I have no idea if this is actually a contribiting factor here or
not, but the way we traditionally do versioning *inside the kernel)
for changes like the one which is most likely the cause here is
fraught with danger, and could easily lead to very subtle compat
errors which would be very difficult to find.
I took a look at the routing socket changes in the cvs logs a couple
of days ago, and to me it looked like the change that Paul has
called out was the likely problem -- purely as it was the only
change of this nature that fit the versioning requirements).
The practice we have when versioning an identifier (for an ioctl or
sysctl or struct, or format identified in a field in a struct) in the
kernel, is to take the name, change it, and assign the old name to
the new functionality.
That of itself is just fine. The problem is the names that we mostly
choose to to use as the replacement name for the old interface, that we
are keeping for binary compat with previous versions, is hideous and
stupid (and yes, I know the convention dates way back to actual CSRG
changes - it was started long before NetBSD).
To illustrate the potential problem, I will use a specific change
the rtsock changes to illustrate ... note I am not claiming that this
particular one is in any way associated with the actual problem here.
The change in question altered the NET_RT_IFLIST ioctl.
So, a new number was chosen to be NET_RT_IFLIST and made visible in
the header files, so all newly compiled userlevel would get the new
format with whatever the updated semantics were. That part of this
is fine, and the way it needs to be.
Inside the kernel (and visible only there) a new name was picked to
use for the old number, so old binaries can continue using the old
ABI - and provided the compat code is coded correctly, will continue
working without noticing anything has changed.
In this case (and following the ancient tradition) the name chosen
for the old version of NET_RT_IFLIST was NET_RT_OIFLIST (adding an 'O'
to mean the "old" interface list).
Then in the kernel we get (at least in the old form, following the
modularisation changes, the details might differ, but this is how it
was when this change was made ... the other changes are not relevant
to the point here)
#ifdef COMPAT_70
case NET_RT_OIFLIST:
/* code to generate/consume the old format */
break;
#endif
which is all fine. If we're seeing the NetBSD 7 version of NET_RT_IFLIST
we use the data the way that it was used in NetBSD 7, and the
case NET_RT_IFLIST:
code handles the new (NetBSD 8 and beyond) version of the sysctl (in
this case, I think).
So far so good.
The issue is that NET_RT_IFLIST had been versioned before, at least
twice before (that I can see), and each time the same kind of change
had been made - the only difference being that the COMPAT_70 in the
ifdef had been COMPAT_50 for the previous change, and COMPAT_14 for
the one before that.
What this means is that before the COMPAT_70 code was added, the name
NET_RT_OIFLIST had been used internally for the COMPAT_50 code, and
before that was added, it had been used for the COMPAT_14 code.
Each time we version this interface (or any similar ones) we always make
the name NET_RT_OIFLIST represent the immediately previous version of
the interface, and rename all the older ones. In this case, NET_RT_OIFLIST
for COMPAT_50 became NET_RT_OOIFLIST and NET_RT_OOIFLIST which had been
used for COMPAT_14 became NET_RT_OOOIFLIST.
People, this is insane! (It is also meaningless unnecessary makework
for whoever is doing the update).
All it would take, is for somewhere, anywhere, in the kernel, a single use
of the NET_RT_OIFLIST name to miss getting changed to NET_RT_OOIFLIST and
all kinds of bad things would happen. And there's no assistance at all
from the compiler, or anything else, in detecting this problem - it takes
something like der Mouse's old NetBSD 5 binary running with a NetBSD 8 (or
later) kernel, to detect a failure.
What we should be doing is something more like we do (in the exact opposite
direction, because of the different requiremnents) when versioning libc
interfaces - which is a scheme which works (that it looks ugly when people
actually see it at the lower level is a nuisance, but that can't be avoided).
So, if when the COMPAT_14 code for NET_RT_IFLIST had been added, we had
renamed the compatibility identifier NET_RT_IFLIST_14 (or something like
that) then when the COMPAT_50 code was added, it would have created
NET_RT_IFLIST_50 and the NET_RT_IFLIST_14 version would not have needed
renaming (the code it executes might need altering - that's a different
issue - and is unchanged from what we have now).
Similarly, when the COMPAT_70 version was added, we would get NET_RT_IFLIST_70
and neither NET_RT_IFLIST_14 nor NET_RT_IFLIST_50 would alter.
This means that when these names are used in other places in the kernel,
(as inside the sysctl_iflist() function in this case perhaps, if that needs
to alter) it will be plainly obvious to all who look at the code which
version of the interface is supposed to be being supported. Further, if
we don't have COMPAT_xx included for some xx (in someone's kernel or other)
then the right code will go away - and if the #ifdef's are not correct,
the compiler (or linker) will help find the problem.
The actual problem in this case might be that the COMPAT_50 (and if that,
then probably the COMPAT_14 code as well) might not have been correctly
updated to deal with the NetBSD 8 kernel's internal data (it might be still
trying to convert NetBSD 7 format messages into NetBSD 5 format for example)
but it could just as easily be that the code is OK, and all that is
going wrong is that the wrong version of it is running, as one of those
extra O's missed getting added somewhere.
In general, I wouldn't suggest going back through all of our existing
compat code and replacing the O names with new ones - too much work for
too little benefit, but in the future, when adding compat code, don't
add an O name, add a _xx name instead, and if there already were O names
from pervious versioning (since these in the old method would have all
needed updating - so the work would be the same) instead of adding new
O's in their names, change them all to the correct _xx names instead.
In this particular case however, it might help spot the problem, if we
were to go way back to the COMPAT_14 routing socket changes, and change
the O names added in that to _14 names, and then go to the COMPAT_50 changes
and change the O names added there to _50 names, and then the COMPAT_70
names added can get changed to _70 names. The process of doing this might
just make whatever didn't get correctly updated stand out more easily than
any other way, as it would force all of the changes to be examined aggain.
Note that it is much more than this one name - this change also added
RTM_ODELADDR RTM_ONEWADDR RTM_OCHGADDR (maybe more) and previous changes
had added RT_OADVANCE(a,b) PF_OROUTE RT_OROUNDUP(n) etc ... (which were
not changed ... ie: the number of O's in a name gives no clue as to which
version it is intended to be the compat code for - just how many times
the interface in question has altered over the years, which, while an
interesting statistic, isn't really very useful.)
And of course, all of this is confused even more by COMPAT_RTSOCK which
adds a whole set of X names which need to be correctly matched with the
right O names as other things change.
It is a wonder than any of this works at all, the way it has been done
(and again, this isn't anyone here's fault - way back in the ancient
past, when things were first bneing versioned this way, and there was
only the old and the new, with no expectation that the new would not
last forever into the future, this 'O' scheme looked just fine).
kre
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de