Discussion:
refactoring ip_output() and the L2 _output()
(too old to reply)
David Young
2013-02-01 23:27:07 UTC
Permalink
I've been trying to get a handle on a few of the many MP-safety problems
in the network stack. My investigations brought me yesterday to
ether_output().

ether_output() is surprisingly complicated. The ether_output() that one
would hope for takes an ifnet, a packet, and a nexthop as arguments,
adds the correct L2 header to the front of the packet, and calls into a
driver's transmit routine. The ether_output() that we have does some
acrobatics with routes, ARP resolution, cons'ing up and prepending an L2
header, queueing the packet and nudging the driver.

Here are the route acrobatics with some annotations:

if ((rt = rt0) != NULL) {
if ((rt->rt_flags & RTF_UP) == 0) {

It's funny that we get into ether_output() with an rtentry that's
not even usable. I'm not sure how that happens. I will have to
look more carefully at what ip_output() is doing.

if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
rt->rt_refcnt--;
if (rt->rt_ifp != ifp)
return (*rt->rt_ifp->if_output)
(ifp, m0, dst, rt);
} else
senderr(EHOSTUNREACH);
}

Yikes! To call rt->rt_ifp's output routine with a different ifp is not
advisable, is it? ISTR finding a bug in carp(4) involving a similar
pattern, ifpX->if_output(ifqY, ...).

It feels wrong that a packet should ever make a hairpin turn in an
L2 output routine, but there it is: first it looks like it's going
out ifp, then it goes out rt->rt_ifp.

The code that follows checks for a gateway route. Finding one, it looks
for a host route to the gateway. It seems to me that ether_output() is
a peculiar place to do that: ip_output() should have taken care of that
for us. Also, it seems to me that the (potentially) two-step lookup,

1) lookup(destination) -> gateway_rt,
2) lookup(gateway_rt->destination)

ought to be a one-step lookup, lookup(destination) -> nexthop.

if ((rt->rt_flags & RTF_GATEWAY) && dst->sa_family != AF_NS) {

I wonder what the Xerox NS (AF_NS) exemption is about? Didn't NetBSD
have a sys/netns/ that was deleted? Surely the check isn't needed
any more?

if (rt->rt_gwroute == NULL)
goto lookup;
if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
rtfree(rt); rt = rt0;
lookup: rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
if ((rt = rt->rt_gwroute) == NULL)
senderr(EHOSTUNREACH);
/* the "G" test below also prevents rt == rt0 */
if ((rt->rt_flags & RTF_GATEWAY) ||
(rt->rt_ifp != ifp)) {
rt->rt_refcnt--;
rt0->rt_gwroute = NULL;
senderr(EHOSTUNREACH);
}
}
}
if (rt->rt_flags & RTF_REJECT)
if (rt->rt_rmx.rmx_expire == 0 ||
(u_long) time_second < rt->rt_rmx.rmx_expire)
senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
}

The route acrobatics alone are replicated in nd6_output(). nd6_output()
will call if_output == ether_output, so similar acrobatics are
performed twice. Of course, the latter performance is cut short.

Here is the ARP resolution:

case AF_INET:
if (m->m_flags & M_BCAST)
(void)memcpy(edst, etherbroadcastaddr, sizeof(edst));
else if (m->m_flags & M_MCAST)
ETHER_MAP_IP_MULTICAST(&satocsin(dst)->sin_addr, edst);
else if (!arpresolve(ifp, rt, m, dst, edst))
return (0); /* if not yet resolved */
/* If broadcasting on a simplex interface, loopback a copy */
if ((m->m_flags & M_BCAST) && (ifp->if_flags & IFF_SIMPLEX))
mcopy = m_copy(m, 0, (int)M_COPYALL);
etype = htons(ETHERTYPE_IP);
break;

BTW, it makes me grimace to see the simplex-loopback handled in an L2
transmit routine. Here's the rest of that,

if (mcopy)
(void)looutput(ifp, mcopy, dst, rt);

Shouldn't that be looutput(lo0ifp, ...) or lo0ifp->if_output(lo0ifp,
...) ? It seems that the network layer should handle loopback itself by
copying the packet and ip_input()'ing it directly.

Both the route acrobatics and the ARP resolution are replicated in
arc_output(), ieee1394_output(), and fddi_output().

As a first attempt at cleaning this up, I've extracted the route
acrobatics from ether_output and moved them to a routine called
ip_hresolv_output that wraps the ifp->if_output call in ip_output. See
attached patch. I guess that this if any protocols other than IPv4 and
IPv6 rely on the acrobatics, this change will break them.

I'd like ether_output never, ever to deal with the routing table,
but arpresolve() deals some in routes, so eventually I will have
to factor that out. I'm aiming for a streamlined IP transmission
path. Today we have something comparatively baroque,

# enter ip_output
dst <- destination(packet)
nexthop <- lookup(ip routing table, dst)
# enter ether_output
if (nexthop is gateway)
if (gateway is cached)
nexthop <- gwcache(nexthop)
else
gwcache(nexthop) <- lookup(ip routing table,
gwaddr(nexthop))
nexthop <- gwcache(nexthop)
if (nexthop l2 info incomplete)
defer_until_arp_complete(nexthop, packet)
else
l2hdr <- create_l2header(interface(nexthop), gwaddr(nexthop))
transmit(interface(nexthop), l2hdr | packet)

I'm aiming for something more like this:

# enter ip_output
dst <- destination(packet)
nexthop <- lookup(ip routing table, dst)
if (nexthop l2 info incomplete)
defer_until_arp_complete(nexthop, packet)
else
# enter ether_output
transmit(interface(nexthop), l2header(nexthop), packet)

It will take several steps to get there. Here is one. A step in the
wrong direction? Your feedback, please. :-)

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981
Mindaugas Rasiukevicius
2013-02-02 01:17:48 UTC
Permalink
Post by David Young
<...>
I'd like ether_output never, ever to deal with the routing table,
but arpresolve() deals some in routes, so eventually I will have
to factor that out. I'm aiming for a streamlined IP transmission
path. Today we have something comparatively baroque,
# enter ip_output
dst <- destination(packet)
nexthop <- lookup(ip routing table, dst)
# enter ether_output
if (nexthop is gateway)
if (gateway is cached)
nexthop <- gwcache(nexthop)
else
gwcache(nexthop) <- lookup(ip routing table,
gwaddr(nexthop))
nexthop <- gwcache(nexthop)
if (nexthop l2 info incomplete)
defer_until_arp_complete(nexthop, packet)
else
l2hdr <- create_l2header(interface(nexthop), gwaddr
(nexthop)) transmit(interface(nexthop), l2hdr | packet)
# enter ip_output
dst <- destination(packet)
nexthop <- lookup(ip routing table, dst)
if (nexthop l2 info incomplete)
defer_until_arp_complete(nexthop, packet)
else
# enter ether_output
transmit(interface(nexthop), l2header(nexthop), packet)
It will take several steps to get there. Here is one. A step in the
wrong direction? Your feedback, please. :-)
Seems reasonable direction to me.
Post by David Young
+/*
+ * Send an IP packet to a host.
+ *
+ * If necessary, resolve the arbitrary IP route, rt0, to an IP host route before
+ * calling ifp's output routine.
+ */
+int
+ip_hresolv_output(struct ifnet * const ifp, struct mbuf * const m,
+ const struct sockaddr * const dst, struct rtentry *rt0)
+{
+#define senderr(e) { error = (e); goto bad;}
I do not think it is worth to macrify such error path; most of the code in
the kernel does not. Actually, it might be worth to just leave the error
handling for the caller here.
Post by David Young
+ int error = 0;
+ struct rtentry *rt;
+
+ if (!ip_hresolv_needed(ifp))
+ return klock_if_output(ifp, m, dst, rt0);
+
+ if ((rt = rt0) == NULL)
+ return klock_if_output(ifp, m, dst, NULL);
+
+ /* The following block is highly questionable. How did we get
here
+ * with a !RTF_UP route? Does rtalloc1() always return an RTF_UP
+ * route?
+ */
/*
* KNF. :)
*/
Post by David Young
+ if ((rt->rt_flags & RTF_UP) == 0) {
+ if ((rt0 = rt = rtalloc1(dst, 1)) == NULL)
+ senderr(EHOSTUNREACH);
+ rt->rt_refcnt--;
+ if (rt->rt_ifp != ifp)
+ return ip_hresolv_output(rt->rt_ifp, m, dst, rt);
+ }
Recursive? How about handling this in a loop?
Post by David Young
+static bool
+ip_hresolv_needed(const struct ifnet * const ifp)
+{
+ switch (ifp->if_type) {
+ return 1;
+ return 0;
+ }
+}
There are standard "true" and "false" for bool.

+static __attribute__((always_inline)) inline int
+klock_if_output(struct ifnet * const ifp, struct mbuf * const m,
+ const struct sockaddr * const dst, struct rtentry *rt)
+{
+ int error;
+
+ if (ifp->if_intern_txenqueue == NULL)
+ KERNEL_LOCK(1, NULL);
+
+ error = (*ifp->if_output)(ifp, m, dst, rt);

I do not think always-inline is worth here, seems like nano-optimising,
especially when this routine performs call on a function pointer. Better
code would probably be generated by having:

const bool mpsafe = ifp->if_intern_txenqueue != NULL;
--
Mindaugas

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2013-02-02 01:58:01 UTC
Permalink
Post by Mindaugas Rasiukevicius
Post by David Young
<...>
I'd like ether_output never, ever to deal with the routing table,
but arpresolve() deals some in routes, so eventually I will have
to factor that out. I'm aiming for a streamlined IP transmission
path. Today we have something comparatively baroque,
# enter ip_output
dst <- destination(packet)
nexthop <- lookup(ip routing table, dst)
# enter ether_output
if (nexthop is gateway)
if (gateway is cached)
nexthop <- gwcache(nexthop)
else
gwcache(nexthop) <- lookup(ip routing table,
gwaddr(nexthop))
nexthop <- gwcache(nexthop)
if (nexthop l2 info incomplete)
defer_until_arp_complete(nexthop, packet)
else
l2hdr <- create_l2header(interface(nexthop), gwaddr
(nexthop)) transmit(interface(nexthop), l2hdr | packet)
# enter ip_output
dst <- destination(packet)
nexthop <- lookup(ip routing table, dst)
if (nexthop l2 info incomplete)
defer_until_arp_complete(nexthop, packet)
else
# enter ether_output
transmit(interface(nexthop), l2header(nexthop), packet)
It will take several steps to get there. Here is one. A step in the
wrong direction? Your feedback, please. :-)
Seems reasonable direction to me.
Post by David Young
+/*
+ * Send an IP packet to a host.
+ *
+ * If necessary, resolve the arbitrary IP route, rt0, to an IP host route before
+ * calling ifp's output routine.
+ */
+int
+ip_hresolv_output(struct ifnet * const ifp, struct mbuf * const m,
+ const struct sockaddr * const dst, struct rtentry *rt0)
+{
+#define senderr(e) { error = (e); goto bad;}
I do not think it is worth to macrify such error path; most of the code in
the kernel does not. Actually, it might be worth to just leave the error
handling for the caller here.
Yeah, I don't think we should any new macros like this. This one came
with the code I was refactoring.
Post by Mindaugas Rasiukevicius
/*
* KNF. :)
*/
Heh. I like to save the vertical space, but KNF comment form has a
pleasing symmetry. :-)
Post by Mindaugas Rasiukevicius
Post by David Young
+ if ((rt->rt_flags & RTF_UP) == 0) {
+ if ((rt0 = rt = rtalloc1(dst, 1)) == NULL)
+ senderr(EHOSTUNREACH);
+ rt->rt_refcnt--;
+ if (rt->rt_ifp != ifp)
+ return ip_hresolv_output(rt->rt_ifp, m, dst, rt);
+ }
Recursive? How about handling this in a loop?
What guarantees that the loop completes---or the recursion, for that
matter? :-) I hope that I can just delete this code: if at first we get
an rtentry that's !RTF_UP, what good is it to try again?
Post by Mindaugas Rasiukevicius
Post by David Young
+static bool
+ip_hresolv_needed(const struct ifnet * const ifp)
+{
+ switch (ifp->if_type) {
+ return 1;
+ return 0;
+ }
+}
There are standard "true" and "false" for bool.
Thanks, I missed that. I copied & pasted & edited nd6_need_cache().
Post by Mindaugas Rasiukevicius
+static __attribute__((always_inline)) inline int
+klock_if_output(struct ifnet * const ifp, struct mbuf * const m,
+ const struct sockaddr * const dst, struct rtentry *rt)
+{
+ int error;
+
+ if (ifp->if_intern_txenqueue == NULL)
+ KERNEL_LOCK(1, NULL);
+
+ error = (*ifp->if_output)(ifp, m, dst, rt);
I do not think always-inline is worth here, seems like nano-optimising,
especially when this routine performs call on a function pointer. Better
const bool mpsafe = ifp->if_intern_txenqueue != NULL;
Good point. I'd always_inline'd it before it five lines long. It's
unbelievable what GCC will neglect to inline.

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mouse
2013-02-02 05:17:38 UTC
Permalink
Post by David Young
I'd like ether_output never, ever to deal with the routing table,
but arpresolve() deals some in routes,
IMO this is a misfeature - borderline bug - and is responsible for at
least one other botch. Back in early March 2012, I went looking to see
if I could eliminate sin_zero. This turned out to be "impossible"
(really meaning "way more work than it was worth to me"), and the
reason was exactly this: because of the way the ARP code (ab)uses the
routing table. I think I wrote up my conclusions and sent them to the
list...yeah, at least for other values of "the list": I sent my note to
tech-kern.
http://mail-index.netbsd.org/tech-kern/2012/03/06/msg012854.html is
(the archived copy of) my report to the list.

Arranging for ARP entries to have nothing to do with the (IPv4) routing
table would be an excellent move, IMO.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2013-02-02 07:15:12 UTC
Permalink
Post by David Young
...
It's funny that we get into ether_output() with an rtentry that's
not even usable. I'm not sure how that happens. I will have to
look more carefully at what ip_output() is doing.
if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
rt->rt_refcnt--;
if (rt->rt_ifp != ifp)
return (*rt->rt_ifp->if_output)
(ifp, m0, dst, rt);
} else
senderr(EHOSTUNREACH);
}
Yikes! To call rt->rt_ifp's output routine with a different ifp is not
advisable, is it? ISTR finding a bug in carp(4) involving a similar
pattern, ifpX->if_output(ifqY, ...).
...

If both bge0 and bge1 are on the same subnet then a route out
of either can be used to transmit out either.

So if bge0's IP is 1.1.1.1 and bge1 is 1.1.1.2 and I want to
reach 1.1.1.3, a cached route might point to bge0 however the
kernel could just as easily decide to use bge1 (maybe bge0 is
down.)

So in places where you have, for example:
+ return ip_hresolv_output(rt->rt_ifp, m, dst, rt);

you should replace the "rt->rt_ifp" with just "ifp".

For the purposes of research, you may want to add a statistic
that is bumped whenever "rt->rt_ifp != ifp" when ip_output()
is called.

Darren


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Laight
2013-02-02 12:46:17 UTC
Permalink
Post by David Young
/* If broadcasting on a simplex interface, loopback a copy */
if ((m->m_flags & M_BCAST) && (ifp->if_flags & IFF_SIMPLEX))
mcopy = m_copy(m, 0, (int)M_COPYALL);
...
Post by David Young
BTW, it makes me grimace to see the simplex-loopback handled in an L2
transmit routine...
The 'should you receive your transmits as if someone else had sent
the packet' question - which is what I think IFF_SIMPLEX is about
is 'interesting'.
Not doing it used to stop some stuff working, doing it confused
other places - unfortunately I can't remember what!

One one hand it might be worth making all the MAC drivers ensure it
happens, OTOH it probably only happens in hardware for some HDX interfaces.
So it might be always doing the software version and filitering out
the duplicates in places where they can happen.

In either case, doing the reflect in the middle of the ARP code
seems wrong!

IIRC some hardware would receive its own transmittiona at 10M HDX, but
not at 100M HDX (sun feps sbus card). Since the driver doesn't need to know
whether the speed in 10M or 100M (only HDX v FDX) the driver I did
always did the software reflect and software filtered any received
packets from its own MAC address.

David
--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2013-02-02 14:06:22 UTC
Permalink
Post by David Laight
Post by David Young
/* If broadcasting on a simplex interface, loopback a copy */
if ((m->m_flags & M_BCAST) && (ifp->if_flags & IFF_SIMPLEX))
mcopy = m_copy(m, 0, (int)M_COPYALL);
...
Post by David Young
BTW, it makes me grimace to see the simplex-loopback handled in an L2
transmit routine...
The 'should you receive your transmits as if someone else had sent
the packet' question - which is what I think IFF_SIMPLEX is about
is 'interesting'.
Not doing it used to stop some stuff working, doing it confused
other places - unfortunately I can't remember what!
Yes, I think IFF_SIMPLEX is about this. Note that it is crtiical for
multicast to work right.
Mihai Chelaru
2013-02-02 14:07:29 UTC
Permalink
Hi Dave,

Sounds like a good idea !

That return (*rt->rt_ifp->if_output)(ifp, m0, dst, rt) line looks
broken, but I don't think it ever happened - the code path and the big
lock took good care of us. Anyhow adding a debug around that and see if
it's really happening would be interesting as someone else suggested.

Some observations:

Please move ip_hresolv_needed() as a bool property (called
resolv_needed or such) in struct ifnet - it will be used by other
protocols as well. Moreover ip_hresolve_output() doesn't look to be IP
dependent, so throwing it inside ip_output.c doesn't look good. The
klock_if_output name looks unfortunate.

I still wonder what you'll do about arpresolve - personaly, I'd like to
see an independent l2 resolving layer attached to struct ifnet and a
dumb if_output, so the frame will be opaque from if_output perspective.
Meaning that I'd like to see if_output called with 4 parameters: ifp,
mbuf, a sockaddr represeting the next hop l2 address (arp address for
ether), and frame_type (ETHERTYPE* for ether). The last parameter
should be aquired by upper layers from a function associated with
struct ifnet - int getframetype(int address_family) while the l2address
should be aquired by upper protocol (IP) using the mentioned resolving
layer (ARP for IP/ETH).
--
Mihai
Post by David Young
I've been trying to get a handle on a few of the many MP-safety problems
in the network stack. My investigations brought me yesterday to
ether_output().
ether_output() is surprisingly complicated. The ether_output() that one
would hope for takes an ifnet, a packet, and a nexthop as arguments,
adds the correct L2 header to the front of the packet, and calls into a
driver's transmit routine. The ether_output() that we have does some
acrobatics with routes, ARP resolution, cons'ing up and prepending an L2
header, queueing the packet and nudging the driver.
if ((rt = rt0) != NULL) {
if ((rt->rt_flags & RTF_UP) == 0) {
It's funny that we get into ether_output() with an rtentry that's
not even usable. I'm not sure how that happens. I will have to
look more carefully at what ip_output() is doing.
if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
rt->rt_refcnt--;
if (rt->rt_ifp != ifp)
return (*rt->rt_ifp->if_output)
(ifp, m0, dst, rt);
} else
senderr(EHOSTUNREACH);
}
Yikes! To call rt->rt_ifp's output routine with a different ifp is not
advisable, is it? ISTR finding a bug in carp(4) involving a similar
pattern, ifpX->if_output(ifqY, ...).
It feels wrong that a packet should ever make a hairpin turn in an
L2 output routine, but there it is: first it looks like it's going
out ifp, then it goes out rt->rt_ifp.
The code that follows checks for a gateway route. Finding one, it looks
for a host route to the gateway. It seems to me that ether_output() is
a peculiar place to do that: ip_output() should have taken care of that
for us. Also, it seems to me that the (potentially) two-step lookup,
1) lookup(destination) -> gateway_rt,
2) lookup(gateway_rt->destination)
ought to be a one-step lookup, lookup(destination) -> nexthop.
if ((rt->rt_flags & RTF_GATEWAY) && dst->sa_family != AF_NS) {
I wonder what the Xerox NS (AF_NS) exemption is about? Didn't NetBSD
have a sys/netns/ that was deleted? Surely the check isn't needed
any more?
if (rt->rt_gwroute == NULL)
goto lookup;
if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
rtfree(rt); rt = rt0;
lookup: rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);
if ((rt = rt->rt_gwroute) == NULL)
senderr(EHOSTUNREACH);
/* the "G" test below also prevents rt == rt0 */
if ((rt->rt_flags & RTF_GATEWAY) ||
(rt->rt_ifp != ifp)) {
rt->rt_refcnt--;
rt0->rt_gwroute = NULL;
senderr(EHOSTUNREACH);
}
}
}
if (rt->rt_flags & RTF_REJECT)
if (rt->rt_rmx.rmx_expire == 0 ||
(u_long) time_second < rt->rt_rmx.rmx_expire)
senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
}
The route acrobatics alone are replicated in nd6_output(). nd6_output()
will call if_output == ether_output, so similar acrobatics are
performed twice. Of course, the latter performance is cut short.
if (m->m_flags & M_BCAST)
(void)memcpy(edst, etherbroadcastaddr, sizeof(edst));
else if (m->m_flags & M_MCAST)
ETHER_MAP_IP_MULTICAST(&satocsin(dst)->sin_addr, edst);
else if (!arpresolve(ifp, rt, m, dst, edst))
return (0); /* if not yet resolved */
/* If broadcasting on a simplex interface, loopback a copy */
if ((m->m_flags & M_BCAST) && (ifp->if_flags & IFF_SIMPLEX))
mcopy = m_copy(m, 0, (int)M_COPYALL);
etype = htons(ETHERTYPE_IP);
break;
BTW, it makes me grimace to see the simplex-loopback handled in an L2
transmit routine. Here's the rest of that,
if (mcopy)
(void)looutput(ifp, mcopy, dst, rt);
Shouldn't that be looutput(lo0ifp, ...) or lo0ifp->if_output(lo0ifp,
...) ? It seems that the network layer should handle loopback itself by
copying the packet and ip_input()'ing it directly.
Both the route acrobatics and the ARP resolution are replicated in
arc_output(), ieee1394_output(), and fddi_output().
As a first attempt at cleaning this up, I've extracted the route
acrobatics from ether_output and moved them to a routine called
ip_hresolv_output that wraps the ifp->if_output call in ip_output. See
attached patch. I guess that this if any protocols other than IPv4 and
IPv6 rely on the acrobatics, this change will break them.
I'd like ether_output never, ever to deal with the routing table,
but arpresolve() deals some in routes, so eventually I will have
to factor that out. I'm aiming for a streamlined IP transmission
path. Today we have something comparatively baroque,
# enter ip_output
dst <- destination(packet)
nexthop <- lookup(ip routing table, dst)
# enter ether_output
if (nexthop is gateway)
if (gateway is cached)
nexthop <- gwcache(nexthop)
else
gwcache(nexthop) <- lookup(ip routing table,
gwaddr(nexthop))
nexthop <- gwcache(nexthop)
if (nexthop l2 info incomplete)
defer_until_arp_complete(nexthop, packet)
else
l2hdr <- create_l2header(interface(nexthop), gwaddr(nexthop))
transmit(interface(nexthop), l2hdr | packet)
# enter ip_output
dst <- destination(packet)
nexthop <- lookup(ip routing table, dst)
if (nexthop l2 info incomplete)
defer_until_arp_complete(nexthop, packet)
else
# enter ether_output
transmit(interface(nexthop), l2header(nexthop), packet)
It will take several steps to get there. Here is one. A step in the
wrong direction? Your feedback, please. :-)
Dave
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2013-02-03 19:42:28 UTC
Permalink
Post by Darren Reed
Post by David Young
...
It's funny that we get into ether_output() with an rtentry that's
not even usable. I'm not sure how that happens. I will have to
look more carefully at what ip_output() is doing.
if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
rt->rt_refcnt--;
if (rt->rt_ifp != ifp)
return (*rt->rt_ifp->if_output)
(ifp, m0, dst, rt);
} else
senderr(EHOSTUNREACH);
}
Yikes! To call rt->rt_ifp's output routine with a different ifp is not
advisable, is it? ISTR finding a bug in carp(4) involving a similar
pattern, ifpX->if_output(ifqY, ...).
...
If both bge0 and bge1 are on the same subnet then a route out
of either can be used to transmit out either.
So if bge0's IP is 1.1.1.1 and bge1 is 1.1.1.2 and I want to
reach 1.1.1.3, a cached route might point to bge0 however the
kernel could just as easily decide to use bge1 (maybe bge0 is
down.)
+ return ip_hresolv_output(rt->rt_ifp, m, dst, rt);
you should replace the "rt->rt_ifp" with just "ifp".
How is it that we're getting into ether_output() with !RTF_UP routes
to begin with? Is it that bge0 is down but its routes haven't been
withdrawn yet?

It seems to me that inconsistencies in the routing lead to this
strange code in ether_output(). We could actually prolong
ether_output()'s dithering about which ifp to use forever.

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2013-02-03 19:45:56 UTC
Permalink
Post by Mouse
Post by David Young
I'd like ether_output never, ever to deal with the routing table,
but arpresolve() deals some in routes,
IMO this is a misfeature - borderline bug - and is responsible for at
least one other botch. Back in early March 2012, I went looking to see
if I could eliminate sin_zero. This turned out to be "impossible"
(really meaning "way more work than it was worth to me"), and the
reason was exactly this: because of the way the ARP code (ab)uses the
routing table. I think I wrote up my conclusions and sent them to the
list...yeah, at least for other values of "the list": I sent my note to
tech-kern.
http://mail-index.netbsd.org/tech-kern/2012/03/06/msg012854.html is
(the archived copy of) my report to the list.
Arranging for ARP entries to have nothing to do with the (IPv4) routing
table would be an excellent move, IMO.
It sounds to me like we both favor a move toward greater simplicity
and efficiency. I'm pretty keen on doing just one table lookup
per packet forwarded.

Insofar as ARP installs routing-table entries using a different key than
a "standard" IPv4 address and mask, if that is what you're saying that
it does, that seems too complicated.

Insofar as looking up an IP host in the routing table produces an L2
header and an output interface---i.e., the total information you need to
forward a packet to that host---that seems most efficient. If sometimes
ARP is the agent who finds the L2 headers out, inserts the corresponding
host routes into the routing table, and modifies/withdraws the routes as
they expire, that's ok by me.

Do you understand what this sin_srcaddr member of sockaddr_inarp is for?

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mouse
2013-02-03 20:59:39 UTC
Permalink
Post by David Young
Post by Mouse
[...] but arpresolve() deals some in routes,
IMO this is a misfeature - borderline bug - and is responsible for
at least one other botch. [...sin_zero...]
Arranging for ARP entries to have nothing to do with the (IPv4)
routing table would be an excellent move, IMO.
It sounds to me like we both favor a move toward greater simplicity
and efficiency. I'm pretty keen on doing just one table lookup per
packet forwarded.
Well, pushing the ARP table out of the v4 routing table would mean
pushing it into somewhere else instead. Depending on how complex that
"somewhere else" turns out to be, this may not be a net win.

My guess would be that it would be a win. But that's just a guess.
Post by David Young
Insofar as ARP installs routing-table entries using a different key
than a "standard" IPv4 address and mask, if that is what you're
saying that it does, that seems too complicated.
Yes, that's what I think is going on. There are 8 bytes of information
in each IPv4 routing table entry that are ignored by almost everything
that prints routes as routes. They are the three fields struct
sockaddr_inarp has that overlay sin_zero in struct sockaddr_in, and, as
far as I can recall, only ARP code ever uses them. However, other
interfaces accept them, since the routing table code is (perhaps
excessively) general and doesn't know they're special; this is why
sin_zero must exist and be zeroed by userland: if not, you wind up with
routes and/or addresses with garbage in the hidden 8 bytes, which don't
match when they should (because of the garbage) but are hard to detect
(because most tools don't print those bytes). As I wrote in the change
I made in my private patch tree when I looked into this,

* That interfaces like bind(2) or routing socket messages pay
* attention to what userland has in sin_zero is the real bug here.
* (Well, that plus the abuse of the AF_INET routing table to hold
* extra hidden data for structs sockaddr_inarp.)
Post by David Young
Do you understand what this sin_srcaddr member of sockaddr_inarp is for?
No. I have a vague feeling that I understood that extra data back when
I investigated sin_zero, but I can't recall it now, and as far as I can
tell nothing uses sin_srcaddr by name. I think ARP code uses it, but,
as I say, grep over the source tree makes me think it must do so
without using the sin_srcaddr name, which is annoying. (For example, I
have a vague memory that there is code which creates routing-socket
messages by serializing something into the corresponding space in a
data buffer.)

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2013-02-04 03:03:31 UTC
Permalink
Post by David Young
It sounds to me like we both favor a move toward greater simplicity
and efficiency. I'm pretty keen on doing just one table lookup
per packet forwarded.
Maybe while you're at it you can find a way to avoid the pointer-chase
to get the actual useful data once given the table entry... ugh.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2013-02-04 13:26:47 UTC
Permalink
Post by Mouse
Well, pushing the ARP table out of the v4 routing table would mean
pushing it into somewhere else instead. Depending on how complex that
"somewhere else" turns out to be, this may not be a net win.
My guess would be that it would be a win. But that's just a guess.
I don't think it would be an improvement - you'd need a separate table,
and you'd need to do two lookups, one to find the cloning route, and
then to find the arp entry. Right now the arp entry, if present, is
found directly.
Post by Mouse
Post by David Young
Insofar as ARP installs routing-table entries using a different key
than a "standard" IPv4 address and mask, if that is what you're
saying that it does, that seems too complicated.
Yes, that's what I think is going on. There are 8 bytes of information
in each IPv4 routing table entry that are ignored by almost everything
that prints routes as routes. They are the three fields struct
sockaddr_inarp has that overlay sin_zero in struct sockaddr_in, and, as
far as I can recall, only ARP code ever uses them. However, other
interfaces accept them, since the routing table code is (perhaps
excessively) general and doesn't know they're special; this is why
sin_zero must exist and be zeroed by userland: if not, you wind up with
routes and/or addresses with garbage in the hidden 8 bytes, which don't
match when they should (because of the garbage) but are hard to detect
(because most tools don't print those bytes).
The lookup code does not understand address formats; that's a feature so
that it works with various protocols.

It sounds like what's broken is that there is not a documented invariant
that all regular v4 entries in the routing table have zeros in the
unused bytes, lack of asserting that if DIAGNOSTIC, and lack of zeroing
it on the way in. (And lack of user-space routing tools not printing a
warning if not zero.) This seems like it should be easy to fix. Fixing
that seems far less likely to introduce bad behavior than a grand
reorganization.
Post by Mouse
As I wrote in the change
I made in my private patch tree when I looked into this,
* That interfaces like bind(2) or routing socket messages pay
* attention to what userland has in sin_zero is the real bug here.
* (Well, that plus the abuse of the AF_INET routing table to hold
* extra hidden data for structs sockaddr_inarp.)
Agreed, 90%. The kernel should probably zero sin_zero when needed,
unless the syscall interface, which means posix, perhaps, documents that
they must be zero.
Joerg Sonnenberger
2013-02-04 13:57:57 UTC
Permalink
Post by Greg Troxel
I don't think it would be an improvement - you'd need a separate table,
and you'd need to do two lookups, one to find the cloning route, and
then to find the arp entry. Right now the arp entry, if present, is
found directly.
I disagree. You are ignoring the impact of having the leaf nodes in the
routing table, which can be quite significant.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mouse
2013-02-04 15:37:06 UTC
Permalink
Post by Greg Troxel
Post by Mouse
Well, pushing the ARP table out of the v4 routing table would mean
pushing it into somewhere else instead. Depending on how complex
that "somewhere else" turns out to be, this may not be a net win.
My guess would be that it would be a win. But that's just a guess.
I don't think it would be an improvement - you'd need a separate
table, and you'd need to do two lookups, one to find the cloning
route, and then to find the arp entry. Right now the arp entry, if
present, is found directly.
It's been long enough I'm not sure, but I don't think so. I _think_
the ARP routes are found only by ARP-aware code - for IP-layer lookups,
the non-ARP route is more general, in the radix-tree sense, so it's the
one that's found.
Post by Greg Troxel
Post by Mouse
However, other interfaces accept [the sin_zero bytes], since the
routing table code is (perhaps excessively) general and doesn't know
they're special;
The lookup code does not understand address formats; that's a feature
so that it works with various protocols.
Sure. But (ab)using it this way, by sticking an extra 8 bytes on the
end and then counting on all of userland to know that it has to zero
them on input and ignore them on output? Oh, except for the ARP code.
It's just pushing the ugliness around.

Using the radix-tree code for the ARP table might even be a good idea.
But not the same table that's used for IP-layer IPv4 routing!

...well, that's my opinion. :-)
Post by Greg Troxel
It sounds like what's broken is that there is not a documented
invariant that all regular v4 entries in the routing table have zeros
in the unused bytes, lack of asserting that if DIAGNOSTIC, and lack
of zeroing it on the way in.
If you want to keep ARP entries in the same table as IPv4 routes, yes,
that's approximately what I think it would mean to do it right. You'd
need to be careful you don't break the ARP-aware paths, though.
Post by Greg Troxel
Agreed, 90%. The kernel should probably zero sin_zero when needed,
unless the syscall interface, which means posix, perhaps, documents
that they must be zero.
Even then, I think it should. POSIX is more about political
compromises, in many cases, than it is about good interface
engineering. Having eight bytes of magic data which is barely
mentioned in the manpage (inet(4) lists it in the struct elements but
doesn't mention it elsewhere, at least as of 5.1) but which must be
zero for proper operation - but, if it's not zero, leads to mysterious,
obscure, and silent failure modes? Depending on accumulated cultural
lore for people to learn that it's MBZ? That is _horrible_ interface
engineering. I've been working with that struct for over twenty years
and still didn't really understand what the deal was with sin_zero
until I put a couple of days into trying to get rid of it and tracking
down the resulting weird failures.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2013-02-04 16:10:01 UTC
Permalink
Post by Joerg Sonnenberger
Post by Greg Troxel
I don't think it would be an improvement - you'd need a separate table,
and you'd need to do two lookups, one to find the cloning route, and
then to find the arp entry. Right now the arp entry, if present, is
found directly.
I disagree. You are ignoring the impact of having the leaf nodes in the
routing table, which can be quite significant.
Please explain more about this. It's not clear what assumptions about
the routing data structure you have made.

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Joerg Sonnenberger
2013-02-04 16:43:57 UTC
Permalink
Post by David Young
Post by Joerg Sonnenberger
Post by Greg Troxel
I don't think it would be an improvement - you'd need a separate table,
and you'd need to do two lookups, one to find the cloning route, and
then to find the arp entry. Right now the arp entry, if present, is
found directly.
I disagree. You are ignoring the impact of having the leaf nodes in the
routing table, which can be quite significant.
Please explain more about this. It's not clear what assumptions about
the routing data structure you have made.
Moving the ARP entries out of the main routing table makes the PATRICA
tree more shallow. My assumption here is that a plain hash table is no
worse in performance than adding at least one level of the PATRICA tree,
when collisions are not an issue.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2013-02-04 18:13:28 UTC
Permalink
Post by Mouse
Post by Greg Troxel
Agreed, 90%. The kernel should probably zero sin_zero when needed,
unless the syscall interface, which means posix, perhaps, documents
that they must be zero.
Even then, I think it should. POSIX is more about political
compromises, in many cases, than it is about good interface
engineering. Having eight bytes of magic data which is barely
mentioned in the manpage (inet(4) lists it in the struct elements but
doesn't mention it elsewhere, at least as of 5.1) but which must be
zero for proper operation - but, if it's not zero, leads to mysterious,
obscure, and silent failure modes? Depending on accumulated cultural
lore for people to learn that it's MBZ? That is _horrible_ interface
engineering. I've been working with that struct for over twenty years
and still didn't really understand what the deal was with sin_zero
until I put a couple of days into trying to get rid of it and tracking
down the resulting weird failures.
Sure, I have no issue with the kernel checking/zeroing (maybe logging if
DIAGNOSTIC, or some such) if there is an interface failure, if it turns
out that the spec says they are zero. I just meant we should get spec
clarity and know what we are doing relative to it. I agree this is
unnecessarily tricky.
Dennis Ferguson
2013-02-10 03:24:32 UTC
Permalink
Post by David Young
if ((rt = rt0) != NULL) {
if ((rt->rt_flags & RTF_UP) == 0) {
It's funny that we get into ether_output() with an rtentry that's
not even usable. I'm not sure how that happens. I will have to
look more carefully at what ip_output() is doing.
if ((rt0 = rt = rtalloc1(dst, 1)) != NULL) {
rt->rt_refcnt--;
if (rt->rt_ifp != ifp)
return (*rt->rt_ifp->if_output)
(ifp, m0, dst, rt);
} else
senderr(EHOSTUNREACH);
}
I'm just catching up on reading this.

To understand this you might look at the code paths in ip_output() dealing
with the SO_DONTROUTE socket option, or maybe at how the strict source
route option is handled (I would do this but I get a feeling of despair
when I look in there..). If nothing has changed since I last dealt with
this the traditional way operations which required matching a non-gatewayed,
interface route were implemented was to avoid the routing table in favor of
scanning the interface list looking for one with an address matching the packet's
destination (which sucks something awful on a router with both many interfaces
configured and a lot of applications needing to use SO_DONTROUTE) and then
delivering the packet to that interface. This packet won't come with a decent route
since it didn't look at the routing table to arrive at this interface, but
the ARP entries are in the routing table so someone needs to look in
there anyway. In addition, if you have two ethernets on the same subnet
the routing table will generally only have routes for one of them (since
the interface routes will be the same it is unable to keep both interfaces'
routes in there at the same time), but the SO_DONTROUTE search may instead
find the interface whose routes are not in the table. In this case I
think the packet gets sent out the interface the SO_DONTROUTE search
found but uses an ARP entry the other interface learned to do so, which
might explain the code above (though I think it is possible to construct
interface configurations involving and ethernet and a p2p interface with
an overlapping destination address where that code fails).

I actually remember when this code made more sense. The very earliest
ancestor of this code thought that knowing an output interface and
a next hop IP address was equivalent to knowing the L2 header for
the outgoing packet, because it was. Most early networks stored
L2 neighbour addresses in the low order bits of the network's IP
addresses; RFC 796 lists these mappings for some large networks, but
doesn't mention that many of the earliest LAN technologies used
1 byte MAC addresses (copied from 3 Mbps Experimental Ethernet) which
were set to be the same as the low order byte of the IP interface
address on the network. If knowing the outgoing interface and next
hop IP address is sufficient to tell you everything you need to know
to send the packet then it doesn't really matter whether you use
the routing table or the interface address configuration to make a
routing decision because the end product is the same: an output
interface and a next hop IP address.

Note that in those days 10 Mbps Ethernet was actually an outlier,
a wart. The 6 byte MAC addresses couldn't be fit in the low order
bits of a 4 byte IP address, so the next hop IP address wasn't
directly useful since none of the bits in there told you what the
L2 header should look like. The idea of maintaining an ARP cache,
burying a fairly complicated protocol in interface code and using
the next hop IP address as an indirection was arrived at because,
while it made Ethernet interfaces bear the additional cost of
resolving this indirection, it also made Ethernet interfaces look
like all the other interfaces where this address was not an indirection
but instead was directly what you needed to know. Essentially
the ARP cache was a mechanism to make ethernet interfaces look
like ARPAnet interfaces. Since we've reached a state where virtually
all the networks that worked like the ARPAnet are long gone and just
about everything is, or looks like, an ethernet interface, it is
probably a good idea to try to optimize this for ethernet rather
than for the ARPAnet.

So I think that treating ARP like other routing protocols, having
it store the routes it learns in the kernel routing table the way
other routing protocols do and getting directly from the route lookup
to the L2 header without an indirection, was exactly the right thing
to do. The problem is that it was incompletely implemented. If you
want to be able to get rid of interface code this way then everything
needs to make its routing decision by looking in the route table, including
SO_DONTROUTE packets and strict source route packets.

This is not difficult to do, but requires dealing with the semantic
shift in what "route lookup" would then mean. To see this it is probably
worth distinguishing a "route lookup" from a "forwarding lookup". A
forwarding lookup is done using only address information as the search
key, while a route lookup uses both address information and other data
associated with the route for the search (like "match routes used for
forwarding", or "match only interface routes", or "match only interface
routes for a particular interface". In a forwarding table it makes no sense
to have more than one route with any particular destination prefix since, with
only address information as the search key, there is nothing to distinguish
routes with the same prefix. In a routing table there is. Since the current
kernel radix tree only has search entries taking addresses alone as keys,
and can't store more than one route with the same address and mask, it is
a forwarding table.

To do SO_DONTROUTE and strict source route routing decisions in the route
table, and eliminate that interface code, you need to provide route lookup
semantics ("match only interface routes") and you need to guarantee that
all the routes they would need to find (i.e. that they would arrive at by
scanning the interface address configuration instead) are always in that
table. The latter is almost always the case anyway, since interface routes
are normally important for forwarding lookups too, but it still requires
bridging the gap between "almost" and "always" and that includes dealing
with cases like two ethernets attached to the same subnet; I'm pretty sure
the data structure would need to be able to store and search more than one
route to the same destination.

Speaking of which, that radix tree implementation in the kernel is very
old and crufty. It was a very early implementation of a variable length
mask search (when routes were classed route lookups were done in a hash
table), the code was always opaque and had minor bugs that seemed impossible
to find, it isn't very fast and operates in a way which unnecessarily
penalizes longer addresses (i.e. IPv6), and we've learned just a whole lot
more about data structures for longest match lookups since then. If
noncontiguous masks are no longer interesting then maybe it is time to
replace that?

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2013-02-11 01:05:01 UTC
Permalink
Post by Dennis Ferguson
Speaking of which, that radix tree implementation in the kernel is very
old and crufty. It was a very early implementation of a variable length
mask search (when routes were classed route lookups were done in a hash
table), the code was always opaque and had minor bugs that seemed impossible
to find, it isn't very fast and operates in a way which unnecessarily
penalizes longer addresses (i.e. IPv6), and we've learned just a whole lot
more about data structures for longest match lookups since then. If
noncontiguous masks are no longer interesting then maybe it is time to
replace that?
Yes, it is time to replace that. We have one candidate to replace the
existing data structure in the form of src/common/lib/libc/gen/ptree.c.
I'm not sure if anyone's using it for a forwarding/routing table, yet.

Speaking of radix trees, it seems that in an SMP networking stack the PCB
tables could be replaced by a couple of them. The tree for unconnected
sockets would use keys

[protocol (tcp|udp|...), local port (1..65535)]

for sockets binding the same port and INADDR_ANY, and use keys

[protocol (tcp|udp|...), local port (1..65535)], local address]

for sockets bound to a particular local address.

For connected sockets, and TCP sessions being established, or what have
you, there would be the flow table that you have described previously,
using the key you mention

[local address, remote address, protocol (tcp|udp|...),
first 4 bytes of L4 header]

or some permutation thereof.

The kernel will try for matches in the flow-table entry, first; failing
to find a match there, it would try the unconnected table.

I suppose that the two tables could be made one if the fields in the
key were suitably ordered,

[protocol, local port, local address, remote port, remote address]

Just a thought, anyway.

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2013-02-11 01:43:31 UTC
Permalink
Post by David Young
Speaking of radix trees, it seems that in an SMP networking stack the PCB
tables could be replaced by a couple of them. The tree for unconnected
sockets would use keys
[protocol (tcp|udp|...), local port (1..65535)]
for sockets binding the same port and INADDR_ANY, and use keys
[protocol (tcp|udp|...), local port (1..65535)], local address]
for sockets bound to a particular local address.
Wouldn't this have the unfortunate effect of squandering the ability
of most modern network adapters to return the appropriate hash bucket
for your (presumed, but reasonably, on the part of their designers)
hashed connection datastructures on receive?

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2013-02-11 02:01:56 UTC
Permalink
Post by Thor Lancelot Simon
Post by David Young
Speaking of radix trees, it seems that in an SMP networking stack the PCB
tables could be replaced by a couple of them. The tree for unconnected
sockets would use keys
[protocol (tcp|udp|...), local port (1..65535)]
for sockets binding the same port and INADDR_ANY, and use keys
[protocol (tcp|udp|...), local port (1..65535)], local address]
for sockets bound to a particular local address.
Wouldn't this have the unfortunate effect of squandering the ability
of most modern network adapters to return the appropriate hash bucket
for your (presumed, but reasonably, on the part of their designers)
hashed connection datastructures on receive?
That's a good point, it probably would.

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Dennis Ferguson
2013-02-11 03:00:14 UTC
Permalink
Post by Thor Lancelot Simon
Post by David Young
Speaking of radix trees, it seems that in an SMP networking stack the PCB
tables could be replaced by a couple of them. The tree for unconnected
sockets would use keys
[protocol (tcp|udp|...), local port (1..65535)]
for sockets binding the same port and INADDR_ANY, and use keys
[protocol (tcp|udp|...), local port (1..65535)], local address]
for sockets bound to a particular local address.
Wouldn't this have the unfortunate effect of squandering the ability
of most modern network adapters to return the appropriate hash bucket
for your (presumed, but reasonably, on the part of their designers)
hashed connection datastructures on receive?
That's interesting. Would it? The stuff he's got listed there is all
unconnected socket state, not the full 5-tuple; it is what you would look
for if you couldn't find a full matching 5-tuple in your hash table. Is
it possible to compute a hash only on the bits significant to those entries
without knowing the answer before you compute the hash?

Additionally, is there now a standard for computing a hash over a connection
identifier and, if so, where can I find it? If not, how is this supposed
to work if you have two network adapters from different vendors, or if some
packet has come out of a tunnel and you need to compute the hash in software
to look up in the same connection table? Or is it assuming that each interface
has a separate hashed connection data structure particular to that interface
and that all packets for a particular connection will come in the same
interface?

This sounds neat in theory, but I'm always suspicious since I've noticed it
isn't unusual for assists that network adapter designers include in their products
to be already squandered by making no sense if your operating system may do stuff
that windows doesn't.

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2013-02-23 22:35:09 UTC
Permalink
Post by David Young
I've been trying to get a handle on a few of the many MP-safety problems
in the network stack. My investigations brought me yesterday to
ether_output().
ether_output() is surprisingly complicated. The ether_output() that one
would hope for takes an ifnet, a packet, and a nexthop as arguments,
adds the correct L2 header to the front of the packet, and calls into a
driver's transmit routine. The ether_output() that we have does some
acrobatics with routes, ARP resolution, cons'ing up and prepending an L2
header, queueing the packet and nudging the driver.
if ((rt = rt0) != NULL) {
if ((rt->rt_flags & RTF_UP) == 0) {
It's funny that we get into ether_output() with an rtentry that's
not even usable. I'm not sure how that happens. I will have to
look more carefully at what ip_output() is doing.
I think the checks for !RTF_UP have something to do with the route
caches (struct route). I'm experimenting with a tree that uses no route
caches and dispenses with this strange logic: if I remember, I will let
you know how that goes.

Dave
--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...