Discussion:
IPv6 forwarding fail = mbuf leak?
(too old to reply)
Mouse
2013-02-08 00:41:50 UTC
Permalink
Okay, I want to investigate this further, but it's been a few days and
I haven't got the round tuits yet. So I'll put it out here in case
it's relevant to anyone else - which it may well be; see the last
paragraph.

My house network's router, including uplink, is a 4.0.1 i386 machine.
(Yes, I know 4.0.1 is no longer officially supported; I'm not asking
for support here. If I wanted support, I'd file a PR. Just throwing
out a heads up in case the problem still exists; if nobody else picks
this up, I'll track it down myself some one of these days. And if
anyone does manage to find it before I do and cares to say so, bonus.)

Recently, thanks to a failure upstream from me, its IPv6 uplink went
away. (As in, nothing responded at the address it was expecting to see
it at - in IPv4, I'd say nothing answered the ARP request, but I forget
what IPv6 calls the analogous protocol.) At the same time, the machine
started wedging. The wedges proved to be some kind of "out of network
memory" condition; rebooting helped...for about a day.

I built a kernel with MBUFTRACE and it developed that it was an mbuf
leak, leaking, according to netstat -mssv, "vlan2 rx". vlan2 is/was
the main house-facing vlan. I set up a cronjob to run netstat -m
periodically and reboot if it got close to the limit. This job
rebooted the machine a bit more than once a day - not good, but I
prefer brief downtimes to wedging while I try to figure out what's
wrong. (Okay, not strictly total wedging, but for my main house
router, stopping talking with the network amounts to much the same
thing in practice.)

I started looking into various ways to figure out more details of where
the mbufs were going. My first few attempts failed badly (the first
one panicked when autoconf first attached the interface; the second, as
soon as I brought it up).

Before I got anything better working, I got fed up with IPv6 not
working and fixed it (well, fixed most of it myself and got the
relevant person to fix the rest).

The mbuf leak stopped dead. The machine's been up for multiple days,
now, and vlan2 rx usage is still at only 1.

This leads me to a very strong suspicion that there's an mbuf leak in
the code paths used when an IPv6 packet is forwarded to a host that
isn't answering whatever IPv6's IPv6-address-to-MAC-address mapping
protocol is. The house network would try to speak to the v6 world
occasionally even without v6 connectivity, so I would expect a low
level of outgoing v6 traffic.

The machine actually is not quite 4.0.1; it's 4.0.1 plus my fixes. But
I am moderately sure none of them are likely to have any bearing (as
one simple example, the route in question does not go out an srt).

As I said, I'll track it down someday if nobody else does. But, unless
that code has been reworked between 4.x and now, the bug may well still
exist, in which case someone might want to look into it in more modern
NetBSD.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2013-02-08 02:21:56 UTC
Permalink
What is the interface type on the upstream link?

Are you sending v6 natively on the link, or is there some encapsulation?
Static route, or some IGP, or BGP, or ?

When you said "isn't answering", do you mean that you have an ndp cache
entry, but when the unicast packet is sent, there is no link-level
receiver? Or do you mean that there is no ndp entry, because it doesn't
answer the ndp request packets?

I'm reading code in -current, but look in
src/sys/net/if_ethersubr.c:ether_output() and audit the memory handling.
I haven't found anything yet, but that's what I would first suspect.
Mouse
2013-02-08 03:14:06 UTC
Permalink
Post by Greg Troxel
What is the interface type on the upstream link?
As seen by the host, Ethernet. The physical layer is DSL (MVL, to be
specific), but as far as I've been able to tell it's operationally just
a really slow Ethernet. (In case it matters, the DSL CPE is behind a
switch which vlan-trunks that, among other things, to the router host.)

The other end is not a mass-market big-ISP head; the DSL is being run
over dry copper to a DSLAM at my upstream/employer.
Post by Greg Troxel
Are you sending v6 natively on the link, or is there some
encapsulation?
Native, or at least as native as IPv6-over-Ethernet ever is. (Not
counting comments, there's nothing in the network config that says
vlan4 is a DSL line instead of a normal Ethernet.)
Post by Greg Troxel
Static route, or some IGP, or BGP, or ?
Entirely manually-configured statics.
Post by Greg Troxel
When you said "isn't answering", do you mean that you have an ndp
cache entry, but when the unicast packet is sent, there is no
link-level receiver? Or do you mean that there is no ndp entry,
because it doesn't answer the ndp request packets?
The latter. (It might have been the former initially, but certainly
after the reboot to clear the first wedge it will have been the latter
until I fixed things. The upstream host and my router disagreed over
what /127 to use on the DSL; I must have changed one end but not the
other and not rebooted since, or some such.)
Post by Greg Troxel
I'm reading code in -current, but look in
src/sys/net/if_ethersubr.c:ether_output() and audit the memory
handling. I haven't found anything yet, but that's what I would
first suspect.
Possibly. But I would be inclined to suspect there's something
IPv6-specific going on; if this happened for v4 as well I would surely
have run into it long since. This is relevant because, AIUI,
ether_output() is not protocol-aware...though it's been a while since I
read it, so even if that memory is correct it could be out of date.

Of course, I still haven't ruled out that it's something idiosyncratic
about my setup, but I'm considering that unlikely enough to be not
worth looking into unless, for example, I find I can't reproduce it in
a test-bench setup.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2013-02-08 17:57:31 UTC
Permalink
Post by Mouse
Post by Greg Troxel
What is the interface type on the upstream link?
As seen by the host, Ethernet. The physical layer is DSL (MVL, to be
specific), but as far as I've been able to tell it's operationally just
a really slow Ethernet. (In case it matters, the DSL CPE is behind a
switch which vlan-trunks that, among other things, to the router host.)
The other end is not a mass-market big-ISP head; the DSL is being run
over dry copper to a DSLAM at my upstream/employer.
Wow, how unusual.

So:

What driver (wm? fxp? vr?) is in use?

Is the underlying physical interface the same one for the DSl-facing
vlan as the house-lan-facing vlan?
Post by Mouse
Native, or at least as native as IPv6-over-Ethernet ever is. (Not
counting comments, there's nothing in the network config that says
vlan4 is a DSL line instead of a normal Ethernet.)
OK, that's what I meant by native.
Post by Mouse
Post by Greg Troxel
Static route, or some IGP, or BGP, or ?
Entirely manually-configured statics.
So this should be easy to reproduce. I do wonder if it has to do with
vlans.
Post by Mouse
Post by Greg Troxel
When you said "isn't answering", do you mean that you have an ndp
cache entry, but when the unicast packet is sent, there is no
link-level receiver? Or do you mean that there is no ndp entry,
because it doesn't answer the ndp request packets?
The latter. (It might have been the former initially, but certainly
after the reboot to clear the first wedge it will have been the latter
until I fixed things. The upstream host and my router disagreed over
what /127 to use on the DSL; I must have changed one end but not the
other and not rebooted since, or some such.)
I dimly remember that ARP used to store a packet when sending an arp
request, so that it could send it when the reply came. But that
requires freeing the old one (with a queue of 1). I am unclear on how
NDP does this, but it's something else to check.
Mouse
2013-02-08 18:20:34 UTC
Permalink
Post by Greg Troxel
Post by Mouse
Post by Greg Troxel
What is the interface type on the upstream link?
As seen by the host, Ethernet. The physical layer is DSL (MVL, to
be specific), [...]. (In case it matters, the DSL CPE is behind a
switch which vlan-trunks that, among other things, to the router host.)
The other end is not a mass-market big-ISP head; the DSL is being
run over dry copper to a DSLAM at my upstream/employer.
Wow, how unusual.
:-)
Post by Greg Troxel
What driver (wm? fxp? vr?) is in use?
The one underlying the vlans? fxp. The only fxp - indeed, the only
real Ethernet hardware - in the system.
Post by Greg Troxel
Is the underlying physical interface the same one for the
DSl-facing vlan as the house-lan-facing vlan?
Yes (as you can no doubt deduce from the above :-).
Post by Greg Troxel
Post by Mouse
Post by Greg Troxel
Static route, or some IGP, or BGP, or ?
Entirely manually-configured statics.
So this should be easy to reproduce. I do wonder if it has to do
with vlans.
I now suspect it may. I just tried it with a setup involving multiple
real hardware interfaces and have been unable to reproduce it. This
news is just minutes old, so I haven't had a chance to set up a closer
replica involving vlans (preferably over fxp). I should see if I have
a vlan-capable switch here (I'm in a different city at the moment,
working with my secondary house LAN, not the primary one this first
occurred with).

If necessary I'll put the system I saw this on back the way it was, but
if it takes that then it'll slow down the hypothesize-test cycle
significantly.
Post by Greg Troxel
I dimly remember that ARP used to store a packet when sending an arp
request, so that it could send it when the reply came.
It does. I've sometimes started a ping and then, when the destination
host finally answered, getting a ping with RTT in the tens or hundreds
of seconds.
Post by Greg Troxel
But that requires freeing the old one (with a queue of 1).
Or freeing the new one, which the high RTTs I've seen imply is what
happens.
Post by Greg Troxel
I am unclear on how NDP does this, but it's something else to check.
I'm not sure either. It's something I should check.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...