missing KERNEL

Discussion:

missing KERNEL_LOCK ?

(too old to reply)

Manuel Bouyer

2014-04-09 15:11:51 UTC

Hello,
as reported on current-users@ some time ago, I experience mbuf leaks
on a router running netbsd-6-1 of december. It's not a slow, continous leak
but a very sudden one:
in regular use netstat -m reports between 800 and 3000 mbufs in use,
and it can run for days. But very suddenly, in a few minutes the number of
mbufs can rise up to the limit (which I raised to 512k mbclusters).
I suspected it was associated with a specific traffic, eventually associated
with ipf. I suspected this because it seems it started to show up after
a ipf rules changes, which caused ipf to return icmp errors to ntp scans,
and MBUFTRACE showed that the lost mbufs were in output queues.

The attached patch does 2 things.
The first one is to add KASSERT(KERNEL_LOCKED_P()) at strategic places,
where we're about to touch the output queues. This is because the
output queues are protected by spl() calls only, so the kernel lock
must be held to protect them.
As the KASSERT() did fire, I then added KERNEL_LOCK()/UNLOCK() where needed.
I found that ifp_ouput() was being called without KERNEL_LOCK() from
ipf(4), and carp(4) so far.

I'm now running with this patch for 2 days without apparent problem, and
the leak didn't show up (but it's too soon to say if the issue is
fixed or not).

So, 2 questions:
- is my analysis right ?
- if so, should I commit the patch as is ?

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

Greg Troxel

2014-04-10 19:32:29 UTC

Permalink

That looks plausibly correct.

Two thoughts:

Have you run with LOCKDEBUG? We have found that netbsd-6 can run
with LOCKDEBUG and stay up, only about a 2x speed penalty.

you are making changes to the upstream ipfilter. that's just how it
is. But I wonder if the lock should be in the ifdef on recent netbsd;
I do not recall when the locking rules changed.

Manuel Bouyer

2014-04-11 14:26:04 UTC

Permalink

Post by Greg Troxel
That looks plausibly correct.
Have you run with LOCKDEBUG? We have found that netbsd-6 can run
with LOCKDEBUG and stay up, only about a 2x speed penalty.

No, I've not tried this. Indeed LOCKDEBUG causes more serialization so it
could hide this kind of problem.

Post by Greg Troxel
you are making changes to the upstream ipfilter. that's just how it
is. But I wonder if the lock should be in the ifdef on recent netbsd;
I do not recall when the locking rules changed.

somewhere between -5 and -6, and only for netinet (the netinet6 stack is
still running under KENREL_LOCK)

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2014-04-11 14:38:30 UTC

Permalink

Post by Manuel Bouyer

Post by Greg Troxel
That looks plausibly correct.
Have you run with LOCKDEBUG? We have found that netbsd-6 can run
with LOCKDEBUG and stay up, only about a 2x speed penalty.

No, I've not tried this. Indeed LOCKDEBUG causes more serialization so it
could hide this kind of problem.

Perhaps, but the system not crashing with LOCKDEBUG is an important
property to maintain. So I personally would not commit this without a
LOCKDEBUG run.

Manuel Bouyer

2014-04-11 15:21:36 UTC

Permalink

Post by Greg Troxel

Post by Manuel Bouyer

Post by Greg Troxel
That looks plausibly correct.
Have you run with LOCKDEBUG? We have found that netbsd-6 can run
with LOCKDEBUG and stay up, only about a 2x speed penalty.

No, I've not tried this. Indeed LOCKDEBUG causes more serialization so it
could hide this kind of problem.

Perhaps, but the system not crashing with LOCKDEBUG is an important
property to maintain. So I personally would not commit this without a
LOCKDEBUG run.

AFAIK LOCKDEBUG doens't covers KERNEL_LOCK, but I can test it anyway.

Manuel Bouyer

2014-04-24 10:21:22 UTC

Permalink

Post by Manuel Bouyer
[...]
The attached patch does 2 things.
The first one is to add KASSERT(KERNEL_LOCKED_P()) at strategic places,
where we're about to touch the output queues. This is because the
output queues are protected by spl() calls only, so the kernel lock
must be held to protect them.
As the KASSERT() did fire, I then added KERNEL_LOCK()/UNLOCK() where needed.
I found that ifp_ouput() was being called without KERNEL_LOCK() from
ipf(4), and carp(4) so far.
I'm now running with this patch for 2 days without apparent problem, and
the leak didn't show up (but it's too soon to say if the issue is
fixed or not).
- is my analysis right ?
- if so, should I commit the patch as is ?

Hello,
I've still not had any problem with this patch. I've now also tested a
LOCKDEBUG kernel, without problem either.
So, unless someone objects, I'll commit it in the next few days.