Discussion:
lockdebug kernel instacrash when npf enabled
(too old to reply)
Mindaugas Rasiukevicius
2019-02-24 21:22:48 UTC
Permalink
Enabling NPF.
[ 22.6038371] panic: kernel debugging assertion
"pserialize_not_in_read_section()" failed: file
"/work/src/sys/kern/kern_mutex.c", line 527 [ 22.7529500] cpu0: Begin
traceback... [ 22.7976654] 0x99deba54: netbsd:db_panic+0x14
[ 22.8465447] 0x99deba6c: netbsd:vpanic+0x194 [ 22.8985454]
<...>
r1.29 of npf_tableset.c changed t_lock from IPL_NET to IPL_NONE.
Based on the above it looks like it needs to be at IPL_SOFTNET.
@rmind you could please have a look?
It is a bug, but only one aspect of it. Yes, the mutex can be IPL_SOFTNET,
but it actually behaves more or less as IPL_NONE. The real bug is that the
code path in question might block. There are a few ways to fix this:

- Convert the mutex to spin-lock at IPL_NET (but it is excessive) and
convert the memory allocations in that code path to KM_NOSLEEP.

- Extend pserialize(9) by implementing Sleepable RCU (SRCU) or equivalent.

- Sprinkle psref(9), but that is ugly and undesirable in the long-term.

I have not had free time to work on a solution yet, but I hope to fix
this soonish and commit with a next batch of the NPF fixes/improvements.

Meanwhile, if you want to run with LOCKDEBUG until this gets fixed, then
as a workaround I can suggest to comment out that assert as you are very
unlikely to hit the crash condition of this bug; it can only happen when
you perform NPF reload, plus you need to be unlucky enough to have the
relevant mutex (used only for LPM-type tables) contended and blocking.
--
Mindaugas

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Christos Zoulas
2019-02-25 00:33:57 UTC
Permalink
Post by Mindaugas Rasiukevicius
Enabling NPF.
[ 22.6038371] panic: kernel debugging assertion
"pserialize_not_in_read_section()" failed: file
"/work/src/sys/kern/kern_mutex.c", line 527 [ 22.7529500] cpu0: Begin
traceback... [ 22.7976654] 0x99deba54: netbsd:db_panic+0x14
[ 22.8465447] 0x99deba6c: netbsd:vpanic+0x194 [ 22.8985454]
<...>
r1.29 of npf_tableset.c changed t_lock from IPL_NET to IPL_NONE.
Based on the above it looks like it needs to be at IPL_SOFTNET.
@rmind you could please have a look?
It is a bug, but only one aspect of it. Yes, the mutex can be IPL_SOFTNET,
but it actually behaves more or less as IPL_NONE. The real bug is that the
- Convert the mutex to spin-lock at IPL_NET (but it is excessive) and
convert the memory allocations in that code path to KM_NOSLEEP.
- Extend pserialize(9) by implementing Sleepable RCU (SRCU) or equivalent.
- Sprinkle psref(9), but that is ugly and undesirable in the long-term.
I have not had free time to work on a solution yet, but I hope to fix
this soonish and commit with a next batch of the NPF fixes/improvements.
Meanwhile, if you want to run with LOCKDEBUG until this gets fixed, then
as a workaround I can suggest to comment out that assert as you are very
unlikely to hit the crash condition of this bug; it can only happen when
you perform NPF reload, plus you need to be unlucky enough to have the
relevant mutex (used only for LPM-type tables) contended and blocking.
But commenting out the asset will cripple the test for everything. We've
discussed this before and we even had a psref patch IIRC, why did it
get lost?

christos


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...