Taylor R Campbell
2020-04-19 00:15:46 UTC
I recently upgraded to netbsd-9, and I've been seeing this panic every
couple days, sometimes more than once a day:
panic: kernel diagnostic assertion "(ln->la_flags & LLE_VALID) != 0" failed: file "/home/riastradh/netbsd/9/src/sys/netinet6/nd6.c", line 2412
This is at:
https://nxr.netbsd.org/xref/src/sys/netinet6/nd6.c#2426
(The line number is slightly different in HEAD, but I think the logic
is essentially the same.)
I suspect what happened is:
1. Thread 0 issued nd6_lookup which:
(a) acquired IF_AFDATA_RLOCK(ifp),
(b) looked up lle and acquired LLE_WLOCK(lle), and then
(c) released IF_AFDATA_RLOCK(ifp); meanwhile,
2. Thread 1 did something which called lltable_unlink_entry without
holding LLE_WLOCK, perhaps llentries_unlink either via
lltable_purge_entries or via lltable_prefix_free ->
htable_prefix_free. lltable_unlink_entry -> htable_unlink_entry
clears LLE_VALID.
3. Thread 0 chokes on the cleared LLE_VALID.
Since thread 0 no longer holds IF_AFDATA_*LOCK, thread 1 can take it
and proceed, and since thread 1 _doesn't need_ LLE_*LOCK, the fact
that thread 0 is holding it doesn't prevent thread 1 from unlinking
lle.
I haven't proven that lltable_purge_entries or lltable_prefix_free
happened at the time of the panic -- perhaps they are a red herring.
Anecdotally the system seems to start dropping packets for a few
seconds before it panics. I'm not the only one who has seen this
symptom. Has anyone dug into this?
The attached patch changes llentries_unlink to acquire LLE_WLOCK
before calling lltable_unlink_entry, and changes lltable_unlink_entry
to assert that the LLE_WLOCK is held before modifying the lle in case
there are other code paths I haven't found that need LLE_WLOCK but
lack it. Haven't tested it yet.
(Unclear whether *_link_entry needs the same treatment -- the two
callers, in_lltable_create and in6_lltable_create, both acquire
LLE_WLOCK immediately after lltable_link_entry but could call it
immediately before, I think.)
Does this sound plausible?
couple days, sometimes more than once a day:
panic: kernel diagnostic assertion "(ln->la_flags & LLE_VALID) != 0" failed: file "/home/riastradh/netbsd/9/src/sys/netinet6/nd6.c", line 2412
This is at:
https://nxr.netbsd.org/xref/src/sys/netinet6/nd6.c#2426
(The line number is slightly different in HEAD, but I think the logic
is essentially the same.)
I suspect what happened is:
1. Thread 0 issued nd6_lookup which:
(a) acquired IF_AFDATA_RLOCK(ifp),
(b) looked up lle and acquired LLE_WLOCK(lle), and then
(c) released IF_AFDATA_RLOCK(ifp); meanwhile,
2. Thread 1 did something which called lltable_unlink_entry without
holding LLE_WLOCK, perhaps llentries_unlink either via
lltable_purge_entries or via lltable_prefix_free ->
htable_prefix_free. lltable_unlink_entry -> htable_unlink_entry
clears LLE_VALID.
3. Thread 0 chokes on the cleared LLE_VALID.
Since thread 0 no longer holds IF_AFDATA_*LOCK, thread 1 can take it
and proceed, and since thread 1 _doesn't need_ LLE_*LOCK, the fact
that thread 0 is holding it doesn't prevent thread 1 from unlinking
lle.
I haven't proven that lltable_purge_entries or lltable_prefix_free
happened at the time of the panic -- perhaps they are a red herring.
Anecdotally the system seems to start dropping packets for a few
seconds before it panics. I'm not the only one who has seen this
symptom. Has anyone dug into this?
The attached patch changes llentries_unlink to acquire LLE_WLOCK
before calling lltable_unlink_entry, and changes lltable_unlink_entry
to assert that the LLE_WLOCK is held before modifying the lle in case
there are other code paths I haven't found that need LLE_WLOCK but
lack it. Haven't tested it yet.
(Unclear whether *_link_entry needs the same treatment -- the two
callers, in_lltable_create and in6_lltable_create, both acquire
LLE_WLOCK immediately after lltable_link_entry but could call it
immediately before, I think.)
Does this sound plausible?