panic: "(ln->la_flags & LLE_VALID) != 0" failed

Discussion:

(too old to reply)

Taylor R Campbell

2020-04-19 00:15:46 UTC

I recently upgraded to netbsd-9, and I've been seeing this panic every
couple days, sometimes more than once a day:

panic: kernel diagnostic assertion "(ln->la_flags & LLE_VALID) != 0" failed: file "/home/riastradh/netbsd/9/src/sys/netinet6/nd6.c", line 2412

This is at:

https://nxr.netbsd.org/xref/src/sys/netinet6/nd6.c#2426

(The line number is slightly different in HEAD, but I think the logic
is essentially the same.)

I suspect what happened is:

1. Thread 0 issued nd6_lookup which:
(a) acquired IF_AFDATA_RLOCK(ifp),
(b) looked up lle and acquired LLE_WLOCK(lle), and then
(c) released IF_AFDATA_RLOCK(ifp); meanwhile,

2. Thread 1 did something which called lltable_unlink_entry without
holding LLE_WLOCK, perhaps llentries_unlink either via
lltable_purge_entries or via lltable_prefix_free ->
htable_prefix_free. lltable_unlink_entry -> htable_unlink_entry
clears LLE_VALID.

3. Thread 0 chokes on the cleared LLE_VALID.

Since thread 0 no longer holds IF_AFDATA_*LOCK, thread 1 can take it
and proceed, and since thread 1 _doesn't need_ LLE_*LOCK, the fact
that thread 0 is holding it doesn't prevent thread 1 from unlinking
lle.

I haven't proven that lltable_purge_entries or lltable_prefix_free
happened at the time of the panic -- perhaps they are a red herring.
Anecdotally the system seems to start dropping packets for a few
seconds before it panics. I'm not the only one who has seen this
symptom. Has anyone dug into this?

The attached patch changes llentries_unlink to acquire LLE_WLOCK
before calling lltable_unlink_entry, and changes lltable_unlink_entry
to assert that the LLE_WLOCK is held before modifying the lle in case
there are other code paths I haven't found that need LLE_WLOCK but
lack it. Haven't tested it yet.

(Unclear whether *_link_entry needs the same treatment -- the two
callers, in_lltable_create and in6_lltable_create, both acquire
LLE_WLOCK immediately after lltable_link_entry but could call it
immediately before, I think.)

Does this sound plausible?

Taylor R Campbell

2020-04-19 05:37:35 UTC

Permalink

Date: Sun, 19 Apr 2020 00:15:46 +0000
The attached patch changes llentries_unlink to acquire LLE_WLOCK
before calling lltable_unlink_entry, and changes lltable_unlink_entry
to assert that the LLE_WLOCK is held before modifying the lle in case
there are other code paths I haven't found that need LLE_WLOCK but
lack it. Haven't tested it yet.

Evidently this doesn't work -- something already holds the LLE_*LOCK
in llentries_unlink:

#0 0xffffffff80222965 in cpu_reboot (howto=***@entry=260,
bootstr=***@entry=0x0)
at /home/riastradh/netbsd/9/src/sys/arch/amd64/amd64/machdep.c:728
#1 0xffffffff80985343 in vpanic (
fmt=***@entry=0xffffffff811d2c78 "lock error: %s: %s,%zu: %s: lock %p cpu %d lwp %p", ap=***@entry=0xffffa40255962568)
at /home/riastradh/netbsd/9/src/sys/kern/subr_prf.c:336
#2 0xffffffff809853f4 in panic (
fmt=***@entry=0xffffffff811d2c78 "lock error: %s: %s,%zu: %s: lock %p cpu %d lwp %p") at /home/riastradh/netbsd/9/src/sys/kern/subr_prf.c:255
#3 0xffffffff8097e816 in lockdebug_abort (
func=***@entry=0xffffffff810843b0 <__func__.6488> "rw_vector_enter",
line=***@entry=350, lock=***@entry=0xffff8cdb44308540,
ops=***@entry=0xffffffff814612d0 <rwlock_lockops>,
msg=***@entry=0xffffffff811cf84f "locking against myself")
at /home/riastradh/netbsd/9/src/sys/kern/subr_lockdebug.c:1047
#4 0xffffffff80959e2d in rw_abort (rw=***@entry=0xffff8cdb44308540,
msg=0xffffffff811cf84f "locking against myself", line=350,
func=<synthetic pointer>)
at /home/riastradh/netbsd/9/src/sys/kern/kern_rwlock.c:193
#5 0xffffffff8095a0ef in rw_vector_enter (rw=0xffff8cdb44308540, op=RW_WRITER)
at /home/riastradh/netbsd/9/src/sys/kern/kern_rwlock.c:350
#6 0xffffffff80a1de17 in llentries_unlink (
llt=<optimized out>)
at /home/riastradh/netbsd/9/src/sys/net/if_llatbl.c:309
#7 0xffffffff80a1deca in htable_prefix_free (llt=0xffff8cd47ecce108,
prefix=<optimized out>, mask=<optimized out>, flags=<optimized out>)
at /home/riastradh/netbsd/9/src/sys/net/if_llatbl.c:288
#8 0xffffffff80a1e67e in lltable_prefix_free (af=24,
prefix=0xffffa402559627e8, mask=0xffff8cdb53ee432c, flags=0)
at /home/riastradh/netbsd/9/src/sys/net/if_llatbl.c:521
#9 0xffffffff80a42fde in rtrequest1 (req=***@entry=2,
info=***@entry=0xffffa402559628d0,
ret_nrt=***@entry=0xffffa402559628c8)
at /home/riastradh/netbsd/9/src/sys/net/route.c:1221
#10 0xffffffff80a43861 in rtinit (ifa=***@entry=0xffff8cdb53ee4248,
cmd=***@entry=2, flags=***@entry=0)
at /home/riastradh/netbsd/9/src/sys/net/route.c:1625
#11 0xffffffff80708ab6 in in6_ifremprefix (
target=***@entry=0xffff8cdb53ee4248)
at /home/riastradh/netbsd/9/src/sys/netinet6/in6.c:313
#12 0xffffffff80708cb3 in in6_ifremprefix (target=0xffff8cdb53ee4248)
at /home/riastradh/netbsd/9/src/sys/netinet6/in6.c:1490
#13 in6_purgeaddr (ifa=0xffff8cdb53ee4248)
at /home/riastradh/netbsd/9/src/sys/netinet6/in6.c:1428
#14 0xffffffff8070b5e3 in in6_control1 (ifp=0xffffa400469ca008,
data=0xffff8cd790d34680, cmd=2166384921, so=0x81206919)
at /home/riastradh/netbsd/9/src/sys/netinet6/in6.c:719
#15 in6_control (so=***@entry=0xffff8cdb52550728, cmd=***@entry=2166384921,
data=***@entry=0xffff8cd790d34680, ifp=***@entry=0xffffa400469ca008)
at /home/riastradh/netbsd/9/src/sys/netinet6/in6.c:772
#16 0xffffffff8072a0e6 in udp6_ioctl (ifp=0xffffa400469ca008,
addr6=0xffff8cd790d34680, cmd=2166384921, so=0xffff8cdb52550728)
at /home/riastradh/netbsd/9/src/sys/netinet6/udp6_usrreq.c:1210
#17 udp6_ioctl_wrapper (a=0xffff8cdb52550728, b=2166384921,
c=0xffff8cd790d34680, d=0xffffa400469ca008)
at /home/riastradh/netbsd/9/src/sys/netinet6/udp6_usrreq.c:1491
#18 0xffffffff806866a9 in compat_ifioctl (so=0xffff8cdb52550728,
ocmd=2166384921, cmd=2166384921, data=0xffff8cd790d34680,
l=<optimized out>)
at /home/riastradh/netbsd/9/src/sys/compat/common/if_43.c:278
#19 0xffffffff80a0dbd3 in doifioctl (so=0xffff8cdb52550728,
cmd=<optimized out>, data=0xffff8cd790d34680, l=0xffff8cdb52c1da20)
at /home/riastradh/netbsd/9/src/sys/net/if.c:3394
#20 0xffffffff80990408 in sys_ioctl (l=<optimized out>,
uap=0xffffa40255963000, retval=<optimized out>)
at /home/riastradh/netbsd/9/src/sys/kern/sys_generic.c:671
#21 0xffffffff8024bb37 in sy_call (rval=0xffffa40255962fb0,
uap=0xffffa40255963000, l=0xffff8cdb52c1da20,
sy=0xffffffff8145c950 <sysent+1296>)
at /home/riastradh/netbsd/9/src/sys/sys/syscallvar.h:65
#22 sy_invoke (code=54, rval=0xffffa40255962fb0, uap=0xffffa40255963000,
l=0xffff8cdb52c1da20, sy=0xffffffff8145c950 <sysent+1296>)
at /home/riastradh/netbsd/9/src/sys/sys/syscallvar.h:94
#23 syscall (frame=0xffffa40255963000)
at /home/riastradh/netbsd/9/src/sys/arch/x86/x86/syscall.c:138
#24 0xffffffff802096dd in handle_syscall ()

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Roy Marples

2020-04-22 19:41:55 UTC

Permalink

Post by Taylor R Campbell
I recently upgraded to netbsd-9, and I've been seeing this panic every
panic: kernel diagnostic assertion "(ln->la_flags & LLE_VALID) != 0" failed: file "/home/riastradh/netbsd/9/src/sys/netinet6/nd6.c", line 2412
https://nxr.netbsd.org/xref/src/sys/netinet6/nd6.c#2426
(The line number is slightly different in HEAD, but I think the logic
is essentially the same.)

This is now fixed in src/sys/netinet6/nd6_nbr.c r1.178
Here's the commit message explaining the issue

inet6: nd6_na_input() now considers ln_state <= ND6_LLINFO_INCOMPLETE

Otherwise if ln_state != ND6_LLINFO_INCOMPLETE and the is no lladdr
and this message was solicited then ln_state is set to ND6_LLINFO_REACHABLE
which could then cause a panic in nd6_resolve().
If ln_state > ND6_LLINFO_INCOMPLETE then it's assumed we have a lladdr.

Potentially this could have been triggered by the introduction of
ND6_LLINFO_PURGE in nd6.c r1.143 but also by the re-introduction of
ND6_LLINFO_INCOMPLETE in nd6.c r1.263.
Depending on the timing, it's technically possible to receive such
a message after the llentry is created with ND6_LLINFO_NOSTATE.

Ironically NetBSD-8 and older are not affected because the KASSERT logic is
inverted - if we have a lladddr, ln_state MUST be > ND6_LLINFO_INCOMPLETE.
However, ln_state is still set incorrectly which *might* affect things elsewhere.

I've submitted a pullup for -9 already.

Roy

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de