Discussion:
hangs with awge(4) on pine64 rock64 board
(too old to reply)
matthew green
2019-07-21 05:17:18 UTC
Permalink
hi folks..


i've been debugging a hang on the rock64. it's fairly easy to
trigger -- send a lot of data at it.

from ddb i would usually see one cpu with an lwp, usually the
idle lwp, fast lwp switched to softnet, and again fast switched
to the softser lwp. it seemed to be a kernel lock issue as the
kernel lock was held and at least one thread was waiting for
it. i couldn't really tell what was up.

i tried enabling NET_MPSAFE (which changes the behaviour of
awge(4) / dwc_gmac.c, beyond the network stack.) that kernel
ran for a lot longer, but ended up locking up again, this time
the rt_lock was being waited upon. but again, i couldn't find
where it was held or what context should be giving it up, though
i did again think about arm's pic_dispatch() being the last
lock and unlock of kernel_lock. then i realised that even with
NET_MPSAFE, awge(4)'s frontends don't setup MPSAFE interrupts.
with a kernel patched to do that under NET_MPSAFE i've had over
5 hours of heavy network access without a hang.

i don't know what is the underlying issue here. it could be
some network stack bug, it could be an awge/gmac bug, it could
be an arm or arm64 bug..

anyone have a clue where to investigate next? alternatively,
how far off is NET_MPSAFE default? :)


.mrg.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Manuel Bouyer
2019-07-21 10:12:01 UTC
Permalink
Post by matthew green
hi folks..
i've been debugging a hang on the rock64. it's fairly easy to
trigger -- send a lot of data at it.
from ddb i would usually see one cpu with an lwp, usually the
idle lwp, fast lwp switched to softnet, and again fast switched
to the softser lwp. it seemed to be a kernel lock issue as the
kernel lock was held and at least one thread was waiting for
it. i couldn't really tell what was up.
i tried enabling NET_MPSAFE (which changes the behaviour of
awge(4) / dwc_gmac.c, beyond the network stack.) that kernel
ran for a lot longer, but ended up locking up again, this time
the rt_lock was being waited upon. but again, i couldn't find
where it was held or what context should be giving it up, though
i did again think about arm's pic_dispatch() being the last
lock and unlock of kernel_lock. then i realised that even with
NET_MPSAFE, awge(4)'s frontends don't setup MPSAFE interrupts.
with a kernel patched to do that under NET_MPSAFE i've had over
5 hours of heavy network access without a hang.
i don't know what is the underlying issue here. it could be
some network stack bug, it could be an awge/gmac bug, it could
be an arm or arm64 bug..
anyone have a clue where to investigate next? alternatively,
how far off is NET_MPSAFE default? :)
It looks like something I fixed some time ago in the arm pmap:
http://mail-index.netbsd.org/source-changes/2019/04/23/msg105355.html

maybe arm64 has a similar issue.
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...