nfs server lockup (again)

Discussion:

(too old to reply)

Manuel Bouyer

2011-06-17 09:45:11 UTC

Hello,
some time ago I reported that a NFS server, rock solid running a 5.0_STABLE
kernel from february 2010, will lock up when upgraded to 5.1_STABLE.
I couldn't reproduce this on a testbed but I got the opportunity to try
again the faultly kernel on a production server and I can now report how
it deadlocks.

The root of the cause is a NFS process closing a socket:
db{0}> tr/a 0x00000000d01585c0
trace: pid 232 lid 19 at 0xd01dc9ac
sleepq_block(1,0,c063c259,c06a513c,c2027108,1,c0723d00,0,1,2df948) at netbsd:sleepq_block+0xeb
kpause(c063c259,0,1,0,c0715eca,c0387e83,c569f858,1000,d4263000,c569f800) at netbsd:kpause+0xe3
uvm_unloanpage(c569f858,1,2,c0123b65,c569f800,0,1000,0,c58902d8,4dfabc08) at netbsd:uvm_unloanpage+0x69
sodopendfreel(c0715ec8,0,d01dcabc,c03880a2,c58902d8,6,0,0,0,0) at netbsd:sodopendfreel+0xe1
sodopendfree(c58902d8,6,0,0,0,0,d01dcadc,cd5a4f80,0,c58902d8) at netbsd:sodopendfree+0x2a
sodisconnect(c58902d8,d01585c0,7,c0723e60,d6fdd700,4dfabc08,c0723e60,c032c4c2,d6fdd700,d6fdd700) at netbsd:sodisconnect+0x62
soclose(c58902d8,0,d01dcb4c,c031a3cd,d6fdd700,c0724060,d01dcb2c,0,c0272501,d01585c0) at netbsd:soclose+0x1b9
closef(d6fdd700,c02737e4,0,d4261d40,d4261d40,d01d93cc,d01dcbec,c027313d,d4261d40,d01d93cc) at netbsd:closef+0x5d
nfsrv_slpderef(d4261d40,d01d93cc,d01dcbd4,d01dcbda,c07277a0,c0724060,0,c06a7b18,c0724060,0) at netbsd:nfsrv_slpderef+0x9e
nfssvc_nfsd(d01dcc38,b8fffc28,d01585c0,0,0,0,c0724002,c0368cd4,0,d01585c0) at netbsd:nfssvc_nfsd+0x1cd
sys_nfssvc(d01585c0,d01dcd00,d01dcd28,0,d0153c80,0,0,4,b8fffc28,7c) at netbsd:sys_nfssvc+0x332
syscall(d01dcd48,b3,ab,1f,1f,0,b8e00000,b8fffcac,b8fffc28,4) at netbsd:syscall+0

The page passed to uvm_unloanpage() has a uobject, and for some reason
the lock was not available when uvm_unloanpage() tried to get it, so it
went to kpause. Note that the lock is now free:
db{0}> x/x c569f858
0xc569f858: c2027108
db{0}> show page c2027108
PAGE 0xc2027108:
flags=c<TABLED,CLEAN>, pqflags=200<PRIVATE2>, wire_count=0, pa=0x55ef8000
uobject=0xd38d80b8, uanon=0x0, offset=0x0 loan_count=2
[page ownership tracking disabled]

(gdb) print &((struct uvm_object*)0xd38d80b8)->vmobjlock
$3 = (kmutex_t *) 0xd38d80b8
db{0}> sh lock 0xd38d80b8
lock address : 0x00000000d38d80b8 type : sleep/adaptive
initialized : 0x00000000c03a0746
shared holds : 0 exclusive: 0
shares wanted: 0 exclusive: 0
current cpu : 0 last held: 2
current lwp : 0x00000000cd5a7c80 last held: 000000000000000000
last locked : 0x00000000c03a121e unlocked : 0x00000000c03a15d9
owner field : 000000000000000000 wait/spin: 0/0
Turnstile chain at 0xc0723ce0.
=> No active turnstile for this lock.

It looks like this is a vnode interlock:
0xc03a121e is in vrelel (/home/src/src-5/src/sys/kern/vfs_subr.c:1497).
1492 *
1493 * Note that VOP_INACTIVE() will drop the vnode lock.
1494 */
1495 VOP_INACTIVE(vp, &recycle);
1496 mutex_enter(&vp->v_interlock);
1497 vp->v_iflag &= ~VI_INACTNOW;
1498 cv_broadcast(&vp->v_cv);
1499 if (!recycle) {
1500 if (vtryrele(vp)) {
1501 mutex_exit(&vp->v_interlock);
(gdb) l *(0x00000000c03a15d9)
0xc03a15d9 is in vrelel (/home/src/src-5/src/sys/kern/vfs_subr.c:1567).
1562 } else {
1563 vp->v_freelisthd = &vnode_free_list;
1564 }
1565 TAILQ_INSERT_TAIL(vp->v_freelisthd, vp, v_freelist);
1566 mutex_exit(&vnode_free_list_lock);
1567 mutex_exit(&vp->v_interlock);
1568 }
1569 }

(side question: is it OK to put a vnode on a free list, while a loan is
still active ?)

soclose() got the socket lock, which happens to be the softnet_lock
(from a quick look, this looks OK; I couldn't find where in nfs or tcp
the socket's lock would be changed to somtheing else but softnet_lock).

So we are kpause()ing with softnet_lock held, and this is bad because
the cpu's softint thread can run, call one of the nfs or tcp timer routine
and sleep on the softnet_lock:
19 3 0 4 d01585c0 slave livelock
46 3 6 204 cd5bd860 softclk/6 tstile
5 3 0 204 cd5a7500 softclk/0 tstile

db{0}> tr/a cd5bd860
trace: pid 0 lid 46 at 0xcece0bc0
sleepq_block(0,0,c0647bf4,c06a4a9c,cd5a4f80,c0724000,ced28a78,dc,40,1000001) at netbsd:sleepq_block+0xeb
turnstile_block(ced28a60,1,cd5a4f80,c06a4a9c,cec6724c,cece4000,cece0c80,ced28a60,0,0) at netbsd:turnstile_block+0x261
mutex_vector_enter(cd5a4f80,cece4000,cece0ce0,0,c0349cc2,cd5bd860,7,cece4000,cece4000,10c) at netbsd:mutex_vector_enter+0x317
tcp_timer_rexmt(c573e408,cd5bd860,7,cece4000,cecd6074,cece4060,cece4868,cece5068,cece5868,c0135fb0) at netbsd:tcp_timer_rexmt+0x1f
callout_softclock(0,10,30,10,10,0,156320,d01cca6c,30,cece0010) at netbsd:callout_softclock+0x25f
softint_dispatch(cd5bb0c0,2,0,0,0,0,cece0d90,cece0b68,cece0bc0,0) at netbsd:softint_dispatch+0x9f

And now that our softclk are sleeping waiting for the lock held by the nfsd
in kpause, the kpause will never be woken up and the kernel is locked.

Now the question is why this doesn't happen with the feb 2010 5.0_STABLE
kernel. nfsd_use_loan is also set to 1 in this kernel.
One thing that maybe is relevant because it defers more things
to a softnet_lock callout is tcp_input.c 1.291.4.3.

I guess this can happen with any TCP application using page loan, not only
nfsd.
Any idea on how to properly fix this ?
a workaround could be to use yield() in uvm_loan.c because it would not
require a clock tick to wake up.
I'm not sure if it's possible to drop the socket lock before unloaning the
pages.
But I wonder if this could be a more general issue with callouts.
Maybe we should have one thread per callout, a la fast_softint,
which is used when a callout needs to sleep ?

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-06-23 14:23:59 UTC

Permalink

[...]
Wouldn't be workqueue be better solution ? No need to setup anything
just create wq and use it where is it needed.

I'm not familiar enough with workqueues. How well does it deal with
frequent calls (under heavy network load I guess this can be called
very often). What happens when the thread blocks, does this defer other,
unrelated workqueues ?
Also there is (from worqueue_enqueue(9)):
"The enqueued work will be processed in a thread context. A work must not
be enqueued again until the callback is called by the workqueue(9) frame-
work."
so it's not as easy as a persistent thread and cv_broadcast() ...

Masao Uebayashi

2011-06-23 14:57:36 UTC

Permalink

I think the sodopendfree() call in sodisconnect() can be either
a) removed, or
b) moved to the very end of soclose() (after sounlock())

uipc_socket.c Rev. 1.63 introduced the loaning code & thorpej@
sprinkled sodopendfree(). I see no strong reason why sodopendfree()
has to be called from sodisconnect().

Anyway, thanks for this, I really enjoyed reading your analysis. :)

Post by Manuel Bouyer

[...]
Wouldn't be workqueue be better solution ? No need to setup anything
just create wq and use it where is it needed.

I'm not familiar enough with workqueues. How well does it deal with
frequent calls (under heavy network load I guess this can be called
very often). What happens when the thread blocks, does this defer other,
unrelated workqueues ?
"The enqueued work will be processed in a thread context. A work must not
be enqueued again until the callback is called by the workqueue(9) frame-
work."
so it's not as easy as a persistent thread and cv_broadcast() ...
--
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-06-28 19:39:33 UTC

Permalink

Post by Masao Uebayashi
I think the sodopendfree() call in sodisconnect() can be either
a) removed, or
b) moved to the very end of soclose() (after sounlock())
sprinkled sodopendfree(). I see no strong reason why sodopendfree()
has to be called from sodisconnect().

Yes, this would probably work. But I find it better in the long
term to have a separate thread for sodopendfree() and wake it up from the
place where we enqueue mbufs in the pendfree queue (rather than hijacking
existing threads which requires some code analisis to see if it's
OK to take locks and sleep here ...)

Thor Lancelot Simon

2011-06-23 14:59:17 UTC

Permalink

Post by Manuel Bouyer
"The enqueued work will be processed in a thread context. A work must not
be enqueued again until the callback is called by the workqueue(9) frame-
work."

This is bizarre. Reading it, I note that I've written some buggy code
assuming that workqueues were, you know, *queues* -- not
queues-of-one-element.

I think it means that you cannot enqueue the _same_ struct work
repeatedly, though. Not a problem, in practice, if you embed one
struct work in each instance of your other datastructure. No?

--
Thor Lancelot Simon ***@panix.com
"All of my opinions are consistent, but I cannot present them all
at once." -Jean-Jacques Rousseau, On The Social Contract

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Holland

2011-06-23 18:06:14 UTC

Permalink

Post by Manuel Bouyer
Any idea on how to properly fix this ?
a workaround could be to use yield() in uvm_loan.c because it would not
require a clock tick to wake up.
I'm not sure if it's possible to drop the socket lock before unloaning the
pages.
But I wonder if this could be a more general issue with callouts.
Maybe we should have one thread per callout, a la fast_softint,
which is used when a callout needs to sleep ?

I opened kern/45093 for this.

I've been thinking that the nfs client problems I've been seeing,
which seem to be callout-related, are maybe connected, as could be the
odd problem mrg has been having with raidframe.

Do we have a debug hook or other method that will raise a panic if
callouts stop getting processed?

--
David A. Holland
***@netbsd.org

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-06-24 16:30:14 UTC

Permalink

Post by David Holland
I've been thinking that the nfs client problems I've been seeing,
which seem to be callout-related, are maybe connected, as could be the
odd problem mrg has been having with raidframe.
Do we have a debug hook or other method that will raise a panic if
callouts stop getting processed?

unfortunably, no (AFAIK).

My system has been hanging on heavy disk usage (my endless compiles) and
I just noticed that one of the CPUs is pinned in rpc.something according
to the top of "top" though it doesn't show up in the list below.
So I'm not sure what's happening (I haven't been following) but I'm
seeing issues on an amd64-current system.

To see if it's the problem you'll need to enter ddb and see what the
per-cpu soft interrupt threads are doing.

David Holland

2011-06-24 16:41:44 UTC

Permalink

unfortunably, no (AFAIK).

without looking at the code (today) I'm wondering if it would be
feasible to have callout_schedule panic if the head of the callout
queue is more than, say, one second in the past. Maybe only for
DIAGNOSTIC kernels?

--
David A. Holland
***@netbsd.org

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Hisashi T Fujinaka

2011-06-24 16:45:19 UTC

Permalink

Post by Manuel Bouyer

unfortunably, no (AFAIK).

To see if it's the problem you'll need to enter ddb and see what the
per-cpu soft interrupt threads are doing.

OK, do I need to add kernel debugging, etc? I've done this in the past
but it's been a while.

--
Hisashi T Fujinaka - ***@twofifty.com
BSEE(6/86) + BSChem(3/95) + BAEnglish(8/95) + MSCS(8/03) + $2.50 = latte

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Hisashi T Fujinaka

2011-06-28 15:48:17 UTC

Permalink

Post by Hisashi T Fujinaka

Post by Manuel Bouyer
To see if it's the problem you'll need to enter ddb and see what the
per-cpu soft interrupt threads are doing.

OK, do I need to add kernel debugging, etc? I've done this in the past
but it's been a while.

ddb should be enough. after entering ddb, ps to see what thread
are waiting on what; you can get stack trace for each blocked threads
with tr/a or tr/t (warning: ps show PID and LID in decimal but tr/t
takes input in hex by default. tr/t 0t1234 will take input as decimal)

Arrgh. My stupid motherboard seems to have intermittent issues with the
USB keyboard and I can't get it into DDB. And I can't complete a build
without nfs locking up.

Is there a workaround? I'll try disabling nfsd...

David Laight

2011-06-24 17:37:09 UTC

Permalink

Post by David Holland

unfortunably, no (AFAIK).

Wouldn't you want to do that from the timer interrupt ??

David

--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Holland

2011-06-24 17:58:10 UTC

Permalink

Post by David Laight

Post by David Holland

Post by David Holland
Do we have a debug hook or other method that will raise a panic if
callouts stop getting processed?

unfortunably, no (AFAIK).

Wouldn't you want to do that from the timer interrupt ??

Not if the problem has to do with the timer interrupt not being
delivered, you wouldn't.

although the case Manuel was describing (and that I think I may also
be seeing) is not that.

--
David A. Holland
***@netbsd.org

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de