Strange NFS lockup through IPsec

Discussion:

(too old to reply)

Edgar Fuß

2008-02-11 22:27:15 UTC

We're experiencing strange NFS lockups. These only happen from time to time,
and they only happen to a few clients conneted to the NFS server in question
via an IPsec tunnel.

Clients are Linux (2.6) boxes, server is NetBSD 4.0/amd64. NFS runs through TCP.

Sometimes NFS on one of the clients gets stuck. There seems to be no way out
of this other than rebooting the client. All other clients continue to
function normally.

In the stuck state, the server receives an NFS request, acknowledges it
(TCP-wise, I mean), but fails to answer. There's a TCP window, the Send-Q
fills up, but nothing gets sent.

Eventually, the NFS RPC gets retransmitted, gets ACK'ed again, but remains
unanswered (on the wire) while the Send-Q fills.

Why on earth would the TCP stack refuse to send if there's a route (otherwise
the TCP ACK's wouldn't be sent), there's a window and there's data in the queue?

There can't be anything grossly wrong with the server because there're quite
a number of other clients working normally. Even in the stuck state, I see
UDP NFS Null Calls/Replies to the stuck client. Just TCP hangs.

Strange, isn't it?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Edgar Fuß

2008-02-15 13:46:05 UTC

Permalink

I've investigated this further.

Invoking netstat -P on the PCB of the connection hanging yields there is data
to send, there is a window, but the REXMT timer is zero.

Following a hint of ws@, I investigated the PCB in question with gdb, and I'm
confused about the timers. The entry for the retransmit timer shows it has long
expired (i.e. c_time is much larger than kern.hardclock_ticks), and it has
fired, but is no longer pending (c_flags = 4).

Now I looked at the timer handling in tcp_output.c and tcp_timers.c and am
unsure how this can happen, as well as confused about the usage of the
TCP_TIMER_ISARMED macro. Browsing the source, I would get the impression that
this means "the timer is already running" while it's actually equivalent to
callout_active, i. e. callout_pending || callout_fired.

What's the reason for testing whether a callout caused by the retransmit
timer having expired has happened in the past before setting it again?

This looks completely broken to me, but that's most probably because of me,
not because of the code.

Can someone please enlighten me?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Edgar Fuß

2008-02-15 23:08:59 UTC

Permalink

Post by Edgar FuÃ
This looks completely broken to me, but that's most probably because of me,
not because of the code.

After ws@ was so kind as to spend more than an hour with me investigating
the code this evening, we both think there's something grossly wrong here.

Before the advent of callouts, the code was using the macros from sys/protosw.h
and tcp_slowtimeo() would explicitly disarm a timer after it fired.
This means that before the rewrite to use callouts, on an expired timer,
TCP_TIMER_ISARMED would usually be false. But now, after the rewrite,
TCP_TIMER_ISARMED will be true, so the code will not arm the timer again.

However, this still doesn't explain why the stack refuses to send even if
more data is arriving in the send queue.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

YAMAMOTO Takashi

2008-02-17 16:12:29 UTC

Permalink

Post by Edgar FuÃ

Post by Edgar FuÃ
This looks completely broken to me, but that's most probably because of me,
not because of the code.

the code this evening, we both think there's something grossly wrong here.
Before the advent of callouts, the code was using the macros from sys/protosw.h
and tcp_slowtimeo() would explicitly disarm a timer after it fired.
This means that before the rewrite to use callouts, on an expired timer,
TCP_TIMER_ISARMED would usually be false. But now, after the rewrite,
TCP_TIMER_ISARMED will be true, so the code will not arm the timer again.

it was changed more recently.
see the "Recent nbftp panic analysis" thread:
http://mail-index.NetBSD.org/tech-net/2005/02/oindex.html

YAMAMOTO Takashi

Post by Edgar FuÃ
However, this still doesn't explain why the stack refuses to send even if
more data is arriving in the send queue.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Edgar Fuß

2008-02-17 14:45:00 UTC

Permalink

I think I'm getting a clue what might be happening here.
Could someone in the know please save my weekend and comment on this?

If tcp_timer_rexmt() gets called during PMTU discovery, TCPT_REXMT will have
fired, but will not be disarmed. This means that tcp_output() will never ever
arm the retransmit timer again beacause it thinks its armed already.

Now, if we have a SACK hole, retransmit the hole, loose the segment again
(or the ACK), tcp_output will exit at just_return without sending anything.
There's even a comment above that this is possible in SACK, so there's code
to re-arm the retransmit timer without having sent a segment, only this will
not happen because the timer is already considered armed (i.e.
TCP_TIMER_ISARMED(tp, TCP_REXMIT) wil be true), but it will not fire.
So we hang.

The problem is I can't get the NFS connection to hang right now and although
I did print out the entire PCB with gdb, I didn't notice there was a SACK
hole so I don't have that hole's retransmission state. So I can't be entirely
sure this analysis is correct until I manage to make the connection hang again
and print out the missing information.

Second problem is I can't just go ahead and install a patched kernel on that
file server unless I'm really confident (i. e. someone in the know concurs)
I found the problem, because, well, that's the file server.

I must admit I don't fully understand how the code is supposed to behave if
it's still in SACK recovery (but hasn't timed out on the lost segment) and
new data comes in to be sent.

So, please, could someone comment whether this sounds plausible or whether
I'm missing the point?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

YAMAMOTO Takashi

2008-02-17 16:13:02 UTC

Permalink

Post by Edgar FuÃ
I think I'm getting a clue what might be happening here.
Could someone in the know please save my weekend and comment on this?
If tcp_timer_rexmt() gets called during PMTU discovery, TCPT_REXMT will have
fired, but will not be disarmed. This means that tcp_output() will never ever
arm the retransmit timer again beacause it thinks its armed already.

it seems like a bug to me.

YAMAMOTO Takashi

Post by Edgar FuÃ
Now, if we have a SACK hole, retransmit the hole, loose the segment again
(or the ACK), tcp_output will exit at just_return without sending anything.
There's even a comment above that this is possible in SACK, so there's code
to re-arm the retransmit timer without having sent a segment, only this will
not happen because the timer is already considered armed (i.e.
TCP_TIMER_ISARMED(tp, TCP_REXMIT) wil be true), but it will not fire.
So we hang.
The problem is I can't get the NFS connection to hang right now and although
I did print out the entire PCB with gdb, I didn't notice there was a SACK
hole so I don't have that hole's retransmission state. So I can't be entirely
sure this analysis is correct until I manage to make the connection hang again
and print out the missing information.
Second problem is I can't just go ahead and install a patched kernel on that
file server unless I'm really confident (i. e. someone in the know concurs)
I found the problem, because, well, that's the file server.
I must admit I don't fully understand how the code is supposed to behave if
it's still in SACK recovery (but hasn't timed out on the lost segment) and
new data comes in to be sent.
So, please, could someone comment whether this sounds plausible or whether
I'm missing the point?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Edgar Fuß

2008-02-17 17:34:02 UTC

Permalink

Post by YAMAMOTO Takashi
it seems like a bug to me.

Thanks.

I'm only wondering why tcp_setpersist() doesn't panic all the time.

What about this patch?

--- sys/netinet/tcp_timer.c.orig 2006-10-09 18:27:07.000000000 +0200
+++ sys/netinet/tcp_timer.c 2008-02-17 18:21:52.000000000 +0100
@@ -317,6 +317,7 @@
*/
in_pcbnotifyall(&tcbtable, icmpsrc.sin_addr, EMSGSIZE,
tcp_mtudisc);
+ TCP_TIMER_DISARM(tp, TCPT_REXMT);
splx(s);
return;
}

It looks to me like no equivalent modification needs to be applied to
tcp_timer_persist() (it always calls tcp_setpersist() which re-arms the timer)
or tcp_timer_keep() (it always re-arms or tcp_drop()s) or tcp_timer_2msl().

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

YAMAMOTO Takashi

2008-02-18 00:03:33 UTC

Permalink

[ cc: christos because he committed this code ]

Post by Edgar FuÃ

Post by YAMAMOTO Takashi
it seems like a bug to me.

i guess it should be before the in_pcbnotifyall call.

YAMAMOTO Takashi

Post by Edgar FuÃ
It looks to me like no equivalent modification needs to be applied to
tcp_timer_persist() (it always calls tcp_setpersist() which re-arms the timer)
or tcp_timer_keep() (it always re-arms or tcp_drop()s) or tcp_timer_2msl().

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Christos Zoulas

2008-02-18 00:43:47 UTC

Permalink

On Feb 18, 9:03am, ***@mwd.biglobe.ne.jp (YAMAMOTO Takashi) wrote:
-- Subject: Re: TCP PMTU/SACK/timer problem

| [ cc: christos because he committed this code ]
|
| > > it seems like a bug to me.
| > Thanks.
| >
| > I'm only wondering why tcp_setpersist() doesn't panic all the time.
| >
| > What about this patch?
| >
| > --- sys/netinet/tcp_timer.c.orig 2006-10-09 18:27:07.000000000 +0200
| > +++ sys/netinet/tcp_timer.c 2008-02-17 18:21:52.000000000 +0100
| > @@ -317,6 +317,7 @@
| > */
| > in_pcbnotifyall(&tcbtable, icmpsrc.sin_addr, EMSGSIZE,
| > tcp_mtudisc);
| > + TCP_TIMER_DISARM(tp, TCPT_REXMT);
| > splx(s);
| > return;
| > }
|
| i guess it should be before the in_pcbnotifyall call.

Comparing the code with OpenBSD, I think thst they are clearing the
timer in tcp_input.c.

christos

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Edgar Fuß

2008-02-18 10:07:40 UTC

Permalink

Post by Christos Zoulas
Comparing the code with OpenBSD, I think thst they are clearing the
timer in tcp_input.c.

Sorry, I don't undestand what you are trying to tell me.
I may be overlooking something, but I think it's well possible for
tcp_input not being called at all in the course of events I described.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Edgar Fuß

2008-07-25 13:09:00 UTC

Permalink

Post by YAMAMOTO Takashi
it seems like a bug to me.

[...]

Post by YAMAMOTO Takashi
i guess it should be before the in_pcbnotifyall call.

May I humbly ask what happend to this? Shall I file a PR about it?
It looks to me like yamt concurs that my analysis was correct.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Edgar Fuß

2008-02-18 13:44:10 UTC

Permalink

Post by YAMAMOTO Takashi
it seems like a bug to me.

Although I'm happy my analysis seems to be correct, I'm still not sure I
fully undestand what's happening. The retransmit timer is in this messed-up
state where it's considered armed but will not go off. We have a SACK hole
that apparently has been retransmitted. We have new data in the socket buffer,
but why will tcp_output() refuse to send (I checked with tcpdump that nothing
is sent; I can't drop into DDB)?

AFAICT, we have sack_rxmit == 0 and sack_bytes_rxmt == 0. We also have
tp->t_partialacks == 0 so the code around netinet/tcp_input.c:833 should
find there's data in the socket buffer it can send, shouldn't it?

Could someone please prove me wrong on the following data or tell me
there's another bug?

I printed out the pcb twice since some fields changed in between.

Thanks in advance for any hints.

I can probably temporarily work around the problem by setting an MTU
on the route from the file server trough the tunnel. So the retransmit
timer shouldn't be messed up any more by PMTU discovery and I can go
ahead enjoing my vacation.

(gdb) print *(struct tcpcb *)0xffff80000914e238
$1 = {t_family = 2, segq = {tqh_first = 0x0, tqh_last = 0xffff80000914e240},
t_segqlen = 0, t_timer = {{c_list = {cq_next = {elem = 0xffffffff80606020,
list = 0xffffffff80606020}, cq_prev = {elem = 0xffffffff806a8800,
list = 0xffffffff806a8800}},
c_func = 0xffffffff801393e0 <tcp_timer_rexmt>,
c_arg = 0xffff80000914e238, c_time = 828374206, c_flags = 4}, {c_list = {
cq_next = {elem = 0x0, list = 0x0}, cq_prev = {elem = 0x0,
list = 0x0}}, c_func = 0xffffffff801392f0 <tcp_timer_persist>,
c_arg = 0xffff80000914e238, c_time = 0, c_flags = 0}, {c_list = {
cq_next = {elem = 0xffff800009885708, list = 0xffff800009885708},
cq_prev = {elem = 0xffff8000091662d8, list = 0xffff8000091662d8}},
c_func = 0xffffffff80139140 <tcp_timer_keep>,
c_arg = 0xffff80000914e238, c_time = 829784260, c_flags = 2}, {c_list = {
cq_next = {elem = 0x0, list = 0x0}, cq_prev = {elem = 0x0,
list = 0x0}}, c_func = 0xffffffff80138f60 <tcp_timer_2msl>,
c_arg = 0xffff80000914e238, c_time = 0, c_flags = 0}}, t_state = 4,
t_rxtshift = 0, t_rxtcur = 3, t_dupacks = 3, t_partialacks = 0,
t_peermss = 1460, t_ourmss = 1460, t_segsz = 1448, t_force = 0 '\0',
t_flags = 2532, t_template = 0xffff800009ac9000,
t_inpcb = 0xffff80000cd352d0, t_in6pcb = 0x0, t_delack_ch = {c_list = {
cq_next = {elem = 0xffff80004e3458d8, list = 0xffff80004e3458d8},
cq_prev = {elem = 0xffff800011d426e8, list = 0xffff800011d426e8}},
c_func = 0xffffffff801390c0 <tcp_delack>, c_arg = 0xffff80000914e238,
c_time = 829064280, c_flags = 0}, snd_una = 1869460940,
snd_nxt = 1869460940, snd_up = 1869460940, snd_wl1 = 2357157762,
snd_wl2 = 1869460940, iss = 1857980151, snd_wnd = 138112,
snd_recover = 1869494764, snd_high = 1869460940, rcv_wnd = 65948,
rcv_nxt = 2357158522, rcv_up = 2357157762, irs = 2336413453,
rcv_adv = 2357223710, snd_max = 1869494764, snd_cwnd = 4380,
snd_ssthresh = 65001, t_rcvtime = 16581286, t_rtttime = 0,
t_rtseq = 1869460940, t_srtt = 39, t_rttvar = 3, t_rttmin = 2,
max_sndwnd = 138112, t_oobflags = 0 '\0', t_iobc = 0 '\0', t_softerror = 0,
snd_scale = 7 '\a', rcv_scale = 0 '\0', request_r_scale = 0 '\0',
requested_s_scale = 7 '\a', ts_recent = 17944682, ts_recent_age = 16581286,
ts_timebase = 16566161, last_ack_sent = 2357158522, t_bytes_acked = 753,
rcv_sack_flags = 0 '\0', rcv_dsack_block = {left = 0, right = 0}, timeq = {
tqh_first = 0x0, tqh_last = 0xffff80000914e3f8}, snd_holes = {
tqh_first = 0xffff80000cd22f98, tqh_last = 0xffff80000cd22fa8},
snd_numholes = 1, rcv_lastsack = 1869494764, sack_newdata = 1869494764,
snd_fack = 1869494764, t_sc = {lh_first = 0x0},
t_lastm = 0xffff800009952a00, t_inoff = 2670, t_lastoff = 32398,
t_lastlen = 94, t_mtudisc = 1, t_pmtud_mss_acked = 2766,
t_pmtud_mtu_sent = 1500, t_pmtud_th_seq = 1869460940,
t_pmtud_nextmtu = 39685, t_pmtud_ip_len = 56325, t_pmtud_ip_hl = 5,
t_ecn_retries = 0 '\0', t_congctl = 0xffffffff805ff6c0}
(gdb) print *((struct tcpcb *)0xffff80000914e238)->snd_holes->tqh_first
$2 = {start = 1869460940, end = 1869462388, rxmit = 1869462388, sackhole_q = {
tqe_next = 0x0, tqe_prev = 0xffff80000914e408}}
(gdb) print *(*(struct tcpcb *)0xffff80000914e238)->t_inpcb->inp_head.inph_socket
$3 = {so_type = 1, so_options = 12, so_linger = 0, so_state = 2,
so_pcb = 0xffff80000cd352d0, so_proto = 0xffffffff80459660, so_head = 0x0,
so_onq = 0x0, so_q0 = {tqh_first = 0x0, tqh_last = 0x0}, so_q = {
tqh_first = 0x0, tqh_last = 0x0}, so_qe = {tqe_next = 0x0,
tqe_prev = 0xffff8000098801b8}, so_q0len = 0, so_qlen = 0, so_qlimit = 0,
so_timeo = 0, so_error = 0, so_pgid = 0, so_oobmark = 0, so_snd = {sb_sel = {
sel_klist = {slh_first = 0x0}, sel_pid = 0, sel_collision = 0 '\0'},
sb_mowner = 0x0, sb_cc = 63420, sb_hiwat = 65948, sb_mbcnt = 115200,
sb_mbmax = 131896, sb_lowat = 4096, sb_mb = 0xffff800011c68200,
sb_mbtail = 0xffff800011c7ee00, sb_lastrecord = 0xffff800011c68200,
sb_flags = 4, sb_timeo = 0, sb_overflowed = 0}, so_rcv = {sb_sel = {
sel_klist = {slh_first = 0x0}, sel_pid = 0, sel_collision = 0 '\0'},
sb_mowner = 0x0, sb_cc = 0, sb_hiwat = 65948, sb_mbcnt = 0,
sb_mbmax = 131896, sb_lowat = 1, sb_mb = 0x0, sb_mbtail = 0x0,
sb_lastrecord = 0x0, sb_flags = 32, sb_timeo = 0, sb_overflowed = 0},
so_internal = 0x0, so_upcall = 0xffffffff8023e510 <nfsrv_rcv>,
so_upcallarg = 0xffff800009cf5400 "", so_send = 0xffffffff8031a490 <sosend>,
so_receive = 0xffffffff80319390 <soreceive>, so_mowner = 0x0,
so_uidinfo = 0xffff800007ecdfc0}
(gdb) print *(struct tcpcb *)0xffff80000914e238
$4 = {t_family = 2, segq = {tqh_first = 0x0, tqh_last = 0xffff80000914e240},
t_segqlen = 0, t_timer = {{c_list = {cq_next = {elem = 0xffffffff80606020,
list = 0xffffffff80606020}, cq_prev = {elem = 0xffffffff806a8800,
list = 0xffffffff806a8800}},
c_func = 0xffffffff801393e0 <tcp_timer_rexmt>,
c_arg = 0xffff80000914e238, c_time = 828374206, c_flags = 4}, {c_list = {
cq_next = {elem = 0x0, list = 0x0}, cq_prev = {elem = 0x0,
list = 0x0}}, c_func = 0xffffffff801392f0 <tcp_timer_persist>,
c_arg = 0xffff80000914e238, c_time = 0, c_flags = 0}, {c_list = {
cq_next = {elem = 0xffff800009885708, list = 0xffff800009885708},
cq_prev = {elem = 0xffff8000091662d8, list = 0xffff8000091662d8}},
c_func = 0xffffffff80139140 <tcp_timer_keep>,
c_arg = 0xffff80000914e238, c_time = 829789467, c_flags = 2}, {c_list = {
cq_next = {elem = 0x0, list = 0x0}, cq_prev = {elem = 0x0,
list = 0x0}}, c_func = 0xffffffff80138f60 <tcp_timer_2msl>,
c_arg = 0xffff80000914e238, c_time = 0, c_flags = 0}}, t_state = 4,
t_rxtshift = 0, t_rxtcur = 3, t_dupacks = 3, t_partialacks = 0,
t_peermss = 1460, t_ourmss = 1460, t_segsz = 1448, t_force = 0 '\0',
t_flags = 2532, t_template = 0xffff800009ac9000,
t_inpcb = 0xffff80000cd352d0, t_in6pcb = 0x0, t_delack_ch = {c_list = {
cq_next = {elem = 0xffffffff80606020, list = 0xffffffff80606020},
cq_prev = {elem = 0xffffffff806a8800, list = 0xffffffff806a8800}},
c_func = 0xffffffff801390c0 <tcp_delack>, c_arg = 0xffff80000914e238,
c_time = 829069487, c_flags = 0}, snd_una = 1869460940,
snd_nxt = 1869460940, snd_up = 1869460940, snd_wl1 = 2357158522,
snd_wl2 = 1869460940, iss = 1857980151, snd_wnd = 138112,
snd_recover = 1869494764, snd_high = 1869460940, rcv_wnd = 65948,
rcv_nxt = 2357158666, rcv_up = 2357158522, irs = 2336413453,
rcv_adv = 2357224201, snd_max = 1869494764, snd_cwnd = 4380,
snd_ssthresh = 65001, t_rcvtime = 16581390, t_rtttime = 0,
t_rtseq = 1869460940, t_srtt = 39, t_rttvar = 3, t_rttmin = 2,
max_sndwnd = 138112, t_oobflags = 0 '\0', t_iobc = 0 '\0', t_softerror = 0,
snd_scale = 7 '\a', rcv_scale = 0 '\0', request_r_scale = 0 '\0',
requested_s_scale = 7 '\a', ts_recent = 17996741, ts_recent_age = 16581390,
ts_timebase = 16566161, last_ack_sent = 2357158666, t_bytes_acked = 753,
rcv_sack_flags = 0 '\0', rcv_dsack_block = {left = 0, right = 0}, timeq = {
tqh_first = 0x0, tqh_last = 0xffff80000914e3f8}, snd_holes = {
tqh_first = 0xffff80000cd22f98, tqh_last = 0xffff80000cd22fa8},
snd_numholes = 1, rcv_lastsack = 1869494764, sack_newdata = 1869494764,
snd_fack = 1869494764, t_sc = {lh_first = 0x0},
t_lastm = 0xffff800009952a00, t_inoff = 2670, t_lastoff = 32398,
t_lastlen = 94, t_mtudisc = 1, t_pmtud_mss_acked = 2766,
t_pmtud_mtu_sent = 1500, t_pmtud_th_seq = 1869460940,
t_pmtud_nextmtu = 39685, t_pmtud_ip_len = 56325, t_pmtud_ip_hl = 5,
t_ecn_retries = 0 '\0', t_congctl = 0xffffffff805ff6c0}

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de