Discussion:
race between tcp_close and delack ?
(too old to reply)
Manuel Bouyer
2008-10-09 20:56:15 UTC
Permalink
Hi,
context: on a netbsd-3 amd64 server I've got several instances of this panic:
panic: tcp_output REXMT
panic() at netbsd:panic+0x1c8
tcp_segsize() at netbsd:tcp_segsize
tcp_timer_persist() at netbsd:tcp_timer_persist+0x73
softclock() at netbsd:softclock+0x2c9
softintr_dispatch() at netbsd:softintr_dispatch+0x99
DDB lost frame for netbsd:Xsoftclock+0x2d, trying 0xffffffff8069bcd0
Xsoftclock() at netbsd:Xsoftclock+0x2d

I think I've found the cause for this: in tcp_close(), we
check if a timer is being invoked, and set TF_DEAD in this case, otherwise
we just pool_put() the pcb. In the TF_DEAD case, it'll be pool_put() by
tcp_isdead(), called from the timer hanbler.

Problem: we check if a timer is being invoked with tcp_timers_invoking(),
but this forgets the t_delack_ch timer. So we may tcp_close() a pcb
while the tcp_delack() is being called. The pcb will be returned to
the pool, then tcp_delack() continue to run, eventually causing one
of the other timers to be rearmed. In the meantime, this pcb may have
been reused, causing the above panic when the timer fires.
This issue is also present in netbsd-4 but seems to be fixed in -current
(by using proper locking).

Does anyone see anything wrong with the above analysis, or the attached
patch ?
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Edgar Fuß
2008-10-10 09:51:25 UTC
Permalink
Post by Manuel Bouyer
Problem: we check if a timer is being invoked with tcp_timers_invoking(),
but this forgets the t_delack_ch timer.
Sorry, I don't understand. I may be stupid, but tcp_timers_invoking() reads

for (i = 0; i < TCPT_NTIMERS; i++)
if (callout_invoking(&tp->t_timer[i]))
count++;
if (callout_invoking(&tp->t_delack_ch))
count++;

so I'm under the impression it does care about t_delack_ch.


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Manuel Bouyer
2008-10-10 11:29:00 UTC
Permalink
Post by Edgar Fuß
Post by Manuel Bouyer
Problem: we check if a timer is being invoked with tcp_timers_invoking(),
but this forgets the t_delack_ch timer.
Sorry, I don't understand. I may be stupid, but tcp_timers_invoking() reads
for (i = 0; i < TCPT_NTIMERS; i++)
if (callout_invoking(&tp->t_timer[i]))
count++;
if (callout_invoking(&tp->t_delack_ch))
count++;
so I'm under the impression it does care about t_delack_ch.
gasp, you're right. So this isn't my bug :( Sigh.
--
Manuel Bouyer, LIP6, Universite Paris VI. ***@lip6.fr
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Manuel Bouyer
2008-10-10 16:27:26 UTC
Permalink
Post by Manuel Bouyer
gasp, you're right. So this isn't my bug :( Sigh.
I found another path which could lead to a similar race.
The tcp timers are armed/disarmed, but we don't check if the callback is
being invoked. One of them at last can cause the panic I'm seeing in
tcp_output():
if (win == 0) {
TCP_TIMER_DISARM(tp, TCPT_REXMT);
tp->t_rxtshift = 0;
tp->snd_nxt = tp->snd_una;
if (TCP_TIMER_ISARMED(tp, TCPT_PERSIST) == 0)
tcp_setpersist(tp);
}

If TCPT_REXMT's callback is invoking at this point, it'll be run just
after tcp_output() exists. tcp_timer_rexmt() will restart the TCPT_REXMT
timer, so now we have both TCPT_REXMT and TCPT_PERSIST pending.

There are other places where we stop TCPT_REXMT, and potentially
arm another timer in the same splsoftnet() block (e.g. tcp_input
stops TCPT_REXMT and may call tcp_output, which may arm TCPT_PERSIST,
tcp_sack does it too).

There are also in tcp_output() possible paths where we disarm TCPT_PERSIST,
and later TCPT_REXMT while TCPT_PERSIST may still be invoking. I think
there could also cause the tcp_setpersist panic, as tcp_timer_persist can
call tcp_setpersist().

As TCP_TIMER_DISARM() will clear the pending and expirted states, I think a
possible way to close this race is to check callout_expired()
in the callout handler, and abort the handler if it's not expired.
See attached patch.

It looks like the same issue exists in netbsd-4 and current.
Comments ? Did I miss sometheing ?
--
Manuel Bouyer, LIP6, Universite Paris VI. ***@lip6.fr
NetBSD: 26 ans d'experience feront toujours la difference
--
Manuel Bouyer
2008-10-19 13:36:47 UTC
Permalink
Post by Manuel Bouyer
Post by Manuel Bouyer
gasp, you're right. So this isn't my bug :( Sigh.
I found another path which could lead to a similar race.
[...]
I've recorded this as kern/39769
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...