Manuel Bouyer
2008-10-09 20:56:15 UTC
Hi,
context: on a netbsd-3 amd64 server I've got several instances of this panic:
panic: tcp_output REXMT
panic() at netbsd:panic+0x1c8
tcp_segsize() at netbsd:tcp_segsize
tcp_timer_persist() at netbsd:tcp_timer_persist+0x73
softclock() at netbsd:softclock+0x2c9
softintr_dispatch() at netbsd:softintr_dispatch+0x99
DDB lost frame for netbsd:Xsoftclock+0x2d, trying 0xffffffff8069bcd0
Xsoftclock() at netbsd:Xsoftclock+0x2d
I think I've found the cause for this: in tcp_close(), we
check if a timer is being invoked, and set TF_DEAD in this case, otherwise
we just pool_put() the pcb. In the TF_DEAD case, it'll be pool_put() by
tcp_isdead(), called from the timer hanbler.
Problem: we check if a timer is being invoked with tcp_timers_invoking(),
but this forgets the t_delack_ch timer. So we may tcp_close() a pcb
while the tcp_delack() is being called. The pcb will be returned to
the pool, then tcp_delack() continue to run, eventually causing one
of the other timers to be rearmed. In the meantime, this pcb may have
been reused, causing the above panic when the timer fires.
This issue is also present in netbsd-4 but seems to be fixed in -current
(by using proper locking).
Does anyone see anything wrong with the above analysis, or the attached
patch ?
context: on a netbsd-3 amd64 server I've got several instances of this panic:
panic: tcp_output REXMT
panic() at netbsd:panic+0x1c8
tcp_segsize() at netbsd:tcp_segsize
tcp_timer_persist() at netbsd:tcp_timer_persist+0x73
softclock() at netbsd:softclock+0x2c9
softintr_dispatch() at netbsd:softintr_dispatch+0x99
DDB lost frame for netbsd:Xsoftclock+0x2d, trying 0xffffffff8069bcd0
Xsoftclock() at netbsd:Xsoftclock+0x2d
I think I've found the cause for this: in tcp_close(), we
check if a timer is being invoked, and set TF_DEAD in this case, otherwise
we just pool_put() the pcb. In the TF_DEAD case, it'll be pool_put() by
tcp_isdead(), called from the timer hanbler.
Problem: we check if a timer is being invoked with tcp_timers_invoking(),
but this forgets the t_delack_ch timer. So we may tcp_close() a pcb
while the tcp_delack() is being called. The pcb will be returned to
the pool, then tcp_delack() continue to run, eventually causing one
of the other timers to be rearmed. In the meantime, this pcb may have
been reused, causing the above panic when the timer fires.
This issue is also present in netbsd-4 but seems to be fixed in -current
(by using proper locking).
Does anyone see anything wrong with the above analysis, or the attached
patch ?
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--