5.2: something wrong with TCP retransmits?

Discussion:

(too old to reply)

Mouse

2013-02-09 21:13:03 UTC

I've just started trying to use 5.2 (I've dabbled a little in 5.1, but
mostly have avoided 5.x so far). I had a connection hang mysteriously,
so I did it again with tcpdump running on an intermediate host.

It looks to me as though something's wrong with TCP retransmissions,
though that's a fairly superficial diagnosis.

Here's the tail of the tcpdump output.

15:40:28.696730 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34290468 win 32432 <nop,nop,timestamp 2413 2412>
15:40:28.697043 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34315332:34316780(1448) ack 43441 win 4197 <nop,nop,timestamp 2413 2413>
15:40:28.697164 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34316780:34318228(1448) ack 43441 win 4197 <nop,nop,timestamp 2413 2413>
15:40:28.697287 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34318228:34319676(1448) ack 43441 win 4197 <nop,nop,timestamp 2413 2413>
15:40:28.916594 IP 216.46.14.122.22 > 10.0.7.14.65521: P 43441:43485(44) ack 34291616 win 33580 <nop,nop,timestamp 2414 2412>
15:40:28.916641 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34293912 win 32432 <nop,nop,timestamp 2414 2412>
15:40:28.917976 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34319676:34321124(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2414>
15:40:28.918104 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34321124:34322572(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2414>
15:40:28.918221 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34322572:34324020(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2414>
15:40:28.918344 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34324020:34325468(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2414>
15:40:29.180893 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2415 2412>
15:40:29.181292 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34325468:34326916(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2415>
15:40:31.190954 IP 216.46.14.122.22 > 10.0.7.14.65521: P 43441:43485(44) ack 34295060 win 33580 <nop,nop,timestamp 2419 2412>
15:40:31.191204 IP 10.0.7.14.65521 > 216.46.14.122.22: P 34326916:34327216(300) ack 43485 win 4197 <nop,nop,timestamp 2418 2419,nop,nop,sack sack 1 {43441:43485} >
15:40:31.406736 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2419 2412,nop,nop,sack sack 1 {34326916:34327216} >
15:40:31.407115 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34295060:34296508(1448) ack 43485 win 4197 <nop,nop,timestamp 2419 2419>
15:40:33.899361 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34296508:34297656(1148) ack 43485 win 4197 <nop,nop,timestamp 2424 2419>
15:40:33.899435 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34297656:34298504(848) ack 43485 win 4197 <nop,nop,timestamp 2424 2419>
15:40:33.899517 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34298504:34299440(936) ack 43485 win 4197 <nop,nop,timestamp 2424 2419>
15:40:34.006752 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2424 2412,nop,nop,sack sack 2 {34296508:34297656}{34326916:34327216} >
15:40:34.226326 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2424 2412,nop,nop,sack sack 2 {34296508:34298504}{34326916:34327216} >
15:40:34.226371 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2424 2412,nop,nop,sack sack 2 {34296508:34299440}{34326916:34327216} >

At this point, it's 15:51:36 and tcpdump hasn't shown anything further.
I checked with netstat on each end, and 10.0.7.14 has data to send

Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 32156 10.0.7.14.65521 216.46.14.122.22 ESTABLISHED

but 216.46.14.122 doesn't

Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 216.46.14.122.22 10.0.7.14.65521 ESTABLISHED

(and, yes, as that implies, there is no NAT on this connection, despite
oen end being RFC1918-private).

I see nothing wrong with the underlying connectivity; ping -n -c 2 -I
10.0.7.14 216.46.14.122 produces a quick response, which shows up as
expected in the tcpdump output:

15:52:41.989631 IP 10.0.7.14 > 216.46.14.122: icmp 64: echo request seq 0
15:52:42.024922 IP 216.46.14.122 > 10.0.7.14: icmp 64: echo reply seq 0

The obvious question is, since 10.0.7.14 has data to send, why isn't it
filling in the SACK-indicated holes? Or at least retransmitting
_something_? As an off-the-cuff guess, my first inclination would be
to suspect the retransmission timers. But that's such a basic thing
that if that were what's wrong, it surely would have been noticed long
since.

It's now 16:12 and the tcpdump indicates no further traffic at all.

The tcpdump is being done very close to the 10.0.7.14 end - it's on
10.0.7.1, which is 10.0.7.14's default route, and on the Ethernet
shared between the two.

Any ideas what could be going wrong here? I'm going to leave the
connection in this wedged state for at least an hour or two in case
anyone happens to see this promptly and has anything to suggest.
(Unfortunately I can't really leave it this way indefinitely, though if
it's as repeatable as this seems to indicate I can probably recreate it
fairly easily.)

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2013-02-10 14:11:08 UTC

Permalink

Mouse <***@Rodents-Montreal.ORG> writes:

(I sent a longer private note, but for others) Install
pkgsrc/xplot-devel, and grok share/docs/xplot/README*.

Post by Mouse
15:40:28.696730 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34290468 win 32432 <nop,nop,timestamp 2413 2412>
15:40:28.697043 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34315332:34316780(1448) ack 43441 win 4197 <nop,nop,timestamp 2413 2413>
15:40:28.697164 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34316780:34318228(1448) ack 43441 win 4197 <nop,nop,timestamp 2413 2413>
15:40:28.697287 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34318228:34319676(1448) ack 43441 win 4197 <nop,nop,timestamp 2413 2413>
15:40:28.916594 IP 216.46.14.122.22 > 10.0.7.14.65521: P 43441:43485(44) ack 34291616 win 33580 <nop,nop,timestamp 2414 2412>
15:40:28.916641 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34293912 win 32432 <nop,nop,timestamp 2414 2412>
15:40:28.917976 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34319676:34321124(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2414>
15:40:28.918104 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34321124:34322572(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2414>
15:40:28.918221 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34322572:34324020(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2414>
15:40:28.918344 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34324020:34325468(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2414>
15:40:29.180893 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2415 2412>

This is the last regular ack before a sack, and acked seqno is off the
top of your (posted) capture.

Post by Mouse
15:40:29.181292 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34325468:34326916(1448) ack 43485 win 4197 <nop,nop,timestamp 2414 2415>
15:40:31.190954 IP 216.46.14.122.22 > 10.0.7.14.65521: P 43441:43485(44) ack 34295060 win 33580 <nop,nop,timestamp 2419 2412>
15:40:31.191204 IP 10.0.7.14.65521 > 216.46.14.122.22: P 34326916:34327216(300) ack 43485 win 4197 <nop,nop,timestamp 2418 2419,nop,nop,sack sack 1 {43441:43485} >
15:40:31.406736 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2419 2412,nop,nop,sack sack 1 {34326916:34327216} >

So the previous segment is sacked, but the ack seqno is from long ago.

Post by Mouse
15:40:31.407115 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34295060:34296508(1448) ack 43485 win 4197 <nop,nop,timestamp 2419 2419>
15:40:33.899361 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34296508:34297656(1148) ack 43485 win 4197 <nop,nop,timestamp 2424 2419>
15:40:33.899435 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34297656:34298504(848) ack 43485 win 4197 <nop,nop,timestamp 2424 2419>
15:40:33.899517 IP 10.0.7.14.65521 > 216.46.14.122.22: . 34298504:34299440(936) ack 43485 win 4197 <nop,nop,timestamp 2424 2419>

These are retransmits (this is the sort of things that's far easier to
see in xplot).

Post by Mouse
15:40:34.006752 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2424 2412,nop,nop,sack sack 2 {34296508:34297656}{34326916:34327216} >
15:40:34.226326 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2424 2412,nop,nop,sack sack 2 {34296508:34298504}{34326916:34327216} >
15:40:34.226371 IP 216.46.14.122.22 > 10.0.7.14.65521: . ack 34295060 win 33580 <nop,nop,timestamp 2424 2412,nop,nop,sack sack 2 {34296508:34299440}{34326916:34327216} >

the main ack seqno is not moving. So of the 4 above, the first was
lost, and next 3 made it. The packed sacked at 15:40:31.406 is still
being sacked.

So overall, you are having serious loss problems. TCP should cope, of
course.

Are you sure all the loss is between the tcpdump host and 216.46.14.122?

I would take a tcpdump on the sending machine. If you can post 'tcpdump
-S -tt <filter to the connection of interest', I can look at it in xplot.

Post by Mouse
I see nothing wrong with the underlying connectivity; ping -n -c 2 -I
10.0.7.14 216.46.14.122 produces a quick response, which shows up as
15:52:41.989631 IP 10.0.7.14 > 216.46.14.122: icmp 64: echo request seq 0
15:52:42.024922 IP 216.46.14.122 > 10.0.7.14: icmp 64: echo reply seq 0

Those are short packets. I've seen misterminated ethernets that work
for short but not long.

Post by Mouse
The obvious question is, since 10.0.7.14 has data to send, why isn't it
filling in the SACK-indicated holes? Or at least retransmitting
_something_? As an off-the-cuff guess, my first inclination would be
to suspect the retransmission timers. But that's such a basic thing
that if that were what's wrong, it surely would have been noticed long
since.

Also, do 'netstat -s' on the sender. Perhaps obvious, but netstat -s,
pause, netstat -s, diff, is highly useful.

I have been looking at TCP xplots from netbsd-5 for a while. There are
issues, but they are minor failures to be as aggressive as the spec
permits; I've never seen something like this.

Mouse

2013-02-10 21:46:23 UTC

Permalink

Post by Greg Troxel
So overall, you are having serious loss problems.

Yes, packet loss was obviously involved. It was an inter-city
connection over multiple large NSPs, so occasional packet loss is no
surprise. The surprise was the TCP stack's failure to recover.

Post by Greg Troxel
TCP should cope, of course.

And that's the part that bothered me: it just went dead. Completely
stopped retransmitting, for over ten minutes.

Post by Greg Troxel
Are you sure all the loss is between the tcpdump host and
216.46.14.122?

Not certain, of course, but it's likely. That's 100Mbps, whereas the
next hop towards 216.46.14.122 is a consumer-grade DSL line.

Post by Greg Troxel
I would take a tcpdump on the sending machine.

I'm trying from a different piece of the 10.* network in question now;
if it hangs again I'll have full captures (this time I'm capturing on
both ends of the connection).

Post by Greg Troxel

Post by Mouse
15:52:41.989631 IP 10.0.7.14 > 216.46.14.122: icmp 64: echo request seq 0
15:52:42.024922 IP 216.46.14.122 > 10.0.7.14: icmp 64: echo reply seq 0

Those are short packets. I've seen misterminated ethernets that work
for short but not long.

None of it is real Ethernet, actually; termination is not relevant.
The parts under my control are all the twisted-pair that typically
passes for Ethernet these days; all the other parts are shared
large-carrier infrastructure, even less likely to be true Ethernet.

Post by Greg Troxel
I have been looking at TCP xplots from netbsd-5 for a while. There
are issues, but they are minor failures to be as aggressive as the
spec permits; I've never seen something like this.

Neither have I. That's why I found it so noteworthy.

I've tried artifically introducing packet loss by briefly turning off
net.inet.ip.forwarding on an intermediate host - it recovered fine. I
then flooded the slowest hop on the path with ping -f -s 1000 and it
locked up for the duration of the ping but has now recovered. I'm
suspecting I'll have to reproduce the original setup to make it
misbehave again. I ought to be able to do that, though it'll mean a
few more days of delay.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

2013-02-11 06:23:02 UTC

Permalink

[...strange TCP lockup...]

I'm suspecting I'll have to reproduce the original setup to make it
misbehave again.

Fortunately (or perhaps not :/), this turned out to be unnecessary. It
locked up on me just recently, and proved to be cooperative enough to
do it again with tcpdumps running everywhere.

The setup is mostly straightforward, though there is one subtlety.

There are three machines involved: the two endpoints and one gateway in
between them. They are 10.0.1.44 (exact) and 216.46.0.66 (chip), which
are the endpoints, and stone, in between them. The subtlety is that
between stone and chip, there are, in IP terms, two links, one of them
being a tunnel layered atop the other; packet routing between stone and
chip is asymmetric, with packets taking the underlying DSL link in one
direction and the tunnel in the other. (This is a historical artifact;
while it may be part of what provokes the problem, I believe the
problem it's provoking is real.)

I snooped on exact (bge0) and chip (fxp1 and tun0). If you can tell
anything from packet payloads, ssh isn't doing its job properly, so I
don't mind putting the raw pcap files up for FTP; alongside them are
text files generated from them with tcpdump -S -tt -n -s 2000.

This is all on ftp.rodents-montreal.org, in /mouse/misc/tcp-lockup/,
as {pcap,txt}.{bgep,fxp1,tun0}. The FTP server should be willing to
bzip2 or gzip files on the fly if you ask for (eg) txt.bge0.bz2.
(There's not much point in trying to compress the pcap files, since the
bulk data in them is encrypted. But the text files compress by better
than ten to one with bzip2.)

The asymmetric routing is a bit odd, but unless it leads to complete
communication failure, TCP should be able to recover, even if not at
full possible speed. But, instead, I'm seeing data trucking along at
tens of KB a second, then suddenly locking up completely, apparently
because exact decides to stop retransmitting. The captures end at
05:05:54 UTC; I waited until at least 05:17 before killing them and
assembling the results. So, regardless of what goes on with stone and
chip, it seems to me exact should be doing _some_ retransmitting, and
as far as I can tell it isn't. At all. (When the failure happens.)

One other thing I just noticed: this case occurred with a 5.1 kernel,
not the 5.2 I initially had trouble with.

I took your suggestion and checked netstat -s. Specifically, I ran

sleep 10; netstat -s > z.1; sleep 60; netstat -s > z.2

on exact and chip, carefully avoiding anything that should provoke
traffic (the initial sleep is to allow me to switch the active window
away from either machine). Chip had a lot of changes, which is not
surprising because it's a relatively busy machine, being
ftp.rodents-montreal.org and also in the path between my house network
(including my NTP pool member) and the world. But exact should have
been doing nothing, and, indeed, none of the stats increased by more
than two, those probably because of broadcast NTP on that subnet. The
stats that changed:

ip
total packets received +2
packets for this host +2
packets sent from this host +1
udp
datagrams received +2
delivered +2
PCB has misses +1
datagrams output +1
arp
packets received +1
valid request packets +1
broadcast/multicast packets +1

In particular, none of the tcp numbers changed at all.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

2013-02-11 06:28:19 UTC

Permalink

Post by Mouse
This is all on ftp.rodents-montreal.org, in /mouse/misc/tcp-lockup/,
as {pcap,txt}.{bgep,fxp1,tun0}.

s/bgep/bge0/

*blush* And I proofread the whole post at least twice, too. :(

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de