Discussion:
BSD doesn't track window size changes correctly.
(too old to reply)
Darren Reed
2012-04-01 08:13:36 UTC
Permalink
I've been having some problems with FreeBSD not following
the remote end updating its window size to 0 and one of the
problematic code paths is common with NetBSD.

In tcp_input.c (NetBSD), we find this:

/*
* Update window information.
* Don't look at window if no ACK: TAC's send garbage on first SYN.
*/
if ((tiflags & TH_ACK) && (SEQ_LT(tp->snd_wl1, th->th_seq) ||
(tp->snd_wl1 == th->th_seq && (SEQ_LT(tp->snd_wl2,
th->th_ack) ||
(tp->snd_wl2 == th->th_ack && tiwin > tp->snd_wnd))))) {

... the problem here is that it only recognises window updates
that increase the window size. Thus if the remote end sends back
a packet that indicates the window size is 0, it is ignored.

Has anyone else noticed bad behaviour when the TCP window size is 0?

Darren


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2012-04-01 12:35:28 UTC
Permalink
Date: Sun, 01 Apr 2012 18:13:36 +1000
From: Darren Reed <***@NetBSD.org>
Message-ID: <***@netbsd.org>

| ... the problem here is that it only recognises window updates
| that increase the window size.

That's not quite what the code says, in one of the three alternatives
the window is only allowed to grow, but ...

| Thus if the remote end sends back
| a packet that indicates the window size is 0, it is ignored.

In that case, yes, but that's also very poor behaviour from the peer,
and not something it can rely upon working (while I'd agree that
perhaps strictly it should).

That is, if the peer has previously authorised our sending some data,
and later tries to take it back, it cannot assume that we had not already
sent the data up to the advertised window size, nor can it assume (in
the case that matters) that we ever even saw its attempted window update
(if I read that right, that third alternative should normally only occur
for a packet with the same seq & ack as previously received, so there's no
way any reply we send can distinguish receiving that packet from receiving
the previous one - unless perhaps it had more data than the earlier packet,
but that would be a very unusual implementation.)

That is, a peer that acts like this cannot complain if we keep sending data
to the old authorised window, and even if we keep retransmitting it if there's
no ack.

If I were you I'd concentrate on finding out what wacky system is closing
the window after previously opening it, that's anti-social TCP at best.

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2012-04-02 03:27:27 UTC
Permalink
Post by Robert Elz
Date: Sun, 01 Apr 2012 18:13:36 +1000
| ... the problem here is that it only recognises window updates
| that increase the window size.
That's not quite what the code says, in one of the three alternatives
the window is only allowed to grow, but ...
| Thus if the remote end sends back
| a packet that indicates the window size is 0, it is ignored.
In that case, yes, but that's also very poor behaviour from the peer,
and not something it can rely upon working (while I'd agree that
perhaps strictly it should).
That is, if the peer has previously authorised our sending some data,
and later tries to take it back, it cannot assume that we had not already
sent the data up to the advertised window size, nor can it assume (in
the case that matters) that we ever even saw its attempted window update
(if I read that right, that third alternative should normally only occur
for a packet with the same seq & ack as previously received, so there's no
way any reply we send can distinguish receiving that packet from receiving
the previous one - unless perhaps it had more data than the earlier packet,
but that would be a very unusual implementation.)
That is, a peer that acts like this cannot complain if we keep sending data
to the old authorised window, and even if we keep retransmitting it if there's
no ack.
If I were you I'd concentrate on finding out what wacky system is closing
the window after previously opening it, that's anti-social TCP at best.
The problems I have observed are running a client from a local FreeBSD server
with a NetBSD server at the other end.

What I am observing happening is two things:
1) NetBSD advertises a window that shrinks as it receives more data, until
its buffer is full.
2) NetBSD then advertises a window size of 0 that is summarily ignored.

So both ends are BSD TCP...

Here's one such extact:
09:01:33.740701 IP FreeBSD.35421 > NetBSD.ssh: Flags [S], seq 2474756569, win 65535, options [mss 1460,nop,wscale 3,sackOK,TS val 1064054 ecr 0], length 0
09:01:33.952922 IP NetBSD.ssh > FreeBSD.35421: Flags [S.], seq 1491889712, ack 2474756570, win 32768, options [mss 1452,nop,wscale 5,nop,nop,TS val 1 ecr 1064054,sackOK,nop,nop], length 0
...
09:03:38.416891 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239664, win 0, options [nop,nop,TS val 249 ecr 1076479], length 0
09:03:38.437943 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 249 ecr 1076502], length 1440
09:03:38.439168 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239672:241112, ack 4441251, win 7740, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.439290 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 241112:242552, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.440516 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 255512:256952, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.440638 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 256952:258392, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.440761 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 258392:259832, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.440885 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 259832:261272, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.441007 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 261272:262712, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.441143 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 262712:264152, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.441250 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 264152:265592, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:39.128537 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239672:241112, ack 4441251, win 8280, options [nop,nop,TS val 1076593 ecr 249], length 1440
09:03:39.720321 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 252 ecr 1076502], length 1440
09:03:39.721146 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4441251, win 8280, options [nop,nop,TS val 1076652 ecr 249], length 0
09:03:39.932057 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239672, win 0, options [nop,nop,TS val 252 ecr 1076502], length 0

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2012-04-02 06:32:50 UTC
Permalink
Further packets:

09:03:38.020855 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 230249, win 295, options [nop,nop,TS val 249 ecr 1076458], length 0
09:03:38.053904 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 232917, win 211, options [nop,nop,TS val 249 ecr 1076458], length 0
09:03:38.080020 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 235568, win 128, options [nop,nop,TS val 249 ecr 1076463], length 0
09:03:38.112052 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 238424, win 39, options [nop,nop,TS val 249 ecr 1076464], length 0
09:03:38.222896 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4438371:4439811, ack 238424, win 39, options [nop,nop,TS val 249 ecr 1076479], length 1440
09:03:38.223740 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239664:239672, ack 4436931, win 8280, options [nop,nop,TS val 1076502 ecr 249,nop,nop,sa
ck 1 {4438371:4439811}], length 8
09:03:38.225524 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4439811:4441251, ack 238424, win 39, options [nop,nop,TS val 249 ecr 1076479], length 1440
09:03:38.226304 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4436931, win 8280, options [nop,nop,TS val 1076502 ecr 249,nop,nop,sack 1 {4438371:44412
51}], length 0
09:03:38.416891 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239664, win 0, options [nop,nop,TS val 249 ecr 1076479], length 0
Post by Darren Reed
09:01:33.740701 IP FreeBSD.35421 > NetBSD.ssh: Flags [S], seq 2474756569, win 65535, options [mss 1460,nop,wscale 3,sackOK,TS val 1064054 ecr 0], length 0
09:01:33.952922 IP NetBSD.ssh > FreeBSD.35421: Flags [S.], seq 1491889712, ack 2474756570, win 32768, options [mss 1452,nop,wscale 5,nop,nop,TS val 1 ecr 1064054,sackOK,nop,nop], length 0
...
09:03:38.416891 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239664, win 0, options [nop,nop,TS val 249 ecr 1076479], length 0
09:03:38.437943 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 249 ecr 1076502], length 1440
09:03:38.439168 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239672:241112, ack 4441251, win 7740, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.439290 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 241112:242552, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.440516 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 255512:256952, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.440638 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 256952:258392, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.440761 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 258392:259832, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.440885 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 259832:261272, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.441007 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 261272:262712, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.441143 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 262712:264152, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.441250 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 264152:265592, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:39.128537 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239672:241112, ack 4441251, win 8280, options [nop,nop,TS val 1076593 ecr 249], length 1440
09:03:39.720321 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 252 ecr 1076502], length 1440
09:03:39.721146 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4441251, win 8280, options [nop,nop,TS val 1076652 ecr 249], length 0
09:03:39.932057 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239672, win 0, options [nop,nop,TS val 252 ecr 1076502], length 0
The packets were captured on a firewall that had rules to allow all packets in
and out to the NetBSD box without dropping any or enforcing any stateful filtering.

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Dennis Ferguson
2012-04-02 19:25:52 UTC
Permalink
I haven't paid much attention to TCP recently so I'm not sure how this
is supposed to work, but it seems like the problem is clear(?). I notice
below that NetBSD is using a window scale of 5, so this packet
Post by Darren Reed
09:03:38.112052 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 238424, win 39, options [nop,nop,TS val 249 ecr 1076464], length 0
ack's to 238424 and advertises a window of 39*32=1248 bytes, so its window
is open out to sequence 238424+1248=239672. FreeBSD seems to believe this
too since it sends this short packet
Post by Darren Reed
09:03:38.223740 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239664:239672, ack 4436931, win 8280, options [nop,nop,TS val 1076502 ecr 249,nop,nop,sack 1 {4438371:4439811}], length 8
which completely fills the advertised window out to sequence 239672. The
problem seems to be that here
Post by Darren Reed
09:03:38.416891 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239664, win 0, options [nop,nop,TS val 249 ecr 1076479], length 0
NetBSD, having previously advertised that its window was open to 239672, now
claims that its window closed at sequence 239664 while apparently chucking
FreeBSD's last 8 bytes. FreeBSD should ignore this since NetBSD shouldn't
be shrinking its window from a previously advertised size. The problem, then,
isn't FreeBSD ignoring the closed window but rather NetBSD not accepting those
last 8 bytes.

I'm pretty sure the problem has something to do with window scaling, but I've
not paid attention to how this is supposed to work. What is clear is that
with a wscale of 5 NetBSD can't specify the right hand edge of its window
with a precision better than 32 bytes, so the edge it advertises is going to
vary by <32 bytes from acknowledgement to acknowledgement. This being the
case, NetBSD should have accepted data out to the maximum sequence it has
ever advertised as being willing to accept. If NetBSD is trying to keep
within the precise limit of some internal buffer it needs to make very sure
that it always rounds the window size it advertises down from that. NetBSD
should have accepted the last 8 bytes it was sent.

Dennis Ferguson


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Dennis Ferguson
2012-04-02 20:13:19 UTC
Permalink
Post by Dennis Ferguson
I haven't paid much attention to TCP recently so I'm not sure how this
is supposed to work, but it seems like the problem is clear(?). I notice
below that NetBSD is using a window scale of 5, so this packet
...
Post by Dennis Ferguson
Post by Darren Reed
09:03:38.223740 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239664:239672, ack 4436931, win 8280, options [nop,nop,TS val 1076502 ecr 249,nop,nop,sack 1 {4438371:4439811}], length 8
which completely fills the advertised window out to sequence 239672. The
I should say that I'm making the assumption that NetBSD actually
received the packet above. I assumed that might be a safe assumption
since the trace was taken on the NetBSD box but, if not, then never
mind.

The reason I'm backtracking a bit is that there is a netstat -s
TCP statistic concerning data received beyond the advertised window,
i.e.

6 packets (0 bytes) of data after window

which seems like it should be incrementing if the problem I thought
was happening is in fact occurring. The fact that only one of the
NetBSD machines I have has incremented this at all is making me think
that the problem must be a bit more subtle.

Dennis Ferguson


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2012-04-02 23:41:39 UTC
Permalink
Date: Mon, 2 Apr 2012 12:25:52 -0700
From: Dennis Ferguson <***@gmail.com>
Message-ID: <72B44EA1-F0F4-45A5-9A03-***@gmail.com>


| NetBSD, having previously advertised that its window was open to 239672, now
| claims that its window closed at sequence 239664 while apparently chucking
| FreeBSD's last 8 bytes. FreeBSD should ignore this

No, it shouldn't, shrinking the window is legal, just not recommended.
Senders need to be prepared for it to happen and deal with it (see page 42
of rfc793).

Window management gets real messy with window scaling - hosts cannot advertise
a window size > their available buffer space (obviously) but need to work
out what to do when receiving data that isn't a multiple of the window
scale factor.

Here NetBSD's window of 39 (39 * 32 bytes, or 1248) was advertised, and
FreeBSD sent 1240 bytes. At that point the buffer has 8 bytes left.
What's NetBSD to do now? If it sends a window size of 1 (32 bytes
scaled) it allows FreeBSD to send 32 more bytes, but has buffer space
for just 8. That's wrong. The next lower window size it can advertise
is 0 - that's really the only rational choice, it is shrinking the window,
but that's legal, and FreeBSD is supposed to be able to handle that.

Whatever tricks the NetBSD (receiving system, whatever OS) does in this
case, it really cannot avoid shrinking the window in some cases, starting
out smaller and allowing extra bytes from the buffer that hadn't been
advertised to gradually be consumed by these fragments of window scale
units in data packets will still eventually consume all the available
space, at which point shrinking the window is the only option.

| The problem, then, isn't FreeBSD ignoring the closed window

Well, it is, really, but

| but rather NetBSD not accepting those last 8 bytes.

that might also be a problem - this one isn't clear to me, that is,
whether a system should enforce the window it advertised, or the one
it really has but cannot advertise. Before window scaling this couldn't
happen, now it can.

| I'm pretty sure the problem has something to do with window scaling,

Possibly, but I suspect there's more to it than that, as after the 0 window
is advertised, FreeBSD goes wild and starts sending all kinds of stuff, not
just those missing 8 bytes - almost as if it was treating the 0 as 65536 or
something.

And ...

***@gmail.com said:
| I should say that I'm making the assumption that NetBSD actually received
| the packet above. I assumed that might be a safe assumption since the
| trace was taken on the NetBSD box

Darren told me the trace is from an intermediate firewall, I believe one
much closer to the FreeBSD than the NetBSD, so this probably isn't a safe
assumption. The timestamp option values probably allow a better analysis
of what has been received where, and when, but I haven't bothered to look
at them, I'm not sure it matters, whatever it believes, there's nothing I
can see in the packet sequence that authorises the FreeBSD to start sending
from (relative) seq 239672 onwards, and it does, in a packet burst that
goes all the way up to 265592 - that's 25920 bytes beyond the window
(give or take that missing 8).

That's just broken.

Why it does it I have no idea, does anyone know how similar FreeBSD's
tcp is to NetBSD's these days?

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Dennis Ferguson
2012-04-03 03:28:57 UTC
Permalink
Post by Robert Elz
Date: Mon, 2 Apr 2012 12:25:52 -0700
| NetBSD, having previously advertised that its window was open to 239672, now
| claims that its window closed at sequence 239664 while apparently chucking
| FreeBSD's last 8 bytes. FreeBSD should ignore this
No, it shouldn't, shrinking the window is legal, just not recommended.
Senders need to be prepared for it to happen and deal with it (see page 42
of rfc793).
It seems to be more than just "not recommended", as in "strongly discouraged"
and the "robustness principle dictates" that you should never do it yourself
even if someone else does.

Never mind the rest, though, most of what I wrote was based on insufficient
reading. I somehow thought the packet trace was obtained from the NetBSD end, but
I see now it was obtained somewhere else and the temporal ordering of the packets
in the trace isn't what NetBSD is seeing. Looking at it this way, the packet
trace is fully consistent. FreeBSD sends the extra 8 bytes, NetBSD later
ack's it.
Post by Robert Elz
Here NetBSD's window of 39 (39 * 32 bytes, or 1248) was advertised, and
FreeBSD sent 1240 bytes. At that point the buffer has 8 bytes left.
What's NetBSD to do now? If it sends a window size of 1 (32 bytes
scaled) it allows FreeBSD to send 32 more bytes, but has buffer space
for just 8. That's wrong. The next lower window size it can advertise
is 0 - that's really the only rational choice, it is shrinking the window,
but that's legal, and FreeBSD is supposed to be able to handle that.
Yes, this is fine. NetBSD should decide how much data it will accept
and then truncate its advertised window down to the next lower 32 byte
boundary (so that the advertised right edge of the window is always
between what it will accept and 31 bytes lower than that). This will
ensure that NetBSD never gets more data than it is willing to accept,
though it might get a bit less than that. Now that I'm looking at the
trace right this is how it seems to work.

Since NetBSD will often need to jitter the right hand edge of the
window back and forth like this, the remaining issue is only
whether FreeBSD needs to track this jitter or not, and I believe
an acceptable answer to that is "not". Given that window shrinkage
is unlikely (if not illegal) I think tracking the most advanced window
edge it has seen from NetBSD and "handling" window shrinkage in the
same way that it "handles" a broken neighbour which advertises a non-zero
window but refuses to ack the packets you are sending is not unreasonable.
Post by Robert Elz
Darren told me the trace is from an intermediate firewall, I believe one
much closer to the FreeBSD than the NetBSD, so this probably isn't a safe
assumption. The timestamp option values probably allow a better analysis
of what has been received where, and when, but I haven't bothered to look
at them, I'm not sure it matters, whatever it believes, there's nothing I
can see in the packet sequence that authorises the FreeBSD to start sending
from (relative) seq 239672 onwards, and it does, in a packet burst that
goes all the way up to 265592 - that's 25920 bytes beyond the window
(give or take that missing 8).
That's just broken.
Notice there's another inconsistency, however. While the zero window
packet NetBSD sends is carrying data which advances the (NetBSD) sequence
to 4438371, the FreeBSD packets are carrying an ack for 4441251. FreeBSD
has seen NetBSD packets which are not in that trace, it is reasonable for
FreeBSD to ignore the zero window in those old packets, and there's no way
to tell whether what it is doing is reasonable without seeing the NetBSD
packet with a sequence of 4441251.

I think the only brokenness there is evidence for in that trace is the trace
itself. There's not enough of the conversation to figure out what is going
on.

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2012-04-03 06:15:40 UTC
Permalink
Date: Mon, 2 Apr 2012 20:28:57 -0700
From: Dennis Ferguson <***@gmail.com>
Message-ID: <5E7FDFF5-1E2E-4375-8451-***@gmail.com>

| It seems to be more than just "not recommended", as in "strongly
| discouraged" and the "robustness principle dictates" that you should
| never do it yourself even if someone else does.

Agreed - that was pretty much what I said in my first message(s) on
the topic.

| trace is fully consistent. FreeBSD sends the extra 8 bytes, NetBSD later
| ack's it.

Yes, I eventually found that too.

| Notice there's another inconsistency, however. While the zero window
| packet NetBSD sends is carrying data which advances the (NetBSD) sequence
| to 4438371, the FreeBSD packets are carrying an ack for 4441251.

I hadn't really bothered looking at what was happening in that direction,
but you're right, I should, and will, a little later.

| I think the only brokenness there is evidence for in that trace is the trace
| itself.

There was a message I sent only to Darren (because of his Reply-To header...)
in which I said much the same thing - the data he included was missing the
important packets.

| There's not enough of the conversation to figure out what is going on.

So he gave me access to the raw packet capture of the whole trace ... I'll
look at it closer including the sequence numbers flowing the other way.

What I did notice already is that the reason for the window filling seems to
be packet loss (you already saw that it is a long RTT connection, it is
entirely possible for one lost packet to result in a window full stall before
the retransmit starts it all going again).

It is possible there's actually nothing at all wrong, though the trace does
end with a RST during one of these window full stalls (there are a few through
the full connection). But the FreeBSD node's behaviour still looks a bit
odd (especially if the assumption I believe is correct that the trace was
taken quite near it, meaning what it shows and what the FreeBSD system saw
are quite close, both in packet accuracy & order, and time).

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2012-04-03 07:16:46 UTC
Permalink
Post by Dennis Ferguson
...
Notice there's another inconsistency, however. While the zero window
packet NetBSD sends is carrying data which advances the (NetBSD) sequence
to 4438371, the FreeBSD packets are carrying an ack for 4441251. FreeBSD
has seen NetBSD packets which are not in that trace, it is reasonable for
FreeBSD to ignore the zero window in those old packets, and there's no way
to tell whether what it is doing is reasonable without seeing the NetBSD
packet with a sequence of 4441251.
I think the only brokenness there is evidence for in that trace is the trace
itself. There's not enough of the conversation to figure out what is going
on.
09:03:38.011416 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4436931,
win 7520, options [nop,nop,TS val 1076481 ecr 249], length 0
09:03:38.019028 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4436931,
win 8033, options [nop,nop,TS val 1076482 ecr 249], length 0
09:03:38.020855 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 230249,
win 295, options [nop,nop,TS val 249 ecr 1076458], length 0
09:03:38.053904 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 232917,
win 211, options [nop,nop,TS val 249 ecr 1076458], length 0
09:03:38.080020 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 235568,
win 128, options [nop,nop,TS val 249 ecr 1076463], length 0
09:03:38.112052 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 238424,
win 39, options [nop,nop,TS val 249 ecr 1076464], length 0
09:03:38.222896 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq
4438371:4439811, ack 238424, win 39, options [nop,nop,TS val 249 ecr
1076479], length 1440
09:03:38.223740 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
239664:239672, ack 4436931, win 8280, options [nop,nop,TS val 1076502
ecr 249,nop,nop,sack 1 {4438371:4439811}], length 8
09:03:38.225524 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq
4439811:4441251, ack 238424, win 39, options [nop,nop,TS val 249 ecr
1076479], length 1440
09:03:38.226304 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4436931,
win 8280, options [nop,nop,TS val 1076502 ecr 249,nop,nop,sack 1
{4438371:4441251}], length 0
09:03:38.416891 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239664,
win 0, options [nop,nop,TS val 249 ecr 1076479], length 0
09:03:38.437943 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq
4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 249 ecr
1076502], length 1440
09:03:38.439168 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
239672:241112, ack 4441251, win 7740, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.439290 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
241112:242552, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.439412 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
242552:243992, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.439535 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
243992:245432, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.439657 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
245432:246872, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.439780 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
246872:248312, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.439902 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
248312:249752, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.440024 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
249752:251192, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.440147 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
251192:252632, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.440269 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
252632:254072, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.440393 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
254072:255512, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.440516 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
255512:256952, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.440638 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
256952:258392, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.440761 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
258392:259832, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.440885 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
259832:261272, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.441007 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
261272:262712, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.441143 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
262712:264152, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:38.441250 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
264152:265592, ack 4441251, win 8139, options [nop,nop,TS val 1076524
ecr 249], length 1440
09:03:39.128537 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq
239672:241112, ack 4441251, win 8280, options [nop,nop,TS val 1076593
ecr 249], length 1440
09:03:39.720321 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq
4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 252 ecr
1076502], length 1440
09:03:39.721146 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4441251,
win 8280, options [nop,nop,TS val 1076652 ecr 249], length 0
09:03:39.932057 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239672,
win 0, options [nop,nop,TS val 252 ecr 1076502], length 0
09:03:42.722687 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq
4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 258 ecr
1076502], length 1440
09:03:42.723459 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4441251,
win 8280, options [nop,nop,TS val 1076952 ecr 252], length 0
09:03:42.937384 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq
4441251:4442691, ack 239672, win 0, options [nop,nop,TS val 258 ecr
1076952], length 1440

Now when I modified FreeBSD's TCP to track all window changes (NetBSD
uses the same logic), not just those that increased the window, the
behaviour observed was this:

16:43:12.670784 IP FreeBSD.56171 > NetBSD.ssh: . 216603:217839(1236) ack
4387088 win 8280 <nop,nop,timestamp 259836 269,nop,nop,sack 1
{4388528:4391408}>
16:43:12.672440 IP NetBSD.ssh > FreeBSD.56171: . 4387088:4388528(1440)
ack 213747 win 157 <nop,nop,timestamp 269 259812>
16:43:12.673315 IP FreeBSD.56171 > NetBSD.ssh: . ack 4391408 win 7740
<nop,nop,timestamp 259836 269>
16:43:12.675599 IP FreeBSD.56171 > NetBSD.ssh: . ack 4391408 win 8280
<nop,nop,timestamp 259836 269>
16:43:12.799261 IP NetBSD.ssh > FreeBSD.56171: . ack 216603 win 68
<nop,nop,timestamp 270 259812>
16:43:12.900628 IP NetBSD.ssh > FreeBSD.56171: . 4391408:4392848(1440)
ack 217839 win 30 <nop,nop,timestamp 270 259836>
16:43:12.903525 IP NetBSD.ssh > FreeBSD.56171: . 4392848:4394288(1440)
ack 217839 win 30 <nop,nop,timestamp 270 259836>
16:43:12.904630 IP FreeBSD.56171 > NetBSD.ssh: . 217839:218799(960) ack
4394288 win 8100 <nop,nop,timestamp 259859 270>
16:43:12.906231 IP NetBSD.ssh > FreeBSD.56171: . 4394288:4395728(1440)
ack 217839 win 30 <nop,nop,timestamp 270 259836>
16:43:12.907031 IP FreeBSD.56171 > NetBSD.ssh: . ack 4395728 win 8280
<nop,nop,timestamp 259859 270>
16:43:13.131700 IP NetBSD.ssh > FreeBSD.56171: . 4395728:4397168(1440)
ack 218799 win 0 <nop,nop,timestamp 270 259859>
16:43:13.134319 IP NetBSD.ssh > FreeBSD.56171: . 4397168:4398608(1440)
ack 218799 win 0 <nop,nop,timestamp 270 259859>
16:43:13.135221 IP FreeBSD.56171 > NetBSD.ssh: . ack 4398608 win 8100
<nop,nop,timestamp 259882 270>
16:43:13.136843 IP NetBSD.ssh > FreeBSD.56171: . 4398608:4400048(1440)
ack 218799 win 0 <nop,nop,timestamp 270 259859>
16:43:13.137692 IP FreeBSD.56171 > NetBSD.ssh: . ack 4400048 win 8280
<nop,nop,timestamp 259882 270>
16:43:13.139551 IP NetBSD.ssh > FreeBSD.56171: . 4400048:4401488(1440)
ack 218799 win 0 <nop,nop,timestamp 270 259859>
16:43:13.142504 IP NetBSD.ssh > FreeBSD.56171: . 4401488:4402928(1440)
ack 218799 win 0 <nop,nop,timestamp 270 259859>
16:43:13.143268 IP FreeBSD.56171 > NetBSD.ssh: . ack 4402928 win 7993
<nop,nop,timestamp 259883 270>

However that was not a perfect change because the connection ends like this:
16:43:31.734465 IP FreeBSD.56171 > NetBSD.ssh: . 477711:479143(1432) ack
5174804 win 7113 <nop,nop,timestamp 261742 308>
16:43:31.739882 IP FreeBSD.56171 > NetBSD.ssh: . ack 5174804 win 7625
<nop,nop,timestamp 261743 308>
16:43:31.744796 IP FreeBSD.56171 > NetBSD.ssh: . ack 5174804 win 8157
<nop,nop,timestamp 261743 308>
16:43:31.905789 IP NetBSD.ssh > FreeBSD.56171: . ack 453231 win 0
<nop,nop,timestamp 308 261704>
16:43:31.956684 IP NetBSD.ssh > FreeBSD.56171: . 5176244:5177684(1440)
ack 453231 win 0 <nop,nop,timestamp 308 261704>
16:43:31.957530 IP FreeBSD.56171 > NetBSD.ssh: . ack 5174804 win 8280
<nop,nop,timestamp 261765 308,nop,nop,sack 1 {5176244:5177684}>
16:43:31.958955 IP NetBSD.ssh > FreeBSD.56171: . ack 453231 win 0
<nop,nop,timestamp 308 261704>
16:43:32.174523 IP NetBSD.ssh > FreeBSD.56171: . 5174804:5176244(1440)
ack 453231 win 0 <nop,nop,timestamp 308 261765>
16:43:32.175434 IP FreeBSD.56171 > NetBSD.ssh: . ack 5177684 win 7920
<nop,nop,timestamp 261786 308>
16:43:32.175500 IP FreeBSD.56171 > NetBSD.ssh: . ack 5177684 win 8280
<nop,nop,timestamp 261786 308>
16:43:32.177426 IP NetBSD.ssh > FreeBSD.56171: . 5177684:5179124(1440)
ack 453231 win 0 <nop,nop,timestamp 308 261765>
16:43:32.185135 IP NetBSD.ssh > FreeBSD.56171: . ack 453231 win 128
<nop,nop,timestamp 308 261765>
16:43:32.186063 IP FreeBSD.56171 > NetBSD.ssh: . 453231:454671(1440) ack
5179124 win 8280 <nop,nop,timestamp 261787 308>
16:43:32.186184 IP FreeBSD.56171 > NetBSD.ssh: . 454671:456111(1440) ack
5179124 win 8280 <nop,nop,timestamp 261787 308>
16:43:32.186289 IP FreeBSD.56171 > NetBSD.ssh: . 456111:457327(1216) ack
5179124 win 8280 <nop,nop,timestamp 261787 308>
16:43:32.392350 IP NetBSD.ssh > FreeBSD.56171: . 5179124:5180564(1440)
ack 453231 win 128 <nop,nop,timestamp 309 261786>
16:43:32.394970 IP NetBSD.ssh > FreeBSD.56171: . 5180564:5182004(1440)
ack 453231 win 128 <nop,nop,timestamp 309 261786>
16:43:32.395842 IP FreeBSD.56171 > NetBSD.ssh: . ack 5182004 win 8100
<nop,nop,timestamp 261808 309>
16:43:32.397691 IP NetBSD.ssh > FreeBSD.56171: . 5182004:5183444(1440)
ack 453231 win 128 <nop,nop,timestamp 309 261786>
16:43:32.408085 IP FreeBSD.56171 > NetBSD.ssh: . ack 5183444 win 8280
<nop,nop,timestamp 261810 309>
16:43:32.416708 IP NetBSD.ssh > FreeBSD.56171: . 5183444:5184884(1440)
ack 454671 win 83 <nop,nop,timestamp 309 261787>
16:43:32.419593 IP NetBSD.ssh > FreeBSD.56171: . 5184884:5186324(1440)
ack 454671 win 83 <nop,nop,timestamp 309 261787>
16:43:32.420415 IP FreeBSD.56171 > NetBSD.ssh: . ack 5186324 win 8100
<nop,nop,timestamp 261811 309>
16:43:32.439702 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 309 261787>
16:43:32.610095 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 309 261787>
16:43:32.622473 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 309 261787>
16:43:32.634325 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 309 261787>
16:43:33.446151 IP FreeBSD.56171 > NetBSD.ssh: . ack 5186324 win 8280
<nop,nop,timestamp 261914 309>
16:43:33.660807 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 311 261787>
16:43:34.410749 IP NetBSD.ssh > FreeBSD.56171: . 5179124:5180564(1440)
ack 457327 win 0 <nop,nop,timestamp 313 261787>
16:43:34.411650 IP FreeBSD.56171 > NetBSD.ssh: . ack 5186324 win 8280
<nop,nop,timestamp 262010 311>
16:43:34.625870 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 313 261787>
16:43:37.412651 IP NetBSD.ssh > FreeBSD.56171: . 5179124:5180564(1440)
ack 457327 win 0 <nop,nop,timestamp 319 261787>
16:43:37.413365 IP FreeBSD.56171 > NetBSD.ssh: . ack 5186324 win 8280
<nop,nop,timestamp 262310 313>
16:43:37.628045 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 319 261787>
16:43:43.414789 IP NetBSD.ssh > FreeBSD.56171: . 5179124:5179648(524)
ack 457327 win 0 <nop,nop,timestamp 331 261787>
16:43:43.415237 IP FreeBSD.56171 > NetBSD.ssh: . ack 5186324 win 8280
<nop,nop,timestamp 262911 319>
16:43:43.629769 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 331 261787>
16:43:55.423047 IP NetBSD.ssh > FreeBSD.56171: . 5179124:5179648(524)
ack 457327 win 0 <nop,nop,timestamp 355 261787>
16:43:55.423645 IP FreeBSD.56171 > NetBSD.ssh: . ack 5186324 win 8280
<nop,nop,timestamp 264112 331>
16:43:55.637501 IP NetBSD.ssh > FreeBSD.56171: . ack 457327 win 0
<nop,nop,timestamp 355 261787>

and eventually is reset after a couple of minutes of those last three
packets being replayed.

Darren


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2012-04-03 07:45:54 UTC
Permalink
Post by Robert Elz
...
What I did notice already is that the reason for the window filling seems to
be packet loss (you already saw that it is a long RTT connection, it is
entirely possible for one lost packet to result in a window full stall before
the retransmit starts it all going again).
I was wondering something similar - is FreeBSD simply sending packets
because it believes that the window will open up when in fact, the
window does not.
Post by Robert Elz
It is possible there's actually nothing at all wrong, though the trace does
end with a RST during one of these window full stalls (there are a few through
the full connection). But the FreeBSD node's behaviour still looks a bit
odd (especially if the assumption I believe is correct that the trace was
taken quite near it, meaning what it shows and what the FreeBSD system saw
are quite close, both in packet accuracy & order, and time).
Correct. The system on which the capture is being made is on the
same LAN as FreeBSD (~1ms away), so any packet in the trace that
is from NetBSD is very likely seen by FreeBSD.

It may be that the ISP throttling of packets is leading to some
packets from NetBSD being dropped and that they will be missing
from the trace.

Darren


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Laight
2012-04-03 18:21:02 UTC
Permalink
Post by Robert Elz
Date: Mon, 2 Apr 2012 20:28:57 -0700
| It seems to be more than just "not recommended", as in "strongly
| discouraged" and the "robustness principle dictates" that you should
| never do it yourself even if someone else does.
Agreed - that was pretty much what I said in my first message(s) on
the topic.
Hmmmm....

If you are trying to send data over a high speed (relatively) high
latency link it may be necessary to have several hundred full sized
ethernet packets 'in flight'.

This requires the receiving end advertise a very large window, but
the receiving end will also expect the receiving application to
(continue to?) read data out of the socket more or less as soon as
it arrives.

This means that the receiving end is likely to 'overcommit' (in some
sense) the receive buffer space, so if the application stops reading
data - particularly if the kernel itself is low on memory, it may wish
to reduce the size of the receive window hoping to apply flow control
back to the sending system more thoroughly.

OTOH having send a 'window slam' there is no reason not to accept
(and ack) data received afterwards provided there is receive
buffering available. (TCP might forbid this, but within a RTT the
sender can probably not tell.)

From experiments I've done with other protocols, it is certainly
best not to reduce the window to zero - because missing the window
opening msg is problematical. Best to leave it open a bit and
defer sending acks. (might confuse RTT timings though!)

David
--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Dennis Ferguson
2012-04-03 21:02:38 UTC
Permalink
On 3 Apr, 2012, at 00:16 , Darren Reed wrote:

Ah, that's a trace that is easier to look at. I see no evidence
that NetBSD has shrunk its window significantly anywhere in that
sequence, however, other than the unavoidable window scaling jitter.
More than this, it seems like FreeBSD is actually tracking the
window shrinkage that jitter causes.
Post by Darren Reed
09:03:38.020855 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 230249, win 295, options [nop,nop,TS val 249 ecr 1076458], length 0
The right edge of the window for this ack is 230249+295*32 = 239689.
That's the furthest right it ever extends.
Post by Darren Reed
09:03:38.053904 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 232917, win 211, options [nop,nop,TS val 249 ecr 1076458], length 0
The right edge of this window is 239669
Post by Darren Reed
09:03:38.080020 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 235568, win 128, options [nop,nop,TS val 249 ecr 1076463], length 0
The right edge of this window is 239664
Post by Darren Reed
09:03:38.112052 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 238424, win 39, options [nop,nop,TS val 249 ecr 1076464], length 0
The right edge of this window is 239672
Post by Darren Reed
09:03:38.222896 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4438371:4439811, ack 238424, win 39, options [nop,nop,TS val 249 ecr 1076479], length 1440
The right edge of this window is 239672
Post by Darren Reed
09:03:38.223740 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239664:239672, ack 4436931, win 8280, options [nop,nop,TS val 1076502 ecr 249,nop,nop,sack 1 {4438371:4439811}], length 8
FreeBSD agrees the right edge of the window is 239672 at this point since it
sends 8 more bytes to fill it up. It doesn't seem to remember that the window
had been advertised as open to 239689 a few packets earlier (it didn't send 25
bytes), however, so the evidence seems to be that FreeBSD is in fact tracking
the window shrinkage jitter. FreeBSD has also figured out that 4436931:4438371
is lost and wants it.
Post by Darren Reed
09:03:38.225524 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4439811:4441251, ack 238424, win 39, options [nop,nop,TS val 249 ecr 1076479], length 1440
239672
Post by Darren Reed
09:03:38.226304 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], ack 4436931, win 8280, options [nop,nop,TS val 1076502 ecr 249,nop,nop,sack 1 {4438371:4441251}], length 0
09:03:38.416891 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239664, win 0, options [nop,nop,TS val 249 ecr 1076479], length 0
239664
Post by Darren Reed
09:03:38.437943 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 249 ecr 1076502], length 1440
This packet is the interesting one. It is a retransmission of the missing
data that FreeBSD asked for 200 ms earlier. It also ack's the 8 bytes that
FreeBSD sent in the same packet, which means that it has accepted and is
ack'ing data beyond the window edge the immediately previous packet advertised
(and the 239689 thing above suggests that FreeBSD would have noted this). The
right hand edge of the window has moved to 239672, but is still closed.

About 1 ms later FreeBSD blasts out beyond the edge of the window. This is
clearly incorrect. It is no longer sack'ing for the data in the packet
above so this must be in response to receiving the packet above.
Post by Darren Reed
09:03:38.439168 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 239672:241112, ack 4441251, win 7740, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.439290 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 241112:242552, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.439412 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 242552:243992, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
09:03:38.439535 IP FreeBSD.35421 > NetBSD.ssh: Flags [.], seq 243992:245432, ack 4441251, win 8139, options [nop,nop,TS val 1076524 ecr 249], length 1440
Somehow I think it is unlikely this problem has much to do with windows
shrinking. There is evidence that FreeBSD is actually tracking window
shrinkage (it could have sent 25 bytes instead of 8, but didn't) so it
appears that FreeBSD is fine with that. There is no evidence that
NetBSD ever advertised a window as big as FreeBSD thinks it now is. The
unique things about the NetBSD packet which prompted the insanity are
that (1) it was a retransmission of missing data, and (2) NetBSD
acknowledges data from FreeBSD in the packet which the previous
packet NetBSD sent indicated was outside the window.

I think you really want to look at what the FreeBSD code does when
it starts off knowing that its neighbor's window is closed and then
gets a packet that does those two things.

Dennis Ferguson



--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2012-04-04 02:43:52 UTC
Permalink
Post by Dennis Ferguson
...
I think you really want to look at what the FreeBSD code does when
it starts off knowing that its neighbor's window is closed and then
gets a packet that does those two things.
Indeed, the discussion over this issue started with a thread
on the FreeBSD networking list:
http://lists.freebsd.org/pipermail/freebsd-net/2012-April/031894.html

The latest in that thread is here, with a proposed patch:
http://lists.freebsd.org/pipermail/freebsd-net/2012-April/031934.html

Note that much of the logic changes here are in code that looks
the same on NetBSD so this patch may be relevant for NetBSD too.

Darren


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Dennis Ferguson
2012-04-04 22:19:18 UTC
Permalink
Post by Darren Reed
Post by Dennis Ferguson
...
I think you really want to look at what the FreeBSD code does when
it starts off knowing that its neighbor's window is closed and then
gets a packet that does those two things.
Indeed, the discussion over this issue started with a thread
http://lists.freebsd.org/pipermail/freebsd-net/2012-April/031894.html
http://lists.freebsd.org/pipermail/freebsd-net/2012-April/031934.html
Note that much of the logic changes here are in code that looks
the same on NetBSD so this patch may be relevant for NetBSD too.
Yes, NetBSD has the problem too and that change probably works. I must
be getting slow, though, it still took me a while to figure out
how that change fixed it since I couldn't see how the problem occurred
in the first place and couldn't find it in the discussion above. The
root cause seems to be an underflow of an unsigned variable, tp->snd_wnd.

For whomever patches this, I think the problem goes as follows. The
last two packets in the trace prior to the bad behavior are these:

09:03:38.416891 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239664, win 0, options [nop,nop,TS val 249 ecr 1076479], length 0
09:03:38.437943 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4436931:4438371, ack 239672, win 0, options [nop,nop,TS val 249 ecr 1076502], length 1440

That second packet is unusual because (1) it is carrying old, retransmitted data,
and (2) it ack's 8 bytes of data outside the window advertised in the previous
packet. The code in tcp_input() which updates the window from information in a
packet is this, as Darren pointed out:

/*
* Update window information.
* Don't look at window if no ACK: TAC's send garbage on first SYN.
*/
if ((tiflags & TH_ACK) && (SEQ_LT(tp->snd_wl1, th->th_seq) ||
(tp->snd_wl1 == th->th_seq && (SEQ_LT(tp->snd_wl2, th->th_ack) ||
(tp->snd_wl2 == th->th_ack && tiwin > tp->snd_wnd))))) {
/* keep track of pure window updates */
if (tlen == 0 &&
tp->snd_wl2 == th->th_ack && tiwin > tp->snd_wnd)
TCP_STATINC(TCP_STAT_RCVWINUPD);
----> tp->snd_wnd = tiwin;
tp->snd_wl1 = th->th_seq;
tp->snd_wl2 = th->th_ack;
if (tp->snd_wnd > tp->max_sndwnd)
tp->max_sndwnd = tp->snd_wnd;
needoutput = 1;
}

If the condition in the if() is satisfied this copies the send window
from the packet. The condition in the if() can only true, however,
if the packet is carrying a current sequence number like the first
packet above; if the packet is retransmitting old data, like the
second packet, it ignores the window in the packet and retains the
previous value of the window.

Note, however, that if the packet carrying an old sequence number is also
carrying an ack for previously un-acked data, tp->snd_wnd must be updated
regardless of the sequence number. In the existing code this is handled
before the if() above, with code which looks like

acked = th->th_ack - tp->snd_una;
. . .
if (acked > so->so_snd.sb_cc) {
tp->snd_wnd -= so->so_snd.sb_cc;
sbdrop(&so->so_snd, (int)so->so_snd.sb_cc);
ourfinisacked = 1;
} else {
if (acked > (tp->t_lastoff - tp->t_inoff))
tp->t_lastm = NULL;
sbdrop(&so->so_snd, acked);
tp->t_lastoff -= acked;
----> tp->snd_wnd -= acked;
ourfinisacked = 0;
}


So the algorithm seems to be:

- If the packet ack's some data, adjust the current tp->snd_wnd to account
for this. That is, keep the window advertised in a previous packet, but
update it to account for what's been ack'd since then.

- If it then decides to believe the window in the new packet overwrite tp->snd_wnd
with the new packet's window. If it decides not to believe it, it continues
to use the window it got previously, appropriately adjusted.

Given this, the first packet above is carrying a window the if() would decide to
believe so after it is processed tp->snd_win will be 0, copied from the packet.

The odd things about the second packet cause the problem as follows. Odd thing
(2), the fact that the packet ack's 8 bytes outside the window advertised in the
previous packet, causes it to subtract 8 from the zero-valued tp->snd_win. Since
tp->snd_win is a u_long it ends up being a really big number. Odd thing (1), the
fact that the packet is carrying old data, makes the if() evaluate false so the
previous error is not overwritten with the window from the packet, as it might
normally be. tcp_output() gets called with a really big tp->snd_win, which prompts
it to emit a whole congestion window of new packets.

The patch above for FreeBSD changes the behavior of the current code by making it
update tp->snd_win from any packet which ack's new data. That makes the broken
adjustment code go away and otherwise seems like a reasonable thing to do to me,
though I don't know for sure since I don't know why the old code didn't do this.
An alternative patch, which would fix the bug but retain the behavior of the old
code, would be to just avoid letting tp->snd_wnd get decremented below zero.

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Dennis Ferguson
2012-04-04 22:49:47 UTC
Permalink
Post by David Laight
From experiments I've done with other protocols, it is certainly
best not to reduce the window to zero - because missing the window
opening msg is problematical. Best to leave it open a bit and
defer sending acks. (might confuse RTT timings though!)
You probably won't find it in any specification document but the
routing protocol BGP depends quite strongly on zero-window TCP being
well-working for stability. In fact it could be argued that BGP
came to be used for almost all the heavy lifting of Internet routing
for a couple of decades now precisely because of this feature of the
TCP transport. Prior routing protocols had custom transports with no
equivalent to a zero window to shut your neighbors up, and as a result
were inherently difficult to keep from melting if you carried a lot of
routes in there and then put the routers under stress (networks running
BGP have been known to melt as well, but generally from bad implementations
rather than difficulties inherent in the design).

I don't usually pay much attention to TCP, but I cringe a bit when I hear
about problems with zero window behavior.

Dennis Ferguson

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2012-04-05 01:23:08 UTC
Permalink
I was thinking in a similar way, but what I think I'd do to
fix this is get rid of the concept of storing the peer's window
size as advertised - that's a meaningless number, and is only
carried in the packet as it is to save bytes.

What the implementation should be tracking is the actual window,
that is, the begin & end sequence numbers of the window, rather
than its size. We have the begin - that's the ack point. What's
needed is to keep the far end of the window (whether the last in,
or first not in, whichever makes for the simpler implementation)
and do away with the window size field (tp->snd_win). If knowing
the max window size (tp->max_sndwnd) is actually useful for something
(aside from reporting, I can't really imagine what) it can be
retained, otherwise it could go too.

Then the problem as you analysed it would not have happened - if the
ack moves forward, then so does the front edge of the window (and
the window becomes smaller without adjusting anything else).
If we choose to use the window size value, then we move the far
edge of the window (usually forwards, but potentially backwards)
independantly of whatever happened to the front edge.

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Dennis Ferguson
2012-04-05 02:57:16 UTC
Permalink
Post by Robert Elz
Then the problem as you analysed it would not have happened - if the
ack moves forward, then so does the front edge of the window (and
the window becomes smaller without adjusting anything else).
If we choose to use the window size value, then we move the far
edge of the window (usually forwards, but potentially backwards)
independantly of whatever happened to the front edge.
Yes, I also thought this would be better since I generally find
it easier to think about the advertised window in terms of the
right-hand end of the sequence space. The current code confuses
me.

After looking at the code, however, I realized that it isn't quite
so clear-cut. It actually tracks two windows, the advertised window
and the congestion window, and since these are often compared they
probably need to be maintained in the same units. Keeping them both
as a byte count means you have to adjust the advertised window for acks
(if you don't just overwrite it, like the FreeBSD patch) but the
congestion window isn't changed by that since it is just a pipe size.
Keeping them both as sequence numbers would save adjusting the advertised
window for acks but would make you adjust the congestion window sequence
number to account for those acks instead. This makes it kind of a wash.
The advertised window is easier to think about as a sequence, the
congestion window is easier to think about as a byte count and either
way you are adjusting one of those windows to account for acks.

The FreeBSD patch does make the problem go away if it works (no more
adjusting the advertised window for acks, it is always copied) but I'd
be a little worried about corner cases at the end or start of connections
which the current code somehow handles. Making tp->snd_win signed and
dealing with negative values might be the easiest way to preserve the
current behaviour.

Dennis Ferguson

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2012-04-01 12:51:11 UTC
Permalink
When I read my earlier reply when it came back from the list, I
see that I was less clear than I intended...

I suspect what you're seeing in the BSD code is an attempt to deal
with the issue of receiving two otherwise identical packets, with
differing window sizes advertised.

In that scenario, we can't tell which was sent first (it is not
safe to assume that the packets were not reordered during transit.)

On the other hand, a properly (well) implemented peer will not lower
the window size in that scenario, where it is entirely proper to
send a larger window (which typically happens when data is delivered
to the peer application.) So, given those two otherwise identical
packets, it is a reasonable assumption to make that the one with the
bigger window size was sent after the one with the smaller window size,
and hence we should process them as if that were true. That, as I
understand it, is what the code does.

If there is some other way to order the packets (differing sequence
numbers, differing ack numbers) then that is used to order them, and
window size reductions work just fine (as well as they're ever able to
work anyway.)

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...