How to use TCP window autosizing?

Discussion:

(too old to reply)

Dave Huang

2015-08-03 06:08:33 UTC

Hi, reading through
https://wiki.netbsd.org/tutorials/tuning_netbsd_for_performance/#index3h2
and http://proj.sunet.se/E2E/tcptune.html , my understanding is that
the net.inet.tcp.recvbuf_auto and net.inet.tcp.sendbuf_auto
enable/disable TCP window autosizing, and that the initial window size
is net.inet.tcp.{recv,send}space and that it'll increase by
net.inet.tcp.{recv,send}buf_inc up to net.inet.tcp.{recv,send}buf_max.
And that kern.sbmax also limits the maximum window size?

If window autosizing is enabled, is it supposed to just work
everywhere automatically, or does each program need to opt-in to it?
Because I'm not seeing anything happening.

I have a 100Mbps internet connection, and need to transfer files from
a server on the other side of the world. Round trip ping times are in
the 250ms range. So, according to the formula on the NetBSD wiki,
buffer size = RTT * bandwidth = 250ms * 100Mbps = 3.125MB.

I'm running NetBSD-alpha/7.0_RC2, with a kernel compiled with
NMBCLUSTERS=16384. The sysctls mentioned in those two webpages about TCP
tuning are set as:

kern.mbuf.nmbclusters = 16384
kern.somaxkva = 16777216
kern.sbmax = 4194304
net.inet.tcp.rfc1323 = 1
net.inet.tcp.recvspace = 32768
net.inet.tcp.sendspace = 32768
net.inet.tcp.recvbuf_auto = 1
net.inet.tcp.recvbuf_inc = 16384
net.inet.tcp.recvbuf_max = 4194304
net.inet.tcp.sendbuf_auto = 1
net.inet.tcp.sendbuf_inc = 8192
net.inet.tcp.sendbuf_max = 4194304

A tcpdump of scp from the remote machine (running Linux) to the local
NetBSD 7.0_RC2 shows:

00:02:32.693344 IP linux.36692 > netbsd.22: Flags [S], seq 2376757141, win 26883, options [mss 8961,sackOK,TS val 17840090 ecr 0,nop,wscale 7], length 0
00:02:32.693595 IP netbsd.22 > linux.36692: Flags [S.], seq 2458802765, ack 2376757142, win 32768, options [mss 1460,nop,wscale 7,nop,nop,TS val 1 ecr 17840090,sackOK,nop,nop], length 0
00:02:32.935663 IP linux.36692 > netbsd.22: Flags [.], ack 1, win 211, options [nop,nop,TS val 17840150 ecr 1], length 0

So it looks like NetBSD starts with an initial window size of 32768,
which I guess is expected given net.inet.tcp.recvspace = 32768? But
when does the autosizing come into play?

I let it run for 20 seconds, hoping to see the window size increase,
but in the ACKs from NetBSD to Linux, I never see the "win" reported
by tcpdump go above 262 (which I guess with a scaling factor of 2^7 is
262*128 = 33536), and the throughput is around 125kB/s (which is what
I'd expect; 32768 bytes/250 ms = 131kB/s). There doesn't seem to be
any packet loss. The remote side sends a burst of about 32K worth of
data, then there's a pause of about 250ms, then another burst of 32K,
etc.

00:02:56.012112 IP linux.36692 > netbsd.22: Flags [.], seq 2183886:2185334, ack 5360, win 269, options [nop,nop,TS val 17845919 ecr 47], length 1448
00:02:56.012233 IP linux.36692 > netbsd.22: Flags [.], seq 2185334:2186782, ack 5360, win 269, options [nop,nop,TS val 17845919 ecr 47], length 1448
00:02:56.012312 IP linux.36692 > netbsd.22: Flags [P.], seq 2186782:2187774, ack 5360, win 269, options [nop,nop,TS val 17845919 ecr 47], length 992
00:02:56.012488 IP netbsd.22 > linux.36692: Flags [.], ack 2185334, win 152, options [nop,nop,TS val 48 ecr 17845919], length 0
00:02:56.012589 IP netbsd.22 > linux.36692: Flags [.], ack 2187774, win 133, options [nop,nop,TS val 48 ecr 17845919], length 0
00:02:56.013013 IP netbsd.22 > linux.36692: Flags [.], ack 2187774, win 261, options [nop,nop,TS val 48 ecr 17845919], length 0
00:02:56.022967 IP netbsd.22 > linux.36692: Flags [P.], seq 5360:5400, ack 2187774, win 262, options [nop,nop,TS val 48 ecr 17845919], length 40
[ I think the "win 262" in the previous packet shows that NetBSD has
not increased its window size over about 32K... NetBSD has
consumed all the data in its buffer and is waiting for more, but
the remote Linux is waiting to get its ACKs before sending more ]
00:02:56.264399 IP linux.36692 > netbsd.22: Flags [.], seq 2187774:2189222, ack 5400, win 269, options [nop,nop,TS val 17845982 ecr 48], length 1448
00:02:56.264490 IP linux.36692 > netbsd.22: Flags [.], seq 2189222:2190670, ack 5400, win 269, options [nop,nop,TS val 17845982 ecr 48], length 1448

If I increase net.inet.tcp.recvspace to 4194304, the scp connects and
does the ssh protocol handshake (according to "scp -v"), but the data
transfer never actually starts... no idea what that means. If I set
recvspace to 3145728, scp reports about 3MB/s throughput when it first
starts, but that gradually decreases to around 600kB/s.

So, what's going on, and what can I do to get a decent transfer rate?
If I scp from Windows to the same remote Linux box, the throughput
slowly increases, and after 20 seconds, it's up to about 3.7MB/s, and
it continues to increase very slowly--after 2 minutes, the throughput
is about 4.1MB/s. The network connection is definitely capable of
doing better than 120kB/s or 600kB/s. Of course, the hardware is
completely different... I'm not running the Alpha edition of Windows
NT :) But scp between the Alpha and another machine on the LAN can do
about 2MB/s while maxing out the Alpha's CPU. If needed, I can do some
testing on a NetBSD machine with a modern/fast amd64 CPU, but I'm
pretty sure the Alpha should be able to do better than what I'm
currently seeing.

P.S. The NetBSD wiki mentions, "The automatic setting for sendbuf and
recvbuf is disabled in the default installation." However, it looks
like it was enabled by default since NetBSD 6.0. It also says, "The
initial value for maximal send buffer and receive buffer is both 256k,
which is very tiny," which is still the case. Is there a reason to
keep it so tiny?

--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: ***@azeotrope.org | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 39 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2015-08-03 12:26:31 UTC

Permalink

Post by Dave Huang
Hi, reading through
https://wiki.netbsd.org/tutorials/tuning_netbsd_for_performance/#index3h2
and http://proj.sunet.se/E2E/tcptune.html , my understanding is that
the net.inet.tcp.recvbuf_auto and net.inet.tcp.sendbuf_auto
enable/disable TCP window autosizing, and that the initial window size
is net.inet.tcp.{recv,send}space and that it'll increase by
net.inet.tcp.{recv,send}buf_inc up to net.inet.tcp.{recv,send}buf_max.

Yes, that's right. It has worked for me.

Post by Dave Huang
And that kern.sbmax also limits the maximum window size?

That makes sense for transmit, since unacked data is in the socket
buffer. And for receive, since advertising a large window implies a
readiness to store data that arrives within it.

Post by Dave Huang
If window autosizing is enabled, is it supposed to just work
everywhere automatically, or does each program need to opt-in to it?
Because I'm not seeing anything happening.

I have not actually read the code, but I am 99% sure that the window is
only increased if there are signs that it isn't big enough. That's
complicated but the key point is that the connection has to be up
against the window, rather than congestion.

Post by Dave Huang
I have a 100Mbps internet connection, and need to transfer files from
a server on the other side of the world. Round trip ping times are in
the 250ms range. So, according to the formula on the NetBSD wiki,
buffer size = RTT * bandwidth = 250ms * 100Mbps = 3.125MB.

That's big, and perhaps you will fill that pipe, but I would be very
surprised if you got 100 Mbps with no loss between you and the other
end.

Post by Dave Huang
I'm running NetBSD-alpha/7.0_RC2, with a kernel compiled with
NMBCLUSTERS=16384. The sysctls mentioned in those two webpages about TCP
kern.mbuf.nmbclusters = 16384
kern.somaxkva = 16777216
kern.sbmax = 4194304
net.inet.tcp.rfc1323 = 1
net.inet.tcp.recvspace = 32768
net.inet.tcp.sendspace = 32768
net.inet.tcp.recvbuf_auto = 1
net.inet.tcp.recvbuf_inc = 16384
net.inet.tcp.recvbuf_max = 4194304
net.inet.tcp.sendbuf_auto = 1
net.inet.tcp.sendbuf_inc = 8192
net.inet.tcp.sendbuf_max = 4194304
A tcpdump of scp from the remote machine (running Linux) to the local
00:02:32.693344 IP linux.36692 > netbsd.22: Flags [S], seq 2376757141, win 26883, options [mss 8961,sackOK,TS val 17840090 ecr 0,nop,wscale 7], length 0
00:02:32.693595 IP netbsd.22 > linux.36692: Flags [S.], seq 2458802765, ack 2376757142, win 32768, options [mss 1460,nop,wscale 7,nop,nop,TS val 1 ecr 17840090,sackOK,nop,nop], length 0
00:02:32.935663 IP linux.36692 > netbsd.22: Flags [.], ack 1, win 211, options [nop,nop,TS val 17840150 ecr 1], length 0
So it looks like NetBSD starts with an initial window size of 32768,
which I guess is expected given net.inet.tcp.recvspace = 32768? But
when does the autosizing come into play?
I let it run for 20 seconds, hoping to see the window size increase,
but in the ACKs from NetBSD to Linux, I never see the "win" reported
by tcpdump go above 262 (which I guess with a scaling factor of 2^7 is
262*128 = 33536), and the throughput is around 125kB/s (which is what
I'd expect; 32768 bytes/250 ms = 131kB/s). There doesn't seem to be
any packet loss. The remote side sends a burst of about 32K worth of
data, then there's a pause of about 250ms, then another burst of 32K,
etc.

Hmm. I was expecting to get to this point and suspect packet loss and
tell you to run "xplot" (in pkgsrc) which lets you visualize the
ack/etc. behavior.

Post by Dave Huang
00:02:56.012112 IP linux.36692 > netbsd.22: Flags [.], seq 2183886:2185334, ack 5360, win 269, options [nop,nop,TS val 17845919 ecr 47], length 1448
00:02:56.012233 IP linux.36692 > netbsd.22: Flags [.], seq 2185334:2186782, ack 5360, win 269, options [nop,nop,TS val 17845919 ecr 47], length 1448
00:02:56.012312 IP linux.36692 > netbsd.22: Flags [P.], seq 2186782:2187774, ack 5360, win 269, options [nop,nop,TS val 17845919 ecr 47], length 992
00:02:56.012488 IP netbsd.22 > linux.36692: Flags [.], ack 2185334, win 152, options [nop,nop,TS val 48 ecr 17845919], length 0
00:02:56.012589 IP netbsd.22 > linux.36692: Flags [.], ack 2187774, win 133, options [nop,nop,TS val 48 ecr 17845919], length 0
00:02:56.013013 IP netbsd.22 > linux.36692: Flags [.], ack 2187774, win 261, options [nop,nop,TS val 48 ecr 17845919], length 0
00:02:56.022967 IP netbsd.22 > linux.36692: Flags [P.], seq 5360:5400, ack 2187774, win 262, options [nop,nop,TS val 48 ecr 17845919], length 40
[ I think the "win 262" in the previous packet shows that NetBSD has
not increased its window size over about 32K... NetBSD has
consumed all the data in its buffer and is waiting for more, but
the remote Linux is waiting to get its ACKs before sending more ]
00:02:56.264399 IP linux.36692 > netbsd.22: Flags [.], seq 2187774:2189222, ack 5400, win 269, options [nop,nop,TS val 17845982 ecr 48], length 1448
00:02:56.264490 IP linux.36692 > netbsd.22: Flags [.], seq 2189222:2190670, ack 5400, win 269, options [nop,nop,TS val 17845982 ecr 48], length 1448
If I increase net.inet.tcp.recvspace to 4194304, the scp connects and
does the ssh protocol handshake (according to "scp -v"), but the data
transfer never actually starts... no idea what that means. If I set

I wonder if there is a 32-bit bug someplace. I would try a number that
fits in 31 bits.

Post by Dave Huang
recvspace to 3145728, scp reports about 3MB/s throughput when it first
starts, but that gradually decreases to around 600kB/s.

Here, you should use xplot, which will plot tranmsitted packets, the ack
line, the window, and sacks. Read the README, and use tcpdump2xplot.
It will take you an hour the first time, but then you'll wonder how
anybody can pore over numbers in tcpdump output to understand TCP
ack/window/congestion behavior.

Post by Dave Huang
So, what's going on, and what can I do to get a decent transfer rate?
If I scp from Windows to the same remote Linux box, the throughput
slowly increases, and after 20 seconds, it's up to about 3.7MB/s, and
it continues to increase very slowly--after 2 minutes, the throughput
is about 4.1MB/s. The network connection is definitely capable of
doing better than 120kB/s or 600kB/s. Of course, the hardware is
completely different... I'm not running the Alpha edition of Windows
NT :) But scp between the Alpha and another machine on the LAN can do
about 2MB/s while maxing out the Alpha's CPU. If needed, I can do some
testing on a NetBSD machine with a modern/fast amd64 CPU, but I'm
pretty sure the Alpha should be able to do better than what I'm
currently seeing.

My guess is that the code that's deciding to open the receive window is
not firing. From the receiver's viewpoint, the receive window is too
small if the traffic is bunched up, which is much harder to articulate
precisely than "we are allowed to send data to the other side, the
buffer is full, and we have no unsent data",

Post by Dave Huang
P.S. The NetBSD wiki mentions, "The automatic setting for sendbuf and
recvbuf is disabled in the default installation." However, it looks
like it was enabled by default since NetBSD 6.0. It also says, "The
initial value for maximal send buffer and receive buffer is both 256k,
which is very tiny," which is still the case. Is there a reason to
keep it so tiny?

This is a balance between supporting a large number of connections and
high speed on a small number, depending on memory. Your case is
somewhat unusual, and there are a lot of machines with only a G or so of
memory. Arguably what would be nice is autosizing of socket bufferes
too.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

2015-08-03 13:00:40 UTC

Permalink

Your case is somewhat unusual, and there are a lot of machines with
only a G or so of memory.

"Only" a G or so? NetBSD supports - or at least I thought it still
paid lip service to supporting - two machines that max out at sixteen
megs.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dave Huang

2015-08-03 17:28:20 UTC

Permalink

Post by Greg Troxel
I have not actually read the code, but I am 99% sure that the window is
only increased if there are signs that it isn't big enough. That's
complicated but the key point is that the connection has to be up
against the window, rather than congestion.

The code in question seems to be in sys/netinet/tcp_input.c, around
line 2072. The comment says:

* The criteria to step up the receive buffer one notch are:
* 1. the number of bytes received during the time it takes
* one timestamp to be reflected back to us (the RTT);
* 2. received bytes per RTT is within seven eighth of the
* current socket buffer size;
* 3. receive buffer size has not hit maximal automatic size;

It seems like those criteria should be met... I haven't tried adding
printfs or something in the series of "if"s to see whether it's making
it all the way to the block that increases the size, and if not, which
check is failing.

Post by Greg Troxel
That's big, and perhaps you will fill that pipe, but I would be very
surprised if you got 100 Mbps with no loss between you and the other
end.

Perhaps there'd be an occasional lost packet, but it looks like there
are enough contiguous streams of packets with no loss that the
autosizing code should've kicked in.

Post by Greg Troxel
Hmm. I was expecting to get to this point and suspect packet loss and
tell you to run "xplot" (in pkgsrc) which lets you visualize the
ack/etc. behavior.

I used xplot a long time ago, but can't get it working now...
tcpdump2xplot complains about "Malformed entry in dump file". I found
http://mail-index.netbsd.org/current-users/2004/11/30/0010.html which
suggests removing the string " IP" from the tcpdump output, but that
didn't help:

tcpdump2xplot: Malformed entry in dump file :1 "1438577662.919916 52.74.238.147.36628 > 10.1.1.73.22: Flags [S], seq 3042363803, win 26883, options [mss 8961,sackOK,TS val 17717649 ecr 0,nop,wscale 7], length 0"

Actually, I think I used tcptrace last time instead of
tcpdump2xplot... I'm not very proficient at interpreting these graphs,
but I think I know the basics, and I think it confirms that the window
size is too small and that the autosizing should increase the size.
You can take a look at Loading Image...

and Loading Image...

Post by Greg Troxel

Post by Dave Huang
If I increase net.inet.tcp.recvspace to 4194304, the scp connects and
does the ssh protocol handshake (according to "scp -v"), but the data
transfer never actually starts... no idea what that means. If I set

I wonder if there is a 32-bit bug someplace. I would try a number that
fits in 31 bits.

I set it to 4MB, not 4GB...

Post by Greg Troxel
Here, you should use xplot, which will plot tranmsitted packets, the ack
line, the window, and sacks. Read the README, and use tcpdump2xplot.
It will take you an hour the first time, but then you'll wonder how
anybody can pore over numbers in tcpdump output to understand TCP
ack/window/congestion behavior.

Loading Image...

and
Loading Image...

I don't really know how to interpret this one, but it doesn't seem
like there's was any packet loss/retransmission.

Greg Troxel

2015-08-03 17:42:02 UTC

Permalink

Post by Dave Huang

The code in question seems to be in sys/netinet/tcp_input.c, around
* 1. the number of bytes received during the time it takes
* one timestamp to be reflected back to us (the RTT);
* 2. received bytes per RTT is within seven eighth of the
* current socket buffer size;
* 3. receive buffer size has not hit maximal automatic size;
It seems like those criteria should be met... I haven't tried adding
printfs or something in the series of "if"s to see whether it's making
it all the way to the block that increases the size, and if not, which
check is failing.

That more or less makes sense for the criteria.

Post by Dave Huang

Post by Greg Troxel
Hmm. I was expecting to get to this point and suspect packet loss and
tell you to run "xplot" (in pkgsrc) which lets you visualize the
ack/etc. behavior.

I should have pointed you to xplot-devel. (I need to straighten that
out sometime...).

Post by Dave Huang
Actually, I think I used tcptrace last time instead of
tcpdump2xplot... I'm not very proficient at interpreting these graphs,
but I think I know the basics, and I think it confirms that the window
size is too small and that the autosizing should increase the size.
You can take a look at http://www.azeotrope.org/~khym/pics/tsg1.png
and http://www.azeotrope.org/~khym/pics/tsg1_zoomed.png

Those plots look pretty much the same.

Post by Dave Huang

Post by Greg Troxel

I wonder if there is a 32-bit bug someplace. I would try a number that
fits in 31 bits.

I set it to 4MB, not 4GB...

Indeed - and then I have no other ideas.

Post by Dave Huang
http://www.azeotrope.org/~khym/pics/tsg2.png and
http://www.azeotrope.org/~khym/pics/tsg2_zoomed.png
I don't really know how to interpret this one, but it doesn't seem
like there's was any packet loss/retransmission.

If that's at the sender, it seems to show packets being sent and the
window line being far away. I do see what looks like dupacks (downward
green tick).

I guess trying to debug/check the receive window growth code is in
order. It's certainly possible there has been a bug in that ~forever.

Greg Troxel

2015-08-03 17:46:26 UTC

Permalink

One thing to be careful about is that if you don't have the SYN packet,
these sorts of analysis programs get the window wrong because they don't
have wscale. I think you are not having this problem, but it's good to
keep an eye out for it.

Mouse

2015-08-03 19:21:14 UTC

Permalink

Post by Dave Huang

Post by Greg Troxel
I wonder if there is a 32-bit bug someplace. I would try a number
that fits in 31 bits.

I set it to 4MB, not 4GB...

Sure it's in bytes, not K?

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dave Huang

2015-08-03 19:31:45 UTC

Permalink

Post by Mouse
Sure it's in bytes, not K?

sysctl(7) just says it's "The default TCP receive buffer size." without
giving any unit. Pretty sure it's bytes though--if the default of 32768
meant 32MB, that would be plenty, and I don't think I'd be having this
problem.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dave Huang

2015-08-04 10:27:47 UTC

Permalink

So it turns out NetBSD's TCP window autosizing works fine after all...
apparently the problem is scp. I was able to get about 6Mbytes/s
throughput by transferring the file via HTTP. While the file is
transferring via HTTP, netstat on the remote Linux end shows about 4MB
in the "Send-Q". Whereas while it's transferring via scp, netstat
shows only 32K or so in the Send-Q.

According to http://www.psc.edu/index.php/hpn-ssh , "SCP and the
underlying SSH2 protocol implementation in OpenSSH is network
performance limited by statically defined internal flow control
buffers." I'm not too sure why I was able to get OK performance using
PuTTY's pscp on Windows, but that page also says, "HPN clients will be
able to download faster from non HPN servers." Perhaps PuTTY doesn't
have the same bottleneck as OpenSSH, so it's able to download faster
even with the slow sshd.

Looks like pkgsrc openssh has an option for the HPN patch--I'll give
that a try, and see if I can find something similar for the Linux
side.

Christos Zoulas

2015-08-05 06:09:03 UTC

Permalink

Post by Dave Huang
So it turns out NetBSD's TCP window autosizing works fine after all...
apparently the problem is scp. I was able to get about 6Mbytes/s
throughput by transferring the file via HTTP. While the file is
transferring via HTTP, netstat on the remote Linux end shows about 4MB
in the "Send-Q". Whereas while it's transferring via scp, netstat
shows only 32K or so in the Send-Q.
According to http://www.psc.edu/index.php/hpn-ssh , "SCP and the
underlying SSH2 protocol implementation in OpenSSH is network
performance limited by statically defined internal flow control
buffers." I'm not too sure why I was able to get OK performance using
PuTTY's pscp on Windows, but that page also says, "HPN clients will be
able to download faster from non HPN servers." Perhaps PuTTY doesn't
have the same bottleneck as OpenSSH, so it's able to download faster
even with the slow sshd.

$ telnet quasar.astron.com 22
Trying 2604:2000:efc0:5:9998:523c:d378:8572...
Connected to quasar.astron.com.
Escape character is '^]'.
SSH-2.0-OpenSSH_6.9 NetBSD_Secure_Shell-20150602-hpn13v14-lpk

christos

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Joerg Sonnenberger

2015-08-09 12:33:42 UTC

Permalink

Note that ssh needs custom buffer scaling as the crypto overhead adds a
noticable delay, even on local networks.

Post by Dave Huang
I'm running NetBSD-alpha/7.0_RC2, with a kernel compiled with
NMBCLUSTERS=16384.

That shouldn't tbe needed any longer? Default should be scaling up with
memory.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de