Discussion:
problems in TCP RTO calculation
(too old to reply)
Greg Troxel
2011-03-12 14:32:03 UTC
Permalink
Two colleagues, Bev Schwartz and Laura Ma, and I have been looking at
the RTT estimation code after Bev noticed excessive RTO times and found
two problems in the code: extra ticks wrongly included it the estimate,
and an inability to lower the RTT estimate far enough due to truncation
rather than rounding in the smoothing calculation. Bev has fixed these
issues in a private tree and we are seeing better TCP performance, and
she noted that srtt storage format does not match the comments.

Looking at the code history, I found that the storage format of srtt was
changed from the 4.4BSD-era convention of fixed point slow ticks with 3
bits to the right of the radix to fixed point seconds/6 bits in
1995/1996, but that the comments weren't updated. At some later point,
the bug in tick counting origin crept in.

The enclosed patch is my attempt to correct the comments and provide
more insight. It's not ready to commit, but I'm heading for finishing
it and looking over it a bit and then commiting. Then, I'll make
smaller commits that affect the specific issues (those commits are
hinted at in the comments attached).

I would appreciate review and comments, and hearing of any objections to
my proposed approach.

Index: sys/netinet/tcp_input.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_input.c,v
retrieving revision 1.307
diff -u -p -r1.307 tcp_input.c
--- sys/netinet/tcp_input.c 9 Mar 2011 00:44:23 -0000 1.307
+++ sys/netinet/tcp_input.c 12 Mar 2011 14:23:51 -0000
@@ -1687,6 +1687,10 @@ after_listen:
* the current time, or is extremely old, fall back to non-1323
* RTT calculation. Since ts_ecr is unsigned, we can test both
* at the same time.
+ *
+ * Add 1 to the result, so that ts_rtt has an
+ * absraction function that maps 0 to invalid and
+ * positive integers n to n-1 ticks.
*/
ts_rtt = TCP_TIMESTAMP(tp) - opti.ts_ecr + 1;
if (ts_rtt > TCP_PAWS_IDLE)
@@ -1742,6 +1746,7 @@ after_listen:
* this is a pure ack for outstanding data.
*/
if (ts_rtt)
+ /* XXX ts_rtt - 1 */
tcp_xmit_timer(tp, ts_rtt);
else if (tp->t_rtttime &&
SEQ_GT(th->th_ack, tp->t_rtseq))
@@ -2425,6 +2430,7 @@ after_listen:
* Recompute the initial retransmit timer.
*/
if (ts_rtt)
+ /* XXX ts_rtt - 1 */
tcp_xmit_timer(tp, ts_rtt);
else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
tcp_xmit_timer(tp, tcp_now - tp->t_rtttime);
@@ -3216,6 +3222,10 @@ tcp_pulloutofband(struct socket *so, str
/*
* Collect new round-trip time estimate
* and update averages and current timeout.
+ *
+ * The argument rtt is in units of ticks, and should straightforwardly
+ * be the difference between teh tick counter at transmit and at
+ * receive.
*/
void
tcp_xmit_timer(struct tcpcb *tp, uint32_t rtt)
@@ -3225,14 +3235,29 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
TCP_STATINC(TCP_STAT_RTTUPDATED);
if (tp->t_srtt != 0) {
/*
- * srtt is stored as fixed point with 3 bits after the
- * binary point (i.e., scaled by 8). The following magic
- * is equivalent to the smoothing algorithm in rfc793 with
- * an alpha of .875 (srtt = rtt/8 + srtt*7/8 in fixed
- * point). Adjust rtt to origin 0.
+ * srtt is storeed in fixed-point seconds with 6 bits
+ * after the radix point. The smoothing algorithm
+ * from rfc793 with an alpha of .875 is "srtt = rtt/8
+ * + srtt*7/8". So, we convert rtt in ticks to
+ * seconds (>>1) in 6-bit fixed point (<<6) and then
+ * divide by 8 (>>3), resulting in a net <<2. The
+ * stored srtt is shifted right by lg(1-alpha), or 3.
+ *
+ * In considering how our fixed-point math implements
+ * the equation from rfc793, an important issue is
+ * quantization of the (1-alpha). Given a shift of 3,
+ * srtt cannot fall below a storage value of 8,
+ * because 8>>3 is 0.
+ * XXX TODO: Round rather than truncate, by adding
+ * "1 << (TCP_RTT_SHIFT-1)" before shifting.
*/
delta = (rtt << 2) - (tp->t_srtt >> TCP_RTT_SHIFT);
if ((tp->t_srtt += delta) <= 0)
+ /*
+ * XXX: explain why 1<<2 is the right value
+ * when rtt goes negative, and how this can
+ * ever happen.
+ */
tp->t_srtt = 1 << 2;
/*
* We accumulate a smoothed rtt variance (actually, a
@@ -3243,6 +3268,10 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* equivalent to rfc793 smoothing with an alpha of .75
* (rttvar = rttvar*3/4 + |delta| / 4). This replaces
* rfc793's wired-in beta.
+ *
+ * XXX delta is in units of seconds with 6-bit fixed
+ * point representation. So this code appears to
+ * assume that rttvar has the same representation.
*/
if (delta < 0)
delta = -delta;
@@ -3254,6 +3283,13 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* No rtt measurement yet - use the unsmoothed rtt.
* Set the variance to half the rtt (so our first
* retransmit happens at 3*rtt).
+ *
+ * For srtt, the extra 2 is <<3 to make up for 6 bits
+ * of fixed point instead of 3, less 1 bit for
+ * converting for ticks to seconds.
+ * XXX For rttvar, the -1 is for halving the rtt, and
+ * the 2 must be for an extra 3 bits of storage format
+ * less 1 bit for ticks to seconds.
*/
tp->t_srtt = rtt << (TCP_RTT_SHIFT + 2);
tp->t_rttvar = rtt << (TCP_RTTVAR_SHIFT + 2 - 1);
Index: sys/netinet/tcp_timer.h
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_timer.h,v
retrieving revision 1.26
diff -u -p -r1.26 tcp_timer.h
--- sys/netinet/tcp_timer.h 28 Apr 2008 20:24:09 -0000 1.26
+++ sys/netinet/tcp_timer.h 12 Mar 2011 14:23:51 -0000
@@ -114,6 +114,7 @@

/*
* Time constants.
+ * Constants beginning TCPTV_ are in integer seconds.
*/
#define TCPTV_MSL ( 30*PR_SLOWHZ) /* max seg lifetime (hah!) */
#define TCPTV_SRTTBASE 0 /* base roundtrip time;
@@ -135,6 +136,7 @@

#define TCP_MAXRXTSHIFT 12 /* maximum retransmits */

+/* Acks are delayed for 1 second; constant is in fast ticks. */
#define TCP_DELACK_TICKS (hz / PR_FASTHZ) /* time to delay ACK */

#ifdef TCPTIMERS
Index: sys/netinet/tcp_var.h
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_var.h,v
retrieving revision 1.162
diff -u -p -r1.162 tcp_var.h
--- sys/netinet/tcp_var.h 16 Sep 2009 15:23:05 -0000 1.162
+++ sys/netinet/tcp_var.h 12 Mar 2011 14:23:51 -0000
@@ -524,13 +524,22 @@ struct syn_cache_head {
#endif

/*
- * The smoothed round-trip time and estimated variance
- * are stored as fixed point numbers scaled by the values below.
- * For convenience, these scales are also used in smoothing the average
- * (smoothed = (1/scale)sample + ((scale-1)/scale)smoothed).
- * With these scales, srtt has 3 bits to the right of the binary point,
- * and thus an "ALPHA" of 0.875. rttvar has 2 bits to the right of the
- * binary point, and is smoothed with an ALPHA of 0.75.
+ * Historically, 4.4BSD stored smoothed RTT (srtt) and variance
+ * (rttvar) in units of slow ticks using a fixed-point representation
+ * with 3 and 2 bits to the right of the radix point.
+ *
+ * NetBSD now stores srtt in units of seconds, with 6 bits to the
+ * right of the radix point, in order to allow for more precision
+ * (15.6 ms instead of 62.5 ms). TODO: document rttvar storage.
+ *
+ * The original code used the same shift values to represent the
+ * storage format and to compute the weighted moving average. For
+ * srtt, An alpha of 7/8, following RFC793, was chosen and still
+ * remains. For rttvar, the alpha is 3/4.
+ *
+ * XXX Change SHIFT values to LGWEIGHT and REP_SHIFT, and adjust
+ * the code to use the correct ones.
+ *
*/
#define TCP_RTT_SHIFT 3 /* shift for srtt; 3 bits frac. */
#define TCP_RTTVAR_SHIFT 2 /* multiplier for rttvar; 2 bits */
Thor Lancelot Simon
2011-03-12 17:37:00 UTC
Permalink
Post by Greg Troxel
+ *
+ * NetBSD now stores srtt in units of seconds, with 6 bits to the
+ * right of the radix point, in order to allow for more precision
+ * (15.6 ms instead of 62.5 ms). TODO: document rttvar storage.
So, I'm wondering: this makes 15.6ms the minimum value that can
be represented, given the odd <= 0 test you pointed out elsewhere
in the code (the math doesn't appear to be able to cope with a "0" RTT
anyway, even were we able to arrange for it to never go negative (how
does it go negative?) but allow 0 values).

Typical LAN latencies today are on the order of 1/20 the smallest value
we can represent. Latencies on local, routed gigabit networks (for
example, Columbia's and NYU's campus networks) are about 1/10 the
smallest value we can represent, with gigabit bandwidth.

Latencies on metropolitan-area gigabit networks, even with some link
congestion (for example, from NetBSD at ISC to Internet Archive at 300
Paul or from Columbia to NYU) are also about 1/10 the smallest value
we can represent.

Even regional networks (for example, Columbia to MIT) yield RTTs of
about 7ms -- 1/2 the smallest value we can represent.

This suggests to me that with this representation, even with bugs fixed,
many, many cases of interest the RTT will have no effect at all on our
network stack. Still more precision is required.

Meanwhile, 27 bits to the left of the decimal point is still a *lot*
more seconds than we'll ever see for a TCP RTT.

Do I misunderstand something here?

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Erik Fair
2011-03-12 20:16:56 UTC
Permalink
It strikes me that we should just use a 32 bit integer and represent microseconds; that would give us a maximum measurement of an hour and 11.5 minutes, which would allow for TCP connections to devices beyond the moon's orbit, and should be fine enough for 10 gigabit Ethernet latencies, no?

Or would nanosecond units be better?

Erik <***@netbsd.org>


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Laight
2011-03-12 21:32:44 UTC
Permalink
Post by Thor Lancelot Simon
Typical LAN latencies today are on the order of 1/20 the smallest value
we can represent (16.5ms).
What about atypical LANs - eg two systems connected by MII crossover?
Or MII (SERDES?) crossover via a small GE switch?
ie when the host MAC and target MAC are all on the same PCB.

We (at work) nearly did our own GE switch in an fpga (we found a
commercial part (from marvell IIRC) that didn't need PHYs in the end
and used that to save the development effort.
There wasn't the power budget available to use back to back GE PHYs!

The elapsed time for packet is probably only slightly greater than
twice the transmission time - not long for small packets.

We also have atypical traffic patterns - I've had problems with
an external switch failing to process 30000 packets/sec on one TCP/IP
connection! (with almost no other traffic).
This traffic isn't really synchronised between the send and receive
sides (it is data packets from multiple channels of a slower network)
and I've seen problems (linux both ends) with the 0 RTT making it
always be doing 'slow start'!

NFI how NetBSD would fair here, but a lot of the code seems to be optimised
for FTP (or similar bulk transfer) while giving something half reasonable
for command-response and terminal work.
Any other traffic pattern tends to cause grief!

David
--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2011-03-12 23:07:03 UTC
Permalink
So, I'm wondering: this makes 15.6ms the minimum value that can
be represented, given the odd <= 0 test you pointed out elsewhere
in the code (the math doesn't appear to be able to cope with a "0" RTT
anyway, even were we able to arrange for it to never go negative (how
does it go negative?) but allow 0 values).

Even regional networks (for example, Columbia to MIT) yield RTTs of
about 7ms -- 1/2 the smallest value we can represent.

That's funny; BBN to MIT is 80ms, but that's because MIT routes to the
west coast on an academic testbed. Before, it was more like 20ms,
because it went to NY and back. But I see your point.

This suggests to me that with this representation, even with bugs fixed,
many, many cases of interest the RTT will have no effect at all on our
network stack. Still more precision is required.

I had similar thoughts, but you are missing two points which cause this
not to really matter:

The minimum RTO is longish, I think 1s, regardless of estimates.

Currently, we are measuring RTT in slow ticks. Given that, 15.6ms is
plenty.

The real problem now is that RTT measurements that should be often 0 and
sometimes 1 slow ticks are being computed as often 1 and sometimes 2.
Fixing that makes a real difference.

A minor problem is that the lowest the EWMA will get to following a 1
sample, even with all 0 samples, is storage-rep 8, which is 124 ms,
because 8>>3 is 0. With the rounding fix, that becomes storage-rep 4
and 62 ms.

Estimating RTT really well in super-low delay environments probably
doesn't matter much, because timeouts will be few (due to delays more
than drops). And, since any drops would be due to big congestion
spikes, and buffering is huge compared to unloaded RTT, the odds of a
delayed packet before srtt+4*rttvar seems high, and the point of Van's
1988 paper referenced in RFC2988 is that it's very important that
retranmissions not happen for non-lost segments, and then given that
it's important that they be as prompt as possible.


Also, we wrote a monitoring framework which has TCP put internal state
like srtt values, cwnd, etc plus note about why it's doing things in a
bpf buffer, and modified xplot to show this. So I think really
understanding what's going on with RTT estimation is in order before
making big changes - right now I'm just fixing old bugs. But, our
monitoring framework was key to finding them and understanding their
impact.
m***@gmail.com
2011-03-13 12:30:06 UTC
Permalink
Post by Greg Troxel
Looking at the code history, I found that the storage format of srtt was
changed from the 4.4BSD-era convention of fixed point slow ticks with 3
bits to the right of the radix to fixed point seconds/6 bits in
1995/1996, but that the comments weren't updated. At some later point,
the bug in tick counting origin crept in.
[...]
I would appreciate review and comments, and hearing of any objections to
my proposed approach.
Wasn't it fixed last year? See this thread:
http://marc.info/?l=netbsd-tech-net&m=128020154709349&w=2


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2011-03-13 12:42:49 UTC
Permalink
Post by m***@gmail.com
http://marc.info/?l=netbsd-tech-net&m=128020154709349&w=2
Huh? That's the begining of this current conversation, when at least I
both understood the code less well than we I now, and I am pretty sure
that nothing changed from it.
m***@gmail.com
2011-03-13 12:59:58 UTC
Permalink
Sorry for the noise, I was under the impression it was already fixed.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2011-04-20 13:49:45 UTC
Permalink
I've committed a large comment change.

The following diff fixes a 1-tick bias error in rtt estimation. We have
been running it in a private tree, and I haven't tested it in current.
I'll check it in at the earlier of my testing it, someone reporting that
they have tested it, or the tech-net TCP experts concurring.

The rationale was in the previous commit. Basically, ts_rtt is set to

now - send + 1

so that it can be tested against 0 to represent 'no valid estimate' and
one has to undo the 1 before use. (This was ok in 4.4BSD and is a
NetBSD regression from 15 years ago.)

Index: tcp_input.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_input.c,v
retrieving revision 1.309
diff -u -p -r1.309 tcp_input.c
--- tcp_input.c 20 Apr 2011 13:35:51 -0000 1.309
+++ tcp_input.c 20 Apr 2011 13:42:32 -0000
@@ -1754,7 +1754,7 @@ after_listen:
* this is a pure ack for outstanding data.
*/
if (ts_rtt)
- tcp_xmit_timer(tp, ts_rtt);
+ tcp_xmit_timer(tp, ts_rtt - 1);
else if (tp->t_rtttime &&
SEQ_GT(th->th_ack, tp->t_rtseq))
tcp_xmit_timer(tp,
@@ -2437,7 +2437,7 @@ after_listen:
* Recompute the initial retransmit timer.
*/
if (ts_rtt)
- tcp_xmit_timer(tp, ts_rtt);
+ tcp_xmit_timer(tp, ts_rtt - 1);
else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
tcp_xmit_timer(tp, tcp_now - tp->t_rtttime);
Greg Troxel
2011-05-25 23:24:46 UTC
Permalink
I've committed a fix to RTO calculation; now I see 1s instead of 1.5s
for RTO delays.

Thanks to all who volunteered discard servers. I'm getting enough loss
to trigger timeouts to someone's box in France, and also seeing drops to
New Zealand. If people don't mind leaving them for me, I'll use them
to check further TCP changes. But I think I have enough now.

Next is making SACK not set cwnd to 1 packet when retransmitting.
Erik Fair
2011-05-25 23:41:25 UTC
Permalink
Post by Greg Troxel
I've committed a fix to RTO calculation; now I see 1s instead of 1.5s
for RTO delays.
Did you request a pullup to the netbsd-5 branch?

curious,

Erik <***@netbsd.org>


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...