Greg Troxel
2011-03-12 14:32:03 UTC
Two colleagues, Bev Schwartz and Laura Ma, and I have been looking at
the RTT estimation code after Bev noticed excessive RTO times and found
two problems in the code: extra ticks wrongly included it the estimate,
and an inability to lower the RTT estimate far enough due to truncation
rather than rounding in the smoothing calculation. Bev has fixed these
issues in a private tree and we are seeing better TCP performance, and
she noted that srtt storage format does not match the comments.
Looking at the code history, I found that the storage format of srtt was
changed from the 4.4BSD-era convention of fixed point slow ticks with 3
bits to the right of the radix to fixed point seconds/6 bits in
1995/1996, but that the comments weren't updated. At some later point,
the bug in tick counting origin crept in.
The enclosed patch is my attempt to correct the comments and provide
more insight. It's not ready to commit, but I'm heading for finishing
it and looking over it a bit and then commiting. Then, I'll make
smaller commits that affect the specific issues (those commits are
hinted at in the comments attached).
I would appreciate review and comments, and hearing of any objections to
my proposed approach.
Index: sys/netinet/tcp_input.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_input.c,v
retrieving revision 1.307
diff -u -p -r1.307 tcp_input.c
--- sys/netinet/tcp_input.c 9 Mar 2011 00:44:23 -0000 1.307
+++ sys/netinet/tcp_input.c 12 Mar 2011 14:23:51 -0000
@@ -1687,6 +1687,10 @@ after_listen:
* the current time, or is extremely old, fall back to non-1323
* RTT calculation. Since ts_ecr is unsigned, we can test both
* at the same time.
+ *
+ * Add 1 to the result, so that ts_rtt has an
+ * absraction function that maps 0 to invalid and
+ * positive integers n to n-1 ticks.
*/
ts_rtt = TCP_TIMESTAMP(tp) - opti.ts_ecr + 1;
if (ts_rtt > TCP_PAWS_IDLE)
@@ -1742,6 +1746,7 @@ after_listen:
* this is a pure ack for outstanding data.
*/
if (ts_rtt)
+ /* XXX ts_rtt - 1 */
tcp_xmit_timer(tp, ts_rtt);
else if (tp->t_rtttime &&
SEQ_GT(th->th_ack, tp->t_rtseq))
@@ -2425,6 +2430,7 @@ after_listen:
* Recompute the initial retransmit timer.
*/
if (ts_rtt)
+ /* XXX ts_rtt - 1 */
tcp_xmit_timer(tp, ts_rtt);
else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
tcp_xmit_timer(tp, tcp_now - tp->t_rtttime);
@@ -3216,6 +3222,10 @@ tcp_pulloutofband(struct socket *so, str
/*
* Collect new round-trip time estimate
* and update averages and current timeout.
+ *
+ * The argument rtt is in units of ticks, and should straightforwardly
+ * be the difference between teh tick counter at transmit and at
+ * receive.
*/
void
tcp_xmit_timer(struct tcpcb *tp, uint32_t rtt)
@@ -3225,14 +3235,29 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
TCP_STATINC(TCP_STAT_RTTUPDATED);
if (tp->t_srtt != 0) {
/*
- * srtt is stored as fixed point with 3 bits after the
- * binary point (i.e., scaled by 8). The following magic
- * is equivalent to the smoothing algorithm in rfc793 with
- * an alpha of .875 (srtt = rtt/8 + srtt*7/8 in fixed
- * point). Adjust rtt to origin 0.
+ * srtt is storeed in fixed-point seconds with 6 bits
+ * after the radix point. The smoothing algorithm
+ * from rfc793 with an alpha of .875 is "srtt = rtt/8
+ * + srtt*7/8". So, we convert rtt in ticks to
+ * seconds (>>1) in 6-bit fixed point (<<6) and then
+ * divide by 8 (>>3), resulting in a net <<2. The
+ * stored srtt is shifted right by lg(1-alpha), or 3.
+ *
+ * In considering how our fixed-point math implements
+ * the equation from rfc793, an important issue is
+ * quantization of the (1-alpha). Given a shift of 3,
+ * srtt cannot fall below a storage value of 8,
+ * because 8>>3 is 0.
+ * XXX TODO: Round rather than truncate, by adding
+ * "1 << (TCP_RTT_SHIFT-1)" before shifting.
*/
delta = (rtt << 2) - (tp->t_srtt >> TCP_RTT_SHIFT);
if ((tp->t_srtt += delta) <= 0)
+ /*
+ * XXX: explain why 1<<2 is the right value
+ * when rtt goes negative, and how this can
+ * ever happen.
+ */
tp->t_srtt = 1 << 2;
/*
* We accumulate a smoothed rtt variance (actually, a
@@ -3243,6 +3268,10 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* equivalent to rfc793 smoothing with an alpha of .75
* (rttvar = rttvar*3/4 + |delta| / 4). This replaces
* rfc793's wired-in beta.
+ *
+ * XXX delta is in units of seconds with 6-bit fixed
+ * point representation. So this code appears to
+ * assume that rttvar has the same representation.
*/
if (delta < 0)
delta = -delta;
@@ -3254,6 +3283,13 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* No rtt measurement yet - use the unsmoothed rtt.
* Set the variance to half the rtt (so our first
* retransmit happens at 3*rtt).
+ *
+ * For srtt, the extra 2 is <<3 to make up for 6 bits
+ * of fixed point instead of 3, less 1 bit for
+ * converting for ticks to seconds.
+ * XXX For rttvar, the -1 is for halving the rtt, and
+ * the 2 must be for an extra 3 bits of storage format
+ * less 1 bit for ticks to seconds.
*/
tp->t_srtt = rtt << (TCP_RTT_SHIFT + 2);
tp->t_rttvar = rtt << (TCP_RTTVAR_SHIFT + 2 - 1);
Index: sys/netinet/tcp_timer.h
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_timer.h,v
retrieving revision 1.26
diff -u -p -r1.26 tcp_timer.h
--- sys/netinet/tcp_timer.h 28 Apr 2008 20:24:09 -0000 1.26
+++ sys/netinet/tcp_timer.h 12 Mar 2011 14:23:51 -0000
@@ -114,6 +114,7 @@
/*
* Time constants.
+ * Constants beginning TCPTV_ are in integer seconds.
*/
#define TCPTV_MSL ( 30*PR_SLOWHZ) /* max seg lifetime (hah!) */
#define TCPTV_SRTTBASE 0 /* base roundtrip time;
@@ -135,6 +136,7 @@
#define TCP_MAXRXTSHIFT 12 /* maximum retransmits */
+/* Acks are delayed for 1 second; constant is in fast ticks. */
#define TCP_DELACK_TICKS (hz / PR_FASTHZ) /* time to delay ACK */
#ifdef TCPTIMERS
Index: sys/netinet/tcp_var.h
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_var.h,v
retrieving revision 1.162
diff -u -p -r1.162 tcp_var.h
--- sys/netinet/tcp_var.h 16 Sep 2009 15:23:05 -0000 1.162
+++ sys/netinet/tcp_var.h 12 Mar 2011 14:23:51 -0000
@@ -524,13 +524,22 @@ struct syn_cache_head {
#endif
/*
- * The smoothed round-trip time and estimated variance
- * are stored as fixed point numbers scaled by the values below.
- * For convenience, these scales are also used in smoothing the average
- * (smoothed = (1/scale)sample + ((scale-1)/scale)smoothed).
- * With these scales, srtt has 3 bits to the right of the binary point,
- * and thus an "ALPHA" of 0.875. rttvar has 2 bits to the right of the
- * binary point, and is smoothed with an ALPHA of 0.75.
+ * Historically, 4.4BSD stored smoothed RTT (srtt) and variance
+ * (rttvar) in units of slow ticks using a fixed-point representation
+ * with 3 and 2 bits to the right of the radix point.
+ *
+ * NetBSD now stores srtt in units of seconds, with 6 bits to the
+ * right of the radix point, in order to allow for more precision
+ * (15.6 ms instead of 62.5 ms). TODO: document rttvar storage.
+ *
+ * The original code used the same shift values to represent the
+ * storage format and to compute the weighted moving average. For
+ * srtt, An alpha of 7/8, following RFC793, was chosen and still
+ * remains. For rttvar, the alpha is 3/4.
+ *
+ * XXX Change SHIFT values to LGWEIGHT and REP_SHIFT, and adjust
+ * the code to use the correct ones.
+ *
*/
#define TCP_RTT_SHIFT 3 /* shift for srtt; 3 bits frac. */
#define TCP_RTTVAR_SHIFT 2 /* multiplier for rttvar; 2 bits */
the RTT estimation code after Bev noticed excessive RTO times and found
two problems in the code: extra ticks wrongly included it the estimate,
and an inability to lower the RTT estimate far enough due to truncation
rather than rounding in the smoothing calculation. Bev has fixed these
issues in a private tree and we are seeing better TCP performance, and
she noted that srtt storage format does not match the comments.
Looking at the code history, I found that the storage format of srtt was
changed from the 4.4BSD-era convention of fixed point slow ticks with 3
bits to the right of the radix to fixed point seconds/6 bits in
1995/1996, but that the comments weren't updated. At some later point,
the bug in tick counting origin crept in.
The enclosed patch is my attempt to correct the comments and provide
more insight. It's not ready to commit, but I'm heading for finishing
it and looking over it a bit and then commiting. Then, I'll make
smaller commits that affect the specific issues (those commits are
hinted at in the comments attached).
I would appreciate review and comments, and hearing of any objections to
my proposed approach.
Index: sys/netinet/tcp_input.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_input.c,v
retrieving revision 1.307
diff -u -p -r1.307 tcp_input.c
--- sys/netinet/tcp_input.c 9 Mar 2011 00:44:23 -0000 1.307
+++ sys/netinet/tcp_input.c 12 Mar 2011 14:23:51 -0000
@@ -1687,6 +1687,10 @@ after_listen:
* the current time, or is extremely old, fall back to non-1323
* RTT calculation. Since ts_ecr is unsigned, we can test both
* at the same time.
+ *
+ * Add 1 to the result, so that ts_rtt has an
+ * absraction function that maps 0 to invalid and
+ * positive integers n to n-1 ticks.
*/
ts_rtt = TCP_TIMESTAMP(tp) - opti.ts_ecr + 1;
if (ts_rtt > TCP_PAWS_IDLE)
@@ -1742,6 +1746,7 @@ after_listen:
* this is a pure ack for outstanding data.
*/
if (ts_rtt)
+ /* XXX ts_rtt - 1 */
tcp_xmit_timer(tp, ts_rtt);
else if (tp->t_rtttime &&
SEQ_GT(th->th_ack, tp->t_rtseq))
@@ -2425,6 +2430,7 @@ after_listen:
* Recompute the initial retransmit timer.
*/
if (ts_rtt)
+ /* XXX ts_rtt - 1 */
tcp_xmit_timer(tp, ts_rtt);
else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
tcp_xmit_timer(tp, tcp_now - tp->t_rtttime);
@@ -3216,6 +3222,10 @@ tcp_pulloutofband(struct socket *so, str
/*
* Collect new round-trip time estimate
* and update averages and current timeout.
+ *
+ * The argument rtt is in units of ticks, and should straightforwardly
+ * be the difference between teh tick counter at transmit and at
+ * receive.
*/
void
tcp_xmit_timer(struct tcpcb *tp, uint32_t rtt)
@@ -3225,14 +3235,29 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
TCP_STATINC(TCP_STAT_RTTUPDATED);
if (tp->t_srtt != 0) {
/*
- * srtt is stored as fixed point with 3 bits after the
- * binary point (i.e., scaled by 8). The following magic
- * is equivalent to the smoothing algorithm in rfc793 with
- * an alpha of .875 (srtt = rtt/8 + srtt*7/8 in fixed
- * point). Adjust rtt to origin 0.
+ * srtt is storeed in fixed-point seconds with 6 bits
+ * after the radix point. The smoothing algorithm
+ * from rfc793 with an alpha of .875 is "srtt = rtt/8
+ * + srtt*7/8". So, we convert rtt in ticks to
+ * seconds (>>1) in 6-bit fixed point (<<6) and then
+ * divide by 8 (>>3), resulting in a net <<2. The
+ * stored srtt is shifted right by lg(1-alpha), or 3.
+ *
+ * In considering how our fixed-point math implements
+ * the equation from rfc793, an important issue is
+ * quantization of the (1-alpha). Given a shift of 3,
+ * srtt cannot fall below a storage value of 8,
+ * because 8>>3 is 0.
+ * XXX TODO: Round rather than truncate, by adding
+ * "1 << (TCP_RTT_SHIFT-1)" before shifting.
*/
delta = (rtt << 2) - (tp->t_srtt >> TCP_RTT_SHIFT);
if ((tp->t_srtt += delta) <= 0)
+ /*
+ * XXX: explain why 1<<2 is the right value
+ * when rtt goes negative, and how this can
+ * ever happen.
+ */
tp->t_srtt = 1 << 2;
/*
* We accumulate a smoothed rtt variance (actually, a
@@ -3243,6 +3268,10 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* equivalent to rfc793 smoothing with an alpha of .75
* (rttvar = rttvar*3/4 + |delta| / 4). This replaces
* rfc793's wired-in beta.
+ *
+ * XXX delta is in units of seconds with 6-bit fixed
+ * point representation. So this code appears to
+ * assume that rttvar has the same representation.
*/
if (delta < 0)
delta = -delta;
@@ -3254,6 +3283,13 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* No rtt measurement yet - use the unsmoothed rtt.
* Set the variance to half the rtt (so our first
* retransmit happens at 3*rtt).
+ *
+ * For srtt, the extra 2 is <<3 to make up for 6 bits
+ * of fixed point instead of 3, less 1 bit for
+ * converting for ticks to seconds.
+ * XXX For rttvar, the -1 is for halving the rtt, and
+ * the 2 must be for an extra 3 bits of storage format
+ * less 1 bit for ticks to seconds.
*/
tp->t_srtt = rtt << (TCP_RTT_SHIFT + 2);
tp->t_rttvar = rtt << (TCP_RTTVAR_SHIFT + 2 - 1);
Index: sys/netinet/tcp_timer.h
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_timer.h,v
retrieving revision 1.26
diff -u -p -r1.26 tcp_timer.h
--- sys/netinet/tcp_timer.h 28 Apr 2008 20:24:09 -0000 1.26
+++ sys/netinet/tcp_timer.h 12 Mar 2011 14:23:51 -0000
@@ -114,6 +114,7 @@
/*
* Time constants.
+ * Constants beginning TCPTV_ are in integer seconds.
*/
#define TCPTV_MSL ( 30*PR_SLOWHZ) /* max seg lifetime (hah!) */
#define TCPTV_SRTTBASE 0 /* base roundtrip time;
@@ -135,6 +136,7 @@
#define TCP_MAXRXTSHIFT 12 /* maximum retransmits */
+/* Acks are delayed for 1 second; constant is in fast ticks. */
#define TCP_DELACK_TICKS (hz / PR_FASTHZ) /* time to delay ACK */
#ifdef TCPTIMERS
Index: sys/netinet/tcp_var.h
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_var.h,v
retrieving revision 1.162
diff -u -p -r1.162 tcp_var.h
--- sys/netinet/tcp_var.h 16 Sep 2009 15:23:05 -0000 1.162
+++ sys/netinet/tcp_var.h 12 Mar 2011 14:23:51 -0000
@@ -524,13 +524,22 @@ struct syn_cache_head {
#endif
/*
- * The smoothed round-trip time and estimated variance
- * are stored as fixed point numbers scaled by the values below.
- * For convenience, these scales are also used in smoothing the average
- * (smoothed = (1/scale)sample + ((scale-1)/scale)smoothed).
- * With these scales, srtt has 3 bits to the right of the binary point,
- * and thus an "ALPHA" of 0.875. rttvar has 2 bits to the right of the
- * binary point, and is smoothed with an ALPHA of 0.75.
+ * Historically, 4.4BSD stored smoothed RTT (srtt) and variance
+ * (rttvar) in units of slow ticks using a fixed-point representation
+ * with 3 and 2 bits to the right of the radix point.
+ *
+ * NetBSD now stores srtt in units of seconds, with 6 bits to the
+ * right of the radix point, in order to allow for more precision
+ * (15.6 ms instead of 62.5 ms). TODO: document rttvar storage.
+ *
+ * The original code used the same shift values to represent the
+ * storage format and to compute the weighted moving average. For
+ * srtt, An alpha of 7/8, following RFC793, was chosen and still
+ * remains. For rttvar, the alpha is 3/4.
+ *
+ * XXX Change SHIFT values to LGWEIGHT and REP_SHIFT, and adjust
+ * the code to use the correct ones.
+ *
*/
#define TCP_RTT_SHIFT 3 /* shift for srtt; 3 bits frac. */
#define TCP_RTTVAR_SHIFT 2 /* multiplier for rttvar; 2 bits */