idnits 2.17.1 

draft-paxson-tcpm-rfc2988bis-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Sep 2009 rather than the newer Notice from 28 Dec 2009.  (See
     https://trustee.ietf.org/license-info/)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 8 instances of too long lines in the document, the longest one
     being 3 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (December 6, 2010) is 4882 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC2988' is mentioned on line 381, but not defined

  ** Obsolete undefined reference: RFC 2988 (Obsoleted by RFC 6298)

  == Missing Reference: 'JBB92' is mentioned on line 158, but not defined

  == Missing Reference: 'RFC1122' is mentioned on line 381, but not defined

  == Missing Reference: 'RFC5681' is mentioned on line 404, but not defined

  ** Obsolete normative reference: RFC 2581 (ref. 'APS99') (Obsoleted by RFC
     5681)

  ** Obsolete normative reference: RFC  793 (ref. 'Pos81') (Obsoleted by RFC
     9293)


     Summary: 5 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force                                V. Paxson
2	INTERNET DRAFT                                          ICSI/UC Berkeley
3	File: draft-paxson-tcpm-rfc2988bis-01.txt                      M. Allman
4	                                                                    ICSI
5	                                                                  J. Chu
6	                                                                  Google
7	                                                              M. Sargent
8	                                                                    CWRU
9	                                                        December 6, 2010

11	                  Computing TCP's Retransmission Timer

13	Status of this Memo

15	    This Internet-Draft is submitted to IETF in full conformance with
16	    the provisions of BCP 78 and BCP 79.

18	    Internet-Drafts are working documents of the Internet Engineering
19	    Task Force (IETF), its areas, and its working groups.  Note that
20	    other groups may also distribute working documents as Internet-
21	    Drafts.

23	    Internet-Drafts are draft documents valid for a maximum of six
24	    months and may be updated, replaced, or obsoleted by other documents
25	    at any time.  It is inappropriate to use Internet-Drafts as
26	    reference material or to cite them other than as "work in progress."

28	    The list of current Internet-Drafts can be accessed at
29	    http://www.ietf.org/ietf/1id-abstracts.txt.

31	    The list of Internet-Draft Shadow Directories can be accessed at
32	    http://www.ietf.org/shadow.html.

34	    This Internet-Draft will expire on June 6, 2011.

36	Copyright Notice

38	    Copyright (c) 2010 IETF Trust and the persons identified as the
39	    document authors.  All rights reserved.

41	    This document is subject to BCP 78 and the IETF Trust's Legal
42	    Provisions Relating to IETF Documents
43	    (http://trustee.ietf.org/license-info) in effect on the date of
44	    publication of this document.  Please review these documents
45	    carefully, as they describe your rights and restrictions with
46	    respect to this document.  Code Components extracted from this
47	    document must include Simplified BSD License text as described in
48	    Section 4.e of the Trust Legal Provisions and are provided without
49	    warranty as described in the BSD License.

51	Abstract

53	   This document defines the standard algorithm that Transmission
54	   Control Protocol (TCP) senders are required to use to compute and
55	   manage their retransmission timer.  It expands on the discussion in
56	   section 4.2.3.1 of RFC 1122 and upgrades the requirement of
57	   supporting the algorithm from a SHOULD to a MUST.

59	1   Introduction

61	   The Transmission Control Protocol (TCP) [Pos81] uses a retransmission
62	   timer to ensure data delivery in the absence of any feedback from the
63	   remote data receiver.  The duration of this timer is referred to as
64	   RTO (retransmission timeout).  RFC 1122 [Bra89] specifies that the
65	   RTO should be calculated as outlined in [Jac88].

67	   This document codifies the algorithm for setting the RTO.  In
68	   addition, this document expands on the discussion in section 4.2.3.1
69	   of RFC 1122 and upgrades the requirement of supporting the algorithm
70	   from a SHOULD to a MUST.  RFC 2581 [APS99] outlines the algorithm TCP
71	   uses to begin sending after the RTO expires and a retransmission is
72	   sent.  This document does not alter the behavior outlined in RFC 2581
73	   [APS99].

75	   In some situations it may be beneficial for a TCP sender to be more
76	   conservative than the algorithms detailed in this document allow.
77	   However, a TCP MUST NOT be more aggressive than the following
78	   algorithms allow.

80	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
81	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
82	   document are to be interpreted as described in [Bra97].

84	2   The Basic Algorithm

86	   To compute the current RTO, a TCP sender maintains two state
87	   variables, SRTT (smoothed round-trip time) and RTTVAR (round-trip
88	   time variation).  In addition, we assume a clock granularity of G
89	   seconds.

91	   The rules governing the computation of SRTT, RTTVAR, and RTO are as
92	   follows:

94	   (2.1) Until a round-trip time (RTT) measurement has been made for a
95	         segment sent between the sender and receiver, the sender SHOULD
96	         set RTO <- 1 second, though the "backing off" on repeated
97	         retransmission discussed in (5.5) still applies.

99	           Note that the previous version of this document used an
100	           initial RTO of 3 seconds [RFC2988].  A TCP implementation MAY
101	           still use this value (or any other value > 1 second).  This
102	           change in the lower bound on the initial RTO is discussed in
103	           further detail in Appendix A.

105	   (2.2) When the first RTT measurement R is made, the host MUST set

107	            SRTT <- R
108	            RTTVAR <- R/2
109	            RTO <- SRTT + max (G, K*RTTVAR)

111	         where K = 4.

113	   (2.3) When a subsequent RTT measurement R' is made, a host MUST set

115	            RTTVAR <- (1 - beta) * RTTVAR + beta * |SRTT - R'|
116	            SRTT <- (1 - alpha) * SRTT + alpha * R'

118	         The value of SRTT used in the update to RTTVAR is its value
119	         before updating SRTT itself using the second assignment.  That
120	         is, updating RTTVAR and SRTT MUST be computed in the above
121	         order.

123	         The above SHOULD be computed using alpha=1/8 and beta=1/4 (as
124	         suggested in [JK88]).

126	         After the computation, a host MUST update
127	         RTO <- SRTT + max (G, K*RTTVAR)

129	   (2.4) Whenever RTO is computed, if it is less than 1 second then the
130	         RTO SHOULD be rounded up to 1 second.

132	         Traditionally, TCP implementations use coarse grain clocks to
133	         measure the RTT and trigger the RTO, which imposes a large
134	         minimum value on the RTO.  Research suggests that a large
135	         minimum RTO is needed to keep TCP conservative and avoid
136	         spurious retransmissions [AP99].  Therefore, this
137	         specification requires a large minimum RTO as a conservative
138	         approach, while at the same time acknowledging that at some
139	         future point, research may show that a smaller minimum RTO is
140	         acceptable or superior.

142	   (2.5) A maximum value MAY be placed on RTO provided it is at least 60
143	         seconds.

145	3   Taking RTT Samples

147	   TCP MUST use Karn's algorithm [KP87] for taking RTT samples.  That
148	   is, RTT samples MUST NOT be made using segments that were
149	   retransmitted (and thus for which it is ambiguous whether the reply
150	   was for the first instance of the packet or a later instance).  The
151	   only case when TCP can safely take RTT samples from retransmitted
152	   segments is when the TCP timestamp option [JBB92] is employed, since
153	   the timestamp option removes the ambiguity regarding which instance
154	   of the data segment triggered the acknowledgment.

156	   Traditionally, TCP implementations have taken one RTT measurement at
157	   a time (typically once per RTT).  However, when using the timestamp
158	   option, each ACK can be used as an RTT sample.  RFC 1323 [JBB92]
159	   suggests that TCP connections utilizing large congestion windows
160	   should take many RTT samples per window of data to avoid aliasing
161	   effects in the estimated RTT.  A TCP implementation MUST take at
162	   least one RTT measurement per RTT (unless that is not possible per
163	   Karn's algorithm).

165	   For fairly modest congestion window sizes research suggests that
166	   timing each segment does not lead to a better RTT estimator [AP99].
167	   Additionally, when multiple samples are taken per RTT the alpha and
168	   beta defined in section 2 may keep an inadequate RTT history.  A
169	   method for changing these constants is currently an open research
170	   question.

172	4   Clock Granularity

174	   There is no requirement for the clock granularity G used for
175	   computing RTT measurements and the different state variables.
176	   However, if the K*RTTVAR term in the RTO calculation equals zero,
177	   the variance term MUST be rounded to G seconds (i.e., use the
178	   equation given in step 2.3).

180	       RTO <- SRTT + max (G, K*RTTVAR)

182	   Experience has shown that finer clock granularities (<= 100 msec)
183	   perform somewhat better than more coarse granularities.

185	   Note that [Jac88] outlines several clever tricks that can be used to
186	   obtain better precision from coarse granularity timers.  These
187	   changes are widely implemented in current TCP implementations.

189	5   Managing the RTO Timer

191	   An implementation MUST manage the retransmission timer(s) in such a
192	   way that a segment is never retransmitted too early, i.e. less than
193	   one RTO after the previous transmission of that segment.

195	   The following is the RECOMMENDED algorithm for managing the
196	   retransmission timer:

198	   (5.1) Every time a packet containing data is sent (including a
199	         retransmission), if the timer is not running, start it running
200	         so that it will expire after RTO seconds (for the current value
201	         of RTO).

203	   (5.2) When all outstanding data has been acknowledged, turn off the
204	         retransmission timer.

206	   (5.3) When an ACK is received that acknowledges new data, restart the
207	         retransmission timer so that it will expire after RTO seconds
208	         (for the current value of RTO).

210	   When the retransmission timer expires, do the following:

212	   (5.4) Retransmit the earliest segment that has not been acknowledged
213	         by the TCP receiver.

215	   (5.5) The host MUST set RTO <- RTO * 2 ("back off the timer").  The
216	         maximum value discussed in (2.5) above may be used to provide an
217	         upper bound to this doubling operation.

219	   (5.6) Start the retransmission timer, such that it expires after RTO
220	         seconds (for the value of RTO after the doubling operation
221	         outlined in 5.5).

223	   (5.7) If the timer expires awaiting the ACK of a SYN segment and the
224	         TCP implementation is using an RTO less than 3 seconds, the RTO
225	         MUST be re-initialized to 3 seconds when data transmission
226	         begins (i.e., after the three-way handshake completes).

228	         This represents a change from the previous version of this
229	         document [RFC2988] and is discussed in Appendix A.

231	   Note that after retransmitting, once a new RTT measurement is
232	   obtained (which can only happen when new data has been sent and
233	   acknowledged), the computations outlined in section 2 are performed,
234	   including the computation of RTO, which may result in "collapsing"
235	   RTO back down after it has been subject to exponential backoff
236	   (rule 5.5).

238	   Note that a TCP implementation MAY clear SRTT and RTTVAR after
239	   backing off the timer multiple times as it is likely that the
240	   current SRTT and RTTVAR are bogus in this situation.  Once SRTT and
241	   RTTVAR are cleared they should be initialized with the next RTT
242	   sample taken per (2.2) rather than using (2.3).

244	6   Security Considerations

246	   This document requires a TCP to wait for a given interval before
247	   retransmitting an unacknowledged segment.  An attacker could cause a
248	   TCP sender to compute a large value of RTO by adding delay to a
249	   timed packet's latency, or that of its acknowledgment.  However,
250	   the ability to add delay to a packet's latency often coincides with
251	   the ability to cause the packet to be lost, so it is difficult to
252	   see what an attacker might gain from such an attack that could cause
253	   more damage than simply discarding some of the TCP connection's
254	   packets.

256	   The Internet to a considerable degree relies on the correct
257	   implementation of the RTO algorithm (as well as those described in
258	   RFC 2581) in order to preserve network stability and avoid
259	   congestion collapse.  An attacker could cause TCP endpoints to
260	   respond more aggressively in the face of congestion by forging
261	   acknowledgments for segments before the receiver has actually
262	   received the data, thus lowering RTO to an unsafe value.  But to do
263	   so requires spoofing the acknowledgments correctly, which is
264	   difficult unless the attacker can monitor traffic along the path
265	   between the sender and the receiver.  In addition, even if the
266	   attacker can cause the sender's RTO to reach too small a value, it
267	   appears the attacker cannot leverage this into much of an attack
268	   (compared to the other damage they can do if they can spoof packets
269	   belonging to the connection), since the sending TCP will still back
270	   off its timer in the face of an incorrectly transmitted packet's
271	   loss due to actual congestion.

273	7  IANA Considerations

275	   None

277	Acknowledgments

279	   The RTO algorithm described in this memo was originated by Van
280	   Jacobson in [Jac88].

282	   Much of the data that motivated changing the initial RTO from 3
283	   seconds to 1 second came from Robert Love, Andre Broido and Mike
284	   Belshe.

286	Normative References

288	   [APS99] Allman, M., Paxson V. and W. Stevens, "TCP Congestion
289	           Control", RFC 2581, April 1999.

291	   [Bra89] Braden, R., "Requirements for Internet Hosts --
292	           Communication Layers", STD 3, RFC 1122, October 1989.

294	   [Bra97] Bradner, S., "Key words for use in RFCs to Indicate
295	           Requirement Levels", BCP 14, RFC 2119, March 1997.

297	   [Pos81] Postel, J., "Transmission Control Protocol", STD 7, RFC 793,
298	           September 1981.

300	Non-Normative References

302	   [AP99]  Allman, M. and V. Paxson, "On Estimating End-to-End Network
303	           Path Properties", SIGCOMM 99.

305	   [Chu09] Chu, J., "Tuning TCP Parameters for the 21st Century",
306	           http://www.ietf.org/proceedings/75/slides/tcpm-1.pdf, July
307	           2009.

309	   [SLS09] Schulman, A., Levin, D., and Spring, N., "CRAWDAD data set
310	           umd/sigcomm2008 (v. 2009-03-02)",
311	           http://crawdad.cs.dartmouth.edu/umd/sigcomm2008, March,
312	           2009.

314	   [HKA04] Henderson, T., Kotz, D., and Abyzov, I., "CRAWDAD trace
315	           dartmouth/campus/tcpdump/fall03 (v. 2004-11-09)",
316	           http://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump/fall03,
317	           November 2004.

319	   [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer
320	           Communication Review, vol. 18, no. 4, pp. 314-329, Aug.  1988.

322	   [JK88]  Jacobson, V. and M. Karels, "Congestion Avoidance and
323	           Control", ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.

325	   [KP87]  Karn, P. and C. Partridge, "Improving Round-Trip Time
326	           Estimates in Reliable Transport Protocols", SIGCOMM 87.

328	Author's Addresses

330	   Vern Paxson
331	   ICSI
332	   1947 Center Street
333	   Suite 600
334	   Berkeley, CA 94704-1198

336	   Phone: 510-666-2882
337	   EMail: vern@icir.org
338	   http://www.icir.org/vern/

340	   Mark Allman
341	   ICSI
342	   1947 Center Street
343	   Suite 600
344	   Berkeley, CA 94704-1198

346	   Phone: 440-235-1792
347	   EMail: mallman@icir.org
348	   http://www.icir.org/mallman/

350	   H.K. Jerry Chu
351	   Google, Inc.
352	   1600 Amphitheatre Parkway
353	   Mountain View, CA 94043

355	   Phone: 650-253-3010
356	   Email: hkchu@google.com

358	   Matt Sargent
359	   Case Western Reserve University Olin Building
360	   10900 Euclid Avenue
361	   Room 505
362	   Cleveland, OH 44106

364	   Phone: 440-223-5932
365	   Email: mts71@case.edu

367	Appendix A

369	    Choosing a reasonable initial RTO requires balancing two
370	    competing considerations:

372	    1. The initial RTO should be sufficiently large to cover most of the
373	       end-to-end paths to avoid spurious retransmissions and their
374	       associated negative performance impact.

376	    2. The initial RTO should be small enough to ensure a timely
377	       recovery from packet loss occurring before an RTT sample is
378	       taken.

380	    Traditionally, TCP has used 3 seconds as the initial RTO
381	    [RFC1122,RFC2988].  This document calls for lowering this value to 1
382	    second using the following rationale:

384	     - Modern networks are simply faster than the state-of-the-art was
385	       at the time the initial RTO of 3 seconds was defined.

387	     - Studies have found that the round-trip times of more than 97.5% of
388	       the connections observed in a large scale analysis were less than
389	       1 second [Chu09], suggesting that 1 second meets criteria 1 above.

391	     - In addition, the studies observed retransmission rates within
392	       the three-way handshake of roughly 2%.  This shows that reducing
393	       the initial RTO has benefit to a non-negligible set of connections.

395	     - However, roughly 2.5% of the connections studied in [Chu09] have
396	       an RTT longer than 1 second.  For those connections, a 1 second
397	       initial RTO guarantees a retransmission during connection
398	       establishment (needed or not).

400	       When this happens, this document calls for reverting to an initial
401	       RTO of 3 seconds for the data transmission phase.  Therefore, the
402	       implications of the spurious retransmission are modest: (1) an
403	       extra SYN is transmitted into the network, and (2) according to
404	       [RFC5681] the initial congestion window will be limited to 1
405	       segment.  While (2) clearly puts such connections at a
406	       disadvantage, this document at least resets the RTO such that the
407	       connection will not continually run into problems with a short
408	       timeout.  (Of course, if the RTT is more than three seconds, the
409	       connection will still encounter difficulties.  But that is not a
410	       new issue for TCP.)

412	       In addition, we note that when using timestamps, TCP will be able
413	       to take an RTT sample even in the presence of a spurious
414	       retransmission, facilitating convergence to a correct RTT estimate
415	       when the RTT exceeds 1 second.

417	    As an additional check on the results presented in [Chu09], we
418	    analyzed packet traces of client behavior collected at four
419	    different vantage points at different times, as follows:

421	      Name       Dates            Pkts.   Cnns.  Clnts. Servs.
422	      --------------------------------------------------------
423	      LBL-1      Oct/05--Mar/06   292M    242K   228    74K
424	      LBL-2      Nov/09--Feb/10   1.1B    1.2M   1047   38K
425	      ICSI-1     Sep/11--18/07    137M    2.1M   193    486K
426	      ICSI-2     Sep/11--18/08    163M    1.9M   177    277K
427	      ICSI-3     Sep/14--21/09    334M    3.1M   170    253K
428	      ICSI-4     Sep/11--18/10    298M    5M     183    189K
429	      Dartmouth  Jan/4--21/04     1B      4M     3782   132K
430	      SIGCOMM    Aug/17--21/08    11.6M   133K   152    29K

432	    The "LBL" data was taken at the Lawrence Berkeley National
433	    Laboratory, the "ICSI" data from the International Computer Science
434	    Institute, the "SIGCOMM" data from the wireless network that served
435	    the attendees of SIGCOMM 2008, and the "Dartmouth" data was
436	    collected from Dartmouth College's wireless network.  The latter two
437	    datasets are available from the CRAWDAD data repository
438	    [HKA04,SLS09].  The table lists the dates of the data collections,
439	    the number of packets collected, the number of TCP connections
440	    observed, the number of local clients monitored, and the number of
441	    remote servers contacted.  We consider only connections initiated
442	    near the tracing vantage point.

444	    Analysis of these datasets finds the prevalence of retransmitted
445	    SYNs to be between 0.03% (ICSI-4) to roughly 2% (LBL-1 and
446	    Dartmouth).

448	    We then analyzed the data to determine the number of
449	    additional---and spurious---retransmissions that would have been
450	    incurred if the initial RTO was assumed to be 1 second.  In most of
451	    the datasets, the proportion of connections with spurious
452	    retransmits was less than 0.1%.  However, in the Dartmouth dataset
453	    approximately 1.1% of the connections would have sent a spurious
454	    retransmit with a lower initial RTO.  We attribute this to the fact
455	    that the monitored network is wireless and therefore susceptible to
456	    additional delays from RF effects.

458	    Finally, there are obviously performance benefits from
459	    retransmitting lost SYNs with a reduced initial RTO.  Across our
460	    datasets, the percentage of connections that retransmitted a SYN and
461	    would realize at least a 10% performance improvement by using the
462	    smaller initial RTO specified in this document ranges from 43%
463	    (LBL-1) to 87% (ICSI-4).  The percentage of connections that would
464	    realize at least a 50% performance improvement ranges from 17%
465	    (ICSI-1 and SIGCOMM) to 73% (ICSI-4).

467	    From the data to which we have access, we conclude that the lower
468	    initial RTO is likely to be beneficial to many connections, and
469	    harmful to relatively few.