idnits 2.17.1 draft-paxson-tcpm-rfc2988bis-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 8 instances of too long lines in the document, the longest one being 3 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 6, 2010) is 4882 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2988' is mentioned on line 381, but not defined ** Obsolete undefined reference: RFC 2988 (Obsoleted by RFC 6298) == Missing Reference: 'JBB92' is mentioned on line 158, but not defined == Missing Reference: 'RFC1122' is mentioned on line 381, but not defined == Missing Reference: 'RFC5681' is mentioned on line 404, but not defined ** Obsolete normative reference: RFC 2581 (ref. 'APS99') (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 793 (ref. 'Pos81') (Obsoleted by RFC 9293) Summary: 5 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force V. Paxson 2 INTERNET DRAFT ICSI/UC Berkeley 3 File: draft-paxson-tcpm-rfc2988bis-01.txt M. Allman 4 ICSI 5 J. Chu 6 Google 7 M. Sargent 8 CWRU 9 December 6, 2010 11 Computing TCP's Retransmission Timer 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance with 16 the provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet-Drafts as 26 reference material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on June 6, 2011. 36 Copyright Notice 38 Copyright (c) 2010 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with 46 respect to this document. Code Components extracted from this 47 document must include Simplified BSD License text as described in 48 Section 4.e of the Trust Legal Provisions and are provided without 49 warranty as described in the BSD License. 51 Abstract 53 This document defines the standard algorithm that Transmission 54 Control Protocol (TCP) senders are required to use to compute and 55 manage their retransmission timer. It expands on the discussion in 56 section 4.2.3.1 of RFC 1122 and upgrades the requirement of 57 supporting the algorithm from a SHOULD to a MUST. 59 1 Introduction 61 The Transmission Control Protocol (TCP) [Pos81] uses a retransmission 62 timer to ensure data delivery in the absence of any feedback from the 63 remote data receiver. The duration of this timer is referred to as 64 RTO (retransmission timeout). RFC 1122 [Bra89] specifies that the 65 RTO should be calculated as outlined in [Jac88]. 67 This document codifies the algorithm for setting the RTO. In 68 addition, this document expands on the discussion in section 4.2.3.1 69 of RFC 1122 and upgrades the requirement of supporting the algorithm 70 from a SHOULD to a MUST. RFC 2581 [APS99] outlines the algorithm TCP 71 uses to begin sending after the RTO expires and a retransmission is 72 sent. This document does not alter the behavior outlined in RFC 2581 73 [APS99]. 75 In some situations it may be beneficial for a TCP sender to be more 76 conservative than the algorithms detailed in this document allow. 77 However, a TCP MUST NOT be more aggressive than the following 78 algorithms allow. 80 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 81 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 82 document are to be interpreted as described in [Bra97]. 84 2 The Basic Algorithm 86 To compute the current RTO, a TCP sender maintains two state 87 variables, SRTT (smoothed round-trip time) and RTTVAR (round-trip 88 time variation). In addition, we assume a clock granularity of G 89 seconds. 91 The rules governing the computation of SRTT, RTTVAR, and RTO are as 92 follows: 94 (2.1) Until a round-trip time (RTT) measurement has been made for a 95 segment sent between the sender and receiver, the sender SHOULD 96 set RTO <- 1 second, though the "backing off" on repeated 97 retransmission discussed in (5.5) still applies. 99 Note that the previous version of this document used an 100 initial RTO of 3 seconds [RFC2988]. A TCP implementation MAY 101 still use this value (or any other value > 1 second). This 102 change in the lower bound on the initial RTO is discussed in 103 further detail in Appendix A. 105 (2.2) When the first RTT measurement R is made, the host MUST set 107 SRTT <- R 108 RTTVAR <- R/2 109 RTO <- SRTT + max (G, K*RTTVAR) 111 where K = 4. 113 (2.3) When a subsequent RTT measurement R' is made, a host MUST set 115 RTTVAR <- (1 - beta) * RTTVAR + beta * |SRTT - R'| 116 SRTT <- (1 - alpha) * SRTT + alpha * R' 118 The value of SRTT used in the update to RTTVAR is its value 119 before updating SRTT itself using the second assignment. That 120 is, updating RTTVAR and SRTT MUST be computed in the above 121 order. 123 The above SHOULD be computed using alpha=1/8 and beta=1/4 (as 124 suggested in [JK88]). 126 After the computation, a host MUST update 127 RTO <- SRTT + max (G, K*RTTVAR) 129 (2.4) Whenever RTO is computed, if it is less than 1 second then the 130 RTO SHOULD be rounded up to 1 second. 132 Traditionally, TCP implementations use coarse grain clocks to 133 measure the RTT and trigger the RTO, which imposes a large 134 minimum value on the RTO. Research suggests that a large 135 minimum RTO is needed to keep TCP conservative and avoid 136 spurious retransmissions [AP99]. Therefore, this 137 specification requires a large minimum RTO as a conservative 138 approach, while at the same time acknowledging that at some 139 future point, research may show that a smaller minimum RTO is 140 acceptable or superior. 142 (2.5) A maximum value MAY be placed on RTO provided it is at least 60 143 seconds. 145 3 Taking RTT Samples 147 TCP MUST use Karn's algorithm [KP87] for taking RTT samples. That 148 is, RTT samples MUST NOT be made using segments that were 149 retransmitted (and thus for which it is ambiguous whether the reply 150 was for the first instance of the packet or a later instance). The 151 only case when TCP can safely take RTT samples from retransmitted 152 segments is when the TCP timestamp option [JBB92] is employed, since 153 the timestamp option removes the ambiguity regarding which instance 154 of the data segment triggered the acknowledgment. 156 Traditionally, TCP implementations have taken one RTT measurement at 157 a time (typically once per RTT). However, when using the timestamp 158 option, each ACK can be used as an RTT sample. RFC 1323 [JBB92] 159 suggests that TCP connections utilizing large congestion windows 160 should take many RTT samples per window of data to avoid aliasing 161 effects in the estimated RTT. A TCP implementation MUST take at 162 least one RTT measurement per RTT (unless that is not possible per 163 Karn's algorithm). 165 For fairly modest congestion window sizes research suggests that 166 timing each segment does not lead to a better RTT estimator [AP99]. 167 Additionally, when multiple samples are taken per RTT the alpha and 168 beta defined in section 2 may keep an inadequate RTT history. A 169 method for changing these constants is currently an open research 170 question. 172 4 Clock Granularity 174 There is no requirement for the clock granularity G used for 175 computing RTT measurements and the different state variables. 176 However, if the K*RTTVAR term in the RTO calculation equals zero, 177 the variance term MUST be rounded to G seconds (i.e., use the 178 equation given in step 2.3). 180 RTO <- SRTT + max (G, K*RTTVAR) 182 Experience has shown that finer clock granularities (<= 100 msec) 183 perform somewhat better than more coarse granularities. 185 Note that [Jac88] outlines several clever tricks that can be used to 186 obtain better precision from coarse granularity timers. These 187 changes are widely implemented in current TCP implementations. 189 5 Managing the RTO Timer 191 An implementation MUST manage the retransmission timer(s) in such a 192 way that a segment is never retransmitted too early, i.e. less than 193 one RTO after the previous transmission of that segment. 195 The following is the RECOMMENDED algorithm for managing the 196 retransmission timer: 198 (5.1) Every time a packet containing data is sent (including a 199 retransmission), if the timer is not running, start it running 200 so that it will expire after RTO seconds (for the current value 201 of RTO). 203 (5.2) When all outstanding data has been acknowledged, turn off the 204 retransmission timer. 206 (5.3) When an ACK is received that acknowledges new data, restart the 207 retransmission timer so that it will expire after RTO seconds 208 (for the current value of RTO). 210 When the retransmission timer expires, do the following: 212 (5.4) Retransmit the earliest segment that has not been acknowledged 213 by the TCP receiver. 215 (5.5) The host MUST set RTO <- RTO * 2 ("back off the timer"). The 216 maximum value discussed in (2.5) above may be used to provide an 217 upper bound to this doubling operation. 219 (5.6) Start the retransmission timer, such that it expires after RTO 220 seconds (for the value of RTO after the doubling operation 221 outlined in 5.5). 223 (5.7) If the timer expires awaiting the ACK of a SYN segment and the 224 TCP implementation is using an RTO less than 3 seconds, the RTO 225 MUST be re-initialized to 3 seconds when data transmission 226 begins (i.e., after the three-way handshake completes). 228 This represents a change from the previous version of this 229 document [RFC2988] and is discussed in Appendix A. 231 Note that after retransmitting, once a new RTT measurement is 232 obtained (which can only happen when new data has been sent and 233 acknowledged), the computations outlined in section 2 are performed, 234 including the computation of RTO, which may result in "collapsing" 235 RTO back down after it has been subject to exponential backoff 236 (rule 5.5). 238 Note that a TCP implementation MAY clear SRTT and RTTVAR after 239 backing off the timer multiple times as it is likely that the 240 current SRTT and RTTVAR are bogus in this situation. Once SRTT and 241 RTTVAR are cleared they should be initialized with the next RTT 242 sample taken per (2.2) rather than using (2.3). 244 6 Security Considerations 246 This document requires a TCP to wait for a given interval before 247 retransmitting an unacknowledged segment. An attacker could cause a 248 TCP sender to compute a large value of RTO by adding delay to a 249 timed packet's latency, or that of its acknowledgment. However, 250 the ability to add delay to a packet's latency often coincides with 251 the ability to cause the packet to be lost, so it is difficult to 252 see what an attacker might gain from such an attack that could cause 253 more damage than simply discarding some of the TCP connection's 254 packets. 256 The Internet to a considerable degree relies on the correct 257 implementation of the RTO algorithm (as well as those described in 258 RFC 2581) in order to preserve network stability and avoid 259 congestion collapse. An attacker could cause TCP endpoints to 260 respond more aggressively in the face of congestion by forging 261 acknowledgments for segments before the receiver has actually 262 received the data, thus lowering RTO to an unsafe value. But to do 263 so requires spoofing the acknowledgments correctly, which is 264 difficult unless the attacker can monitor traffic along the path 265 between the sender and the receiver. In addition, even if the 266 attacker can cause the sender's RTO to reach too small a value, it 267 appears the attacker cannot leverage this into much of an attack 268 (compared to the other damage they can do if they can spoof packets 269 belonging to the connection), since the sending TCP will still back 270 off its timer in the face of an incorrectly transmitted packet's 271 loss due to actual congestion. 273 7 IANA Considerations 275 None 277 Acknowledgments 279 The RTO algorithm described in this memo was originated by Van 280 Jacobson in [Jac88]. 282 Much of the data that motivated changing the initial RTO from 3 283 seconds to 1 second came from Robert Love, Andre Broido and Mike 284 Belshe. 286 Normative References 288 [APS99] Allman, M., Paxson V. and W. Stevens, "TCP Congestion 289 Control", RFC 2581, April 1999. 291 [Bra89] Braden, R., "Requirements for Internet Hosts -- 292 Communication Layers", STD 3, RFC 1122, October 1989. 294 [Bra97] Bradner, S., "Key words for use in RFCs to Indicate 295 Requirement Levels", BCP 14, RFC 2119, March 1997. 297 [Pos81] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, 298 September 1981. 300 Non-Normative References 302 [AP99] Allman, M. and V. Paxson, "On Estimating End-to-End Network 303 Path Properties", SIGCOMM 99. 305 [Chu09] Chu, J., "Tuning TCP Parameters for the 21st Century", 306 http://www.ietf.org/proceedings/75/slides/tcpm-1.pdf, July 307 2009. 309 [SLS09] Schulman, A., Levin, D., and Spring, N., "CRAWDAD data set 310 umd/sigcomm2008 (v. 2009-03-02)", 311 http://crawdad.cs.dartmouth.edu/umd/sigcomm2008, March, 312 2009. 314 [HKA04] Henderson, T., Kotz, D., and Abyzov, I., "CRAWDAD trace 315 dartmouth/campus/tcpdump/fall03 (v. 2004-11-09)", 316 http://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump/fall03, 317 November 2004. 319 [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer 320 Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. 322 [JK88] Jacobson, V. and M. Karels, "Congestion Avoidance and 323 Control", ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. 325 [KP87] Karn, P. and C. Partridge, "Improving Round-Trip Time 326 Estimates in Reliable Transport Protocols", SIGCOMM 87. 328 Author's Addresses 330 Vern Paxson 331 ICSI 332 1947 Center Street 333 Suite 600 334 Berkeley, CA 94704-1198 336 Phone: 510-666-2882 337 EMail: vern@icir.org 338 http://www.icir.org/vern/ 340 Mark Allman 341 ICSI 342 1947 Center Street 343 Suite 600 344 Berkeley, CA 94704-1198 346 Phone: 440-235-1792 347 EMail: mallman@icir.org 348 http://www.icir.org/mallman/ 350 H.K. Jerry Chu 351 Google, Inc. 352 1600 Amphitheatre Parkway 353 Mountain View, CA 94043 355 Phone: 650-253-3010 356 Email: hkchu@google.com 358 Matt Sargent 359 Case Western Reserve University Olin Building 360 10900 Euclid Avenue 361 Room 505 362 Cleveland, OH 44106 364 Phone: 440-223-5932 365 Email: mts71@case.edu 367 Appendix A 369 Choosing a reasonable initial RTO requires balancing two 370 competing considerations: 372 1. The initial RTO should be sufficiently large to cover most of the 373 end-to-end paths to avoid spurious retransmissions and their 374 associated negative performance impact. 376 2. The initial RTO should be small enough to ensure a timely 377 recovery from packet loss occurring before an RTT sample is 378 taken. 380 Traditionally, TCP has used 3 seconds as the initial RTO 381 [RFC1122,RFC2988]. This document calls for lowering this value to 1 382 second using the following rationale: 384 - Modern networks are simply faster than the state-of-the-art was 385 at the time the initial RTO of 3 seconds was defined. 387 - Studies have found that the round-trip times of more than 97.5% of 388 the connections observed in a large scale analysis were less than 389 1 second [Chu09], suggesting that 1 second meets criteria 1 above. 391 - In addition, the studies observed retransmission rates within 392 the three-way handshake of roughly 2%. This shows that reducing 393 the initial RTO has benefit to a non-negligible set of connections. 395 - However, roughly 2.5% of the connections studied in [Chu09] have 396 an RTT longer than 1 second. For those connections, a 1 second 397 initial RTO guarantees a retransmission during connection 398 establishment (needed or not). 400 When this happens, this document calls for reverting to an initial 401 RTO of 3 seconds for the data transmission phase. Therefore, the 402 implications of the spurious retransmission are modest: (1) an 403 extra SYN is transmitted into the network, and (2) according to 404 [RFC5681] the initial congestion window will be limited to 1 405 segment. While (2) clearly puts such connections at a 406 disadvantage, this document at least resets the RTO such that the 407 connection will not continually run into problems with a short 408 timeout. (Of course, if the RTT is more than three seconds, the 409 connection will still encounter difficulties. But that is not a 410 new issue for TCP.) 412 In addition, we note that when using timestamps, TCP will be able 413 to take an RTT sample even in the presence of a spurious 414 retransmission, facilitating convergence to a correct RTT estimate 415 when the RTT exceeds 1 second. 417 As an additional check on the results presented in [Chu09], we 418 analyzed packet traces of client behavior collected at four 419 different vantage points at different times, as follows: 421 Name Dates Pkts. Cnns. Clnts. Servs. 422 -------------------------------------------------------- 423 LBL-1 Oct/05--Mar/06 292M 242K 228 74K 424 LBL-2 Nov/09--Feb/10 1.1B 1.2M 1047 38K 425 ICSI-1 Sep/11--18/07 137M 2.1M 193 486K 426 ICSI-2 Sep/11--18/08 163M 1.9M 177 277K 427 ICSI-3 Sep/14--21/09 334M 3.1M 170 253K 428 ICSI-4 Sep/11--18/10 298M 5M 183 189K 429 Dartmouth Jan/4--21/04 1B 4M 3782 132K 430 SIGCOMM Aug/17--21/08 11.6M 133K 152 29K 432 The "LBL" data was taken at the Lawrence Berkeley National 433 Laboratory, the "ICSI" data from the International Computer Science 434 Institute, the "SIGCOMM" data from the wireless network that served 435 the attendees of SIGCOMM 2008, and the "Dartmouth" data was 436 collected from Dartmouth College's wireless network. The latter two 437 datasets are available from the CRAWDAD data repository 438 [HKA04,SLS09]. The table lists the dates of the data collections, 439 the number of packets collected, the number of TCP connections 440 observed, the number of local clients monitored, and the number of 441 remote servers contacted. We consider only connections initiated 442 near the tracing vantage point. 444 Analysis of these datasets finds the prevalence of retransmitted 445 SYNs to be between 0.03% (ICSI-4) to roughly 2% (LBL-1 and 446 Dartmouth). 448 We then analyzed the data to determine the number of 449 additional---and spurious---retransmissions that would have been 450 incurred if the initial RTO was assumed to be 1 second. In most of 451 the datasets, the proportion of connections with spurious 452 retransmits was less than 0.1%. However, in the Dartmouth dataset 453 approximately 1.1% of the connections would have sent a spurious 454 retransmit with a lower initial RTO. We attribute this to the fact 455 that the monitored network is wireless and therefore susceptible to 456 additional delays from RF effects. 458 Finally, there are obviously performance benefits from 459 retransmitting lost SYNs with a reduced initial RTO. Across our 460 datasets, the percentage of connections that retransmitted a SYN and 461 would realize at least a 10% performance improvement by using the 462 smaller initial RTO specified in this document ranges from 43% 463 (LBL-1) to 87% (ICSI-4). The percentage of connections that would 464 realize at least a 50% performance improvement ranges from 17% 465 (ICSI-1 and SIGCOMM) to 73% (ICSI-4). 467 From the data to which we have access, we conclude that the lower 468 initial RTO is likely to be beneficial to many connections, and 469 harmful to relatively few.