idnits 2.17.1 draft-ietf-tcpm-rfc2581bis-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Draft Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 813 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2414 (Obsoleted by RFC 3390) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) -- Obsolete informational reference (is this intentional?): RFC 3782 (Obsoleted by RFC 6582) Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Allman 2 Internet-Draft V. Paxson 3 Obsoletes: 2581 ICSI 4 Intended status: Draft Standard E. Blanton 5 Expires: January 27 2010 Purdue University 6 July 27 2009 8 TCP Congestion Control 9 draft-ietf-tcpm-rfc2581bis-07.txt 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with 14 the provisions of BCP 78 and BCP 79. This document may contain 15 material from IETF Documents or IETF Contributions published or made 16 publicly available before November 10, 2008. The person(s) 17 controlling the copyright in some of this material may not have 18 granted the IETF Trust the right to allow modifications of such 19 material outside the IETF Standards Process. Without obtaining an 20 adequate license from the person(s) controlling the copyright in 21 such materials, this document may not be modified outside the IETF 22 Standards Process, and derivative works of it may not be created 23 outside the IETF Standards Process, except to format it for 24 publication as an RFC or to translate it into languages other than 25 English. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as 30 Internet-Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other documents 34 at any time. It is inappropriate to use Internet-Drafts as 35 reference material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt. 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html. 43 Copyright Statement 45 Copyright (c) 2009 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents in effect on the date of 50 publication of this document (http://trustee.ietf.org/license-info). 51 Please review these documents carefully, as they describe your 52 rights and restrictions with respect to this document. 54 This document may contain material from IETF Documents or IETF 55 Contributions published or made publicly available before November 56 10, 2008. The person(s) controlling the copyright in some of this 57 material may not have granted the IETF Trust the right to allow 58 modifications of such material outside the IETF Standards Process. 59 Without obtaining an adequate license from the person(s) controlling 60 the copyright in such materials, this document may not be modified 61 outside the IETF Standards Process, and derivative works of it may 62 not be created outside the IETF Standards Process, except to format 63 it for publication as an RFC or to translate it into languages other 64 than English. 66 Abstract 68 This document defines TCP's four intertwined congestion control 69 algorithms: slow start, congestion avoidance, fast retransmit, and 70 fast recovery. In addition, the document specifies how TCP should 71 begin transmission after a relatively long idle period, as well as 72 discussing various acknowledgment generation methods. This document 73 obsoletes RFC 2581. 75 Table Of Contents 77 1. Introduction. . . . . . . . . . . . . . . . . 2 78 2. Definitions . . . . . . . . . . . . . . . . . 3 79 3. Congestion Control Algorithms . . . . . . . . 4 80 3.1 Slow Start and Congestion Avoidance . . . . . 4 81 3.2 Fast Retransmit/Fast Recovery . . . . . . . . 7 82 4. Additional Considerations . . . . . . . . . . 9 83 4.1 Re-starting Idle Connections. . . . . . . . . 9 84 4.2 Generating Acknowledgments. . . . . . . . . . 10 85 4.3 Loss Recovery Mechanisms. . . . . . . . . . . 11 86 5. Security Considerations . . . . . . . . . . . 12 87 6. Changes Between RFC 2001 and RFC 2581 . . . . 12 88 7. Changes Relative to RFC 2581. . . . . . . . . 12 89 8. IANA Considerations . . . . . . . . . . . . . 13 91 1. Introduction 93 This document specifies four TCP [RFC793] congestion control 94 algorithms: slow start, congestion avoidance, fast retransmit and 95 fast recovery. These algorithms were devised in [Jac88] and 96 [Jac90]. Their use with TCP is standardized in [RFC1122]. 97 Additional early work in additive-increase, multiplicative-decrease 98 congestion control is given in [CJ89]. 100 Note that [Ste94] provides examples of these algorithms in action 101 and [WS95] provides an explanation of the source code for the BSD 102 implementation of these algorithms. 104 In addition to specifying these congestion control algorithms, this 105 document specifies what TCP connections should do after a relatively 106 long idle period, as well as specifying and clarifying some of the 107 issues pertaining to TCP ACK generation. 109 This document obsoletes [RFC2581], which in turn obsoleted 110 [RFC2001]. 112 This document is organized as follows. Section 2 provides various 113 definitions which will be used throughout the document. Section 3 114 provides a specification of the congestion control 115 algorithms. Section 4 outlines concerns related to the congestion 116 control algorithms and finally, section 5 outlines security 117 considerations. 119 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 120 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 121 document are to be interpreted as described in [RFC2119]. 123 2. Definitions 125 This section provides the definition of several terms that will be 126 used throughout the remainder of this document. 128 SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or 129 both). 131 SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the 132 largest segment that the sender can transmit. This value can be 133 based on the maximum transmission unit of the network, the path 134 MTU discovery [RFC1191,RFC4821] algorithm, RMSS (see next item), 135 or other factors. The size does not include the TCP/IP headers 136 and options. 138 RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the 139 largest segment the receiver is willing to accept. This is the 140 value specified in the MSS option sent by the receiver during 141 connection startup. Or, if the MSS option is not used, 536 142 bytes [RFC1122]. The size does not include the TCP/IP headers 143 and options. 145 FULL-SIZED SEGMENT: A segment that contains the maximum number of 146 data bytes permitted (i.e., a segment containing SMSS bytes of 147 data). 149 RECEIVER WINDOW (rwnd): The most recently advertised receiver 150 window. 152 CONGESTION WINDOW (cwnd): A TCP state variable that limits the 153 amount of data a TCP can send. At any given time, a TCP MUST 154 NOT send data with a sequence number higher than the sum of the 155 highest acknowledged sequence number and the minimum of cwnd and 156 rwnd. 158 INITIAL WINDOW (IW): The initial window is the size of the sender's 159 congestion window after the three-way handshake is completed. 161 LOSS WINDOW (LW): The loss window is the size of the congestion 162 window after a TCP sender detects loss using its retransmission 163 timer. 165 RESTART WINDOW (RW): The restart window is the size of the 166 congestion window after a TCP restarts transmission after an 167 idle period (if the slow start algorithm is used; see section 168 4.1 for more discussion). 170 FLIGHT SIZE: The amount of data that has been sent but not yet 171 cumulatively acknowledged. 173 DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a 174 "duplicate" in the following algorithms when (a) the receiver of 175 the ACK has outstanding data, (b) the incoming acknowledgment 176 carries no data, (c) the SYN and FIN bits are both off, (d) the 177 acknowledgment number is equal to the greatest acknowledgment 178 received on the given connection (TCP.UNA from [RFC793]) and (e) 179 the advertised window in the incoming acknowledgment equals the 180 advertised window in the last incoming acknowledgment. 182 Alternatively, a TCP that utilizes selective acknowledgments 183 [RFC2018,RFC2883] can leverage the SACK information to determine 184 when an incoming ACK is a "duplicate" (e.g., if the ACK contains 185 previously unknown SACK information). 187 3. Congestion Control Algorithms 189 This section defines the four congestion control algorithms: slow 190 start, congestion avoidance, fast retransmit and fast recovery, 191 developed in [Jac88] and [Jac90]. In some situations it may be 192 beneficial for a TCP sender to be more conservative than the 193 algorithms allow, however a TCP MUST NOT be more aggressive than the 194 following algorithms allow (that is, MUST NOT send data when the 195 value of cwnd computed by the following algorithms would not allow 196 the data to be sent). 198 Also note that the algorithms specified in this document work in 199 terms of using loss as the signal of congestion. Explicit 200 Congestion Notification (ECN) could also be used as specified in 201 [RFC3168]. 203 3.1 Slow Start and Congestion Avoidance 205 The slow start and congestion avoidance algorithms MUST be used by a 206 TCP sender to control the amount of outstanding data being injected 207 into the network. To implement these algorithms, two variables are 208 added to the TCP per-connection state. The congestion window (cwnd) 209 is a sender-side limit on the amount of data the sender can transmit 210 into the network before receiving an acknowledgment (ACK), while the 211 receiver's advertised window (rwnd) is a receiver-side limit on the 212 amount of outstanding data. The minimum of cwnd and rwnd governs 213 data transmission. 215 Another state variable, the slow start threshold (ssthresh), is used 216 to determine whether the slow start or congestion avoidance 217 algorithm is used to control data transmission, as discussed below. 219 Beginning transmission into a network with unknown conditions 220 requires TCP to slowly probe the network to determine the available 221 capacity, in order to avoid congesting the network with an 222 inappropriately large burst of data. The slow start algorithm is 223 used for this purpose at the beginning of a transfer, or after 224 repairing loss detected by the retransmission timer. Slow start 225 additionally serves to start the "ACK clock" used by the TCP sender 226 to release data into the network in the slow start, congestion 227 avoidance, and loss recovery algorithms. 229 IW, the initial value of cwnd, MUST be set using the following 230 guidelines as an upper bound. 232 If SMSS > 2190 bytes: 233 IW = 2 * SMSS bytes and MUST NOT be more than 2 segments 234 If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): 235 IW = 3 * SMSS bytes and MUST NOT be more than 3 segments 236 if SMSS <= 1095 bytes: 237 IW = 4 * SMSS bytes and MUST NOT be more than 4 segments 239 As specified in [RFC3390], the SYN/ACK and the acknowledgment of the 240 SYN/ACK MUST NOT increase the size of the congestion window. 241 Further, if the SYN or SYN/ACK is lost, the initial window used by a 242 sender after a correctly transmitted SYN MUST be one segment 243 consisting of at most SMSS bytes. 245 A detailed rationale and discussion of the IW setting is provided in 246 [RFC3390]. 248 When initial congestion windows of more than one segment are 249 implemented along with Path MTU Discovery [RFC1191], and the MSS 250 being used is found to be too large, the congestion window cwnd 251 SHOULD be reduced to prevent large bursts of smaller segments. 252 Specifically, cwnd SHOULD be reduced by the ratio of the old segment 253 size to the new segment size. 255 The initial value of ssthresh SHOULD be set arbitrarily high (e.g., 256 to the size of the largest possible advertised window), but ssthresh 257 MUST be reduced in response to congestion. Setting ssthresh as high 258 as possible allows the network conditions, rather than some 259 arbitrary host limit, to dictate the sending rate. In cases where 260 the end systems have a solid understanding of the network path, more 261 carefully setting the initial ssthresh value may have merit (e.g., 262 such that the end host does not create congestion along the path). 264 The slow start algorithm is used when cwnd < ssthresh, while the 265 congestion avoidance algorithm is used when cwnd > ssthresh. When 266 cwnd and ssthresh are equal the sender may use either slow start or 267 congestion avoidance. 269 During slow start, a TCP increments cwnd by at most SMSS bytes for 270 each ACK received that cumulatively acknowledges new data. Slow 271 start ends when cwnd exceeds ssthresh (or, optionally, when it 272 reaches it, as noted above) or when congestion is observed. While 273 traditionally TCP implementations have increased cwnd by precisely 274 SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND 275 that TCP implementations increase cwnd, per: 277 cwnd += min (N, SMSS) (2) 279 where N is the number of previously unacknowledged bytes 280 acknowledged in the incoming ACK. This adjustment is part of 281 Appropriate Byte Counting [RFC3465] and provides robustness against 282 misbehaving receivers which may attempt to induce a sender to 283 artificially inflate cwnd using a mechanism known as "ACK Division" 284 [SCWA99]. ACK Division consists of a receiver sending multiple ACKs 285 for a single TCP data segment, each acknowledging only a portion of 286 its data. A TCP that increments cwnd by SMSS for each such ACK will 287 inappropriately inflate the amount of data injected into the 288 network. 290 During congestion avoidance, cwnd is incremented by roughly 1 291 full-sized segment per round-trip time (RTT). Congestion avoidance 292 continues until congestion is detected. The basic guidelines for 293 incrementing cwnd during congestion avoidance are: 295 * MAY increment cwnd by SMSS bytes 297 * SHOULD increment cwnd per equation (2) once per RTT 299 * MUST NOT increment cwnd by more than SMSS bytes 301 We note that [RFC3465] allows for cwnd increases of more than SMSS 302 bytes for incoming acknowledgments during slow start on an 303 experimental basis, however such behavior is not allowed as part of 304 the standard. 306 The RECOMMENDED way to increase cwnd during congestion avoidance is 307 to count the number of bytes that have been acknowledged by ACKs for 308 new data. (A drawback of this implementation is that it requires 309 maintaining an additional state variable.) When the number of bytes 310 acknowledged reaches cwnd, then cwnd can be incremented by up to 311 SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be 312 increased by more than SMSS bytes per RTT. This method both allows 313 TCPs to increase cwnd by one segment per RTT in the face of delayed 314 ACKs and provides robustness against ACK Division attacks. 316 Another common formula that a TCP MAY use to update cwnd during 317 congestion avoidance is given in equation 3: 319 cwnd += SMSS*SMSS/cwnd (3) 321 This adjustment is executed on every incoming ACK that acknowledges 322 new data. Equation (3) provides an acceptable approximation to the 323 underlying principle of increasing cwnd by 1 full-sized segment per 324 RTT. (Note that for a connection in which the receiver is 325 acknowledging every-other packet, (3) is less aggressive than 326 allowed -- roughly increasing cwnd every second RTT.) 328 Implementation Note: Since integer arithmetic is usually used in TCP 329 implementations, the formula given in equation 3 can fail to 330 increase cwnd when the congestion window is larger than SMSS*SMSS. 331 If the above formula yields 0, the result SHOULD be rounded up to 1 332 byte. 334 Implementation Note: Older implementations have an additional 335 additive constant on the right-hand side of equation (3). This is 336 incorrect and can actually lead to diminished performance [RFC2525]. 338 Implementation Note: Some implementations maintain cwnd in units of 339 bytes, while others in units of full-sized segments. The latter 340 will find equation (3) difficult to use, and may prefer to use the 341 counting approach discussed in the previous paragraph. 343 When a TCP sender detects segment loss using the retransmission 344 timer and the given segment has not yet been resent by way of the 345 retransmission timer, the value of ssthresh MUST be set to no more 346 than the value given in equation 4: 348 ssthresh = max (FlightSize / 2, 2*SMSS) (4) 350 where, as discussed above, FlightSize is the amount of outstanding 351 data in the network. 353 On the other hand, when a TCP sender detects segment loss using the 354 retransmission timer and the given segment has already been 355 retransmitted by way of the retransmission timer at least once, the 356 value of ssthresh is held constant. 358 Implementation Note: An easy mistake to make is to simply use cwnd, 359 rather than FlightSize, which in some implementations may 360 incidentally increase well beyond rwnd. 362 Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be 363 set to no more than the loss window, LW, which equals 1 full-sized 364 segment (regardless of the value of IW). Therefore, after 365 retransmitting the dropped segment the TCP sender uses the slow 366 start algorithm to increase the window from 1 full-sized segment to 367 the new value of ssthresh, at which point congestion avoidance again 368 takes over. 370 As shown in [FF96,RFC3782], slow start-based loss recovery after a 371 timeout can cause spurious retransmissions that trigger duplicate 372 acknowledgments. The reaction to the arrival of these duplicate 373 ACKs in TCP implementations varies widely. This document does not 374 specify how to treat such acknowledgments, but does note this as an 375 area that may benefit from additional attention, experimentation and 376 specification. 378 3.2 Fast Retransmit/Fast Recovery 379 A TCP receiver SHOULD send an immediate duplicate ACK when an out- 380 of-order segment arrives. The purpose of this ACK is to inform the 381 sender that a segment was received out-of-order and which sequence 382 number is expected. From the sender's perspective, duplicate ACKs 383 can be caused by a number of network problems. First, they can be 384 caused by dropped segments. In this case, all segments after the 385 dropped segment will trigger duplicate ACKs until the loss is 386 repaired. Second, duplicate ACKs can be caused by the re-ordering 387 of data segments by the network (not a rare event along some network 388 paths [Pax97]). Finally, duplicate ACKs can be caused by 389 replication of ACK or data segments by the network. In addition, a 390 TCP receiver SHOULD send an immediate ACK when the incoming segment 391 fills in all or part of a gap in the sequence space. This will 392 generate more timely information for a sender recovering from a loss 393 through a retransmission timeout, a fast retransmit, or an advanced 394 loss recovery algorithm, as outlined in section 4.3. 396 The TCP sender SHOULD use the "fast retransmit" algorithm to detect 397 and repair loss, based on incoming duplicate ACKs. The fast 398 retransmit algorithm uses the arrival of 3 duplicate ACKs (as 399 defined in section 2, without any intervening ACKs which move 400 SND.UNA) as an indication that a segment has been lost. After 401 receiving 3 duplicate ACKs, TCP performs a retransmission of what 402 appears to be the missing segment, without waiting for the 403 retransmission timer to expire. 405 After the fast retransmit algorithm sends what appears to be the 406 missing segment, the "fast recovery" algorithm governs the 407 transmission of new data until a non-duplicate ACK arrives. The 408 reason for not performing slow start is that the receipt of the 409 duplicate ACKs not only indicates that a segment has been lost, but 410 also that segments are most likely leaving the network (although a 411 massive segment duplication by the network can invalidate this 412 conclusion). In other words, since the receiver can only generate a 413 duplicate ACK when a segment has arrived, that segment has left the 414 network and is in the receiver's buffer, so we know it is no longer 415 consuming network resources. Furthermore, since the ACK "clock" 416 [Jac88] is preserved, the TCP sender can continue to transmit new 417 segments (although transmission must continue using a reduced cwnd, 418 since loss is an indication of congestion). 420 The fast retransmit and fast recovery algorithms are implemented 421 together as follows. 423 1. On the first and second duplicate ACKs received at a sender, a 424 TCP SHOULD send a segment of previously unsent data per 425 [RFC3042] provided that the receiver's advertised window allows, 426 the total FlightSize would remain less than or equal to cwnd 427 plus 2*SMSS, and that new data is available for transmission. 428 Further, the TCP sender MUST NOT change cwnd to reflect these 429 two segments [RFC3042]. Note that a sender using SACK [RFC2018] 430 MUST NOT send new data unless the incoming duplicate 431 acknowledgment contains new SACK information. 433 2. When the third duplicate ACK is received, a TCP MUST set 434 ssthresh to no more than the value given in equation 4. When 435 [RFC3042] is in use, additional data sent in limited transmit 436 MUST NOT be included in this calculation. 438 3. The lost segment starting at SND.UNA MUST be retransmitted and 439 cwnd set to ssthresh plus 3*SMSS. This artificially "inflates" 440 the congestion window by the number of segments (three) that 441 have left the network and which the receiver has buffered. 443 4. For each additional duplicate ACK received (after the third), 444 cwnd MUST be incremented by SMSS. This artificially inflates 445 the congestion window in order to reflect the additional segment 446 that has left the network. 448 Note: [SCWA99] discusses a receiver-based attack whereby many 449 bogus duplicate ACKs are sent to the data sender in order to 450 artificially inflate cwnd and cause a higher than appropriate 451 sending rate to be used. A TCP MAY therefore limit the number 452 of times cwnd is artificially inflated during loss recovery 453 to the number of outstanding segments (or, an approximation 454 thereof). 456 Note: When an advanced loss recovery mechanism (such as outlined 457 in section 4.3) is not in use, this increase in FlightSize can 458 cause equation 4 to slightly inflate cwnd and ssthresh, as some 459 of the segments between SND.UNA and SND.NXT are assumed to have 460 left the network but are still reflected in FlightSize. 462 5. When previously unsent data is available and the new value of 463 cwnd and the receiver's advertised window allow, a TCP SHOULD 464 send 1*SMSS bytes of previously unsent data. 466 6. When the next ACK arrives that acknowledges previously 467 unacknowledged data, a TCP MUST set cwnd to ssthresh (the value 468 set in step 2). This is termed "deflating" the window. 470 This ACK should be the acknowledgment elicited by the 471 retransmission from step 3, one RTT after the retransmission 472 (though it may arrive sooner in the presence of significant out- 473 of-order delivery of data segments at the receiver). 474 Additionally, this ACK should acknowledge all the intermediate 475 segments sent between the lost segment and the receipt of the 476 third duplicate ACK, if none of these were lost. 478 Note: This algorithm is known to generally not recover efficiently 479 from multiple losses in a single flight of packets [FF96]. Section 480 4.3 below addresses such cases. 482 4. Additional Considerations 484 4.1 Re-starting Idle Connections 485 A known problem with the TCP congestion control algorithms described 486 above is that they allow a potentially inappropriate burst of 487 traffic to be transmitted after TCP has been idle for a relatively 488 long period of time. After an idle period, TCP cannot use the ACK 489 clock to strobe new segments into the network, as all the ACKs have 490 drained from the network. Therefore, as specified above, TCP can 491 potentially send a cwnd-size line-rate burst into the network after 492 an idle period. In addition, changing network conditions may have 493 rendered TCP's notion of the available end-to-end network capacity 494 between two endpoints, as estimated by cwnd, inaccurate during the 495 course of a long idle period. 497 [Jac88] recommends that a TCP use slow start to restart 498 transmission after a relatively long idle period. Slow start 499 serves to restart the ACK clock, just as it does at the beginning 500 of a transfer. This mechanism has been widely deployed in the 501 following manner. When TCP has not received a segment for more 502 than one retransmission timeout, cwnd is reduced to the value of 503 the restart window (RW) before transmission begins. 505 For the purposes of this standard, we define RW = min(IW,cwnd). 507 Using the last time a segment was received to determine whether or 508 not to decrease cwnd can fail to deflate cwnd in the common case of 509 persistent HTTP connections [HTH98]. In this case, a Web server 510 receives a request before transmitting data to the Web client. The 511 reception of the request makes the test for an idle connection fail, 512 and allows the TCP to begin transmission with a possibly 513 inappropriately large cwnd. 515 Therefore, a TCP SHOULD set cwnd to no more than RW before beginning 516 transmission if the TCP has not sent data in an interval exceeding 517 the retransmission timeout. 519 4.2 Generating Acknowledgments 521 The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a 522 TCP receiver. When using delayed ACKs, a TCP receiver MUST NOT 523 excessively delay acknowledgments. Specifically, an ACK SHOULD be 524 generated for at least every second full-sized segment, and MUST be 525 generated within 500 ms of the arrival of the first unacknowledged 526 packet. 528 The requirement that an ACK "SHOULD" be generated for at least every 529 second full-sized segment is listed in [RFC1122] in one place as a 530 SHOULD and another as a MUST. Here we unambiguously state it is a 531 SHOULD. We also emphasize that this is a SHOULD, meaning that an 532 implementor should indeed only deviate from this requirement after 533 careful consideration of the implications. See the discussion of 534 "Stretch ACK violation" in [RFC2525] and the references therein for 535 a discussion of the possible performance problems with generating 536 ACKs less frequently than every second full-sized segment. 538 In some cases, the sender and receiver may not agree on what 539 constitutes a full-sized segment. An implementation is deemed to 540 comply with this requirement if it sends at least one acknowledgment 541 every time it receives 2*RMSS bytes of new data from the sender, 542 where RMSS is the Maximum Segment Size specified by the receiver to 543 the sender (or the default value of 536 bytes, per [RFC1122], if the 544 receiver does not specify an MSS option during connection 545 establishment). The sender may be forced to use a segment size less 546 than RMSS due to the maximum transmission unit (MTU), the path MTU 547 discovery algorithm or other factors. For instance, consider the 548 case when the receiver announces an RMSS of X bytes but the sender 549 ends up using a segment size of Y bytes (Y < X) due to path MTU 550 discovery (or the sender's MTU size). The receiver will generate 551 stretch ACKs if it waits for 2*X bytes to arrive before an ACK is 552 sent. Clearly this will take more than 2 segments of size Y bytes. 553 Therefore, while a specific algorithm is not defined, it is 554 desirable for receivers to attempt to prevent this situation, for 555 example by acknowledging at least every second segment, regardless 556 of size. Finally, we repeat that an ACK MUST NOT be delayed for 557 more than 500 ms waiting on a second full-sized segment to arrive. 559 Out-of-order data segments SHOULD be acknowledged immediately, in 560 order to accelerate loss recovery. To trigger the fast retransmit 561 algorithm, the receiver SHOULD send an immediate duplicate ACK when 562 it receives a data segment above a gap in the sequence space. To 563 provide feedback to senders recovering from losses, the receiver 564 SHOULD send an immediate ACK when it receives a data segment that 565 fills in all or part of a gap in the sequence space. 567 A TCP receiver MUST NOT generate more than one ACK for every 568 incoming segment, other than to update the offered window as the 569 receiving application consumes new data [page 42, RFC793][RFC813]. 571 4.3 Loss Recovery Mechanisms 573 A number of loss recovery algorithms that augment fast retransmit 574 and fast recovery have been suggested by TCP researchers and 575 specified in the RFC series. While some of these algorithms are 576 based on the TCP selective acknowledgment (SACK) option [RFC2018], 577 such as [FF96,MM96a,MM96b,RFC3517], others do not require SACKs 578 [Hoe96,FF96,RFC3782]. The non-SACK algorithms use "partial 579 acknowledgments" (ACKs which cover previously unacknowledged data, 580 but not all the data outstanding when loss was detected) to trigger 581 retransmissions. While this document does not standardize any of 582 the specific algorithms that may improve fast retransmit/fast 583 recovery, these enhanced algorithms are implicitly allowed, as long 584 as they follow the general principles of the basic four algorithms 585 outlined above. 587 That is, when the first loss in a window of data is detected, 588 ssthresh MUST be set to no more than the value given by equation 589 (4). Second, until all lost segments in the window of data in 590 question are repaired, the number of segments transmitted in each 591 RTT MUST be no more than half the number of outstanding segments 592 when the loss was detected. Finally, after all loss in the given 593 window of segments has been successfully retransmitted, cwnd MUST be 594 set to no more than ssthresh and congestion avoidance MUST be used 595 to further increase cwnd. Loss in two successive windows of data, 596 or the loss of a retransmission, should be taken as two indications 597 of congestion and, therefore, cwnd (and ssthresh) MUST be lowered 598 twice in this case. 600 We RECOMMEND that TCP implementers employ some form of advanced loss 601 recovery that can cope with multiple losses in a window of data. 602 The algorithms detailed in [RFC3782] and [RFC3517] conform to the 603 general principles outlined above. We note that while these are not 604 the only two algorithms that conform to the above general principles 605 these two algorithms have been vetted by the community and are 606 currently on the standards track. 608 5. Security Considerations 610 This document requires a TCP to diminish its sending rate in the 611 presence of retransmission timeouts and the arrival of duplicate 612 acknowledgments. An attacker can therefore impair the performance 613 of a TCP connection by either causing data packets or their 614 acknowledgments to be lost, or by forging excessive duplicate 615 acknowledgments. 617 In response to the ACK division attack outlined in [SCWA99] this 618 document RECOMMENDS increasing the congestion window based on the 619 number of bytes newly acknowledged in each arriving ACK rather than 620 by a particular constant on each arriving ACK (as outlined in 621 section 3.1). 623 The Internet to a considerable degree relies on the correct 624 implementation of these algorithms in order to preserve network 625 stability and avoid congestion collapse. An attacker could cause 626 TCP endpoints to respond more aggressively in the face of congestion 627 by forging excessive duplicate acknowledgments or excessive 628 acknowledgments for new data. Conceivably, such an attack could 629 drive a portion of the network into congestion collapse. 631 6. Changes Between RFC 2001 and RFC 2581 633 [RFC2001] was extensively rewritten editorially and it is not 634 feasible to itemize the list of changes between [RFC2001] and 635 [RFC2581]. The intention of [RFC2581] was to not change any of the 636 recommendations given in [RFC2001], but to further clarify cases 637 that were not discussed in detail in [RFC2001]. Specifically, 638 [RFC2581] suggested what TCP connections should do after a 639 relatively long idle period, as well as specified and clarified 640 some of the issues pertaining to TCP ACK generation. Finally, the 641 allowable upper bound for the initial congestion window was raised 642 from one to two segments. 644 7. Changes Relative to RFC 2581 646 A specific definition for "duplicate acknowledgment" has been 647 added, based on the definition used by BSD TCP. 649 The document now notes that what to do with duplicate ACKs after the 650 retransmission timer has fired is future work and explicitly 651 unspecified in this document. 653 The initial window requirements were changed to allow Larger 654 Initial Windows as standardized in [RFC3390]. Additionally, the 655 steps to take when an initial window is discovered to be too large 656 due to Path MTU Discovery [RFC1191] are detailed. 658 The recommended initial value for ssthresh has been changed to say 659 that it SHOULD be arbitrarily high, where it was previously MAY. 660 This is to provide additional guidance to implementors on the 661 matter. 663 During slow start, the usage of Appropriate Byte Counting [RFC3465] 664 with L=1*SMSS is explicitly recommended. The method of increasing 665 cwnd given in [RFC2581] is still explicitly allowed. Byte counting 666 during congestion avoidance is also recommended, while the method 667 from [RFC2581] and other safe methods are still allowed. 669 The treatment of ssthresh on retransmission timeout was clarified. 670 In particular, ssthresh must be set to half the FlightSize on the 671 first retransmission of a given segment and then is held constant on 672 subsequent retransmissions of the same segment. 674 The description of fast retransmit and fast recovery has been 675 clarified, and the use of Limited Transmit [RFC3042] is now 676 recommended. 678 TCPs now MAY limit the number of duplicate ACKs that artificially 679 inflate cwnd during loss recovery to the number of segments 680 outstanding to avoid the duplicate ACK spoofing attack described in 681 [SCWA99]. 683 The restart window has been changed to min(IW,cwnd) from IW. This 684 behavior was described as "experimental" in [RFC2581]. 686 It is now recommended that TCP implementors implement an advanced 687 loss recovery algorithm conforming to the principles outlined in 688 this document. 690 The security considerations have been updated to discuss ACK 691 division and recommend byte counting as a counter to this attack. 693 8. IANA Considerations 695 This document contains no IANA considerations, but apparently an 696 Internet *Draft* can no longer be published without this section. 698 Acknowledgments 700 The core algorithms we describe were developed by Van Jacobson 702 [Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was 703 developed in conjunction with Hari Balakrishnan and Sally Floyd. 704 The initial congestion window size specified in this document is a 705 result of work with Sally Floyd and Craig Partridge 706 [RFC2414,RFC3390]. 708 W. Richard ("Rich") Stevens wrote the first version of this document 709 [RFC2001] and co-authored the second version [RFC2581]. This 710 present version much benefits from his clarity and thoughtfulness of 711 description, and we are grateful for Rich's contributions in 712 elucidating TCP congestion control, as well as in more broadly 713 helping us understand numerous issues relating to networking. 715 We wish to emphasize that the shortcomings and mistakes of this 716 document are solely the responsibility of the current authors. 718 Some of the text from this document is taken from "TCP/IP 719 Illustrated, Volume 1: The Protocols" by W. Richard Stevens 720 (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The 721 Implementation" by Gary R. Wright and W. Richard Stevens (Addison- 722 Wesley, 1995). This material is used with the permission of 723 Addison-Wesley. 725 Anil Agarwal, Steve Arden, Neal Cardwell, Noritoshi Demizu, Gorry 726 Fairhurst, Kevin Fall, John Heffner, Alfred Hoenes, Sally Floyd, 727 Reiner Ludwig, Matt Mathis, Craig Partridge and Joe Touch 728 contributed a number of helpful suggestions. 730 Normative References 732 [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 733 793, September 1981. 735 [RFC1122] Braden, R., "Requirements for Internet Hosts -- 736 Communication Layers", STD 3, RFC 1122, October 1989. 738 [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, 739 November 1990. 741 Informative References 743 [CJ89] Chiu, D. and R. Jain, "Analysis of the Increase/Decrease 744 Algorithms for Congestion Avoidance in Computer Networks", 745 Journal of Computer Networks and ISDN Systems, vol. 17, no. 1, 746 pp. 1-14, June 1989. 748 [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of 749 Tahoe, Reno and SACK TCP", Computer Communication Review, July 750 1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. 752 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 753 Control Scheme for TCP", In ACM SIGCOMM, August 1996. 755 [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP 756 Slow-Start Restart After Idle", Work in Progress. 758 [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer 759 Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. 760 ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. 762 [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm", 763 end2end-interest mailing list, April 30, 1990. 764 ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail. 766 [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining 767 TCP Congestion Control", Proceedings of SIGCOMM'96, August, 768 1996, Stanford, CA. Available 769 from http://www.psc.edu/networking/papers/papers.html 771 [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding 772 Parameters", Technical report. Available from 773 http://www.psc.edu/networking/papers/FACKnotes/current. 775 [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics", 776 Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. 778 [RFC813] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC 779 813, July 1982. 781 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 782 Retransmit, and Fast Recovery Algorithms", RFC 2001, January 783 1997. 785 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP 786 Selective Acknowledgement Options", RFC 2018, October 1996. 788 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 789 Requirement Levels", BCP 14, RFC 2119, March 1997. 791 [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's 792 Initial Window Size", RFC 2414, September 1998. 794 [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, 795 J., Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP 796 Implementation Problems", RFC 2525, March 1999. 798 [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion 799 Control, RFC 2581, April 1999. 801 [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An 802 Extension to the Selective Acknowledgement (SACK) Option for 803 TCP, RFC 2883, July 2000. 805 [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission 806 Timer", RFC 2988, November 2000. 808 [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing 809 TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 810 2001. 812 [RFC3168] K. Ramakrishnan, S. Floyd, D. Black, "The Addition of 813 Explicit Congestion Notification (ECN) to IP", RFC 3168, 814 September 2001. 816 [RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's 817 Initial Window", RFC 3390, October 2002. 819 [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte 820 Counting (ABC), RFC 3465, February 2003. 822 [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A 823 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 824 Algorithm for TCP, RFC 3517, April 2003. 826 [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno 827 Modification to TCP's Fast Recovery Algorithm, RFC 3782, April 828 2004. 830 [RFC4821] Matt Mathis, John Heffner, Packetization Layer Path MTU 831 Discovery, RFC 4821, March 2007. 833 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 834 "TCP Congestion Control With a Misbehaving Receiver", ACM 835 Computer Communication Review, 29(5), October 1999. 837 [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", 838 Addison-Wesley, 1994. 840 [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The 841 Implementation", Addison-Wesley, 1995. 843 Authors' Addresses 845 Mark Allman 846 International Computer Science Institute (ICSI) 847 1947 Center Street 848 Suite 600 849 Berkeley, CA 94704-1198 850 Phone: +1 440 235 1792 851 EMail: mallman@icir.org 852 http://www.icir.org/mallman/ 854 Vern Paxson 855 International Computer Science Institute (ICSI) 856 1947 Center Street 857 Suite 600 858 Berkeley, CA 94704-1198 859 Phone: +1 510/642-4274 x302 860 EMail: vern@icir.org 861 http://www.icir.org/vern/ 862 Ethan Blanton 863 Purdue University Computer Sciences 864 305 North University Street 865 West Lafayette, IN 47907 866 EMail: eblanton@cs.purdue.edu 867 http://www.cs.purdue.edu/homes/eblanton/ 869 Acknowledgment 871 Funding for the RFC Editor function is currently provided by the 872 Internet Society.