idnits 2.17.1 draft-ietf-tcpm-rfc2581bis-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 855. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 831. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 838. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 844. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 2008) is 5855 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Flo94' is defined on line 699, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 813 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2414 (Obsoleted by RFC 3390) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) -- Obsolete informational reference (is this intentional?): RFC 3782 (Obsoleted by RFC 6582) Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Allman 2 Internet-Draft V. Paxson 3 Expires: October 2008 ICSI 4 E. Blanton 5 Purdue University 6 April 2008 8 TCP Congestion Control 9 draft-ietf-tcpm-rfc2581bis-04.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet-Drafts as 26 reference material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Abstract 36 This document defines TCP's four intertwined congestion control 37 algorithms: slow start, congestion avoidance, fast retransmit, and 38 fast recovery. In addition, the document specifies how TCP should 39 begin transmission after a relatively long idle period, as well as 40 discussing various acknowledgment generation methods. 42 1. Introduction 44 This document specifies four TCP [RFC793] congestion control 45 algorithms: slow start, congestion avoidance, fast retransmit and 46 fast recovery. These algorithms were devised in [Jac88] and 47 [Jac90]. Their use with TCP is standardized in [RFC1122]. 48 Additional early work in additive-increase, multiplicative-decrease 49 congestion control is given in [CJ89]. 51 This document obsoletes [RFC2581] which in turned obsoleted 52 [RFC2001]. 54 In addition to specifying the congestion control algorithms, this 55 document specifies what TCP connections should do after a relatively 56 long idle period, as well as specifying and clarifying some of the 57 issues pertaining to TCP ACK generation. 59 Note that [Ste94] provides examples of these algorithms in action 60 and [WS95] provides an explanation of the source code for the BSD 61 implementation of these algorithms. 63 This document is organized as follows. Section 2 provides various 64 definitions which will be used throughout the document. Section 3 65 provides a specification of the congestion control 66 algorithms. Section 4 outlines concerns related to the congestion 67 control algorithms and finally, section 5 outlines security 68 considerations. 70 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 71 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 72 document are to be interpreted as described in [RFC2119]. 74 2. Definitions 76 This section provides the definition of several terms that will be 77 used throughout the remainder of this document. 79 SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or 80 both). 82 SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the 83 largest segment that the sender can transmit. This value can be 84 based on the maximum transmission unit of the network, the path 85 MTU discovery [RFC1191,RFC4821] algorithm, RMSS (see next item), 86 or other factors. The size does not include the TCP/IP headers 87 and options. 89 RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the 90 largest segment the receiver is willing to accept. This is the 91 value specified in the MSS option sent by the receiver during 92 connection startup. Or, if the MSS option is not used, 536 93 bytes [RFC1122]. The size does not include the TCP/IP headers 94 and options. 96 FULL-SIZED SEGMENT: A segment that contains the maximum number of 97 data bytes permitted (i.e., a segment containing SMSS bytes of 98 data). 100 RECEIVER WINDOW (rwnd): The most recently advertised receiver 101 window. 103 CONGESTION WINDOW (cwnd): A TCP state variable that limits the 104 amount of data a TCP can send. At any given time, a TCP MUST 105 NOT send data with a sequence number higher than the sum of the 106 highest acknowledged sequence number and the minimum of cwnd and 107 rwnd. 109 INITIAL WINDOW (IW): The initial window is the size of the sender's 110 congestion window after the three-way handshake is completed. 112 LOSS WINDOW (LW): The loss window is the size of the congestion 113 window after a TCP sender detects loss using its retransmission 114 timer. 116 RESTART WINDOW (RW): The restart window is the size of the 117 congestion window after a TCP restarts transmission after an 118 idle period (if the slow start algorithm is used; see section 119 4.1 for more discussion). 121 FLIGHT SIZE: The amount of data that has been sent but not yet 122 cumulatively acknowledged. 124 DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a 125 "duplicate" in the following algorithms when (a) the receiver of 126 the ACK has outstanding data, (b) the incoming acknowledgment 127 carries no data, (c) the SYN and FIN bits are both off, (d) the 128 acknowledgment number is equal to the greatest acknowledgment 129 received on the given connection (TCP.UNA from [RFC793]) and (e) 130 the advertised window in the incoming acknowledgment equals the 131 advertised window in the last incoming acknowledgment. 133 Alternatively, a TCP that utilizes selective acknowledgments 134 [RFC2018,RFC2883] can leverage the SACK information to determine 135 when an incoming ACK is a "duplicate" (e.g., if the ACK contains 136 previously unknown SACK information). 138 3. Congestion Control Algorithms 140 This section defines the four congestion control algorithms: slow 141 start, congestion avoidance, fast retransmit and fast recovery, 142 developed in [Jac88] and [Jac90]. In some situations it may be 143 beneficial for a TCP sender to be more conservative than the 144 algorithms allow, however a TCP MUST NOT be more aggressive than the 145 following algorithms allow (that is, MUST NOT send data when the 146 value of cwnd computed by the following algorithms would not allow 147 the data to be sent). 149 Also note that the algorithms specified in this document work in 150 terms of using loss as the signal of congestion. Explicit 151 Congestion Notification (ECN) could also be used as specified in 152 [RFC3168]. 154 3.1 Slow Start and Congestion Avoidance 156 The slow start and congestion avoidance algorithms MUST be used by a 157 TCP sender to control the amount of outstanding data being injected 158 into the network. To implement these algorithms, two variables are 159 added to the TCP per-connection state. The congestion window (cwnd) 160 is a sender-side limit on the amount of data the sender can transmit 161 into the network before receiving an acknowledgment (ACK), while the 162 receiver's advertised window (rwnd) is a receiver-side limit on the 163 amount of outstanding data. The minimum of cwnd and rwnd governs 164 data transmission. 166 Another state variable, the slow start threshold (ssthresh), is used 167 to determine whether the slow start or congestion avoidance 168 algorithm is used to control data transmission, as discussed below. 170 Beginning transmission into a network with unknown conditions 171 requires TCP to slowly probe the network to determine the available 172 capacity, in order to avoid congesting the network with an 173 inappropriately large burst of data. The slow start algorithm is 174 used for this purpose at the beginning of a transfer, or after 175 repairing loss detected by the retransmission timer. Slow start 176 additionally serves to start the "ACK clock" used by the TCP sender 177 to release data into the network in the slow start, congestion 178 avoidance, and loss recovery algorithms. 180 IW, the initial value of cwnd, MUST be set using the following 181 guidelines as an upper bound. 183 If SMSS > 2190 bytes: 184 IW = 2 * SMSS bytes and MUST NOT be more than 2 segments 185 If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): 186 IW = 3 * SMSS bytes and MUST NOT be more than 3 segments 187 if SMSS <= 1095 bytes: 188 IW = 4 * SMSS bytes and MUST NOT be more than 4 segments 190 As specified in [RFC3390], the SYN/ACK and the acknowledgment of the 191 SYN/ACK MUST NOT increase the size of the congestion window. 192 Further, if the SYN or SYN/ACK is lost, the initial window used by a 193 sender after a correctly transmitted SYN MUST be one segment 194 consisting of at most SMSS bytes. 196 A detailed rationale and discussion of the IW setting is provided in 197 [RFC3390]. 199 When initial congestion windows of more than one segment are 200 implemented along with Path MTU Discovery [RFC1191], and the MSS 201 being used is found to be too large, the congestion window cwnd 202 SHOULD be reduced to prevent large bursts of smaller segments. 203 Specifically, cwnd SHOULD be reduced by the ratio of the old segment 204 size to the new segment size. 206 The initial value of ssthresh SHOULD be set arbitrarily high (e.g., 207 to the size of the largest possible advertised window), but ssthresh 208 MUST be reduced in response to congestion. Setting ssthresh as high 209 as possible allows the network conditions, rather than some 210 arbitrary host limit, to dictate the sending rate. In cases where 211 the end systems have a solid understanding of the network path, more 212 carefully setting the initial ssthresh value may have merit (e.g., 213 such that the end host does not create congestion along the path). 215 The slow start algorithm is used when cwnd < ssthresh, while the 216 congestion avoidance algorithm is used when cwnd > ssthresh. When 217 cwnd and ssthresh are equal the sender may use either slow start or 218 congestion avoidance. 220 During slow start, a TCP increments cwnd by at most SMSS bytes for 221 each ACK received that cumulatively acknowledges new data. Slow 222 start ends when cwnd exceeds ssthresh (or, optionally, when it 223 reaches it, as noted above) or when congestion is observed. While 224 traditionally TCP implementations have increased cwnd by precisely 225 SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND 226 that TCP implementations increase cwnd, per: 228 cwnd += min (N, SMSS) (2) 230 where N is the number of previously unacknowledged bytes 231 acknowledged in the incoming ACK. This adjustment is part of 232 Appropriate Byte Counting [RFC3465] and provides robustness against 233 misbehaving receivers which may attempt to induce a sender to 234 artificially inflate cwnd using a mechanism known as "ACK Division" 235 [SCWA99]. ACK Division consists of a receiver sending multiple ACKs 236 for a single TCP data segment, each acknowledging only a portion of 237 its data. A TCP that increments cwnd by SMSS for each such ACK will 238 inappropriately inflate the amount of data injected into the 239 network. 241 During congestion avoidance, cwnd is incremented by roughly 1 242 full-sized segment per round-trip time (RTT). Congestion avoidance 243 continues until congestion is detected. The basic guidelines for 244 incrementing cwnd during congestion avoidance are: 246 * MAY increment cwnd by SMSS bytes 248 * SHOULD increment cwnd per equation (2) once per RTT 250 * MUST NOT increment cwnd by more than SMSS bytes 252 We note that [RFC3465] allows for cwnd increases of more than SMSS 253 bytes for incoming acknowledgments during slow start on an 254 experimental basis, however such behavior is not allowed as part of 255 the standard. 257 The RECOMMENDED way to increase cwnd during congestion avoidance is 258 to count the number of bytes that have been acknowledged by ACKs for 259 new data. (A drawback of this implementation is that it requires 260 maintaining an additional state variable.) When the number of bytes 261 acknowledged reaches cwnd, then cwnd can be incremented by up to 262 SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be 263 increased by more than SMSS bytes per RTT. This method both allows 264 TCPs to increase cwnd by one segment per RTT in the face of delayed 265 ACKs and provides robustness against ACK Division attacks. 267 Another common formula that a TCP MAY use to update cwnd during 268 congestion avoidance is given in equation 3: 270 cwnd += SMSS*SMSS/cwnd (3) 272 This adjustment is executed on every incoming ACK that acknowledges 273 new data. Equation (3) provides an acceptable approximation to the 274 underlying principle of increasing cwnd by 1 full-sized segment per 275 RTT. (Note that for a connection in which the receiver is 276 acknowledging every-other packet, (3) is less aggressive than 277 allowed -- roughly increasing cwnd every second RTT.) 279 Implementation Note: Since integer arithmetic is usually used in TCP 280 implementations, the formula given in equation 3 can fail to 281 increase cwnd when the congestion window is larger than SMSS*SMSS. 282 If the above formula yields 0, the result SHOULD be rounded up to 1 283 byte. 285 Implementation Note: Older implementations have an additional 286 additive constant on the right-hand side of equation (3). This is 287 incorrect and can actually lead to diminished performance [RFC2525]. 289 Implementation Note: Some implementations maintain cwnd in units of 290 bytes, while others in units of full-sized segments. The latter 291 will find equation (3) difficult to use, and may prefer to use the 292 counting approach discussed in the previous paragraph. 294 When a TCP sender detects segment loss using the retransmission 295 timer and the given segment has not yet been retransmitted, the 296 value of ssthresh MUST be set to no more than the value given in 297 equation 4: 299 ssthresh = max (FlightSize / 2, 2*SMSS) (4) 301 where, as discussed above, FlightSize is the amount of outstanding 302 data in the network. 304 On the other hand, when a TCP sender detects segment loss using the 305 retransmission timer and the given segment has already been 306 retransmitted by way of the retransmission timer at least once, the 307 value of ssthresh is held constant. 309 Implementation Note: An easy mistake to make is to simply use cwnd, 310 rather than FlightSize, which in some implementations may 311 incidentally increase well beyond rwnd. 313 Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be 314 set to no more than the loss window, LW, which equals 1 full-sized 315 segment (regardless of the value of IW). Therefore, after 316 retransmitting the dropped segment the TCP sender uses the slow 317 start algorithm to increase the window from 1 full-sized segment to 318 the new value of ssthresh, at which point congestion avoidance again 319 takes over. 321 As shown in [FF96,RFC3782], slow start-based loss recovery after a 322 timeout can cause spurious retransmissions that trigger duplicate 323 acknowledgments. The reaction to the arrival of these duplicate 324 ACKs in TCP implementations varies widely. This document does not 325 specify how to treat such acknowledgments, but does note this as an 326 area that may benefit from additional attention, experimentation and 327 specification. 329 3.2 Fast Retransmit/Fast Recovery 331 A TCP receiver SHOULD send an immediate duplicate ACK when an out- 332 of-order segment arrives. The purpose of this ACK is to inform the 333 sender that a segment was received out-of-order and which sequence 334 number is expected. From the sender's perspective, duplicate ACKs 335 can be caused by a number of network problems. First, they can be 336 caused by dropped segments. In this case, all segments after the 337 dropped segment will trigger duplicate ACKs until the loss is 338 repaired. Second, duplicate ACKs can be caused by the re-ordering 339 of data segments by the network (not a rare event along some network 340 paths [Pax97]). Finally, duplicate ACKs can be caused by 341 replication of ACK or data segments by the network. In addition, a 342 TCP receiver SHOULD send an immediate ACK when the incoming segment 343 fills in all or part of a gap in the sequence space. This will 344 generate more timely information for a sender recovering from a loss 345 through a retransmission timeout, a fast retransmit, or an advanced 346 loss recovery algorithm, as outlined in section 4.3. 348 The TCP sender SHOULD use the "fast retransmit" algorithm to detect 349 and repair loss, based on incoming duplicate ACKs. The fast 350 retransmit algorithm uses the arrival of 3 duplicate ACKs (as 351 defined in section 2, without any intervening ACKs which move 352 SND.UNA) as an indication that a segment has been lost. After 353 receiving 3 duplicate ACKs, TCP performs a retransmission of what 354 appears to be the missing segment, without waiting for the 355 retransmission timer to expire. 357 After the fast retransmit algorithm sends what appears to be the 358 missing segment, the "fast recovery" algorithm governs the 359 transmission of new data until a non-duplicate ACK arrives. The 360 reason for not performing slow start is that the receipt of the 361 duplicate ACKs not only indicates that a segment has been lost, but 362 also that segments are most likely leaving the network (although a 363 massive segment duplication by the network can invalidate this 364 conclusion). In other words, since the receiver can only generate a 365 duplicate ACK when a segment has arrived, that segment has left the 366 network and is in the receiver's buffer, so we know it is no longer 367 consuming network resources. Furthermore, since the ACK "clock" 368 [Jac88] is preserved, the TCP sender can continue to transmit new 369 segments (although transmission must continue using a reduced cwnd, 370 since loss is an indication of congestion). 372 The fast retransmit and fast recovery algorithms are implemented 373 together as follows. 375 1. On the first and second duplicate ACKs received at a sender, a 376 TCP SHOULD send a segment of previously unsent data per 377 [RFC3042] provided that the receiver's advertised window allows, 378 the total FlightSize would remain less than or equal to cwnd 379 plus 2*SMSS, and that new data is available for transmission. 380 Further, the TCP sender MUST NOT change cwnd to reflect these 381 two segments [RFC3042]. Note that a sender using SACK [RFC2018] 382 MUST NOT send new data unless the incoming duplicate 383 acknowledgment contains new SACK information. 385 2. When the third duplicate ACK is received, a TCP MUST set 386 ssthresh to no more than the value given in equation 4. When 387 [RFC3042] is in use, additional data sent in limited transmit 388 MUST NOT be included in this calculation. 390 3. The lost segment starting at SND.UNA MUST be retransmitted and 391 cwnd set to ssthresh plus 3*SMSS. This artificially "inflates" 392 the congestion window by the number of segments (three) that 393 have left the network and which the receiver has buffered. 395 4. For each additional duplicate ACK received (after the third), 396 cwnd MUST be incremented by SMSS. This artificially inflates 397 the congestion window in order to reflect the additional segment 398 that has left the network. 400 Note: [SCWA99] discusses a receiver-based attack whereby many 401 bogus duplicate ACKs are sent to the data sender in order to 402 artificially inflate cwnd and cause a higher than appropriate 403 sending rate to be used. A TCP MAY therefore limit the number 404 of times cwnd is artificially inflated during loss recovery 405 to the number of outstanding segments (or, an approximation 406 thereof). 408 5. When previously unsent data is available and the new value of 409 cwnd and the receiver's advertised window allow, a TCP SHOULD 410 send 1*SMSS bytes of previously unsent data. 412 6. When the next ACK arrives that acknowledges previously 413 unacknowledged data, a TCP MUST set cwnd to ssthresh (the value 414 set in step 2). This is termed "deflating" the window. 416 This ACK should be the acknowledgment elicited by the 417 retransmission from step 3, one RTT after the retransmission 418 (though it may arrive sooner in the presence of significant out- 419 of-order delivery of data segments at the receiver). 420 Additionally, this ACK should acknowledge all the intermediate 421 segments sent between the lost segment and the receipt of the 422 third duplicate ACK, if none of these were lost. 424 Note: This algorithm is known to generally not recover efficiently 425 from multiple losses in a single flight of packets [FF96]. Section 426 4.3 below addresses such cases. 428 4. Additional Considerations 430 4.1 Re-starting Idle Connections 432 A known problem with the TCP congestion control algorithms described 433 above is that they allow a potentially inappropriate burst of 434 traffic to be transmitted after TCP has been idle for a relatively 435 long period of time. After an idle period, TCP cannot use the ACK 436 clock to strobe new segments into the network, as all the ACKs have 437 drained from the network. Therefore, as specified above, TCP can 438 potentially send a cwnd-size line-rate burst into the network after 439 an idle period. In addition, changing network conditions may have 440 rendered TCP's notion of the available end-to-end network capacity 441 between two endpoints, as estimated by cwnd, inaccurate during the 442 course of a long idle period. 444 [Jac88] recommends that a TCP use slow start to restart 445 transmission after a relatively long idle period. Slow start 446 serves to restart the ACK clock, just as it does at the beginning 447 of a transfer. This mechanism has been widely deployed in the 448 following manner. When TCP has not received a segment for more 449 than one retransmission timeout, cwnd is reduced to the value of 450 the restart window (RW) before transmission begins. 452 For the purposes of this standard, we define RW = min(IW,cwnd). 454 Using the last time a segment was received to determine whether or 455 not to decrease cwnd can fail to deflate cwnd in the common case of 456 persistent HTTP connections [HTH98]. In this case, a Web server 457 receives a request before transmitting data to the Web client. The 458 reception of the request makes the test for an idle connection fail, 459 and allows the TCP to begin transmission with a possibly 460 inappropriately large cwnd. 462 Therefore, a TCP SHOULD set cwnd to no more than RW before beginning 463 transmission if the TCP has not sent data in an interval exceeding 464 the retransmission timeout. 466 4.2 Generating Acknowledgments 468 The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a 469 TCP receiver. When using delayed ACKs, a TCP receiver MUST NOT 470 excessively delay acknowledgments. Specifically, an ACK SHOULD be 471 generated for at least every second full-sized segment, and MUST be 472 generated within 500 ms of the arrival of the first unacknowledged 473 packet. 475 The requirement that an ACK "SHOULD" be generated for at least every 476 second full-sized segment is listed in [RFC1122] in one place as a 477 SHOULD and another as a MUST. Here we unambiguously state it is a 478 SHOULD. We also emphasize that this is a SHOULD, meaning that an 479 implementor should indeed only deviate from this requirement after 480 careful consideration of the implications. See the discussion of 481 "Stretch ACK violation" in [RFC2525] and the references therein for 482 a discussion of the possible performance problems with generating 483 ACKs less frequently than every second full-sized segment. 485 In some cases, the sender and receiver may not agree on what 486 constitutes a full-sized segment. An implementation is deemed to 487 comply with this requirement if it sends at least one acknowledgment 488 every time it receives 2*RMSS bytes of new data from the sender, 489 where RMSS is the Maximum Segment Size specified by the receiver to 490 the sender (or the default value of 536 bytes, per [RFC1122], if the 491 receiver does not specify an MSS option during connection 492 establishment). The sender may be forced to use a segment size less 493 than RMSS due to the maximum transmission unit (MTU), the path MTU 494 discovery algorithm or other factors. For instance, consider the 495 case when the receiver announces an RMSS of X bytes but the sender 496 ends up using a segment size of Y bytes (Y < X) due to path MTU 497 discovery (or the sender's MTU size). The receiver will generate 498 stretch ACKs if it waits for 2*X bytes to arrive before an ACK is 499 sent. Clearly this will take more than 2 segments of size Y bytes. 500 Therefore, while a specific algorithm is not defined, it is 501 desirable for receivers to attempt to prevent this situation, for 502 example by acknowledging at least every second segment, regardless 503 of size. Finally, we repeat that an ACK MUST NOT be delayed for 504 more than 500 ms waiting on a second full-sized segment to arrive. 506 Out-of-order data segments SHOULD be acknowledged immediately, in 507 order to accelerate loss recovery. To trigger the fast retransmit 508 algorithm, the receiver SHOULD send an immediate duplicate ACK when 509 it receives a data segment above a gap in the sequence space. To 510 provide feedback to senders recovering from losses, the receiver 511 SHOULD send an immediate ACK when it receives a data segment that 512 fills in all or part of a gap in the sequence space. 514 A TCP receiver MUST NOT generate more than one ACK for every 515 incoming segment, other than to update the offered window as the 516 receiving application consumes new data [page 42, RFC793][RFC813]. 518 4.3 Loss Recovery Mechanisms 520 A number of loss recovery algorithms that augment fast retransmit 521 and fast recovery have been suggested by TCP researchers and 522 specified in the RFC series. While some of these algorithms are 523 based on the TCP selective acknowledgment (SACK) option [RFC2018], 524 such as [FF96,MM96a,MM96b,RFC3517], others do not require SACKs 525 [Hoe96,FF96,RFC3782]. The non-SACK algorithms use "partial 526 acknowledgments" (ACKs which cover previously unacknowledged data, 527 but not all the data outstanding when loss was detected) to trigger 528 retransmissions. While this document does not standardize any of 529 the specific algorithms that may improve fast retransmit/fast 530 recovery, these enhanced algorithms are implicitly allowed, as long 531 as they follow the general principles of the basic four algorithms 532 outlined above. 534 That is, when the first loss in a window of data is detected, 535 ssthresh MUST be set to no more than the value given by equation 536 (4). Second, until all lost segments in the window of data in 537 question are repaired, the number of segments transmitted in each 538 RTT MUST be no more than half the number of outstanding segments 539 when the loss was detected. Finally, after all loss in the given 540 window of segments has been successfully retransmitted, cwnd MUST be 541 set to no more than ssthresh and congestion avoidance MUST be used 542 to further increase cwnd. Loss in two successive windows of data, 543 or the loss of a retransmission, should be taken as two indications 544 of congestion and, therefore, cwnd (and ssthresh) MUST be lowered 545 twice in this case. 547 We RECOMMEND that TCP implementers employ some form of advanced loss 548 recovery that can cope with multiple losses in a window of data. 549 The algorithms detailed in [RFC3782] and [RFC3517] conform to the 550 general principles outlined above. We note that while these are not 551 the only two algorithms that conform to the above general principles 552 these two algorithms have been vetted by the community and are 553 currently on the standards track. 555 5. Security Considerations 557 This document requires a TCP to diminish its sending rate in the 558 presence of retransmission timeouts and the arrival of duplicate 559 acknowledgments. An attacker can therefore impair the performance 560 of a TCP connection by either causing data packets or their 561 acknowledgments to be lost, or by forging excessive duplicate 562 acknowledgments. Causing two congestion control events back-to-back 563 will often cut ssthresh to its minimum value of 2*SMSS, causing the 564 connection to immediately enter the slower-performing congestion 565 avoidance phase. 567 In response to the ACK division attack outlined in [SCWA99] this 568 document RECOMMENDS increasing the congestion window based on the 569 number of bytes newly acknowledged in each arriving ACK rather than 570 by a particular constant on each arriving ACK (as outlined in 571 section 3.1). 573 The Internet to a considerable degree relies on the correct 574 implementation of these algorithms in order to preserve network 575 stability and avoid congestion collapse. An attacker could cause 576 TCP endpoints to respond more aggressively in the face of congestion 577 by forging excessive duplicate acknowledgments or excessive 578 acknowledgments for new data. Conceivably, such an attack could 579 drive a portion of the network into congestion collapse. 581 6. Changes Between RFC 2001 and RFC 2581 583 [RFC2001] has been extensively rewritten editorially and it is not 584 feasible to itemize the list of changes between [RFC2001] and 585 [RFC2581]. The intention of [RFC2581] is to not change any of the 586 recommendations given in [RFC2001], but to further clarify cases 587 that were not discussed in detail in [RFC2001]. Specifically, 588 [RFC2581] suggests what TCP connections should do after a relatively 589 long idle period, as well as specifying and clarifying some of the 590 issues pertaining to TCP ACK generation. Finally, the allowable 591 upper bound for the initial congestion window has also been raised 592 from one to two segments. 594 7. Changes Relative to RFC 2581 595 A specific definition for "duplicate acknowledgment" has been 596 added, based on the definition used by BSD TCP. 598 The document now notes that what to do with duplicate ACKs after the 599 retransmission timer has fired is future work and explicitly 600 unspecified in this document. 602 The initial window requirements were changed to allow Larger 603 Initial Windows as standardized in [RFC3390]. Additionally, the 604 steps to take when an initial window is discovered to be too large 605 due to Path MTU Discovery [RFC1191] are detailed. 607 The recommended initial value for ssthresh has been changed to say 608 that it SHOULD be arbitrarily high, where it was previously MAY. 609 This is to provide additional guidance to implementors on the 610 matter. 612 During slow start, the usage of Appropriate Byte Counting [RFC3465] 613 with L=1*SMSS is explicitly recommended. The method of increasing 614 cwnd given in [RFC2581] is still explicitly allowed. Byte counting 615 during congestion avoidance is also recommended, while the method 616 from [RFC2581] and other safe methods are still allowed. 618 The treatment of ssthresh on retransmission timeout was clarified. 619 In particular, ssthresh must be set to half the FlightSize on the 620 first retransmission of a given segment and then is held constant on 621 subsequent retransmissions of the same segment. 623 The description of fast retransmit and fast recovery has been 624 clarified, and the use of Limited Transmit [RFC3042] is now 625 recommended. 627 TCPs now MAY limit the number of duplicate ACKs that artificially 628 inflate cwnd during loss recovery to the number of segments 629 outstanding to avoid the duplicate ACK spoofing attack described in 630 [SCWA99]. 632 The restart window has been changed to min(IW,cwnd) from IW. This 633 behavior was described as "experimental" in [RFC2581]. 635 It is now recommended that TCP implementors implement an advanced 636 loss recovery algorithm conforming to the principles outlined in 637 this document. 639 The security considerations have been updated to discuss ACK 640 division and recommend byte counting as a counter to this attack. 642 8. IANA Considerations 644 This document contains no IANA considerations, but apparently an 645 Internet *Draft* can no longer be published without this section. 647 Acknowledgments 648 The core algorithms we describe were developed by Van Jacobson 649 [Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was 650 developed in conjunction with Hari Balakrishnan and Sally Floyd. 651 The initial congestion window size specified in this document is a 652 result of work with Sally Floyd and Craig Partridge 653 [RFC2414,RFC3390]. 655 W. Richard ("Rich") Stevens wrote the first version of this document 656 [RFC2001] and co-authored the second version [RFC2581]. This 657 present version much benefits from his clarity and thoughtfulness of 658 description, and we are grateful for Rich's contributions in 659 elucidating TCP congestion control, as well as in more broadly 660 helping us understand numerous issues relating to networking. 662 We wish to emphasize that the shortcomings and mistakes of this 663 document are solely the responsibility of the current authors. 665 Some of the text from this document is taken from "TCP/IP 666 Illustrated, Volume 1: The Protocols" by W. Richard Stevens 667 (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The 668 Implementation" by Gary R. Wright and W. Richard Stevens (Addison- 669 Wesley, 1995). This material is used with the permission of 670 Addison-Wesley. 672 Anil Agarwal, Steve Arden, Neal Cardwell, Noritoshi Demizu, Gorry 673 Fairhurst, Kevin Fall, John Heffner, Alfred Hoenes, Sally Floyd, 674 Reiner Ludwig, Matt Mathis, Craig Partridge and Joe Touch 675 contributed a number of helpful suggestions. 677 Normative References 679 [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 680 793, September 1981. 682 [RFC1122] Braden, R., "Requirements for Internet Hosts -- 683 Communication Layers", STD 3, RFC 1122, October 1989. 685 [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, 686 November 1990. 688 Informative References 690 [CJ89] Chiu, D. and R. Jain, "Analysis of the Increase/Decrease 691 Algorithms for Congestion Avoidance in Computer Networks", 692 Journal of Computer Networks and ISDN Systems, vol. 17, no. 1, 693 pp. 1-14, June 1989. 695 [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of 696 Tahoe, Reno and SACK TCP", Computer Communication Review, July 697 1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. 699 [Flo94] Floyd, S., "TCP and Successive Fast Retransmits. Technical 700 report", October 1994. 702 ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. 704 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 705 Control Scheme for TCP", In ACM SIGCOMM, August 1996. 707 [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP 708 Slow-Start Restart After Idle", Work in Progress. 710 [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer 711 Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. 712 ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. 714 [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm", 715 end2end-interest mailing list, April 30, 1990. 716 ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail. 718 [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining 719 TCP Congestion Control", Proceedings of SIGCOMM'96, August, 720 1996, Stanford, CA. Available 721 from http://www.psc.edu/networking/papers/papers.html 723 [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding 724 Parameters", Technical report. Available from 725 http://www.psc.edu/networking/papers/FACKnotes/current. 727 [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics", 728 Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. 730 [RFC813] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC 731 813, July 1982. 733 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 734 Retransmit, and Fast Recovery Algorithms", RFC 2001, January 735 1997. 737 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP 738 Selective Acknowledgement Options", RFC 2018, October 1996. 740 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 741 Requirement Levels", BCP 14, RFC 2119, March 1997. 743 [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's 744 Initial Window Size", RFC 2414, September 1998. 746 [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, 747 J., Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP 748 Implementation Problems", RFC 2525, March 1999. 750 [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion 751 Control, RFC 2581, April 1999. 753 [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An 754 Extension to the Selective Acknowledgement (SACK) Option for 755 TCP, RFC 2883, July 2000. 757 [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission 758 Timer", RFC 2988, November 2000. 760 [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing 761 TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 762 2001. 764 [RFC3168] K. Ramakrishnan, S. Floyd, D. Black, "The Addition of 765 Explicit Congestion Notification (ECN) to IP", RFC 3168, 766 September 2001. 768 [RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's 769 Initial Window", RFC 3390, October 2002. 771 [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte 772 Counting (ABC), RFC 3465, February 2003. 774 [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A 775 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 776 Algorithm for TCP, RFC 3517, April 2003. 778 [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno 779 Modification to TCP's Fast Recovery Algorithm, RFC 3782, April 780 2004. 782 [RFC4821] Matt Mathis, John Heffner, Packetization Layer Path MTU 783 Discovery, RFC 4821, March 2007. 785 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 786 "TCP Congestion Control With a Misbehaving Receiver", ACM 787 Computer Communication Review, 29(5), October 1999. 789 [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", 790 Addison-Wesley, 1994. 792 [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The 793 Implementation", Addison-Wesley, 1995. 795 Authors' Addresses 797 Mark Allman 798 International Computer Science Institute (ICSI) 799 1947 Center Street 800 Suite 600 801 Berkeley, CA 94704-1198 802 Phone: +1 440 235 1792 803 EMail: mallman@icir.org 804 http://www.icir.org/mallman/ 806 Vern Paxson 807 International Computer Science Institute (ICSI) 808 1947 Center Street 809 Suite 600 810 Berkeley, CA 94704-1198 811 Phone: +1 510/642-4274 x302 812 EMail: vern@icir.org 813 http://www.icir.org/vern/ 815 Ethan Blanton 816 Purdue University Computer Sciences 817 1398 Computer Science Building 818 West Lafayette, IN 47907 819 EMail: eblanton@cs.purdue.edu 820 http://www.cs.purdue.edu/homes/eblanton/ 822 Intellectual Property Statement 824 The IETF takes no position regarding the validity or scope of any 825 Intellectual Property Rights or other rights that might be claimed 826 to pertain to the implementation or use of the technology described 827 in this document or the extent to which any license under such 828 rights might or might not be available; nor does it represent that 829 it has made any independent effort to identify any such rights. 830 Information on the procedures with respect to rights in RFC 831 documents can be found in BCP 78 and BCP 79. 833 Copies of IPR disclosures made to the IETF Secretariat and any 834 assurances of licenses to be made available, or the result of an 835 attempt made to obtain a general license or permission for the use 836 of such proprietary rights by implementers or users of this 837 specification can be obtained from the IETF on-line IPR repository 838 at http://www.ietf.org/ipr. 840 The IETF invites any interested party to bring to its attention any 841 copyrights, patents or patent applications, or other proprietary 842 rights that may cover technology that may be required to implement 843 this standard. Please address the information to the IETF at 844 ietf-ipr@ietf.org. 846 Disclaimer of Validity 848 This document and the information contained herein are provided 849 on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 850 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 851 IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 852 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 853 WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE 854 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 855 FOR A PARTICULAR PURPOSE. 857 Copyright Statement 859 Copyright (C) The IETF Trust (2008). This document is subject to 860 the rights, licenses and restrictions contained in BCP 78, and 861 except as set forth therein, the authors retain all their rights. 863 Acknowledgment 865 Funding for the RFC Editor function is currently provided by the 866 Internet Society.