idnits 2.17.1 draft-ietf-tcpm-rfc2581bis-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 790. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 767. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 774. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 780. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 5 instances of too long lines in the document, the longest one being 2 characters in excess of 72. ** There are 5 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 2006) is 6669 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC3390' is mentioned on line 606, but not defined == Unused Reference: 'Flo94' is defined on line 650, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 813 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2414 (Obsoleted by RFC 3390) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) -- Obsolete informational reference (is this intentional?): RFC 3782 (Obsoleted by RFC 6582) Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Allman 2 Internet-Draft V. Paxson 3 Expires: July 2006 ICIR / ICSI 4 E. Blanton 5 Purdue University 6 January 2006 8 TCP Congestion Control 9 draft-ietf-tcpm-rfc2581bis-00.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet-Drafts as 26 reference material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Copyright Notice 36 Copyright (C) The Internet Society (2006). 38 Abstract 40 This document defines TCP's four intertwined congestion control 41 algorithms: slow start, congestion avoidance, fast retransmit, and 42 fast recovery. In addition, the document specifies how TCP should 43 begin transmission after a relatively long idle period, as well as 44 discussing various acknowledgment generation methods. 46 1. Introduction 48 This document specifies four TCP [RFC793] congestion control 49 algorithms: slow start, congestion avoidance, fast retransmit and 50 fast recovery. These algorithms were devised in [Jac88] and 51 [Jac90]. Their use with TCP is standardized in [RFC1122]. Additional 52 early work in additive-increase, multiplicative-decrease congestion 53 control is given in [CJ89]. 55 This document is an update of [RFC2001] and [RFC2581]. 57 In addition to specifying the congestion control algorithms, this 58 document specifies what TCP connections should do after a relatively 59 long idle period, as well as specifying and clarifying some of the 60 issues pertaining to TCP ACK generation. 62 Note that [Ste94] provides examples of these algorithms in action 63 and [WS95] provides an explanation of the source code for the BSD 64 implementation of these algorithms. 66 This document is organized as follows. Section 2 provides various 67 definitions which will be used throughout the document. Section 3 68 provides a specification of the congestion control 69 algorithms. Section 4 outlines concerns related to the congestion 70 control algorithms and finally, section 5 outlines security 71 considerations. 73 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 74 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 75 document are to be interpreted as described in [RFC2119]. 77 2. Definitions 79 This section provides the definition of several terms that will be 80 used throughout the remainder of this document. 82 SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or 83 both). 85 SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the 86 largest segment that the sender can transmit. This value can be 87 based on the maximum transmission unit of the network, the path 88 MTU discovery [RFC1191] algorithm, RMSS (see next item), or other 89 factors. The size does not include the TCP/IP headers and 90 options. 92 RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the 93 largest segment the receiver is willing to accept. This is the 94 value specified in the MSS option sent by the receiver during 95 connection startup. Or, if the MSS option is not used, 536 96 bytes [RFC1122]. The size does not include the TCP/IP headers and 97 options. 99 FULL-SIZED SEGMENT: A segment that contains the maximum number of 100 data bytes permitted (i.e., a segment containing SMSS bytes of 101 data). 103 RECEIVER WINDOW (rwnd) The most recently advertised receiver window. 105 CONGESTION WINDOW (cwnd): A TCP state variable that limits the 106 amount of data a TCP can send. At any given time, a TCP MUST 107 NOT send data with a sequence number higher than the sum of the 108 highest acknowledged sequence number and the minimum of cwnd and 109 rwnd. 111 INITIAL WINDOW (IW): The initial window is the size of the sender's 112 congestion window after the three-way handshake is completed. 114 LOSS WINDOW (LW): The loss window is the size of the congestion 115 window after a TCP sender detects loss using its retransmission 116 timer. 118 RESTART WINDOW (RW): The restart window is the size of the 119 congestion window after a TCP restarts transmission after an 120 idle period (if the slow start algorithm is used; see section 121 4.1 for more discussion). 123 FLIGHT SIZE: The amount of data that has been sent but not yet 124 acknowledged. 126 DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a 127 "duplicate" in the following algorithms when (a) the 128 receiver of the ACK has outstanding data, (b) the incoming 129 acknowledgment carries no data, (c) the SYN and FIN bits are 130 both off, (d) the acknowledgment number is equal to the greatest 131 acknowledgment received on the given connection (TCP.UNA from 132 [RFC793]) and (e) the advertised window in the incoming 133 acknowledgment equals the advertised window in the last incoming 134 acknowledgment. Alternatively, a TCP that utilizes selective 135 acknowledgments [RFC2018] can determine an incoming ACK is a 136 "duplicate" if the ACK contains previously unknown SACK 137 information. 139 3. Congestion Control Algorithms 141 This section defines the four congestion control algorithms: slow 142 start, congestion avoidance, fast retransmit and fast recovery, 143 developed in [Jac88] and [Jac90]. In some situations it may be 144 beneficial for a TCP sender to be more conservative than the 145 algorithms allow, however a TCP MUST NOT be more aggressive than the 146 following algorithms allow (that is, MUST NOT send data when the 147 value of cwnd computed by the following algorithms would not allow 148 the data to be sent). 150 3.1 Slow Start and Congestion Avoidance 152 The slow start and congestion avoidance algorithms MUST be used by a 153 TCP sender to control the amount of outstanding data being injected 154 into the network. To implement these algorithms, two variables are 155 added to the TCP per-connection state. The congestion window (cwnd) 156 is a sender-side limit on the amount of data the sender can transmit 157 into the network before receiving an acknowledgment (ACK), while the 158 receiver's advertised window (rwnd) is a receiver-side limit on the 159 amount of outstanding data. The minimum of cwnd and rwnd governs 160 data transmission. 162 Another state variable, the slow start threshold (ssthresh), is used 163 to determine whether the slow start or congestion avoidance 164 algorithm is used to control data transmission, as discussed below. 166 Beginning transmission into a network with unknown conditions 167 requires TCP to slowly probe the network to determine the available 168 capacity, in order to avoid congesting the network with an 169 inappropriately large burst of data. The slow start algorithm is 170 used for this purpose at the beginning of a transfer, or after 171 repairing loss detected by the retransmission timer. 173 IW, the initial value of cwnd, MUST be set using the following 174 guidelines as an upper bound. 176 If SMSS > 2190 bytes: 177 IW = 2 * SMSS bytes and MUST NOT be more than 2 segments 178 If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): 179 IW = 3 * SMSS bytes and MUST NOT be more than 3 segments 180 if SMSS <= 1095 bytes: 181 IW = 4 * SMSS bytes and MUST NOT be more than 4 segments 183 A detailed rationale and discussion of the IW setting is provided in 184 [RFC3390]. 186 When larger initial windows are implemented along with Path MTU 187 Discovery [RFC1191], and the MSS being used is found to be too 188 large, the congestion window cwnd SHOULD be reduced to prevent 189 large bursts of smaller segments. Specifically, cwnd SHOULD be 190 reduced by the ratio of the old segment size to the new segment 191 size. 193 The initial value of ssthresh SHOULD be arbitrarily high (for 194 example, some implementations use the size of the advertised 195 window), but ssthresh MUST be reduced in response to congestion. 196 The slow start algorithm is used when cwnd < ssthresh, while the 197 congestion avoidance algorithm is used when cwnd > ssthresh. When 198 cwnd and ssthresh are equal the sender may use either slow start or 199 congestion avoidance. 201 During slow start, a TCP increments cwnd by at most SMSS bytes for 202 each ACK received that acknowledges new data. Slow start ends when 203 cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted 204 above) or when congestion is observed. While traditionally TCP 205 implementations have increased cwnd by precisely SMSS bytes upon 206 receipt of an ACK covering new data, we RECOMMEND that TCP 207 implementations increase cwnd, per: 209 cwnd += min (N, SMSS) (2) 211 where N is the number of previously unacknowledged bytes 212 acknowledged in the incoming ACK. This adjustment is part of 213 Appropriate Byte Counting [RFC3465] and provides robustness against 214 misbehaving receivers which may attempt to induce a sender to 215 artificially inflate cwnd using a mechanism known as "ACK Division" 216 [SCWA99]. ACK Division consists of a receiver sending multiple ACKs 217 for a single TCP data segment, each acknowledging only a portion of 218 its data. A TCP that increments cwnd by SMSS for each such ACK will 219 inappropriately inflate the amount of data injected into the 220 network. 222 During congestion avoidance, cwnd is incremented by roughly 1 223 full-sized segment per round-trip time (RTT). Congestion avoidance 224 continues until congestion is detected. The basic guidelines for 225 incrementing cwnd during congestion avoidance are: 227 * MAY increment cwnd by SMSS bytes 229 * SHOULD increment cwnd per equation (2) 231 * MUST NOT increment cwnd by more than SMSS bytes 233 We note that [RFC3465] allows for cwnd increases of more than SMSS 234 bytes for incoming acknowledgments during slow start on an 235 experimental basis, however such behavior is not allowed as part of 236 the standard. 238 The RECOMMENDED way to increase cwnd during congestion avoidance is 239 to count the number of bytes that have been acknowledged by ACKs for 240 new data. (A drawback of this implementation is that it requires 241 maintaining an additional state variable.) When the number of bytes 242 acknowledged reaches cwnd, then cwnd can be incremented by up to 243 SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be 244 increased by more than SMSS bytes per RTT. This method both allows 245 TCPs to increase cwnd by one segment per RTT in the face of delayed 246 ACKs and provides robustness against ACK Division attacks. 248 Another common formula that a TCP MAY use to update cwnd during 249 congestion avoidance is given in equation 3: 251 cwnd += SMSS*SMSS/cwnd (3) 253 This adjustment is executed on every incoming ACK that acknowledges 254 new data. 255 Equation (3) provides an acceptable approximation to the underlying 256 principle of increasing cwnd by 1 full-sized segment per RTT. (Note 257 that for a connection in which the receiver is acknowledging 258 every-other packet, (3) is less aggressive than allowed -- roughly 259 increasing cwnd every second RTT.) 261 Implementation Note: Since integer arithmetic is usually used in TCP 262 implementations, the formula given in equation 3 can fail to 263 increase cwnd when the congestion window is larger than SMSS*SMSS. 264 If the above formula yields 0, the result SHOULD be rounded up to 1 265 byte. 267 Implementation Note: older implementations have an additional 268 additive constant on the right-hand side of equation (3). This is 269 incorrect and can actually lead to diminished performance [RFC2525]. 271 Implementation Note: some implementations maintain cwnd in units of 272 bytes, while others in units of full-sized segments. The latter 273 will find equation (3) difficult to use, and may prefer to use the 274 counting approach discussed in the previous paragraph. 276 When a TCP sender detects segment loss using the retransmission 277 timer and the given segment has not yet been retransmitted, the 278 value of ssthresh MUST be set to no more than the value given in 279 equation 4: 281 ssthresh = max (FlightSize / 2, 2*SMSS) (4) 283 where, as discussed above, FlightSize is the amount of outstanding 284 data in the network. 286 On the other hand, when a TCP sender detects segment loss using the 287 retransmission timer and the given segment has already been 288 retransmitted at least once, the value of ssthresh MUST be set to no 289 more than the value given in equation 5: 291 ssthresh = max (ssthresh / 2, 2*SMSS) (5) 293 In other words, upon the first retransmission of a segment the value 294 of ssthresh should be set to half the amount of outstanding data in 295 the network, whereas on subsequent retransmissions the value of 296 ssthresh should simply be halved. 298 Implementation Note: an easy mistake to make is to simply use cwnd, 299 rather than FlightSize, which in some implementations may 300 incidentally increase well beyond rwnd. 302 Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be 303 set to no more than the loss window, LW, which equals 1 full-sized 304 segment (regardless of the value of IW). Therefore, after 305 retransmitting the dropped segment the TCP sender uses the slow 306 start algorithm to increase the window from 1 full-sized segment to 307 the new value of ssthresh, at which point congestion avoidance again 308 takes over. 310 3.2 Fast Retransmit/Fast Recovery 312 A TCP receiver SHOULD send an immediate duplicate ACK when an out- 313 of-order segment arrives. The purpose of this ACK is to inform the 314 sender that a segment was received out-of-order and which sequence 315 number is expected. From the sender's perspective, duplicate ACKs 316 can be caused by a number of network problems. First, they can be 317 caused by dropped segments. In this case, all segments after the 318 dropped segment will trigger duplicate ACKs until the loss is 319 repaired. Second, duplicate ACKs can be caused by the re-ordering 320 of data segments by the network (not a rare event along some network 321 paths [Pax97]). Finally, duplicate ACKs can be caused by 322 replication of ACK or data segments by the network. In addition, a 323 TCP receiver SHOULD send an immediate ACK when the incoming segment 324 fills in all or part of a gap in the sequence space. This will 325 generate more timely information for a sender recovering from a loss 326 through a retransmission timeout, a fast retransmit, or an advanced 327 loss recovery algorithm, as outlined in section 4.3. 329 The TCP sender SHOULD use the "fast retransmit" algorithm to detect 330 and repair loss, based on incoming duplicate ACKs. The fast 331 retransmit algorithm uses the arrival of 3 duplicate ACKs (4 332 identical ACKs without the arrival of any other intervening packets) 333 as an indication that a segment has been lost. After receiving 3 334 duplicate ACKs, TCP performs a retransmission of what appears to be 335 the missing segment, without waiting for the retransmission timer to 336 expire. 338 After the fast retransmit algorithm sends what appears to be the 339 missing segment, the "fast recovery" algorithm governs the 340 transmission of new data until a non-duplicate ACK arrives. The 341 reason for not performing slow start is that the receipt of the 342 duplicate ACKs not only indicates that a segment has been lost, but 343 also that segments are most likely leaving the network (although a 344 massive segment duplication by the network can invalidate this 345 conclusion). In other words, since the receiver can only generate a 346 duplicate ACK when a segment has arrived, that segment has left the 347 network and is in the receiver's buffer, so we know it is no longer 348 consuming network resources. Furthermore, since the ACK "clock" 349 [Jac88] is preserved, the TCP sender can continue to transmit new 350 segments (although transmission must continue using a reduced cwnd, 351 since loss is an indication of congestion). 353 The fast retransmit and fast recovery algorithms are implemented 354 together as follows. 356 1. On the first and second duplicate ACKs received at a sender, a 357 TCP SHOULD send a segment of previously unsent data per 358 [RFC3042] provided that the receiver's advertised window allows, 359 the total FlightSize would remain less than or equal to cwnd 360 plus 2*SMSS, and that new data is available for transmission. 361 Further, the TCP sender MUST NOT change cwnd to reflect these 362 two segments [RFC3042]. Note that a sender using SACK [RFC2018] 363 MUST NOT send new data unless the incoming duplicate 364 acknowledgment contains new SACK information. 366 2. When the third duplicate ACK is received, a TCP MUST set 367 ssthresh to no more than the value given in equation 4. 369 3. The lost segment MUST be retransmitted and cwnd set to 370 ssthresh plus 3*SMSS. This artificially "inflates" the 371 congestion window by the number of segments (three) that have 372 left the network and which the receiver has buffered. 374 4. For each additional duplicate ACK received (after the third), 375 cwnd MUST be incremented by SMSS. This artificially inflates 376 the congestion window in order to reflect the additional segment 377 that has left the network. 379 5. Transmit a segment, if allowed by the new value of cwnd and the 380 receiver's advertised window. 382 6. When the next ACK arrives that acknowledges new data, a TCP 383 MUST set cwnd to ssthresh (the value set in step 1). This is 384 termed "deflating" the window. 386 This ACK should be the acknowledgment elicited by the 387 retransmission from step 1, one RTT after the retransmission 388 (though it may arrive sooner in the presence of significant out- 389 of-order delivery of data segments at the 390 receiver). Additionally, this ACK should acknowledge all the 391 intermediate segments sent between the lost segment and the 392 receipt of the third duplicate ACK, if none of these were lost. 394 Note: This algorithm is known to generally not recover efficiently 395 from multiple losses in a single flight of packets [FF96]. Section 396 4.3 below addresses such cases. 398 4. Additional Considerations 400 4.1 Re-starting Idle Connections 402 A known problem with the TCP congestion control algorithms described 403 above is that they allow a potentially inappropriate burst of 404 traffic to be transmitted after TCP has been idle for a relatively 405 long period of time. After an idle period, TCP cannot use the ACK 406 clock to strobe new segments into the network, as all the ACKs have 407 drained from the network. Therefore, as specified above, TCP can 408 potentially send a cwnd-size line-rate burst into the network after 409 an idle period. 411 [Jac88] recommends that a TCP use slow start to restart 412 transmission after a relatively long idle period. Slow start 413 serves to restart the ACK clock, just as it does at the beginning 414 of a transfer. This mechanism has been widely deployed in the 415 following manner. When TCP has not received a segment for more 416 than one retransmission timeout, cwnd is reduced to the value of 417 the restart window (RW) before transmission begins. 419 For the purposes of this standard, we define RW = min(IW,cwnd). 421 Using the last time a segment was received to determine whether or 422 not to decrease cwnd can fail to deflate cwnd in the common case of 423 persistent HTTP connections [HTH98]. In this case, a Web server 424 receives a request before transmitting data to the Web client. The 425 reception of the request makes the test for an idle connection fail, 426 and allows the TCP to begin transmission with a possibly 427 inappropriately large cwnd. 429 Therefore, a TCP SHOULD set cwnd to no more than RW before beginning 430 transmission if the TCP has not sent data in an interval exceeding 431 the retransmission timeout. 433 4.2 Generating Acknowledgments 435 The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a 436 TCP receiver. When using delayed ACKs, a TCP receiver MUST NOT 437 excessively delay acknowledgments. Specifically, an ACK SHOULD be 438 generated for at least every second full-sized segment, and MUST be 439 generated within 500 ms of the arrival of the first unacknowledged 440 packet. 442 The requirement that an ACK "SHOULD" be generated for at least every 443 second full-sized segment is listed in [RFC1122] in one place as a 444 SHOULD and another as a MUST. Here we unambiguously state it is a 445 SHOULD. We also emphasize that this is a SHOULD, meaning that an 446 implementor should indeed only deviate from this requirement after 447 careful consideration of the implications. See the discussion of 448 "Stretch ACK violation" in [RFC2525] and the references therein for a 449 discussion of the possible performance problems with generating ACKs 450 less frequently than every second full-sized segment. 452 In some cases, the sender and receiver may not agree on what 453 constitutes a full-sized segment. An implementation is deemed to 454 comply with this requirement if it sends at least one acknowledgment 455 every time it receives 2*RMSS bytes of new data from the sender, 456 where RMSS is the Maximum Segment Size specified by the receiver to 457 the sender (or the default value of 536 bytes, per [RFC1122], if the 458 receiver does not specify an MSS option during connection 459 establishment). The sender may be forced to use a segment size less 460 than RMSS due to the maximum transmission unit (MTU), the path MTU 461 discovery algorithm or other factors. For instance, consider the 462 case when the receiver announces an RMSS of X bytes but the sender 463 ends up using a segment size of Y bytes (Y < X) due to path MTU 464 discovery (or the sender's MTU size). The receiver will generate 465 stretch ACKs if it waits for 2*X bytes to arrive before an ACK is 466 sent. Clearly this will take more than 2 segments of size Y bytes. 467 Therefore, while a specific algorithm is not defined, it is 468 desirable for receivers to attempt to prevent this situation, for 469 example by acknowledging at least every second segment, regardless 470 of size. Finally, we repeat that an ACK MUST NOT be delayed for 471 more than 500 ms waiting on a second full-sized segment to arrive. 473 Out-of-order data segments SHOULD be acknowledged immediately, in 474 order to accelerate loss recovery. To trigger the fast retransmit 475 algorithm, the receiver SHOULD send an immediate duplicate ACK when 476 it receives a data segment above a gap in the sequence space. To 477 provide feedback to senders recovering from losses, the receiver 478 SHOULD send an immediate ACK when it receives a data segment that 479 fills in all or part of a gap in the sequence space. 481 A TCP receiver MUST NOT generate more than one ACK for every 482 incoming segment, other than to update the offered window as the 483 receiving application consumes new data [page 42, RFC793][RFC813]. 485 4.3 Loss Recovery Mechanisms 486 A number of loss recovery algorithms that augment fast retransmit 487 and fast recovery have been suggested by TCP researchers and 488 specified in the RFC series. While some of these algorithms are 489 based on the TCP selective acknowledgment (SACK) option [RFC2018], 490 such as [FF96,MM96a,MM96b,RFC3517], others do not require SACKs 491 [Hoe96,FF96,RFC3782]. The non-SACK algorithms use "partial 492 acknowledgments" (ACKs which cover previously unacknowledged data, 493 but not all the data outstanding when loss was detected) to trigger 494 retransmissions. While this document does not standardize any of 495 the specific algorithms that may improve fast retransmit/fast 496 recovery, these enhanced algorithms are implicitly allowed, as long 497 as they follow the general principles of the basic four algorithms 498 outlined above. 500 That is, when the first loss in a window of data is detected, 501 ssthresh MUST be set to no more than the value given by equation 502 (4). Second, until all lost segments in the window of data in 503 question are repaired, the number of segments transmitted in each 504 RTT MUST be no more than half the number of outstanding segments 505 when the loss was detected. Finally, after all loss in the given 506 window of segments has been successfully retransmitted, cwnd MUST be 507 set to no more than ssthresh and congestion avoidance MUST be used 508 to further increase cwnd. Loss in two successive windows of data, 509 or the loss of a retransmission, should be taken as two indications 510 of congestion and, therefore, cwnd (and ssthresh) MUST be lowered 511 twice in this case. 513 We RECOMMEND that TCP implementers employ some form of advanced loss 514 recovery that can cope with multiple losses in a window of data. 515 The algorithms detailed in [RFC3782] and [RFC3517] conform to the 516 general principles outlined above. We note that while these are not 517 the only two algorithms that conform to the above general principles 518 these two algorithms have been vetted by the community and are 519 currently on the standards track. 521 5. Security Considerations 523 This document requires a TCP to diminish its sending rate in the 524 presence of retransmission timeouts and the arrival of duplicate 525 acknowledgments. An attacker can therefore impair the performance 526 of a TCP connection by either causing data packets or their 527 acknowledgments to be lost, or by forging excessive duplicate 528 acknowledgments. Causing two congestion control events back-to-back 529 will often cut ssthresh to its minimum value of 2*SMSS, causing the 530 connection to immediately enter the slower-performing congestion 531 avoidance phase. 533 In response to the ACK division attack outlined in [SCWA99] this 534 document RECOMMENDS increasing the congestion window based on the 535 number of bytes newly acknowledged in each arriving ACK rather than 536 by a particular constant on each arriving ACK (as outlined in 537 section 3.1). 539 The Internet to a considerable degree relies on the correct 540 implementation of these algorithms in order to preserve network 541 stability and avoid congestion collapse. An attacker could cause 542 TCP endpoints to respond more aggressively in the face of congestion 543 by forging excessive duplicate acknowledgments or excessive 544 acknowledgments for new data. Conceivably, such an attack could 545 drive a portion of the network into congestion collapse. 547 6. Changes Between RFC 2001 and RFC 2581 549 This document has been extensively rewritten editorially and it is 550 not feasible to itemize the list of changes between the two 551 documents. The intention of this document is not to change any of 552 the recommendations given in RFC 2001, but to further clarify cases 553 that were not discussed in detail in 2001. Specifically, this 554 document suggests what TCP connections should do after a relatively 555 long idle period, as well as specifying and clarifying some of the 556 issues pertaining to TCP ACK generation. Finally, the allowable 557 upper bound for the initial congestion window has also been raised 558 from one to two segments. 560 7. Changes Relative to RFC 2581 562 A specific definition for "duplicate acknowledgment" has been 563 added, based on the definition used by BSD TCP. 565 The initial window requirements were changed to allow Larger 566 Initial Windows as standardized in [RFC3390]. Additionally, the 567 steps to take when an initial window is discovered to be too large 568 due to Path MTU Discovery [RFC1191] are detailed. 570 The recommended initial value for ssthresh has been changed to say 571 that it SHOULD be arbitrarily high, where it was previously MAY. 572 This is to provide additional guidance to implementors on the 573 matter. 575 During slow start, the usage of Appropriate Byte Counting [RFC3465] 576 with L=1*SMSS is explicitly recommended. The method of increasing 577 cwnd given in [RFC2581] is still explicitly allowed. Byte counting 578 during congestion avoidance is also recommended, while the method 579 from [RFC2581] and other safe methods are still allowed. 581 The treatment of ssthresh on retransmission timeout was clarified. 582 Specifically, Equation (3) from [RFC2581] was split into Equations 583 (4) and (5) in this document. 585 The description of fast retransmit and fast recovery has been 586 clarified, and the use of Limited Transmit [RFC3042] is now 587 recommended. 589 The restart window has been changed to min(IW,cwnd) from IW. This 590 behavior was described as "experimental" in [RFC2581]. 592 It is now recommended that TCP implementors implement an advanced 593 loss recovery algorithm conforming to the principles outlined in 594 this document. 596 The security considerations have been updated to discuss ACK 597 division and recommend byte counting as a counter to this attack. 599 Acknowledgments 601 The core algorithms we describe were developed by Van Jacobson 602 [Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was 603 developed in conjunction with Hari Balakrishnan and Sally Floyd. 604 The initial congestion window size specified in this document is a 605 result of work with Sally Floyd and Craig Partridge 606 [RFC2414,RFC3390]. 608 W. Richard ("Rich") Stevens wrote the first version of this document 609 [RFC2001] and co-authored the second version [RFC2581]. This 610 present version much benefits from his clarity and thoughtfulness of 611 description, and we are grateful for Rich's contributions in 612 elucidating TCP congestion control, as well as in more broadly 613 helping us understand numerous issues relating to networking. 615 We wish to emphasize that the shortcomings and mistakes of this 616 document are solely the responsibility of the current authors. 618 Some of the text from this document is taken from "TCP/IP 619 Illustrated, Volume 1: The Protocols" by W. Richard Stevens 620 (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The 621 Implementation" by Gary R. Wright and W. Richard Stevens (Addison- 622 Wesley, 1995). This material is used with the permission of 623 Addison-Wesley. 625 Neal Cardwell, Noritoshi Demizu, Kevin Fall, Sally Floyd, Craig 626 Partridge and Joe Touch contributed a number of helpful suggestions. 628 Normative References 630 [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 631 793, September 1981. 633 [RFC1122] Braden, R., "Requirements for Internet Hosts -- 634 Communication Layers", STD 3, RFC 1122, October 1989. 636 [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, 637 November 1990. 639 Informative References 641 [CJ89] Chiu, D. and R. Jain, "Analysis of the Increase/Decrease 642 Algorithms for Congestion Avoidance in Computer Networks", 643 Journal of Computer Networks and ISDN Systems, vol. 17, no. 1, 644 pp. 1-14, June 1989. 646 [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of 647 Tahoe, Reno and SACK TCP", Computer Communication Review, July 648 1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. 650 [Flo94] Floyd, S., "TCP and Successive Fast Retransmits. Technical 651 report", October 1994. 652 ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. 654 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 655 Control Scheme for TCP", In ACM SIGCOMM, August 1996. 657 [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP 658 Slow-Start Restart After Idle", Work in Progress. 660 [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer 661 Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. 662 ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. 664 [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm", 665 end2end-interest mailing list, April 30, 1990. 666 ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail. 668 [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining 669 TCP Congestion Control", Proceedings of SIGCOMM'96, August, 670 1996, Stanford, CA. Available 671 fromhttp://www.psc.edu/networking/papers/papers.html 673 [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding 674 Parameters", Technical report. Available from 675 http://www.psc.edu/networking/papers/FACKnotes/current. 677 [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics", 678 Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. 680 [RFC813] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC 681 813, July 1982. 683 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 684 Retransmit, and Fast Recovery Algorithms", RFC 2001, January 685 1997. 687 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP 688 Selective Acknowledgement Options", RFC 2018, October 1996. 690 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 691 Requirement Levels", BCP 14, RFC 2119, March 1997. 693 [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's 694 Initial Window Size", RFC 2414, September 1998. 696 [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., 697 Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP 698 Implementation Problems", RFC 2525, March 1999. 700 [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion 701 Control, RFC 2581, April 1999. 703 [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission 704 Timer", RFC 2988, November 2000. 706 [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing 707 TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 708 2001. 710 [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte 711 Counting (ABC), RFC 3465, February 2003. 713 [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A 714 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 715 Algorithm for TCP, RFC 3517, April 2003. 717 [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno 718 Modification to TCP's Fast Recovery Algorithm, RFC 3782, April 719 2004. 721 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 722 "TCP Congestion Control With a Misbehaving Receiver", ACM 723 Computer Communication Review, 29(5), October 1999. 725 [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", 726 Addison-Wesley, 1994. 728 [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The 729 Implementation", Addison-Wesley, 1995. 731 Authors' Addresses 733 Mark Allman 734 ICIR / ICSI 735 1947 Center Street 736 Suite 600 737 Berkeley, CA 94704-1198 738 Phone: +1 440 243 7361 739 EMail: mallman@icir.org 740 http://www.icir.org/mallman/ 742 Vern Paxson 743 ICIR / ICSI 744 1947 Center Street 745 Suite 600 746 Berkeley, CA 94704-1198 747 Phone: +1 510/642-4274 x302 748 EMail: vern@icir.org 749 http://www.icir.org/vern/ 751 Ethan Blanton 752 Purdue University Computer Sciences 753 1398 Computer Science Building 754 West Lafayette, IN 47907 755 EMail: eblanton@cs.purdue.edu 756 http://www.cs.purdue.edu/homes/eblanton/ 758 Intellectual Property Statement 760 The IETF takes no position regarding the validity or scope of any 761 Intellectual Property Rights or other rights that might be claimed 762 to pertain to the implementation or use of the technology described 763 in this document or the extent to which any license under such 764 rights might or might not be available; nor does it represent that 765 it has made any independent effort to identify any such rights. 766 Information on the procedures with respect to rights in RFC 767 documents can be found in BCP 78 and BCP 79. 769 Copies of IPR disclosures made to the IETF Secretariat and any 770 assurances of licenses to be made available, or the result of an 771 attempt made to obtain a general license or permission for the use 772 of such proprietary rights by implementers or users of this 773 specification can be obtained from the IETF on-line IPR repository 774 at http://www.ietf.org/ipr. 776 The IETF invites any interested party to bring to its attention any 777 copyrights, patents or patent applications, or other proprietary 778 rights that may cover technology that may be required to implement 779 this standard. Please address the information to the IETF at 780 ietf-ipr@ietf.org. 782 Disclaimer of Validity 784 This document and the information contained herein are provided on 785 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 786 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE 787 INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR 788 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 789 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 790 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 792 Copyright Statement 794 Copyright (C) The Internet Society (2006). This document is subject 795 to the rights, licenses and restrictions contained in BCP 78, and 796 except as set forth therein, the authors retain all their rights. 798 Acknowledgment 800 Funding for the RFC Editor function is currently provided by the 801 Internet Society.