idnits 2.17.1 draft-ietf-tcpm-rfc2581bis-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 832. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 808. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 815. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 821. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1, updated by RFC 4748 (on line 838), which is fine, but *also* found old RFC 3978, Section 5.4, paragraph 1 text on line 36. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 5 instances of too long lines in the document, the longest one being 2 characters in excess of 72. ** There are 4 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 2007) is 6280 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Flo94' is defined on line 684, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 813 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2414 (Obsoleted by RFC 3390) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) -- Obsolete informational reference (is this intentional?): RFC 3782 (Obsoleted by RFC 6582) Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Allman 2 Internet-Draft V. Paxson 3 Expires: August 2007 ICIR / ICSI 4 E. Blanton 5 Purdue University 6 February 2007 8 TCP Congestion Control 9 draft-ietf-tcpm-rfc2581bis-02.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet-Drafts as 26 reference material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Copyright Notice 36 Copyright (C) The Internet Society (2007). 38 Abstract 40 This document defines TCP's four intertwined congestion control 41 algorithms: slow start, congestion avoidance, fast retransmit, and 42 fast recovery. In addition, the document specifies how TCP should 43 begin transmission after a relatively long idle period, as well as 44 discussing various acknowledgment generation methods. 46 1. Introduction 48 This document specifies four TCP [RFC793] congestion control 49 algorithms: slow start, congestion avoidance, fast retransmit and 50 fast recovery. These algorithms were devised in [Jac88] and 51 [Jac90]. Their use with TCP is standardized in [RFC1122]. Additional 52 early work in additive-increase, multiplicative-decrease congestion 53 control is given in [CJ89]. 55 This document obsoletes [RFC2581] which in turned obsoleted 56 [RFC2001]. 58 In addition to specifying the congestion control algorithms, this 59 document specifies what TCP connections should do after a relatively 60 long idle period, as well as specifying and clarifying some of the 61 issues pertaining to TCP ACK generation. 63 Note that [Ste94] provides examples of these algorithms in action 64 and [WS95] provides an explanation of the source code for the BSD 65 implementation of these algorithms. 67 This document is organized as follows. Section 2 provides various 68 definitions which will be used throughout the document. Section 3 69 provides a specification of the congestion control 70 algorithms. Section 4 outlines concerns related to the congestion 71 control algorithms and finally, section 5 outlines security 72 considerations. 74 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 75 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 76 document are to be interpreted as described in [RFC2119]. 78 2. Definitions 80 This section provides the definition of several terms that will be 81 used throughout the remainder of this document. 83 SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or 84 both). 86 SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the 87 largest segment that the sender can transmit. This value can be 88 based on the maximum transmission unit of the network, the path 89 MTU discovery [RFC1191] algorithm, RMSS (see next item), or other 90 factors. The size does not include the TCP/IP headers and 91 options. 93 RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the 94 largest segment the receiver is willing to accept. This is the 95 value specified in the MSS option sent by the receiver during 96 connection startup. Or, if the MSS option is not used, 536 97 bytes [RFC1122]. The size does not include the TCP/IP headers and 98 options. 100 FULL-SIZED SEGMENT: A segment that contains the maximum number of 101 data bytes permitted (i.e., a segment containing SMSS bytes of 102 data). 104 RECEIVER WINDOW (rwnd): The most recently advertised receiver 105 window. 107 CONGESTION WINDOW (cwnd): A TCP state variable that limits the 108 amount of data a TCP can send. At any given time, a TCP MUST 109 NOT send data with a sequence number higher than the sum of the 110 highest acknowledged sequence number and the minimum of cwnd and 111 rwnd. 113 INITIAL WINDOW (IW): The initial window is the size of the sender's 114 congestion window after the three-way handshake is completed. 116 LOSS WINDOW (LW): The loss window is the size of the congestion 117 window after a TCP sender detects loss using its retransmission 118 timer. 120 RESTART WINDOW (RW): The restart window is the size of the 121 congestion window after a TCP restarts transmission after an 122 idle period (if the slow start algorithm is used; see section 123 4.1 for more discussion). 125 FLIGHT SIZE: The amount of data that has been sent but not yet 126 acknowledged. 128 DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a 129 "duplicate" in the following algorithms when (a) the receiver of 130 the ACK has outstanding data, (b) the incoming acknowledgment 131 carries no data, (c) the SYN and FIN bits are both off, (d) the 132 acknowledgment number is equal to the greatest acknowledgment 133 received on the given connection (TCP.UNA from [RFC793]) and (e) 134 the advertised window in the incoming acknowledgment equals the 135 advertised window in the last incoming acknowledgment. 136 Alternatively, a TCP that utilizes selective acknowledgments 137 [RFC2018,RFC2883] can determine an incoming ACK is a "duplicate" 138 if the ACK contains previously unknown SACK information. 140 3. Congestion Control Algorithms 142 This section defines the four congestion control algorithms: slow 143 start, congestion avoidance, fast retransmit and fast recovery, 144 developed in [Jac88] and [Jac90]. In some situations it may be 145 beneficial for a TCP sender to be more conservative than the 146 algorithms allow, however a TCP MUST NOT be more aggressive than the 147 following algorithms allow (that is, MUST NOT send data when the 148 value of cwnd computed by the following algorithms would not allow 149 the data to be sent). 151 3.1 Slow Start and Congestion Avoidance 153 The slow start and congestion avoidance algorithms MUST be used by a 154 TCP sender to control the amount of outstanding data being injected 155 into the network. To implement these algorithms, two variables are 156 added to the TCP per-connection state. The congestion window (cwnd) 157 is a sender-side limit on the amount of data the sender can transmit 158 into the network before receiving an acknowledgment (ACK), while the 159 receiver's advertised window (rwnd) is a receiver-side limit on the 160 amount of outstanding data. The minimum of cwnd and rwnd governs 161 data transmission. 163 Another state variable, the slow start threshold (ssthresh), is used 164 to determine whether the slow start or congestion avoidance 165 algorithm is used to control data transmission, as discussed below. 167 Beginning transmission into a network with unknown conditions 168 requires TCP to slowly probe the network to determine the available 169 capacity, in order to avoid congesting the network with an 170 inappropriately large burst of data. The slow start algorithm is 171 used for this purpose at the beginning of a transfer, or after 172 repairing loss detected by the retransmission timer. 174 IW, the initial value of cwnd, MUST be set using the following 175 guidelines as an upper bound. 177 If SMSS > 2190 bytes: 178 IW = 2 * SMSS bytes and MUST NOT be more than 2 segments 179 If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): 180 IW = 3 * SMSS bytes and MUST NOT be more than 3 segments 181 if SMSS <= 1095 bytes: 182 IW = 4 * SMSS bytes and MUST NOT be more than 4 segments 184 As specified in [RFC3390], the SYN/ACK and the acknowledgment of the 185 SYN/ACK MUST NOT increase the size of the congestion window. 186 Further, if the SYN or SYN/ACK is lost, the initial window used by a 187 sender after a correctly transmitted SYN MUST be one segment 188 consisting of at most SMSS bytes. 190 A detailed rationale and discussion of the IW setting is provided in 191 [RFC3390]. 193 When larger initial windows are implemented along with Path MTU 194 Discovery [RFC1191], and the MSS being used is found to be too 195 large, the congestion window cwnd SHOULD be reduced to prevent 196 large bursts of smaller segments. Specifically, cwnd SHOULD be 197 reduced by the ratio of the old segment size to the new segment 198 size. 200 The initial value of ssthresh SHOULD be set arbitrarily high (e.g., 201 to the size of the largest possible advertised window), but ssthresh 202 MUST be reduced in response to congestion. Setting ssthresh as high 203 as possible allows the network conditions, rather than some 204 arbitrary host limit, to dictate the sending rate. In cases where 205 the end systems have a solid understanding of the network path, more 206 carefully setting the initial ssthresh value may have merit (e.g., 207 such that the end host does not create congestion along the path). 209 The slow start algorithm is used when cwnd < ssthresh, while the 210 congestion avoidance algorithm is used when cwnd > ssthresh. When 211 cwnd and ssthresh are equal the sender may use either slow start or 212 congestion avoidance. 214 During slow start, a TCP increments cwnd by at most SMSS bytes for 215 each ACK received that acknowledges new data. Slow start ends when 216 cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted 217 above) or when congestion is observed. While traditionally TCP 218 implementations have increased cwnd by precisely SMSS bytes upon 219 receipt of an ACK covering new data, we RECOMMEND that TCP 220 implementations increase cwnd, per: 222 cwnd += min (N, SMSS) (2) 224 where N is the number of previously unacknowledged bytes 225 acknowledged in the incoming ACK. This adjustment is part of 226 Appropriate Byte Counting [RFC3465] and provides robustness against 227 misbehaving receivers which may attempt to induce a sender to 228 artificially inflate cwnd using a mechanism known as "ACK Division" 229 [SCWA99]. ACK Division consists of a receiver sending multiple ACKs 230 for a single TCP data segment, each acknowledging only a portion of 231 its data. A TCP that increments cwnd by SMSS for each such ACK will 232 inappropriately inflate the amount of data injected into the 233 network. 235 During congestion avoidance, cwnd is incremented by roughly 1 236 full-sized segment per round-trip time (RTT). Congestion avoidance 237 continues until congestion is detected. The basic guidelines for 238 incrementing cwnd during congestion avoidance are: 240 * MAY increment cwnd by SMSS bytes 242 * SHOULD increment cwnd per equation (2) 244 * MUST NOT increment cwnd by more than SMSS bytes 246 We note that [RFC3465] allows for cwnd increases of more than SMSS 247 bytes for incoming acknowledgments during slow start on an 248 experimental basis, however such behavior is not allowed as part of 249 the standard. 251 The RECOMMENDED way to increase cwnd during congestion avoidance is 252 to count the number of bytes that have been acknowledged by ACKs for 253 new data. (A drawback of this implementation is that it requires 254 maintaining an additional state variable.) When the number of bytes 255 acknowledged reaches cwnd, then cwnd can be incremented by up to 256 SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be 257 increased by more than SMSS bytes per RTT. This method both allows 258 TCPs to increase cwnd by one segment per RTT in the face of delayed 259 ACKs and provides robustness against ACK Division attacks. 261 Another common formula that a TCP MAY use to update cwnd during 262 congestion avoidance is given in equation 3: 264 cwnd += SMSS*SMSS/cwnd (3) 266 This adjustment is executed on every incoming ACK that acknowledges 267 new data. 268 Equation (3) provides an acceptable approximation to the underlying 269 principle of increasing cwnd by 1 full-sized segment per RTT. (Note 270 that for a connection in which the receiver is acknowledging 271 every-other packet, (3) is less aggressive than allowed -- roughly 272 increasing cwnd every second RTT.) 274 Implementation Note: Since integer arithmetic is usually used in TCP 275 implementations, the formula given in equation 3 can fail to 276 increase cwnd when the congestion window is larger than SMSS*SMSS. 277 If the above formula yields 0, the result SHOULD be rounded up to 1 278 byte. 280 Implementation Note: Older implementations have an additional 281 additive constant on the right-hand side of equation (3). This is 282 incorrect and can actually lead to diminished performance [RFC2525]. 284 Implementation Note: Some implementations maintain cwnd in units of 285 bytes, while others in units of full-sized segments. The latter 286 will find equation (3) difficult to use, and may prefer to use the 287 counting approach discussed in the previous paragraph. 289 When a TCP sender detects segment loss using the retransmission 290 timer and the given segment has not yet been retransmitted, the 291 value of ssthresh MUST be set to no more than the value given in 292 equation 4: 294 ssthresh = max (FlightSize / 2, 2*SMSS) (4) 296 where, as discussed above, FlightSize is the amount of outstanding 297 data in the network. 299 On the other hand, when a TCP sender detects segment loss using the 300 retransmission timer and the given segment has already been 301 retransmitted at least once, the value of ssthresh is held 302 constant. 304 Implementation Note: An easy mistake to make is to simply use cwnd, 305 rather than FlightSize, which in some implementations may 306 incidentally increase well beyond rwnd. 308 Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be 309 set to no more than the loss window, LW, which equals 1 full-sized 310 segment (regardless of the value of IW). Therefore, after 311 retransmitting the dropped segment the TCP sender uses the slow 312 start algorithm to increase the window from 1 full-sized segment to 313 the new value of ssthresh, at which point congestion avoidance again 314 takes over. 316 As shown in [FF96,RFC3782], slow start-based loss recovery after a 317 timeout can cause spurious retransmissions that trigger duplicate 318 acknowledgments. The reaction to the arrival of these duplicate 319 ACKs in TCP implementations varies widely. This document does not 320 specify how to treat such acknowledgments, but does note this as an 321 area that may benefit from additional attention, experimentation and 322 specification. 324 3.2 Fast Retransmit/Fast Recovery 325 A TCP receiver SHOULD send an immediate duplicate ACK when an out- 326 of-order segment arrives. The purpose of this ACK is to inform the 327 sender that a segment was received out-of-order and which sequence 328 number is expected. From the sender's perspective, duplicate ACKs 329 can be caused by a number of network problems. First, they can be 330 caused by dropped segments. In this case, all segments after the 331 dropped segment will trigger duplicate ACKs until the loss is 332 repaired. Second, duplicate ACKs can be caused by the re-ordering 333 of data segments by the network (not a rare event along some network 334 paths [Pax97]). Finally, duplicate ACKs can be caused by 335 replication of ACK or data segments by the network. In addition, a 336 TCP receiver SHOULD send an immediate ACK when the incoming segment 337 fills in all or part of a gap in the sequence space. This will 338 generate more timely information for a sender recovering from a loss 339 through a retransmission timeout, a fast retransmit, or an advanced 340 loss recovery algorithm, as outlined in section 4.3. 342 The TCP sender SHOULD use the "fast retransmit" algorithm to detect 343 and repair loss, based on incoming duplicate ACKs. The fast 344 retransmit algorithm uses the arrival of 3 duplicate ACKs (as 345 defined in section 2, without any intervening ACKs which move 346 SND.UNA) as an indication that a segment has been lost. After 347 receiving 3 duplicate ACKs, TCP performs a retransmission of what 348 appears to be the missing segment, without waiting for the 349 retransmission timer to expire. 351 After the fast retransmit algorithm sends what appears to be the 352 missing segment, the "fast recovery" algorithm governs the 353 transmission of new data until a non-duplicate ACK arrives. The 354 reason for not performing slow start is that the receipt of the 355 duplicate ACKs not only indicates that a segment has been lost, but 356 also that segments are most likely leaving the network (although a 357 massive segment duplication by the network can invalidate this 358 conclusion). In other words, since the receiver can only generate a 359 duplicate ACK when a segment has arrived, that segment has left the 360 network and is in the receiver's buffer, so we know it is no longer 361 consuming network resources. Furthermore, since the ACK "clock" 362 [Jac88] is preserved, the TCP sender can continue to transmit new 363 segments (although transmission must continue using a reduced cwnd, 364 since loss is an indication of congestion). 366 The fast retransmit and fast recovery algorithms are implemented 367 together as follows. 369 1. On the first and second duplicate ACKs received at a sender, a 370 TCP SHOULD send a segment of previously unsent data per 371 [RFC3042] provided that the receiver's advertised window allows, 372 the total FlightSize would remain less than or equal to cwnd 373 plus 2*SMSS, and that new data is available for transmission. 374 Further, the TCP sender MUST NOT change cwnd to reflect these 375 two segments [RFC3042]. Note that a sender using SACK [RFC2018] 376 MUST NOT send new data unless the incoming duplicate 377 acknowledgment contains new SACK information. 379 2. When the third duplicate ACK is received, a TCP MUST set 380 ssthresh to no more than the value given in equation 4. 382 3. The lost segment MUST be retransmitted and cwnd set to 383 ssthresh plus 3*SMSS. This artificially "inflates" the 384 congestion window by the number of segments (three) that have 385 left the network and which the receiver has buffered. 387 4. For each additional duplicate ACK received (after the third), 388 cwnd MUST be incremented by SMSS. This artificially inflates 389 the congestion window in order to reflect the additional segment 390 that has left the network. 392 Note: [SCWA99] discusses a receiver-based attack whereby many 393 bogus duplicate ACKs are sent to the data sender in order to 394 artificially inflate cwnd and cause a higher than appropriate 395 sending rate to be used. A TCP MAY therefore limit the number 396 of times cwnd is artificially inflated during loss recovery 397 to the number of outstanding segments (or, an approximation 398 thereof). 400 5. Transmit a segment, if allowed by the new value of cwnd and the 401 receiver's advertised window. 403 6. When the next ACK arrives that acknowledges previously 404 unacknowledged data, a TCP MUST set cwnd to ssthresh (the value 405 set in step 2). This is termed "deflating" the window. 407 This ACK should be the acknowledgment elicited by the 408 retransmission from step 3, one RTT after the retransmission 409 (though it may arrive sooner in the presence of significant out- 410 of-order delivery of data segments at the receiver). 411 Additionally, this ACK should acknowledge all the intermediate 412 segments sent between the lost segment and the receipt of the 413 third duplicate ACK, if none of these were lost. 415 Note: This algorithm is known to generally not recover efficiently 416 from multiple losses in a single flight of packets [FF96]. Section 417 4.3 below addresses such cases. 419 4. Additional Considerations 421 4.1 Re-starting Idle Connections 423 A known problem with the TCP congestion control algorithms described 424 above is that they allow a potentially inappropriate burst of 425 traffic to be transmitted after TCP has been idle for a relatively 426 long period of time. After an idle period, TCP cannot use the ACK 427 clock to strobe new segments into the network, as all the ACKs have 428 drained from the network. Therefore, as specified above, TCP can 429 potentially send a cwnd-size line-rate burst into the network after 430 an idle period. 432 [Jac88] recommends that a TCP use slow start to restart 433 transmission after a relatively long idle period. Slow start 434 serves to restart the ACK clock, just as it does at the beginning 435 of a transfer. This mechanism has been widely deployed in the 436 following manner. When TCP has not received a segment for more 437 than one retransmission timeout, cwnd is reduced to the value of 438 the restart window (RW) before transmission begins. 440 For the purposes of this standard, we define RW = min(IW,cwnd). 442 Using the last time a segment was received to determine whether or 443 not to decrease cwnd can fail to deflate cwnd in the common case of 444 persistent HTTP connections [HTH98]. In this case, a Web server 445 receives a request before transmitting data to the Web client. The 446 reception of the request makes the test for an idle connection fail, 447 and allows the TCP to begin transmission with a possibly 448 inappropriately large cwnd. 450 Therefore, a TCP SHOULD set cwnd to no more than RW before beginning 451 transmission if the TCP has not sent data in an interval exceeding 452 the retransmission timeout. 454 4.2 Generating Acknowledgments 456 The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a 457 TCP receiver. When using delayed ACKs, a TCP receiver MUST NOT 458 excessively delay acknowledgments. Specifically, an ACK SHOULD be 459 generated for at least every second full-sized segment, and MUST be 460 generated within 500 ms of the arrival of the first unacknowledged 461 packet. 463 The requirement that an ACK "SHOULD" be generated for at least every 464 second full-sized segment is listed in [RFC1122] in one place as a 465 SHOULD and another as a MUST. Here we unambiguously state it is a 466 SHOULD. We also emphasize that this is a SHOULD, meaning that an 467 implementor should indeed only deviate from this requirement after 468 careful consideration of the implications. See the discussion of 469 "Stretch ACK violation" in [RFC2525] and the references therein for a 470 discussion of the possible performance problems with generating ACKs 471 less frequently than every second full-sized segment. 473 In some cases, the sender and receiver may not agree on what 474 constitutes a full-sized segment. An implementation is deemed to 475 comply with this requirement if it sends at least one acknowledgment 476 every time it receives 2*RMSS bytes of new data from the sender, 477 where RMSS is the Maximum Segment Size specified by the receiver to 478 the sender (or the default value of 536 bytes, per [RFC1122], if the 479 receiver does not specify an MSS option during connection 480 establishment). The sender may be forced to use a segment size less 481 than RMSS due to the maximum transmission unit (MTU), the path MTU 482 discovery algorithm or other factors. For instance, consider the 483 case when the receiver announces an RMSS of X bytes but the sender 484 ends up using a segment size of Y bytes (Y < X) due to path MTU 485 discovery (or the sender's MTU size). The receiver will generate 486 stretch ACKs if it waits for 2*X bytes to arrive before an ACK is 487 sent. Clearly this will take more than 2 segments of size Y bytes. 488 Therefore, while a specific algorithm is not defined, it is 489 desirable for receivers to attempt to prevent this situation, for 490 example by acknowledging at least every second segment, regardless 491 of size. Finally, we repeat that an ACK MUST NOT be delayed for 492 more than 500 ms waiting on a second full-sized segment to arrive. 494 Out-of-order data segments SHOULD be acknowledged immediately, in 495 order to accelerate loss recovery. To trigger the fast retransmit 496 algorithm, the receiver SHOULD send an immediate duplicate ACK when 497 it receives a data segment above a gap in the sequence space. To 498 provide feedback to senders recovering from losses, the receiver 499 SHOULD send an immediate ACK when it receives a data segment that 500 fills in all or part of a gap in the sequence space. 502 A TCP receiver MUST NOT generate more than one ACK for every 503 incoming segment, other than to update the offered window as the 504 receiving application consumes new data [page 42, RFC793][RFC813]. 506 4.3 Loss Recovery Mechanisms 508 A number of loss recovery algorithms that augment fast retransmit 509 and fast recovery have been suggested by TCP researchers and 510 specified in the RFC series. While some of these algorithms are 511 based on the TCP selective acknowledgment (SACK) option [RFC2018], 512 such as [FF96,MM96a,MM96b,RFC3517], others do not require SACKs 513 [Hoe96,FF96,RFC3782]. The non-SACK algorithms use "partial 514 acknowledgments" (ACKs which cover previously unacknowledged data, 515 but not all the data outstanding when loss was detected) to trigger 516 retransmissions. While this document does not standardize any of 517 the specific algorithms that may improve fast retransmit/fast 518 recovery, these enhanced algorithms are implicitly allowed, as long 519 as they follow the general principles of the basic four algorithms 520 outlined above. 522 That is, when the first loss in a window of data is detected, 523 ssthresh MUST be set to no more than the value given by equation 524 (4). Second, until all lost segments in the window of data in 525 question are repaired, the number of segments transmitted in each 526 RTT MUST be no more than half the number of outstanding segments 527 when the loss was detected. Finally, after all loss in the given 528 window of segments has been successfully retransmitted, cwnd MUST be 529 set to no more than ssthresh and congestion avoidance MUST be used 530 to further increase cwnd. Loss in two successive windows of data, 531 or the loss of a retransmission, should be taken as two indications 532 of congestion and, therefore, cwnd (and ssthresh) MUST be lowered 533 twice in this case. 535 We RECOMMEND that TCP implementers employ some form of advanced loss 536 recovery that can cope with multiple losses in a window of data. 537 The algorithms detailed in [RFC3782] and [RFC3517] conform to the 538 general principles outlined above. We note that while these are not 539 the only two algorithms that conform to the above general principles 540 these two algorithms have been vetted by the community and are 541 currently on the standards track. 543 5. Security Considerations 545 This document requires a TCP to diminish its sending rate in the 546 presence of retransmission timeouts and the arrival of duplicate 547 acknowledgments. An attacker can therefore impair the performance 548 of a TCP connection by either causing data packets or their 549 acknowledgments to be lost, or by forging excessive duplicate 550 acknowledgments. Causing two congestion control events back-to-back 551 will often cut ssthresh to its minimum value of 2*SMSS, causing the 552 connection to immediately enter the slower-performing congestion 553 avoidance phase. 555 In response to the ACK division attack outlined in [SCWA99] this 556 document RECOMMENDS increasing the congestion window based on the 557 number of bytes newly acknowledged in each arriving ACK rather than 558 by a particular constant on each arriving ACK (as outlined in 559 section 3.1). 561 The Internet to a considerable degree relies on the correct 562 implementation of these algorithms in order to preserve network 563 stability and avoid congestion collapse. An attacker could cause 564 TCP endpoints to respond more aggressively in the face of congestion 565 by forging excessive duplicate acknowledgments or excessive 566 acknowledgments for new data. Conceivably, such an attack could 567 drive a portion of the network into congestion collapse. 569 6. Changes Between RFC 2001 and RFC 2581 571 This document has been extensively rewritten editorially and it is 572 not feasible to itemize the list of changes between the two 573 documents. The intention of this document is not to change any of 574 the recommendations given in RFC 2001, but to further clarify cases 575 that were not discussed in detail in 2001. Specifically, this 576 document suggests what TCP connections should do after a relatively 577 long idle period, as well as specifying and clarifying some of the 578 issues pertaining to TCP ACK generation. Finally, the allowable 579 upper bound for the initial congestion window has also been raised 580 from one to two segments. 582 7. Changes Relative to RFC 2581 584 A specific definition for "duplicate acknowledgment" has been 585 added, based on the definition used by BSD TCP. 587 The document now notes that what to do with duplicate ACKs after the 588 retransmission timer has fired is future work and explicitly 589 unspecified in this document. 591 The initial window requirements were changed to allow Larger 592 Initial Windows as standardized in [RFC3390]. Additionally, the 593 steps to take when an initial window is discovered to be too large 594 due to Path MTU Discovery [RFC1191] are detailed. 596 The recommended initial value for ssthresh has been changed to say 597 that it SHOULD be arbitrarily high, where it was previously MAY. 598 This is to provide additional guidance to implementors on the 599 matter. 601 During slow start, the usage of Appropriate Byte Counting [RFC3465] 602 with L=1*SMSS is explicitly recommended. The method of increasing 603 cwnd given in [RFC2581] is still explicitly allowed. Byte counting 604 during congestion avoidance is also recommended, while the method 605 from [RFC2581] and other safe methods are still allowed. 607 The treatment of ssthresh on retransmission timeout was clarified. 608 In particular, ssthresh must be set to half the FlightSize on the 609 first retransmission of a given segment and then is held constant on 610 subsequent retransmissions of the same segment. 612 The description of fast retransmit and fast recovery has been 613 clarified, and the use of Limited Transmit [RFC3042] is now 614 recommended. 616 TCPs now MAY limit the number of duplicate ACKs that artificially 617 inflate cwnd during loss recovery to the number of segments 618 outstanding to avoid the duplicate ACK spoofing attack described in 619 [SCWA99]. 621 The restart window has been changed to min(IW,cwnd) from IW. This 622 behavior was described as "experimental" in [RFC2581]. 624 It is now recommended that TCP implementors implement an advanced 625 loss recovery algorithm conforming to the principles outlined in 626 this document. 628 The security considerations have been updated to discuss ACK 629 division and recommend byte counting as a counter to this attack. 631 Acknowledgments 633 The core algorithms we describe were developed by Van Jacobson 634 [Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was 635 developed in conjunction with Hari Balakrishnan and Sally Floyd. 636 The initial congestion window size specified in this document is a 637 result of work with Sally Floyd and Craig Partridge 638 [RFC2414,RFC3390]. 640 W. Richard ("Rich") Stevens wrote the first version of this document 641 [RFC2001] and co-authored the second version [RFC2581]. This 642 present version much benefits from his clarity and thoughtfulness of 643 description, and we are grateful for Rich's contributions in 644 elucidating TCP congestion control, as well as in more broadly 645 helping us understand numerous issues relating to networking. 647 We wish to emphasize that the shortcomings and mistakes of this 648 document are solely the responsibility of the current authors. 650 Some of the text from this document is taken from "TCP/IP 651 Illustrated, Volume 1: The Protocols" by W. Richard Stevens 652 (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The 653 Implementation" by Gary R. Wright and W. Richard Stevens (Addison- 654 Wesley, 1995). This material is used with the permission of 655 Addison-Wesley. 657 Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John 658 Heffner, Alfred Hoenes, Sally Floyd, Reiner Ludwig, Matt Mathis, 659 Craig Partridge and Joe Touch contributed a number of helpful 660 suggestions. 662 Normative References 664 [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 665 793, September 1981. 667 [RFC1122] Braden, R., "Requirements for Internet Hosts -- 668 Communication Layers", STD 3, RFC 1122, October 1989. 670 [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, 671 November 1990. 673 Informative References 675 [CJ89] Chiu, D. and R. Jain, "Analysis of the Increase/Decrease 676 Algorithms for Congestion Avoidance in Computer Networks", 677 Journal of Computer Networks and ISDN Systems, vol. 17, no. 1, 678 pp. 1-14, June 1989. 680 [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of 681 Tahoe, Reno and SACK TCP", Computer Communication Review, July 682 1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. 684 [Flo94] Floyd, S., "TCP and Successive Fast Retransmits. Technical 685 report", October 1994. 686 ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. 688 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 689 Control Scheme for TCP", In ACM SIGCOMM, August 1996. 691 [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP 692 Slow-Start Restart After Idle", Work in Progress. 694 [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer 695 Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. 696 ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. 698 [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm", 699 end2end-interest mailing list, April 30, 1990. 700 ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail. 702 [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining 703 TCP Congestion Control", Proceedings of SIGCOMM'96, August, 704 1996, Stanford, CA. Available 705 from http://www.psc.edu/networking/papers/papers.html 707 [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding 708 Parameters", Technical report. Available from 709 http://www.psc.edu/networking/papers/FACKnotes/current. 711 [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics", 712 Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. 714 [RFC813] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC 715 813, July 1982. 717 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 718 Retransmit, and Fast Recovery Algorithms", RFC 2001, January 719 1997. 721 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP 722 Selective Acknowledgement Options", RFC 2018, October 1996. 724 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 725 Requirement Levels", BCP 14, RFC 2119, March 1997. 727 [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's 728 Initial Window Size", RFC 2414, September 1998. 730 [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., 731 Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP 732 Implementation Problems", RFC 2525, March 1999. 734 [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion 735 Control, RFC 2581, April 1999. 737 [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An 738 Extension to the Selective Acknowledgement (SACK) Option for 739 TCP, RFC 2883, July 2000. 741 [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission 742 Timer", RFC 2988, November 2000. 744 [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing 745 TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 746 2001. 748 [RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's 749 Initial Window", RFC 3390, October 2002. 751 [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte 752 Counting (ABC), RFC 3465, February 2003. 754 [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A 755 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 756 Algorithm for TCP, RFC 3517, April 2003. 758 [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno 759 Modification to TCP's Fast Recovery Algorithm, RFC 3782, April 760 2004. 762 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 763 "TCP Congestion Control With a Misbehaving Receiver", ACM 764 Computer Communication Review, 29(5), October 1999. 766 [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", 767 Addison-Wesley, 1994. 769 [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The 770 Implementation", Addison-Wesley, 1995. 772 Authors' Addresses 774 Mark Allman 775 ICIR / ICSI 776 1947 Center Street 777 Suite 600 778 Berkeley, CA 94704-1198 779 Phone: +1 440 235 1792 780 EMail: mallman@icir.org 781 http://www.icir.org/mallman/ 783 Vern Paxson 784 ICIR / ICSI 785 1947 Center Street 786 Suite 600 787 Berkeley, CA 94704-1198 788 Phone: +1 510/642-4274 x302 789 EMail: vern@icir.org 790 http://www.icir.org/vern/ 792 Ethan Blanton 793 Purdue University Computer Sciences 794 1398 Computer Science Building 795 West Lafayette, IN 47907 796 EMail: eblanton@cs.purdue.edu 797 http://www.cs.purdue.edu/homes/eblanton/ 799 Intellectual Property Statement 801 The IETF takes no position regarding the validity or scope of any 802 Intellectual Property Rights or other rights that might be claimed 803 to pertain to the implementation or use of the technology described 804 in this document or the extent to which any license under such 805 rights might or might not be available; nor does it represent that 806 it has made any independent effort to identify any such rights. 807 Information on the procedures with respect to rights in RFC 808 documents can be found in BCP 78 and BCP 79. 810 Copies of IPR disclosures made to the IETF Secretariat and any 811 assurances of licenses to be made available, or the result of an 812 attempt made to obtain a general license or permission for the use 813 of such proprietary rights by implementers or users of this 814 specification can be obtained from the IETF on-line IPR repository 815 at http://www.ietf.org/ipr. 817 The IETF invites any interested party to bring to its attention any 818 copyrights, patents or patent applications, or other proprietary 819 rights that may cover technology that may be required to implement 820 this standard. Please address the information to the IETF at 821 ietf-ipr@ietf.org. 823 Disclaimer of Validity 825 This document and the information contained herein are provided 826 on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 827 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 828 IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 829 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 830 WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE 831 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 832 FOR A PARTICULAR PURPOSE. 834 Copyright Statement 836 Copyright (C) The IETF Trust (2007). This document is subject to 837 the rights, licenses and restrictions contained in BCP 78, and 838 except as set forth therein, the authors retain all their rights. 840 Acknowledgment 842 Funding for the RFC Editor function is currently provided by the 843 Internet Society.