idnits 2.17.1 draft-ietf-tcpm-rfc2581bis-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 2009) is 5459 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 813 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2414 (Obsoleted by RFC 3390) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) -- Obsolete informational reference (is this intentional?): RFC 3782 (Obsoleted by RFC 6582) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Allman 2 Internet-Draft V. Paxson 3 Expires: October 2009 ICSI 4 E. Blanton 5 Purdue University 6 May 2009 8 TCP Congestion Control 9 draft-ietf-tcpm-rfc2581bis-05.txt 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with 14 the provisions of BCP 78 and BCP 79. This document may contain 15 material from IETF Documents or IETF Contributions published or made 16 publicly available before November 10, 2008. The person(s) 17 controlling the copyright in some of this material may not have 18 granted the IETF Trust the right to allow modifications of such 19 material outside the IETF Standards Process. Without obtaining an 20 adequate license from the person(s) controlling the copyright in 21 such materials, this document may not be modified outside the IETF 22 Standards Process, and derivative works of it may not be created 23 outside the IETF Standards Process, except to format it for 24 publication as an RFC or to translate it into languages other than 25 English. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as 30 Internet-Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other documents 34 at any time. It is inappropriate to use Internet-Drafts as 35 reference material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt. 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html. 43 Copyright Statement 45 Copyright (c) 2009 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents in effect on the date of 50 publication of this document (http://trustee.ietf.org/license-info). 51 Please review these documents carefully, as they describe your 52 rights and restrictions with respect to this document. 54 This document may contain material from IETF Documents or IETF 55 Contributions published or made publicly available before November 56 10, 2008. The person(s) controlling the copyright in some of this 57 material may not have granted the IETF Trust the right to allow 58 modifications of such material outside the IETF Standards Process. 59 Without obtaining an adequate license from the person(s) controlling 60 the copyright in such materials, this document may not be modified 61 outside the IETF Standards Process, and derivative works of it may 62 not be created outside the IETF Standards Process, except to format 63 it for publication as an RFC or to translate it into languages other 64 than English. 66 Abstract 68 This document defines TCP's four intertwined congestion control 69 algorithms: slow start, congestion avoidance, fast retransmit, and 70 fast recovery. In addition, the document specifies how TCP should 71 begin transmission after a relatively long idle period, as well as 72 discussing various acknowledgment generation methods. 74 1. Introduction 76 This document specifies four TCP [RFC793] congestion control 77 algorithms: slow start, congestion avoidance, fast retransmit and 78 fast recovery. These algorithms were devised in [Jac88] and 79 [Jac90]. Their use with TCP is standardized in [RFC1122]. 80 Additional early work in additive-increase, multiplicative-decrease 81 congestion control is given in [CJ89]. 83 Note that [Ste94] provides examples of these algorithms in action 84 and [WS95] provides an explanation of the source code for the BSD 85 implementation of these algorithms. 87 In addition to specifying these congestion control algorithms, this 88 document specifies what TCP connections should do after a relatively 89 long idle period, as well as specifying and clarifying some of the 90 issues pertaining to TCP ACK generation. 92 This document obsoletes [RFC2581], which in turn obsoleted 93 [RFC2001]. 95 This document is organized as follows. Section 2 provides various 96 definitions which will be used throughout the document. Section 3 97 provides a specification of the congestion control 98 algorithms. Section 4 outlines concerns related to the congestion 99 control algorithms and finally, section 5 outlines security 100 considerations. 102 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 103 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 104 document are to be interpreted as described in [RFC2119]. 106 2. Definitions 108 This section provides the definition of several terms that will be 109 used throughout the remainder of this document. 111 SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or 112 both). 114 SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the 115 largest segment that the sender can transmit. This value can be 116 based on the maximum transmission unit of the network, the path 117 MTU discovery [RFC1191,RFC4821] algorithm, RMSS (see next item), 118 or other factors. The size does not include the TCP/IP headers 119 and options. 121 RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the 122 largest segment the receiver is willing to accept. This is the 123 value specified in the MSS option sent by the receiver during 124 connection startup. Or, if the MSS option is not used, 536 125 bytes [RFC1122]. The size does not include the TCP/IP headers 126 and options. 128 FULL-SIZED SEGMENT: A segment that contains the maximum number of 129 data bytes permitted (i.e., a segment containing SMSS bytes of 130 data). 132 RECEIVER WINDOW (rwnd): The most recently advertised receiver 133 window. 135 CONGESTION WINDOW (cwnd): A TCP state variable that limits the 136 amount of data a TCP can send. At any given time, a TCP MUST 137 NOT send data with a sequence number higher than the sum of the 138 highest acknowledged sequence number and the minimum of cwnd and 139 rwnd. 141 INITIAL WINDOW (IW): The initial window is the size of the sender's 142 congestion window after the three-way handshake is completed. 144 LOSS WINDOW (LW): The loss window is the size of the congestion 145 window after a TCP sender detects loss using its retransmission 146 timer. 148 RESTART WINDOW (RW): The restart window is the size of the 149 congestion window after a TCP restarts transmission after an 150 idle period (if the slow start algorithm is used; see section 151 4.1 for more discussion). 153 FLIGHT SIZE: The amount of data that has been sent but not yet 154 cumulatively acknowledged. 156 DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a 157 "duplicate" in the following algorithms when (a) the receiver of 158 the ACK has outstanding data, (b) the incoming acknowledgment 159 carries no data, (c) the SYN and FIN bits are both off, (d) the 160 acknowledgment number is equal to the greatest acknowledgment 161 received on the given connection (TCP.UNA from [RFC793]) and (e) 162 the advertised window in the incoming acknowledgment equals the 163 advertised window in the last incoming acknowledgment. 165 Alternatively, a TCP that utilizes selective acknowledgments 166 [RFC2018,RFC2883] can leverage the SACK information to determine 167 when an incoming ACK is a "duplicate" (e.g., if the ACK contains 168 previously unknown SACK information). 170 3. Congestion Control Algorithms 172 This section defines the four congestion control algorithms: slow 173 start, congestion avoidance, fast retransmit and fast recovery, 174 developed in [Jac88] and [Jac90]. In some situations it may be 175 beneficial for a TCP sender to be more conservative than the 176 algorithms allow, however a TCP MUST NOT be more aggressive than the 177 following algorithms allow (that is, MUST NOT send data when the 178 value of cwnd computed by the following algorithms would not allow 179 the data to be sent). 181 Also note that the algorithms specified in this document work in 182 terms of using loss as the signal of congestion. Explicit 183 Congestion Notification (ECN) could also be used as specified in 184 [RFC3168]. 186 3.1 Slow Start and Congestion Avoidance 188 The slow start and congestion avoidance algorithms MUST be used by a 189 TCP sender to control the amount of outstanding data being injected 190 into the network. To implement these algorithms, two variables are 191 added to the TCP per-connection state. The congestion window (cwnd) 192 is a sender-side limit on the amount of data the sender can transmit 193 into the network before receiving an acknowledgment (ACK), while the 194 receiver's advertised window (rwnd) is a receiver-side limit on the 195 amount of outstanding data. The minimum of cwnd and rwnd governs 196 data transmission. 198 Another state variable, the slow start threshold (ssthresh), is used 199 to determine whether the slow start or congestion avoidance 200 algorithm is used to control data transmission, as discussed below. 202 Beginning transmission into a network with unknown conditions 203 requires TCP to slowly probe the network to determine the available 204 capacity, in order to avoid congesting the network with an 205 inappropriately large burst of data. The slow start algorithm is 206 used for this purpose at the beginning of a transfer, or after 207 repairing loss detected by the retransmission timer. Slow start 208 additionally serves to start the "ACK clock" used by the TCP sender 209 to release data into the network in the slow start, congestion 210 avoidance, and loss recovery algorithms. 212 IW, the initial value of cwnd, MUST be set using the following 213 guidelines as an upper bound. 215 If SMSS > 2190 bytes: 216 IW = 2 * SMSS bytes and MUST NOT be more than 2 segments 218 If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): 219 IW = 3 * SMSS bytes and MUST NOT be more than 3 segments 220 if SMSS <= 1095 bytes: 221 IW = 4 * SMSS bytes and MUST NOT be more than 4 segments 223 As specified in [RFC3390], the SYN/ACK and the acknowledgment of the 224 SYN/ACK MUST NOT increase the size of the congestion window. 225 Further, if the SYN or SYN/ACK is lost, the initial window used by a 226 sender after a correctly transmitted SYN MUST be one segment 227 consisting of at most SMSS bytes. 229 A detailed rationale and discussion of the IW setting is provided in 230 [RFC3390]. 232 When initial congestion windows of more than one segment are 233 implemented along with Path MTU Discovery [RFC1191], and the MSS 234 being used is found to be too large, the congestion window cwnd 235 SHOULD be reduced to prevent large bursts of smaller segments. 236 Specifically, cwnd SHOULD be reduced by the ratio of the old segment 237 size to the new segment size. 239 The initial value of ssthresh SHOULD be set arbitrarily high (e.g., 240 to the size of the largest possible advertised window), but ssthresh 241 MUST be reduced in response to congestion. Setting ssthresh as high 242 as possible allows the network conditions, rather than some 243 arbitrary host limit, to dictate the sending rate. In cases where 244 the end systems have a solid understanding of the network path, more 245 carefully setting the initial ssthresh value may have merit (e.g., 246 such that the end host does not create congestion along the path). 248 The slow start algorithm is used when cwnd < ssthresh, while the 249 congestion avoidance algorithm is used when cwnd > ssthresh. When 250 cwnd and ssthresh are equal the sender may use either slow start or 251 congestion avoidance. 253 During slow start, a TCP increments cwnd by at most SMSS bytes for 254 each ACK received that cumulatively acknowledges new data. Slow 255 start ends when cwnd exceeds ssthresh (or, optionally, when it 256 reaches it, as noted above) or when congestion is observed. While 257 traditionally TCP implementations have increased cwnd by precisely 258 SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND 259 that TCP implementations increase cwnd, per: 261 cwnd += min (N, SMSS) (2) 263 where N is the number of previously unacknowledged bytes 264 acknowledged in the incoming ACK. This adjustment is part of 265 Appropriate Byte Counting [RFC3465] and provides robustness against 266 misbehaving receivers which may attempt to induce a sender to 267 artificially inflate cwnd using a mechanism known as "ACK Division" 268 [SCWA99]. ACK Division consists of a receiver sending multiple ACKs 269 for a single TCP data segment, each acknowledging only a portion of 270 its data. A TCP that increments cwnd by SMSS for each such ACK will 271 inappropriately inflate the amount of data injected into the 272 network. 274 During congestion avoidance, cwnd is incremented by roughly 1 275 full-sized segment per round-trip time (RTT). Congestion avoidance 276 continues until congestion is detected. The basic guidelines for 277 incrementing cwnd during congestion avoidance are: 279 * MAY increment cwnd by SMSS bytes 281 * SHOULD increment cwnd per equation (2) once per RTT 283 * MUST NOT increment cwnd by more than SMSS bytes 285 We note that [RFC3465] allows for cwnd increases of more than SMSS 286 bytes for incoming acknowledgments during slow start on an 287 experimental basis, however such behavior is not allowed as part of 288 the standard. 290 The RECOMMENDED way to increase cwnd during congestion avoidance is 291 to count the number of bytes that have been acknowledged by ACKs for 292 new data. (A drawback of this implementation is that it requires 293 maintaining an additional state variable.) When the number of bytes 294 acknowledged reaches cwnd, then cwnd can be incremented by up to 295 SMSS bytes. Note that during congestion avoidance, cwnd MUST NOT be 296 increased by more than SMSS bytes per RTT. This method both allows 297 TCPs to increase cwnd by one segment per RTT in the face of delayed 298 ACKs and provides robustness against ACK Division attacks. 300 Another common formula that a TCP MAY use to update cwnd during 301 congestion avoidance is given in equation 3: 303 cwnd += SMSS*SMSS/cwnd (3) 305 This adjustment is executed on every incoming ACK that acknowledges 306 new data. Equation (3) provides an acceptable approximation to the 307 underlying principle of increasing cwnd by 1 full-sized segment per 308 RTT. (Note that for a connection in which the receiver is 309 acknowledging every-other packet, (3) is less aggressive than 310 allowed -- roughly increasing cwnd every second RTT.) 312 Implementation Note: Since integer arithmetic is usually used in TCP 313 implementations, the formula given in equation 3 can fail to 314 increase cwnd when the congestion window is larger than SMSS*SMSS. 315 If the above formula yields 0, the result SHOULD be rounded up to 1 316 byte. 318 Implementation Note: Older implementations have an additional 319 additive constant on the right-hand side of equation (3). This is 320 incorrect and can actually lead to diminished performance [RFC2525]. 322 Implementation Note: Some implementations maintain cwnd in units of 323 bytes, while others in units of full-sized segments. The latter 324 will find equation (3) difficult to use, and may prefer to use the 325 counting approach discussed in the previous paragraph. 327 When a TCP sender detects segment loss using the retransmission 328 timer and the given segment has not yet been resent by way of the 329 retransmission timer, the value of ssthresh MUST be set to no more 330 than the value given in equation 4: 332 ssthresh = max (FlightSize / 2, 2*SMSS) (4) 334 where, as discussed above, FlightSize is the amount of outstanding 335 data in the network. 337 On the other hand, when a TCP sender detects segment loss using the 338 retransmission timer and the given segment has already been 339 retransmitted by way of the retransmission timer at least once, the 340 value of ssthresh is held constant. 342 Implementation Note: An easy mistake to make is to simply use cwnd, 343 rather than FlightSize, which in some implementations may 344 incidentally increase well beyond rwnd. 346 Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be 347 set to no more than the loss window, LW, which equals 1 full-sized 348 segment (regardless of the value of IW). Therefore, after 349 retransmitting the dropped segment the TCP sender uses the slow 350 start algorithm to increase the window from 1 full-sized segment to 351 the new value of ssthresh, at which point congestion avoidance again 352 takes over. 354 As shown in [FF96,RFC3782], slow start-based loss recovery after a 355 timeout can cause spurious retransmissions that trigger duplicate 356 acknowledgments. The reaction to the arrival of these duplicate 357 ACKs in TCP implementations varies widely. This document does not 358 specify how to treat such acknowledgments, but does note this as an 359 area that may benefit from additional attention, experimentation and 360 specification. 362 3.2 Fast Retransmit/Fast Recovery 364 A TCP receiver SHOULD send an immediate duplicate ACK when an out- 365 of-order segment arrives. The purpose of this ACK is to inform the 366 sender that a segment was received out-of-order and which sequence 367 number is expected. From the sender's perspective, duplicate ACKs 368 can be caused by a number of network problems. First, they can be 369 caused by dropped segments. In this case, all segments after the 370 dropped segment will trigger duplicate ACKs until the loss is 371 repaired. Second, duplicate ACKs can be caused by the re-ordering 372 of data segments by the network (not a rare event along some network 373 paths [Pax97]). Finally, duplicate ACKs can be caused by 374 replication of ACK or data segments by the network. In addition, a 375 TCP receiver SHOULD send an immediate ACK when the incoming segment 376 fills in all or part of a gap in the sequence space. This will 377 generate more timely information for a sender recovering from a loss 378 through a retransmission timeout, a fast retransmit, or an advanced 379 loss recovery algorithm, as outlined in section 4.3. 381 The TCP sender SHOULD use the "fast retransmit" algorithm to detect 382 and repair loss, based on incoming duplicate ACKs. The fast 383 retransmit algorithm uses the arrival of 3 duplicate ACKs (as 384 defined in section 2, without any intervening ACKs which move 385 SND.UNA) as an indication that a segment has been lost. After 386 receiving 3 duplicate ACKs, TCP performs a retransmission of what 387 appears to be the missing segment, without waiting for the 388 retransmission timer to expire. 390 After the fast retransmit algorithm sends what appears to be the 391 missing segment, the "fast recovery" algorithm governs the 392 transmission of new data until a non-duplicate ACK arrives. The 393 reason for not performing slow start is that the receipt of the 394 duplicate ACKs not only indicates that a segment has been lost, but 395 also that segments are most likely leaving the network (although a 396 massive segment duplication by the network can invalidate this 397 conclusion). In other words, since the receiver can only generate a 398 duplicate ACK when a segment has arrived, that segment has left the 399 network and is in the receiver's buffer, so we know it is no longer 400 consuming network resources. Furthermore, since the ACK "clock" 401 [Jac88] is preserved, the TCP sender can continue to transmit new 402 segments (although transmission must continue using a reduced cwnd, 403 since loss is an indication of congestion). 405 The fast retransmit and fast recovery algorithms are implemented 406 together as follows. 408 1. On the first and second duplicate ACKs received at a sender, a 409 TCP SHOULD send a segment of previously unsent data per 410 [RFC3042] provided that the receiver's advertised window allows, 411 the total FlightSize would remain less than or equal to cwnd 412 plus 2*SMSS, and that new data is available for transmission. 413 Further, the TCP sender MUST NOT change cwnd to reflect these 414 two segments [RFC3042]. Note that a sender using SACK [RFC2018] 415 MUST NOT send new data unless the incoming duplicate 416 acknowledgment contains new SACK information. 418 2. When the third duplicate ACK is received, a TCP MUST set 419 ssthresh to no more than the value given in equation 4. When 420 [RFC3042] is in use, additional data sent in limited transmit 421 MUST NOT be included in this calculation. 423 3. The lost segment starting at SND.UNA MUST be retransmitted and 424 cwnd set to ssthresh plus 3*SMSS. This artificially "inflates" 425 the congestion window by the number of segments (three) that 426 have left the network and which the receiver has buffered. 428 4. For each additional duplicate ACK received (after the third), 429 cwnd MUST be incremented by SMSS. This artificially inflates 430 the congestion window in order to reflect the additional segment 431 that has left the network. 433 Note: [SCWA99] discusses a receiver-based attack whereby many 434 bogus duplicate ACKs are sent to the data sender in order to 435 artificially inflate cwnd and cause a higher than appropriate 436 sending rate to be used. A TCP MAY therefore limit the number 437 of times cwnd is artificially inflated during loss recovery 438 to the number of outstanding segments (or, an approximation 439 thereof). 441 Note: When an advanced loss recovery mechanism (such as outlined 442 in section 4.3) is not in use, this increase in FlightSize can 443 cause equation 4 to slightly inflate cwnd and ssthresh, as some 444 of the segments between SND.UNA and SND.NXT are assumed to have 445 left the network but are still reflected in FlightSize. 447 5. When previously unsent data is available and the new value of 448 cwnd and the receiver's advertised window allow, a TCP SHOULD 449 send 1*SMSS bytes of previously unsent data. 451 6. When the next ACK arrives that acknowledges previously 452 unacknowledged data, a TCP MUST set cwnd to ssthresh (the value 453 set in step 2). This is termed "deflating" the window. 455 This ACK should be the acknowledgment elicited by the 456 retransmission from step 3, one RTT after the retransmission 457 (though it may arrive sooner in the presence of significant out- 458 of-order delivery of data segments at the receiver). 459 Additionally, this ACK should acknowledge all the intermediate 460 segments sent between the lost segment and the receipt of the 461 third duplicate ACK, if none of these were lost. 463 Note: This algorithm is known to generally not recover efficiently 464 from multiple losses in a single flight of packets [FF96]. Section 465 4.3 below addresses such cases. 467 4. Additional Considerations 469 4.1 Re-starting Idle Connections 471 A known problem with the TCP congestion control algorithms described 472 above is that they allow a potentially inappropriate burst of 473 traffic to be transmitted after TCP has been idle for a relatively 474 long period of time. After an idle period, TCP cannot use the ACK 475 clock to strobe new segments into the network, as all the ACKs have 476 drained from the network. Therefore, as specified above, TCP can 477 potentially send a cwnd-size line-rate burst into the network after 478 an idle period. In addition, changing network conditions may have 479 rendered TCP's notion of the available end-to-end network capacity 480 between two endpoints, as estimated by cwnd, inaccurate during the 481 course of a long idle period. 483 [Jac88] recommends that a TCP use slow start to restart 484 transmission after a relatively long idle period. Slow start 485 serves to restart the ACK clock, just as it does at the beginning 486 of a transfer. This mechanism has been widely deployed in the 487 following manner. When TCP has not received a segment for more 488 than one retransmission timeout, cwnd is reduced to the value of 489 the restart window (RW) before transmission begins. 491 For the purposes of this standard, we define RW = min(IW,cwnd). 493 Using the last time a segment was received to determine whether or 494 not to decrease cwnd can fail to deflate cwnd in the common case of 495 persistent HTTP connections [HTH98]. In this case, a Web server 496 receives a request before transmitting data to the Web client. The 497 reception of the request makes the test for an idle connection fail, 498 and allows the TCP to begin transmission with a possibly 499 inappropriately large cwnd. 501 Therefore, a TCP SHOULD set cwnd to no more than RW before beginning 502 transmission if the TCP has not sent data in an interval exceeding 503 the retransmission timeout. 505 4.2 Generating Acknowledgments 507 The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a 508 TCP receiver. When using delayed ACKs, a TCP receiver MUST NOT 509 excessively delay acknowledgments. Specifically, an ACK SHOULD be 510 generated for at least every second full-sized segment, and MUST be 511 generated within 500 ms of the arrival of the first unacknowledged 512 packet. 514 The requirement that an ACK "SHOULD" be generated for at least every 515 second full-sized segment is listed in [RFC1122] in one place as a 516 SHOULD and another as a MUST. Here we unambiguously state it is a 517 SHOULD. We also emphasize that this is a SHOULD, meaning that an 518 implementor should indeed only deviate from this requirement after 519 careful consideration of the implications. See the discussion of 520 "Stretch ACK violation" in [RFC2525] and the references therein for 521 a discussion of the possible performance problems with generating 522 ACKs less frequently than every second full-sized segment. 524 In some cases, the sender and receiver may not agree on what 525 constitutes a full-sized segment. An implementation is deemed to 526 comply with this requirement if it sends at least one acknowledgment 527 every time it receives 2*RMSS bytes of new data from the sender, 528 where RMSS is the Maximum Segment Size specified by the receiver to 529 the sender (or the default value of 536 bytes, per [RFC1122], if the 530 receiver does not specify an MSS option during connection 531 establishment). The sender may be forced to use a segment size less 532 than RMSS due to the maximum transmission unit (MTU), the path MTU 533 discovery algorithm or other factors. For instance, consider the 534 case when the receiver announces an RMSS of X bytes but the sender 535 ends up using a segment size of Y bytes (Y < X) due to path MTU 536 discovery (or the sender's MTU size). The receiver will generate 537 stretch ACKs if it waits for 2*X bytes to arrive before an ACK is 538 sent. Clearly this will take more than 2 segments of size Y bytes. 539 Therefore, while a specific algorithm is not defined, it is 540 desirable for receivers to attempt to prevent this situation, for 541 example by acknowledging at least every second segment, regardless 542 of size. Finally, we repeat that an ACK MUST NOT be delayed for 543 more than 500 ms waiting on a second full-sized segment to arrive. 545 Out-of-order data segments SHOULD be acknowledged immediately, in 546 order to accelerate loss recovery. To trigger the fast retransmit 547 algorithm, the receiver SHOULD send an immediate duplicate ACK when 548 it receives a data segment above a gap in the sequence space. To 549 provide feedback to senders recovering from losses, the receiver 550 SHOULD send an immediate ACK when it receives a data segment that 551 fills in all or part of a gap in the sequence space. 553 A TCP receiver MUST NOT generate more than one ACK for every 554 incoming segment, other than to update the offered window as the 555 receiving application consumes new data [page 42, RFC793][RFC813]. 557 4.3 Loss Recovery Mechanisms 559 A number of loss recovery algorithms that augment fast retransmit 560 and fast recovery have been suggested by TCP researchers and 561 specified in the RFC series. While some of these algorithms are 562 based on the TCP selective acknowledgment (SACK) option [RFC2018], 563 such as [FF96,MM96a,MM96b,RFC3517], others do not require SACKs 564 [Hoe96,FF96,RFC3782]. The non-SACK algorithms use "partial 565 acknowledgments" (ACKs which cover previously unacknowledged data, 566 but not all the data outstanding when loss was detected) to trigger 567 retransmissions. While this document does not standardize any of 568 the specific algorithms that may improve fast retransmit/fast 569 recovery, these enhanced algorithms are implicitly allowed, as long 570 as they follow the general principles of the basic four algorithms 571 outlined above. 573 That is, when the first loss in a window of data is detected, 574 ssthresh MUST be set to no more than the value given by equation 575 (4). Second, until all lost segments in the window of data in 576 question are repaired, the number of segments transmitted in each 577 RTT MUST be no more than half the number of outstanding segments 578 when the loss was detected. Finally, after all loss in the given 579 window of segments has been successfully retransmitted, cwnd MUST be 580 set to no more than ssthresh and congestion avoidance MUST be used 581 to further increase cwnd. Loss in two successive windows of data, 582 or the loss of a retransmission, should be taken as two indications 583 of congestion and, therefore, cwnd (and ssthresh) MUST be lowered 584 twice in this case. 586 We RECOMMEND that TCP implementers employ some form of advanced loss 587 recovery that can cope with multiple losses in a window of data. 588 The algorithms detailed in [RFC3782] and [RFC3517] conform to the 589 general principles outlined above. We note that while these are not 590 the only two algorithms that conform to the above general principles 591 these two algorithms have been vetted by the community and are 592 currently on the standards track. 594 5. Security Considerations 595 This document requires a TCP to diminish its sending rate in the 596 presence of retransmission timeouts and the arrival of duplicate 597 acknowledgments. An attacker can therefore impair the performance 598 of a TCP connection by either causing data packets or their 599 acknowledgments to be lost, or by forging excessive duplicate 600 acknowledgments. 602 In response to the ACK division attack outlined in [SCWA99] this 603 document RECOMMENDS increasing the congestion window based on the 604 number of bytes newly acknowledged in each arriving ACK rather than 605 by a particular constant on each arriving ACK (as outlined in 606 section 3.1). 608 The Internet to a considerable degree relies on the correct 609 implementation of these algorithms in order to preserve network 610 stability and avoid congestion collapse. An attacker could cause 611 TCP endpoints to respond more aggressively in the face of congestion 612 by forging excessive duplicate acknowledgments or excessive 613 acknowledgments for new data. Conceivably, such an attack could 614 drive a portion of the network into congestion collapse. 616 6. Changes Between RFC 2001 and RFC 2581 618 [RFC2001] was extensively rewritten editorially and it is not 619 feasible to itemize the list of changes between [RFC2001] and 620 [RFC2581]. The intention of [RFC2581] was to not change any of the 621 recommendations given in [RFC2001], but to further clarify cases 622 that were not discussed in detail in [RFC2001]. Specifically, 623 [RFC2581] suggested what TCP connections should do after a 624 relatively long idle period, as well as specified and clarified 625 some of the issues pertaining to TCP ACK generation. Finally, the 626 allowable upper bound for the initial congestion window was raised 627 from one to two segments. 629 7. Changes Relative to RFC 2581 631 A specific definition for "duplicate acknowledgment" has been 632 added, based on the definition used by BSD TCP. 634 The document now notes that what to do with duplicate ACKs after the 635 retransmission timer has fired is future work and explicitly 636 unspecified in this document. 638 The initial window requirements were changed to allow Larger 639 Initial Windows as standardized in [RFC3390]. Additionally, the 640 steps to take when an initial window is discovered to be too large 641 due to Path MTU Discovery [RFC1191] are detailed. 643 The recommended initial value for ssthresh has been changed to say 644 that it SHOULD be arbitrarily high, where it was previously MAY. 645 This is to provide additional guidance to implementors on the 646 matter. 648 During slow start, the usage of Appropriate Byte Counting [RFC3465] 649 with L=1*SMSS is explicitly recommended. The method of increasing 650 cwnd given in [RFC2581] is still explicitly allowed. Byte counting 651 during congestion avoidance is also recommended, while the method 652 from [RFC2581] and other safe methods are still allowed. 654 The treatment of ssthresh on retransmission timeout was clarified. 655 In particular, ssthresh must be set to half the FlightSize on the 656 first retransmission of a given segment and then is held constant on 657 subsequent retransmissions of the same segment. 659 The description of fast retransmit and fast recovery has been 660 clarified, and the use of Limited Transmit [RFC3042] is now 661 recommended. 663 TCPs now MAY limit the number of duplicate ACKs that artificially 664 inflate cwnd during loss recovery to the number of segments 665 outstanding to avoid the duplicate ACK spoofing attack described in 666 [SCWA99]. 668 The restart window has been changed to min(IW,cwnd) from IW. This 669 behavior was described as "experimental" in [RFC2581]. 671 It is now recommended that TCP implementors implement an advanced 672 loss recovery algorithm conforming to the principles outlined in 673 this document. 675 The security considerations have been updated to discuss ACK 676 division and recommend byte counting as a counter to this attack. 678 8. IANA Considerations 680 This document contains no IANA considerations, but apparently an 681 Internet *Draft* can no longer be published without this section. 683 Acknowledgments 685 The core algorithms we describe were developed by Van Jacobson 686 [Jac88, Jac90]. In addition, Limited Transmit [RFC3042] was 687 developed in conjunction with Hari Balakrishnan and Sally Floyd. 688 The initial congestion window size specified in this document is a 689 result of work with Sally Floyd and Craig Partridge 690 [RFC2414,RFC3390]. 692 W. Richard ("Rich") Stevens wrote the first version of this document 693 [RFC2001] and co-authored the second version [RFC2581]. This 694 present version much benefits from his clarity and thoughtfulness of 695 description, and we are grateful for Rich's contributions in 696 elucidating TCP congestion control, as well as in more broadly 697 helping us understand numerous issues relating to networking. 699 We wish to emphasize that the shortcomings and mistakes of this 700 document are solely the responsibility of the current authors. 702 Some of the text from this document is taken from "TCP/IP 703 Illustrated, Volume 1: The Protocols" by W. Richard Stevens 704 (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The 705 Implementation" by Gary R. Wright and W. Richard Stevens (Addison- 706 Wesley, 1995). This material is used with the permission of 707 Addison-Wesley. 709 Anil Agarwal, Steve Arden, Neal Cardwell, Noritoshi Demizu, Gorry 710 Fairhurst, Kevin Fall, John Heffner, Alfred Hoenes, Sally Floyd, 711 Reiner Ludwig, Matt Mathis, Craig Partridge and Joe Touch 712 contributed a number of helpful suggestions. 714 Normative References 716 [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 717 793, September 1981. 719 [RFC1122] Braden, R., "Requirements for Internet Hosts -- 720 Communication Layers", STD 3, RFC 1122, October 1989. 722 [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, 723 November 1990. 725 Informative References 727 [CJ89] Chiu, D. and R. Jain, "Analysis of the Increase/Decrease 728 Algorithms for Congestion Avoidance in Computer Networks", 729 Journal of Computer Networks and ISDN Systems, vol. 17, no. 1, 730 pp. 1-14, June 1989. 732 [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of 733 Tahoe, Reno and SACK TCP", Computer Communication Review, July 734 1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. 736 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 737 Control Scheme for TCP", In ACM SIGCOMM, August 1996. 739 [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP 740 Slow-Start Restart After Idle", Work in Progress. 742 [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer 743 Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. 744 ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. 746 [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm", 747 end2end-interest mailing list, April 30, 1990. 748 ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail. 750 [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining 751 TCP Congestion Control", Proceedings of SIGCOMM'96, August, 752 1996, Stanford, CA. Available 753 from http://www.psc.edu/networking/papers/papers.html 755 [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding 756 Parameters", Technical report. Available from 757 http://www.psc.edu/networking/papers/FACKnotes/current. 759 [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics", 760 Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. 762 [RFC813] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC 763 813, July 1982. 765 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 766 Retransmit, and Fast Recovery Algorithms", RFC 2001, January 767 1997. 769 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP 770 Selective Acknowledgement Options", RFC 2018, October 1996. 772 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 773 Requirement Levels", BCP 14, RFC 2119, March 1997. 775 [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's 776 Initial Window Size", RFC 2414, September 1998. 778 [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, 779 J., Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP 780 Implementation Problems", RFC 2525, March 1999. 782 [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion 783 Control, RFC 2581, April 1999. 785 [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An 786 Extension to the Selective Acknowledgement (SACK) Option for 787 TCP, RFC 2883, July 2000. 789 [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission 790 Timer", RFC 2988, November 2000. 792 [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing 793 TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 794 2001. 796 [RFC3168] K. Ramakrishnan, S. Floyd, D. Black, "The Addition of 797 Explicit Congestion Notification (ECN) to IP", RFC 3168, 798 September 2001. 800 [RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's 801 Initial Window", RFC 3390, October 2002. 803 [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte 804 Counting (ABC), RFC 3465, February 2003. 806 [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A 807 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 808 Algorithm for TCP, RFC 3517, April 2003. 810 [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno 811 Modification to TCP's Fast Recovery Algorithm, RFC 3782, April 812 2004. 814 [RFC4821] Matt Mathis, John Heffner, Packetization Layer Path MTU 815 Discovery, RFC 4821, March 2007. 817 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 818 "TCP Congestion Control With a Misbehaving Receiver", ACM 819 Computer Communication Review, 29(5), October 1999. 821 [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", 822 Addison-Wesley, 1994. 824 [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The 825 Implementation", Addison-Wesley, 1995. 827 Authors' Addresses 829 Mark Allman 830 International Computer Science Institute (ICSI) 831 1947 Center Street 832 Suite 600 833 Berkeley, CA 94704-1198 834 Phone: +1 440 235 1792 835 EMail: mallman@icir.org 836 http://www.icir.org/mallman/ 838 Vern Paxson 839 International Computer Science Institute (ICSI) 840 1947 Center Street 841 Suite 600 842 Berkeley, CA 94704-1198 843 Phone: +1 510/642-4274 x302 844 EMail: vern@icir.org 845 http://www.icir.org/vern/ 847 Ethan Blanton 848 Purdue University Computer Sciences 849 305 North University Street 850 West Lafayette, IN 47907 851 EMail: eblanton@cs.purdue.edu 852 http://www.cs.purdue.edu/homes/eblanton/ 854 Acknowledgment 856 Funding for the RFC Editor function is currently provided by the 857 Internet Society.