idnits 2.17.1 draft-ietf-tsvwg-newreno-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 19 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 20 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([RFC2018,RFC3517], [RFC2581], [FF96], [RFC2582], [Hoe95]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 528: '... retransmission of three packets [Gur03]. The TCP sender MAY use such...' RFC 2119 keyword, line 607: '... [RFC2581] specifies that "Out-of-order data segments SHOULD be...' RFC 2119 keyword, line 762: '...two heuristics that the TCP sender MAY...' Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 2003) is 7461 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2988' is defined on line 807, but no explicit reference was found in the text == Unused Reference: 'Hen98' is defined on line 844, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2582 (Obsoleted by RFC 3782) ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 2001 (ref. 'F98') (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) Summary: 9 errors (**), 0 flaws (~~), 5 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force S. Floyd 3 INTERNET DRAFT ICSI 4 draft-ietf-tsvwg-newreno-02.txt T. Henderson 5 Boeing 6 A. Gurtov 7 U. Helsinki 8 November 2003 10 The NewReno Modification to TCP's Fast Recovery Algorithm 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Abstract 35 RFC 2581 [RFC2581] documents the following four intertwined TCP 36 congestion control algorithms: Slow Start, Congestion Avoidance, Fast 37 Retransmit, and Fast Recovery. RFC 2581 [RFC2581] explicitly allows 38 certain modifications of these algorithms, including modifications 39 that use the TCP Selective Acknowledgement (SACK) option 40 [RFC2018,RFC3517], and modifications that respond to "partial 41 acknowledgments" (ACKs which cover new data, but not all the data 42 outstanding when loss was detected) in the absence of SACK. The 43 NewReno mechanism uses an algorithm for responding to partial 44 acknowledgments that was first proposed by Janey Hoe in [Hoe95]. 46 RFC 2582 [RFC2582] specified the NewReno mechanisms as Experimental 47 in 1999. This document is a small revision of RFC 2582 intended to 48 advance the NewReno mechanisms to Proposed Standard. RFC 2581 notes 49 that the Fast Retransmit/Fast Recovery algorithm specified in that 50 document does not recover very efficiently from multiple losses in a 51 single flight of packets, and that RFC 2582 contains one set of 52 modifications to address this problem. 54 NOTE TO THE RFC EDITOR: PLEASE REMOVE THIS SECTION UPON PUBLICATION. 56 Changes from draft-ietf-tsvwg-newreno-01.txt: 58 * Some improvements in phrasing suggested by Mark Allman. 60 Changes from draft-ietf-tsvwg-newreno-00.txt: 62 * In Section 8, added a cautionary note about using the duplicate 63 acknowledgment counter as a flag for whether Fast Recovery is in 64 effect. 66 * In Section 8, added a note about pulling along "recover" with 67 "snd_una" when Fast Recovery is not in effect. 69 * Added a discussion in Section 6 about heuristics for distinguishing 70 between a retransmitted packet that was dropped, and three duplicate 71 acknowledgements simply from the unnecessary retransmission of three 72 packets. 74 * Added more text and examples for comparing the Impatient and the 75 Slow-but-Steady variants. 77 * In Section 8, added a cautionary note saying that when the sender 78 is not in Fast Retransmit, the sender should not use the Fast 79 Recovery response to multiple duplicate acknowledgements. 81 Changes from draft-floyd-newreno-00.txt: 83 * In Section 8 on "Implementation issues for the data sender", 84 mentioned alternate methods for limiting bursts when exiting Fast 85 Recovery. 87 * Changed draft from draft-floyd-newreno to draft-ietf-tsvwg-newreno 89 Changes from RFC 2582: 91 * Rephrasing and rearrangements of the text. 93 * RFC 2582 described the Careful and Less Careful variants of 94 NewReno, along with a default version that was neither Careful nor 95 Less Careful, and recommended the Careful variant. This document 96 only specifies the Careful version. 98 * RFC 2582 used two separate variables, "send_high" and "recover", 99 and this document has merged them into a single variable "recover". 101 * Added sections on "Comparisons between Reno and NewReno TCP", and 102 on "Changes relative to RFC 2582". The section on "Comparisons 103 between Reno and NewReno TCP" includes a discussion of the one area 104 where NewReno is known to perform worse than Reno or SACK, and that 105 is in the response to reordering. 107 * Moved all of the discussions of the Impatient and Slow-but-Steady 108 variants to one place, and recommended the Impatient variant (as in 109 the default version in RFC 2582). 111 * Added a section on Implementation issues for the data sender, 112 mentioning maxburst_. 114 * Added a paragraph about differences between RFC 2582 and [FF96]. 116 END OF NOTE TO RFC EDITOR 118 1. Introduction 120 For the typical implementation of the TCP Fast Recovery algorithm 121 described in [RFC2581] (first implemented in the 1990 BSD Reno 122 release, and referred to as the Reno algorithm in [FF96]), the TCP 123 data sender only retransmits a packet after a retransmit timeout has 124 occurred, or after three duplicate acknowledgements have arrived 125 triggering the Fast Retransmit algorithm. A single retransmit 126 timeout might result in the retransmission of several data packets, 127 but each invocation of the Fast Retransmit algorithm in RFC 2581 128 leads to the retransmission of only a single data packet. 130 Problems can arise, therefore, when multiple packets have been 131 dropped from a single window of data and the Fast Retransmit and Fast 132 Recovery algorithms are invoked. In this case, if the SACK option is 133 available, the TCP sender has the information to make intelligent 134 decisions about which packets to retransmit and which packets not to 135 retransmit during Fast Recovery. This document applies only for TCP 136 connections that are unable to use the TCP Selective Acknowledgement 137 (SACK) option, either because the option is not locally supported or 138 because the TCP peer did not indicate a willingness to use SACK. 140 In the absence of SACK, there is little information available to the 141 TCP sender in making retransmission decisions during Fast Recovery. 143 From the three duplicate acknowledgements, the sender infers a packet 144 loss, and retransmits the indicated packet. After this, the data 145 sender could receive additional duplicate acknowledgements, as the 146 data receiver acknowledges additional data packets that were already 147 in flight when the sender entered Fast Retransmit. 149 In the case of multiple packets dropped from a single window of data, 150 the first new information available to the sender comes when the 151 sender receives an acknowledgement for the retransmitted packet (that 152 is, the packet retransmitted when Fast Retransmit was first entered). 153 If there had been a single packet drop and no reordering, then the 154 acknowledgement for this packet will acknowledge all of the packets 155 transmitted before Fast Retransmit was entered. However, when there 156 were multiple packet drops, then the acknowledgement for the 157 retransmitted packet will acknowledge some but not all of the packets 158 transmitted before the Fast Retransmit. We call this acknowledgement 159 a partial acknowledgment. 161 Along with several other suggestions, [Hoe95] suggested that during 162 Fast Recovery the TCP data sender responds to a partial 163 acknowledgment by inferring that the next in-sequence packet has been 164 lost, and retransmitting that packet. This document describes a 165 modification to the Fast Recovery algorithm in RFC 2581 that 166 incorporates a response to partial acknowledgements received during 167 Fast Recovery. We call this modified Fast Recovery algorithm 168 NewReno, because it is a slight but significant variation of the 169 basic Reno algorithm in RFC 2581. This document does not discuss the 170 other suggestions in [Hoe95] and [Hoe96], such as a change to the 171 ssthresh parameter during Slow-Start, or the proposal to send a new 172 packet for every two duplicate acknowledgements during Fast Recovery. 173 The version of NewReno in this document also draws on other 174 discussions of NewReno in the literature [LM97]. 176 We do not claim that the NewReno version of Fast Recovery described 177 here is an optimal modification of Fast Recovery for responding to 178 partial acknowledgements, for TCP connections that are unable to use 179 SACK. Based on our experiences with the NewReno modification in the 180 NS simulator [NS] and with numerous implementations of NewReno, we 181 believe that this modification improves the performance of the Fast 182 Retransmit and Fast Recovery algorithms in a wide variety of 183 scenarios. 185 2. Terminology and Definitions 187 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 188 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 189 and "OPTIONAL" are to be interpreted as described in BCP 14, RFC 2119 190 and indicate requirement levels for compliant TCP implementations 191 implementing the NewReno Fast Retransmit and Fast Recovery algorithms 192 described in this document. 194 This document assumes that the reader is familiar with the terms 195 SENDER MAXIMUM SEGMENT SIZE (SMSS), CONGESTION WINDOW (cwnd), and 196 FLIGHT SIZE (FlightSize) defined in [RFC2581]. FLIGHT SIZE is 197 defined as in [RFC2581] as follows: 199 FLIGHT SIZE: 200 The amount of data that has been sent but not yet acknowledged. 202 3. The Fast Retransmit and Fast Recovery Algorithms in NewReno 204 The standard implementation of the Fast Retransmit and Fast Recovery 205 algorithms is given in [RFC2581]. This section specifies the basic 206 NewReno algorithm. Sections 4 through 6 describe some optional 207 variants, and the motivations behind them, that an implementor may 208 want to consider when tuning performance for certain network 209 scenarios. Sections 7 and 8 provide some guidance to implementors 210 based on experience with NewReno implementations. 212 The NewReno modification concerns the Fast Recovery procedure that 213 begins when three duplicate ACKs are received and ends when either a 214 retransmission timeout occurs or an ACK arrives that acknowledges all 215 of the data up to and including the data that was outstanding when 216 the Fast Recovery procedure began. 218 The NewReno algorithm specified in this document differs from the 219 implementation in [RFC2581] in the introduction of the variable 220 "recover" in step 1, in the response to a partial or new 221 acknowledgement in step 5, and in modifications to step 1 and the 222 addition of step 6 for avoiding multiple Fast Retransmits caused by 223 the retransmission of packets already received by the receiver. 225 The algorithm specified in this document uses a variable "recover", 226 whose initial value is the initial send sequence number. 228 1) Three duplicate ACKs: 229 When the third duplicate ACK is received and the sender is not 230 already in the Fast Recovery procedure, check to see if the 231 Cumulative Acknowledgement field covers more than "recover". 232 If so, go to Step 1A. Otherwise, go to Step 1B. 234 1A) Invoking Fast Retransmit: 235 If so, then set ssthresh to no more than the value given in 236 equation 1 below. (This is equation 3 from [RFC2581]). 238 ssthresh = max (FlightSize / 2, 2*SMSS) (1) 240 In addition, record the highest sequence number transmitted in 241 the variable "recover", and go to Step 2. 243 1B) Not invoking Fast Retransmit: 244 Do not enter the Fast Retransmit and Fast Recovery procedure. 245 In particular, do not change ssthresh, do not go to Step 2 to 246 retransmit the "lost" segment, and do not execute Step 3 upon 247 subsequent duplicate ACKs. 249 2) Entering Fast Retransmit: 250 Retransmit the lost segment and set cwnd to ssthresh plus 3*SMSS. 251 This artificially "inflates" the congestion window by the number 252 of segments (three) that have left the network and which the 253 receiver has buffered. 255 3) Fast Recovery: 256 For each additional duplicate ACK received while in Fast 257 Recovery, increment cwnd by SMSS. This artificially inflates 258 the congestion window in order to reflect the additional segment 259 that has left the network. 261 4) Fast Recovery, continued: 262 Transmit a segment, if allowed by the new value of cwnd and the 263 receiver's advertised window. 265 5) When an ACK arrives that acknowledges new data, this ACK could be 266 the acknowledgment elicited by the retransmission from step 2, or 267 elicited by a later retransmission. 269 Full acknowledgements: 270 If this ACK acknowledges all of the data up to and including 271 "recover", then the ACK acknowledges all the intermediate 272 segments sent between the original transmission of the lost 273 segment and the receipt of the third duplicate ACK. Set cwnd to 274 either (1) min (ssthresh, FlightSize + SMSS); or (2) ssthresh, 275 where ssthresh is the value set in step 1; this is termed 276 "deflating" the window. (We note that "FlightSize" in step 1 277 referred to the amount of data outstanding in step 1, when Fast 278 Recovery was entered, while "FlightSize" in step 5 refers to the 279 amount of data outstanding in step 5, when Fast Recovery is 280 exited.) If the second option is selected, the implementation 281 is encouraged to take measures to avoid a possible burst of 282 data, in case the amount of data outstanding in the network was 283 much less than the new congestion window allows. A simple 284 mechanism is to limit the number of data packets that can be 285 sent in response to a single acknowledgement. (This is known 286 as "maxburst_" in the NS simulator). Exit the Fast Recovery 287 procedure. 289 Partial acknowledgements: 290 If this ACK does *not* acknowledge all of the data up to and 291 including "recover", then this is a partial ACK. In this case, 292 retransmit the first unacknowledged segment. Deflate the 293 congestion window by the amount of new data acknowledged by the 294 cumulative acknowledgement field. If the partial ACK 295 acknowledges at least one SMSS of new data, then add back SMSS 296 bytes to the congestion window. As in Step 3, this artificially 297 inflates the congestion window in order to reflect the additional 298 segment that has left the network. Send a new segment if 299 permitted by the new value of cwnd. This "partial window 300 deflation" attempts to ensure that, when Fast Recovery eventually 301 ends, approximately ssthresh amount of data will be outstanding 302 in the network. Do not exit the Fast Recovery procedure (i.e., 303 if any duplicate ACKs subsequently arrive, execute Steps 3 and 304 4 above). 306 For the first partial ACK that arrives during Fast Recovery, also 307 reset the retransmit timer. Timer management is discussed in 308 more detail in Section 4. 310 6) Retransmit timeouts: 311 After a retransmit timeout, record the highest sequence number 312 transmitted in the variable "recover" and exit the Fast 313 Recovery procedure if applicable. 315 Step 1 specifies a check that the Cumulative Acknowledgement field 316 covers more than "recover". Because the acknowledgement field 317 contains the sequence number that the sender next expects to receive, 318 the acknowledgement "ack_number" covers more than "recover" when: 320 ack_number - 1 > recover. 322 Note that in Step 5, the congestion window is deflated after a 323 partial acknowledgement is received. The congestion window was 324 likely to have been inflated considerably when the partial 325 acknowledgement was received. In addition, depending on the original 326 pattern of packet losses, the partial acknowledgement might 327 acknowledge nearly a window of data. In this case, if the congestion 328 window was not deflated, the data sender might be able to send nearly 329 a window of data back-to-back. 331 This document does not specify the sender's response to duplicate 332 ACKs when the Fast Retransmit/Fast Recovery algorithm is not invoked. 333 This is addressed in other documents, such as those describing the 334 Limited Transmit procedure [RFC3042]. This document also does not 335 address issues of adjusting the duplicate acknowledgement threshold, 336 but assumes the threshold specified in the IETF standards; the 337 current standard is RFC 2581, which specifies a threshold of three 338 duplicate acknowledgements. 340 As a final note, we would observe that in the absence of the SACK 341 option, the data sender is working from limited information. When 342 the issue of recovery from multiple dropped packets from a single 343 window of data is of particular importance, the best alternative 344 would be to use the SACK option. 346 4. Resetting the Retransmit Timer in Response to Partial 347 Acknowledgements. 349 One possible variant to the response to partial acknowledgements 350 specified in Section 3 concerns when to reset the retransmit timer 351 after a partial acknowledgement. The algorithm in Section 3, Step 5, 352 resets the retransmit timer only after the first partial ACK. In 353 this case, if a large number of packets were dropped from a window of 354 data, the TCP data sender's retransmit timer will ultimately expire, 355 and the TCP data sender will invoke Slow-Start. (This is illustrated 356 on page 12 of [F98].) We call this the Impatient variant of NewReno. 357 We note that the Impatient variant in Section 3 doesn't follow the 358 recommended algorithm in RFC 2988 of restarting the retransmit timer 359 after every packet transmission or retransmission [RFC2988, Step 360 5.1]. 362 In contrast, the NewReno simulations in [FF96] illustrate the 363 algorithm described above with the modification that the retransmit 364 timer is reset after each partial acknowledgement. We call this the 365 Slow-but-Steady variant of NewReno. In this case, for a window with 366 a large number of packet drops, the TCP data sender retransmits at 367 most one packet per roundtrip time. (This behavior is illustrated in 368 the New-Reno TCP simulation of Figure 5 in [FF96], and on page 11 of 369 [F98]. 371 When N packets have been dropped from a window of data for a large 372 value of N, the Slow-but-Steady variant can remain in Fast Recovery 373 for N round-trip times, retransmitting one more dropped packet each 374 round-trip time; for these scenarios, the Impatient variant gives a 375 faster recovery and better performance. The tests "ns test-suite- 376 newreno.tcl impatient1" and "ns test-suite-newreno.tcl slow1" in the 377 NS simulator illustrate such a scenario, where the Impatient variant 378 performs better than the Slow-but-Steady variant. The Impatient 379 variant can be particularly important for TCP connections with large 380 congestion windows, as illustrated by the tests "ns test-suite- 381 newreno.tcl impatient4" and "ns test-suite-newreno.tcl slow4" in the 382 NS simulator. 384 One can also construct scenarios where the Slow-but-Steady variant 385 gives better performance than the Impatient variant. As an example, 386 this occurs then only a small number of packets are dropped, the RTO 387 is sufficiently small that the retransmit timer expires, and 388 performance would have been better without a retransmit timeout. The 389 tests "ns test-suite-newreno.tcl impatient2" and "ns test-suite- 390 newreno.tcl slow2" in the NS simulator illustrate such a scenario. 392 The Slow-but-Steady variant can also achieve higher goodput than the 393 Impatient variant, by avoiding unnecessary retransmissions. This 394 could be of special interest for cellular links, where every 395 transmission costs battery power and money. The tests "ns test- 396 suite-newreno.tcl impatient3" and "ns test-suite-newreno.tcl slow3" 397 in the NS simulator illustrate such a scenario. The Slow-but-Steady 398 variant can also be more robust to delay variation in the network, 399 where a delay spike might force the Impatient variant into a timeout 400 and go-back-N recovery. 402 Neither of the two variants discussed above are optimal. Our 403 recommendation is for the Impatient variant, as specified in Section 404 3 of this document, because of the poor performance of the Slow-but- 405 Steady variant for TCP connections with large congestion windows. 407 One possibility for a more optimal algorithm would be one that 408 recovered from multiple packet drops as quickly as does slow-start, 409 while resetting the retransmit timers after each partial 410 acknowledgement, as described in the section below. We note, 411 however, that there is a limitation to the potential performance in 412 this case in the absence of the SACK option. 414 5. Retransmissions after a Partial Acknowledgement 416 One possible variant to the response to partial acknowledgements 417 specified in Section 3 would be to retransmit more than one packet 418 after each partial acknowledgement, and to reset the retransmit timer 419 after each retransmission. The algorithm specified in Section 3 420 retransmits a single packet after each partial acknowledgement. This 421 is the most conservative alternative, in that it is the least likely 422 to result in an unnecessarily-retransmitted packet. A variant that 423 would recover faster from a window with many packet drops would be to 424 effectively Slow-Start, retransmitting two packets after each partial 425 acknowledgement. Such an approach would take less than N roundtrip 426 times to recover from N losses [Hoe96]. However, in the absence of 427 SACK, recovering as quickly as slow-start introduces the likelihood 428 of unnecessarily retransmitting packets, and this could significantly 429 complicate the recovery mechanisms. 431 We note that the response to partial acknowledgements specified in 432 Section 3 of this document and in RFC 2582 differs from the response 433 in [FF96], even though both approaches only retransmit one packet in 434 response to a partial acknowledgement. Step 5 of Section 3 specifies 435 that the TCP sender responds to a partial ACK by deflating the 436 congestion window by the amount of new data acknowledged, then adding 437 back SMSS bytes if the partial ACK acknowledges at least SMSS bytes 438 of new data, and sending a new segment if permitted by the new value 439 of cwnd. Thus, only one previously-sent packet is retransmitted in 440 response to each partial acknowledgement, but additional new packets 441 might be transmitted as well, depending on the amount of new data 442 acknowledged by the partial acknowledgement. In contrast, the 443 variant of NewReno illustrated in [FF96] simply set the congestion 444 window to ssthresh when a partial acknowledgement was received. The 445 approach in [FF96] is more conservative, and does not attempt to 446 accurately track the actual number of outstanding packets after a 447 partial acknowledgement is received. While either of these 448 approaches gives acceptable performance, the variant specified in 449 Section 3 recovers more smoothly when multiple packets are dropped 450 from a window of data. (The [FF96] behavior can be seen in the NS 451 simulator by setting the variable "partial_window_deflation_" for 452 "Agent/TCP/Newreno" to 0, and the behavior specified in Section 3 is 453 achieved by setting "partial_window_deflation_" to 1.) 455 6. Avoiding Multiple Fast Retransmits 457 This section describes the motivation for the sender's state variable 458 "recover", and discusses possible heuristics for distinguishing 459 between a retransmitted packet that was dropped, and three duplicate 460 acknowledgements simply from the unnecessary retransmission of three 461 packets. 463 In the absence of the SACK option or timestamps, a duplicate 464 acknowledgement carries no information to identify the data packet or 465 packets at the TCP data receiver that triggered that duplicate 466 acknowledgement. In this case, the TCP data sender is unable to 467 distinguish between a duplicate acknowledgement that results from a 468 lost or delayed data packet, and a duplicate acknowledgement that 469 results from the sender's unnecessary retransmission of a data packet 470 that had already been received at the TCP data receiver. Because of 471 this, with the Retransmit and Fast Recovery algorithms in Reno TCP, 472 multiple segment losses from a single window of data can sometimes 473 result in unnecessary multiple Fast Retransmits (and multiple 474 reductions of the congestion window) [F94]. 476 With the Fast Retransmit and Fast Recovery algorithms in Reno TCP, 477 the performance problems caused by multiple Fast Retransmits are 478 relatively minor compared to the potential problems with Tahoe TCP, 479 which does not implement Fast Recovery. Nevertheless, unnecessary 480 Fast Retransmits can occur with Reno TCP unless some explicit 481 mechanism is added to avoid this, such as the use of the "recover" 482 variable. (This modification is called "bugfix" in [F98], and is 483 illustrated on pages 7 and 9 of that document. Unnecessary Fast 484 Retransmits for Reno without "bugfix" is illustrated on page 6 of 485 [F98].) 487 Section 3 of RFC 2582 defined a default variant of NewReno TCP that 488 did not use the variable "recover", and did not check if duplicate 489 ACKs cover the variable "recover" before invoking Fast Retransmit. 490 With this default variant from RFC 2582, the problem of multiple Fast 491 Retransmits from a single window of data can occur after a Retransmit 492 Timeout (as in page 8 of [F98]) or in scenarios with reordering (as 493 in the validation test "./test-all-newreno newreno5_noBF" in 494 directory "tcl/test" of the NS simulator. This gives performance 495 similar to that on page 8 of [F03].) RFC 2582 also defined Careful 496 and Less Careful variants of the NewReno algorithm, and recommended 497 the Careful variant. 499 The algorithm specified in Section 3 of this document corresponds to 500 the Careful variant of NewReno TCP from RFC 2582, and eliminates the 501 problem of multiple Fast Retransmits. This algorithm uses the 502 variable "recover", whose initial value is the initial send sequence 503 number. After each retransmit timeout, the highest sequence number 504 transmitted so far is recorded in the variable "recover". 506 If, after a retransmit timeout, the TCP data sender retransmits three 507 consecutive packets that have already been received by the data 508 receiver, then the TCP data sender will receive three duplicate 509 acknowledgements that do not cover more than "recover". In this 510 case, the duplicate acknowledgements are not an indication of a new 511 instance of congestion. They are simply an indication that the 512 sender has unnecessarily retransmitted at least three packets. 514 However, when a retransmitted packet is itself dropped, the sender 515 can also receive three duplicate acknowledgements that do not cover 516 more than "recover", and in this case the sender would have been 517 better off if it had initiated Fast Retransmit. For a TCP that 518 implements the algorithm specified in Section 3 of this document, the 519 sender does not infer a packet drop from duplicate acknowledgements 520 in this scenario. As always, the retransmit timer is the backup 521 mechanism for inferring packet loss in this case. 523 There are several heuristics, based on timestamps or on the amount of 524 advancement of the cumulative acknowledgement field, that allow the 525 sender to distinguish in some cases between three duplicate 526 acknowledgements following a retransmitted packet that was dropped, 527 and three duplicate acknowledgements simply from the unnecessary 528 retransmission of three packets [Gur03]. The TCP sender MAY use such 529 a heuristic to decide to invoke a Fast Retransmit in some cases even 530 when the three duplicate acknowledgements do not cover more than 531 "recover". 533 For example, when three duplicate acknowledgements are caused by the 534 unnecessary retransmission of three packets, this is likely to be 535 accompanied by the cumulative acknowledgement field advancing by at 536 least four segments. Similarly, a heuristic based on timestamps uses 537 the fact that when there is a hole in the sequence space, the 538 timestamp echoed in the duplicate acknowledgement is the timestamp of 539 the most recent data packet that advanced the cumulative 540 acknowledgement field [RFC1323]. If timestamps are used, and the 541 sender stores the timestamp of the last acknowledged segment, then 542 the timestamp echoed by duplicate acknowledgements can be used to 543 distinguish between a retransmitted packet that was dropped, and 544 three duplicate acknowledgements simply from the unnecessary 545 retransmission of three packets. The heuristics are illustrated in 546 the NS simulator in the validation test "./test-all-newreno". 548 6.1. ACK Heuristic. 550 If the ACK-based heuristic is used, then following the advancement of 551 the cumulative acknowledgement field, the sender stores the value of 552 previous cumulative acknowledgement as prev_highest_ack and stores 553 the latest cumulative ACK as highest_ack. In addition, the following 554 step is performed if Step 1 in Section 3 fails, before proceeding to 555 Step 1B. 557 1*) If the Cumulative Acknowledgement field didn't cover more than 558 "recover", check to see if the congestion window is greater 559 than SMSS bytes and the difference between highest_ack and 560 prev_highest_ack is at most 4*SMSS bytes. If true, duplicate 561 ACKs indicate a lost segment (proceed to Step 1A in Section 562 3). Otherwise, duplicate ACKs likely result from unnecessary 563 retransmissions (proceed to Step 1B in Section 3). 565 The congestion window check serves to protect against fast retransmit 566 immediately after a retransmit timeout, similar to the 567 "exitFastRetrans_" variable in NS. Examples of applying the ACK 568 heuristic are in validation tests "./test-all-newreno 569 newreno_rto_loss_ack" and "./test-all-newreno newreno_rto_dup_ack" in 570 directory "tcl/test" of the NS simulator. 572 If several ACKs are lost, the sender can see a jump in the cumulative 573 ACK of more than three segments and the heuristic can fail. A 574 validation test for this scenario is "./test-all-newreno 575 newreno_rto_loss_ackf". RFC 2581 recommends that a receiver should 576 send duplicate ACKs for every out-of-order data packet, such as a 577 data packet received during Fast Recovery. The ACK heuristic is more 578 likely to fail if the receiver does not follow this advice, because 579 then a smaller number of ACK losses are needed to produce a 580 sufficient jump in the cumulative ACK. 582 6.2. Timestamp Heuristic. 584 If this heuristic is used, the sender stores the timestamp of the 585 last acknowledged segment. In addition, the second paragraph of step 586 1 in Section 3 is replaced as follows: 588 1**) If the Cumulative Acknowledgement field didn't cover more than 589 "recover", check to see if the echoed timestamp equals the 590 stored timestamp. If true, duplicate ACKs indicate a lost 591 segment (proceed to Step 1A in Section 3). Otherwise, duplicate 592 ACKs likely result from unnecessary retransmissions (proceed 593 to Step 1B in Section 3). 595 Examples of applying the timestamp heuristic are in validation tests 596 "./test-all-newreno newreno_rto_loss_tsh" and "./test-all-newreno 597 newreno_rto_dup_tsh". The timestamp heuristic works correctly both 598 when the receiver echoes timestamps as specified by [RFC1323] or by 599 its revision attempts. However, if the receiver arbitrarily echos 600 timestamps, the heuristic can fail. The heuristic can also fail if a 601 timeout was spurious and returning ACKs are not from retransmitted 602 segments. This can be prevented by detection algorithms such as 603 [RFC3522]. 605 7. Implementation Issues for the Data Receiver 607 [RFC2581] specifies that "Out-of-order data segments SHOULD be 608 acknowledged immediately, in order to accelerate loss recovery." 609 Neal Cardwell has noted that some data receivers do not send an 610 immediate acknowledgement when they send a partial acknowledgment, 611 but instead wait first for their delayed acknowledgement timer to 612 expire [C98]. As [C98] notes, this severely limits the potential 613 benefit from NewReno by delaying the receipt of the partial 614 acknowledgement at the data sender. Echoing RFC 2581, our 615 recommendation is that the data receiver send an immediate 616 acknowledgement for an out-of-order segment, even when that out-of- 617 order segment fills a hole in the buffer. 619 8. Implementation Issues for the Data Sender 621 In Section 3, Step 5 above, it is noted that implementations should 622 take measures to avoid a possible burst of data when leaving Fast 623 Recovery, in case the amount of new data that the sender is eligible 624 to send due to the new value of the congestion window is large. This 625 can arise during NewReno when ACKs are lost or treated as pure window 626 updates, thereby causing the sender to underestimate the number of 627 new segments that can be sent during the recovery procedure. 628 Specifically, bursts can occur when the FlightSize is much less than 629 the new congestion window when exiting from Fast Recovery. One 630 simple mechanism to avoid a burst of data when leaving Fast Recovery 631 is to limit the number of data packets that can be sent in response 632 to a single acknowledgment. (This is known as "maxburst_" in the ns 633 simulator.) Other possible mechanisms for avoiding bursts include 634 rate-based pacing, or setting the slow-start threshold to the 635 resultant congestion window and then resetting the congestion window 636 to FlightSize. A recommendation on the general mechanism to avoid 637 excessively bursty sending patterns is outside the scope of this 638 document. 640 An implementation may want to use a separate flag to record whether 641 or not it is presently in the Fast Recovery procedure. The use of 642 the value of the duplicate acknowledgment counter as such a flag is 643 not reliable because it can be reset upon window updates and out-of- 644 order acknowledgments. 646 When not in Fast Recovery, the value of the state variable "recover" 647 should be pulled along with the value of the state variable for 648 acknowledgments (typically, "snd_una") so that, when large amounts of 649 data has been sent and acked, the sequence space does not wrap and 650 falsely indicate that Fast Recovery should not be entered (Section 3, 651 step 1, last paragraph). 653 It is important for the sender to respond correctly to duplicate ACKs 654 received when the sender is no longer in Fast Recovery (e.g., because 655 of a Retransmit Timeout). The Limited Transmit procedure [RFC3042] 656 describes possible responses to the first and second duplicate 657 acknowledgements. When three or more duplicate acknowledgements are 658 received, the Cumulative Acknowledgement field doesn't cover more 659 than "recover", and a new Fast Recovery is not invoked, it is 660 important that the sender not execute the Fast Recovery steps (3) and 661 (4) in Section 3. Otherwise, the sender could end up in a chain of 662 spurious timeouts. We mention this only because several NewReno 663 implementations had this bug, including the implementation in the NS 664 simulator. (This bug in the NS simulator was fixed in July 2003, 665 with the variable "exitFastRetrans_".) 667 9. Simulations 669 Simulations with NewReno are illustrated with the validation test 670 "tcl/test/test-all-newreno" in the NS simulator. The command 671 "../../ns test-suite-newreno.tcl reno" shows a simulation with Reno 672 TCP, illustrating the data sender's lack of response to a partial 673 acknowledgement. In contrast, the command "../../ns test-suite- 674 newreno.tcl newreno_B" shows a simulation with the same scenario 675 using the NewReno algorithms described in this paper. 677 10. Comparisons between Reno and NewReno TCP 679 As we stated in the introduction, we believe that the NewReno 680 modification described in this document improves the performance of 681 the Fast Retransmit and Fast Recovery algorithms of Reno TCP in a 682 wide variety of scenarios. This has been discussed in some depth in 683 [FF96], which illustrates Reno TCP's poor performance when multiple 684 packets are dropped from a window of data and also illustrates 685 NewReno TCP's good performance in that scenario. 687 We do, however, know of one scenario where Reno TCP gives better 688 performance than NewReno TCP, that we describe here for the sake of 689 completeness. Consider a scenario with no packet loss, but with 690 sufficient reordering that the TCP sender receives three duplicate 691 acknowledgements. This will trigger the Fast Retransmit and Fast 692 Recovery algorithms. With Reno TCP or with Sack TCP, this will 693 result in the unnecessary retransmission of a single packet, combined 694 with a halving of the congestion window (shown on pages 4 and 6 of 695 [F03]). With NewReno TCP, however, this reordering will also result 696 in the unnecessary retransmission of an entire window of data (shown 697 on page 5 of [F03]). 699 While Reno TCP performs better than NewReno TCP in the presence of 700 reordering, NewReno's superior performance in the presence of 701 multiple packet drops generally outweighs its less optimal 702 performance in the presence of reordering. (Sack TCP is the 703 preferred solution, with good performance in both scenarios.) This 704 document recommends the Fast Retransmit and Fast Recovery algorithms 705 of NewReno TCP instead of those of Reno TCP for those TCP connections 706 that do not support SACK. We would also note that NewReno's Fast 707 Retransmit and Fast Recovery mechanisms are widely deployed in TCP 708 implementations in the Internet today, as documented in [PF01]. For 709 example, tests of TCP implementations in several thousand web servers 710 in 2001 showed that for those TCP connections where the web browser 711 was not SACK-capable, more web servers used the Fast Retransmit and 712 Fast Recovery algorithms of NewReno than those of Reno or Tahoe TCP 713 [PF01]. 715 11. Changes Relative to RFC 2582 717 The purpose of this document is to advance the NewReno's Fast 718 Retransmit and Fast Recovery algorithms in RFC 2582 to Proposed 719 Standard. 721 The main change in this document relative to RFC 2582 is to specify 722 the Careful variant of NewReno's Fast Retransmit and Fast Recovery 723 algorithms. The base algorithm described in RFC 2582 did not attempt 724 to avoid unnecessary multiple Fast Retransmits that can occur after a 725 timeout (described in more detail in the section above). However, 726 RFC 2582 also defined "Careful" and "Less Careful" variants that 727 avoid these unnecessary Fast Retransmits, and recommended the Careful 728 variant. This document specifies the previously-named "Careful" 729 variant as the basic version of NewReno. As described below, this 730 algorithm uses a variable "recover", whose initial value is the send 731 sequence number. 733 The algorithm specified in Section 3 checks whether the 734 acknowledgement field of a partial acknowledgement covers *more* than 735 "recover". Another possible variant would be to require simply that 736 the acknowledgement field *cover* "recover" before initiating another 737 Fast Retransmit. We called this the Less Careful variant in RFC 738 2582. 740 There are two separate scenarios in which the TCP sender could 741 receive three duplicate acknowledgements acknowledging "recover" but 742 no more than "recover". One scenario would be that the data sender 743 transmitted four packets with sequence numbers higher than "recover", 744 that the first packet was dropped in the network, and the following 745 three packets triggered three duplicate acknowledgements 746 acknowledging "recover". The second scenario would be that the 747 sender unnecessarily retransmitted three packets below "recover", and 748 that these three packets triggered three duplicate acknowledgements 749 acknowledging "recover". In the absence of SACK, the TCP sender in 750 unable to distinguish between these two scenarios. 752 For the Careful variant of Fast Retransmit, the data sender would 753 have to wait for a retransmit timeout in the first scenario, but 754 would not have an unnecessary Fast Retransmit in the second scenario. 755 For the Less Careful variant to Fast Retransmit, the data sender 756 would Fast Retransmit as desired in the first scenario, and would 757 unnecessarily Fast Retransmit in the second scenario. This document 758 only specifies the Careful variant in Section 3. Unnecessary Fast 759 Retransmits with the Less Careful variant in scenarios with 760 reordering are illustrated in page 8 of [F03]. 762 The document also specifies two heuristics that the TCP sender MAY 763 use to decide to invoke Fast Retransmit even when the three duplicate 764 acknowledgements do not cover more than "recover". These heuristics, 765 an ACK-based heuristic and a timestamp heuristic, are described in 766 Sections 6.1 and 6.2 respectively. 768 12. Conclusions 770 This document specifies the NewReno Fast Retransmit and Fast Recovery 771 algorithms for TCP. This NewReno modification to TCP can be 772 important even for TCP implementations that support the SACK option, 773 because the SACK option can only be used for TCP connections when 774 both TCP end-nodes support the SACK option. NewReno performs better 775 than Reno (RFC 2581) in a number of scenarios discussed herein. 777 A number of options to the basic algorithm presented in Section 3 are 778 also described. These include the handling of the retransmission 779 timer (Section 4), the response to partial acknowledgments (Section 780 5), and the value of the congestion window when leaving Fast Recovery 781 (section 3, step 5). Our belief is that the differences between 782 these variants of NewReno are small compared to the differences 783 between Reno and NewReno. That is, the important thing is to 784 implement NewReno instead of Reno, for a TCP connection without SACK; 785 it is less important exactly which of the variants of NewReno is 786 implemented. 788 13. Acknowledgements 790 Many thanks to Anil Agarwal, Mark Allman, Armando Caro, Jeffrey Hsu, 791 Vern Paxson, Kacheong Poon, Keyur Shah, and Bernie Volz for detailed 792 feedback on this document or on its precursor RFC 2582. 794 14. References 796 Normative References 798 [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP Selective 799 Acknowledgement Options", RFC 2018, October 1996. 801 [RFC2581] W. Stevens, M. Allman, and V. Paxson, "TCP Congestion 802 Control", RFC 2581, April 1999. 804 [RFC2582] S. Floyd and T. Henderson, The NewReno Modification to 805 TCP's Fast Recovery Algorithm, RFC 2582, April 1999. 807 [RFC2988] V. Paxson and M. Allman, Computing TCP's Retransmission 808 Timer, RFC 2988, November 2000. 810 [RFC3042] M. Allman, H. Balakrishnan, and S. Floyd, Enhancing TCP's 811 Loss Recovery Using Limited Transmit, RFC 3042, January 2001. 813 Informative References 815 [C98] N. Cardwell, "delayed ACKs for retransmitted packets: ouch!". 816 November 1998, Email to the tcpimpl mailing list, Message-ID 817 "Pine.LNX.4.02A.9811021421340.26785-100000@sake.cs.washington.edu", 818 archived at "http://tcp-impl.lerc.nasa.gov/tcp-impl". 820 [F98] S. Floyd, Revisions to RFC 2001, "Presentation to the TCPIMPL 821 Working Group", August 1998. URLs "ftp://ftp.ee.lbl.gov/talks/sf- 822 tcpimpl-aug98.ps" and "ftp://ftp.ee.lbl.gov/talks/sf-tcpimpl- 823 aug98.pdf". 825 [F03] S. Floyd, "Moving NewReno from Experimental to Proposed 826 Standard? Presentation to the TSVWG Working Group", March 2003. 827 URLs "http://www.icir.org/floyd/talks/newreno-Mar03.ps" and 828 "http://www.icir.org/floyd/talks/newreno-Mar03.pdf". 830 [FF96] K. Fall and S. Floyd, "Simulation-based Comparisons of Tahoe, 831 Reno and SACK TCP", Computer Communication Review, July 1996. URL 832 "ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z". 834 [F94] S. Floyd, "TCP and Successive Fast Retransmits", Technical 835 report, October 1994. URL 836 "ftp://ftp.ee.lbl.gov/papers/fastretrans.ps". 838 [Gur03] A. Gurtov, "[Tsvwg] resolving the problem of unnecessary fast 839 retransmits in go-back-N", email to the tsvwg mailing list, message 840 ID <3F25B467.9020609@cs.helsinki.fi>, July 28, 2003. URL 841 "http://www1.ietf.org/mail-archive/working- 842 groups/tsvwg/current/msg04334.html". 844 [Hen98] T. Henderson, Re: NewReno and the 2001 Revision. September 845 1998. Email to the tcpimpl mailing list, Message ID 846 "Pine.BSI.3.95.980923224136.26134A-100000@raptor.CS.Berkeley.EDU", 847 archived at "http://tcp-impl.lerc.nasa.gov/tcp-impl". 849 [Hoe95] J. Hoe, "Startup Dynamics of TCP's Congestion Control and 850 Avoidance Schemes", Master's Thesis, MIT, 1995. URL "http://ana- 851 www.lcs.mit.edu/anaweb/ps-papers/hoe-thesis.ps". 853 [Hoe96] J. Hoe, "Improving the Start-up Behavior of a Congestion 854 Control Scheme for TCP", ACM SIGCOMM, August 1996. URL 855 "http://www.acm.org/sigcomm/sigcomm96/program.html". 857 [LM97] D. Lin and R. Morris, "Dynamics of Random Early Detection", 858 SIGCOMM 97, September 1997. URL 859 "http://www.acm.org/sigcomm/sigcomm97/program.html". 861 [NS] The Network Simulator (NS). URL "http://www.isi.edu/nsnam/ns/". 863 [PF01] J. Padhye and S. Floyd, "Identifying the TCP Behavior of Web 864 Servers", June 2001, SIGCOMM 2001. 866 [RFC1323] V. Jacobson, R. Braden, and D. Borman, "TCP Extensions for 867 High Performance,", RFC 1323, May 1992. 869 [RFC3517] E. Blanton, M. Allman, K. Fall, and L. Wang, "A 870 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 871 Algorithm for TCP", RFC 3517, April 2003. 873 [RFC3522] R. Ludwig and M. Meyer, The Eifel Detection Algorithm for 874 TCP, RFC 3522, April 2003. 876 15. Security Considerations 878 RFC 2581 discusses general security considerations concerning TCP 879 congestion control. This document describes a specific algorithm 880 that conforms with the congestion control requirements of RFC 2581, 881 and so those considerations apply to this algorithm, too. There are 882 no known additional security concerns for this specific algorithm. 884 AUTHORS' ADDRESSES 886 Sally Floyd 887 International Computer Science Institute 888 Phone: +1 (510) 666-2989 889 Email: floyd@acm.org 890 URL: http://www.icir.org/floyd/ 892 Tom Henderson 893 The Boeing Company 895 Email: thomas.r.henderson@boeing.com 897 Andrei Gurtov 898 U. Helsinki 900 Email: gurtov@cs.helsinki.fi