idnits 2.17.1 draft-henderson-tcpm-rfc3782-bis-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 39 instances of too long lines in the document, the longest one being 20 characters in excess of 72. -- The draft header indicates that this document obsoletes RFC3782, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 25, 2010) is 4903 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2018' is defined on line 775, but no explicit reference was found in the text == Unused Reference: 'RFC2988' is defined on line 787, but no explicit reference was found in the text == Unused Reference: 'RFC3517' is defined on line 853, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2582 (Obsoleted by RFC 3782) ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 2001 (ref. 'F98') (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) Summary: 6 errors (**), 0 flaws (~~), 6 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Floyd 3 Internet-Draft ICSI 4 Obsoletes: 3782 (if approved) T. Henderson 5 Intended status: Standards Track Boeing 6 Expires: April 28, 2011 A. Gurtov 7 HIIT 8 October 25, 2010 10 The NewReno Modification to TCP's Fast Recovery Algorithm 11 draft-henderson-tcpm-rfc3782-bis-01.txt 13 Abstract 15 The purpose of this document is to make errata changes and adopt a 16 proposal from Yoshifumi Nishida to slightly increase the minimum 17 window size after Fast Recovery from one to two segments, to improve 18 performance when the receiver uses delayed acknowledgments. 20 Status of this Memo 22 This Internet-Draft is submitted to IETF in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on April 28, 2011. 37 Copyright Notice 39 Copyright (c) 2010 IETF Trust and the persons identified as 40 the document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 This document may contain material from IETF Documents or IETF 53 Contributions published or made publicly available before November 54 10, 2008. The person(s) controlling the copyright in some of this 55 material may not have granted the IETF Trust the right to allow 56 modifications of such material outside the IETF Standards Process. 57 Without obtaining an adequate license from the person(s) controlling 58 the copyright in such materials, this document may not be modified 59 outside the IETF Standards Process, and derivative works of it may 60 not be created outside the IETF Standards Process, except to format 61 it for publication as an RFC or to translate it into languages other 62 than English. 64 1. Introduction 66 For the typical implementation of the TCP Fast Recovery algorithm 67 described in [RFC2581] (first implemented in the 1990 BSD Reno 68 release, and referred to as the Reno algorithm in [FF96]), the TCP 69 data sender only retransmits a packet after a retransmit timeout has 70 occurred, or after three duplicate acknowledgements have arrived 71 triggering the Fast Retransmit algorithm. A single retransmit 72 timeout might result in the retransmission of several data packets, 73 but each invocation of the Fast Retransmit algorithm in RFC 2581 74 leads to the retransmission of only a single data packet. 76 Problems can arise, therefore, when multiple packets are 77 dropped from a single window of data and the Fast Retransmit and Fast 78 Recovery algorithms are invoked. In this case, if the SACK option is 79 available, the TCP sender has the information to make intelligent 80 decisions about which packets to retransmit and which packets not to 81 retransmit during Fast Recovery. This document applies only for TCP 82 connections that are unable to use the TCP Selective Acknowledgement 83 (SACK) option, either because the option is not locally supported or 84 because the TCP peer did not indicate a willingness to use SACK. 86 In the absence of SACK, there is little information available to the 87 TCP sender in making retransmission decisions during Fast 88 Recovery. From the three duplicate acknowledgements, the sender 89 infers a packet loss, and retransmits the indicated packet. After 90 this, the data sender could receive additional duplicate 91 acknowledgements, as the data receiver acknowledges additional data 92 packets that were already in flight when the sender entered Fast Retransmit. 94 In the case of multiple packets dropped from a single window of data, 95 the first new information available to the sender comes when the 96 sender receives an acknowledgement for the retransmitted packet (that 97 is, the packet retransmitted when Fast Retransmit was first 98 entered). If there is a single packet drop and no reordering, then the 99 acknowledgement for this packet will acknowledge all of the packets 100 transmitted before Fast Retransmit was entered. However, if there 101 are multiple packet drops, then the acknowledgement for the 102 retransmitted packet will acknowledge some but not all of the packets 103 transmitted before the Fast Retransmit. We call this acknowledgement 104 a partial acknowledgment. 106 Along with several other suggestions, [Hoe95] suggested that during 107 Fast Recovery the TCP data sender responds to a partial 108 acknowledgment by inferring that the next in-sequence packet has been 109 lost, and retransmitting that packet. This document describes a 110 modification to the Fast Recovery algorithm in RFC 2581 that 111 incorporates a response to partial acknowledgements received during 112 Fast Recovery. We call this modified Fast Recovery algorithm 113 NewReno, because it is a slight but significant variation of the 114 basic Reno algorithm in RFC 2581. This document does not discuss the 115 other suggestions in [Hoe95] and [Hoe96], such as a change to the 116 ssthresh parameter during Slow-Start, or the proposal to send a new 117 packet for every two duplicate acknowledgements during Fast 118 Recovery. The version of NewReno in this document also draws on other 119 discussions of NewReno in the literature [LM97, Hen98]. 121 We do not claim that the NewReno version of Fast Recovery described 122 here is an optimal modification of Fast Recovery for responding to 123 partial acknowledgements, for TCP connections that are unable to use 124 SACK. Based on our experiences with the NewReno modification in the 125 NS simulator [NS] and with numerous implementations of NewReno, we 126 believe that this modification improves the performance of the Fast 127 Retransmit and Fast Recovery algorithms in a wide variety of 128 scenarios. 130 2. Terminology and Definitions 132 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 133 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 134 and "OPTIONAL" are to be interpreted as described in BCP 14, RFC 2119 135 [RFC2119]. This RFC indicates requirement levels for compliant TCP 136 implementations implementing the NewReno Fast Retransmit and Fast 137 Recovery algorithms described in this document. 139 This document assumes that the reader is familiar with the terms 140 SENDER MAXIMUM SEGMENT SIZE (SMSS), CONGESTION WINDOW (cwnd), and 141 FLIGHT SIZE (FlightSize) defined in [RFC2581]. FLIGHT SIZE is 142 defined as in [RFC2581] as follows: 144 FLIGHT SIZE: 145 The amount of data that has been sent but not yet acknowledged. 147 3. The Fast Retransmit and Fast Recovery Algorithms in NewReno 149 The standard implementation of the Fast Retransmit and Fast Recovery 150 algorithms is given in [RFC2581]. This section specifies the basic 151 NewReno algorithm. Sections 4 through 6 describe some optional 152 variants, and the motivations behind them, that an implementor may 153 want to consider when tuning performance for certain network 154 scenarios. Sections 7 and 8 provide some guidance to implementors 155 based on experience with NewReno implementations. 157 The NewReno modification concerns the Fast Recovery procedure that 158 begins when three duplicate ACKs are received and ends when either a 159 retransmission timeout occurs or an ACK arrives that acknowledges all 160 of the data up to and including the data that was outstanding when 161 the Fast Recovery procedure began. 163 The NewReno algorithm specified in this document differs from the 164 implementation in [RFC2581] in the introduction of the variable 165 "recover" in step 1, in the response to a partial or new 166 acknowledgement in step 5, and in modifications to step 1 and the 167 addition of step 6 for avoiding multiple Fast Retransmits caused by 168 the retransmission of packets already received by the receiver. 170 The algorithm specified in this document uses a variable "recover", 171 whose initial value is the initial send sequence number. 173 1) Three duplicate ACKs: 174 When the third duplicate ACK is received and the sender is not 175 already in the Fast Recovery procedure, check to see if the 176 Cumulative Acknowledgement field covers more than 177 "recover". If so, go to Step 1A. Otherwise, go to Step 1B. 179 1A) Invoking Fast Retransmit: 180 If so, then set ssthresh to no more than the value given in 181 equation 1 below. (This is equation 3 from [RFC2581]). 183 ssthresh = max (FlightSize / 2, 2*SMSS) (1) 185 In addition, record the highest sequence number transmitted in 186 the variable "recover", and go to Step 2. 188 1B) Not invoking Fast Retransmit: 189 Do not enter the Fast Retransmit and Fast Recovery procedure. In 190 particular, do not change ssthresh, do not go to Step 2 to 191 retransmit the "lost" segment, and do not execute Step 3 upon 192 subsequent duplicate ACKs. 194 2) Entering Fast Retransmit: 195 Retransmit the lost segment and set cwnd to ssthresh plus 196 3*SMSS. This artificially "inflates" the congestion window by the number 197 of segments (three) that have left the network and the 198 receiver has buffered. 200 3) Fast Recovery: 201 For each additional duplicate ACK received while in Fast 202 Recovery, increment cwnd by SMSS. This artificially inflates 203 the congestion window in order to reflect the additional segment 204 that has left the network. 206 4) Fast Recovery, continued: 207 Transmit a segment, if allowed by the new value of cwnd and the 208 receiver's advertised window. 210 5) When an ACK arrives that acknowledges new data, this ACK could be 211 the acknowledgment elicited by the retransmission from step 2, or 212 elicited by a later retransmission. 214 Full acknowledgements: 215 If this ACK acknowledges all of the data up to and including 216 "recover", then the ACK acknowledges all the intermediate 217 segments sent between the original transmission of the lost 218 segment and the receipt of the third duplicate ACK. Set cwnd to 219 either (1) min (ssthresh, max(FlightSize, SMSS) + SMSS) or (2) ssthresh, 220 where ssthresh is the value set in step 1; this is termed 221 "deflating" the window. (We note that "FlightSize" in step 1 222 referred to the amount of data outstanding in step 1, when Fast 223 Recovery was entered, while "FlightSize" in step 5 refers to the 224 amount of data outstanding in step 5, when Fast Recovery is 225 exited.) If the second option is selected, the implementation 226 is encouraged to take measures to avoid a possible burst of 227 data, in case the amount of data outstanding in the network is 228 much less than the new congestion window allows. A simple mechanism 229 is to limit the number of data packets that can be 230 sent in response to a single acknowledgement; this is known 231 as "maxburst_" in the NS simulator. Exit the Fast Recovery 232 procedure. 234 Partial acknowledgements: 235 If this ACK does *not* acknowledge all of the data up to and 236 including "recover", then this is a partial ACK. In this case, 237 retransmit the first unacknowledged segment. Deflate the 238 congestion window by the amount of new data acknowledged by the 239 cumulative acknowledgement field. If the partial ACK 240 acknowledges at least one SMSS of new data, then add back SMSS 241 bytes to the congestion window. As in Step 3, this artificially 242 inflates the congestion window in order to reflect the additional 243 segment that has left the network. Send a new segment if 244 permitted by the new value of cwnd. This "partial window 245 deflation" attempts to ensure that, when Fast Recovery eventually 246 ends, approximately ssthresh amount of data will be outstanding 247 in the network. Do not exit the Fast Recovery procedure (i.e., 248 if any duplicate ACKs subsequently arrive, execute Steps 3 and 249 4 above). 251 For the first partial ACK that arrives during Fast Recovery, also 252 reset the retransmit timer. Timer management is discussed in 253 more detail in Section 4. 255 6) Retransmit timeouts: 256 After a retransmit timeout, record the highest sequence number 257 transmitted in the variable "recover" and exit the Fast 258 Recovery procedure if applicable. 260 Step 1 specifies a check that the Cumulative Acknowledgement field 261 covers more than "recover". Because the acknowledgement field 262 contains the sequence number that the sender next expects to receive, 263 the acknowledgement "ack_number" covers more than "recover" when: 265 ack_number - 1 > recover; 267 i.e., at least one byte more of data is acknowledged beyond the 268 highest byte that was outstanding when Fast Retransmit was last 269 entered. 271 Note that in Step 5, the congestion window is deflated after a 272 partial acknowledgement is received. The congestion window was 273 likely to have been inflated considerably when the partial 274 acknowledgement was received. In addition, depending on the original 275 pattern of packet losses, the partial acknowledgement might 276 acknowledge nearly a window of data. In this case, if the congestion 277 window was not deflated, the data sender might be able to send nearly 278 a window of data back-to-back. 280 This document does not specify the sender's response to duplicate 281 ACKs when the Fast Retransmit/Fast Recovery algorithm is not 282 invoked. This is addressed in other documents, such as those describing the 283 Limited Transmit procedure [RFC3042]. This document also does not 284 address issues of adjusting the duplicate acknowledgement threshold, 285 but assumes the threshold specified in the IETF standards; the 286 current standard is RFC 2581, which specifies a threshold of three 287 duplicate acknowledgements. 289 As a final note, we would observe that in the absence of the SACK 290 option, the data sender is working from limited information. When 291 the issue of recovery from multiple dropped packets from a single 292 window of data is of particular importance, the best alternative 293 would be to use the SACK option. 295 4. Resetting the Retransmit Timer in Response to Partial 296 Acknowledgements 298 One possible variant to the response to partial acknowledgements 299 specified in Section 3 concerns when to reset the retransmit timer 300 after a partial acknowledgement. The algorithm in Section 3, Step 5, 301 resets the retransmit timer only after the first partial ACK. In 302 this case, if a large number of packets were dropped from a window of 303 data, the TCP data sender's retransmit timer will ultimately expire, 304 and the TCP data sender will invoke Slow-Start. (This is illustrated 305 on page 12 of [F98].) We call this the Impatient variant of NewReno. 306 We note that the Impatient variant in Section 3 doesn't follow the 307 recommended algorithm in RFC 2988 of restarting the retransmit timer 308 after every packet transmission or retransmission [RFC2988, Step 309 5.1]. 311 In contrast, the NewReno simulations in [FF96] illustrate the 312 algorithm described above with the modification that the retransmit 313 timer is reset after each partial acknowledgement. We call this the 314 Slow-but-Steady variant of NewReno. In this case, for a window with 315 a large number of packet drops, the TCP data sender retransmits at 316 most one packet per roundtrip time. (This behavior is illustrated in 317 the New-Reno TCP simulation of Figure 5 in [FF96], and on page 11 of 318 [F98]). 320 When N packets have been dropped from a window of data for a large 321 value of N, the Slow-but-Steady variant can remain in Fast Recovery 322 for N round-trip times, retransmitting one more dropped packet each 323 round-trip time; for these scenarios, the Impatient variant gives a 324 faster recovery and better performance. The tests "ns 325 test-suite-newreno.tcl impatient1" and "ns test-suite-newreno.tcl 326 slow1" in the NS simulator illustrate such a scenario, where the 327 Impatient variant performs better than the Slow-but-Steady 328 variant. The Impatient variant can be particularly important for TCP 329 connections with large congestion windows, as illustrated by the tests 330 "ns test-suite-newreno.tcl impatient4" and "ns test-suite-newreno.tcl 331 slow4" in the NS simulator. 333 One can also construct scenarios where the Slow-but-Steady variant 334 gives better performance than the Impatient variant. As an example, 335 this occurs when only a small number of packets are dropped, the RTO 336 is sufficiently small that the retransmit timer expires, and 337 performance would have been better without a retransmit timeout. The 338 tests "ns test-suite-newreno.tcl impatient2" and "ns 339 test-suite-newreno.tcl slow2" in the NS simulator illustrate such a 340 scenario. 342 The Slow-but-Steady variant can also achieve higher goodput than the 343 Impatient variant, by avoiding unnecessary retransmissions. This 344 could be of special interest for cellular links, where every 345 transmission costs battery power and money. The tests "ns 346 test-suite-newreno.tcl impatient3" and "ns test-suite-newreno.tcl 347 slow3" in the NS simulator illustrate such a scenario. The Slow-but-Steady 348 variant can also be more robust to delay variation in the network, 349 where a delay spike might force the Impatient variant into a timeout 350 and go-back-N recovery. 352 Neither of the two variants discussed above are optimal. Our 353 recommendation is for the Impatient variant, as specified in Section 354 3 of this document, because of the poor performance of the 355 Slow-but-Steady variant for TCP connections with large congestion 356 windows. 358 One possibility for a more optimal algorithm would be one that 359 recovered from multiple packet drops as quickly as does slow-start, 360 while resetting the retransmit timers after each partial 361 acknowledgement, as described in the section below. We note, 362 however, that there is a limitation to the potential performance in 363 this case in the absence of the SACK option. 365 5. Retransmissions after a Partial Acknowledgement 367 One possible variant to the response to partial acknowledgements 368 specified in Section 3 would be to retransmit more than one packet 369 after each partial acknowledgement, and to reset the retransmit timer 370 after each retransmission. The algorithm specified in Section 3 371 retransmits a single packet after each partial acknowledgement. This 372 is the most conservative alternative, in that it is the least likely 373 to result in an unnecessarily-retransmitted packet. A variant that 374 would recover faster from a window with many packet drops would be to 375 effectively Slow-Start, retransmitting two packets after each partial 376 acknowledgement. Such an approach would take less than N roundtrip 377 times to recover from N losses [Hoe96]. However, in the absence of 378 SACK, recovering as quickly as slow-start introduces the likelihood 379 of unnecessarily retransmitting packets, and this could significantly 380 complicate the recovery mechanisms. 382 We note that the response to partial acknowledgements specified in 383 Section 3 of this document and in RFC 2582 differs from the response 384 in [FF96], even though both approaches only retransmit one packet in 385 response to a partial acknowledgement. Step 5 of Section 3 specifies 386 that the TCP sender responds to a partial ACK by deflating the 387 congestion window by the amount of new data acknowledged, adding 388 back SMSS bytes if the partial ACK acknowledges at least SMSS bytes 389 of new data, and sending a new segment if permitted by the new value 390 of cwnd. Thus, only one previously-sent packet is retransmitted in 391 response to each partial acknowledgement, but additional new packets 392 might be transmitted as well, depending on the amount of new data 393 acknowledged by the partial acknowledgement. In contrast, the 394 variant of NewReno illustrated in [FF96] simply set the congestion 395 window to ssthresh when a partial acknowledgement was received. The 396 approach in [FF96] is more conservative, and does not attempt to 397 accurately track the actual number of outstanding packets after a 398 partial acknowledgement is received. While either of these 399 approaches gives acceptable performance, the variant specified in 400 Section 3 recovers more smoothly when multiple packets are dropped 401 from a window of data. (The [FF96] behavior can be seen in the NS 402 simulator by setting the variable "partial_window_deflation_" for 403 "Agent/TCP/Newreno" to 0; the behavior specified in Section 3 is 404 achieved by setting "partial_window_deflation_" to 1.) 406 6. Avoiding Multiple Fast Retransmits 408 This section describes the motivation for the sender's state variable 409 "recover", and discusses possible heuristics for distinguishing 410 between a retransmitted packet that was dropped, and three duplicate 411 acknowledgements from the unnecessary retransmission of three 412 packets. 414 In the absence of the SACK option or timestamps, a duplicate 415 acknowledgement carries no information to identify the data packet or 416 packets at the TCP data receiver that triggered that duplicate 417 acknowledgement. In this case, the TCP data sender is unable to 418 distinguish between a duplicate acknowledgement that results from a 419 lost or delayed data packet, and a duplicate acknowledgement that 420 results from the sender's unnecessary retransmission of a data packet 421 that had already been received at the TCP data receiver. Because of 422 this, with the Retransmit and Fast Recovery algorithms in Reno TCP, 423 multiple segment losses from a single window of data can sometimes 424 result in unnecessary multiple Fast Retransmits (and multiple 425 reductions of the congestion window) [F94]. 427 With the Fast Retransmit and Fast Recovery algorithms in Reno TCP, 428 the performance problems caused by multiple Fast Retransmits are 429 relatively minor compared to the potential problems with Tahoe TCP, 430 which does not implement Fast Recovery. Nevertheless, unnecessary 431 Fast Retransmits can occur with Reno TCP unless some explicit 432 mechanism is added to avoid this, such as the use of the "recover" 433 variable. (This modification is called "bugfix" in [F98], and is 434 illustrated on pages 7 and 9 of that document. Unnecessary Fast 435 Retransmits for Reno without "bugfix" is illustrated on page 6 of 436 [F98].) 438 Section 3 of [RFC2582] defined a default variant of NewReno TCP that 439 did not use the variable "recover", and did not check if duplicate 440 ACKs cover the variable "recover" before invoking Fast Retransmit. 441 With this default variant from RFC 2582, the problem of multiple Fast 442 Retransmits from a single window of data can occur after a Retransmit 443 Timeout (as in page 8 of [F98]) or in scenarios with reordering (as 444 in the validation test "./test-all-newreno newreno5_noBF" in 445 directory "tcl/test" of the NS simulator. This gives performance 446 similar to that on page 8 of [F03].) RFC 2582 also defined Careful 447 and Less Careful variants of the NewReno algorithm, and recommended 448 the Careful variant. 450 The algorithm specified in Section 3 of this document corresponds to 451 the Careful variant of NewReno TCP from RFC 2582, and eliminates the 452 problem of multiple Fast Retransmits. This algorithm uses the 453 variable "recover", whose initial value is the initial send sequence 454 number. After each retransmit timeout, the highest sequence number 455 transmitted so far is recorded in the variable "recover". 457 If, after a retransmit timeout, the TCP data sender retransmits three 458 consecutive packets that have already been received by the data 459 receiver, then the TCP data sender will receive three duplicate 460 acknowledgements that do not cover more than "recover". In this 461 case, the duplicate acknowledgements are not an indication of a new 462 instance of congestion. They are simply an indication that the 463 sender has unnecessarily retransmitted at least three packets. 465 However, when a retransmitted packet is itself dropped, the sender 466 can also receive three duplicate acknowledgements that do not cover 467 more than "recover". In this case, the sender would have been 468 better off if it had initiated Fast Retransmit. For a TCP that 469 implements the algorithm specified in Section 3 of this document, the 470 sender does not infer a packet drop from duplicate acknowledgements 471 in this scenario. As always, the retransmit timer is the backup 472 mechanism for inferring packet loss in this case. 474 There are several heuristics, based on timestamps or on the amount of 475 advancement of the cumulative acknowledgement field, that allow the 476 sender to distinguish, in some cases, between three duplicate 477 acknowledgements following a retransmitted packet that was dropped, 478 and three duplicate acknowledgements from the unnecessary 479 retransmission of three packets [Gur03, GF04]. The TCP sender MAY use such 480 a heuristic to decide to invoke a Fast Retransmit in some cases, even 481 when the three duplicate acknowledgements do not cover more than 482 "recover". 484 For example, when three duplicate acknowledgements are caused by the 485 unnecessary retransmission of three packets, this is likely to be 486 accompanied by the cumulative acknowledgement field advancing by at 487 least four segments. Similarly, a heuristic based on timestamps uses 488 the fact that when there is a hole in the sequence space, the 489 timestamp echoed in the duplicate acknowledgement is the timestamp of 490 the most recent data packet that advanced the cumulative 491 acknowledgement field [RFC1323]. If timestamps are used, and the 492 sender stores the timestamp of the last acknowledged segment, then 493 the timestamp echoed by duplicate acknowledgements can be used to 494 distinguish between a retransmitted packet that was dropped and 495 three duplicate acknowledgements from the unnecessary 496 retransmission of three packets. The heuristics are illustrated in 497 the NS simulator in the validation test "./test-all-newreno". 499 6.1. ACK Heuristic 501 If the ACK-based heuristic is used, then following the advancement of 502 the cumulative acknowledgement field, the sender stores the value of the 503 previous cumulative acknowledgement as prev_highest_ack, and stores 504 the latest cumulative ACK as highest_ack. In addition, the following 505 step is performed if Step 1 in Section 3 fails, before proceeding to 506 Step 1B. 508 1*) If the Cumulative Acknowledgement field didn't cover more than 509 "recover", check to see if the congestion window is greater 510 than SMSS bytes and the difference between highest_ack and 511 prev_highest_ack is at most 4*SMSS bytes. If true, duplicate 512 ACKs indicate a lost segment (proceed to Step 1A in Section 513 3). Otherwise, duplicate ACKs likely result from unnecessary 514 retransmissions (proceed to Step 1B in Section 3). 516 The congestion window check serves to protect against fast retransmit 517 immediately after a retransmit timeout, similar to the 518 "exitFastRetrans_" variable in NS. Examples of applying the ACK 519 heuristic are in validation tests "./test-all-newreno 520 newreno_rto_loss_ack" and "./test-all-newreno newreno_rto_dup_ack" in 521 directory "tcl/test" of the NS simulator. 523 If several ACKs are lost, the sender can see a jump in the cumulative 524 ACK of more than three segments, and the heuristic can fail. A 525 validation test for this scenario is "./test-all-newreno 526 newreno_rto_loss_ackf". RFC 2581 recommends that a receiver should 527 send duplicate ACKs for every out-of-order data packet, such as a 528 data packet received during Fast Recovery. The ACK heuristic is more 529 likely to fail if the receiver does not follow this advice, because 530 then a smaller number of ACK losses are needed to produce a 531 sufficient jump in the cumulative ACK. 533 6.2. Timestamp Heuristic 535 If this heuristic is used, the sender stores the timestamp of the 536 last acknowledged segment. In addition, the second paragraph of step 537 1 in Section 3 is replaced as follows: 539 1**) If the Cumulative Acknowledgement field didn't cover more than 540 "recover", check to see if the echoed timestamp in the last 541 non-duplicate acknowledgment equals the 542 stored timestamp. If true, duplicate ACKs indicate a lost 543 segment (proceed to Step 1A in Section 3). Otherwise, duplicate 544 ACKs likely result from unnecessary retransmissions (proceed 545 to Step 1B in Section 3). 547 Examples of applying the timestamp heuristic are in validation tests 548 "./test-all-newreno newreno_rto_loss_tsh" and "./test-all-newreno 549 newreno_rto_dup_tsh". The timestamp heuristic works correctly, both 550 when the receiver echoes timestamps as specified by [RFC1323], and by 551 its revision attempts. However, if the receiver arbitrarily echoes 552 timestamps, the heuristic can fail. The heuristic can also fail if a 553 timeout was spurious and returning ACKs are not from retransmitted 554 segments. This can be prevented by detection algorithms such as 555 [RFC3522]. 557 7. Implementation Issues for the Data Receiver 559 [RFC2581] specifies that "Out-of-order data segments SHOULD be 560 acknowledged immediately, in order to accelerate loss recovery." 561 Neal Cardwell has noted that some data receivers do not send an 562 immediate acknowledgement when they send a partial acknowledgment, 563 but instead wait first for their delayed acknowledgement timer to 564 expire [C98]. As [C98] notes, this severely limits the potential 565 benefit of NewReno by delaying the receipt of the partial 566 acknowledgement at the data sender. Echoing RFC 2581, our 567 recommendation is that the data receiver send an immediate 568 acknowledgement for an out-of-order segment, even when that 569 out-of-order segment fills a hole in the buffer. 571 8. Implementation Issues for the Data Sender 573 In Section 3, Step 5 above, it is noted that implementations should 574 take measures to avoid a possible burst of data when leaving Fast 575 Recovery, in case the amount of new data that the sender is eligible 576 to send due to the new value of the congestion window is large. This 577 can arise during NewReno when ACKs are lost or treated as pure window 578 updates, thereby causing the sender to underestimate the number of 579 new segments that can be sent during the recovery procedure. 580 Specifically, bursts can occur when the FlightSize is much less than 581 the new congestion window when exiting from Fast Recovery. One 582 simple mechanism to avoid a burst of data when leaving Fast Recovery 583 is to limit the number of data packets that can be sent in response 584 to a single acknowledgment. (This is known as "maxburst_" in the ns 585 simulator.) Other possible mechanisms for avoiding bursts include 586 rate-based pacing, or setting the slow-start threshold to the 587 resultant congestion window and then resetting the congestion window 588 to FlightSize. A recommendation on the general mechanism to avoid 589 excessively bursty sending patterns is outside the scope of this 590 document. 592 An implementation may want to use a separate flag to record whether 593 or not it is presently in the Fast Recovery procedure. The use of 594 the value of the duplicate acknowledgment counter for this purpose is 595 not reliable because it can be reset upon window updates and 596 out-of-order acknowledgments. 598 When updating the Cumulative Acknowledgement field outside of 599 Fast Recovery, the "recover" state variable may also need to be 600 updated in order to continue to permit possible entry into Fast 601 Recovery (Section 3, step 1). This issue arises when an update 602 of the Cumulative Acknowledgement field results in a sequence 603 wraparound that affects the ordering between the Cumulative 604 Acknowledgement field and the "recover" state variable. Entry 605 into Fast Recovery is only possible when the Cumulative 606 Acknowledgment field covers more than the "recover" state variable. 608 It is important for the sender to respond correctly to duplicate ACKs 609 received when the sender is no longer in Fast Recovery (e.g., because 610 of a Retransmit Timeout). The Limited Transmit procedure [RFC3042] 611 describes possible responses to the first and second duplicate 612 acknowledgements. When three or more duplicate acknowledgements are 613 received, the Cumulative Acknowledgement field doesn't cover more 614 than "recover", and a new Fast Recovery is not invoked, it is 615 important that the sender not execute the Fast Recovery steps (3) and 616 (4) in Section 3. Otherwise, the sender could end up in a chain of 617 spurious timeouts. We mention this only because several NewReno 618 implementations had this bug, including the implementation in the NS 619 simulator. (This bug in the NS simulator was fixed in July 2003, 620 with the variable "exitFastRetrans_".) 622 It has been observed that some TCP implementations enter a slow start 623 or congestion avoidance window updating algorithm immediately after 624 the cwnd is set by the equation found in (Section 3, step 5), even 625 without a new external event generating the cwnd change. Note that 626 after cwnd is set based on the procedure for exiting Fast Recovery 627 (Section 3, step 5), cwnd SHOULD NOT be updated until a further 628 event occurs (e.g., arrival of an ack, or timeout) after this 629 adjustment. 631 9. Simulations 633 Simulations with NewReno are illustrated with the validation test 634 "tcl/test/test-all-newreno" in the NS simulator. The command 635 "../../ns test-suite-newreno.tcl reno" shows a simulation with Reno 636 TCP, illustrating the data sender's lack of response to a partial 637 acknowledgement. In contrast, the command "../../ns 638 test-suite-newreno.tcl newreno_B" shows a simulation with the same 639 scenario using the NewReno algorithms described in this paper. 641 10. Comparisons between Reno and NewReno TCP 643 As we stated in the introduction, we believe that the NewReno 644 modification described in this document improves the performance of 645 the Fast Retransmit and Fast Recovery algorithms of Reno TCP in a 646 wide variety of scenarios. This has been discussed in some depth in 647 [FF96], which illustrates Reno TCP's poor performance when multiple 648 packets are dropped from a window of data and also illustrates 649 NewReno TCP's good performance in that scenario. 651 We do, however, know of one scenario where Reno TCP gives better 652 performance than NewReno TCP, that we describe here for the sake of 653 completeness. Consider a scenario with no packet loss, but with 654 sufficient reordering so that the TCP sender receives three duplicate 655 acknowledgements. This will trigger the Fast Retransmit and Fast 656 Recovery algorithms. With Reno TCP or with Sack TCP, this will 657 result in the unnecessary retransmission of a single packet, combined 658 with a halving of the congestion window (shown on pages 4 and 6 of 659 [F03]). With NewReno TCP, however, this reordering will also result 660 in the unnecessary retransmission of an entire window of data (shown 661 on page 5 of [F03]). 663 While Reno TCP performs better than NewReno TCP in the presence of 664 reordering, NewReno's superior performance in the presence of 665 multiple packet drops generally outweighs its less optimal 666 performance in the presence of reordering. (Sack TCP is the 667 preferred solution, with good performance in both scenarios.) This 668 document recommends the Fast Retransmit and Fast Recovery algorithms 669 of NewReno TCP instead of those of Reno TCP for those TCP connections 670 that do not support SACK. We would also note that NewReno's Fast 671 Retransmit and Fast Recovery mechanisms are widely deployed in TCP 672 implementations in the Internet today, as documented in [PF01]. For 673 example, tests of TCP implementations in several thousand web servers 674 in 2001 showed that for those TCP connections where the web browser 675 was not SACK-capable, more web servers used the Fast Retransmit and 676 Fast Recovery algorithms of NewReno than those of Reno or Tahoe TCP 677 [PF01]. 679 11. Changes Relative to RFC 2582 681 The purpose of this document is to advance the NewReno's Fast 682 Retransmit and Fast Recovery algorithms in RFC 2582 to Standards Track. 684 The main change in this document relative to RFC 2582 is to specify 685 the Careful variant of NewReno's Fast Retransmit and Fast Recovery 686 algorithms. The base algorithm described in RFC 2582 did not attempt 687 to avoid unnecessary multiple Fast Retransmits that can occur after a 688 timeout (described in more detail in the section above). However, 689 RFC 2582 also defined "Careful" and "Less Careful" variants that 690 avoid these unnecessary Fast Retransmits, and recommended the Careful 691 variant. This document specifies the previously-named "Careful" 692 variant as the basic version of NewReno. As described below, this 693 algorithm uses a variable "recover", whose initial value is the send 694 sequence number. 696 The algorithm specified in Section 3 checks whether the 697 acknowledgement field of a partial acknowledgement covers *more* than 698 "recover", as defined in Section 3. Another possible variant would be 699 to simply require that the acknowledgement field covers *more than or 700 equal to* "recover" before initiating another Fast Retransmit. We 701 called this the Less Careful variant in RFC 2582. 703 There are two separate scenarios in which the TCP sender could 704 receive three duplicate acknowledgements acknowledging "recover" but 705 no more than "recover". One scenario would be that the data sender 706 transmitted four packets with sequence numbers higher than "recover", 707 that the first packet was dropped in the network, and the following 708 three packets triggered three duplicate acknowledgements 709 acknowledging "recover". The second scenario would be that the 710 sender unnecessarily retransmitted three packets below "recover", and 711 that these three packets triggered three duplicate acknowledgements 712 acknowledging "recover". In the absence of SACK, the TCP sender is 713 unable to distinguish between these two scenarios. 715 For the Careful variant of Fast Retransmit, the data sender would 716 have to wait for a retransmit timeout in the first scenario, but 717 would not have an unnecessary Fast Retransmit in the second 718 scenario. For the Less Careful variant to Fast Retransmit, the data sender 719 would Fast Retransmit as desired in the first scenario, and would 720 unnecessarily Fast Retransmit in the second scenario. This document 721 only specifies the Careful variant in Section 3. Unnecessary Fast 722 Retransmits with the Less Careful variant in scenarios with 723 reordering are illustrated in page 8 of [F03]. 725 The document also specifies two heuristics that the TCP sender MAY 726 use to decide to invoke Fast Retransmit even when the three duplicate 727 acknowledgements do not cover more than "recover". These heuristics, 728 an ACK-based heuristic and a timestamp heuristic, are described in 729 Sections 6.1 and 6.2 respectively. 731 12. Conclusions 733 This document specifies the NewReno Fast Retransmit and Fast Recovery 734 algorithms for TCP. This NewReno modification to TCP can even be 735 important for TCP implementations that support the SACK option, 736 because the SACK option can only be used for TCP connections when 737 both TCP end-nodes support the SACK option. NewReno performs better 738 than Reno (RFC 2581) in a number of scenarios discussed herein. 740 A number of options to the basic algorithm presented in Section 3 are 741 also described. These include the handling of the retransmission 742 timer (Section 4), the response to partial acknowledgments (Section 743 5), and the value of the congestion window when leaving Fast Recovery 744 (section 3, step 5). Our belief is that the differences between 745 these variants of NewReno are small compared to the differences 746 between Reno and NewReno. That is, the important thing is to 747 implement NewReno instead of Reno, for a TCP connection without SACK; 748 it is less important exactly which of the variants of NewReno is 749 implemented. 751 13. Security Considerations 753 RFC 2581 discusses general security considerations concerning TCP 754 congestion control. This document describes a specific algorithm 755 that conforms with the congestion control requirements of RFC 2581, 756 and so those considerations apply to this algorithm, too. There are 757 no known additional security concerns for this specific algorithm. 759 14. Acknowledgements 761 Many thanks to Anil Agarwal, Mark Allman, Armando Caro, Jeffrey Hsu, 762 Vern Paxson, Kacheong Poon, Keyur Shah, and Bernie Volz for detailed 763 feedback on this document or on its precursor, RFC 2582. Jeffrey 764 Hsu provided clarifications on the handling of the recover variable 765 that were applied to RFC 3782 as errata, and now are in Section 8 766 of this document. Yoshifumi Nishida contributed a modification 767 to the fast recovery algorithm to account for the case in which 768 flightsize is 0 when the TCP sender leaves fast recovery, and the 769 TCP receiver uses delayed acknowledgments. 771 15. References 773 15.1. Normative References 775 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP 776 Selective Acknowledgement Options", RFC 2018, October 1996. 778 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 779 Requirement Levels", BCP 14, RFC 2119, March 1997. 781 [RFC2581] Allman, M., Paxson, V. and W. Stevens, "TCP Congestion 782 Control", RFC 2581, April 1999. 784 [RFC2582] Floyd, S. and T. Henderson, "The NewReno Modification to 785 TCP's Fast Recovery Algorithm", RFC 2582, April 1999. 787 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 788 Timer", RFC 2988, November 2000. 790 [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing TCP's 791 Loss Recovery Using Limited Transmit", RFC 3042, January 2001. 793 15.2. Informative References 795 [C98] Cardwell, N., "delayed ACKs for retransmitted packets: ouch!". 796 November 1998, Email to the tcpimpl mailing list, Message-ID 797 "Pine.LNX.4.02A.9811021421340.26785-100000@sake.cs.washington.edu", 798 archived at "http://tcp-impl.lerc.nasa.gov/tcp-impl". 800 [F98] Floyd, S., Revisions to RFC 2001, "Presentation to the TCPIMPL 801 Working Group", August 1998. URLs 802 "ftp://ftp.ee.lbl.gov/talks/sf-tcpimpl-aug98.ps" and 803 "ftp://ftp.ee.lbl.gov/talks/sf-tcpimpl-aug98.pdf". 805 [F03] Floyd, S., "Moving NewReno from Experimental to Proposed 806 Standard? Presentation to the TSVWG Working Group", March 2003. 807 URLs "http://www.icir.org/floyd/talks/newreno-Mar03.ps" and 808 "http://www.icir.org/floyd/talks/newreno-Mar03.pdf". 810 [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of Tahoe, 811 Reno and SACK TCP", Computer Communication Review, July 1996. URL 812 "ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z". 814 [F94] Floyd, S., "TCP and Successive Fast Retransmits", Technical 815 report, October 1994. URL 816 "ftp://ftp.ee.lbl.gov/papers/fastretrans.ps". 818 [GF04] Gurtov, A. and S. Floyd, "Resolving Acknowledgment Ambiguity 819 in non-SACK TCP", Next Generation Teletraffic and 820 Wired/Wireless Advanced Networking (NEW2AN'04), February 821 2004. URL "http://www.cs.helsinki.fi/u/gurtov/papers/ 822 heuristics.html". 824 [Gur03] Gurtov, A., "[Tsvwg] resolving the problem of unnecessary fast 825 retransmits in go-back-N", email to the tsvwg mailing list, message 826 ID <3F25B467.9020609@cs.helsinki.fi>, July 28, 2003. URL 827 "http://www1.ietf.org/mail-archive/working-groups/tsvwg/current/msg04334.html". 829 [Hen98] Henderson, T., Re: NewReno and the 2001 Revision. September 830 1998. Email to the tcpimpl mailing list, Message ID 831 "Pine.BSI.3.95.980923224136.26134A-100000@raptor.CS.Berkeley.EDU", 832 archived at "http://tcp-impl.lerc.nasa.gov/tcp-impl". 834 [Hoe95] Hoe, J., "Startup Dynamics of TCP's Congestion Control and 835 Avoidance Schemes", Master's Thesis, MIT, 1995. 837 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 838 Control Scheme for TCP", ACM SIGCOMM, August 1996. URL 839 "http://www.acm.org/sigcomm/sigcomm96/program.html". 841 [LM97] Lin, D. and R. Morris, "Dynamics of Random Early Detection", 842 SIGCOMM 97, September 1997. URL 843 "http://www.acm.org/sigcomm/sigcomm97/program.html". 845 [NS] The Network Simulator (NS). URL "http://www.isi.edu/nsnam/ns/". 847 [PF01] Padhye, J. and S. Floyd, "Identifying the TCP Behavior of Web 848 Servers", June 2001, SIGCOMM 2001. 850 [RFC1323] Jacobson, V., Braden, R. and D. Borman, "TCP Extensions for 851 High Performance", RFC 1323, May 1992. 853 [RFC3517] Blanton, E., Allman, M., Fall, K. and L. Wang, "A 854 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 855 Algorithm for TCP", RFC 3517, April 2003. 857 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm for 858 TCP", RFC 3522, April 2003. 860 Appendix A. Document Revision History 862 To be removed upon publication 864 +----------+--------------------------------------------------+ 865 | Revision | Comments | 866 +----------+--------------------------------------------------+ 867 | draft-00 | Initial version, unmodified text from RFC3782 | 868 +----------+--------------------------------------------------+ 869 | draft-01 | RFC3782 errata applied, and changes applied from | 870 | | draft-nishida-newreno-modification-02 | 871 +----------+--------------------------------------------------+ 873 Authors' Addresses 875 Sally Floyd 876 International Computer Science Institute 878 Phone: +1 (510) 666-2989 879 EMail: floyd@acm.org 880 URL: http://www.icir.org/floyd/ 882 Tom Henderson 883 The Boeing Company 884 EMail: thomas.r.henderson@boeing.com 886 Andrei Gurtov 887 HIIT 888 Helsinki Institute for Information Technology 889 Aalto University 890 P.O. Box 19800 891 Helsinki FIN-00076 892 FINLAND 894 EMail: gurtov@hiit.fi