idnits 2.17.1 draft-ietf-tcpm-rfc3782-bis-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 44 instances of too long lines in the document, the longest one being 20 characters in excess of 72. ** The abstract seems to contain references ([RFC5681], [RFC2883], [RFC3782], [Hoe95]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. -- The draft header indicates that this document obsoletes RFC3782, but the abstract doesn't seem to directly say this. It does mention RFC3782 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 24, 2011) is 4812 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 2001 (ref. 'F98') (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 2582 (Obsoleted by RFC 3782) -- Obsolete informational reference (is this intentional?): RFC 3782 (Obsoleted by RFC 6582) Summary: 4 errors (**), 0 flaws (~~), 3 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group T. Henderson 3 Internet-Draft Boeing 4 Obsoletes: 3782 (if approved) S. Floyd 5 Intended status: Standards Track ICSI 6 Expires: June 24, 2011 A. Gurtov 7 HIIT 8 Y. Nishida 9 WIDE Project 10 January 24, 2011 12 The NewReno Modification to TCP's Fast Recovery Algorithm 13 draft-ietf-tcpm-rfc3782-bis-00.txt 15 Abstract 17 RFC 5681 [RFC5681] documents the following four intertwined TCP 18 congestion control algorithms: Slow Start, Congestion Avoidance, Fast 19 Retransmit, and Fast Recovery. RFC 5681 explicitly allows 20 certain modifications of these algorithms, including modifications 21 that use the TCP Selective Acknowledgement (SACK) option [RFC2883], 22 and modifications that respond to "partial acknowledgments" (ACKs 23 which cover new data, but not all the data outstanding when loss was 24 detected) in the absence of SACK. This document describes a specific 25 algorithm for responding to partial acknowledgments, referred to as 26 NewReno. This response to partial acknowledgments was first proposed 27 by Janey Hoe in [Hoe95]. 29 The purpose of this revision from [RFC3782] is to make errata changes 30 and to adopt a proposal from Yoshifumi Nishida to slightly increase 31 the minimum window size after Fast Recovery from one to two segments, 32 to improve performance when the receiver uses delayed acknowledgments. 34 Status of this Memo 36 This Internet-Draft is submitted to IETF in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at http://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on June 24, 2011. 50 Copyright Notice 52 Copyright (c) 2011 IETF Trust and the persons identified as 53 the document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 This document may contain material from IETF Documents or IETF 66 Contributions published or made publicly available before November 67 10, 2008. The person(s) controlling the copyright in some of this 68 material may not have granted the IETF Trust the right to allow 69 modifications of such material outside the IETF Standards Process. 70 Without obtaining an adequate license from the person(s) controlling 71 the copyright in such materials, this document may not be modified 72 outside the IETF Standards Process, and derivative works of it may 73 not be created outside the IETF Standards Process, except to format 74 it for publication as an RFC or to translate it into languages other 75 than English. 77 1. Introduction 79 For the typical implementation of the TCP Fast Recovery algorithm 80 described in [RFC5681] (first implemented in the 1990 BSD Reno 81 release, and referred to as the Reno algorithm in [FF96]), the TCP 82 data sender only retransmits a packet after a retransmit timeout has 83 occurred, or after three duplicate acknowledgements have arrived 84 triggering the Fast Retransmit algorithm. A single retransmit 85 timeout might result in the retransmission of several data packets, 86 but each invocation of the Fast Retransmit algorithm in RFC 5681 87 leads to the retransmission of only a single data packet. 89 Problems can arise, therefore, when multiple packets are 90 dropped from a single window of data and the Fast Retransmit and Fast 91 Recovery algorithms are invoked. In this case, if the SACK option is 92 available, the TCP sender has the information to make intelligent 93 decisions about which packets to retransmit and which packets not to 94 retransmit during Fast Recovery. This document applies only for TCP 95 connections that are unable to use the TCP Selective Acknowledgement 96 (SACK) option, either because the option is not locally supported or 97 because the TCP peer did not indicate a willingness to use SACK. 99 In the absence of SACK, there is little information available to the 100 TCP sender in making retransmission decisions during Fast 101 Recovery. From the three duplicate acknowledgements, the sender 102 infers a packet loss, and retransmits the indicated packet. After 103 this, the data sender could receive additional duplicate 104 acknowledgements, as the data receiver acknowledges additional data 105 packets that were already in flight when the sender entered Fast 106 Retransmit. 108 In the case of multiple packets dropped from a single window of data, 109 the first new information available to the sender comes when the 110 sender receives an acknowledgement for the retransmitted packet (that 111 is, the packet retransmitted when Fast Retransmit was first 112 entered). If there is a single packet drop and no reordering, then the 113 acknowledgement for this packet will acknowledge all of the packets 114 transmitted before Fast Retransmit was entered. However, if there 115 are multiple packet drops, then the acknowledgement for the 116 retransmitted packet will acknowledge some but not all of the packets 117 transmitted before the Fast Retransmit. We call this acknowledgement 118 a partial acknowledgment. 120 Along with several other suggestions, [Hoe95] suggested that during 121 Fast Recovery the TCP data sender responds to a partial 122 acknowledgment by inferring that the next in-sequence packet has been 123 lost, and retransmitting that packet. This document describes a 124 modification to the Fast Recovery algorithm in RFC 5681 that 125 incorporates a response to partial acknowledgements received during 126 Fast Recovery. We call this modified Fast Recovery algorithm 127 NewReno, because it is a slight but significant variation of the 128 basic Reno algorithm in RFC 5681. This document does not discuss the 129 other suggestions in [Hoe95] and [Hoe96], such as a change to the 130 ssthresh parameter during Slow-Start, or the proposal to send a new 131 packet for every two duplicate acknowledgements during Fast 132 Recovery. The version of NewReno in this document also draws on other 133 discussions of NewReno in the literature [LM97, Hen98]. 135 We do not claim that the NewReno version of Fast Recovery described 136 here is an optimal modification of Fast Recovery for responding to 137 partial acknowledgements, for TCP connections that are unable to use 138 SACK. Based on our experiences with the NewReno modification in the 139 NS simulator [NS] and with numerous implementations of NewReno, we 140 believe that this modification improves the performance of the Fast 141 Retransmit and Fast Recovery algorithms in a wide variety of 142 scenarios. 144 2. Terminology and Definitions 146 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 147 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 148 and "OPTIONAL" are to be interpreted as described in BCP 14, RFC 2119 149 [RFC2119]. This RFC indicates requirement levels for compliant TCP 150 implementations implementing the NewReno Fast Retransmit and Fast 151 Recovery algorithms described in this document. 153 This document assumes that the reader is familiar with the terms 154 SENDER MAXIMUM SEGMENT SIZE (SMSS), CONGESTION WINDOW (cwnd), and 155 FLIGHT SIZE (FlightSize) defined in [RFC5681]. FLIGHT SIZE is 156 defined as in [RFC5681] as follows: 158 FLIGHT SIZE: 159 The amount of data that has been sent but not yet cumulatively 160 acknowledged. 162 3. The Fast Retransmit and Fast Recovery Algorithms in NewReno 164 The standard implementation of the Fast Retransmit and Fast Recovery 165 algorithms is given in [RFC5681]. This section specifies the basic 166 NewReno algorithm. Sections 4 through 6 describe some optional 167 variants, and the motivations behind them, that an implementor may 168 want to consider when tuning performance for certain network 169 scenarios. Sections 7 and 8 provide some guidance to implementors 170 based on experience with NewReno implementations. 172 The NewReno modification concerns the Fast Recovery procedure that 173 begins when three duplicate ACKs are received and ends when either a 174 retransmission timeout occurs or an ACK arrives that acknowledges all 175 of the data up to and including the data that was outstanding when 176 the Fast Recovery procedure began. 178 The NewReno algorithm specified in this document differs from the 179 implementation in [RFC5681] in the introduction of the variable 180 "recover" in step 1, in the response to a partial or new 181 acknowledgement in step 5, and in modifications to step 1 and the 182 addition of step 6 for avoiding multiple Fast Retransmits caused by 183 the retransmission of packets already received by the receiver. 185 The algorithm specified in this document uses a variable "recover", 186 whose initial value is the initial send sequence number. 188 1) Three duplicate ACKs: 189 When the third duplicate ACK is received and the sender is not 190 already in the Fast Recovery procedure, check to see if the 191 Cumulative Acknowledgement field covers more than 192 "recover". If so, go to Step 1A. Otherwise, go to Step 1B. 194 1A) Invoking Fast Retransmit: 195 If so, then set ssthresh to no more than the value given in 196 equation 1 below. (This is equation 4 from [RFC5681]). 198 ssthresh = max (FlightSize / 2, 2*SMSS) (1) 200 In addition, record the highest sequence number transmitted in 201 the variable "recover", and go to Step 2. 203 1B) Not invoking Fast Retransmit: 204 Do not enter the Fast Retransmit and Fast Recovery procedure. In 205 particular, do not change ssthresh, do not go to Step 2 to 206 retransmit the "lost" segment, and do not execute Step 3 upon 207 subsequent duplicate ACKs. 209 2) Entering Fast Retransmit: 210 Retransmit the lost segment and set cwnd to ssthresh plus 211 3*SMSS. This artificially "inflates" the congestion window by the 212 number of segments (three) that have left the network and the 213 receiver has buffered. 215 3) Fast Recovery: 216 For each additional duplicate ACK received while in Fast 217 Recovery, increment cwnd by SMSS. This artificially inflates 218 the congestion window in order to reflect the additional segment 219 that has left the network. 221 4) Fast Recovery, continued: 222 Transmit a segment, if allowed by the new value of cwnd and the 223 receiver's advertised window. 225 5) When an ACK arrives that acknowledges new data, this ACK could be 226 the acknowledgment elicited by the retransmission from step 2, or 227 elicited by a later retransmission. 229 Full acknowledgements: 230 If this ACK acknowledges all of the data up to and including 231 "recover", then the ACK acknowledges all the intermediate 232 segments sent between the original transmission of the lost 233 segment and the receipt of the third duplicate ACK. Set cwnd to 234 either (1) min (ssthresh, max(FlightSize, SMSS) + SMSS) or 235 (2) ssthresh, where ssthresh is the value set in step 1; this is 236 termed "deflating" the window. (We note that "FlightSize" in step 1 237 referred to the amount of data outstanding in step 1, when Fast 238 Recovery was entered, while "FlightSize" in step 5 refers to the 239 amount of data outstanding in step 5, when Fast Recovery is 240 exited.) If the second option is selected, the implementation 241 is encouraged to take measures to avoid a possible burst of 242 data, in case the amount of data outstanding in the network is 243 much less than the new congestion window allows. A simple mechanism 244 is to limit the number of data packets that can be 245 sent in response to a single acknowledgement; this is known 246 as "maxburst_" in the NS simulator. Exit the Fast Recovery 247 procedure. 249 Partial acknowledgements: 250 If this ACK does *not* acknowledge all of the data up to and 251 including "recover", then this is a partial ACK. In this case, 252 retransmit the first unacknowledged segment. Deflate the 253 congestion window by the amount of new data acknowledged by the 254 cumulative acknowledgement field. If the partial ACK 255 acknowledges at least one SMSS of new data, then add back SMSS 256 bytes to the congestion window. As in Step 3, this artificially 257 inflates the congestion window in order to reflect the additional 258 segment that has left the network. Send a new segment if 259 permitted by the new value of cwnd. This "partial window 260 deflation" attempts to ensure that, when Fast Recovery eventually 261 ends, approximately ssthresh amount of data will be outstanding 262 in the network. Do not exit the Fast Recovery procedure (i.e., 263 if any duplicate ACKs subsequently arrive, execute Steps 3 and 264 4 above). 266 For the first partial ACK that arrives during Fast Recovery, also 267 reset the retransmit timer. Timer management is discussed in 268 more detail in Section 4. 270 6) Retransmit timeouts: 271 After a retransmit timeout, record the highest sequence number 272 transmitted in the variable "recover" and exit the Fast 273 Recovery procedure if applicable. 275 Step 1 specifies a check that the Cumulative Acknowledgement field 276 covers more than "recover". Because the acknowledgement field 277 contains the sequence number that the sender next expects to receive, 278 the acknowledgement "ack_number" covers more than "recover" when: 280 ack_number - 1 > recover; 282 i.e., at least one byte more of data is acknowledged beyond the 283 highest byte that was outstanding when Fast Retransmit was last 284 entered. 286 Note that in Step 5, the congestion window is deflated after a 287 partial acknowledgement is received. The congestion window was 288 likely to have been inflated considerably when the partial 289 acknowledgement was received. In addition, depending on the original 290 pattern of packet losses, the partial acknowledgement might 291 acknowledge nearly a window of data. In this case, if the congestion 292 window was not deflated, the data sender might be able to send nearly 293 a window of data back-to-back. 295 This document does not specify the sender's response to duplicate 296 ACKs when the Fast Retransmit/Fast Recovery algorithm is not 297 invoked. This is addressed in other documents, such as those 298 describing the Limited Transmit procedure [RFC3042]. This document 299 also does not address issues of adjusting the duplicate acknowledgement 300 threshold, but assumes the threshold specified in the IETF standards; 301 the current standard is RFC 5681, which specifies a threshold of three 302 duplicate acknowledgements. 304 As a final note, we would observe that in the absence of the SACK 305 option, the data sender is working from limited information. When 306 the issue of recovery from multiple dropped packets from a single 307 window of data is of particular importance, the best alternative 308 would be to use the SACK option. 310 4. Resetting the Retransmit Timer in Response to Partial 311 Acknowledgements 313 One possible variant to the response to partial acknowledgements 314 specified in Section 3 concerns when to reset the retransmit timer 315 after a partial acknowledgement. The algorithm in Section 3, Step 5, 316 resets the retransmit timer only after the first partial ACK. In 317 this case, if a large number of packets were dropped from a window of 318 data, the TCP data sender's retransmit timer will ultimately expire, 319 and the TCP data sender will invoke Slow-Start. (This is illustrated 320 on page 12 of [F98].) We call this the Impatient variant of NewReno. 321 We note that the Impatient variant in Section 3 doesn't follow the 322 recommended algorithm in RFC 2988 of restarting the retransmit timer 323 after every packet transmission or retransmission (step 5.1 of 324 [RFC2988]). 326 In contrast, the NewReno simulations in [FF96] illustrate the 327 algorithm described above with the modification that the retransmit 328 timer is reset after each partial acknowledgement. We call this the 329 Slow-but-Steady variant of NewReno. In this case, for a window with 330 a large number of packet drops, the TCP data sender retransmits at 331 most one packet per roundtrip time. (This behavior is illustrated in 332 the New-Reno TCP simulation of Figure 5 in [FF96], and on page 11 of 333 [F98]). 335 When N packets have been dropped from a window of data for a large 336 value of N, the Slow-but-Steady variant can remain in Fast Recovery 337 for N round-trip times, retransmitting one more dropped packet each 338 round-trip time; for these scenarios, the Impatient variant gives a 339 faster recovery and better performance. The tests "ns 340 test-suite-newreno.tcl impatient1" and "ns test-suite-newreno.tcl 341 slow1" in the NS simulator illustrate such a scenario, where the 342 Impatient variant performs better than the Slow-but-Steady 343 variant. The Impatient variant can be particularly important for TCP 344 connections with large congestion windows, as illustrated by the tests 345 "ns test-suite-newreno.tcl impatient4" and "ns test-suite-newreno.tcl 346 slow4" in the NS simulator. 348 One can also construct scenarios where the Slow-but-Steady variant 349 gives better performance than the Impatient variant. As an example, 350 this occurs when only a small number of packets are dropped, the RTO 351 is sufficiently small that the retransmit timer expires, and 352 performance would have been better without a retransmit timeout. The 353 tests "ns test-suite-newreno.tcl impatient2" and "ns 354 test-suite-newreno.tcl slow2" in the NS simulator illustrate such a 355 scenario. 357 The Slow-but-Steady variant can also achieve higher goodput than the 358 Impatient variant, by avoiding unnecessary retransmissions. This 359 could be of special interest for cellular links, where every 360 transmission costs battery power and money. The tests "ns 361 test-suite-newreno.tcl impatient3" and "ns test-suite-newreno.tcl 362 slow3" in the NS simulator illustrate such a scenario. The 363 Slow-but-Steady variant can also be more robust to delay variation in 364 the network, where a delay spike might force the Impatient variant into 365 a timeout and go-back-N recovery. 367 Neither of the two variants discussed above are optimal. Our 368 recommendation is for the Impatient variant, as specified in Section 369 3 of this document, because of the poor performance of the 370 Slow-but-Steady variant for TCP connections with large congestion 371 windows. 373 One possibility for a more optimal algorithm would be one that 374 recovered from multiple packet drops as quickly as does slow-start, 375 while resetting the retransmit timers after each partial 376 acknowledgement, as described in the section below. We note, 377 however, that there is a limitation to the potential performance in 378 this case in the absence of the SACK option. 380 5. Retransmissions after a Partial Acknowledgement 382 One possible variant to the response to partial acknowledgements 383 specified in Section 3 would be to retransmit more than one packet 384 after each partial acknowledgement, and to reset the retransmit timer 385 after each retransmission. The algorithm specified in Section 3 386 retransmits a single packet after each partial acknowledgement. This 387 is the most conservative alternative, in that it is the least likely 388 to result in an unnecessarily-retransmitted packet. A variant that 389 would recover faster from a window with many packet drops would be to 390 effectively Slow-Start, retransmitting two packets after each partial 391 acknowledgement. Such an approach would take less than N roundtrip 392 times to recover from N losses [Hoe96]. However, in the absence of 393 SACK, recovering as quickly as slow-start introduces the likelihood 394 of unnecessarily retransmitting packets, and this could significantly 395 complicate the recovery mechanisms. 397 We note that the response to partial acknowledgements specified in 398 Section 3 of this document and in RFC 2582 differs from the response 399 in [FF96], even though both approaches only retransmit one packet in 400 response to a partial acknowledgement. Step 5 of Section 3 specifies 401 that the TCP sender responds to a partial ACK by deflating the 402 congestion window by the amount of new data acknowledged, adding 403 back SMSS bytes if the partial ACK acknowledges at least SMSS bytes 404 of new data, and sending a new segment if permitted by the new value 405 of cwnd. Thus, only one previously-sent packet is retransmitted in 406 response to each partial acknowledgement, but additional new packets 407 might be transmitted as well, depending on the amount of new data 408 acknowledged by the partial acknowledgement. In contrast, the 409 variant of NewReno illustrated in [FF96] simply set the congestion 410 window to ssthresh when a partial acknowledgement was received. The 411 approach in [FF96] is more conservative, and does not attempt to 412 accurately track the actual number of outstanding packets after a 413 partial acknowledgement is received. While either of these 414 approaches gives acceptable performance, the variant specified in 415 Section 3 recovers more smoothly when multiple packets are dropped 416 from a window of data. (The [FF96] behavior can be seen in the NS 417 simulator by setting the variable "partial_window_deflation_" for 418 "Agent/TCP/Newreno" to 0; the behavior specified in Section 3 is 419 achieved by setting "partial_window_deflation_" to 1.) 421 6. Avoiding Multiple Fast Retransmits 423 This section describes the motivation for the sender's state variable 424 "recover", and discusses possible heuristics for distinguishing 425 between a retransmitted packet that was dropped, and three duplicate 426 acknowledgements from the unnecessary retransmission of three 427 packets. 429 In the absence of the SACK option or timestamps, a duplicate 430 acknowledgement carries no information to identify the data packet or 431 packets at the TCP data receiver that triggered that duplicate 432 acknowledgement. In this case, the TCP data sender is unable to 433 distinguish between a duplicate acknowledgement that results from a 434 lost or delayed data packet, and a duplicate acknowledgement that 435 results from the sender's unnecessary retransmission of a data packet 436 that had already been received at the TCP data receiver. Because of 437 this, with the Retransmit and Fast Recovery algorithms in Reno TCP, 438 multiple segment losses from a single window of data can sometimes 439 result in unnecessary multiple Fast Retransmits (and multiple 440 reductions of the congestion window) [F94]. 442 With the Fast Retransmit and Fast Recovery algorithms in Reno TCP, 443 the performance problems caused by multiple Fast Retransmits are 444 relatively minor compared to the potential problems with Tahoe TCP, 445 which does not implement Fast Recovery. Nevertheless, unnecessary 446 Fast Retransmits can occur with Reno TCP unless some explicit 447 mechanism is added to avoid this, such as the use of the "recover" 448 variable. (This modification is called "bugfix" in [F98], and is 449 illustrated on pages 7 and 9 of that document. Unnecessary Fast 450 Retransmits for Reno without "bugfix" is illustrated on page 6 of 451 [F98].) 453 Section 3 of [RFC2582] defined a default variant of NewReno TCP that 454 did not use the variable "recover", and did not check if duplicate 455 ACKs cover the variable "recover" before invoking Fast Retransmit. 456 With this default variant from RFC 2582, the problem of multiple Fast 457 Retransmits from a single window of data can occur after a Retransmit 458 Timeout (as in page 8 of [F98]) or in scenarios with reordering (as 459 in the validation test "./test-all-newreno newreno5_noBF" in 460 directory "tcl/test" of the NS simulator. This gives performance 461 similar to that on page 8 of [F03].) RFC 2582 also defined Careful 462 and Less Careful variants of the NewReno algorithm, and recommended 463 the Careful variant. 465 The algorithm specified in Section 3 of this document corresponds to 466 the Careful variant of NewReno TCP from RFC 2582, and eliminates the 467 problem of multiple Fast Retransmits. This algorithm uses the 468 variable "recover", whose initial value is the initial send sequence 469 number. After each retransmit timeout, the highest sequence number 470 transmitted so far is recorded in the variable "recover". 472 If, after a retransmit timeout, the TCP data sender retransmits three 473 consecutive packets that have already been received by the data 474 receiver, then the TCP data sender will receive three duplicate 475 acknowledgements that do not cover more than "recover". In this 476 case, the duplicate acknowledgements are not an indication of a new 477 instance of congestion. They are simply an indication that the 478 sender has unnecessarily retransmitted at least three packets. 480 However, when a retransmitted packet is itself dropped, the sender 481 can also receive three duplicate acknowledgements that do not cover 482 more than "recover". In this case, the sender would have been 483 better off if it had initiated Fast Retransmit. For a TCP that 484 implements the algorithm specified in Section 3 of this document, the 485 sender does not infer a packet drop from duplicate acknowledgements 486 in this scenario. As always, the retransmit timer is the backup 487 mechanism for inferring packet loss in this case. 489 There are several heuristics, based on timestamps or on the amount of 490 advancement of the cumulative acknowledgement field, that allow the 491 sender to distinguish, in some cases, between three duplicate 492 acknowledgements following a retransmitted packet that was dropped, 493 and three duplicate acknowledgements from the unnecessary 494 retransmission of three packets [Gur03, GF04]. The TCP sender MAY use 495 such a heuristic to decide to invoke a Fast Retransmit in some cases, 496 even when the three duplicate acknowledgements do not cover more than 497 "recover". 499 For example, when three duplicate acknowledgements are caused by the 500 unnecessary retransmission of three packets, this is likely to be 501 accompanied by the cumulative acknowledgement field advancing by at 502 least four segments. Similarly, a heuristic based on timestamps uses 503 the fact that when there is a hole in the sequence space, the 504 timestamp echoed in the duplicate acknowledgement is the timestamp of 505 the most recent data packet that advanced the cumulative 506 acknowledgement field [RFC1323]. If timestamps are used, and the 507 sender stores the timestamp of the last acknowledged segment, then 508 the timestamp echoed by duplicate acknowledgements can be used to 509 distinguish between a retransmitted packet that was dropped and 510 three duplicate acknowledgements from the unnecessary 511 retransmission of three packets. The heuristics are illustrated in 512 the NS simulator in the validation test "./test-all-newreno". 514 6.1. ACK Heuristic 516 If the ACK-based heuristic is used, then following the advancement of 517 the cumulative acknowledgement field, the sender stores the value of 518 the previous cumulative acknowledgement as prev_highest_ack, and stores 519 the latest cumulative ACK as highest_ack. In addition, the following 520 step is performed if Step 1 in Section 3 fails, before proceeding to 521 Step 1B. 523 1*) If the Cumulative Acknowledgement field didn't cover more than 524 "recover", check to see if the congestion window is greater 525 than SMSS bytes and the difference between highest_ack and 526 prev_highest_ack is at most 4*SMSS bytes. If true, duplicate 527 ACKs indicate a lost segment (proceed to Step 1A in Section 528 3). Otherwise, duplicate ACKs likely result from unnecessary 529 retransmissions (proceed to Step 1B in Section 3). 531 The congestion window check serves to protect against fast retransmit 532 immediately after a retransmit timeout, similar to the 533 "exitFastRetrans_" variable in NS. Examples of applying the ACK 534 heuristic are in validation tests "./test-all-newreno 535 newreno_rto_loss_ack" and "./test-all-newreno newreno_rto_dup_ack" in 536 directory "tcl/test" of the NS simulator. 538 If several ACKs are lost, the sender can see a jump in the cumulative 539 ACK of more than three segments, and the heuristic can fail. A 540 validation test for this scenario is "./test-all-newreno 541 newreno_rto_loss_ackf". RFC 5681 recommends that a receiver should 542 send duplicate ACKs for every out-of-order data packet, such as a 543 data packet received during Fast Recovery. The ACK heuristic is more 544 likely to fail if the receiver does not follow this advice, because 545 then a smaller number of ACK losses are needed to produce a 546 sufficient jump in the cumulative ACK. 548 6.2. Timestamp Heuristic 550 If this heuristic is used, the sender stores the timestamp of the 551 last acknowledged segment. In addition, the second paragraph of step 552 1 in Section 3 is replaced as follows: 554 1**) If the Cumulative Acknowledgement field didn't cover more than 555 "recover", check to see if the echoed timestamp in the last 556 non-duplicate acknowledgment equals the 557 stored timestamp. If true, duplicate ACKs indicate a lost 558 segment (proceed to Step 1A in Section 3). Otherwise, duplicate 559 ACKs likely result from unnecessary retransmissions (proceed 560 to Step 1B in Section 3). 562 Examples of applying the timestamp heuristic are in validation tests 563 "./test-all-newreno newreno_rto_loss_tsh" and "./test-all-newreno 564 newreno_rto_dup_tsh". The timestamp heuristic works correctly, both 565 when the receiver echoes timestamps as specified by [RFC1323], and by 566 its revision attempts. However, if the receiver arbitrarily echoes 567 timestamps, the heuristic can fail. The heuristic can also fail if a 568 timeout was spurious and returning ACKs are not from retransmitted 569 segments. This can be prevented by detection algorithms such as 570 [RFC3522]. 572 7. Implementation Issues for the Data Receiver 574 [RFC5681] specifies that "Out-of-order data segments SHOULD be 575 acknowledged immediately, in order to accelerate loss recovery." 576 Neal Cardwell has noted that some data receivers do not send an 577 immediate acknowledgement when they send a partial acknowledgment, 578 but instead wait first for their delayed acknowledgement timer to 579 expire [C98]. As [C98] notes, this severely limits the potential 580 benefit of NewReno by delaying the receipt of the partial 581 acknowledgement at the data sender. Echoing RFC 5681, our 582 recommendation is that the data receiver send an immediate 583 acknowledgement for an out-of-order segment, even when that 584 out-of-order segment fills a hole in the buffer. 586 8. Implementation Issues for the Data Sender 588 In Section 3, Step 5 above, it is noted that implementations should 589 take measures to avoid a possible burst of data when leaving Fast 590 Recovery, in case the amount of new data that the sender is eligible 591 to send due to the new value of the congestion window is large. This 592 can arise during NewReno when ACKs are lost or treated as pure window 593 updates, thereby causing the sender to underestimate the number of 594 new segments that can be sent during the recovery procedure. 595 Specifically, bursts can occur when the FlightSize is much less than 596 the new congestion window when exiting from Fast Recovery. One 597 simple mechanism to avoid a burst of data when leaving Fast Recovery 598 is to limit the number of data packets that can be sent in response 599 to a single acknowledgment. (This is known as "maxburst_" in the ns 600 simulator.) Other possible mechanisms for avoiding bursts include 601 rate-based pacing, or setting the slow-start threshold to the 602 resultant congestion window and then resetting the congestion window 603 to FlightSize. A recommendation on the general mechanism to avoid 604 excessively bursty sending patterns is outside the scope of this 605 document. 607 An implementation may want to use a separate flag to record whether 608 or not it is presently in the Fast Recovery procedure. The use of 609 the value of the duplicate acknowledgment counter for this purpose is 610 not reliable because it can be reset upon window updates and 611 out-of-order acknowledgments. 613 When updating the Cumulative Acknowledgement field outside of 614 Fast Recovery, the "recover" state variable may also need to be 615 updated in order to continue to permit possible entry into Fast 616 Recovery (Section 3, step 1). This issue arises when an update 617 of the Cumulative Acknowledgement field results in a sequence 618 wraparound that affects the ordering between the Cumulative 619 Acknowledgement field and the "recover" state variable. Entry 620 into Fast Recovery is only possible when the Cumulative 621 Acknowledgment field covers more than the "recover" state variable. 623 It is important for the sender to respond correctly to duplicate ACKs 624 received when the sender is no longer in Fast Recovery (e.g., because 625 of a Retransmit Timeout). The Limited Transmit procedure [RFC3042] 626 describes possible responses to the first and second duplicate 627 acknowledgements. When three or more duplicate acknowledgements are 628 received, the Cumulative Acknowledgement field doesn't cover more 629 than "recover", and a new Fast Recovery is not invoked, it is 630 important that the sender not execute the Fast Recovery steps (3) and 631 (4) in Section 3. Otherwise, the sender could end up in a chain of 632 spurious timeouts. We mention this only because several NewReno 633 implementations had this bug, including the implementation in the NS 634 simulator. (This bug in the NS simulator was fixed in July 2003, 635 with the variable "exitFastRetrans_".) 637 It has been observed that some TCP implementations enter a slow start 638 or congestion avoidance window updating algorithm immediately after 639 the cwnd is set by the equation found in (Section 3, step 5), even 640 without a new external event generating the cwnd change. Note that 641 after cwnd is set based on the procedure for exiting Fast Recovery 642 (Section 3, step 5), cwnd SHOULD NOT be updated until a further 643 event occurs (e.g., arrival of an ack, or timeout) after this 644 adjustment. 646 9. Simulations 648 Simulations with NewReno are illustrated with the validation test 649 "tcl/test/test-all-newreno" in the NS simulator. The command 650 "../../ns test-suite-newreno.tcl reno" shows a simulation with Reno 651 TCP, illustrating the data sender's lack of response to a partial 652 acknowledgement. In contrast, the command "../../ns 653 test-suite-newreno.tcl newreno_B" shows a simulation with the same 654 scenario using the NewReno algorithms described in this paper. 656 10. Comparisons between Reno and NewReno TCP 657 As we stated in the introduction, we believe that the NewReno 658 modification described in this document improves the performance of 659 the Fast Retransmit and Fast Recovery algorithms of Reno TCP in a 660 wide variety of scenarios. This has been discussed in some depth in 661 [FF96], which illustrates Reno TCP's poor performance when multiple 662 packets are dropped from a window of data and also illustrates 663 NewReno TCP's good performance in that scenario. 665 We do, however, know of one scenario where Reno TCP gives better 666 performance than NewReno TCP, that we describe here for the sake of 667 completeness. Consider a scenario with no packet loss, but with 668 sufficient reordering so that the TCP sender receives three duplicate 669 acknowledgements. This will trigger the Fast Retransmit and Fast 670 Recovery algorithms. With Reno TCP or with Sack TCP, this will 671 result in the unnecessary retransmission of a single packet, combined 672 with a halving of the congestion window (shown on pages 4 and 6 of 673 [F03]). With NewReno TCP, however, this reordering will also result 674 in the unnecessary retransmission of an entire window of data (shown 675 on page 5 of [F03]). 677 While Reno TCP performs better than NewReno TCP in the presence of 678 reordering, NewReno's superior performance in the presence of 679 multiple packet drops generally outweighs its less optimal 680 performance in the presence of reordering. (Sack TCP is the 681 preferred solution, with good performance in both scenarios.) This 682 document recommends the Fast Retransmit and Fast Recovery algorithms 683 of NewReno TCP instead of those of Reno TCP for those TCP connections 684 that do not support SACK. We would also note that NewReno's Fast 685 Retransmit and Fast Recovery mechanisms are widely deployed in TCP 686 implementations in the Internet today, as documented in [PF01]. For 687 example, tests of TCP implementations in several thousand web servers 688 in 2001 showed that for those TCP connections where the web browser 689 was not SACK-capable, more web servers used the Fast Retransmit and 690 Fast Recovery algorithms of NewReno than those of Reno or Tahoe TCP 691 [PF01]. 693 11. Changes Relative to RFC 2582 695 The purpose of this document is to advance the NewReno's Fast 696 Retransmit and Fast Recovery algorithms in RFC 2582 to Standards Track. 698 The main change in this document relative to RFC 2582 is to specify 699 the Careful variant of NewReno's Fast Retransmit and Fast Recovery 700 algorithms. The base algorithm described in RFC 2582 did not attempt 701 to avoid unnecessary multiple Fast Retransmits that can occur after a 702 timeout (described in more detail in the section above). However, 703 RFC 2582 also defined "Careful" and "Less Careful" variants that 704 avoid these unnecessary Fast Retransmits, and recommended the Careful 705 variant. This document specifies the previously-named "Careful" 706 variant as the basic version of NewReno. As described below, this 707 algorithm uses a variable "recover", whose initial value is the send 708 sequence number. 710 The algorithm specified in Section 3 checks whether the 711 acknowledgement field of a partial acknowledgement covers *more* than 712 "recover", as defined in Section 3. Another possible variant would be 713 to simply require that the acknowledgement field covers *more than or 714 equal to* "recover" before initiating another Fast Retransmit. We 715 called this the Less Careful variant in RFC 2582. 717 There are two separate scenarios in which the TCP sender could 718 receive three duplicate acknowledgements acknowledging "recover" but 719 no more than "recover". One scenario would be that the data sender 720 transmitted four packets with sequence numbers higher than "recover", 721 that the first packet was dropped in the network, and the following 722 three packets triggered three duplicate acknowledgements 723 acknowledging "recover". The second scenario would be that the 724 sender unnecessarily retransmitted three packets below "recover", and 725 that these three packets triggered three duplicate acknowledgements 726 acknowledging "recover". In the absence of SACK, the TCP sender is 727 unable to distinguish between these two scenarios. 729 For the Careful variant of Fast Retransmit, the data sender would 730 have to wait for a retransmit timeout in the first scenario, but 731 would not have an unnecessary Fast Retransmit in the second 732 scenario. For the Less Careful variant to Fast Retransmit, the data 733 sender would Fast Retransmit as desired in the first scenario, and would 734 unnecessarily Fast Retransmit in the second scenario. This document 735 only specifies the Careful variant in Section 3. Unnecessary Fast 736 Retransmits with the Less Careful variant in scenarios with 737 reordering are illustrated in page 8 of [F03]. 739 The document also specifies two heuristics that the TCP sender MAY 740 use to decide to invoke Fast Retransmit even when the three duplicate 741 acknowledgements do not cover more than "recover". These heuristics, 742 an ACK-based heuristic and a timestamp heuristic, are described in 743 Sections 6.1 and 6.2 respectively. 745 12. Changes Relative to RFC 3782 747 In [RFC3782], the cwnd after Full ACK reception will be set to 748 (1) min (ssthresh, FlightSize + SMSS) or (2) ssthresh. However, 749 there is a risk in the first logic which results in performance 750 degradation. With the first logic, if FlightSize is zero, the result 751 will be 1 SMSS. This means TCP can transmit only 1 segment at this 752 moment, which can cause delay in ACK transmission at receiver due to 753 delayed ACK algorithm. 755 The FlightSize on Full ACK reception can be zero in some situations. 756 A typical example is where sending window size during fast recovery is 757 small. In this case, the retransmitted packet and new data packets can 758 be transmitted within a short interval. If all these packets 759 successfully arrive, the receiver may generate a Full ACK that 760 acknowledges all outstanding data. Even if window size is not small, 761 loss of ACK packets or receive buffer shortage during fast recovery can 762 also increase the possibility to fall into this situation. 764 The proposed fix in this document ensures that sender TCP transmits at 765 least two segments on Full ACK reception. 767 In addition, errata for RFC3782 (editorial clarification to Section 8) 768 has been applied. 770 13. Conclusions 772 This document specifies the NewReno Fast Retransmit and Fast Recovery 773 algorithms for TCP. This NewReno modification to TCP can even be 774 important for TCP implementations that support the SACK option, 775 because the SACK option can only be used for TCP connections when 776 both TCP end-nodes support the SACK option. NewReno performs better 777 than Reno (RFC 5681) in a number of scenarios discussed herein. 779 A number of options to the basic algorithm presented in Section 3 are 780 also described. These include the handling of the retransmission 781 timer (Section 4), the response to partial acknowledgments (Section 782 5), and the value of the congestion window when leaving Fast Recovery 783 (section 3, step 5). Our belief is that the differences between 784 these variants of NewReno are small compared to the differences 785 between Reno and NewReno. That is, the important thing is to 786 implement NewReno instead of Reno, for a TCP connection without SACK; 787 it is less important exactly which of the variants of NewReno is 788 implemented. 790 14. Security Considerations 792 RFC 5681 discusses general security considerations concerning TCP 793 congestion control. This document describes a specific algorithm 794 that conforms with the congestion control requirements of RFC 5681, 795 and so those considerations apply to this algorithm, too. There are 796 no known additional security concerns for this specific algorithm. 798 15. IANA Considerations 800 This document has no actions for IANA. 802 16. Acknowledgements 804 Many thanks to Anil Agarwal, Mark Allman, Armando Caro, Jeffrey Hsu, 805 Vern Paxson, Kacheong Poon, Keyur Shah, and Bernie Volz for detailed 806 feedback on this document or on its precursor, RFC 2582. Jeffrey 807 Hsu provided clarifications on the handling of the recover variable 808 that were applied to RFC 3782 as errata, and now are in Section 8 809 of this document. Yoshifumi Nishida contributed a modification 810 to the fast recovery algorithm to account for the case in which 811 flightsize is 0 when the TCP sender leaves fast recovery, and the 812 TCP receiver uses delayed acknowledgments. 814 17. References 816 17.1. Normative References 818 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 819 Requirement Levels", BCP 14, RFC 2119, March 1997. 821 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 822 Timer", RFC 2988, November 2000. 824 [RFC5681] Allman, M., Paxson, V. and E. Blanton, "TCP Congestion 825 Control", RFC 5681, September 2009. 827 17.2. Informative References 829 [C98] Cardwell, N., "delayed ACKs for retransmitted packets: ouch!". 830 November 1998, Email to the tcpimpl mailing list, Message-ID 831 "Pine.LNX.4.02A.9811021421340.26785-100000@sake.cs.washington.edu", 832 archived at "http://tcp-impl.lerc.nasa.gov/tcp-impl". 834 [F98] Floyd, S., Revisions to RFC 2001, "Presentation to the TCPIMPL 835 Working Group", August 1998. URLs 836 "ftp://ftp.ee.lbl.gov/talks/sf-tcpimpl-aug98.ps" and 837 "ftp://ftp.ee.lbl.gov/talks/sf-tcpimpl-aug98.pdf". 839 [F03] Floyd, S., "Moving NewReno from Experimental to Proposed 840 Standard? Presentation to the TSVWG Working Group", March 2003. 841 URLs "http://www.icir.org/floyd/talks/newreno-Mar03.ps" and 842 "http://www.icir.org/floyd/talks/newreno-Mar03.pdf". 844 [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of Tahoe, 845 Reno and SACK TCP", Computer Communication Review, July 1996. URL 846 "ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z". 848 [F94] Floyd, S., "TCP and Successive Fast Retransmits", Technical 849 report, October 1994. URL 850 "ftp://ftp.ee.lbl.gov/papers/fastretrans.ps". 852 [GF04] Gurtov, A. and S. Floyd, "Resolving Acknowledgment Ambiguity 853 in non-SACK TCP", Next Generation Teletraffic and 854 Wired/Wireless Advanced Networking (NEW2AN'04), February 855 2004. URL "http://www.cs.helsinki.fi/u/gurtov/papers/ 856 heuristics.html". 858 [Gur03] Gurtov, A., "[Tsvwg] resolving the problem of unnecessary fast 859 retransmits in go-back-N", email to the tsvwg mailing list, message 860 ID <3F25B467.9020609@cs.helsinki.fi>, July 28, 2003. URL 861 "http://www1.ietf.org/mail-archive/working-groups/tsvwg/current/msg04334.html". 863 [Hen98] Henderson, T., Re: NewReno and the 2001 Revision. September 864 1998. Email to the tcpimpl mailing list, Message ID 865 "Pine.BSI.3.95.980923224136.26134A-100000@raptor.CS.Berkeley.EDU", 866 archived at "http://tcp-impl.lerc.nasa.gov/tcp-impl". 868 [Hoe95] Hoe, J., "Startup Dynamics of TCP's Congestion Control and 869 Avoidance Schemes", Master's Thesis, MIT, 1995. 871 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 872 Control Scheme for TCP", ACM SIGCOMM, August 1996. URL 873 "http://www.acm.org/sigcomm/sigcomm96/program.html". 875 [LM97] Lin, D. and R. Morris, "Dynamics of Random Early Detection", 876 SIGCOMM 97, September 1997. URL 877 "http://www.acm.org/sigcomm/sigcomm97/program.html". 879 [NS] The Network Simulator (NS). URL "http://www.isi.edu/nsnam/ns/". 881 [PF01] Padhye, J. and S. Floyd, "Identifying the TCP Behavior of Web 882 Servers", June 2001, SIGCOMM 2001. 884 [RFC1323] Jacobson, V., Braden, R. and D. Borman, "TCP Extensions for 885 High Performance", RFC 1323, May 1992. 887 [RFC2582] Floyd, S. and T. Henderson, "The NewReno Modification to 888 TCP's Fast Recovery Algorithm", RFC 2582, April 1999. 890 [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, and M. Podolsky, "The 891 Selective Acknowledgment (SACK) Option for TCP, RFC 2883, July 2000. 893 [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing TCP's 894 Loss Recovery Using Limited Transmit", RFC 3042, January 2001. 896 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm for 897 TCP", RFC 3522, April 2003. 899 [RFC3782] Floyd, S., T. Henderson, and A. Gurtov, "The NewReno 900 Modification to TCP's Fast Recovery Algorithm", RFC 3782, April 2004. 902 Appendix A. Document Revision History 904 To be removed upon publication 906 +----------+--------------------------------------------------+ 907 | Revision | Comments | 908 +----------+--------------------------------------------------+ 909 | draft-00 | RFC3782 errata applied, and changes applied from | 910 | | draft-nishida-newreno-modification-02 | 911 +----------+--------------------------------------------------+ 913 Authors' Addresses 915 Thomas R. Henderson 916 The Boeing Company 917 P.O. Box 3707 918 Seattle, WA 98124 920 EMail: thomas.r.henderson@boeing.com 922 Sally Floyd 923 International Computer Science Institute 925 Phone: +1 (510) 666-2989 926 EMail: floyd@acm.org 927 URL: http://www.icir.org/floyd/ 929 Andrei Gurtov 930 HIIT 931 Helsinki Institute for Information Technology 932 Aalto University 933 P.O. Box 19800 934 Helsinki FIN-00076 935 FINLAND 937 EMail: gurtov@hiit.fi 939 Yoshifumi Nishida 940 WIDE Project 941 Endo 5322 942 Fujisawa, Kanagawa 252-8520 943 Japan 944 Email: nishida@wide.ad.jp