idnits 2.17.1 draft-allman-tcp-early-rexmt-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. ** There is 1 instance of lines with control characters in the document. ** The abstract seems to contain references ([RFC2119]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2026' is mentioned on line 18, but not defined == Missing Reference: 'RFC2119' is mentioned on line 51, but not defined == Unused Reference: 'AA02' is defined on line 311, but no explicit reference was found in the text == Unused Reference: 'LK98' is defined on line 339, but no explicit reference was found in the text == Unused Reference: 'Mor97' is defined on line 342, but no explicit reference was found in the text == Unused Reference: 'RFC3150' is defined on line 356, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960) ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298) ** Downref: Normative reference to an Experimental RFC: RFC 3522 -- Obsolete informational reference (is this intentional?): RFC 2582 (Obsoleted by RFC 3782) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) Summary: 10 errors (**), 0 flaws (~~), 8 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Mark Allman 2 INTERNET DRAFT ICIR 3 File: draft-allman-tcp-early-rexmt-03.txt Konstantin Avrachenkov 4 INRIA 5 Urtzi Ayesta 6 France Telecom R&D 7 Josh Blanton 8 Ohio University 9 December, 2003 10 Expires: June, 2004 12 Early Retransmit for TCP and SCTP 14 Status of this Memo 16 This document is an Internet-Draft and is in full conformance with 17 all provisions of Section 10 of [RFC2026]. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as 22 Internet-Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six 25 months and may be updated, replaced, or obsoleted by other documents 26 at any time. It is inappropriate to use Internet-Drafts as 27 reference material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 Abstract 37 This document proposes a new mechanism for TCP and SCTP that can be 38 used to more effectively recover lost segments when a connection's 39 congestion window is small. The "Early Retransmit" mechanism allows 40 the transport to reduce, in certain special circumstances, the 41 number of duplicate acknowledgments required to trigger a fast 42 retransmission. This allows the transport to use fast retransmit to 43 recover packet losses that would otherwise require a lengthy 44 retransmission timeout. 46 Terminology 48 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 49 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 50 document are to be interpreted as described in RFC 2119 [RFC2119]. 52 1 Introduction 54 A number of researchers have pointed out that the loss recovery 55 strategies employed by TCP [RFC793] and SCTP [RFC2960] do not work 56 well when the amount of outstanding data a TCP sender has injected 57 into the network is small. This can happen in a number of 58 situations, such as: 60 (1) The connection is "application limited" and has only a limited 61 amount of data to send. This can happen any time the 62 application does not produce enough data to fill the congestion 63 window. A particular case when all connections become 64 application limited is as the connection ends. 66 (2) The connection is limited by the receiver-advertised window. 68 (3) The connection is constrained by end-to-end congestion control 69 when the connection's share of the path is small, the path has a 70 small bandwidth-delay product or the transport is ascertaining 71 the available bandwidth in the first few round-trip times of 72 slow start. 74 Many researchers have studied problems with TCP when the congestion 75 window is small and have outlined possible mechanisms to mitigate 76 these problems [Mor97,BPS+98,Bal98,LK98,RFC3150,AA02]. SCTP's loss 77 recovery and congestion control mechanisms are based on TCP and 78 therefore the same problems impact the performance of SCTP 79 connections. When the transport detects a missing segment, the 80 connection enters a loss recovery phase using one of two methods. 81 First, if an acknowledgment (ACK) for a given segment is not 82 received in a certain amount of time a retransmission timer fires 83 and the segment is resent [RFC2988]. Second, the ``Fast 84 Retransmit'' algorithm resends a segment when three duplicate ACKs 85 arrive at the sender [Jac88,RFC2581]. However, because duplicate 86 ACKs from the receiver are also triggered by packet reordering in 87 the Internet, the sender waits for three duplicate ACKs in an 88 attempt to disambiguate segment loss from packet reordering. When 89 using small windows it may not be possible to generate the required 90 number of duplicate ACKs to trigger Fast Retransmit when a loss does 91 happen. 93 Once in a loss recovery phase, a number of techniques can be used to 94 retransmit lost segments. TCP can use slow start based recovery or 95 Fast Recovery [RFC2581], NewReno [RFC2582], and loss recovery based 96 on selective acknowledgments (SACKs) [RFC2018,FF96,RFC3517]. SCTP's 97 loss recovery is not as varied due to the built-in selective 98 acknowledgments. 100 The transport's retransmission timeout (RTO) is based on measured 101 round-trip times (RTT) between the sender and receiver, as specified 102 in [RFC2988] (for TCP) and [RFC2960] (for SCTP). To prevent 103 spurious retransmissions of segments that are only delayed and not 104 lost, the minimum RTO is conservatively chosen to be 1 second. 105 Therefore, it behooves TCP senders to detect and recover from as 106 many losses as possible without incurring a lengthy timeout during 107 which the connection remains idle. However, if not enough duplicate 108 ACKs arrive from the receiver, the Fast Retransmit algorithm is 109 never triggered---this situation occurs when the congestion window 110 is small, if a large number of segments in a window are lost or at 111 the end of a transfer as data drains from the network. For 112 instance, consider a congestion window (cwnd) of three segments. If 113 one segment is dropped by the network, then at most two duplicate 114 ACKs will arrive at the sender, assuming no ACK loss. Since three 115 duplicate ACKs are required to trigger Fast Retransmit, a timeout 116 will be required to resend the dropped packet. 118 [BPS+98] shows that roughly 56% of retransmissions sent by a busy 119 web server are sent after the RTO timer expires, while only 44% are 120 handled by Fast Retransmit. In addition, only 4% of the RTO 121 timer-based retransmissions could have been avoided with SACK, which 122 has to continue to disambiguate reordering from genuine 123 loss. Furthermore, [All00] shows that for one particular web server 124 the median transfer size is less than four segments, indicating that 125 more than half of the connections will be forced to rely on the RTO 126 timer to recover from any losses that occur. Thus, loss recovery 127 without relying on the conservative RTO is beneficial for short TCP 128 transfers. 130 The Limited Transmit mechanism introduced in [RFC3042] allows a TCP 131 sender to transmit previously unsent data upon the reception of each 132 of the two duplicate ACKs that precede a fast retransmit. SCTP 133 [RFC2960] uses SACK information to calculate the number of 134 outstanding segments in the network. Hence, when the first two 135 duplicate ACKs arrive at the sender they will indicate that data has 136 left the network and allow the sender to transmit new data (if 137 available) similar to TCP's Limited Transmit algorithm. 139 By sending these two new segments the TCP sender is attempting to 140 induce additional duplicate ACKs (if appropriate) so that Fast 141 Retransmit will be triggered before the retransmission timeout 142 expires. The "Early Retransmit" mechanism outlined in this document 143 covers the case when previously unsent data is not available for 144 transmission. 146 Section 2 of this document outlines a small change to TCP and SCTP 147 senders that will decrease the reliance on the retransmission timer, 148 and thereby improve performance when Fast Retransmit cannot 149 otherwise be triggered. Section 3 discusses related work. Section 150 4 sketches security issues. 152 2 Reduction of the Retransmission Threshold 154 The Early Retransmit algorithm calls for lowering the threshold for 155 triggering Fast Retransmit when the amount of outstanding data is 156 small and when no unsent data segments are enqueued. We define 157 variants of Early Retransmit for connections that do and do not 158 support selective acknowledgments (SACK) [RFC2018]. (Note: SCTP 159 includes SACK in the base protocol and so there is no need for the 160 non-SACK variant of Early Retransmit in SCTP.) 162 If the following two conditions hold the sender can use Early 163 Retransmit (regardless of SACK support). 165 (2.a) The amount of outstanding data (ownd) is less than 4*SMSS 166 bytes. 168 (2.b) There is either no unsent data ready for transmission at the 169 sender or the advertised window does not permit new segments to 170 be transmitted. 172 When the above two conditions hold and the connection does not 173 support SACK the duplicate ACK threshold used to trigger Fast 174 Retransmit MAY be reduced to: 176 ER_thresh = ceiling (ownd/SMSS) - 1 (1) 178 duplicate ACKs, where ownd is in terms of bytes. 180 When conditions (2.a) and (2.b) hold and the connection does support 181 SACK Fast Retransmit MAY be used when ownd - SMSS bytes have been 182 SACKed. 184 In other words, when ownd is small enough that losing one segment 185 would not trigger Fast Retransmit, the trigger for Fast Retransmit 186 is reduced to receiving indications that all but one segment has 187 arrived at the receiver. This mitigation is less robust in the face 188 of reordered segments than the standard Fast Retransmit threshold. 189 Research shows that a general reduction in the number of duplicate 190 ACKs required to trigger fast retransmission of a segment to two 191 (rather than three) leads to a reduction in the ratio of good to bad 192 retransmits by a factor of three [Pax97]. However, this analysis 193 did not include the additional conditioning on the event that the 194 ownd was smaller than 4 segments. 196 The SACK variant of the Early Retransmit algorithm is preferred to 197 the non-SACK variant due to its robustness in the face of ACK loss 198 (since SACKs are sent redundantly) and due to interactions with the 199 delayed ACK timer. Consider a flight of three segments, S1...S3, 200 with S2 being dropped by the network. When S1 arrives it is 201 in-order and so the receiver may or may not delay the ACK, leading 202 to two scenarios: 204 (A) The ACK for S1 is delayed. In this case the arrival of S3 will 205 trigger an ACK to be transmitted covering segment S1 (which was 206 previously unacknowledged). In this case Early Retransmit 207 without SACK will not prevent an RTO because no duplicate ACKs 208 will arrive. However, with SACK the ACK for S1 will also 209 include SACK information indicating that S3 has arrived at the 210 receiver. The sender can then invoke Fast Retransmit on this 211 ACK because ownd - SMSS bytes have been SACKed when the ACK 212 arrives. 214 (B) The ACK for S1 is not delayed. In this case the arrival of S1 215 triggers an ACK and the arrival of S3 triggers a second ACK 216 (because it is out-of-order). Both ACKs will cover the same 217 segment (S1). Therefore, regardless of whether SACK is used 218 Early Retransmit can be performed by the sender (assuming no ACK 219 loss). 221 We note two "worst case" scenarios for Early Retransmit: 223 (1) Persistent reordering of segments, coupled with an application 224 that does not constantly send data, can result in large numbers 225 of needless retransmissions when using Early Retransmit. For 226 instance, consider an application that sends data two segments 227 at a time, followed by an idle period when no data is queued for 228 delivery by TCP. If the network consistently reorders the two 229 segments, the sender will needlessly retransmit one out of every 230 two unique segments transmitted (and one-third of all segments) 231 when using the above algorithm. However, this would only be a 232 problem for long-lived connections from applications that 233 transmit in spurts. 235 (2) Similar to the above, consider the case of 2 segment transfers 236 that always experience reordering. Just as in (1) above, one 237 out of every two unique data segments will be retransmitted 238 needlessly, therefore one-third of the traffic will be spurious. 240 Currently this document offers no suggestion on how to mitigate the 241 above problems. Rather, the authors believe that the community's 242 consensus is that Early Retransmit is scoped enough that the worst 243 case problems are pathological and do not need mitigation at this 244 time. However, Appendix A offers a survey of possible mitigations. 246 3 Related Work 248 Deployment of Explicit Congestion Notification (ECN) [Flo94,RFC3168] 249 may benefit connections with small congestion window sizes 250 [RFC2884]. ECN provides a method for indicating congestion to the 251 end-host without dropping segments. While some segment drops may 252 still occur, ECN may allow TCP to perform better with small cwnd 253 sizes because the sender will be required to detect less segment 254 loss [RFC2884]. 256 [Bal98] outlines another solution to the problem of having no new 257 segments to transmit into the network when the first two duplicate 258 ACKs arrive. In response to these duplicate ACKs, a TCP sender 259 transmits zero-byte segments to induce additional duplicate ACKs. 260 This method preserves the robustness of the standard Fast Retransmit 261 algorithm at the cost of injecting segments into the network that do 262 not deliver any data (and, therefore are potentially wasting network 263 resources). 265 4 Security Considerations 267 The security considerations found in [RFC2581] apply to this 268 document. No additional security problems have been identified with 269 Early Retransmit at this time. 271 Acknowledgments 273 We thank Sally Floyd for her feedback in discussions about Early 274 Retransmit. We also thank Sally Floyd and Hari Balakrishnan who 275 helped with a large portion of the text of this document when it was 276 part of a separate document. Armando Caro and many members of the 277 tsvwg mailing list provided good discussions that helped shape this 278 document. 280 Normative References 282 [RFC793] Jon Postel. Transmission Control Protocol. Std 7, RFC 283 793. September 1981. 285 [RFC2018] Matt Mathis, Jamshid Mahdavi, Sally Floyd, Allyn Romanow. 286 TCP Selective Acknowledgement Options. RFC 2018, October 1996. 288 [RFC2581] Mark Allman, Vern Paxson, W. Richard Stevens. TCP 289 Congestion Control. RFC 2581, April 1999. 291 [RFC2883] Sally Floyd, Jamshid Mahdavi, Matt Mathis, Matt Podolsky. 292 An Extension to the Selective Acknowledgement (SACK) Option for 293 TCP. RFC 2883, July 2000. 295 [RFC2960] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. 296 Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, V. 297 Paxson. Stream Control Transmission Protocol. October 2000. 299 [RFC2988] Vern Paxson, Mark Allman. Computing TCP's Retransmission 300 Timer. RFC 2988, April 2000. 302 [RFC3042] Mark Allman, Hari Balakrishnan, Sally Floyd. Enhancing 303 TCP's Loss Recovery Using Limited Transmit. RFC 3042, January 304 2001. 306 [RFC3522] Reiner Ludwig, Michael Meyer. The Eifel Detection 307 Algorithm for TCP. RFC 3522, April 2003. 309 Informative References 311 [AA02] Urtzi Ayesta, Konstantin Avrachenkov, "The Effect of the 312 Initial Window Size and Limited Transmit Algorithm on the 313 Transient Behavior of TCP Transfers", In Proc. of the 15th ITC 314 Internet Specialist Seminar, Wurzburg, July 2002. 316 [All00] Mark Allman. A Server-Side View of WWW Characteristics. 317 ACM Computer Communications Review, October 2000. 319 [Bal98] Hari Balakrishnan. Challenges to Reliable Data Transport 320 over Heterogeneous Wireless Networks. Ph.D. Thesis, University 321 of California at Berkeley, August 1998. 323 [BPS+98] Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan, 324 Mark Stemm, and Randy Katz. TCP Behavior of a Busy Web Server: 326 Analysis and Improvements. Proc. IEEE INFOCOM Conf., San 327 Francisco, CA, March 1998. 329 [FF96] Kevin Fall, Sally Floyd. Simulation-based Comparisons of 330 Tahoe, Reno, and SACK TCP. ACM Computer Communication Review, 331 July 1996. 333 [Flo94] Sally Floyd. TCP and Explicit Congestion Notification. ACM 334 Computer Communication Review, October 1994. 336 [Jac88] Van Jacobson. Congestion Avoidance and Control. ACM 337 SIGCOMM 1988. 339 [LK98] Dong Lin, H.T. Kung. TCP Fast Recovery Strategies: Analysis 340 and Improvements. Proceedings of InfoCom, March 1998. 342 [Mor97] Robert Morris. TCP Behavior with Many Flows. Proceedings 343 of the Fifth IEEE International Conference on Network Protocols. 344 October 1997. 346 [Pax97] Vern Paxson. End-to-End Internet Packet Dynamics. ACM 347 SIGCOMM, September 1997. 349 [RFC2582] Sally Floyd, Tom Henderson. The NewReno Modification to 350 TCP's Fast Recovery Algorithm. RFC 2582, April 1999. 352 [RFC2884] Jamal Hadi Salim and Uvaiz Ahmed. Performance Evaluation 353 of Explicit Congestion Notification (ECN) in IP Networks. RFC 354 2884, July 2000. 356 [RFC3150] Spencer Dawkins, Gabriel Montenegro, Markku Kojo, Vincent 357 Magret. End-to-end Performance Implications of Slow Links. RFC 358 3150, July 2001. 360 [RFC3168] K. K. Ramakrishnan, Sally Floyd, David Black. The 361 Addition of Explicit Congestion Notification (ECN) to IP. RFC 362 3168, September 2001. 364 [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang. A 365 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 366 Algorithm for TCP. RFC 3517, April 2003. 368 Author's Addresses: 370 Mark Allman 371 ICSI Center for Internet Research (ICIR) 372 1947 Center Street, Suite 600 373 Berkeley, CA 94704-1198 374 Phone: 216-243-7361 375 mallman@icir.org 376 http://www.icir.org/mallman/ 378 Konstantin Avrachenkov 379 INRIA 380 2004 route des Lucioles, B.P.93 381 06902, Sophia Antipolis 382 France 383 Phone: 00 33 492 38 7751 384 Email: k.avrachenkov@sophia.inria.fr 385 http://www.inria.fr/mistral/personnel/K.Avrachenkov/moi.html 387 Urtzi Ayesta 388 France Telecom R&D 389 905 rue Albert Einstein 390 06921 Sophia Antipolis 391 France 392 Email: Urtzi.Ayesta@francetelecom.com 393 http://www.inria.fr/mistral/personnel/Urtzi.Ayesta/me.html 395 Josh Blanton 396 Ohio University 397 301 Stocker Center 398 Athens, OH 45701 399 jblanton@irg.cs.ohiou.edu 401 Appendix A: Research Issues in Adjusting the Duplicate ACK Threshold 403 Decreasing the number of duplicate ACKs required to trigger Fast 404 Retransmit, as suggested in section 2, has the drawback of making 405 Fast Retransmit less robust in the face of minor network reordering. 406 Two egregious examples of problems caused by reordering are given in 407 section 2. This appendix outlines several schemes that have been 408 suggested to mitigate the problems caused to Early Retransmit by 409 reordering. These methods need further research before they are 410 suggested for general use (and, current consensus is that the cases 411 that make Early Retransmit unnecessarily retransmit a large amount 412 of data are patalogical and therefore these mitigations are not 413 generally required). 415 MITIGATION A.1: Allow a connection to use Early Retransmit as long 416 as the algorithm is not injecting a "too much" spurious data into 417 the network. For instance, using the information provided by TCP's 418 DSACK option [RFC2883] or SCTP's Duplicate-TSN notification, a 419 sender can determine when segments sent via Early Retransmit are 420 needless. Likewise, using Eifel [RFC3522] the sender can detect 421 spurious Early Retransmits. Once spurious Early Retransmits are 422 detected the sender can either eliminate the use of Early Retransmit 423 or limit the use of the algorithm to ensure that an acceptably small 424 fraction of the connection's transmissions are not spurious. 426 Alternatively, if a sender cannot reliably determine if an Early 427 Retransmitted segment is spurious or not the sender could simply 428 limit Early Retransmits either to some fixed number per connection 429 (e.g., Early Retransmit is allowed only once per connection) or to 430 some small percentage of the total traffic being transmitted. 432 MITIGATION A.2: Allow a connection to trigger Early Retransmit using 433 the criteria given in section 2, in addition to a "small" timeout 435 [Pax97]. For instance, a sender may have to wait for 2 duplicate 436 ACKs and then T msec before Early Retransmitting a segment. The 437 added time gives reordered acknowledgments time to arrive at the 438 sender and avoid a needless retransmit. Designing a method for 439 choosing an appropriate timeout is part of the research that would 440 need to be involved in this scheme. 442 Full Copyright Statement 444 Copyright (C) The Internet Society (2003). All Rights Reserved. 446 This document and translations of it may be copied and furnished to 447 others, and derivative works that comment on or otherwise explain it 448 or assist in its implementation may be prepared, copied, published 449 and distributed, in whole or in part, without restriction of any 450 kind, provided that the above copyright notice and this paragraph 451 are included on all such copies and derivative works. However, this 452 document itself may not be modified in any way, such as by removing 453 the copyright notice or references to the Internet Society or other 454 Internet organizations, except as needed for the purpose of 455 developing Internet standards in which case the procedures for 456 copyrights defined in the Internet Standards process must be 457 followed, or as required to translate it into languages other than 458 English. 460 The limited permissions granted above are perpetual and will not be 461 revoked by the Internet Society or its successors or assigns. 463 This document and the information contained herein is provided on an 464 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 465 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 466 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 467 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 468 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.