idnits 2.17.1 draft-hurtig-tcpm-rtorestart-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 22, 2012) is 4197 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'SEG 1' is mentioned on line 164, but not defined == Missing Reference: 'SEG 2' is mentioned on line 165, but not defined == Missing Reference: 'SEG 3' is mentioned on line 170, but not defined ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260) == Outdated reference: A later version (-01) exists of draft-dukkipati-tcpm-tcp-loss-probe-00 Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance and Minor Extensions P. Hurtig 3 (tcpm) Karlstad University 4 Internet-Draft A. Petlund 5 Intended status: Experimental Simula Research Laboratory AS 6 Expires: April 25, 2013 M. Welzl 7 University of Oslo 8 October 22, 2012 10 TCP and SCTP RTO Restart 11 draft-hurtig-tcpm-rtorestart-03 13 Abstract 15 This document describes a modified algorithm for managing the TCP and 16 SCTP retransmission timers that provides faster loss recovery when a 17 connection's amount of outstanding data is small. The modification 18 allows the transport to restart its retransmission timer more 19 aggressively in situations where fast retransmit cannot be used. 20 This enables faster loss detection and recovery for connections that 21 are short-lived or application-limited. 23 Status of this Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on April 25, 2013. 40 Copyright Notice 42 Copyright (c) 2012 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 1. Introduction 57 TCP uses two mechanisms to detect segment loss. First, if a segment 58 is not acknowledged within a certain amount of time, a retransmission 59 timeout (RTO) occurs, and the segment is retransmitted [RFC6298]. 60 While the RTO is based on measured round-trip times (RTTs) between 61 the sender and receiver, it also has a conservative lower bound of 1 62 second to ensure that delayed segments are not mistaken as lost. 63 Second, when a sender receives duplicate acknowledgments, the fast 64 retransmit algorithm infers segment loss and triggers a 65 retransmission. Duplicate acknowledgments are generated by a 66 receiver when out-of-order segments arrive. As both segment loss and 67 segment reordering cause out-of-order arrival, fast retransmit waits 68 for three duplicate acknowledgments before considering the segment as 69 lost. In some situations, however, the number of outstanding 70 segments is not enough to trigger three duplicate acknowledgments, 71 and the sender must rely on lengthy RTOs for loss recovery. 73 The amount of outstanding segments can be small for several reasons: 75 (1) The connection is limited by the congestion control when the 76 path has a low total capacity (bandwidth-delay product) or the 77 connection's share of the capacity is small. It is also limited 78 by the congestion control in the first RTTs of a connection or 79 after an RTO when the available capacity is probed using slow- 80 start. 82 (2) The connection is limited by the receiver's available buffer 83 space. 85 (3) The connection is limited by the application if the available 86 capacity of the path is not fully utilized (e.g. interactive 87 applications), or at the end of a transfer, which is frequent if 88 the total amount of data is small (e.g. web traffic). 90 The first two situations can occur for any flow, as external factors 91 at the network and/or host level cause them. The third situation 92 primarily affects flows that are short or have a low transmission 93 rate. Typical examples of applications that produce short flows are 94 web servers. [RJ10] shows that 70% of all web objects, found at the 95 top 500 sites, are too small for fast retransmit to work. [BPS98] 96 shows that about 56% of all retransmissions sent by a busy web server 97 are sent after RTO expiry. While the experiments were not conducted 98 using SACK [RFC2018], only 4% of the RTO-based retransmissions could 99 have been avoided. Applications have a low transmission rate when 100 data is sent in response to actions, or as a reaction to real life 101 events. Typical examples of such applications are stock trading 102 systems, remote computer operations and online games. What is 103 special about this class of applications is that they are time- 104 dependant, and extra latency can reduce the application service level 105 [P09]. Although such applications may represent a small amount of 106 data sent on the network, a considerable number of flows have such 107 properties and the importance of low latency is high. 109 The RTO restart approach outlined in this document makes the RTO 110 slightly more aggressive when the number of outstanding segments is 111 small, in an attempt to enable faster loss recovery for all segments 112 while being robust to reordering. While it still conforms to the 113 requirement in [RFC6298] that segments must not be retransmitted 114 earlier than RTO seconds after their original transmission, it could 115 increase the chance for a spurious timeout, which could degrade 116 performance when the congestion window (cwnd) is large -- for 117 example, when an application sends enough data to reach a cwnd 118 covering 100 segments and then stops. The likelihood and potential 119 impact of this problem as well as possible mitigation strategies are 120 currently under investigation. 122 While this document focuses on TCP, the described changes are also 123 valid for the Stream Control Transmission Protocol (SCTP) [RFC4960] 124 which has similar loss recovery and congestion control algorithms. 126 1.1. Requirements Language 128 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 129 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 130 document are to be interpreted as described in RFC 2119 [RFC2119]. 132 2. RTO Restart Overview 134 The RTO management algorithm described in [RFC6298] recommends that 135 the retransmission timer is restarted when an acknowledgment (ACK) 136 that acknowledges new data is received and there is still outstanding 137 data. The restart is conducted to guarantee that unacknowledged 138 segments will be retransmitted after approximately RTO seconds. 139 However, by restarting the timer on each incoming acknowledgment, 140 retransmissions are not typically triggered RTO seconds after their 141 previous transmission but rather RTO seconds after the last ACK 142 arrived. The duration of this extra delay depends on several factors 143 but is in most cases approximately one RTT. Hence, in most 144 situations the time before a retransmission is triggered is equal to 145 "RTO + RTT". 147 The extra delay can be significant, especially for applications that 148 use a lower RTOmin than the standard of 1 second and/or in 149 environments with high RTTs, e.g. mobile networks. The restart 150 approach is illustrated in Figure 1 where a TCP sender transmits 151 three segments to a receiver. The arrival of the first and second 152 segment triggers a delayed ACK [RFC1122], which restarts the RTO 153 timer at the sender. The RTO restart is performed approximately one 154 RTT after the transmission of the third segment. Thus, if the third 155 segment is lost, as indicated in Figure 1, the effective loss 156 detection time is "RTO + RTT" seconds. In some situations, the 157 effective loss detection time becomes even longer. Consider a 158 scenario where only two segments are outstanding. If the second 159 segment is lost, the time to expire the delayed ACK timer will also 160 be included in the effective loss detection time. 162 Sender Receiver 163 ... 164 DATA [SEG 1] ----------------------> (ack delayed) 165 DATA [SEG 2] ----------------------> (send ack) 166 DATA [SEG 3] ----X /-------- ACK 167 (restart RTO) <----------/ 168 ... 169 (RTO expiry) 170 DATA [SEG 3] ----------------------> 172 Figure 1: RTO restart example 174 During normal TCP bulk transfer the current RTO restart approach is 175 not a problem. Actually, as long as enough segments arrive at a 176 receiver to enable fast retransmit, RTO-based loss recovery should be 177 avoided. RTOs should only be used as a last resort, as they 178 drastically lower the congestion window compared to fast retransmit, 179 and the current approach can therefore be beneficial -- it is 180 described in [EL04] to act as a "safety margin" that compensates for 181 some of the problems that the authors have identified with the 182 standard RTO calculation. Notably, the authors of [EL04] also state 183 that "this safety margin does not exist for highly interactive 184 applications where often only a single packet is in flight." 186 There are only a few situations where timeouts are appropriate, or 187 the only choice. For example, if the network is severely congested 188 and no segments arrive, RTO-based recovery should be used. In this 189 situation, the time to recover from the loss(es) will not be the 190 performance bottleneck. Furthermore, for connections that do not 191 utilize enough capacity to enable fast retransmit, RTO is the only 192 choice. The time needed for loss detection in such scenarios can 193 become a serious performance bottleneck. 195 3. RTO Restart Algorithm 197 To enable faster loss recovery for connections that are unable to use 198 fast retransmit, an alternative RTO restart can be used. By 199 resetting the timer to "RTO - T_earliest", where T_earliest is the 200 time elapsed since the earliest outstanding segment was transmitted, 201 retransmissions will always occur after exactly RTO seconds. This 202 approach makes the RTO more aggressive than the standardized approach 203 in [RFC6298] but still conforms to the requirement in [RFC6298] that 204 segments must not be retransmitted earlier than RTO seconds after 205 their original transmission. 207 This document specifies the following update of step 5.3 in Section 5 208 of [RFC6298] (and a similar update in Section 6.3.2 of [RFC4960] for 209 SCTP): 211 When an ACK is received that acknowledges new data: 213 (1) Set T_earliest = 0. 215 (2) If the following two conditions hold: 217 (a) The number of outstanding segments is less than four. 219 (b) There is no unsent data ready for transmission or the 220 receiver's advertised window does not permit 221 transmission. 223 set T_earliest to the time elapsed since the earliest 224 outstanding segment was sent. 226 (3) Restart the retransmission timer so that it will expire after 227 "RTO - T_earliest" seconds (for the current value of RTO). 229 The update requires TCP implementations to track the time elapsed 230 since the transmission of the earliest outstanding segment 231 (T_earliest). As the alternative restart is used only when the 232 number of outstanding segments is less than four only four segments 233 need to be tracked. Furthermore, some implementations of TCP (e.g. 234 Linux TCP) already track the transmission times of all segments. 236 4. Discussion 238 The currently standardized algorithm has been shown to add at least 239 one RTT to the loss recovery process in TCP [LS00] and SCTP 240 [HB08][PBP09]. Applications that have strict timing requirements 241 (e.g. telephony signaling and gaming) rather than throughput 242 requirements may want to use a lower RTOmin than the standard of 1 243 second [RFC4166]. For such applications the modified restart 244 approach could be important as the RTT and also the delayed ACK timer 245 of receivers will be large components of the effective loss recovery 246 time. Measurements in [HB08] have shown that the total transfer time 247 of a lost segment (including the original transmission time and the 248 loss recovery time) can be reduced with up to 35% using the suggested 249 approach. These results match those presented in [PGH06][PBP09], 250 where the modified restart approach is shown to significantly reduce 251 retransmission latency. 253 There are several proposals that address the problem of not having 254 enough ACKs for loss recovery. In what follows, we explain why the 255 mechanism described here is complementary to these approaches: 257 The limited transmit mechanism [RFC3042] allows a TCP sender to 258 transmit a previously unsent segment for each of the first two 259 duplicate acknowledgments. By transmitting new segments, the sender 260 attempts to generate additional duplicate acknowledgments to enable 261 fast retransmit. However, limited transmit does not help if no 262 previously unsent data is ready for transmission or if the receiver 263 is out of buffer space. [RFC5827] specifies an early retransmit 264 algorithm to enable fast loss recovery in such situations. By 265 dynamically lowering the amount of duplicate acknowledgments needed 266 for fast retransmit (dupthresh), based on the number of outstanding 267 segments, a smaller number of duplicate acknowledgments are needed to 268 trigger a retransmission. In some situations, however, the algorithm 269 is of no use or might not work properly. First, if a single segment 270 is outstanding, and lost, it is impossible to use early retransmit. 271 Second, if ACKs are lost, the early retransmit cannot help. Third, 272 if the network path reorders segments, the algorithm might cause more 273 unnecessary retransmissions than fast retransmit. 275 TCP-NCR [RFC4653] sets the dupthresh to three or more, to better 276 disambiguate reordered and lost segments. In addition, early 277 retransmit lowers the dupthresh when the amount of outstanding data 278 is small, to enable faster loss recovery. The reasons why the RTO 279 restart procedure described in this document does not take dynamic 280 dupthresh considerations into account are twofold. First, if a 281 larger dupthresh is used, the RTO restart approach could be used when 282 the congestion window, and the amount of outstanding data, is larger. 283 However, in such situations the actual amount of outstanding data can 284 significantly impact the RTT of the connection, making it potentially 285 dangerous to be more aggressive. Second, if a smaller dupthresh is 286 used, the amount of outstanding data needed for a restart is smaller. 287 However, as the congestion window is already small, it does not 288 matter if a retransmission is due to a fast retransmit or an RTO. 289 The resulting congestion window will still be very small, and the 290 only difference is how quickly TCP infers segment loss. 292 Tail Loss Probe [TLP] is a proposal to send up to two "probe 293 segments" when a timer fires which is set to a value smaller than the 294 RTO. A "probe segment" is a new segment if new data is available, 295 else a retransmission. The intention is to compensate for sluggish 296 RTO behavior in situations where the RTO greatly exceeds the RTT, 297 which, according to measurements reported in [TLP], is not uncommon. 298 The Probe timeout (PTO) is at least 2 RTTs, and only scheduled in 299 case the RTO is farther than the PTO. A spurious PTO is less risky 300 than a spurious RTO, as it would not have the same negative effects 301 (clearing the scoreboard and restarting with slow-start). In 302 contrast, RTO restart is trying to make the RTO more appropriate in 303 cases where there is no need to be overly cautious. 305 TLP could kick in in situations where RTO restart does not apply, and 306 it could overrule (yielding a similar general behavior, but with a 307 lower timeout) RTO restart in cases where the number of outstanding 308 segments is smaller than 4 and no new segments are available for 309 transmission. The shorter RTO from RTO restart also reduces the 310 probability that TLP is activated because PTO might be farther than 311 RTO. This could make RTO restart more aggressive than the algorithm 312 in [TLP] when: 314 (1) no data has been sent in an interval exceeding the RTO 316 (2) the number of outstanding segments is 3 318 (3) (defined in [RFC5681]) is at least 3 320 because, under these conditions, in accordance with [RFC5681], 3 321 packets can immediately be retransmitted, whereas TLP only allows up 322 to two consecutive PTOs. 324 5. IANA Considerations 326 This memo includes no request to IANA. 328 6. Security Considerations 330 This document discusses a change in how to set the retransmission 331 timer's value when restarted. This change does not raise any new 332 security issues with TCP or SCTP. 334 7. References 336 7.1. Normative References 338 [RFC1122] Braden, R., "Requirements for Internet Hosts - 339 Communication Layers", STD 3, RFC 1122, October 1989. 341 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 342 Selective Acknowledgment Options", RFC 2018, October 1996. 344 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 345 Requirement Levels", BCP 14, RFC 2119, March 1997. 347 [RFC3042] Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing 348 TCP's Loss Recovery Using Limited Transmit", RFC 3042, 349 January 2001. 351 [RFC4166] Coene, L. and J. Pastor-Balbas, "Telephony Signalling 352 Transport over Stream Control Transmission Protocol (SCTP) 353 Applicability Statement", RFC 4166, February 2006. 355 [RFC4653] Bhandarkar, S., Reddy, A., Allman, M., and E. Blanton, 356 "Improving the Robustness of TCP to Non-Congestion 357 Events", RFC 4653, August 2006. 359 [RFC4960] Stewart, R., "Stream Control Transmission Protocol", 360 RFC 4960, September 2007. 362 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 363 Control", RFC 5681, September 2009. 365 [RFC5827] Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J., and 366 P. Hurtig, "Early Retransmit for TCP and Stream Control 367 Transmission Protocol (SCTP)", RFC 5827, May 2010. 369 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 370 "Computing TCP's Retransmission Timer", RFC 6298, 371 June 2011. 373 7.2. Informative References 375 [BPS98] Balakrishnan, H., Padmanabhan, V., Seshan, S., Stemm, M., 376 and R. Katz, "TCP Behavior of a Busy Web Server: Analysis 377 and Improvements", Proc. IEEE INFOCOM Conf., March 1998. 379 [EL04] Ekstroem, H. and R. Ludwig, "The Peak-Hopper: A New End- 380 to-End Retransmission Timer for Reliable Unicast 381 Transport", IEEE INFOCOM 2004, March 2004. 383 [HB08] Hurtig, P. and A. Brunstrom, "SCTP: designed for timely 384 message delivery?", Springer Telecommunication Systems, 385 May 2010. 387 [LS00] Ludwig, R. and K. Sklower, "The Eifel retransmission 388 timer", ACM SIGCOMM Comput. Commun. Rev., 30(3), 389 July 2000. 391 [P09] Petlund, A., "Improving latency for interactive, thin- 392 stream applications over reliable transport", Unipub PhD 393 Thesis, Oct 2009. 395 [PBP09] Petlund, A., Beskow, P., Pedersen, J., Paaby, E., Griwodz, 396 C., and P. Halvorsen, "Improving SCTP Retransmission 397 Delays for Time-Dependent Thin Streams", 398 Springer Multimedia Tools and Applications, 45(1-3), 2009. 400 [PGH06] Pedersen, J., Griwodz, C., and P. Halvorsen, 401 "Considerations of SCTP Retransmission Delays for Thin 402 Streams", IEEE LCN 2006, November 2006. 404 [RJ10] Ramachandran, S., "Web metrics: Size and number of 405 resources", Google http://code.google.com/speed/articles/ 406 web-metrics.html, May 2010. 408 [TLP] Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 409 "TCP Loss Probe (TLP): An Algorithm for Fast Recovery of 410 Tail Losses", draft-dukkipati-tcpm-tcp-loss-probe-00.txt 411 (work in progress), July 2012. 413 Authors' Addresses 415 Per Hurtig 416 Karlstad University 417 Universitetsgatan 2 418 Karlstad, 651 88 419 Sweden 421 Phone: +46 54 700 23 35 422 Email: per.hurtig@kau.se 424 Andreas Petlund 425 Simula Research Laboratory AS 426 P.O. Box 134 427 Lysaker, 1325 428 Norway 430 Phone: +47 67 82 82 00 431 Email: apetlund@simula.no 433 Michael Welzl 434 University of Oslo 435 PO Box 1080 Blindern 436 Oslo, N-0316 437 Norway 439 Phone: +47 22 85 24 20 440 Email: michawe@ifi.uio.no