idnits 2.17.1 draft-dukkipati-tcpm-tcp-loss-probe-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 560 has weird spacing: '...mber of score...' == Line 563 has weird spacing: '...tection all r...' == Line 564 has weird spacing: '...ransmit all...' == Line 565 has weird spacing: '...ransmit all...' == Line 566 has weird spacing: '...ecovery all r...' == (1 more instance...) -- The document date (February 25, 2013) is 4040 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-rtorestart-00 Summary: 1 error (**), 0 flaws (~~), 8 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group N. Dukkipati 3 Internet-Draft N. Cardwell 4 Intended status: Experimental Y. Cheng 5 Expires: August 29, 2013 M. Mathis 6 Google, Inc 7 February 25, 2013 9 Tail Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses 10 draft-dukkipati-tcpm-tcp-loss-probe-01.txt 12 Abstract 14 Retransmission timeouts are detrimental to application latency, 15 especially for short transfers such as Web transactions where 16 timeouts can often take longer than all of the rest of a transaction. 17 The primary cause of retransmission timeouts are lost segments at the 18 tail of transactions. This document describes an experimental 19 algorithm for TCP to quickly recover lost segments at the end of 20 transactions or when an entire window of data or acknowledgments are 21 lost. Tail Loss Probe (TLP) is a sender-only algorithm that allows 22 the transport to recover tail losses through fast recovery as opposed 23 to lengthy retransmission timeouts. If a connection is not receiving 24 any acknowledgments for a certain period of time, TLP transmits the 25 last unacknowledged segment (loss probe). In the event of a tail 26 loss in the original transmissions, the acknowledgment from the loss 27 probe triggers SACK/FACK based fast recovery. TLP effectively avoids 28 long timeouts and thereby improves TCP performance. 30 Status of this Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at http://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on August 29, 2013. 47 Copyright Notice 48 Copyright (c) 2013 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 65 2. Loss probe algorithm . . . . . . . . . . . . . . . . . . . . . 5 66 2.1. Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 6 67 2.2. FACK threshold based recovery . . . . . . . . . . . . . . 8 68 3. Detecting recovered losses . . . . . . . . . . . . . . . . . . 9 69 3.1. TLP Loss Detection: The Basic Idea . . . . . . . . . . . . 9 70 3.2. TLP Loss Detection: Algorithm Details . . . . . . . . . . 9 71 4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 11 72 4.1. Unifying loss recoveries . . . . . . . . . . . . . . . . . 12 73 4.2. Recovery of any N-degree tail loss . . . . . . . . . . . . 12 74 5. Experiments with TLP . . . . . . . . . . . . . . . . . . . . . 14 75 6. Related work . . . . . . . . . . . . . . . . . . . . . . . . . 16 76 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17 77 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 78 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 81 1. Introduction 83 Retransmission timeouts are detrimental to application latency, 84 especially for short transfers such as Web transactions where 85 timeouts can often take longer than all of the rest of a transaction. 86 This document describes an experimental algorithm, Tail Loss Probe 87 (TLP), to invoke fast recovery for losses that would otherwise be 88 only recoverable through timeouts. 90 The Transmission Control Protocol (TCP) has two methods for 91 recovering lost segments. First, the fast retransmit algorithm 92 relies on incoming duplicate acknowledgments (ACKs), which indicate 93 that the receiver is missing some data. After a required number of 94 duplicate ACKs have arrived at the sender, it retransmits the first 95 unacknowledged segment and continues with a loss recovery algorithm 96 such as the SACK-based loss recovery [RFC6675]. If the fast 97 retransmit algorithm fails for any reason, TCP uses a retransmission 98 timeout as the last resort mechanism to recover lost segments. If an 99 ACK for a given segment is not received in a certain amount of time 100 called retransmission timeout (RTO), the segment is resent [RFC6298]. 102 Timeouts can occur in a number of situations, such as the following: 104 (1) Drop tail at the end of transactions. Example: consider a 105 transfer of five segments sent on a connection that has a congestion 106 window of ten. Any degree of loss in the tail, such as segments four 107 and five, will only be recovered via a timeout. 109 (2) Mid-transaction loss of an entire window of data or ACKs. Unlike 110 (1) there is more data waiting to be sent. Example: consider a 111 transfer of four segments to be sent on a connection that has a 112 congestion window of two. If the sender transmits two segments and 113 both are lost then the loss will only be recovered via a timeout. 115 (3) Insufficient number of duplicate ACKs to trigger fast recovery at 116 sender. The early retransmit mechanism [RFC5827] addresses this 117 problem in certain special circumstances, by reducing the number of 118 duplicate ACKs required to trigger a fast retransmission. 120 (4) An unexpectedly long round-trip time (RTT), such that the ACKs 121 arrive after the RTO timer expires. The F-RTO algorithm [RFC5682] is 122 designed to detect such spurious retransmission timeouts and at least 123 partially undo the consequences of such events. 125 Measurements on Google Web servers show that approximately 70% of 126 retransmissions for Web transfers are sent after the RTO timer 127 expires, while only 30% are handled by fast recovery. Even on 128 servers exclusively serving YouTube videos, RTO based retransmissions 129 account for about 46% of the retransmissions. If the losses are 130 detectable from the ACK stream (through duplicate ACKs or SACK 131 blocks) then early retransmit, fast recovery and proportional rate 132 reduction are effective in avoiding timeouts [IMC11PRR]. Timeout 133 retransmissions that occur in recovery and disorder state (a state 134 indicating that a connection has received some duplicate ACKs), 135 account for just 4% of the timeout episodes. On the other hand 96% 136 of the timeout episodes occur without any preceding duplicate ACKs or 137 other indication of losses at the sender [IMC11PRR]. Early 138 retransmit and fast recovery have no hope of repairing losses without 139 these indications. Efficiently addressing situations that would 140 cause timeouts without any prior indication of losses is a 141 significant opportunity for additional improvements to loss recovery. 143 To get a sense of just how long the RTOs are in relation to 144 connection RTTs, following is the distribution of RTO/RTT values on 145 Google Web servers. [percentile, RTO/RTT]: [50th percentile, 4.3]; 146 [75th percentile, 11.3]; [90th percentile, 28.9]; [95th percentile, 147 53.9]; [99th percentile, 214]. Large RTOs, typically caused by 148 variance in measured RTTs, can be a result of intermediate queuing, 149 and service variability in mobile channels. Such large RTOs make a 150 huge contribution to the long tail on the latency statistics of short 151 flows. Note that simply reducing the length of RTO does not address 152 the latency problem for two reasons: first, it increases the chances 153 of spurious retransmissions. Second and more importantly, an RTO 154 reduces TCP's congestion window to one and forces a slow start. 155 Recovery of losses without relying primarily on the RTO mechanism is 156 beneficial for short TCP transfers. 158 The question we address in this document is: Can a TCP sender recover 159 tail losses of transactions through fast recovery and thereby avoid 160 lengthy retransmission timeouts? We specify an algorithm, Tail Loss 161 Probe (TLP), which sends probe segments to trigger duplicate ACKs 162 with the intent of invoking fast recovery more quickly than an RTO at 163 the end of a transaction. TLP is applicable only for connections in 164 Open state, wherein a sender is receiving in-sequence ACKs and has 165 not detected any lost segments. TLP can be implemented by modifying 166 only the TCP sender, and does not require any TCP options or changes 167 to the receiver for its operation. For convenience, this document 168 mostly refers to TCP, but the algorithms and other discussion are 169 valid for Stream Control Transmission Protocol (SCTP) as well. 171 This document is organized as follows. Section 2 describes the basic 172 Loss Probe algorithm. Section 3 outlines an algorithm to detect the 173 cases when TLP plugs a hole in the sender. The algorithm makes the 174 sender aware that a loss had occurred so it performs the appropriate 175 congestion window reduction. Section 4 discusses the interaction of 176 TLP with early retransmit in being able to recover any degree of tail 177 losses. Section 5 discusses the experimental results with TLP on 178 Google Web servers. Section 6 discusses related work, and Section 7 179 discusses the security considerations. 181 1.1. Terminology 183 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 184 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 185 document are to be interpreted as described in [RFC2119]. 187 2. Loss probe algorithm 189 The Loss probe algorithm is designed for a sender to quickly detect 190 tail losses without waiting for an RTO. We will henceforth use tail 191 loss to generally refer to either drops at the tail end of 192 transactions or a loss of an entire window of data/ACKs. TLP works 193 for senders with SACK enabled and in Open state, i.e. the sender has 194 so far received in-sequence ACKs with no SACK blocks. The risk of a 195 sender incurring a timeout is high when the sender has not received 196 any ACKs for a certain portion of time but is unable to transmit any 197 further data either because it is application limited (out of new 198 data to send), receiver window (rwnd) limited, or congestion window 199 (cwnd) limited. For these circumstances, the basic idea of TLP is to 200 transmit probe segments for the specific purpose of eliciting 201 additional ACKs from the receiver. The initial idea was to send some 202 form of zero window probe (ZWP) with one byte of new or old data. 203 The ACK from the ZWP would provide an additional opportunity for a 204 SACK block to detect loss without an RTO. Additional losses can be 205 detected subsequently and repaired as SACK based fast recovery 206 proceeds. However, in practice sending a single byte of data turned 207 out to be problematic to implement and more fragile than necessary. 208 Instead we use a full segment to probe but have to add complexity to 209 compensate for the probe itself masking losses. 211 Define probe timeout (PTO) to be a timer event indicating that an ACK 212 is overdue on a connection. The PTO value is set to max(2 * SRTT, 213 10ms), where SRTT is the smoothed round-trip time [RFC6298], and is 214 adjusted to account for delayed ACK timer when there is only one 215 outstanding segment. 217 The basic version of the TLP algorithm transmits one probe segment 218 after a probe timeout if the connection has outstanding 219 unacknowledged data but is otherwise idle, i.e. not receiving any 220 ACKs or is cwnd/rwnd/application limited. The transmitted segment, 221 aka loss probe, can be either a new segment if available and the 222 receive window permits, or a retransmission of the most recently sent 223 segment, i.e., the segment with the highest sequence number. When 224 there is tail loss, the ACK from the probe triggers fast recovery. 225 In the absence of loss, there is no change in the congestion control 226 or loss recovery state of the connection, apart from any state 227 related to TLP itself. 229 TLP MUST NOT be used for non-SACK connections. SACK feedback allows 230 senders to use the algorithm described in section 3 to infer whether 231 any segments were lost. 233 2.1. Pseudocode 235 We define the terminology used in specifying the TLP algorithm: 237 FlightSize: amount of outstanding data in the network as defined in 238 [RFC5681]. 240 RTO: The transport's retransmission timeout (RTO) is based on 241 measured round-trip times (RTT) between the sender and receiver, as 242 specified in [RFC6298] for TCP. 244 PTO: Probe timeout is a timer event indicating that an ACK is 245 overdue. Its value is constrained to be smaller than or equal to an 246 RTO. 248 SRTT: smoothed round-trip time computed like in [RFC6298]. 250 Open state: the sender has so far received in-sequence ACKs with no 251 SACK blocks, and no other indications (such as retransmission 252 timeout) that a loss may have occurred. 254 Consecutive PTOs: back-to-back PTOs all scheduled for the same tail 255 packets in a flight. The (N+1)st PTO is scheduled after transmitting 256 the probe segment for Nth PTO. 258 The TLP algorithm works as follows: 260 (1) Schedule PTO after transmission of new data in Open state: 262 Check for conditions to schedule PTO outlined in step 2 below. 263 FlightSize > 1: schedule PTO in max(2*SRTT, 10ms). 264 FlightSize == 1: schedule PTO in max(2*SRTT, 1.5*SRTT+WCDelAckT). 265 If RTO is earlier, schedule PTO in its place: PTO = min(RTO, PTO). 267 WCDelAckT stands for worst case delayed ACK timer. When FlightSize 268 is 1, PTO is inflated additionally by WCDelAckT time to compensate 269 for a potential long delayed ACK timer at the receiver. The 270 RECOMMENDED value for WCDelAckT is 200ms. 272 A PTO value of 2*SRTT allows a sender to wait long enough to know 273 that an ACK is overdue. Under normal circumstances, i.e. no losses, 274 an ACK typically arrives in one RTT. But choosing PTO to be exactly 275 an RTT is likely to generate spurious probes given that even end- 276 system timings can easily push an ACK to be above an RTT. We chose 277 PTO to be the next integral value of RTT. If RTO is smaller than the 278 computed value for PTO, then a probe is scheduled to be sent at the 279 RTO time. The RTO timer is rearmed at the time of sending the probe, 280 as is shown in Step 3 below. This ensures that a PTO is always sent 281 prior to a connection experiencing an RTO. 283 (2) Conditions for scheduling PTO: 285 (a) Connection is in Open state. 286 (b) Connection is either cwnd limited or application limited. 287 (c) Number of consecutive PTOs <= 2. 288 (d) Connection is SACK enabled. 290 Implementations MAY use one or two consecutive PTOs. 292 (3) When PTO fires: 294 (a) If a new previously unsent segment exists: 295 -> Transmit new segment. 296 -> FlightSize += SMSS. cwnd remains unchanged. 297 (b) If no new segment exists: 298 -> Retransmit the last segment. 299 (c) Increment statistics counter for loss probes. 300 (d) If conditions in (2) are satisfied: 301 -> Reschedule next PTO. 302 Else: 303 -> Rearm RTO to fire at epoch 'now+RTO'. 305 The reason for retransmitting the last segment in Step (b) is so that 306 the ACK will carry SACK blocks and trigger either SACK-based loss 307 recovery [RFC6675] or FACK threshold based fast recovery [FACK]. On 308 transmission of a TLP, a MIB counter is incremented to keep track of 309 the total number of loss probes sent. 311 (4) During ACK processing: 313 Cancel any existing PTO. 314 If conditions in (2) allow: 315 -> Reschedule PTO relative to the ACK receipt time. 317 Following is an example of TLP. All events listed are at a TCP 318 sender. 320 (1) Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 321 no more new data to transmit. A PTO is scheduled to fire in 2 RTTs, 322 after the transmission of the 10th segment. 324 (2) Receives acknowledgements (ACKs) for segments 1-5; segments 6-10 325 are lost and no ACKs are received. Note that the sender 326 (re)schedules its PTO timer relative to the last received ACK, which 327 is the ACK for segment 5 in this case. The sender sets the PTO 328 interval using the calculation described in step (1) of the 329 algorithm. 331 (3) When PTO fires, sender retransmits segment 10. 333 (4) After an RTT, SACK for packet 10 arrives. The ACK also carries 334 SACK holes for segments 6, 7, 8 and 9. This triggers FACK threshold 335 based recovery. 337 (5) Connection enters fast recovery and retransmits remaining lost 338 segments. 340 2.2. FACK threshold based recovery 342 At the core of TLP is its reliance on FACK threshold based algorithm 343 to invoke Fast Recovery. In this section we specify this algorithm. 345 Section 3.1 of the Forward Acknowledgement (FACK) Paper [FACK] 346 describes an alternate algorithm for triggering fast retransmit, 347 based on the extent of the SACK scoreboard. Its goal is to trigger 348 fast retransmit as soon as the receiver's reassembly queue is larger 349 than the dupack threshold, as indicated by the difference between the 350 forward most SACK block edge and SND.UNA. This algorithm quickly and 351 reliably triggers fast retransmit in the presence of burst losses -- 352 often on the first SACK following such a loss. Such a threshold 353 based algorithm also triggers fast retransmit immediately in the 354 presence of any reordering with extent greater than the dupack 355 threshold. 357 FACK threshold based recovery works by introducing a new TCP state 358 variable at the sender called SND.FACK. SND.FACK reflects the 359 forward-most data held by the receiver and is updated when a SACK 360 block is received acknowledging data with a higher sequence number 361 than the current value of SND.FACK. SND.FACK reflects the highest 362 sequence number known to have been received plus one. Note that in 363 non-recovery states, SND.FACK is the same as SND.UNA. The following 364 snippet is the pseudocode for FACK threshold based recovery. 366 If (SND.FACK - SND.UNA) > dupack threshold: 367 -> Invoke Fast Retransmit and Fast Recovery. 369 3. Detecting recovered losses 371 If the only loss was the last segment, there is the risk that the 372 loss probe itself might repair the loss, effectively masking it from 373 congestion control. To avoid interfering with mandatory congestion 374 control [RFC5681] it is imperative that TLP include a mechanism to 375 detect when the probe might have masked a loss and to properly reduce 376 the congestion window (cwnd). An algorithm to examine subsequent 377 ACKs to determine whether the original segment was lost is described 378 here. 380 Since it is observed that a significant fraction of the hosts that 381 support SACK do not support duplicate selective acknowledgments 382 (D-SACKs) [RFC2883] the TLP algorithm for detecting such lost 383 segments relies only on basic RFC 2018 SACK [RFC2018]. 385 3.1. TLP Loss Detection: The Basic Idea 387 Consider a TLP retransmission "episode" where a sender retransmits N 388 consecutive TLP packets, all for the same tail packet in a flight. 389 Let us say that an episode ends when the sender receives an ACK above 390 the SND.NXT at the time of the episode. We want to make sure that 391 before the episode ends the sender receives N "TLP dupacks", 392 indicating that all N TLP probe segments were unnecessary, so there 393 was no loss/hole that needed plugging. If the sender gets less than 394 N "TLP dupacks" before the end of the episode, then probably the 395 first TLP packet to arrive at the receiver plugged a hole, and only 396 the remaining TLP packets that arrived at the receiver generated 397 dupacks. 399 Note that delayed ACKs complicate the picture, since a delayed ACK 400 will imply that the sender receives one fewer ACK than would normally 401 be expected. To mitigate this complication, before sending a TLP 402 loss probe retransmission, the sender should attempt to wait long 403 enough that the receiver has sent any delayed ACKs that it is 404 withholding. The sender algorithm, described in section 2.1 features 405 such a delay. 407 If there is ACK loss or a delayed ACK, then this algorithm is 408 conservative, because the sender will reduce cwnd when in fact there 409 was no packet loss. In practice this is acceptable, and potentially 410 even desirable: if there is reverse path congestion then reducing 411 cwnd is prudent. 413 3.2. TLP Loss Detection: Algorithm Details 415 (1) State 416 TLPRtxOut: the number of unacknowledged TLP retransmissions in 417 current TLP episode. The connection maintains this integer counter 418 that tracks the number of TLP retransmissions in the current episode 419 for which we have not yet received a "TLP dupack". The sender 420 initializes the TLPRtxOut field to 0. 422 TLPHighRxt: the value of SND.NXT at the time of TLP retransmission. 423 The TLP sender uses TLPHighRxt to record SND.NXT at the time it 424 starts doing TLP transmissions during a given TLP episode. 426 (2) Initialization 428 When a connection enters the ESTABLISHED state, or suffers a 429 retransmission timeout, or enters fast recovery, it executes the 430 following: 432 TLPRtxOut = 0; 433 TLPHighRxt = 0; 435 (3) Upon sending a TLP retransmission: 437 if (TLPRtxOut == 0) 438 TLPHighRxt = SND.NXT; 439 TLPRtxOut++; 441 (4) Upon receiving an ACK: 443 (a) Tracking ACKs 445 We define a "TLP dupack" as a dupack that has all the regular 446 properties of a dupack that can trigger fast retransmit, plus the ACK 447 acknowledges TLPHighRxt, and the ACK carries no new SACK information 448 (as noted earlier, TLP requires that the receiver supports SACK). 449 This is the kind of ACK we expect to see for a TLP transmission if 450 there were no losses. More precisely, the TLP sender considers a TLP 451 probe segment as acknowledged if all of the following conditions are 452 met: 454 (a) TLPRtxOut > 0 455 (b) SEG.ACK == TLPHighRxt 456 (c) the segment contains no SACK blocks for sequence ranges 457 above TLPHighRxt 458 (d) the ACK does not advance SND.UNA 459 (e) the segment contains no data 460 (f) the segment is not a window update 462 If all of those conditions are met, then the sender executes the 463 following: 465 TLPRtxOut--; 467 (b) Marking the end of a TLP episode and detecting losses 469 If an incoming ACK is after TLPHighRxt, then the sender deems the TLP 470 episode over. At that time, the TLP sender executes the following: 472 isLoss = (TLPRtxOut > 0) && 473 (segment does not carry a DSACK for TLP retransmission); 474 TLPRtxOut = 0 475 if (isLoss) 476 EnterRecovery(); 478 In other words, if the sender detects an ACK for data beyond the TLP 479 loss probe retransmission then (in the absence of reordering on the 480 return path of ACKs) it should have received any ACKs that will 481 indicate whether the original or any loss probe retransmissions were 482 lost. An exception is the case when the segment carries a Duplicate 483 SACK (DSACK) for the TLP retransmission. If the TLPRtxOut count is 484 still non-zero and thus indicates that some TLP probe segments remain 485 unacknowledged, then the sender should presume that at least one 486 segment was lost, so it should enter fast recovery using the 487 proportional rate reduction algorithm [IMC11PRR]. 489 (5) Senders must only send a TLP loss probe retransmission if all the 490 conditions from section 2.1 are met and the following condition also 491 holds: 493 (TLPRtxOut == 0) || (SND.NXT == TLPHighRxt) 495 This ensures that there is at most one sequence range with 496 outstanding TLP retransmissions. The sender maintains this invariant 497 so that there is at most one TLP retransmission "episode" happening 498 at a time, so that the sender can use the algorithm described above 499 in this section to determine when the episode is over, and thus when 500 it can infer whether any data segments were lost. 502 Note that this condition only limits the number of outstanding TLP 503 loss probes that are retransmissions. There may be an arbitrary 504 number of outstanding unacknowledged TLP loss probes that consist of 505 new, previously-unsent data, since the standard retransmission 506 timeout and fast recovery algorithms are sufficient to detect losses 507 of such probe segments. 509 4. Discussion 511 In this section we discuss two properties related to TLP. 513 4.1. Unifying loss recoveries 515 The existing loss recovery algorithms in TCP have a discontinuity: A 516 single segment loss in the middle of a packet train can be recovered 517 via fast recovery while a loss at the end of the train causes an RTO. 518 Example: consider a train of segments 1-10, loss of segment five can 519 be recovered quickly through fast recovery, while loss of segment ten 520 can only be recovered through a timeout. In practice, the difference 521 between losses that trigger RTO versus those invoking fast recovery 522 has more to do with the position of the losses as opposed to the 523 intensity or magnitude of congestion at the link. 525 TLP unifies the loss recovery mechanisms regardless of the position 526 of a loss, so now with TLP a segment loss in the middle of a train as 527 well as at the tail end can now trigger the same fast recovery 528 mechanisms. 530 4.2. Recovery of any N-degree tail loss 532 The TLP algorithm, when combined with a variant of the early 533 retransmit mechanism described below, is capable of recovering any 534 tail loss for any sized flow using fast recovery. 536 We propose the following enhancement to the early retransmit 537 algorithm described in [RFC5827]: in addition to allowing an early 538 retransmit in the scenarios described in [RFC5827], we propose to 539 allow a delayed early retransmit [IMC11PRR] in the case where there 540 are three outstanding segments that have not been cumulatively 541 acknowledged and one segment that has been fully SACKed. 543 Consider the following scenario, which illustrates an example of how 544 this enhancement allows quick loss recovery in a new scenario: 546 (1) scoreboard reads: A _ _ _ 547 (2) TLP retransmission probe of the last (fourth) segment 548 (3) the arrival of a SACK for the last segment changes 549 scoreboard to: A _ _ S 550 (4) early retransmit and fast recovery of the second and 551 third segments 553 With this enhancement to the early retransmit mechanism, then for any 554 degree of N-segment tail loss we get a quick recovery mechanism 555 instead of an RTO. 557 Consider the following taxonomy of tail loss scenarios, and the 558 ultimate outcome in each case: 560 number of scoreboard after 561 losses TLP retrans ACKed mechanism final outcome 562 -------- ----------------- ----------------- ------------- 563 (1) AAAL AAAA TLP loss detection all repaired 564 (2) AALL AALS early retransmit all repaired 565 (3) ALLL ALLS early retransmit all repaired 566 (4) LLLL LLLS FACK fast recovery all repaired 567 (5) >=5 L ..LS FACK fast recovery all repaired 569 key: 570 A = ACKed segment 571 L = lost segment 572 S = SACKed segment 574 Let us consider each tail loss scenario in more detail: 576 (1) With one segment lost, the TLP loss probe itself will repair the 577 loss. In this case, the sender's TLP loss detection algorithm will 578 notice that a segment was lost and repaired, and reduce its 579 congestion window in response to the loss. 581 (2) With two segments lost, the TLP loss probe itself is not enough 582 to repair the loss. However, when the SACK for the loss probe 583 arrives at the sender, then the early retransmit mechanism described 584 in [RFC5827] will note that with two segments outstanding and the 585 second one SACKed, the sender should retransmit the first segment. 586 This retransmit will repair the single remaining lost segment. 588 (3) With three segments lost, the TLP loss probe itself is not enough 589 to repair the loss. However, when the SACK for the loss probe 590 arrives at the sender, then the enhanced early retransmit mechanism 591 described in this section will note that with three segments 592 outstanding and the third one SACKed, the sender should retransmit 593 the first segment and enter fast recovery. The early retransmit and 594 fast recovery phase will, together, repair the the remaining two lost 595 segments. 597 (4) With four segments lost, the TLP loss probe itself is not enough 598 to repair the loss. However, when the SACK for the loss probe 599 arrives at the sender, then the FACK fast retransmit mechanism [FACK] 600 will note that with four segments outstanding and the fourth one 601 SACKed, the sender should retransmit the first segment and enter fast 602 recovery. The fast retransmit and fast recovery phase will, 603 together, repair the the remaining two lost segments. 605 (5) With five or more segments lost, events precede much as in case 606 (4). The TLP loss probe itself is not enough to repair the loss. 608 However, when the SACK for the loss probe arrives at the sender, then 609 the FACK fast retransmit mechanism [FACK] will note that with five or 610 more segments outstanding and the segment highest in sequence space 611 SACKed, the sender should retransmit the first segment and enter fast 612 recovery. The fast retransmit and fast recovery phase will, 613 together, repair the remaining lost segments. 615 In summary, the TLP mechanism, in conjunction with the proposed 616 enhancement to the early retransmit mechanism, is able to recover 617 from a tail loss of any number of segments without resort to a costly 618 RTO. 620 5. Experiments with TLP 622 In this section we describe experiments and measurements with TLP 623 performed on Google Web servers using Linux 2.6. The experiments 624 were performed over several weeks and measurements were taken across 625 a wide range of Google applications. The main goal of the 626 experiments is to instrument and measure TLP's performance relative 627 to the baseline. The experiment and baseline were using the same 628 kernels with an on/off switch to enable TLP. 630 Our experiments include both the basic TLP algorithm of Section 2 and 631 its loss detection component in Section 3. All other algorithms such 632 as early retransmit and FACK threshold based recovery are present in 633 the both the experiment and baseline. There are three primary 634 metrics we are interested in: impact on TCP latency (average and tail 635 or 99th percentile latency), retransmission statistics, and the 636 overhead of probe segments relative to the total number of 637 transmitted segments. TCP latency is the time elapsed between the 638 server transmitting the first byte of the response to it receiving an 639 ACK for the last byte. 641 The table below shows the percentiles and average latency improvement 642 of key Web applications, including even those responses without 643 losses, measured over a period of one week. The key takeaway is: the 644 average response time improved up to 7% and the 99th percentile 645 improved by 10%. Nearly all of the improvement for TLP is in the 646 tail latency (post-90th percentile). The varied improvements across 647 services are due to different response-size distributions and traffic 648 patterns. For example, TLP helps the most for Images, as these are 649 served by multiple concurrently active TCP connections which increase 650 the chances of tail segment losses. 652 Application Average 99% 654 Google Web Search -3% -5% 656 Google Maps -5% -10% 658 Google Images -7% -10% 660 TLP also improved performance in mobile networks -- by 7.2% for Web 661 search and Instant and 7.6% for Images transferred over Verizon 662 network. To see why and where the latency improvements are coming 663 from, we measured the retransmission statistics. We broke down the 664 retransmission stats based on nature of retransmission -- timeout 665 retransmission or fast recovery. TLP reduced the number of timeouts 666 by 15% compared to the baseline, i.e. (timeouts_tlp - 667 timeouts_baseline) / timeouts_baseline = 15%. Instead, these losses 668 were either recovered via fast recovery or by the loss probe 669 retransmission itself. The largest reduction in timeouts is when the 670 sender is in the Open state in which it receives only insequence ACKs 671 and no duplicate ACKs, likely because of tail losses. 672 Correspondingly, the retransmissions occurring in the slow start 673 phase after RTO reduced by 46% relative to baseline. Note that it is 674 not always possible for TLP to convert 100% of the timeouts into fast 675 recovery episodes because a probe itself may be lost. Also notable 676 in our experiments is a significant decrease in the number of 677 spurious timeouts -- the experiment had 61% fewer congestion window 678 undo events. The Linux TCP sender uses either DSACK or timestamps to 679 determine if retransmissions are spurious and employs techniques for 680 undoing congestion window reductions. We also note that the total 681 number of retransmissions decreased 7% with TLP because of the 682 decrease in spurious retransmissions, and because the TLP probe 683 itself plugs a hole. 685 We also quantified the overhead of probe packets. The probes 686 accounted for 0.48% of all outgoing segments, i.e. (number of probe 687 segments / number of outgoing segments)*100 = 0.48%. This is a 688 reasonable overhead when contrasted with the overall retransmission 689 rate of 3.2%. 10% of the probes sent are new segments and the rest 690 are retransmissions, which is unsurprising given that short Web 691 responses often don't have new data to send. We also found that in 692 about 33% of the cases, the probes themselves plugged the only hole 693 at receiver and the loss detection algorithm reduced the congestion 694 window. 37% of the probes were not necessary and resulted in a 695 duplicate acknowledgment. 697 Besides the macro level latency and retransmission statistics, we 698 report some measurements from TCP's internal state variables at the 699 point when a probe segment is transmitted. The following 700 distribution shows the FlightSize and congestion window values when a 701 PTO is scheduled. We note that cwnd is not the limiting factor and 702 that nearly all of the probe segments are sent within the congestion 703 window. 705 percentile 10% 25% 50% 75% 90% 99% 707 FlightSize 1 1 2 3 10 20 708 cwnd 5 10 10 10 17 44 710 We have also experimented with a few variations of TLP: multiple 711 probe segments versus single probe for the same tail loss episode, 712 and several values for WCDelAckT. Our experiments show that sending 713 just one probe suffices to get most (~90%) of latency benefits. The 714 experiment results reported in this section and our current 715 implementation limits number of probes to one, although the draft 716 itself allows up to two consecutive probes. We chose the worst case 717 delayed ack timer to be 200ms. When FlightSize equals 1 it is 718 important to account for the delayed ACK timer in the PTO value, in 719 order to bring down the number of unnecessary probe segments. With 720 delays of 0ms and 50ms, the probe overhead jumped from 0.48% to 3.1% 721 and 2.2% respectively. We have also experimented with transmitting 722 1-byte probe retransmissions as opposed to an entire MSS 723 retransmission probe. While this scheme has the advantage of not 724 requiring the loss detection algorithm outlined in Section 3, it 725 turned out to be problematic to implement correctly in certain TCP 726 stacks. Additionally, retransmitting 1-byte probe costs one more RTT 727 to recover single packet tail losses, which is detrimental for short 728 transfer latency. 730 6. Related work 732 TCP's long and conservative RTO recovery has long been identified as 733 the major performance bottleneck for latency-demanding applications. 734 A well-studied example is online gaming that requires reliability and 735 low latency but small bandwidth. [GRIWODZ06] shows that repeated 736 long RTO is the dominating performance bottleneck for game 737 responsiveness. The authors in [PETLUND08] propose to use linear RTO 738 to improve the performance, which has been incorporated in the Linux 739 kernel as a non-default socket option for such thin streams. 740 [MONDAL08] further argues exponential RTO backoff should be removed 741 because it is not necessary for the stability of Internet. In 742 contrast, TLP does not change the RTO timer calculation or the 743 exponential back off. TLP's approach is to keep the behavior after 744 RTO conservative for stability but allows a few timely probes before 745 concluding the network is badly congested and cwnd should fall to 1. 747 As noted earlier in the Introduction the F-RTO [RFC5682] algorithm 748 reduces the number of spurious timeout retransmissions and the Early 749 Retransmit [RFC5827] mechanism reduces timeouts when a connection has 750 received a certain number of duplicate ACKs. Both are complementary 751 to TLP and can work alongside. Rescue retransmission introduced in 752 [RFC6675] deals with loss events such as AL*SL* (using the same 753 notation as section 4). TLP covers wider range of events such as 754 AL*. We experimented with rescue retransmission on Google Web 755 servers, but did not observe much performance improvement. When the 756 last segment is lost, it is more likely that a number of contiguous 757 segments preceding the segment are also lost, i.e. AL* is common. 758 Timeouts that occur in the fast recovery are rare. 760 [HURTIG13] proposes to offset the elapsed time of the pending packet 761 when re-arming the RTO timer. It is possible to apply the same idea 762 for the TLP timer as well. We have not yet tested such a change to 763 TLP. 765 Tail Loss Probe is one of several algorithms designed to maximize the 766 robustness of TCPs self clock in the presence of losses. It follows 767 the same principles as Proportional Rate Reduction [IMC11PRR] and TCP 768 Laminar [Laminar]. 770 On a final note we note that Tail loss probe does not eliminate 100% 771 of all RTOs. RTOs still remain the dominant mode of loss recovery 772 for short transfers. More work in future should be done along the 773 following lines: transmitting multiple loss probes prior to finally 774 resorting to RTOs, maintaining ACK clocking for short transfers in 775 the absence of new data by clocking out old data in response to 776 incoming ACKs, taking cues from applications to indicate end of 777 transactions and use it for smarter tail loss recovery. 779 7. Security Considerations 781 The security considerations outlined in [RFC5681] apply to this 782 document. At this time we did not find any additional security 783 problems with Tail loss probe. 785 8. IANA Considerations 787 This document makes no request of IANA. 789 Note to RFC Editor: this section may be removed on publication as an 790 RFC. 792 9. References 794 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 795 and Y. Nishida, "A Conservative Loss Recovery Algorithm 796 Based on Selective Acknowledgment (SACK) for TCP", 797 RFC 6675, August 2012. 799 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 800 "Computing TCP's Retransmission Timer", RFC 6298, 801 June 2011. 803 [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. 804 Hurtig, "Early Retransmit for TCP and Stream Control 805 Transmission Protocol (SCTP)", RFC 5827, April 2010. 807 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 808 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 809 Spurious Retransmission Timeouts with TCP", RFC 5682, 810 September 2009. 812 [IMC11PRR] 813 Mathis, M., Dukkipati, N., Cheng, Y., and M. Ghobadi, 814 "Proportional Rate Reduction for TCP", Proceedings of the 815 2011 ACM SIGCOMM conference on Internet measurement 816 conference , 2011. 818 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 819 Requirement Levels", RFC 2119, March 1997. 821 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 822 Control", RFC 5681, September 2009. 824 [FACK] Mathis, M. and M. Jamshid, "Forward acknowledgement: 825 refining TCP congestion control", ACM SIGCOMM Computer 826 Communication Review, Volume 26, Issue 4, Oct. 1996. , 827 1996. 829 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 830 Extension to the Selective Acknowledgement (SACK) Option 831 for TCP", RFC 2883, July 2000. 833 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 834 Options", RFC 2018, October 1996. 836 [GRIWODZ06] 837 Griwodz, C. and P. Halvorsen, "The fun of using TCP for an 838 MMORPG", NOSSDAV , 2006. 840 [PETLUND08] 841 Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, 842 "TCP enhancements for interactive thin-stream 843 applications", NOSSDAV , 2008. 845 [MONDAL08] 846 Mondal, A. and A. Kuzmanovic, "Removing Exponential 847 Backoff from TCP", ACM SIGCOMM Computer Communication 848 Review , 2008. 850 [Laminar] Mathis, M., "Laminar TCP and the case for refactoring TCP 851 congestion control", July 2012. 853 [HURTIG13] 854 Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP 855 and SCTP RTO Restart", draft-ietf-tcpm-rtorestart-00 (work 856 in progress), February 2013. 858 Authors' Addresses 860 Nandita Dukkipati 861 Google, Inc 862 1600 Amphitheater Parkway 863 Mountain View, California 93117 864 USA 866 Email: nanditad@google.com 868 Neal Cardwell 869 Google, Inc 870 76 Ninth Avenue 871 New York, NY 10011 872 USA 874 Email: ncardwell@google.com 876 Yuchung Cheng 877 Google, Inc 878 1600 Amphitheater Parkway 879 Mountain View, California 93117 880 USA 882 Email: ycheng@google.com 883 Matt Mathis 884 Google, Inc 885 1600 Amphitheater Parkway 886 Mountain View, California 93117 887 USA 889 Email: mattmathis@google.com