idnits 2.17.1 draft-ietf-tcpm-rack-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 1, 2019) is 1637 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC8174' is mentioned on line 64, but not defined == Missing Reference: 'RFC4653' is mentioned on line 1062, but not defined == Missing Reference: 'RFC3708' is mentioned on line 480, but not defined == Missing Reference: 'RFC3522' is mentioned on line 1081, but not defined == Unused Reference: 'RFC4737' is defined on line 1268, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group Y. Cheng 3 Internet-Draft N. Cardwell 4 Intended status: Experimental N. Dukkipati 5 Expires: May 4, 2020 P. Jha 6 Google, Inc 7 November 1, 2019 9 RACK: a time-based fast loss detection algorithm for TCP 10 draft-ietf-tcpm-rack-06 12 Abstract 14 This document presents a new TCP loss detection algorithm called RACK 15 ("Recent ACKnowledgment"). RACK uses the notion of time, instead of 16 packet or sequence counts, to detect losses, for modern TCP 17 implementations that can support per-packet timestamps and the 18 selective acknowledgment (SACK) option. It is intended to replace 19 the conventional DUPACK threshold approach and its variants, as well 20 as other nonstandard approaches. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on May 4, 2020. 39 Copyright Notice 41 Copyright (c) 2019 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 1. Terminology 56 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 57 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 58 "OPTIONAL" in this document are to be interpreted as described in BCP 59 14 [RFC2119] [RFC8174] when, and only when, they appear in all 60 capitals, as shown here. In this document, these words will appear 61 with that interpretation only when in UPPER CASE. Lower case uses of 62 these words are not to be interpreted as carrying [RFC2119] 63 significance. 65 2. Introduction 67 This document presents a new loss detection algorithm called RACK 68 ("Recent ACKnowledgment"). RACK uses the notion of time instead of 69 the conventional packet or sequence counting approaches for detecting 70 losses. RACK deems a packet lost if some packet sent sufficiently 71 later has been delivered. It does this by recording packet 72 transmission times and inferring losses using cumulative 73 acknowledgments or selective acknowledgment (SACK) TCP options. 75 In the last couple of years we have been observing several 76 increasingly common loss and reordering patterns in the Internet: 78 1. Lost retransmissions. Traffic policers [POLICER16] and burst 79 losses often cause retransmissions to be lost again, severely 80 increasing TCP latency. 82 2. Tail drops. Structured request-response traffic turns more 83 losses into tail drops. In such cases, TCP is application- 84 limited, so it cannot send new data to probe losses and has to 85 rely on retransmission timeouts (RTOs). 87 3. Reordering. Link-layer protocols (e.g., 802.11 block ACK), link 88 bonding, or routers' internal load-balancing can deliver TCP 89 packets out of order. The degree of such reordering is usually 90 within the order of the path round trip time. 92 Despite TCP stacks (e.g. Linux) that implement many of the standard 93 and proposed loss detection algorithms 94 [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM], 95 we've found that together they do not perform well. The main reason 96 is that many of them are based on the classic rule of counting 97 duplicate acknowledgments [RFC5681]. They can either detect loss 98 quickly or accurately, but not both, especially when the sender is 99 application-limited or under reordering that is unpredictable. And 100 under these conditions none of them can detect lost retransmissions 101 well. 103 Also, these algorithms, including RFCs, rarely address the 104 interactions with other algorithms. For example, FACK may consider a 105 packet is lost while RFC6675 may not. Implementing N algorithms 106 while dealing with N^2 interactions is a daunting task and error- 107 prone. 109 The goal of RACK is to solve all the problems above by replacing many 110 of the loss detection algorithms above with one more effective 111 algorithm to handle loss and reordering. 113 3. Overview 115 The main idea behind RACK is that if a packet has been delivered out 116 of order, then the packets sent chronologically before that were 117 either lost or reordered. This concept is not fundamentally 118 different from [RFC5681][RFC6675][FACK]. But the key innovation in 119 RACK is to use a per-packet transmission timestamp and widely 120 deployed SACK options to conduct time-based inferences instead of 121 inferring losses with packet or sequence counting approaches. 123 Using a threshold for counting duplicate acknowledgments (i.e., 124 DupThresh) alone is no longer reliable because of today's prevalent 125 reordering patterns. A common type of reordering is that the last 126 "runt" packet of a window's worth of packet bursts gets delivered 127 first, then the rest arrive shortly after in order. To handle this 128 effectively, a sender would need to constantly adjust the DupThresh 129 to the burst size; but this would risk increasing the frequency of 130 RTOs on real losses. 132 Today's prevalent lost retransmissions also cause problems with 133 packet-counting approaches [RFC5681][RFC6675][FACK], since those 134 approaches depend on reasoning in sequence number space. 135 Retransmissions break the direct correspondence between ordering in 136 sequence space and ordering in time. So when retransmissions are 137 lost, sequence-based approaches are often unable to infer and quickly 138 repair losses that can be deduced with time-based approaches. 140 Instead of counting packets, RACK uses the most recently delivered 141 packet's transmission time to judge if some packets sent previous to 142 that time have "expired" by passing a certain reordering settling 143 window. On each ACK, RACK marks any already-expired packets lost, 144 and for any packets that have not yet expired it waits until the 145 reordering window passes and then marks those lost as well. In 146 either case, RACK can repair the loss without waiting for a (long) 147 RTO. RACK can be applied to both fast recovery and timeout recovery, 148 and can detect losses on both originally transmitted and 149 retransmitted packets, making it a great all-weather loss detection 150 mechanism. 152 4. Design Rationale for Reordering Tolerance 154 The reordering behavior of networks can evolve (over years) in 155 response to the behavior of transport protocols and applications, as 156 well as the needs of network designers and operators. From a network 157 or link designer's viewpoint, parallelization (eg. link bonding) is 158 the easiest way to get a network to go faster. Therefore their main 159 constraint on speed is reordering, and there is pressure to relax 160 that constraint. If RACK becomes widely deployed, the underlying 161 networks may introduce more reordering for higher throughput. But 162 this may result in excessive reordering that hurts end to end 163 performance: 165 1. End host packet processing: extreme reordering on high-speed 166 networks would incur high CPU cost by greatly reducing the 167 effectiveness of aggregation mechanisms, such as large receive 168 offload (LRO) and generic receive offload (GRO), and 169 significantly increasing the number of ACKs. 171 2. Congestion control: TCP congestion control implicitly assumes the 172 feedback from ACKs are from the same bottleneck. Therefore it 173 cannot handle well scenarios where packets are traversing largely 174 disjoint paths. 176 3. Loss recovery: Having an excessively large reordering window to 177 accommodate widely different latencies from different paths would 178 increase the latency of loss recovery. 180 An end-to-end transport protocol cannot tell immediately whether a 181 hole is reordering or loss. It can only distinguish between the two 182 in hindsight if the hole in the sequence space gets filled later 183 without a retransmission. How long the sender waits for such 184 potential reordering events to settle is determined by the current 185 reordering window. 187 Given these considerations, a core design philosophy of RACK is to 188 adapt to the measured duration of reordering events, within 189 reasonable and specific bounds. To accomplish this RACK places the 190 following mandates on the reordering window: 192 1. The initial RACK reordering window SHOULD be set to a small 193 fraction of the round-trip time. 195 2. If no reordering has been observed, then RACK SHOULD honor the 196 classic 3-DUPACK rule for initiating fast recovery. One simple 197 way to implement this is to temporarily override the reorder 198 window to 0. 200 3. The RACK reordering window SHOULD leverage Duplicate Selective 201 Acknowledgement (DSACK) information [RFC3708] to adaptively 202 estimate the duration of reordering events. 204 4. The RACK reordering window MUST be bounded and this bound SHOULD 205 be one round trip. 207 As a flow starts, either condition 1 or condition 2 or both would 208 trigger RACK to start the recovery process quickly. The low initial 209 reordering window and use of the 3-DUPACK rule are key to achieving 210 low-latency loss recovery for short flows by risking spurious 211 retransmissions to recover losses quickly. This rationale is that 212 spurious retransmissions for short flows are not expected to produce 213 excessive network traffic. 215 For long flows the design tolerates reordering within a round trip. 216 This handles reordering caused by path divergence in small time 217 scales (reordering within the round-trip time of the shortest path), 218 which should tolerate much of the reordering from link bonding, 219 multipath routing, or link-layer out-of-order delivery. It also 220 relaxes ordering constraints to allow sending flights of TCP packets 221 on different paths dynamically for better load-balancing (e.g. 222 flowlets). 224 However, the fact that the initial RACK reordering window is low, and 225 the RACK reordering window's adaptive growth is bounded, means that 226 there will continue to be a cost to reordering and a limit to RACK's 227 adaptation to reordering. This maintains a disincentive for network 228 designers and operators to introduce needless or excessive 229 reordering, particularly because they have to allow for low round 230 trip time paths. This means RACK will not encourage networks to 231 perform inconsiderate fine-grained packet-spraying over highly 232 disjoint paths with very different characteristics. There are good 233 alternative solutions, such as MPTCP, for such networks. 235 To conclude, the RACK algorithm aims to adapt to small degrees of 236 reordering, quickly recover most losses within one to two round 237 trips, and avoid costly retransmission timeouts (RTOs). In the 238 presence of reordering, the adaptation algorithm can impose 239 sometimes-needless delays when it waits to disambiguate loss from 240 reordering, but the penalty for waiting is bounded to one round trip 241 and such delays are confined to longer-running flows. 243 This document provides a concrete and detailed reordering window 244 adaptation algorithm for implementors. We note that the specifics of 245 the algorithm are likely to evolve over time. But that is a separate 246 engineering optimization that's out of scope for this document. 248 5. Requirements 250 The reader is expected to be familiar with the definitions given in 251 the TCP congestion control [RFC5681] and selective acknowledgment 252 [RFC2018] RFCs. Familiarity with the conservative SACK-based 253 recovery for TCP [RFC6675] is not expected but helps. 255 RACK has three requirements: 257 1. The connection MUST use selective acknowledgment (SACK) options 258 [RFC2018]. 260 2. For each packet sent, the sender MUST store its most recent 261 transmission time with (at least) millisecond granularity. For 262 round-trip times lower than a millisecond (e.g., intra-datacenter 263 communications) microsecond granularity would significantly help 264 the detection latency but is not required. 266 3. For each packet sent, the sender MUST remember whether the packet 267 has been retransmitted or not. 269 We assume that requirement 1 implies the sender keeps a SACK 270 scoreboard, which is a data structure to store selective 271 acknowledgment information on a per-connection basis ([RFC6675] 272 section 3). For the ease of explaining the algorithm, we use a 273 pseudo-scoreboard that manages the data in sequence number ranges. 274 But the specifics of the data structure are left to the implementor. 276 RACK does not need any change on the receiver. 278 6. Definitions 280 The reader is expected to be familiar with the definitions given in 281 [RFC793], including SND.UNA, SND.NXT, SEG.ACK, and SEG.SEQ. 283 6.1. Definitions of variables 285 A sender implementing RACK needs to store these new RACK variables: 287 "Packet.xmit_ts" is the time of the last transmission of a data 288 packet, including retransmissions, if any. The sender needs to 289 record the transmission time for each packet sent and not yet 290 acknowledged. The time MUST be stored at millisecond granularity or 291 finer. 293 "RACK.packet". Among all the packets that have been either 294 selectively or cumulatively acknowledged, RACK.packet is the one that 295 was sent most recently including retransmissions. 297 "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet. 299 "RACK.end_seq" is the ending TCP sequence number of RACK.packet. 301 "RACK.rtt" is the RTT of the most recently delivered packet on the 302 connection (either cumulatively acknowledged or selectively 303 acknowledged) that was not marked invalid as a possible spurious 304 retransmission. 306 "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated. 308 "RACK.reo_wnd" is a reordering window computed in the unit of time 309 used for recording packet transmission times. It is used to defer 310 the moment at which RACK marks a packet lost. 312 "RACK.dupthresh" is a constant specifying the number of duplicate 313 acknowledgments, or selectively acknowledged segments, that can 314 (under certain conditions) trigger fast recovery, similar to 315 [RFC6675]. As in [RFC5681] and [RFC6675], this threshold is defined 316 to be 3. 318 "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the 319 connection. 321 "RACK.ack_ts" is the time when all the sequences in RACK.packet were 322 selectively or cumulatively acknowledged. 324 "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd 326 "RACK.reo_wnd_persist" is the number of loss recoveries before 327 resetting RACK.reo_wnd 329 "RACK.dsack" indicates if a DSACK option has been received since last 330 RACK.reo_wnd change "RACK.pkts_sacked" returns the total number of 331 packets selectively acknowledged in the SACK scoreboard. 333 "RACK.reord" indicates the connection has detected packet reordering 334 event(s) 335 "RACK.fack" is the highest selectively or cumulatively acknowledged 336 sequence 338 Note that the Packet.xmit_ts variable is per packet in flight. The 339 RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT 340 variables are kept in the per-connection TCP control block. 341 RACK.packet and RACK.ack_ts are used as local variables in the 342 algorithm. 344 7. Algorithm Details 346 7.1. Transmitting a data packet 348 Upon transmitting a new packet or retransmitting an old packet, 349 record the time in Packet.xmit_ts. RACK does not care if the 350 retransmission is triggered by an ACK, new application data, an RTO, 351 or any other means. 353 7.2. Upon receiving an ACK 355 Step 1: Update RACK.min_RTT. 357 Use the RTT measurements obtained via [RFC6298] or [RFC7323] to 358 update the estimated minimum RTT in RACK.min_RTT. The sender can 359 track a simple global minimum of all RTT measurements from the 360 connection, or a windowed min-filtered value of recent RTT 361 measurements. This document does not specify an exact approach. 363 Step 2: Update RACK stats 365 Given the information provided in an ACK, each packet cumulatively 366 ACKed or SACKed is marked as delivered in the scoreboard. Among all 367 the packets newly ACKed or SACKed in the connection, record the most 368 recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts. 369 Sometimes the timestamps of RACK.Packet and Packet could carry the 370 same transmit timestamps due to clock granularity or segmentation 371 offloading (i.e. the two packets were handed to the NIC as a single 372 unit). In that case the sequence numbers of RACK.end_seq and 373 Packet.end_seq are compared to break the tie. 375 Since an ACK can also acknowledge retransmitted data packets, and 376 retransmissions can be spurious, the sender must take care to avoid 377 spurious inferences. For example, if the sender were to use timing 378 information from a spurious retransmission, the RACK.rtt could be 379 vastly underestimated. 381 To avoid spurious inferences, ignore a packet as invalid if any of 382 its TCP sequences have been retransmitted before and either of two 383 conditions is true: 385 1. The Timestamp Echo Reply field (TSecr) of the ACK's timestamp 386 option [RFC7323], if available, indicates the ACK was not 387 acknowledging the last retransmission of the packet. 389 2. The packet was last retransmitted less than RACK.min_rtt ago. 391 If the ACK is not ignored as invalid, update the RACK.rtt to be the 392 RTT sample calculated using this ACK, and continue. If this ACK or 393 SACK was for the most recently sent packet, then record the 394 RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK. 395 Otherwise exit here and omit the following steps. 397 Notice that the second condition above is a heuristic. This 398 heuristic would fail to update RACK stats if the packet is spuriously 399 retransmitted because of a recent minimum RTT decrease (e.g. path 400 change). Consequentially RACK may not detect losses from ACK events 401 and the recovery resorts to the (slower) TLP or RTO timer-based 402 events. However such events should be rare and the connection would 403 pick up the new minimum RTT when the recovery ends to avoid repeated 404 similar failures. 406 Step 2 may be summarized in pseudocode as: 408 RACK_sent_after(t1, seq1, t2, seq2): 409 If t1 > t2: 410 Return true 411 Else if t1 == t2 AND seq1 > seq2: 412 Return true 413 Else: 414 Return false 416 RACK_update(): 417 For each Packet newly acknowledged cumulatively or selectively: 418 rtt = Now() - Packet.xmit_ts 419 If Packet.retransmitted is TRUE: 420 If ACK.ts_option.echo_reply < Packet.xmit_ts: 421 Return 422 If rtt < RACK.min_rtt: 423 Return 425 RACK.rtt = rtt 426 If RACK_sent_after(Packet.xmit_ts, Packet.end_seq 427 RACK.xmit_ts, RACK.end_seq): 428 RACK.xmit_ts = Packet.xmit_ts 430 Step 3: Detect packet reordering 432 To detect reordering, the sender looks for original data packets 433 being delivered out of order in sequence space. The sender tracks 434 the highest sequence selectively or cumulatively acknowledged in the 435 RACK.fack variable. The name fack stands for the most forward ACK 436 originated from the [FACK] draft. If the ACK selectively or 437 cumulatively acknowledges an unacknowledged and also never 438 retransmitted sequence below RACK.fack, then the corresponding packet 439 has been reordered and RACK.reord is set to TRUE. 441 The heuristic above only detects reordering if the re-ordered packet 442 has not yet been retransmitted. This is a major drawback because if 443 RACK has a low reordering window and the network is reordering 444 packets, RACK may falsely retransmit frequently. Consequently RACK 445 may fail to detect reordering to increase the reordering window, 446 because the reordered packets were already (falsely) retransmitted. 448 DSACK [RFC3708] can help mitigate this issue. The false 449 retransmission would solicit DSACK option in the ACK. Therefore if 450 the ACK has a DSACK option covering some sequence that were both 451 acknowledged and retransmitted, this implies the original packet was 452 reordered but RACK retransmitted the packet too quickly and should 453 set RACK.reord to TRUE. 455 RACK_detect_reordering(): 456 For each Packet newly acknowledged cumulatively or selectively: 457 If Packet.end_seq > RACK.fack: 458 RACK.fack = Packet.end_seq 459 Else if Packet.end_seq < RACK.fack AND 460 Packet.retransmitted is FALSE: 461 RACK.reord = TRUE 463 For each Packet covered by the DSACK option: 464 If Packet.retransmitted is TRUE: 465 RACK.reord = TRUE 467 Step 4: Update RACK reordering window 469 To handle the prevalent small degree of reordering, RACK.reo_wnd 470 serves as an allowance for settling time before marking a packet 471 lost. This section documents a detailed algorithm following the 472 design rationale section. RACK starts initially with a conservative 473 window of min_RTT/4. If no reordering has been observed, RACK uses 474 RACK.reo_wnd of 0 during loss recovery, in order to retransmit 475 quickly, or when the number of DUPACKs exceeds the classic DUPACK 476 threshold. The subtle difference of this approach and conventional 477 one [RFC5681][RFC6675] is dicussed later in the section of "RACK and 478 TLP discussions". 480 Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window, 481 to higher degrees of reordering, if DSACK is supported. Receiving an 482 ACK with a DSACK indicates a spurious retransmission, which in turn 483 suggests that the RACK reordering window, RACK.reo_wnd, is likely too 484 small. The sender MAY increase the RACK.reo_wnd window linearly for 485 every round trip in which the sender receives a DSACK, so that after 486 N distinct round trips in which a DSACK is received, the RACK.reo_wnd 487 becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT. The 488 inflated RACK.reo_wnd would persist for 16 loss recoveries and after 489 which it resets to its starting value, min_RTT / 4. 491 The following pseudocode implements above algorithm. Note that 492 extensions that require additional TCP features (e.g. DSACK) would 493 work if the feature functions simply return false. 495 RACK_update_reo_wnd(): 496 RACK.min_RTT = TCP_min_RTT() 497 If DSACK option is present: 498 RACK.dsack = true 500 If SND.UNA < RACK.rtt_seq: 501 RACK.dsack = false /* React to DSACK once per round trip */ 503 If RACK.dsack: 504 RACK.reo_wnd_incr += 1 505 RACK.dsack = false 506 RACK.rtt_seq = SND.NXT 507 RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */ 508 Else if exiting loss recovery: 509 RACK.reo_wnd_persist -= 1 510 If RACK.reo_wnd_persist <= 0: 511 RACK.reo_wnd_incr = 1 513 If RACK.reord is FALSE: 514 If in loss recovery: /* If in fast or timeout recovery */ 515 RACK.reo_wnd = 0 516 Return 517 Else if RACK.pkts_sacked >= RACK.dupthresh: 518 RACK.reo_wnd = 0 519 return 520 RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr 521 RACK.reo_wnd = min(RACK.reo_wnd, SRTT) 523 Step 5: Detect losses. 525 For each packet that has not been SACKed, if RACK.xmit_ts is after 526 Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its 527 corresponding sequence range) lost in the scoreboard. The rationale 528 is that if another packet that was sent later has been delivered, and 529 the reordering window or "reordering settling time" has already 530 passed, then the packet was likely lost. 532 If another packet that was sent later has been delivered, but the 533 reordering window has not passed, then it is not yet safe to deem the 534 unacked packet lost. Using the basic algorithm above, the sender 535 would wait for the next ACK to further advance RACK.xmit_ts; but this 536 risks a timeout (RTO) if no more ACKs come back (e.g, due to losses 537 or application limit). For timely loss detection, the sender MAY 538 install a "reordering settling" timer set to fire at the earliest 539 moment at which it is safe to conclude that some packet is lost. The 540 earliest moment is the time it takes to expire the reordering window 541 of the earliest unacked packet in flight. 543 This timer expiration value can be derived as follows. As a starting 544 point, we consider that the reordering window has passed if the 545 RACK.packet was sent sufficiently after the packet in question, or a 546 sufficient time has elapsed since the RACK.packet was S/ACKed, or 547 some combination of the two. More precisely, RACK marks a packet as 548 lost if the reordering window for a packet has elapsed through the 549 sum of: 551 1. delta in transmit time between a packet and the RACK.packet 553 2. delta in time between RACK.ack_ts and now 555 So we mark a packet as lost if: 557 RACK.xmit_ts >= Packet.xmit_ts 558 AND 559 (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd 561 If we solve this second condition for "now", the moment at which we 562 can declare a packet lost, then we get: 564 now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts) 566 Then (RACK.ack_ts - RACK.xmit_ts) is the RTT of the packet the sender 567 used to set RACK.xmit_ts: the round trip time of the most recently 568 (re)transmitted packet that's been delivered. To be more robust to 569 reordering, RACK uses a more conservative RTT value to decide if an 570 unacknowledged packet should be considered lost, RACK.rtt: the round 571 trip time of the most recently delivered packet on the connection 572 that was not marked invalid as a possible spurious retransmission. 574 When packets are delivered in order, the most recently 575 (re)transmitted packet that's been delivered is also the most 576 recently delivered, hence RACK.rtt == RACK.ack_ts - RACK.xmit_ts. 577 But if packets were reordered, then the packet delivered most 578 recently was sent before the most recently (re)transmitted packet. 579 Hence RACK.rtt > (RACK.ack_ts - RACK.xmit_ts). 581 Since RACK.RTT >= (RACK.ack_ts - RACK.xmit_ts), the previous equation 582 reduces to saying that the sender can declare a packet lost when: 584 now >= Packet.xmit_ts + RACK.reo_wnd + RACK.rtt 586 In turn, that is equivalent to stating that a RACK sender should 587 declare a packet lost when: 589 Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0 591 The following pseudocode implements the algorithm above. When an ACK 592 is received or the RACK timer expires, call RACK_detect_loss(). The 593 algorithm includes an additional optimization to break timestamp ties 594 by using the TCP sequence space. The optimization is particularly 595 useful to detect losses in a timely manner with TCP Segmentation 596 Offload, where multiple packets in one TSO blob have identical 597 timestamps. It is also useful when the timestamp clock granularity 598 is close to or longer than the actual round trip time. 600 RACK_detect_loss(): 601 timeout = 0 603 For each packet, Packet, not acknowledged yet: 604 If Packet.lost is TRUE AND Packet.retransmitted is FALSE: 605 Continue /* Packet lost but not yet retransmitted */ 607 If RACK_sent_after(RACK.xmit_ts, RACK.end_seq, 608 Packet.xmit_ts, Packet.end_seq): 609 remaining = Packet.xmit_ts + RACK.rtt + 610 RACK.reo_wnd - Now() 611 If remaining <= 0: 612 Packet.lost = TRUE 613 Else: 614 timeout = max(remaining, timeout) 616 If timeout != 0 617 Arm a timer to call RACK_detect_loss() after timeout 619 Implementation optimization: looping through packets in the SACK 620 scoreboard above could be very costly on large-BDP networks since the 621 inflight could be very large. If the implementation can organize the 622 scoreboard data structures to have packets sorted by the last 623 (re)transmission time, then the loop can start on the least recently 624 sent packet and abort on the first packet sent after RACK.time_ts. 625 This can be implemented by using a seperate list sorted in time 626 order. The implementation inserts the packet at the tail of the list 627 when it is (re)transmitted, and removes a packet from the list when 628 it is delivered or marked lost. We RECOMMEND such an optimization 629 because it enables implementations to support high-BDP networks. 630 This optimization is implemented in Linux and sees orders of 631 magnitude improvement in CPU usage on high-speed WAN networks. 633 7.3. Tail Loss Probe: fast recovery for tail losses 635 This section describes a supplemental algorithm, Tail Loss Probe 636 (TLP), which leverages RACK to further reduce RTO recoveries. TLP 637 triggers fast recovery to quickly repair tail losses that can 638 otherwise be recovered by RTOs only. After an original data 639 transmission, TLP sends a probe data segment within one to two RTTs. 640 The probe data segment can either be new, previously unsent data, or 641 a retransmission of previously sent data just below SND.NXT. In 642 either case the goal is to elicit more feedback from the receiver, in 643 the form of an ACK (potentially with SACK blocks), to allow RACK to 644 trigger fast recovery instead of an RTO. 646 An RTO occurs when the first unacknowledged sequence number is not 647 acknowledged after a conservative period of time has elapsed 648 [RFC6298]. Common causes of RTOs include: 650 1. The entire flight of data is lost. 652 2. Tail losses of data segments at the end of an application 653 transaction. 655 3. Tail losses of ACKs at the end of an application transaction. 657 4. Lost retransmits, which can halt fast recovery based on [RFC6675] 658 if the ACK stream completely dries up. For example, consider a 659 window of three data packets (P1, P2, P3) that are sent; P1 and 660 P2 are dropped. On receipt of a SACK for P3, RACK marks P1 and 661 P2 as lost and retransmits them as R1 and R2. Suppose R1 and R2 662 are lost as well, so there are no more returning ACKs to detect 663 R1 and R2 as lost. Recovery stalls. 665 5. An unexpectedly long round-trip time (RTT). This can cause ACKs 666 to arrive after the RTO timer expires. The F-RTO algorithm 667 [RFC5682] is designed to detect such spurious retransmission 668 timeouts and at least partially undo the consequences of such 669 events, but F-RTO cannot be used in many situations. 671 7.4. Tail Loss Probe: An Example 673 Following is an example of TLP. All events listed are at a TCP 674 sender. 676 1. Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 677 no more new data to transmit. A PTO is scheduled to fire in 2 678 RTTs, after the transmission of the 10th segment. 680 2. Sender receives acknowledgements (ACKs) for segments 1-5; 681 segments 6-10 are lost and no ACKs are received. The sender 682 reschedules its PTO timer relative to the last received ACK, 683 which is the ACK for segment 5 in this case. The sender sets the 684 PTO interval using the calculation described in step (2) of the 685 algorithm. 687 3. When PTO fires, sender retransmits segment 10. 689 4. After an RTT, a SACK for packet 10 arrives. The ACK also carries 690 SACK holes for segments 6, 7, 8 and 9. This triggers RACK-based 691 loss recovery. 693 5. The connection enters fast recovery and retransmits the remaining 694 lost segments. 696 7.5. Tail Loss Probe Algorithm Details 698 We define the terminology used in specifying the TLP algorithm: 700 FlightSize: amount of outstanding data in the network, as defined in 701 [RFC5681]. 703 RTO: The transport's retransmission timeout (RTO) is based on 704 measured round-trip times (RTT) between the sender and receiver, as 705 specified in [RFC6298] for TCP. PTO: Probe timeout (PTO) is a timer 706 event indicating that an ACK is overdue. Its value is constrained to 707 be smaller than or equal to an RTO. 709 SRTT: smoothed round-trip time, computed as specified in [RFC6298]. 710 TLPRxtOut: a boolean indicating whether there is an unacknowledged 711 TLP retransmission. 713 TLPHighRxt: the value of SND.NXT at the time of sending a TLP 714 retransmission. 716 The TLP algorithm has three phases, which we discuss in turn. 718 7.5.1. Phase 1: Scheduling a loss probe 720 Step 1: Check conditions for scheduling a PTO. 722 A sender should check to see if it should schedule a PTO in the 723 following situations: 725 1. After transmitting new data that was not itself a TLP probe 727 2. Upon receiving an ACK that cumulatively acknowledges data 729 3. Upon receiving a SACK that selectively acknowledges data that was 730 last sent before the segment with SEG.SEQ=SND.UNA was last 731 (re)transmitted 733 A sender should schedule a PTO only if all of the following 734 conditions are met: 736 1. The connection supports SACK [RFC2018] 738 2. The connection has no SACKed sequences in the SACK scoreboard 740 3. The connection is not in loss recovery 742 If a PTO can be scheduled according to these conditions, the sender 743 should schedule a PTO. If there was a previously scheduled PTO or 744 RTO pending, then that pending PTO or RTO should first be cancelled, 745 and then the new PTO should be scheduled. 747 If a PTO cannot be scheduled according to these conditions, then the 748 sender MUST arm the RTO timer if there is unacknowledged data in 749 flight. 751 Step 2: Select the duration of the PTO. 753 A sender SHOULD use the following logic to select the duration of a 754 PTO: 756 TLP_timeout(): 757 If SRTT is available: 758 PTO = 2 * SRTT 759 If FlightSize = 1: 760 PTO += WCDelAckT 761 Else: 762 PTO += 2ms 763 Else: 764 PTO = 1 sec 766 If Now() + PTO > TCP_RTO_expire(): 767 PTO = TCP_RTO_expire() - Now() 769 Aiming for a PTO value of 2*SRTT allows a sender to wait long enough 770 to know that an ACK is overdue. Under normal circumstances, i.e. no 771 losses, an ACK typically arrives in one SRTT. But choosing PTO to be 772 exactly an SRTT is likely to generate spurious probes given that 773 network delay variance and even end-system timings can easily push an 774 ACK to be above an SRTT. We chose PTO to be the next integral 775 multiple of SRTT. 777 Similarly, network delay variations and end-system processing 778 latencies and timer granularities can easily delay ACKs beyond 779 2*SRTT, so senders SHOULD add at least 2ms to a computed PTO value 780 (and MAY add more if the sending host OS timer granularity is more 781 coarse than 1ms). 783 WCDelAckT stands for worst case delayed ACK timer. When FlightSize 784 is 1, PTO is inflated by WCDelAckT time to compensate for a potential 785 long delayed ACK timer at the receiver. The RECOMMENDED value for 786 WCDelAckT is 200ms. 788 Finally, if the time at which an RTO would fire (here denoted 789 "TCP_RTO_expire") is sooner than the computed time for the PTO, then 790 a probe is scheduled to be sent at that earlier time. 792 7.5.2. Phase 2: Sending a loss probe 794 When the PTO fires, transmit a probe data segment: 796 TLP_send_probe(): 797 If an unsent segment exists AND 798 the receive window allows new data to be sent: 799 Transmit the lowest-sequence unsent segment of up to SMSS 800 Increment FlightSize by the size of the newly-sent segment 802 Else if TLPRxtOut is not set: 803 Retransmit the highest-sequence segment sent so far 804 The cwnd remains unchanged 806 When the loss probe is a retransmission, the sender uses the highest- 807 sequence segment sent so far. This is in order to deal with the 808 retransmission ambiguity problem in TCP. Suppose a sender sends N 809 segments, and then retransmits the last segment (segment N) as a loss 810 probe, and then the sender receives a SACK for segment N. As long as 811 the sender waits for any required RACK reordering settling timer to 812 then expire, it doesn't matter if that SACK was for the original 813 transmission of segment N or the TLP retransmission; in either case 814 the arrival of the SACK for segment N provides evidence that the N-1 815 segments preceding segment N were likely lost. In the case where 816 there is only one original outstanding segment of data (N=1), the 817 same logic (trivially) applies: an ACK for a single outstanding 818 segment tells the sender the N-1=0 segments preceding that segment 819 were lost. Furthermore, whether there are N>1 or N=1 outstanding 820 segments, there is a question about whether the original last segment 821 or its TLP retransmission were lost; the sender estimates this using 822 TLP recovery detection (see below). 824 Note that wether a probe was sent or not in TLP_send_probe(), the 825 sender MUST arm the RTO timer, not the PTO timer, at the end of 826 TLP_send_probe() if FlightSize is not zero. This ensures that the 827 sender does not send repeated, back-to-back TLP probes. Checking 828 TLPRxtOut prior to sending the loss probe is also critical to avoid 829 TLP loops if an application writes periodically at an interval less 830 than PTO. 832 7.5.3. Phase 3: ACK processing 834 On each incoming ACK, the sender should check the conditions in Step 835 1 of Phase 1 to see if it should schedule (or reschedule) the loss 836 probe timer. 838 7.6. TLP recovery detection 840 If the only loss in an outstanding window of data was the last 841 segment, then a TLP loss probe retransmission of that data segment 842 might repair the loss. TLP recovery detection examines ACKs to 843 detect when the probe might have repaired a loss, and thus allows 844 congestion control to properly reduce the congestion window (cwnd) 845 [RFC5681]. 847 Consider a TLP retransmission episode where a sender retransmits a 848 tail packet in a flight. The TLP retransmission episode ends when 849 the sender receives an ACK with a SEG.ACK above the SND.NXT at the 850 time the episode started. During the TLP retransmission episode the 851 sender checks for a duplicate ACK or D-SACK indicating that both the 852 original segment and TLP retransmission arrived at the receiver, 853 meaning there was no loss that needed repairing. If the TLP sender 854 does not receive such an indication before the end of the TLP 855 retransmission episode, then it MUST estimate that either the 856 original data segment or the TLP retransmission were lost, and 857 congestion control MUST react appropriately to that loss as it would 858 any other loss. 860 Since a significant fraction of the hosts that support SACK do not 861 support duplicate selective acknowledgments (D-SACKs) [RFC2883] the 862 TLP algorithm for detecting such lost segments relies only on basic 863 SACK support [RFC2018]. 865 7.6.1. Initializing and resetting state 867 When a connection is created, or suffers a retransmission timeout, or 868 enters fast recovery, it executes the following: 870 TLPRxtOut = false 872 7.6.2. Recording loss probe states 874 Senders MUST only send a TLP loss probe retransmission if TLPRxtOut 875 is false. This ensures that at any given time a connection has at 876 most one outstanding TLP retransmission. This allows the sender to 877 use the algorithm described in this section to estimate whether any 878 data segments were lost. 880 Note that this condition only restricts TLP loss probes that are 881 retransmissions. There may be an arbitrary number of outstanding 882 unacknowledged TLP loss probes that consist of new, previously-unsent 883 data, since the retransmission timeout and fast recovery algorithms 884 are sufficient to detect losses of such probe segments. 886 Upon sending a TLP probe that is a retransmission, the sender sets 887 TLPRxtOut to true and TLPHighRxt to SND.NXT. 889 Detecting recoveries accomplished by loss probes 890 Step 1: Track ACKs indicating receipt of original and retransmitted 891 segments 893 A sender considers both the original segment and TLP probe 894 retransmission segment as acknowledged if either 1 or 2 are true: 896 1. This is a duplicate acknowledgment (as defined in [RFC5681], 897 section 2), and all of the following conditions are met: 899 1. TLPRxtOut is true 901 2. SEG.ACK == TLPHighRxt 903 3. SEG.ACK == SND.UNA 905 4. the segment contains no SACK blocks for sequence ranges above 906 TLPHighRxt 908 5. the segment contains no data 910 6. the segment is not a window update 912 2. This is an ACK acknowledging a sequence number at or above 913 TLPHighRxt and it contains a D-SACK; i.e. all of the following 914 conditions are met: 916 1. TLPRxtOut is true 918 2. SEG.ACK >= TLPHighRxt 920 3. the ACK contains a D-SACK block 922 If either of the conditions is met, then the sender estimates that 923 the receiver received both the original data segment and the TLP 924 probe retransmission, and so the sender considers the TLP episode to 925 be done, and records that fact by setting TLPRxtOut to false. 927 Step 2: Mark the end of a TLP retransmission episode and detect 928 losses 930 If the sender receives a cumulative ACK for data beyond the TLP loss 931 probe retransmission then, in the absence of reordering on the return 932 path of ACKs, it should have received any ACKs for the original 933 segment and TLP probe retransmission segment. At that time, if the 934 TLPRxtOut flag is still true and thus indicates that the TLP probe 935 retransmission remains unacknowledged, then the sender should presume 936 that at least one of its data segments was lost, so it SHOULD invoke 937 a congestion control response equivalent to fast recovery. 939 More precisely, on each ACK the sender executes the following: 941 if (TLPRxtOut and SEG.ACK >= TLPHighRxt) { 942 TLPRxtOut = false 943 EnterRecovery() 944 ExitRecovery() 945 } 947 8. RACK and TLP discussions 949 8.1. Advantages 951 The biggest advantage of RACK is that every data packet, whether it 952 is an original data transmission or a retransmission, can be used to 953 detect losses of the packets sent chronologically prior to it. 955 Example: TAIL DROP. Consider a sender that transmits a window of 956 three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the 957 transmission of each packet is at least RACK.reo_wnd (1 millisecond 958 by default) after the transmission of the previous packet. RACK will 959 mark P1 as lost when the SACK of P2 is received, and this will 960 trigger the retransmission of P1 as R1. When R1 is cumulatively 961 acknowledged, RACK will mark P3 as lost and the sender will 962 retransmit P3 as R3. This example illustrates how RACK is able to 963 repair certain drops at the tail of a transaction without any timer. 964 Notice that neither the conventional duplicate ACK threshold 965 [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK] 966 algorithm can detect such losses, because of the required packet or 967 sequence count. 969 Example: LOST RETRANSMIT. Consider a window of three data packets 970 (P1, P2, P3) that are sent; P1 and P2 are dropped. Suppose the 971 transmission of each packet is at least RACK.reo_wnd (1 millisecond 972 by default) after the transmission of the previous packet. When P3 973 is SACKed, RACK will mark P1 and P2 lost and they will be 974 retransmitted as R1 and R2. Suppose R1 is lost again but R2 is 975 SACKed; RACK will mark R1 lost for retransmission again. Again, 976 neither the conventional three duplicate ACK threshold approach, nor 977 [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect 978 such losses. And such a lost retransmission is very common when TCP 979 is being rate-limited, particularly by token bucket policers with 980 large bucket depth and low rate limit. Retransmissions are often 981 lost repeatedly because standard congestion control requires multiple 982 round trips to reduce the rate below the policed rate. 984 Example: SMALL DEGREE OF REORDERING. Consider a common reordering 985 event: a window of packets are sent as (P1, P2, P3). P1 and P2 carry 986 a full payload of MSS octets, but P3 has only a 1-octet payload. 988 Suppose the sender has detected reordering previously and thus 989 RACK.reo_wnd is min_RTT/4. Now P3 is reordered and delivered first, 990 before P1 and P2. As long as P1 and P2 are delivered within 991 min_RTT/4, RACK will not consider P1 and P2 lost. But if P1 and P2 992 are delivered outside the reordering window, then RACK will still 993 falsely mark P1 and P2 lost. We discuss how to reduce false 994 positives in the end of this section. 996 The examples above show that RACK is particularly useful when the 997 sender is limited by the application, which is common for 998 interactive, request/response traffic. Similarly, RACK still works 999 when the sender is limited by the receive window, which is common for 1000 applications that use the receive window to throttle the sender. 1002 For some implementations (e.g., Linux), RACK works quite efficiently 1003 with TCP Segmentation Offload (TSO). RACK always marks the entire 1004 TSO blob lost because the packets in the same TSO blob have the same 1005 transmission timestamp. By contrast, the algorithms based on 1006 sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset 1007 of packets in the TSO blob lost, forcing the stack to perform 1008 expensive fragmentation of the TSO blob, or to selectively tag 1009 individual packets lost in the scoreboard. 1011 8.2. Disadvantages 1013 RACK requires the sender to record the transmission time of each 1014 packet sent at a clock granularity of one millisecond or finer. TCP 1015 implementations that record this already for RTT estimation do not 1016 require any new per-packet state. But implementations that are not 1017 yet recording packet transmission times will need to add per-packet 1018 internal state (commonly either 4 or 8 octets per packet or TSO blob) 1019 to track transmission times. In contrast, the conventional [RFC6675] 1020 loss detection approach does not require any per-packet state beyond 1021 the SACK scoreboard. This is particularly useful on ultra-low RTT 1022 networks where the RTT is far less than the sender TCP clock 1023 grainularity (e.g. inside data-centers). 1025 RACK can easily and optionally support the conventional approach in 1026 [RFC6675][RFC5681] by resetting the reordering window to zero when 1027 the threshold is met. Note that this approach differs slightly from 1028 [RFC6675] which considers a packet lost when at least #DupThresh 1029 higher-sequenc packets are SACKed. RACK's approach considers a 1030 packet lost when at least one higher sequence packet is SACKed and 1031 the total number of SACKed packets is at least DupThresh. For 1032 example, suppose a connection sends 10 packets, and packets 3, 5, 7 1033 are SACKed. [RFC6675] considers packets 1 and 2 lost. RACK 1034 considers packets 1, 2, 4, 6 lost. 1036 8.3. Adjusting the reordering window 1038 When the sender detects packet reordering, RACK uses a reordering 1039 window of min_rtt / 4. It uses the minimum RTT to accommodate 1040 reordering introduced by packets traversing slightly different paths 1041 (e.g., router-based parallelism schemes) or out-of-order deliveries 1042 in the lower link layer (e.g., wireless links using link-layer 1043 retransmission). RACK uses a quarter of minimum RTT because Linux 1044 TCP used the same factor in its implementation to delay Early 1045 Retransmit [RFC5827] to reduce spurious loss detections in the 1046 presence of reordering, and experience shows that this seems to work 1047 reasonably well. We have evaluated using the smoothed RTT (SRTT from 1048 [RFC6298] RTT estimation) or the most recently measured RTT 1049 (RACK.rtt) using an experiment similar to that in the Performance 1050 Evaluation section. They do not make any significant difference in 1051 terms of total recovery latency. 1053 8.4. Relationships with other loss recovery algorithms 1055 The primary motivation of RACK is to ultimately provide a simple and 1056 general replacement for some of the standard loss recovery algorithms 1057 [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard 1058 ones [FACK][THIN-STREAM]. While RACK can be a supplemental loss 1059 detection mechanism on top of these algorithms, this is not 1060 necessary, because RACK implicitly subsumes most of them. 1062 [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK 1063 threshold based on the current or previous flight sizes. RACK takes 1064 a different approach, by using only one ACK event and a reordering 1065 window. RACK can be seen as an extended Early Retransmit [RFC5827] 1066 without a FlightSize limit but with an additional reordering window. 1067 [FACK] considers an original packet to be lost when its sequence 1068 range is sufficiently far below the highest SACKed sequence. In some 1069 sense RACK can be seen as a generalized form of FACK that operates in 1070 time space instead of sequence space, enabling it to better handle 1071 reordering, application-limited traffic, and lost retransmissions. 1073 Nevertheless RACK is still an experimental algorithm. Since the 1074 oldest loss detection algorithm, the 3 duplicate ACK threshold 1075 [RFC5681], has been standardized and widely deployed. RACK can 1076 easily and optionally support the conventional approach for 1077 compatibility. 1079 RACK is compatible with and does not interfere with the the standard 1080 RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel 1081 algorithms [RFC3522]. This is because RACK only detects loss by 1082 using ACK events. It neither changes the RTO timer calculation nor 1083 detects spurious timeouts. 1085 Furthermore, RACK naturally works well with Tail Loss Probe [TLP] 1086 because a tail loss probe solicits either an ACK or SACK, which can 1087 be used by RACK to detect more losses. RACK can be used to relax 1088 TLP's requirement for using FACK and retransmitting the the highest- 1089 sequenced packet, because RACK is agnostic to packet sequence 1090 numbers, and uses transmission time instead. Thus TLP could be 1091 modified to retransmit the first unacknowledged packet, which could 1092 improve application latency. 1094 8.5. Interaction with congestion control 1096 RACK intentionally decouples loss detection from congestion control. 1097 RACK only detects losses; it does not modify the congestion control 1098 algorithm [RFC5681][RFC6937]. However, RACK may detect losses 1099 earlier or later than the conventional duplicate ACK threshold 1100 approach does. A packet marked lost by RACK SHOULD NOT be 1101 retransmitted until congestion control deems this appropriate. 1102 Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used 1103 when using RACK. 1105 RACK is applicable for both fast recovery and recovery after a 1106 retransmission timeout (RTO) in [RFC5681]. RACK applies equally to 1107 fast recovery and RTO recovery because RACK is purely based on the 1108 transmission time order of packets. When a packet retransmitted by 1109 RTO is acknowledged, RACK will mark any unacked packet sent 1110 sufficiently prior to the RTO as lost, because at least one RTT has 1111 elapsed since these packets were sent. 1113 The following simple example compares how RACK and non-RACK loss 1114 detection interacts with congestion control: suppose a TCP sender has 1115 a congestion window (cwnd) of 20 packets on a SACK-enabled 1116 connection. It sends 10 data packets and all of them are lost. 1118 Without RACK, the sender would time out, reset cwnd to 1, and 1119 retransmit the first packet. It would take four round trips (1 + 2 + 1120 4 + 3 = 10) to retransmit all the 10 lost packets using slow start. 1121 The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4 1122 packets due to congestion window validation. 1124 With RACK, a sender would send the TLP after 2*RTT and get a DUPACK. 1125 If the sender implements Proportional Rate Reduction [RFC6937] it 1126 would slow start to retransmit the remaining 9 lost packets since the 1127 number of packets in flight (0) is lower than the slow start 1128 threshold (10). The slow start would again take four round trips (1 1129 + 2 + 4 + 3 = 10). The recovery latency would be 2*RTT + 4*RTT, with 1130 an ending cwnd set to the slow start threshold of 10 packets. 1132 In both cases, the sender after the recovery would be in congestion 1133 avoidance. The difference in recovery latency (RTO + 4*RTT vs 6*RTT) 1134 can be significant if the RTT is much smaller than the minimum RTO (1 1135 second in RFC6298) or if the RTT is large. The former case is common 1136 in local area networks, data-center networks, or content distribution 1137 networks with deep deployments. The latter case is more common in 1138 developing regions with highly congested and/or high-latency 1139 networks. The ending congestion window after recovery also impacts 1140 subsequent data transfer. 1142 8.6. TLP recovery detection with delayed ACKs 1144 Delayed ACKs complicate the detection of repairs done by TLP, since 1145 with a delayed ACK the sender receives one fewer ACK than would 1146 normally be expected. To mitigate this complication, before sending 1147 a TLP loss probe retransmission, the sender should attempt to wait 1148 long enough that the receiver has sent any delayed ACKs that it is 1149 withholding. The sender algorithm described above features such a 1150 delay, in the form of WCDelAckT. Furthermore, if the receiver 1151 supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then 1152 in the case of a delayed ACK the sender's TLP recovery detection 1153 algorithm (see above) can use the D-SACK information to infer that 1154 the original and TLP retransmission both arrived at the receiver. 1156 If there is ACK loss or a delayed ACK without a D-SACK, then this 1157 algorithm is conservative, because the sender will reduce cwnd when 1158 in fact there was no packet loss. In practice this is acceptable, 1159 and potentially even desirable: if there is reverse path congestion 1160 then reducing cwnd can be prudent. 1162 8.7. RACK for other transport protocols 1164 RACK can be implemented in other transport protocols. The algorithm 1165 can be simplified by skipping step 3 if the protocol can support a 1166 unique transmission or packet identifier (e.g. TCP timestamp options 1167 [RFC7323]). For example, the QUIC protocol implements RACK [QUIC- 1168 LR]. The [Sprout] congestion control also independently designed to 1169 use a 10ms reordering window to improve its loss detection. 1171 9. Experiments and Performance Evaluations 1173 RACK and TLP have been deployed at Google, for both connections to 1174 users in the Internet and internally. We conducted a performance 1175 evaluation experiment for RACK and TLP on a small set of Google Web 1176 servers in Western Europe that serve mostly European and some African 1177 countries. The experiment lasted three days in March 2017. The 1178 servers were divided evenly into four groups of roughly 5.3 million 1179 flows each: 1181 Group 1 (control): RACK off, TLP off, RFC 6675 on 1183 Group 2: RACK on, TLP off, RFC 6675 on 1185 Group 3: RACK on, TLP on, RFC 6675 on 1187 Group 4: RACK on, TLP on, RFC 6675 off 1189 All groups used Linux with CUBIC congestion control, an initial 1190 congestion window of 10 packets, and the fq/pacing qdisc. In terms 1191 of specific recovery features, all groups enabled RFC5682 (F-RTO) but 1192 disabled FACK because it is not an IETF RFC. FACK was excluded 1193 because the goal of this setup is to compare RACK and TLP to RFC- 1194 based loss recoveries. Since TLP depends on either FACK or RACK, we 1195 could not run another group that enables TLP only (with both RACK and 1196 FACK disabled). Group 4 is to test whether RACK plus TLP can 1197 completely replace the DupThresh-based [RFC6675]. 1199 The servers sit behind a load balancer that distributes the 1200 connections evenly across the four groups. 1202 Each group handles a similar number of connections and sends and 1203 receives similar amounts of data. We compare total time spent in 1204 loss recovery across groups. The recovery time is measured from when 1205 the recovery and retransmission starts, until the remote host has 1206 acknowledged the highest sequence (SND.NXT) at the time the recovery 1207 started. Therefore the recovery includes both fast recoveries and 1208 timeout recoveries. 1210 Our data shows that Group 2 recovery latency is only 0.3% lower than 1211 the Group 1 recovery latency. But Group 3 recovery latency is 25% 1212 lower than Group 1 due to a 40% reduction in RTO-triggered 1213 recoveries! Therefore it is important to implement both TLP and RACK 1214 for performance. Group 4's total recovery latency is 0.02% lower 1215 than Group 3's, indicating that RACK plus TLP can successfully 1216 replace RFC6675 as a standalone recovery mechanism. 1218 We want to emphasize that the current experiment is limited in terms 1219 of network coverage. The connectivity in Western Europe is fairly 1220 good, therefore loss recovery is not a major performance bottleneck. 1221 We plan to expand our experiments to regions with worse connectivity, 1222 in particular on networks with strong traffic policing. 1224 10. Security Considerations 1226 RACK does not change the risk profile for TCP. 1228 An interesting scenario is ACK-splitting attacks [SCWA99]: for an 1229 MSS-size packet sent, the receiver or the attacker might send MSS 1230 ACKs that SACK or acknowledge one additional byte per ACK. This 1231 would not fool RACK. RACK.xmit_ts would not advance because all the 1232 sequences of the packet are transmitted at the same time (carry the 1233 same transmission timestamp). In other words, SACKing only one byte 1234 of a packet or SACKing the packet in entirety have the same effect on 1235 RACK. 1237 11. IANA Considerations 1239 This document makes no request of IANA. 1241 Note to RFC Editor: this section may be removed on publication as an 1242 RFC. 1244 12. Acknowledgments 1246 The authors thank Matt Mathis for his insights in FACK and Michael 1247 Welzl for his per-packet timer idea that inspired this work. Eric 1248 Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana 1249 Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi 1250 Nishida, Bob Briscoe, Felix Weinrank, Michael Tuexen, and Martin Duke 1251 contributed to the draft or the implementations in Linux, FreeBSD, 1252 Windows and QUIC. 1254 13. References 1256 13.1. Normative References 1258 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 1259 Options", RFC 2018, October 1996. 1261 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1262 Requirement Levels", RFC 2119, March 1997. 1264 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1265 Extension to the Selective Acknowledgement (SACK) Option 1266 for TCP", RFC 2883, July 2000. 1268 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 1269 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 1270 November 2006. 1272 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1273 Control", RFC 5681, September 2009. 1275 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 1276 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 1277 Spurious Retransmission Timeouts with TCP", RFC 5682, 1278 September 2009. 1280 [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. 1281 Hurtig, "Early Retransmit for TCP and Stream Control 1282 Transmission Protocol (SCTP)", RFC 5827, April 2010. 1284 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1285 "Computing TCP's Retransmission Timer", RFC 6298, June 1286 2011. 1288 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1289 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1290 Based on Selective Acknowledgment (SACK) for TCP", 1291 RFC 6675, August 2012. 1293 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 1294 Rate Reduction for TCP", May 2013. 1296 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 1297 Scheffenegger, "TCP Extensions for High Performance", 1298 September 2014. 1300 [RFC793] Postel, J., "Transmission Control Protocol", September 1301 1981. 1303 13.2. Informative References 1305 [FACK] Mathis, M. and M. Jamshid, "Forward acknowledgement: 1306 refining TCP congestion control", ACM SIGCOMM Computer 1307 Communication Review, Volume 26, Issue 4, Oct. 1996. , 1308 1996. 1310 [POLICER16] 1311 Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng, 1312 Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An 1313 Analysis of Traffic Policing in the Web", ACM SIGCOMM , 1314 2016. 1316 [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And 1317 Congestion Control", draft-tsvwg-quic-loss-recovery-01 1318 (work in progress), June 2016. 1320 [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP 1321 and SCTP RTO Restart", February 2016. 1323 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 1324 "TCP Congestion Control With a Misbehaving Receiver", ACM 1325 Computer Communication Review, 29(5) , 1999. 1327 [Sprout] Winstein, K., Sivaraman, A., and H. Balakrishnan, 1328 "Stochastic Forecasts Achieve High Throughput and Low 1329 Delay over Cellular Networks", USENIX Symposium on 1330 Networked Systems Design and Implementation (NSDI) , 2013. 1332 [THIN-STREAM] 1333 Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, 1334 "TCP enhancements for interactive thin-stream 1335 applications", NOSSDAV , 2008. 1337 [TLP] Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 1338 "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of 1339 Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work 1340 in progress), August 2013. 1342 Authors' Addresses 1344 Yuchung Cheng 1345 Google, Inc 1347 Email: ycheng@google.com 1349 Neal Cardwell 1350 Google, Inc 1352 Email: ncardwell@google.com 1354 Nandita Dukkipati 1355 Google, Inc 1357 Email: nanditad@google.com 1359 Priyaranjan Jha 1360 Google, Inc 1362 Email: priyarjha@google.com