idnits 2.17.1 draft-ietf-tcpm-rack-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 26, 2019) is 1827 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC4653' is mentioned on line 1033, but not defined == Missing Reference: 'RFC3708' is mentioned on line 472, but not defined == Missing Reference: 'RFC3522' is mentioned on line 1052, but not defined -- Looks like a reference, but probably isn't: '1' on line 1309 -- Looks like a reference, but probably isn't: '2' on line 1311 == Unused Reference: 'RFC2119' is defined on line 1231, but no explicit reference was found in the text == Unused Reference: 'RFC4737' is defined on line 1238, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group Y. Cheng 3 Internet-Draft N. Cardwell 4 Intended status: Experimental N. Dukkipati 5 Expires: October 28, 2019 P. Jha 6 Google, Inc 7 April 26, 2019 9 RACK: a time-based fast loss detection algorithm for TCP 10 draft-ietf-tcpm-rack-05 12 Abstract 14 This document presents a new TCP loss detection algorithm called RACK 15 ("Recent ACKnowledgment"). RACK uses the notion of time, instead of 16 packet or sequence counts, to detect losses, for modern TCP 17 implementations that can support per-packet timestamps and the 18 selective acknowledgment (SACK) option. It is intended to replace 19 the conventional DUPACK threshold approach and its variants, as well 20 as other nonstandard approaches. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on October 28, 2019. 39 Copyright Notice 41 Copyright (c) 2019 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 1. Introduction 56 This document presents a new loss detection algorithm called RACK 57 ("Recent ACKnowledgment"). RACK uses the notion of time instead of 58 the conventional packet or sequence counting approaches for detecting 59 losses. RACK deems a packet lost if some packet sent sufficiently 60 later has been delivered. It does this by recording packet 61 transmission times and inferring losses using cumulative 62 acknowledgments or selective acknowledgment (SACK) TCP options. 64 In the last couple of years we have been observing several 65 increasingly common loss and reordering patterns in the Internet: 67 1. Lost retransmissions. Traffic policers [POLICER16] and burst 68 losses often cause retransmissions to be lost again, severely 69 increasing TCP latency. 71 2. Tail drops. Structured request-response traffic turns more 72 losses into tail drops. In such cases, TCP is application- 73 limited, so it cannot send new data to probe losses and has to 74 rely on retransmission timeouts (RTOs). 76 3. Reordering. Link-layer protocols (e.g., 802.11 block ACK), link 77 bonding, or routers' internal load-balancing can deliver TCP 78 packets out of order. The degree of such reordering is usually 79 within the order of the path round trip time. 81 Despite TCP stacks (e.g. Linux) that implement many of the standard 82 and proposed loss detection algorithms 83 [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM], 84 we've found that together they do not perform well. The main reason 85 is that many of them are based on the classic rule of counting 86 duplicate acknowledgments [RFC5681]. They can either detect loss 87 quickly or accurately, but not both, especially when the sender is 88 application-limited or under reordering that is unpredictable. And 89 under these conditions none of them can detect lost retransmissions 90 well. 92 Also, these algorithms, including RFCs, rarely address the 93 interactions with other algorithms. For example, FACK may consider a 94 packet is lost while RFC6675 may not. Implementing N algorithms 95 while dealing with N^2 interactions is a daunting task and error- 96 prone. 98 The goal of RACK is to solve all the problems above by replacing many 99 of the loss detection algorithms above with one more effective 100 algorithm to handle loss and reordering. 102 2. Overview 104 The main idea behind RACK is that if a packet has been delivered out 105 of order, then the packets sent chronologically before that were 106 either lost or reordered. This concept is not fundamentally 107 different from [RFC5681][RFC6675][FACK]. But the key innovation in 108 RACK is to use a per-packet transmission timestamp and widely 109 deployed SACK options to conduct time-based inferences instead of 110 inferring losses with packet or sequence counting approaches. 112 Using a threshold for counting duplicate acknowledgments (i.e., 113 DupThresh) alone is no longer reliable because of today's prevalent 114 reordering patterns. A common type of reordering is that the last 115 "runt" packet of a window's worth of packet bursts gets delivered 116 first, then the rest arrive shortly after in order. To handle this 117 effectively, a sender would need to constantly adjust the DupThresh 118 to the burst size; but this would risk increasing the frequency of 119 RTOs on real losses. 121 Today's prevalent lost retransmissions also cause problems with 122 packet-counting approaches [RFC5681][RFC6675][FACK], since those 123 approaches depend on reasoning in sequence number space. 124 Retransmissions break the direct correspondence between ordering in 125 sequence space and ordering in time. So when retransmissions are 126 lost, sequence-based approaches are often unable to infer and quickly 127 repair losses that can be deduced with time-based approaches. 129 Instead of counting packets, RACK uses the most recently delivered 130 packet's transmission time to judge if some packets sent previous to 131 that time have "expired" by passing a certain reordering settling 132 window. On each ACK, RACK marks any already-expired packets lost, 133 and for any packets that have not yet expired it waits until the 134 reordering window passes and then marks those lost as well. In 135 either case, RACK can repair the loss without waiting for a (long) 136 RTO. RACK can be applied to both fast recovery and timeout recovery, 137 and can detect losses on both originally transmitted and 138 retransmitted packets, making it a great all-weather loss detection 139 mechanism. 141 3. Design Rationale for Reordering Tolerance 143 The reordering behavior of networks can evolve (over years) in 144 response to the behavior of transport protocols and applications, as 145 well as the needs of network designers and operators. From a network 146 or link designer's viewpoint, parallelization (eg. link bonding) is 147 the easiest way to get a network to go faster. Therefore their main 148 constraint on speed is reordering, and there is pressure to relax 149 that constraint. If RACK becomes widely deployed, the underlying 150 networks may introduce more reordering for higher throughput. But 151 this may result in excessive reordering that hurts end to end 152 performance: 154 1. End host packet processing: extreme reordering on high-speed 155 networks would incur high CPU cost by greatly reducing the 156 effectiveness of aggregation mechanisms, such as large receive 157 offload (LRO) and generic receive offload (GRO), and 158 significantly increasing the number of ACKs. 160 2. Congestion control: TCP congestion control implicitly assumes the 161 feedback from ACKs are from the same bottleneck. Therefore it 162 cannot handle well scenarios where packets are traversing largely 163 disjoint paths. 165 3. Loss recovery: Having an excessively large reordering window to 166 accommodate widely different latencies from different paths would 167 increase the latency of loss recovery. 169 An end-to-end transport protocol cannot tell immediately whether a 170 hole is reordering or loss. It can only distinguish between the two 171 in hindsight if the hole in the sequence space gets filled later 172 without a retransmission. How long the sender waits for such 173 potential reordering events to settle is determined by the current 174 reordering window. 176 Given these considerations, a core design philosophy of RACK is to 177 adapt to the measured duration of reordering events, within 178 reasonable and specific bounds. To accomplish this RACK places the 179 following mandates on the reordering window: 181 1. The initial RACK reordering window SHOULD be set to a small 182 fraction of the round-trip time. 184 2. If no reordering has been observed, then RACK SHOULD honor the 185 classic 3-DUPACK rule for initiating fast recovery. One simple 186 way to implement this is to temporarily override the reorder 187 window to 0. 189 3. The RACK reordering window SHOULD leverage Duplicate Selective 190 Acknowledgement (DSACK) information [RFC3708] to adaptively 191 estimate the duration of reordering events. 193 4. The RACK reordering window MUST be bounded and this bound SHOULD 194 be one round trip. 196 As a flow starts, either condition 1 or condition 2 or both would 197 trigger RACK to start the recovery process quickly. The low initial 198 reordering window and use of the 3-DUPACK rule are key to achieving 199 low-latency loss recovery for short flows by risking spurious 200 retransmissions to recover losses quickly. This rationale is that 201 spurious retransmissions for short flows are not expected to produce 202 excessive network traffic. 204 For long flows the design tolerates reordering within a round trip. 205 This handles reordering caused by path divergence in small time 206 scales (reordering within the round-trip time of the shortest path), 207 which should tolerate much of the reordering from link bonding, 208 multipath routing, or link-layer out-of-order delivery. It also 209 relaxes ordering constraints to allow sending flights of TCP packets 210 on different paths dynamically for better load-balancing (e.g. 211 flowlets). 213 However, the fact that the initial RACK reordering window is low, and 214 the RACK reordering window's adaptive growth is bounded, means that 215 there will continue to be a cost to reordering and a limit to RACK's 216 adaptation to reordering. This maintains a disincentive for network 217 designers and operators to introduce needless or excessive 218 reordering, particularly because they have to allow for low round 219 trip time paths. This means RACK will not encourage networks to 220 perform inconsiderate fine-grained packet-spraying over highly 221 disjoint paths with very different characteristics. There are good 222 alternative solutions, such as MPTCP, for such networks. 224 To conclude, the RACK algorithm aims to adapt to small degrees of 225 reordering, quickly recover most losses within one to two round 226 trips, and avoid costly retransmission timeouts (RTOs). In the 227 presence of reordering, the adaptation algorithm can impose 228 sometimes-needless delays when it waits to disambiguate loss from 229 reordering, but the penalty for waiting is bounded to one round trip 230 and such delays are confined to longer-running flows. 232 This document provides a concrete and detailed reordering window 233 adaptation algorithm for implementors. We note that the specifics of 234 the algorithm are likely to evolve over time. But that is a separate 235 engineering optimization that's out of scope for this document. 237 4. Requirements 239 The reader is expected to be familiar with the definitions given in 240 the TCP congestion control [RFC5681] and selective acknowledgment 241 [RFC2018] RFCs. Familiarity with the conservative SACK-based 242 recovery for TCP [RFC6675] is not expected but helps. 244 RACK has three requirements: 246 1. The connection MUST use selective acknowledgment (SACK) options 247 [RFC2018]. 249 2. For each packet sent, the sender MUST store its most recent 250 transmission time with (at least) millisecond granularity. For 251 round-trip times lower than a millisecond (e.g., intra-datacenter 252 communications) microsecond granularity would significantly help 253 the detection latency but is not required. 255 3. For each packet sent, the sender MUST remember whether the packet 256 has been retransmitted or not. 258 We assume that requirement 1 implies the sender keeps a SACK 259 scoreboard, which is a data structure to store selective 260 acknowledgment information on a per-connection basis ([RFC6675] 261 section 3). For the ease of explaining the algorithm, we use a 262 pseudo-scoreboard that manages the data in sequence number ranges. 263 But the specifics of the data structure are left to the implementor. 265 RACK does not need any change on the receiver. 267 5. Definitions 269 5.1. Terminology 271 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 272 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 273 document are to be interpreted as described in [RFC2119 [1]]. In 274 this document, these words will appear with that interpretation only 275 when in UPPER CASE. Lower case uses of these words are not to be 276 interpreted as carrying [RFC2119 [2]] significance. 278 The reader is expected to be familiar with the definitions given in 279 [RFC793], including SND.UNA, SND.NXT, SEG.ACK, and SEG.SEQ. 281 5.2. Definitions of variables 283 A sender implementing RACK needs to store these new RACK variables: 285 "Packet.xmit_ts" is the time of the last transmission of a data 286 packet, including retransmissions, if any. The sender needs to 287 record the transmission time for each packet sent and not yet 288 acknowledged. The time MUST be stored at millisecond granularity or 289 finer. 291 "RACK.packet". Among all the packets that have been either 292 selectively or cumulatively acknowledged, RACK.packet is the one that 293 was sent most recently (including retransmissions). 295 "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet. 297 "RACK.end_seq" is the ending TCP sequence number of RACK.packet. 299 "RACK.rtt" is the RTT of the most recently transmitted packet that 300 has been delivered (either cumulatively acknowledged or selectively 301 acknowledged) on the connection. 303 "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated. 305 "RACK.reo_wnd" is a reordering window computed in the unit of time 306 used for recording packet transmission times. It is used to defer 307 the moment at which RACK marks a packet lost. 309 "RACK.dupthresh" is a constant specifying the number of duplicate 310 acknowledgments, or selectively acknowledged segments, that can 311 (under certain conditions) trigger fast recovery, similar to 312 [RFC6675]. As in [RFC5681] and [RFC6675], this threshold is defined 313 to be 3. 315 "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the 316 connection. 318 "RACK.ack_ts" is the time when all the sequences in RACK.packet were 319 selectively or cumulatively acknowledged. 321 "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd 323 "RACK.reo_wnd_persist" is the number of loss recoveries before 324 resetting RACK.reo_wnd 326 "RACK.dsack" indicates if a DSACK option has been received since last 327 RACK.reo_wnd change "RACK.pkts_sacked" returns the total number of 328 packets selectively acknowledged in the SACK scoreboard. 330 "RACK.reord" indicates the connection has detected packet reordering 331 event(s) 333 "RACK.fack" is the highest selectively or cumulatively acknowledged 334 sequence 336 Note that the Packet.xmit_ts variable is per packet in flight. The 337 RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT 338 variables are kept in the per-connection TCP control block. 339 RACK.packet and RACK.ack_ts are used as local variables in the 340 algorithm. 342 6. Algorithm Details 344 6.1. Transmitting a data packet 346 Upon transmitting a new packet or retransmitting an old packet, 347 record the time in Packet.xmit_ts. RACK does not care if the 348 retransmission is triggered by an ACK, new application data, an RTO, 349 or any other means. 351 6.2. Upon receiving an ACK 353 Step 1: Update RACK.min_RTT. 355 Use the RTT measurements obtained via [RFC6298] or [RFC7323] to 356 update the estimated minimum RTT in RACK.min_RTT. The sender can 357 track a simple global minimum of all RTT measurements from the 358 connection, or a windowed min-filtered value of recent RTT 359 measurements. This document does not specify an exact approach. 361 Step 2: Update RACK stats 363 Given the information provided in an ACK, each packet cumulatively 364 ACKed or SACKed is marked as delivered in the scoreboard. Among all 365 the packets newly ACKed or SACKed in the connection, record the most 366 recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts. 367 Sometimes the timestamps of RACK.Packet and Packet could carry the 368 same transmit timestamps due to clock granularity or segmentation 369 offloading (i.e. the two packets were sent as a jumbo frame into the 370 NIC). In that case the sequence numbers of RACK.end_seq and 371 Packet.end_seq are compared to break the tie. 373 Since an ACK can also acknowledge retransmitted data packets, 374 RACK.rtt can be vastly underestimated if the retransmission was 375 spurious. To avoid that, ignore a packet if any of its TCP sequences 376 have been retransmitted before and either of two conditions is true: 378 1. The Timestamp Echo Reply field (TSecr) of the ACK's timestamp 379 option [RFC7323], if available, indicates the ACK was not 380 acknowledging the last retransmission of the packet. 382 2. The packet was last retransmitted less than RACK.min_rtt ago. 384 If the ACK is not ignored as invalid, update the RACK.rtt to be the 385 RTT sample calculated using this ACK, and continue. If this ACK or 386 SACK was for the most recently sent packet, then record the 387 RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK. 388 Otherwise exit here and omit the following steps. 390 Notice that the second condition above is a heuristic. This 391 heuristic would fail to update RACK stats if the packet is spuriously 392 retransmitted because of a recent minimum RTT decrease (e.g. path 393 change). Consequentially RACK may not detect losses from ACK events 394 and the recovery resorts to the (slower) TLP or RTO timer-based 395 events. However such events should be rare and the connection would 396 pick up the new minimum RTT when the recovery ends to avoid repeated 397 similar failures. 399 Step 2 may be summarized in pseudocode as: 401 RACK_sent_after(t1, seq1, t2, seq2): 402 If t1 > t2: 403 Return true 404 Else if t1 == t2 AND seq1 > seq2: 405 Return true 406 Else: 407 Return false 409 RACK_update(): 410 For each Packet newly acknowledged cumulatively or selectively: 411 rtt = Now() - Packet.xmit_ts 412 If Packet.retransmitted is TRUE: 413 If ACK.ts_option.echo_reply < Packet.xmit_ts: 414 Return 415 If rtt < RACK.min_rtt: 416 Return 418 RACK.rtt = rtt 419 If RACK_sent_after(Packet.xmit_ts, Packet.end_seq 420 RACK.xmit_ts, RACK.end_seq): 421 RACK.xmit_ts = Packet.xmit_ts 423 Step 3: Detect packet reordering 424 To detect reordering, the sender looks for original data packets 425 being delivered out of order in sequence space. The sender tracks 426 the highest sequence selectively or cumulatively acknowledged in the 427 RACK.fack variable. The name fack stands for the most forward ACK 428 originated from the [FACK] draft. If the ACK selectively or 429 cumulatively acknowledges an unacknowledged and also never 430 retransmitted sequence below RACK.fack, then the corresponding packet 431 has been reordered and RACK.reord is set to TRUE. 433 The heuristic above only detects reordering if the re-ordered packet 434 has not yet been retransmitted. This is a major drawback because if 435 RACK has a low reordering window and the network is reordering 436 packets, RACK may falsely retransmit frequently. Consequently RACK 437 may fail to detect reordering to increase the reordering window, 438 because the reordered packets were already (falsely) retransmitted. 440 DSACK [RFC3708] can help mitigate this issue. The false 441 retransmission would solicit DSACK option in the ACK. Therefore if 442 the ACK has a DSACK option covering some sequence that were both 443 acknowledged and retransmitted, this implies the original packet was 444 reordered but RACK retransmitted the packet too quickly and should 445 set RACK.reord to TRUE. 447 RACK_detect_reordering(): 448 For each Packet newly acknowledged cumulatively or selectively: 449 If Packet.end_seq > RACK.fack: 450 RACK.fack = Packet.end_seq 451 Else if Packet.end_seq < RACK.fack AND 452 Packet.retransmitted is FALSE: 453 RACK.reord = TRUE 455 For each Packet covered by the DSACK option: 456 If Packet.retransmitted is TRUE: 457 RACK.reord = TRUE 459 Step 4: Update RACK reordering window 461 To handle the prevalent small degree of reordering, RACK.reo_wnd 462 serves as an allowance for settling time before marking a packet 463 lost. This section documents a detailed algorithm following the 464 design rationale section. RACK starts initially with a conservative 465 window of min_RTT/4. If no reordering has been observed, RACK uses 466 RACK.reo_wnd of 0 during loss recovery, in order to retransmit 467 quickly, or when the number of DUPACKs exceeds the classic DUPACK 468 threshold. The subtle difference of this approach and conventional 469 one [RFC5681][RFC6675] is dicussed later in the section of "RACK and 470 TLP discussions". 472 Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window, 473 to higher degrees of reordering, if DSACK is supported. Receiving an 474 ACK with a DSACK indicates a spurious retransmission, which in turn 475 suggests that the RACK reordering window, RACK.reo_wnd, is likely too 476 small. The sender MAY increase the RACK.reo_wnd window linearly for 477 every round trip in which the sender receives a DSACK, so that after 478 N distinct round trips in which a DSACK is received, the RACK.reo_wnd 479 becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT. The 480 inflated RACK.reo_wnd would persist for 16 loss recoveries and after 481 which it resets to its starting value, min_RTT / 4. 483 The following pseudocode implements above algorithm. Note that 484 extensions that require additional TCP features (e.g. DSACK) would 485 work if the feature functions simply return false. 487 RACK_update_reo_wnd(): 488 RACK.min_RTT = TCP_min_RTT() 489 If DSACK option is present: 490 RACK.dsack = true 492 If SND.UNA < RACK.rtt_seq: 493 RACK.dsack = false /* React to DSACK once per round trip */ 495 If RACK.dsack: 496 RACK.reo_wnd_incr += 1 497 RACK.dsack = false 498 RACK.rtt_seq = SND.NXT 499 RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */ 500 Else if exiting loss recovery: 501 RACK.reo_wnd_persist -= 1 502 If RACK.reo_wnd_persist <= 0: 503 RACK.reo_wnd_incr = 1 505 If RACK.reord is FALSE: 506 If in loss recovery: /* If in fast or timeout recovery */ 507 RACK.reo_wnd = 0 508 Return 509 Else if RACK.pkts_sacked >= RACK.dupthresh: 510 RACK.reo_wnd = 0 511 return 512 RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr 513 RACK.reo_wnd = min(RACK.reo_wnd, SRTT) 515 Step 5: Detect losses. 517 For each packet that has not been SACKed, if RACK.xmit_ts is after 518 Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its 519 corresponding sequence range) lost in the scoreboard. The rationale 520 is that if another packet that was sent later has been delivered, and 521 the reordering window or "reordering settling time" has already 522 passed, then the packet was likely lost. 524 If another packet that was sent later has been delivered, but the 525 reordering window has not passed, then it is not yet safe to deem the 526 unacked packet lost. Using the basic algorithm above, the sender 527 would wait for the next ACK to further advance RACK.xmit_ts; but this 528 risks a timeout (RTO) if no more ACKs come back (e.g, due to losses 529 or application limit). For timely loss detection, the sender MAY 530 install a "reordering settling" timer set to fire at the earliest 531 moment at which it is safe to conclude that some packet is lost. The 532 earliest moment is the time it takes to expire the reordering window 533 of the earliest unacked packet in flight. 535 This timer expiration value can be derived as follows. As a starting 536 point, we consider that the reordering window has passed if the 537 RACK.packet was sent sufficiently after the packet in question, or a 538 sufficient time has elapsed since the RACK.packet was S/ACKed, or 539 some combination of the two. More precisely, RACK marks a packet as 540 lost if the reordering window for a packet has elapsed through the 541 sum of: 543 1. delta in transmit time between a packet and the RACK.packet 545 2. delta in time between RACK.ack_ts and now 547 So we mark a packet as lost if: 549 RACK.xmit_ts >= Packet.xmit_ts 550 AND 551 (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd 553 If we solve this second condition for "now", the moment at which we 554 can declare a packet lost, then we get: 556 now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts) 558 Then (RACK.ack_ts - RACK.xmit_ts) is just the RTT of the packet we 559 used to set RACK.xmit_ts, so this reduces to: 561 Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0 563 The following pseudocode implements the algorithm above. When an ACK 564 is received or the RACK timer expires, call RACK_detect_loss(). The 565 algorithm includes an additional optimization to break timestamp ties 566 by using the TCP sequence space. The optimization is particularly 567 useful to detect losses in a timely manner with TCP Segmentation 568 Offload, where multiple packets in one TSO blob have identical 569 timestamps. It is also useful when the timestamp clock granularity 570 is close to or longer than the actual round trip time. 572 RACK_detect_loss(): 573 timeout = 0 575 For each packet, Packet, not acknowledged yet: 576 If Packet.lost is TRUE AND Packet.retransmitted is FALSE: 577 Continue /* Packet lost but not yet retransmitted */ 579 If RACK_sent_after(RACK.xmit_ts, RACK.end_seq, 580 Packet.xmit_ts, Packet.end_seq): 581 remaining = Packet.xmit_ts + RACK.rtt + 582 RACK.reo_wnd - Now() 583 If remaining <= 0: 584 Packet.lost = TRUE 585 Else: 586 timeout = max(remaining, timeout) 588 If timeout != 0 589 Arm a timer to call RACK_detect_loss() after timeout 591 Implementation optimization: looping through packets in the SACK 592 scoreboard above could be very costly on large-BDP networks since the 593 inflight could be very large. If the implementation can organize the 594 scoreboard data structures to have packets sorted by the last 595 (re)transmission time, then the loop can start on the least recently 596 sent packet and abort on the first packet sent after RACK.time_ts. 597 This can be implemented by using a seperate list sorted in time 598 order. The implementation inserts the packet at the tail of the list 599 when it is (re)transmitted, and removes a packet from the list when 600 it is delivered or marked lost. We RECOMMEND such an optimization 601 because it enables implementations to support high-BDP networks. 602 This optimization is implemented in Linux and sees orders of 603 magnitude improvement in CPU usage on high-speed WAN networks. 605 6.3. Tail Loss Probe: fast recovery for tail losses 607 This section describes a supplemental algorithm, Tail Loss Probe 608 (TLP), which leverages RACK to further reduce RTO recoveries. TLP 609 triggers fast recovery to quickly repair tail losses that can 610 otherwise be recovered by RTOs only. After an original data 611 transmission, TLP sends a probe data segment within one to two RTTs. 612 The probe data segment can either be new, previously unsent data, or 613 a retransmission of previously sent data just below SND.NXT. In 614 either case the goal is to elicit more feedback from the receiver, in 615 the form of an ACK (potentially with SACK blocks), to allow RACK to 616 trigger fast recovery instead of an RTO. 618 An RTO occurs when the first unacknowledged sequence number is not 619 acknowledged after a conservative period of time has elapsed 620 [RFC6298]. Common causes of RTOs include: 622 1. The entire flight of data is lost. 624 2. Tail losses of data segments at the end of an application 625 transaction. 627 3. Tail losses of ACKs at the end of an application transaction. 629 4. Lost retransmits, which can halt fast recovery based on [RFC6675] 630 if the ACK stream completely dries up. For example, consider a 631 window of three data packets (P1, P2, P3) that are sent; P1 and 632 P2 are dropped. On receipt of a SACK for P3, RACK marks P1 and 633 P2 as lost and retransmits them as R1 and R2. Suppose R1 and R2 634 are lost as well, so there are no more returning ACKs to detect 635 R1 and R2 as lost. Recovery stalls. 637 5. An unexpectedly long round-trip time (RTT). This can cause ACKs 638 to arrive after the RTO timer expires. The F-RTO algorithm 639 [RFC5682] is designed to detect such spurious retransmission 640 timeouts and at least partially undo the consequences of such 641 events, but F-RTO cannot be used in many situations. 643 6.4. Tail Loss Probe: An Example 645 Following is an example of TLP. All events listed are at a TCP 646 sender. 648 1. Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 649 no more new data to transmit. A PTO is scheduled to fire in 2 650 RTTs, after the transmission of the 10th segment. 652 2. Sender receives acknowledgements (ACKs) for segments 1-5; 653 segments 6-10 are lost and no ACKs are received. The sender 654 reschedules its PTO timer relative to the last received ACK, 655 which is the ACK for segment 5 in this case. The sender sets the 656 PTO interval using the calculation described in step (2) of the 657 algorithm. 659 3. When PTO fires, sender retransmits segment 10. 661 4. After an RTT, a SACK for packet 10 arrives. The ACK also carries 662 SACK holes for segments 6, 7, 8 and 9. This triggers RACK-based 663 loss recovery. 665 5. The connection enters fast recovery and retransmits the remaining 666 lost segments. 668 6.5. Tail Loss Probe Algorithm Details 670 We define the terminology used in specifying the TLP algorithm: 672 FlightSize: amount of outstanding data in the network, as defined in 673 [RFC5681]. 675 RTO: The transport's retransmission timeout (RTO) is based on 676 measured round-trip times (RTT) between the sender and receiver, as 677 specified in [RFC6298] for TCP. PTO: Probe timeout (PTO) is a timer 678 event indicating that an ACK is overdue. Its value is constrained to 679 be smaller than or equal to an RTO. 681 SRTT: smoothed round-trip time, computed as specified in [RFC6298]. 683 The TLP algorithm has three phases, which we discuss in turn. 685 6.5.1. Phase 1: Scheduling a loss probe 687 Step 1: Check conditions for scheduling a PTO. 689 A sender should check to see if it should schedule a PTO in the 690 following situations: 692 1. After transmitting new data that was not itself a TLP probe 694 2. Upon receiving an ACK that cumulatively acknowledges data 696 3. Upon receiving a SACK that selectively acknowledges data that was 697 last sent before the segment with SEG.SEQ=SND.UNA was last 698 (re)transmitted 700 A sender should schedule a PTO only if all of the following 701 conditions are met: 703 1. The connection supports SACK [RFC2018] 705 2. The connection has no SACKed sequences in the SACK scoreboard 707 3. The connection is not in loss recovery 708 If a PTO can be scheduled according to these conditions, the sender 709 should schedule a PTO. If there was a previously scheduled PTO or 710 RTO pending, then that pending PTO or RTO should first be cancelled, 711 and then the new PTO should be scheduled. 713 If a PTO cannot be scheduled according to these conditions, then the 714 sender MUST arm the RTO timer if there is unacknowledged data in 715 flight. 717 Step 2: Select the duration of the PTO. 719 A sender SHOULD use the following logic to select the duration of a 720 PTO: 722 TLP_timeout(): 723 If SRTT is available: 724 PTO = 2 * SRTT 725 If FlightSize = 1: 726 PTO += WCDelAckT 727 Else: 728 PTO += 2ms 729 Else: 730 PTO = 1 sec 732 If Now() + PTO > TCP_RTO_expire(): 733 PTO = TCP_RTO_expire() - Now() 735 Aiming for a PTO value of 2*SRTT allows a sender to wait long enough 736 to know that an ACK is overdue. Under normal circumstances, i.e. no 737 losses, an ACK typically arrives in one SRTT. But choosing PTO to be 738 exactly an SRTT is likely to generate spurious probes given that 739 network delay variance and even end-system timings can easily push an 740 ACK to be above an SRTT. We chose PTO to be the next integral 741 multiple of SRTT. 743 Similarly, network delay variations and end-system processing 744 latencies and timer granularities can easily delay ACKs beyond 745 2*SRTT, so senders SHOULD add at least 2ms to a computed PTO value 746 (and MAY add more if the sending host OS timer granularity is more 747 coarse than 1ms). 749 WCDelAckT stands for worst case delayed ACK timer. When FlightSize 750 is 1, PTO is inflated by WCDelAckT time to compensate for a potential 751 long delayed ACK timer at the receiver. The RECOMMENDED value for 752 WCDelAckT is 200ms. 754 Finally, if the time at which an RTO would fire (here denoted 755 "TCP_RTO_expire") is sooner than the computed time for the PTO, then 756 a probe is scheduled to be sent at that earlier time. 758 6.5.2. Phase 2: Sending a loss probe 760 When the PTO fires, transmit a probe data segment: 762 TLP_send_probe(): 763 If an unsent segment exists AND 764 the receive window allows new data to be sent: 765 Transmit the lowest-sequence unsent segment of up to SMSS 766 Increment FlightSize by the size of the newly-sent segment 767 Else: 768 Retransmit the highest-sequence segment sent so far 769 The cwnd remains unchanged 771 When the loss probe is a retransmission, the sender uses the highest- 772 sequence segment sent so far. This is in order to deal with the 773 retransmission ambiguity problem in TCP. Suppose a sender sends N 774 segments, and then retransmits the last segment (segment N) as a loss 775 probe, and then the sender receives a SACK for segment N. As long as 776 the sender waits for any required RACK reordering settling timer to 777 then expire, it doesn't matter if that SACK was for the original 778 transmission of segment N or the TLP retransmission; in either case 779 the arrival of the SACK for segment N provides evidence that the N-1 780 segments preceding segment N were likely lost. In the case where 781 there is only one original outstanding segment of data (N=1), the 782 same logic (trivially) applies: an ACK for a single outstanding 783 segment tells the sender the N-1=0 segments preceding that segment 784 were lost. Furthermore, whether there are N>1 or N=1 outstanding 785 segments, there is a question about whether the original last segment 786 or its TLP retransmission were lost; the sender estimates this using 787 TLP recovery detection (see below). 789 Note that after transmitting a TLP, the sender MUST arm an RTO timer, 790 and not the PTO timer. This ensures that the sender does not send 791 repeated, back-to-back TLP probes. This is important to avoid TLP 792 loops if an application writes periodically at an interval less than 793 PTO. 795 6.5.3. Phase 3: ACK processing 797 On each incoming ACK, the sender should check the conditions in Step 798 1 of Phase 1 to see if it should schedule (or reschedule) the loss 799 probe timer. 801 6.6. TLP recovery detection 803 If the only loss in an outstanding window of data was the last 804 segment, then a TLP loss probe retransmission of that data segment 805 might repair the loss. TLP recovery detection examines ACKs to 806 detect when the probe might have repaired a loss, and thus allows 807 congestion control to properly reduce the congestion window (cwnd) 808 [RFC5681]. 810 Consider a TLP retransmission episode where a sender retransmits a 811 tail packet in a flight. The TLP retransmission episode ends when 812 the sender receives an ACK with a SEG.ACK above the SND.NXT at the 813 time the episode started. During the TLP retransmission episode the 814 sender checks for a duplicate ACK or D-SACK indicating that both the 815 original segment and TLP retransmission arrived at the receiver, 816 meaning there was no loss that needed repairing. If the TLP sender 817 does not receive such an indication before the end of the TLP 818 retransmission episode, then it MUST estimate that either the 819 original data segment or the TLP retransmission were lost, and 820 congestion control MUST react appropriately to that loss as it would 821 any other loss. 823 Since a significant fraction of the hosts that support SACK do not 824 support duplicate selective acknowledgments (D-SACKs) [RFC2883] the 825 TLP algorithm for detecting such lost segments relies only on basic 826 SACK support [RFC2018]. 828 Definitions of variables 830 TLPRxtOut: a boolean indicating whether there is an unacknowledged 831 TLP retransmission. 833 TLPHighRxt: the value of SND.NXT at the time of sending a TLP 834 retransmission. 836 6.6.1. Initializing and resetting state 838 When a connection is created, or suffers a retransmission timeout, or 839 enters fast recovery, it executes the following: 841 TLPRxtOut = false 843 6.6.2. Recording loss probe states 845 Senders MUST only send a TLP loss probe retransmission if TLPRxtOut 846 is false. This ensures that at any given time a connection has at 847 most one outstanding TLP retransmission. This allows the sender to 848 use the algorithm described in this section to estimate whether any 849 data segments were lost. 851 Note that this condition only restricts TLP loss probes that are 852 retransmissions. There may be an arbitrary number of outstanding 853 unacknowledged TLP loss probes that consist of new, previously-unsent 854 data, since the retransmission timeout and fast recovery algorithms 855 are sufficient to detect losses of such probe segments. 857 Upon sending a TLP probe that is a retransmission, the sender sets 858 TLPRxtOut to true and TLPHighRxt to SND.NXT. 860 Detecting recoveries accomplished by loss probes 862 Step 1: Track ACKs indicating receipt of original and retransmitted 863 segments 865 A sender considers both the original segment and TLP probe 866 retransmission segment as acknowledged if either 1 or 2 are true: 868 1. This is a duplicate acknowledgment (as defined in [RFC5681], 869 section 2), and all of the following conditions are met: 871 1. TLPRxtOut is true 873 2. SEG.ACK == TLPHighRxt 875 3. SEG.ACK == SND.UNA 877 4. the segment contains no SACK blocks for sequence ranges above 878 TLPHighRxt 880 5. the segment contains no data 882 6. the segment is not a window update 884 2. This is an ACK acknowledging a sequence number at or above 885 TLPHighRxt and it contains a D-SACK; i.e. all of the following 886 conditions are met: 888 1. TLPRxtOut is true 890 2. SEG.ACK >= TLPHighRxt 892 3. the ACK contains a D-SACK block 894 If either of the conditions is met, then the sender estimates that 895 the receiver received both the original data segment and the TLP 896 probe retransmission, and so the sender considers the TLP episode to 897 be done, and records that fact by setting TLPRxtOut to false. 899 Step 2: Mark the end of a TLP retransmission episode and detect 900 losses 902 If the sender receives a cumulative ACK for data beyond the TLP loss 903 probe retransmission then, in the absence of reordering on the return 904 path of ACKs, it should have received any ACKs for the original 905 segment and TLP probe retransmission segment. At that time, if the 906 TLPRxtOut flag is still true and thus indicates that the TLP probe 907 retransmission remains unacknowledged, then the sender should presume 908 that at least one of its data segments was lost, so it SHOULD invoke 909 a congestion control response equivalent to fast recovery. 911 More precisely, on each ACK the sender executes the following: 913 if (TLPRxtOut and SEG.ACK >= TLPHighRxt) { 914 TLPRxtOut = false 915 EnterRecovery() 916 ExitRecovery() 917 } 919 7. RACK and TLP discussions 921 7.1. Advantages 923 The biggest advantage of RACK is that every data packet, whether it 924 is an original data transmission or a retransmission, can be used to 925 detect losses of the packets sent chronologically prior to it. 927 Example: TAIL DROP. Consider a sender that transmits a window of 928 three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the 929 transmission of each packet is at least RACK.reo_wnd (1 millisecond 930 by default) after the transmission of the previous packet. RACK will 931 mark P1 as lost when the SACK of P2 is received, and this will 932 trigger the retransmission of P1 as R1. When R1 is cumulatively 933 acknowledged, RACK will mark P3 as lost and the sender will 934 retransmit P3 as R3. This example illustrates how RACK is able to 935 repair certain drops at the tail of a transaction without any timer. 936 Notice that neither the conventional duplicate ACK threshold 937 [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK] 938 algorithm can detect such losses, because of the required packet or 939 sequence count. 941 Example: LOST RETRANSMIT. Consider a window of three data packets 942 (P1, P2, P3) that are sent; P1 and P2 are dropped. Suppose the 943 transmission of each packet is at least RACK.reo_wnd (1 millisecond 944 by default) after the transmission of the previous packet. When P3 945 is SACKed, RACK will mark P1 and P2 lost and they will be 946 retransmitted as R1 and R2. Suppose R1 is lost again but R2 is 947 SACKed; RACK will mark R1 lost for retransmission again. Again, 948 neither the conventional three duplicate ACK threshold approach, nor 949 [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect 950 such losses. And such a lost retransmission is very common when TCP 951 is being rate-limited, particularly by token bucket policers with 952 large bucket depth and low rate limit. Retransmissions are often 953 lost repeatedly because standard congestion control requires multiple 954 round trips to reduce the rate below the policed rate. 956 Example: SMALL DEGREE OF REORDERING. Consider a common reordering 957 event: a window of packets are sent as (P1, P2, P3). P1 and P2 carry 958 a full payload of MSS octets, but P3 has only a 1-octet payload. 959 Suppose the sender has detected reordering previously and thus 960 RACK.reo_wnd is min_RTT/4. Now P3 is reordered and delivered first, 961 before P1 and P2. As long as P1 and P2 are delivered within 962 min_RTT/4, RACK will not consider P1 and P2 lost. But if P1 and P2 963 are delivered outside the reordering window, then RACK will still 964 falsely mark P1 and P2 lost. We discuss how to reduce false 965 positives in the end of this section. 967 The examples above show that RACK is particularly useful when the 968 sender is limited by the application, which is common for 969 interactive, request/response traffic. Similarly, RACK still works 970 when the sender is limited by the receive window, which is common for 971 applications that use the receive window to throttle the sender. 973 For some implementations (e.g., Linux), RACK works quite efficiently 974 with TCP Segmentation Offload (TSO). RACK always marks the entire 975 TSO blob lost because the packets in the same TSO blob have the same 976 transmission timestamp. By contrast, the algorithms based on 977 sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset 978 of packets in the TSO blob lost, forcing the stack to perform 979 expensive fragmentation of the TSO blob, or to selectively tag 980 individual packets lost in the scoreboard. 982 7.2. Disadvantages 984 RACK requires the sender to record the transmission time of each 985 packet sent at a clock granularity of one millisecond or finer. TCP 986 implementations that record this already for RTT estimation do not 987 require any new per-packet state. But implementations that are not 988 yet recording packet transmission times will need to add per-packet 989 internal state (commonly either 4 or 8 octets per packet or TSO blob) 990 to track transmission times. In contrast, the conventional [RFC6675] 991 loss detection approach does not require any per-packet state beyond 992 the SACK scoreboard. This is particularly useful on ultra-low RTT 993 networks where the RTT is far less than the sender TCP clock 994 grainularity (e.g. inside data-centers). 996 RACK can easily and optionally support the conventional approach in 997 [RFC6675][RFC5681] by resetting the reordering window to zero when 998 the threshold is met. Note that this approach differs slightly from 999 [RFC6675] which considers a packet lost when at least #DupThresh 1000 higher-sequenc packets are SACKed. RACK's approach considers a 1001 packet lost when at least one higher sequence packet is SACKed and 1002 the total number of SACKed packets is at least DupThresh. For 1003 example, suppose a connection sends 10 packets, and packets 3, 5, 7 1004 are SACKed. [RFC6675] considers packets 1 and 2 lost. RACK 1005 considers packets 1, 2, 4, 6 lost. 1007 7.3. Adjusting the reordering window 1009 When the sender detects packet reordering, RACK uses a reordering 1010 window of min_rtt / 4. It uses the minimum RTT to accommodate 1011 reordering introduced by packets traversing slightly different paths 1012 (e.g., router-based parallelism schemes) or out-of-order deliveries 1013 in the lower link layer (e.g., wireless links using link-layer 1014 retransmission). RACK uses a quarter of minimum RTT because Linux 1015 TCP used the same factor in its implementation to delay Early 1016 Retransmit [RFC5827] to reduce spurious loss detections in the 1017 presence of reordering, and experience shows that this seems to work 1018 reasonably well. We have evaluated using the smoothed RTT (SRTT from 1019 [RFC6298] RTT estimation) or the most recently measured RTT 1020 (RACK.rtt) using an experiment similar to that in the Performance 1021 Evaluation section. They do not make any significant difference in 1022 terms of total recovery latency. 1024 7.4. Relationships with other loss recovery algorithms 1026 The primary motivation of RACK is to ultimately provide a simple and 1027 general replacement for some of the standard loss recovery algorithms 1028 [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard 1029 ones [FACK][THIN-STREAM]. While RACK can be a supplemental loss 1030 detection mechanism on top of these algorithms, this is not 1031 necessary, because RACK implicitly subsumes most of them. 1033 [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK 1034 threshold based on the current or previous flight sizes. RACK takes 1035 a different approach, by using only one ACK event and a reordering 1036 window. RACK can be seen as an extended Early Retransmit [RFC5827] 1037 without a FlightSize limit but with an additional reordering window. 1038 [FACK] considers an original packet to be lost when its sequence 1039 range is sufficiently far below the highest SACKed sequence. In some 1040 sense RACK can be seen as a generalized form of FACK that operates in 1041 time space instead of sequence space, enabling it to better handle 1042 reordering, application-limited traffic, and lost retransmissions. 1044 Nevertheless RACK is still an experimental algorithm. Since the 1045 oldest loss detection algorithm, the 3 duplicate ACK threshold 1046 [RFC5681], has been standardized and widely deployed. RACK can 1047 easily and optionally support the conventional approach for 1048 compatibility. 1050 RACK is compatible with and does not interfere with the the standard 1051 RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel 1052 algorithms [RFC3522]. This is because RACK only detects loss by 1053 using ACK events. It neither changes the RTO timer calculation nor 1054 detects spurious timeouts. 1056 Furthermore, RACK naturally works well with Tail Loss Probe [TLP] 1057 because a tail loss probe solicits either an ACK or SACK, which can 1058 be used by RACK to detect more losses. RACK can be used to relax 1059 TLP's requirement for using FACK and retransmitting the the highest- 1060 sequenced packet, because RACK is agnostic to packet sequence 1061 numbers, and uses transmission time instead. Thus TLP could be 1062 modified to retransmit the first unacknowledged packet, which could 1063 improve application latency. 1065 7.5. Interaction with congestion control 1067 RACK intentionally decouples loss detection from congestion control. 1068 RACK only detects losses; it does not modify the congestion control 1069 algorithm [RFC5681][RFC6937]. However, RACK may detect losses 1070 earlier or later than the conventional duplicate ACK threshold 1071 approach does. A packet marked lost by RACK SHOULD NOT be 1072 retransmitted until congestion control deems this appropriate. 1073 Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used 1074 when using RACK. 1076 RACK is applicable for both fast recovery and recovery after a 1077 retransmission timeout (RTO) in [RFC5681]. RACK applies equally to 1078 fast recovery and RTO recovery because RACK is purely based on the 1079 transmission time order of packets. When a packet retransmitted by 1080 RTO is acknowledged, RACK will mark any unacked packet sent 1081 sufficiently prior to the RTO as lost, because at least one RTT has 1082 elapsed since these packets were sent. 1084 The following simple example compares how RACK and non-RACK loss 1085 detection interacts with congestion control: suppose a TCP sender has 1086 a congestion window (cwnd) of 20 packets on a SACK-enabled 1087 connection. It sends 10 data packets and all of them are lost. 1089 Without RACK, the sender would time out, reset cwnd to 1, and 1090 retransmit the first packet. It would take four round trips (1 + 2 + 1091 4 + 3 = 10) to retransmit all the 10 lost packets using slow start. 1092 The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4 1093 packets due to congestion window validation. 1095 With RACK, a sender would send the TLP after 2*RTT and get a DUPACK. 1096 If the sender implements Proportional Rate Reduction [RFC6937] it 1097 would slow start to retransmit the remaining 9 lost packets since the 1098 number of packets in flight (0) is lower than the slow start 1099 threshold (10). The slow start would again take four round trips (1 1100 + 2 + 4 + 3 = 10). The recovery latency would be 2*RTT + 4*RTT, with 1101 an ending cwnd set to the slow start threshold of 10 packets. 1103 In both cases, the sender after the recovery would be in congestion 1104 avoidance. The difference in recovery latency (RTO + 4*RTT vs 6*RTT) 1105 can be significant if the RTT is much smaller than the minimum RTO (1 1106 second in RFC6298) or if the RTT is large. The former case is common 1107 in local area networks, data-center networks, or content distribution 1108 networks with deep deployments. The latter case is more common in 1109 developing regions with highly congested and/or high-latency 1110 networks. The ending congestion window after recovery also impacts 1111 subsequent data transfer. 1113 7.6. TLP recovery detection with delayed ACKs 1115 Delayed ACKs complicate the detection of repairs done by TLP, since 1116 with a delayed ACK the sender receives one fewer ACK than would 1117 normally be expected. To mitigate this complication, before sending 1118 a TLP loss probe retransmission, the sender should attempt to wait 1119 long enough that the receiver has sent any delayed ACKs that it is 1120 withholding. The sender algorithm described above features such a 1121 delay, in the form of WCDelAckT. Furthermore, if the receiver 1122 supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then 1123 in the case of a delayed ACK the sender's TLP recovery detection 1124 algorithm (see above) can use the D-SACK information to infer that 1125 the original and TLP retransmission both arrived at the receiver. 1127 If there is ACK loss or a delayed ACK without a D-SACK, then this 1128 algorithm is conservative, because the sender will reduce cwnd when 1129 in fact there was no packet loss. In practice this is acceptable, 1130 and potentially even desirable: if there is reverse path congestion 1131 then reducing cwnd can be prudent. 1133 7.7. RACK for other transport protocols 1135 RACK can be implemented in other transport protocols. The algorithm 1136 can be simplified by skipping step 3 if the protocol can support a 1137 unique transmission or packet identifier (e.g. TCP timestamp options 1138 [RFC7323]). For example, the QUIC protocol implements RACK [QUIC- 1139 LR]. 1141 8. Experiments and Performance Evaluations 1143 RACK and TLP have been deployed at Google, for both connections to 1144 users in the Internet and internally. We conducted a performance 1145 evaluation experiment for RACK and TLP on a small set of Google Web 1146 servers in Western Europe that serve mostly European and some African 1147 countries. The experiment lasted three days in March 2017. The 1148 servers were divided evenly into four groups of roughly 5.3 million 1149 flows each: 1151 Group 1 (control): RACK off, TLP off, RFC 6675 on 1153 Group 2: RACK on, TLP off, RFC 6675 on 1155 Group 3: RACK on, TLP on, RFC 6675 on 1157 Group 4: RACK on, TLP on, RFC 6675 off 1159 All groups used Linux with CUBIC congestion control, an initial 1160 congestion window of 10 packets, and the fq/pacing qdisc. In terms 1161 of specific recovery features, all groups enabled RFC5682 (F-RTO) but 1162 disabled FACK because it is not an IETF RFC. FACK was excluded 1163 because the goal of this setup is to compare RACK and TLP to RFC- 1164 based loss recoveries. Since TLP depends on either FACK or RACK, we 1165 could not run another group that enables TLP only (with both RACK and 1166 FACK disabled). Group 4 is to test whether RACK plus TLP can 1167 completely replace the DupThresh-based [RFC6675]. 1169 The servers sit behind a load balancer that distributes the 1170 connections evenly across the four groups. 1172 Each group handles a similar number of connections and sends and 1173 receives similar amounts of data. We compare total time spent in 1174 loss recovery across groups. The recovery time is measured from when 1175 the recovery and retransmission starts, until the remote host has 1176 acknowledged the highest sequence (SND.NXT) at the time the recovery 1177 started. Therefore the recovery includes both fast recoveries and 1178 timeout recoveries. 1180 Our data shows that Group 2 recovery latency is only 0.3% lower than 1181 the Group 1 recovery latency. But Group 3 recovery latency is 25% 1182 lower than Group 1 due to a 40% reduction in RTO-triggered 1183 recoveries! Therefore it is important to implement both TLP and RACK 1184 for performance. Group 4's total recovery latency is 0.02% lower 1185 than Group 3's, indicating that RACK plus TLP can successfully 1186 replace RFC6675 as a standalone recovery mechanism. 1188 We want to emphasize that the current experiment is limited in terms 1189 of network coverage. The connectivity in Western Europe is fairly 1190 good, therefore loss recovery is not a major performance bottleneck. 1191 We plan to expand our experiments to regions with worse connectivity, 1192 in particular on networks with strong traffic policing. 1194 9. Security Considerations 1196 RACK does not change the risk profile for TCP. 1198 An interesting scenario is ACK-splitting attacks [SCWA99]: for an 1199 MSS-size packet sent, the receiver or the attacker might send MSS 1200 ACKs that SACK or acknowledge one additional byte per ACK. This 1201 would not fool RACK. RACK.xmit_ts would not advance because all the 1202 sequences of the packet are transmitted at the same time (carry the 1203 same transmission timestamp). In other words, SACKing only one byte 1204 of a packet or SACKing the packet in entirety have the same effect on 1205 RACK. 1207 10. IANA Considerations 1209 This document makes no request of IANA. 1211 Note to RFC Editor: this section may be removed on publication as an 1212 RFC. 1214 11. Acknowledgments 1216 The authors thank Matt Mathis for his insights in FACK and Michael 1217 Welzl for his per-packet timer idea that inspired this work. Eric 1218 Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana 1219 Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi 1220 Nishida, Bob Briscoe, Felix Weinrank, and Michael Tuexen contributed 1221 to the draft or the implementations in Linux, FreeBSD, Windows and 1222 QUIC. 1224 12. References 1226 12.1. Normative References 1228 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 1229 Options", RFC 2018, October 1996. 1231 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1232 Requirement Levels", RFC 2119, March 1997. 1234 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1235 Extension to the Selective Acknowledgement (SACK) Option 1236 for TCP", RFC 2883, July 2000. 1238 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 1239 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 1240 November 2006. 1242 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1243 Control", RFC 5681, September 2009. 1245 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 1246 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 1247 Spurious Retransmission Timeouts with TCP", RFC 5682, 1248 September 2009. 1250 [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. 1251 Hurtig, "Early Retransmit for TCP and Stream Control 1252 Transmission Protocol (SCTP)", RFC 5827, April 2010. 1254 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1255 "Computing TCP's Retransmission Timer", RFC 6298, June 1256 2011. 1258 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1259 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1260 Based on Selective Acknowledgment (SACK) for TCP", 1261 RFC 6675, August 2012. 1263 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 1264 Rate Reduction for TCP", May 2013. 1266 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 1267 Scheffenegger, "TCP Extensions for High Performance", 1268 September 2014. 1270 [RFC793] Postel, J., "Transmission Control Protocol", September 1271 1981. 1273 12.2. Informative References 1275 [FACK] Mathis, M. and M. Jamshid, "Forward acknowledgement: 1276 refining TCP congestion control", ACM SIGCOMM Computer 1277 Communication Review, Volume 26, Issue 4, Oct. 1996. , 1278 1996. 1280 [POLICER16] 1281 Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng, 1282 Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An 1283 Analysis of Traffic Policing in the Web", ACM SIGCOMM , 1284 2016. 1286 [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And 1287 Congestion Control", draft-tsvwg-quic-loss-recovery-01 1288 (work in progress), June 2016. 1290 [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP 1291 and SCTP RTO Restart", February 2016. 1293 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 1294 "TCP Congestion Control With a Misbehaving Receiver", ACM 1295 Computer Communication Review, 29(5) , 1999. 1297 [THIN-STREAM] 1298 Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, 1299 "TCP enhancements for interactive thin-stream 1300 applications", NOSSDAV , 2008. 1302 [TLP] Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 1303 "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of 1304 Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work 1305 in progress), August 2013. 1307 12.3. URIs 1309 [1] https://tools.ietf.org/html/rfc2119 1311 [2] https://tools.ietf.org/html/rfc2119 1313 Authors' Addresses 1314 Yuchung Cheng 1315 Google, Inc 1316 1600 Amphitheater Parkway 1317 Mountain View, California 94043 1318 USA 1320 Email: ycheng@google.com 1322 Neal Cardwell 1323 Google, Inc 1324 76 Ninth Avenue 1325 New York, NY 10011 1326 USA 1328 Email: ncardwell@google.com 1330 Nandita Dukkipati 1331 Google, Inc 1332 1600 Amphitheater Parkway 1333 Mountain View, California 94043 1335 Email: nanditad@google.com 1337 Priyaranjan Jha 1338 Google, Inc 1339 1600 Amphitheater Parkway 1340 Mountain View, California 94043 1342 Email: priyarjha@google.com