idnits 2.17.1 draft-ietf-tcpm-rack-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (Jan 17, 2020) is 1554 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC8174' is mentioned on line 64, but not defined == Missing Reference: 'RFC4653' is mentioned on line 1065, but not defined == Missing Reference: 'RFC3708' is mentioned on line 484, but not defined == Missing Reference: 'RFC3522' is mentioned on line 1084, but not defined == Unused Reference: 'RFC4737' is defined on line 1283, but no explicit reference was found in the text ** Downref: Normative reference to an Experimental RFC: RFC 5827 ** Downref: Normative reference to an Experimental RFC: RFC 6937 ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 5 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group Y. Cheng 3 Internet-Draft N. Cardwell 4 Intended status: Standards Track N. Dukkipati 5 Expires: July 20, 2020 P. Jha 6 Google, Inc 7 Jan 17, 2020 9 RACK: a time-based fast loss detection algorithm for TCP 10 draft-ietf-tcpm-rack-07 12 Abstract 14 This document presents a new TCP loss detection algorithm called RACK 15 ("Recent ACKnowledgment"). RACK uses the notion of time, instead of 16 packet or sequence counts, to detect losses, for modern TCP 17 implementations that can support per-packet timestamps and the 18 selective acknowledgment (SACK) option. It is intended to replace 19 the conventional DUPACK threshold approach and its variants, as well 20 as other nonstandard approaches. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on July 20, 2020. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 1. Terminology 56 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 57 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 58 "OPTIONAL" in this document are to be interpreted as described in BCP 59 14 [RFC2119] [RFC8174] when, and only when, they appear in all 60 capitals, as shown here. In this document, these words will appear 61 with that interpretation only when in UPPER CASE. Lower case uses of 62 these words are not to be interpreted as carrying [RFC2119] 63 significance. 65 2. Introduction 67 This document presents a new loss detection algorithm called RACK 68 ("Recent ACKnowledgment"). RACK uses the notion of time instead of 69 the conventional packet or sequence counting approaches for detecting 70 losses. RACK deems a packet lost if it has not been delivered and 71 some packet sent sufficiently later has been delivered. It does this 72 by recording packet transmission times and inferring losses using 73 cumulative acknowledgments or selective acknowledgment (SACK) TCP 74 options. 76 In recent years we have been observing several increasingly common 77 loss and reordering patterns in the Internet: 79 1. Slow recovery due to lost retransmissions. Traffic policers 80 [POLICER16] and burst losses often cause retransmissions to be 81 lost again. This severely increases latency because the lost 82 retransmissions can only be recovered by retransmission timeouts 83 (RTOs). 85 2. Tail drops. Structured request-response traffic turns more 86 losses into tail drops. In such cases, TCP is application- 87 limited, so it cannot send new data to probe losses and has to 88 rely on retransmission timeouts (RTOs). 90 3. Reordering. Link-layer protocols (e.g., 802.11 block ACK), link 91 bonding, or routers' internal load-balancing can deliver TCP 92 packets out of order. The degree of such reordering is usually 93 within the order of the path round trip time. 95 Despite TCP stacks (e.g. Linux) that implement many of the standard 96 and proposed loss detection algorithms 98 [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM], 99 we've found that together they do not perform well. The main reason 100 is that many of them are based on the classic rule of counting 101 duplicate acknowledgments [RFC5681]. They can either detect loss 102 quickly or accurately, but not both, especially when the sender is 103 application-limited or under reordering that is unpredictable. And 104 under these conditions none of them can detect lost retransmissions 105 well. 107 Also, these algorithms, including RFCs, rarely address the 108 interactions with other algorithms. For example, FACK may consider a 109 packet is lost while RFC6675 may not. Implementing N algorithms 110 while dealing with N^2 interactions is a daunting task and error- 111 prone. 113 The goal of RACK is to solve all the problems above by replacing many 114 of the loss detection algorithms above with one more effective 115 algorithm to handle loss and reordering. 117 3. Overview 119 The main idea behind RACK is that if a packet has been delivered out 120 of order, then the packets sent chronologically before that were 121 either lost or reordered. This concept is not fundamentally 122 different from [RFC5681][RFC6675][FACK]. But the key innovation in 123 RACK is to use a per-packet transmission timestamp and widely 124 deployed SACK options to conduct time-based inferences instead of 125 inferring losses with packet or sequence counting approaches. 127 Using a threshold for counting duplicate acknowledgments (i.e., 128 DupThresh) alone is no longer reliable because of today's prevalent 129 reordering patterns. A common type of reordering is that the last 130 "runt" packet of a window's worth of packet bursts gets delivered 131 first, then the rest arrive shortly after in order. To handle this 132 effectively, a sender would need to constantly adjust the DupThresh 133 to the burst size; but this would risk increasing the frequency of 134 RTOs on real losses. 136 Today's prevalent lost retransmissions also cause problems with 137 packet-counting approaches [RFC5681][RFC6675][FACK], since those 138 approaches depend on reasoning in sequence number space. 139 Retransmissions break the direct correspondence between ordering in 140 sequence space and ordering in time. So when retransmissions are 141 lost, sequence-based approaches are often unable to infer and quickly 142 repair losses that can be deduced with time-based approaches. 144 Instead of counting packets, RACK uses the most recently delivered 145 packet's transmission time to judge if some packets sent previous to 146 that time have "expired" by passing a certain reordering settling 147 window. On each ACK, RACK marks any already-expired packets lost, 148 and for any packets that have not yet expired it waits until the 149 reordering window passes and then marks those lost as well. In 150 either case, RACK can repair the loss without waiting for a (long) 151 RTO. RACK can be applied to both fast recovery and timeout recovery, 152 and can detect losses on both originally transmitted and 153 retransmitted packets, making it a great all-weather loss detection 154 mechanism. 156 4. Design Rationale for Reordering Tolerance 158 The reordering behavior of networks can evolve (over years) in 159 response to the behavior of transport protocols and applications, as 160 well as the needs of network designers and operators. From a network 161 or link designer's viewpoint, parallelization (eg. link bonding) is 162 the easiest way to get a network to go faster. Therefore their main 163 constraint on speed is reordering, and there is pressure to relax 164 that constraint. If RACK becomes widely deployed, the underlying 165 networks may introduce more reordering for higher throughput. But 166 this may result in excessive reordering that hurts end to end 167 performance: 169 1. End host packet processing: extreme reordering on high-speed 170 networks would incur high CPU cost by greatly reducing the 171 effectiveness of aggregation mechanisms, such as large receive 172 offload (LRO) and generic receive offload (GRO), and 173 significantly increasing the number of ACKs. 175 2. Congestion control: TCP congestion control implicitly assumes the 176 feedback from ACKs are from the same bottleneck. Therefore it 177 cannot handle well scenarios where packets are traversing largely 178 disjoint paths. 180 3. Loss recovery: Having an excessively large reordering window to 181 accommodate widely different latencies from different paths would 182 increase the latency of loss recovery. 184 An end-to-end transport protocol cannot tell immediately whether a 185 hole is reordering or loss. It can only distinguish between the two 186 in hindsight if the hole in the sequence space gets filled later 187 without a retransmission. How long the sender waits for such 188 potential reordering events to settle is determined by the current 189 reordering window. 191 Given these considerations, a core design philosophy of RACK is to 192 adapt to the measured duration of reordering events, within 193 reasonable and specific bounds. To accomplish this RACK places the 194 following mandates on the reordering window: 196 1. The initial RACK reordering window SHOULD be set to a small 197 fraction of the round-trip time. 199 2. If no reordering has been observed, then RACK SHOULD honor the 200 classic 3-DUPACK rule for initiating fast recovery. One simple 201 way to implement this is to temporarily override the reorder 202 window to 0. 204 3. The RACK reordering window SHOULD leverage Duplicate Selective 205 Acknowledgement (DSACK) information [RFC3708] to adaptively 206 estimate the duration of reordering events. 208 4. The RACK reordering window MUST be bounded and this bound SHOULD 209 be one round trip. 211 As a flow starts, either condition 1 or condition 2 or both would 212 trigger RACK to start the recovery process quickly. The low initial 213 reordering window and use of the 3-DUPACK rule are key to achieving 214 low-latency loss recovery for short flows by risking spurious 215 retransmissions to recover losses quickly. This rationale is that 216 spurious retransmissions for short flows are not expected to produce 217 excessive network traffic. 219 For long flows the design tolerates reordering within a round trip. 220 This handles reordering caused by path divergence in small time 221 scales (reordering within the round-trip time of the shortest path), 222 which should tolerate much of the reordering from link bonding, 223 multipath routing, or link-layer out-of-order delivery. It also 224 relaxes ordering constraints to allow sending flights of TCP packets 225 on different paths dynamically for better load-balancing (e.g. 226 flowlets). 228 However, the fact that the initial RACK reordering window is low, and 229 the RACK reordering window's adaptive growth is bounded, means that 230 there will continue to be a cost to reordering and a limit to RACK's 231 adaptation to reordering. This maintains a disincentive for network 232 designers and operators to introduce needless or excessive 233 reordering, particularly because they have to allow for low round 234 trip time paths. This means RACK will not encourage networks to 235 perform inconsiderate fine-grained packet-spraying over highly 236 disjoint paths with very different characteristics. There are good 237 alternative solutions, such as MPTCP, for such networks. 239 To conclude, the RACK algorithm aims to adapt to small degrees of 240 reordering, quickly recover most losses within one to two round 241 trips, and avoid costly retransmission timeouts (RTOs). In the 242 presence of reordering, the adaptation algorithm can impose 243 sometimes-needless delays when it waits to disambiguate loss from 244 reordering, but the penalty for waiting is bounded to one round trip 245 and such delays are confined to longer-running flows. 247 This document provides a concrete and detailed reordering window 248 adaptation algorithm for implementers. We note that the specifics of 249 the algorithm are likely to evolve over time. But that is a separate 250 engineering optimization that's out of scope for this document. 252 5. Requirements 254 The reader is expected to be familiar with the definitions given in 255 the TCP congestion control [RFC5681] and selective acknowledgment 256 [RFC2018] RFCs. Familiarity with the conservative SACK-based 257 recovery for TCP [RFC6675] is not expected but helps. 259 RACK has three requirements: 261 1. The connection MUST use selective acknowledgment (SACK) options 262 [RFC2018]. 264 2. For each packet sent, the sender MUST store its most recent 265 transmission time with (at least) millisecond granularity. For 266 round-trip times lower than a millisecond (e.g., intra-datacenter 267 communications) microsecond granularity would significantly help 268 the detection latency but is not required. 270 3. For each packet sent, the sender MUST remember whether the packet 271 has been retransmitted or not. 273 We assume that requirement 1 implies the sender keeps a SACK 274 scoreboard, which is a data structure to store selective 275 acknowledgment information on a per-connection basis ([RFC6675] 276 section 3). For the ease of explaining the algorithm, we use a 277 pseudo-scoreboard that manages the data in sequence number ranges. 278 But the specifics of the data structure are left to the implementor. 280 RACK does not need any change on the receiver. 282 6. Definitions 284 The reader is expected to be familiar with the definitions given in 285 [RFC793], including SND.UNA, SND.NXT, SEG.ACK, and SEG.SEQ. 287 6.1. Definitions of variables 289 A sender implementing RACK needs to store these new RACK variables: 291 "Packet.xmit_ts" is the time of the last transmission of a data 292 packet, including retransmissions, if any. The sender needs to 293 record the transmission time for each packet sent and not yet 294 acknowledged. The time MUST be stored at millisecond granularity or 295 finer. 297 "RACK.packet". Among all the packets that have been either 298 selectively or cumulatively acknowledged, RACK.packet is the one that 299 was sent most recently including retransmissions. 301 "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet. 303 "RACK.end_seq" is the ending TCP sequence number of RACK.packet. 305 "RACK.rtt" is the RTT of the most recently delivered packet on the 306 connection (either cumulatively acknowledged or selectively 307 acknowledged) that was not marked invalid as a possible spurious 308 retransmission. 310 "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated. 312 "RACK.reo_wnd" is a reordering window computed in the unit of time 313 used for recording packet transmission times. It is used to defer 314 the moment at which RACK marks a packet lost. 316 "RACK.dupthresh" is a constant specifying the number of duplicate 317 acknowledgments, or selectively acknowledged segments, that can 318 (under certain conditions) trigger fast recovery, similar to 319 [RFC6675]. As in [RFC5681] and [RFC6675], this threshold is defined 320 to be 3. 322 "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the 323 connection. 325 "RACK.ack_ts" is the time when all the sequences in RACK.packet were 326 selectively or cumulatively acknowledged. 328 "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd 330 "RACK.reo_wnd_persist" is the number of loss recoveries before 331 resetting RACK.reo_wnd 332 "RACK.dsack" indicates if a DSACK option has been received since last 333 RACK.reo_wnd change "RACK.pkts_sacked" returns the total number of 334 packets selectively acknowledged in the SACK scoreboard. 336 "RACK.reord" indicates the connection has detected packet reordering 337 event(s) 339 "RACK.fack" is the highest selectively or cumulatively acknowledged 340 sequence 342 Note that the Packet.xmit_ts variable is per packet in flight. The 343 RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT 344 variables are kept in the per-connection TCP control block. 345 RACK.packet and RACK.ack_ts are used as local variables in the 346 algorithm. 348 7. Algorithm Details 350 7.1. Transmitting a data packet 352 Upon transmitting a new packet or retransmitting an old packet, 353 record the time in Packet.xmit_ts. RACK does not care if the 354 retransmission is triggered by an ACK, new application data, an RTO, 355 or any other means. 357 7.2. Upon receiving an ACK 359 Step 1: Update RACK.min_RTT. 361 Use the RTT measurements obtained via [RFC6298] or [RFC7323] to 362 update the estimated minimum RTT in RACK.min_RTT. The sender can 363 track a simple global minimum of all RTT measurements from the 364 connection, or a windowed min-filtered value of recent RTT 365 measurements. This document does not specify an exact approach. 367 Step 2: Update RACK stats 369 Given the information provided in an ACK, each packet cumulatively 370 ACKed or SACKed is marked as delivered in the scoreboard. Among all 371 the packets newly ACKed or SACKed in the connection, record the most 372 recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts. 373 Sometimes the timestamps of RACK.Packet and Packet could carry the 374 same transmit timestamps due to clock granularity or segmentation 375 offloading (i.e. the two packets were handed to the NIC as a single 376 unit). In that case the sequence numbers of RACK.end_seq and 377 Packet.end_seq are compared to break the tie. 379 Since an ACK can also acknowledge retransmitted data packets, and 380 retransmissions can be spurious, the sender must take care to avoid 381 spurious inferences. For example, if the sender were to use timing 382 information from a spurious retransmission, the RACK.rtt could be 383 vastly underestimated. 385 To avoid spurious inferences, ignore a packet as invalid if any of 386 its TCP sequences have been retransmitted before and either of two 387 conditions is true: 389 1. The Timestamp Echo Reply field (TSecr) of the ACK's timestamp 390 option [RFC7323], if available, indicates the ACK was not 391 acknowledging the last retransmission of the packet. 393 2. The packet was last retransmitted less than RACK.min_rtt ago. 395 If the ACK is not ignored as invalid, update the RACK.rtt to be the 396 RTT sample calculated using this ACK, and continue. If this ACK or 397 SACK was for the most recently sent packet, then record the 398 RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK. 399 Otherwise exit here and omit the following steps. 401 Notice that the second condition above is a heuristic. This 402 heuristic would fail to update RACK stats if the packet is spuriously 403 retransmitted because of a recent minimum RTT decrease (e.g. path 404 change). Consequentially RACK may not detect losses from ACK events 405 and the recovery resorts to the (slower) TLP or RTO timer-based 406 events. However such events should be rare and the connection would 407 pick up the new minimum RTT when the recovery ends to avoid repeated 408 similar failures. 410 Step 2 may be summarized in pseudocode as: 412 RACK_sent_after(t1, seq1, t2, seq2): 413 If t1 > t2: 414 Return true 415 Else if t1 == t2 AND seq1 > seq2: 416 Return true 417 Else: 418 Return false 420 RACK_update(): 421 For each Packet newly acknowledged cumulatively or selectively: 422 rtt = Now() - Packet.xmit_ts 423 If Packet.retransmitted is TRUE: 424 If ACK.ts_option.echo_reply < Packet.xmit_ts: 425 Return 426 If rtt < RACK.min_rtt: 427 Return 429 RACK.rtt = rtt 430 If RACK_sent_after(Packet.xmit_ts, Packet.end_seq 431 RACK.xmit_ts, RACK.end_seq): 432 RACK.xmit_ts = Packet.xmit_ts 434 Step 3: Detect packet reordering 436 To detect reordering, the sender looks for original data packets 437 being delivered out of order in sequence space. The sender tracks 438 the highest sequence selectively or cumulatively acknowledged in the 439 RACK.fack variable. The name fack stands for the most forward ACK 440 originated from the [FACK] draft. If the ACK selectively or 441 cumulatively acknowledges an unacknowledged and also never 442 retransmitted sequence below RACK.fack, then the corresponding packet 443 has been reordered and RACK.reord is set to TRUE. 445 The heuristic above only detects reordering if the re-ordered packet 446 has not yet been retransmitted. This is a major drawback because if 447 RACK has a low reordering window and the network is reordering 448 packets, RACK may falsely retransmit frequently. Consequently RACK 449 may fail to detect reordering to increase the reordering window, 450 because the reordered packets were already (falsely) retransmitted. 452 DSACK [RFC3708] can help mitigate this issue. The false 453 retransmission would solicit DSACK option in the ACK. Therefore if 454 the ACK has a DSACK option covering some sequence that were both 455 acknowledged and retransmitted, this implies the original packet was 456 reordered but RACK retransmitted the packet too quickly and should 457 set RACK.reord to TRUE. 459 RACK_detect_reordering(): 460 For each Packet newly acknowledged cumulatively or selectively: 461 If Packet.end_seq > RACK.fack: 462 RACK.fack = Packet.end_seq 463 Else if Packet.end_seq < RACK.fack AND 464 Packet.retransmitted is FALSE: 465 RACK.reord = TRUE 467 For each Packet covered by the DSACK option: 468 If Packet.retransmitted is TRUE: 469 RACK.reord = TRUE 471 Step 4: Update RACK reordering window 473 To handle the prevalent small degree of reordering, RACK.reo_wnd 474 serves as an allowance for settling time before marking a packet 475 lost. This section documents a detailed algorithm following the 476 design rationale section. RACK starts initially with a conservative 477 window of min_RTT/4. If no reordering has been observed, RACK uses 478 RACK.reo_wnd of 0 during loss recovery, in order to retransmit 479 quickly, or when the number of DUPACKs exceeds the classic DUPACK 480 threshold. The subtle difference of this approach and conventional 481 one [RFC5681][RFC6675] is dicussed later in the section of "RACK and 482 TLP discussions". 484 Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window, 485 to higher degrees of reordering, if DSACK is supported. Receiving an 486 ACK with a DSACK indicates a spurious retransmission, which in turn 487 suggests that the RACK reordering window, RACK.reo_wnd, is likely too 488 small. The sender MAY increase the RACK.reo_wnd window linearly for 489 every round trip in which the sender receives a DSACK, so that after 490 N distinct round trips in which a DSACK is received, the RACK.reo_wnd 491 becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT. The 492 inflated RACK.reo_wnd would persist for 16 loss recoveries and after 493 which it resets to its starting value, min_RTT / 4. 495 The following pseudocode implements above algorithm. Note that 496 extensions that require additional TCP features (e.g. DSACK) would 497 work if the feature functions simply return false. 499 RACK_update_reo_wnd(): 500 RACK.min_RTT = TCP_min_RTT() 501 If DSACK option is present: 502 RACK.dsack = true 504 If SND.UNA < RACK.rtt_seq: 505 RACK.dsack = false /* React to DSACK once per round trip */ 507 If RACK.dsack: 508 RACK.reo_wnd_incr += 1 509 RACK.dsack = false 510 RACK.rtt_seq = SND.NXT 511 RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */ 512 Else if exiting loss recovery: 513 RACK.reo_wnd_persist -= 1 514 If RACK.reo_wnd_persist <= 0: 515 RACK.reo_wnd_incr = 1 517 If RACK.reord is FALSE: 518 If in loss recovery: /* If in fast or timeout recovery */ 519 RACK.reo_wnd = 0 520 Return 521 Else if RACK.pkts_sacked >= RACK.dupthresh: 522 RACK.reo_wnd = 0 523 return 524 RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr 525 RACK.reo_wnd = min(RACK.reo_wnd, SRTT) 527 Step 5: Detect losses. 529 For each packet that has not been SACKed, if RACK.xmit_ts is after 530 Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its 531 corresponding sequence range) lost in the scoreboard. The rationale 532 is that if another packet that was sent later has been delivered, and 533 the reordering window or "reordering settling time" has already 534 passed, then the packet was likely lost. 536 If another packet that was sent later has been delivered, but the 537 reordering window has not passed, then it is not yet safe to deem the 538 unacked packet lost. Using the basic algorithm above, the sender 539 would wait for the next ACK to further advance RACK.xmit_ts; but this 540 risks a timeout (RTO) if no more ACKs come back (e.g, due to losses 541 or application limit). For timely loss detection, the sender MAY 542 install a "reordering settling" timer set to fire at the earliest 543 moment at which it is safe to conclude that some packet is lost. The 544 earliest moment is the time it takes to expire the reordering window 545 of the earliest unacked packet in flight. 547 This timer expiration value can be derived as follows. As a starting 548 point, we consider that the reordering window has passed if the 549 RACK.packet was sent sufficiently after the packet in question, or a 550 sufficient time has elapsed since the RACK.packet was S/ACKed, or 551 some combination of the two. More precisely, RACK marks a packet as 552 lost if the reordering window for a packet has elapsed through the 553 sum of: 555 1. delta in transmit time between a packet and the RACK.packet 557 2. delta in time between RACK.ack_ts and now 559 So we mark a packet as lost if: 561 RACK.xmit_ts >= Packet.xmit_ts 562 AND 563 (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd 565 If we solve this second condition for "now", the moment at which we 566 can declare a packet lost, then we get: 568 now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts) 570 Then (RACK.ack_ts - RACK.xmit_ts) is the RTT of the packet the sender 571 used to set RACK.xmit_ts: the round trip time of the most recently 572 (re)transmitted packet that's been delivered. To be more robust to 573 reordering, RACK uses a more conservative RTT value to decide if an 574 unacknowledged packet should be considered lost, RACK.rtt: the round 575 trip time of the most recently delivered packet on the connection 576 that was not marked invalid as a possible spurious retransmission. 578 When packets are delivered in order, the most recently 579 (re)transmitted packet that's been delivered is also the most 580 recently delivered, hence RACK.rtt == RACK.ack_ts - RACK.xmit_ts. 581 But if packets were reordered, then the packet delivered most 582 recently was sent before the most recently (re)transmitted packet. 583 Hence RACK.rtt > (RACK.ack_ts - RACK.xmit_ts). 585 Since RACK.RTT >= (RACK.ack_ts - RACK.xmit_ts), the previous equation 586 reduces to saying that the sender can declare a packet lost when: 588 now >= Packet.xmit_ts + RACK.reo_wnd + RACK.rtt 590 In turn, that is equivalent to stating that a RACK sender should 591 declare a packet lost when: 593 Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0 594 The following pseudocode implements the algorithm above. When an ACK 595 is received or the RACK timer expires, call RACK_detect_loss(). The 596 algorithm includes an additional optimization to break timestamp ties 597 by using the TCP sequence space. The optimization is particularly 598 useful to detect losses in a timely manner with TCP Segmentation 599 Offload, where multiple packets in one TSO blob have identical 600 timestamps. It is also useful when the timestamp clock granularity 601 is close to or longer than the actual round trip time. 603 RACK_detect_loss(): 604 timeout = 0 606 For each packet, Packet, not acknowledged yet: 607 If Packet.lost is TRUE AND Packet.retransmitted is FALSE: 608 Continue /* Packet lost but not yet retransmitted */ 610 If RACK_sent_after(RACK.xmit_ts, RACK.end_seq, 611 Packet.xmit_ts, Packet.end_seq): 612 remaining = Packet.xmit_ts + RACK.rtt + 613 RACK.reo_wnd - Now() 614 If remaining <= 0: 615 Packet.lost = TRUE 616 Else: 617 timeout = max(remaining, timeout) 619 If timeout != 0 620 Arm a timer to call RACK_detect_loss() after timeout 622 Implementation optimization: looping through packets in the SACK 623 scoreboard above could be very costly on large-BDP networks since the 624 inflight could be very large. If the implementation can organize the 625 scoreboard data structures to have packets sorted by the last 626 (re)transmission time, then the loop can start on the least recently 627 sent packet and abort on the first packet sent after RACK.time_ts. 628 This can be implemented by using a seperate doubly-linked list sorted 629 in time order. The implementation inserts the packet at the tail of 630 the list when it is (re)transmitted, and removes a packet from the 631 list when it is delivered or marked lost. We RECOMMEND such an 632 optimization because it enables implementations to support high-BDP 633 networks. This optimization is implemented in Linux and sees orders 634 of magnitude improvement in CPU usage on high-speed WAN networks. 636 7.3. Tail Loss Probe: fast recovery for tail losses 638 This section describes a supplemental algorithm, Tail Loss Probe 639 (TLP), which leverages RACK to further reduce RTO recoveries. TLP 640 triggers fast recovery to quickly repair tail losses that can 641 otherwise be recovered by RTOs only. After an original data 642 transmission, TLP sends a probe data segment within one to two RTTs. 643 The probe data segment can either be new, previously unsent data, or 644 a retransmission of previously sent data just below SND.NXT. In 645 either case the goal is to elicit more feedback from the receiver, in 646 the form of an ACK (potentially with SACK blocks), to allow RACK to 647 trigger fast recovery instead of an RTO. 649 An RTO occurs when the first unacknowledged sequence number is not 650 acknowledged after a conservative period of time has elapsed 651 [RFC6298]. Common causes of RTOs include: 653 1. The entire flight of data is lost. 655 2. Tail losses of data segments at the end of an application 656 transaction. 658 3. Tail losses of ACKs at the end of an application transaction. 660 4. Lost retransmits, which can halt fast recovery based on [RFC6675] 661 if the ACK stream completely dries up. For example, consider a 662 window of three data packets (P1, P2, P3) that are sent; P1 and 663 P2 are dropped. On receipt of a SACK for P3, RACK marks P1 and 664 P2 as lost and retransmits them as R1 and R2. Suppose R1 and R2 665 are lost as well, so there are no more returning ACKs to detect 666 R1 and R2 as lost. Recovery stalls. 668 5. An unexpectedly long round-trip time (RTT). This can cause ACKs 669 to arrive after the RTO timer expires. The F-RTO algorithm 670 [RFC5682] is designed to detect such spurious retransmission 671 timeouts and at least partially undo the consequences of such 672 events, but F-RTO cannot be used in many situations. 674 7.4. Tail Loss Probe: An Example 676 Following is an example of TLP. All events listed are at a TCP 677 sender. 679 1. Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 680 no more new data to transmit. A PTO is scheduled to fire in 2 681 RTTs, after the transmission of the 10th segment. 683 2. Sender receives acknowledgements (ACKs) for segments 1-5; 684 segments 6-10 are lost and no ACKs are received. The sender 685 reschedules its PTO timer relative to the last received ACK, 686 which is the ACK for segment 5 in this case. The sender sets the 687 PTO interval using the calculation described in step (2) of the 688 algorithm. 690 3. When PTO fires, sender retransmits segment 10. 692 4. After an RTT, a SACK for packet 10 arrives. The ACK also carries 693 SACK holes for segments 6, 7, 8 and 9. This triggers RACK-based 694 loss recovery. 696 5. The connection enters fast recovery and retransmits the remaining 697 lost segments. 699 7.5. Tail Loss Probe Algorithm Details 701 We define the terminology used in specifying the TLP algorithm: 703 FlightSize: amount of outstanding data in the network, as defined in 704 [RFC5681]. 706 RTO: The transport's retransmission timeout (RTO) is based on 707 measured round-trip times (RTT) between the sender and receiver, as 708 specified in [RFC6298] for TCP. PTO: Probe timeout (PTO) is a timer 709 event indicating that an ACK is overdue. Its value is constrained to 710 be smaller than or equal to an RTO. 712 SRTT: smoothed round-trip time, computed as specified in [RFC6298]. 713 TLPRxtOut: a boolean indicating whether there is an unacknowledged 714 TLP retransmission. 716 TLPHighRxt: the value of SND.NXT at the time of sending a TLP 717 retransmission. 719 The TLP algorithm has three phases, which we discuss in turn. 721 7.5.1. Phase 1: Scheduling a loss probe 723 Step 1: Check conditions for scheduling a PTO. 725 A sender should check to see if it should schedule a PTO in the 726 following situations: 728 1. After transmitting new data that was not itself a TLP probe 730 2. Upon receiving an ACK that cumulatively acknowledges data 732 3. Upon receiving a SACK that selectively acknowledges data that was 733 last sent before the segment with SEG.SEQ=SND.UNA was last 734 (re)transmitted 736 A sender should schedule a PTO only if all of the following 737 conditions are met: 739 1. The connection supports SACK [RFC2018] 741 2. The connection has no SACKed sequences in the SACK scoreboard 743 3. The connection is not in loss recovery 745 If a PTO can be scheduled according to these conditions, the sender 746 should schedule a PTO. If there was a previously scheduled PTO or 747 RTO pending, then that pending PTO or RTO should first be cancelled, 748 and then the new PTO should be scheduled. 750 If a PTO cannot be scheduled according to these conditions, then the 751 sender MUST arm the RTO timer if there is unacknowledged data in 752 flight. 754 Step 2: Select the duration of the PTO. 756 A sender SHOULD use the following logic to select the duration of a 757 PTO: 759 TLP_timeout(): 760 If SRTT is available: 761 PTO = 2 * SRTT 762 If FlightSize = 1: 763 PTO += WCDelAckT 764 Else: 765 PTO += 2ms 766 Else: 767 PTO = 1 sec 769 If Now() + PTO > TCP_RTO_expire(): 770 PTO = TCP_RTO_expire() - Now() 772 Aiming for a PTO value of 2*SRTT allows a sender to wait long enough 773 to know that an ACK is overdue. Under normal circumstances, i.e. no 774 losses, an ACK typically arrives in one SRTT. But choosing PTO to be 775 exactly an SRTT is likely to generate spurious probes given that 776 network delay variance and even end-system timings can easily push an 777 ACK to be above an SRTT. We chose PTO to be the next integral 778 multiple of SRTT. 780 Similarly, network delay variations and end-system processing 781 latencies and timer granularities can easily delay ACKs beyond 782 2*SRTT, so senders SHOULD add at least 2ms to a computed PTO value 783 (and MAY add more if the sending host OS timer granularity is more 784 coarse than 1ms). 786 WCDelAckT stands for worst case delayed ACK timer. When FlightSize 787 is 1, PTO is inflated by WCDelAckT time to compensate for a potential 788 long delayed ACK timer at the receiver. The RECOMMENDED value for 789 WCDelAckT is 200ms. 791 Finally, if the time at which an RTO would fire (here denoted 792 "TCP_RTO_expire") is sooner than the computed time for the PTO, then 793 a probe is scheduled to be sent at that earlier time. 795 7.5.2. Phase 2: Sending a loss probe 797 When the PTO fires, transmit a probe data segment: 799 TLP_send_probe(): 800 If an unsent segment exists AND 801 the receive window allows new data to be sent: 802 Transmit the lowest-sequence unsent segment of up to SMSS 803 Increment FlightSize by the size of the newly-sent segment 805 Else if TLPRxtOut is not set: 806 Retransmit the highest-sequence segment sent so far 807 The cwnd remains unchanged 809 When the loss probe is a retransmission, the sender uses the highest- 810 sequence segment sent so far. This is in order to deal with the 811 retransmission ambiguity problem in TCP. Suppose a sender sends N 812 segments, and then retransmits the last segment (segment N) as a loss 813 probe, and then the sender receives a SACK for segment N. As long as 814 the sender waits for any required RACK reordering settling timer to 815 then expire, it doesn't matter if that SACK was for the original 816 transmission of segment N or the TLP retransmission; in either case 817 the arrival of the SACK for segment N provides evidence that the N-1 818 segments preceding segment N were likely lost. In the case where 819 there is only one original outstanding segment of data (N=1), the 820 same logic (trivially) applies: an ACK for a single outstanding 821 segment tells the sender the N-1=0 segments preceding that segment 822 were lost. Furthermore, whether there are N>1 or N=1 outstanding 823 segments, there is a question about whether the original last segment 824 or its TLP retransmission were lost; the sender estimates this using 825 TLP recovery detection (see below). 827 Note that whether or not a probe was sent in TLP_send_probe(), the 828 sender MUST arm the RTO timer, not the PTO timer, at the end of 829 TLP_send_probe() if FlightSize is not zero. This ensures that the 830 sender does not send repeated, back-to-back TLP probes. Checking 831 TLPRxtOut prior to sending the loss probe is also critical to avoid 832 TLP loops if an application writes periodically at an interval less 833 than PTO. 835 7.5.3. Phase 3: ACK processing 837 On each incoming ACK, the sender should check the conditions in Step 838 1 of Phase 1 to see if it should schedule (or reschedule) the loss 839 probe timer. 841 7.6. TLP recovery detection 843 If the only loss in an outstanding window of data was the last 844 segment, then a TLP loss probe retransmission of that data segment 845 might repair the loss. TLP recovery detection examines ACKs to 846 detect when the probe might have repaired a loss, and thus allows 847 congestion control to properly reduce the congestion window (cwnd) 848 [RFC5681]. 850 Consider a TLP retransmission episode where a sender retransmits a 851 tail packet in a flight. The TLP retransmission episode ends when 852 the sender receives an ACK with a SEG.ACK above the SND.NXT at the 853 time the episode started. During the TLP retransmission episode the 854 sender checks for a duplicate ACK or D-SACK indicating that both the 855 original segment and TLP retransmission arrived at the receiver, 856 meaning there was no loss that needed repairing. If the TLP sender 857 does not receive such an indication before the end of the TLP 858 retransmission episode, then it MUST estimate that either the 859 original data segment or the TLP retransmission were lost, and 860 congestion control MUST react appropriately to that loss as it would 861 any other loss. 863 Since a significant fraction of the hosts that support SACK do not 864 support duplicate selective acknowledgments (D-SACKs) [RFC2883] the 865 TLP algorithm for detecting such lost segments relies only on basic 866 SACK support [RFC2018]. 868 7.6.1. Initializing and resetting state 870 When a connection is created, or suffers a retransmission timeout, or 871 enters fast recovery, it executes the following: 873 TLPRxtOut = false 875 7.6.2. Recording loss probe states 877 Senders MUST only send a TLP loss probe retransmission if TLPRxtOut 878 is false. This ensures that at any given time a connection has at 879 most one outstanding TLP retransmission. This allows the sender to 880 use the algorithm described in this section to estimate whether any 881 data segments were lost. 883 Note that this condition only restricts TLP loss probes that are 884 retransmissions. There may be an arbitrary number of outstanding 885 unacknowledged TLP loss probes that consist of new, previously-unsent 886 data, since the retransmission timeout and fast recovery algorithms 887 are sufficient to detect losses of such probe segments. 889 Upon sending a TLP probe that is a retransmission, the sender sets 890 TLPRxtOut to true and TLPHighRxt to SND.NXT. 892 Detecting recoveries accomplished by loss probes 894 Step 1: Track ACKs indicating receipt of original and retransmitted 895 segments 897 A sender considers both the original segment and TLP probe 898 retransmission segment as acknowledged if either 1 or 2 are true: 900 1. This is a duplicate acknowledgment (as defined in [RFC5681], 901 section 2), and all of the following conditions are met: 903 1. TLPRxtOut is true 905 2. SEG.ACK == TLPHighRxt 907 3. SEG.ACK == SND.UNA 909 4. the segment contains no SACK blocks for sequence ranges above 910 TLPHighRxt 912 5. the segment contains no data 914 6. the segment is not a window update 916 2. This is an ACK acknowledging a sequence number at or above 917 TLPHighRxt and it contains a D-SACK; i.e. all of the following 918 conditions are met: 920 1. TLPRxtOut is true 922 2. SEG.ACK >= TLPHighRxt 924 3. the ACK contains a D-SACK block 926 If either of the conditions is met, then the sender estimates that 927 the receiver received both the original data segment and the TLP 928 probe retransmission, and so the sender considers the TLP episode to 929 be done, and records that fact by setting TLPRxtOut to false. 931 Step 2: Mark the end of a TLP retransmission episode and detect 932 losses 934 If the sender receives a cumulative ACK for data beyond the TLP loss 935 probe retransmission then, in the absence of reordering on the return 936 path of ACKs, it should have received any ACKs for the original 937 segment and TLP probe retransmission segment. At that time, if the 938 TLPRxtOut flag is still true and thus indicates that the TLP probe 939 retransmission remains unacknowledged, then the sender should presume 940 that at least one of its data segments was lost, so it SHOULD invoke 941 a congestion control response equivalent to fast recovery. 943 More precisely, on each ACK the sender executes the following: 945 if (TLPRxtOut and SEG.ACK >= TLPHighRxt) { 946 TLPRxtOut = false 947 EnterRecovery() 948 ExitRecovery() 949 } 951 8. RACK and TLP discussions 953 8.1. Advantages 955 The biggest advantage of RACK is that every data packet, whether it 956 is an original data transmission or a retransmission, can be used to 957 detect losses of the packets sent chronologically prior to it. 959 Example: TAIL DROP. Consider a sender that transmits a window of 960 three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the 961 transmission of each packet is at least RACK.reo_wnd (1 millisecond 962 by default) after the transmission of the previous packet. RACK will 963 mark P1 as lost when the SACK of P2 is received, and this will 964 trigger the retransmission of P1 as R1. When R1 is cumulatively 965 acknowledged, RACK will mark P3 as lost and the sender will 966 retransmit P3 as R3. This example illustrates how RACK is able to 967 repair certain drops at the tail of a transaction without any timer. 968 Notice that neither the conventional duplicate ACK threshold 969 [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK] 970 algorithm can detect such losses, because of the required packet or 971 sequence count. 973 Example: LOST RETRANSMIT. Consider a window of three data packets 974 (P1, P2, P3) that are sent; P1 and P2 are dropped. Suppose the 975 transmission of each packet is at least RACK.reo_wnd (1 millisecond 976 by default) after the transmission of the previous packet. When P3 977 is SACKed, RACK will mark P1 and P2 lost and they will be 978 retransmitted as R1 and R2. Suppose R1 is lost again but R2 is 979 SACKed; RACK will mark R1 lost for retransmission again. Again, 980 neither the conventional three duplicate ACK threshold approach, nor 981 [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect 982 such losses. And such a lost retransmission is very common when TCP 983 is being rate-limited, particularly by token bucket policers with 984 large bucket depth and low rate limit. Retransmissions are often 985 lost repeatedly because standard congestion control requires multiple 986 round trips to reduce the rate below the policed rate. 988 Example: SMALL DEGREE OF REORDERING. Consider a common reordering 989 event: a window of packets are sent as (P1, P2, P3). P1 and P2 carry 990 a full payload of MSS octets, but P3 has only a 1-octet payload. 991 Suppose the sender has detected reordering previously and thus 992 RACK.reo_wnd is min_RTT/4. Now P3 is reordered and delivered first, 993 before P1 and P2. As long as P1 and P2 are delivered within 994 min_RTT/4, RACK will not consider P1 and P2 lost. But if P1 and P2 995 are delivered outside the reordering window, then RACK will still 996 falsely mark P1 and P2 lost. We discuss how to reduce false 997 positives in the end of this section. 999 The examples above show that RACK is particularly useful when the 1000 sender is limited by the application, which is common for 1001 interactive, request/response traffic. Similarly, RACK still works 1002 when the sender is limited by the receive window, which is common for 1003 applications that use the receive window to throttle the sender. 1005 For some implementations (e.g., Linux), RACK works quite efficiently 1006 with TCP Segmentation Offload (TSO). RACK always marks the entire 1007 TSO blob lost because the packets in the same TSO blob have the same 1008 transmission timestamp. By contrast, the algorithms based on 1009 sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset 1010 of packets in the TSO blob lost, forcing the stack to perform 1011 expensive fragmentation of the TSO blob, or to selectively tag 1012 individual packets lost in the scoreboard. 1014 8.2. Disadvantages 1016 RACK requires the sender to record the transmission time of each 1017 packet sent at a clock granularity of one millisecond or finer. TCP 1018 implementations that record this already for RTT estimation do not 1019 require any new per-packet state. But implementations that are not 1020 yet recording packet transmission times will need to add per-packet 1021 internal state (commonly either 4 or 8 octets per packet or TSO blob) 1022 to track transmission times. In contrast, the conventional [RFC6675] 1023 loss detection approach does not require any per-packet state beyond 1024 the SACK scoreboard. This is particularly useful on ultra-low RTT 1025 networks where the RTT is far less than the sender TCP clock 1026 grainularity (e.g. inside data-centers). 1028 RACK can easily and optionally support the conventional approach in 1029 [RFC6675][RFC5681] by resetting the reordering window to zero when 1030 the threshold is met. Note that this approach differs slightly from 1031 [RFC6675] which considers a packet lost when at least #DupThresh 1032 higher-sequenc packets are SACKed. RACK's approach considers a 1033 packet lost when at least one higher sequence packet is SACKed and 1034 the total number of SACKed packets is at least DupThresh. For 1035 example, suppose a connection sends 10 packets, and packets 3, 5, 7 1036 are SACKed. [RFC6675] considers packets 1 and 2 lost. RACK 1037 considers packets 1, 2, 4, 6 lost. 1039 8.3. Adjusting the reordering window 1041 When the sender detects packet reordering, RACK uses a reordering 1042 window of min_rtt / 4. It uses the minimum RTT to accommodate 1043 reordering introduced by packets traversing slightly different paths 1044 (e.g., router-based parallelism schemes) or out-of-order deliveries 1045 in the lower link layer (e.g., wireless links using link-layer 1046 retransmission). RACK uses a quarter of minimum RTT because Linux 1047 TCP used the same factor in its implementation to delay Early 1048 Retransmit [RFC5827] to reduce spurious loss detections in the 1049 presence of reordering, and experience shows that this seems to work 1050 reasonably well. We have evaluated using the smoothed RTT (SRTT from 1051 [RFC6298] RTT estimation) or the most recently measured RTT 1052 (RACK.rtt) using an experiment similar to that in the Performance 1053 Evaluation section. They do not make any significant difference in 1054 terms of total recovery latency. 1056 8.4. Relationships with other loss recovery algorithms 1058 The primary motivation of RACK is to ultimately provide a simple and 1059 general replacement for some of the standard loss recovery algorithms 1060 [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard 1061 ones [FACK][THIN-STREAM]. While RACK can be a supplemental loss 1062 detection mechanism on top of these algorithms, this is not 1063 necessary, because RACK implicitly subsumes most of them. 1065 [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK 1066 threshold based on the current or previous flight sizes. RACK takes 1067 a different approach, by using only one ACK event and a reordering 1068 window. RACK can be seen as an extended Early Retransmit [RFC5827] 1069 without a FlightSize limit but with an additional reordering window. 1070 [FACK] considers an original packet to be lost when its sequence 1071 range is sufficiently far below the highest SACKed sequence. In some 1072 sense RACK can be seen as a generalized form of FACK that operates in 1073 time space instead of sequence space, enabling it to better handle 1074 reordering, application-limited traffic, and lost retransmissions. 1076 Nevertheless RACK is still an experimental algorithm. Since the 1077 oldest loss detection algorithm, the 3 duplicate ACK threshold 1078 [RFC5681], has been standardized and widely deployed. RACK can 1079 easily and optionally support the conventional approach for 1080 compatibility. 1082 RACK is compatible with and does not interfere with the the standard 1083 RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel 1084 algorithms [RFC3522]. This is because RACK only detects loss by 1085 using ACK events. It neither changes the RTO timer calculation nor 1086 detects spurious timeouts. 1088 Furthermore, RACK naturally works well with Tail Loss Probe [TLP] 1089 because a tail loss probe solicits either an ACK or SACK, which can 1090 be used by RACK to detect more losses. RACK can be used to relax 1091 TLP's requirement for using FACK and retransmitting the the highest- 1092 sequenced packet, because RACK is agnostic to packet sequence 1093 numbers, and uses transmission time instead. Thus TLP could be 1094 modified to retransmit the first unacknowledged packet, which could 1095 improve application latency. 1097 8.5. Interaction with congestion control 1099 RACK intentionally decouples loss detection from congestion control. 1100 RACK only detects losses; it does not modify the congestion control 1101 algorithm [RFC5681][RFC6937]. A packet marked lost by RACK SHOULD 1102 NOT be retransmitted until congestion control deems this appropriate. 1103 Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used 1104 when using RACK. 1106 RACK is applicable for both fast recovery and recovery after a 1107 retransmission timeout (RTO) in [RFC5681]. RACK applies equally to 1108 fast recovery and RTO recovery because RACK is purely based on the 1109 transmission time order of packets. When a packet retransmitted by 1110 RTO is acknowledged, RACK will mark any unacked packet sent 1111 sufficiently prior to the RTO as lost, because at least one RTT has 1112 elapsed since these packets were sent. 1114 RACK may detect losses faster or slower than the conventional 1115 duplicate ACK threshold approach does. RACK can detect losses faster 1116 by not requiring three DUPACKs, so congestion control may reduce the 1117 congestion window earlier. When the network path has both reordering 1118 and losses, RACK detects losses slower by waiting for the reordering 1119 window to expire. TCP may continue to increase the congestion window 1120 upon receiving ACKs during this time, making the sender more 1121 aggressive. Certain congestion control algorithms can benefit from 1122 accounting for this increase in the congestion window during the 1123 reordering window. 1125 8.5.1. Example: interactions with congestion control 1127 The following simple example compares how RACK and non-RACK loss 1128 detection interacts with congestion control: suppose a TCP sender has 1129 a congestion window (cwnd) of 20 packets on a SACK-enabled 1130 connection. It sends 10 data packets and all of them are lost. 1132 Without RACK, the sender would time out, reset cwnd to 1, and 1133 retransmit the first packet. It would take four round trips (1 + 2 + 1134 4 + 3 = 10) to retransmit all the 10 lost packets using slow start. 1135 The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4 1136 packets due to congestion window validation. 1138 With RACK, a sender would send the TLP after 2*RTT and get a DUPACK. 1139 If the sender implements Proportional Rate Reduction [RFC6937] it 1140 would slow start to retransmit the remaining 9 lost packets since the 1141 number of packets in flight (0) is lower than the slow start 1142 threshold (10). The slow start would again take four round trips (1 1143 + 2 + 4 + 3 = 10). The recovery latency would be 2*RTT + 4*RTT, with 1144 an ending cwnd set to the slow start threshold of 10 packets. 1146 In both cases, the sender after the recovery would be in congestion 1147 avoidance. The difference in recovery latency (RTO + 4*RTT vs 6*RTT) 1148 can be significant if the RTT is much smaller than the minimum RTO (1 1149 second in RFC6298) or if the RTT is large. The former case is common 1150 in local area networks, data-center networks, or content distribution 1151 networks with deep deployments. The latter case is more common in 1152 developing regions with highly congested and/or high-latency 1153 networks. The ending congestion window after recovery also impacts 1154 subsequent data transfer. 1156 8.6. TLP recovery detection with delayed ACKs 1158 Delayed ACKs complicate the detection of repairs done by TLP, since 1159 with a delayed ACK the sender receives one fewer ACK than would 1160 normally be expected. To mitigate this complication, before sending 1161 a TLP loss probe retransmission, the sender should attempt to wait 1162 long enough that the receiver has sent any delayed ACKs that it is 1163 withholding. The sender algorithm described above features such a 1164 delay, in the form of WCDelAckT. Furthermore, if the receiver 1165 supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then 1166 in the case of a delayed ACK the sender's TLP recovery detection 1167 algorithm (see above) can use the D-SACK information to infer that 1168 the original and TLP retransmission both arrived at the receiver. 1170 If there is ACK loss or a delayed ACK without a D-SACK, then this 1171 algorithm is conservative, because the sender will reduce cwnd when 1172 in fact there was no packet loss. In practice this is acceptable, 1173 and potentially even desirable: if there is reverse path congestion 1174 then reducing cwnd can be prudent. 1176 8.7. RACK for other transport protocols 1178 RACK can be implemented in other transport protocols. The algorithm 1179 can be simplified by skipping step 3 if the protocol can support a 1180 unique transmission or packet identifier (e.g. TCP timestamp options 1181 [RFC7323]). For example, the QUIC protocol implements RACK [QUIC- 1182 LR]. The [Sprout] loss detection algorithm was also independently 1183 designed to use a 10ms reordering window to improve its loss 1184 detection. 1186 9. Experiments and Performance Evaluations 1188 RACK and TLP have been deployed at Google, for both connections to 1189 users in the Internet and internally. We conducted a performance 1190 evaluation experiment for RACK and TLP on a small set of Google Web 1191 servers in Western Europe that serve mostly European and some African 1192 countries. The experiment lasted three days in March 2017. The 1193 servers were divided evenly into four groups of roughly 5.3 million 1194 flows each: 1196 Group 1 (control): RACK off, TLP off, RFC 6675 on 1198 Group 2: RACK on, TLP off, RFC 6675 on 1200 Group 3: RACK on, TLP on, RFC 6675 on 1202 Group 4: RACK on, TLP on, RFC 6675 off 1204 All groups used Linux with CUBIC congestion control, an initial 1205 congestion window of 10 packets, and the fq/pacing qdisc. In terms 1206 of specific recovery features, all groups enabled RFC5682 (F-RTO) but 1207 disabled FACK because it is not an IETF RFC. FACK was excluded 1208 because the goal of this setup is to compare RACK and TLP to RFC- 1209 based loss recoveries. Since TLP depends on either FACK or RACK, we 1210 could not run another group that enables TLP only (with both RACK and 1211 FACK disabled). Group 4 is to test whether RACK plus TLP can 1212 completely replace the DupThresh-based [RFC6675]. 1214 The servers sit behind a load balancer that distributes the 1215 connections evenly across the four groups. 1217 Each group handles a similar number of connections and sends and 1218 receives similar amounts of data. We compare total time spent in 1219 loss recovery across groups. The recovery time is measured from when 1220 the recovery and retransmission starts, until the remote host has 1221 acknowledged the highest sequence (SND.NXT) at the time the recovery 1222 started. Therefore the recovery includes both fast recoveries and 1223 timeout recoveries. 1225 Our data shows that Group 2 recovery latency is only 0.3% lower than 1226 the Group 1 recovery latency. But Group 3 recovery latency is 25% 1227 lower than Group 1 due to a 40% reduction in RTO-triggered 1228 recoveries! Therefore it is important to implement both TLP and RACK 1229 for performance. Group 4's total recovery latency is 0.02% lower 1230 than Group 3's, indicating that RACK plus TLP can successfully 1231 replace RFC6675 as a standalone recovery mechanism. 1233 We want to emphasize that the current experiment is limited in terms 1234 of network coverage. The connectivity in Western Europe is fairly 1235 good, therefore loss recovery is not a major performance bottleneck. 1236 We plan to expand our experiments to regions with worse connectivity, 1237 in particular on networks with strong traffic policing. 1239 10. Security Considerations 1241 RACK does not change the risk profile for TCP. 1243 An interesting scenario is ACK-splitting attacks [SCWA99]: for an 1244 MSS-size packet sent, the receiver or the attacker might send MSS 1245 ACKs that SACK or acknowledge one additional byte per ACK. This 1246 would not fool RACK. RACK.xmit_ts would not advance because all the 1247 sequences of the packet are transmitted at the same time (carry the 1248 same transmission timestamp). In other words, SACKing only one byte 1249 of a packet or SACKing the packet in entirety have the same effect on 1250 RACK. 1252 11. IANA Considerations 1254 This document makes no request of IANA. 1256 Note to RFC Editor: this section may be removed on publication as an 1257 RFC. 1259 12. Acknowledgments 1261 The authors thank Matt Mathis for his insights in FACK and Michael 1262 Welzl for his per-packet timer idea that inspired this work. Eric 1263 Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana 1264 Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi 1265 Nishida, Bob Briscoe, Felix Weinrank, Michael Tuexen, and Martin Duke 1266 contributed to the draft or the implementations in Linux, FreeBSD, 1267 Windows and QUIC. 1269 13. References 1271 13.1. Normative References 1273 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 1274 Options", RFC 2018, October 1996. 1276 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1277 Requirement Levels", RFC 2119, March 1997. 1279 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1280 Extension to the Selective Acknowledgement (SACK) Option 1281 for TCP", RFC 2883, July 2000. 1283 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 1284 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 1285 November 2006. 1287 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1288 Control", RFC 5681, September 2009. 1290 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 1291 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 1292 Spurious Retransmission Timeouts with TCP", RFC 5682, 1293 September 2009. 1295 [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. 1296 Hurtig, "Early Retransmit for TCP and Stream Control 1297 Transmission Protocol (SCTP)", RFC 5827, April 2010. 1299 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1300 "Computing TCP's Retransmission Timer", RFC 6298, June 1301 2011. 1303 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1304 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1305 Based on Selective Acknowledgment (SACK) for TCP", 1306 RFC 6675, August 2012. 1308 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 1309 Rate Reduction for TCP", May 2013. 1311 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 1312 Scheffenegger, "TCP Extensions for High Performance", 1313 September 2014. 1315 [RFC793] Postel, J., "Transmission Control Protocol", September 1316 1981. 1318 13.2. Informative References 1320 [FACK] Mathis, M. and M. Jamshid, "Forward acknowledgement: 1321 refining TCP congestion control", ACM SIGCOMM Computer 1322 Communication Review, Volume 26, Issue 4, Oct. 1996. , 1323 1996. 1325 [POLICER16] 1326 Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng, 1327 Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An 1328 Analysis of Traffic Policing in the Web", ACM SIGCOMM , 1329 2016. 1331 [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And 1332 Congestion Control", draft-tsvwg-quic-loss-recovery-01 1333 (work in progress), June 2016. 1335 [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP 1336 and SCTP RTO Restart", February 2016. 1338 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 1339 "TCP Congestion Control With a Misbehaving Receiver", ACM 1340 Computer Communication Review, 29(5) , 1999. 1342 [Sprout] Winstein, K., Sivaraman, A., and H. Balakrishnan, 1343 "Stochastic Forecasts Achieve High Throughput and Low 1344 Delay over Cellular Networks", USENIX Symposium on 1345 Networked Systems Design and Implementation (NSDI) , 2013. 1347 [THIN-STREAM] 1348 Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, 1349 "TCP enhancements for interactive thin-stream 1350 applications", NOSSDAV , 2008. 1352 [TLP] Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 1353 "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of 1354 Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work 1355 in progress), August 2013. 1357 Authors' Addresses 1359 Yuchung Cheng 1360 Google, Inc 1362 Email: ycheng@google.com 1363 Neal Cardwell 1364 Google, Inc 1366 Email: ncardwell@google.com 1368 Nandita Dukkipati 1369 Google, Inc 1371 Email: nanditad@google.com 1373 Priyaranjan Jha 1374 Google, Inc 1376 Email: priyarjha@google.com