idnits 2.17.1 draft-ietf-tcpm-rack-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. ** The abstract seems to contain references ([RFC6675], [FACK]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 9, 2020) is 1506 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC8174' is mentioned on line 64, but not defined == Missing Reference: 'RFC4653' is mentioned on line 1063, but not defined == Missing Reference: 'RFC3708' is mentioned on line 490, but not defined == Missing Reference: 'RFC3522' is mentioned on line 1081, but not defined == Unused Reference: 'RFC4737' is defined on line 1278, but no explicit reference was found in the text ** Downref: Normative reference to an Experimental RFC: RFC 5827 ** Downref: Normative reference to an Experimental RFC: RFC 6937 ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- No information found for draft-ietf-quic-recovery-latest - is the name correct? Summary: 6 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group Y. Cheng 3 Internet-Draft N. Cardwell 4 Intended status: Standards Track N. Dukkipati 5 Expires: September 10, 2020 P. Jha 6 Google, Inc 7 March 9, 2020 9 RACK: a time-based fast loss detection algorithm for TCP 10 draft-ietf-tcpm-rack-08 12 Abstract 14 This document presents a new TCP loss detection algorithm called RACK 15 ("Recent ACKnowledgment"). RACK uses the notion of time, instead of 16 packet or sequence counts, to detect losses, for modern TCP 17 implementations that can support per-packet timestamps and the 18 selective acknowledgment (SACK) option. It is intended to be an 19 alternative to the DUPACK threshold approach [RFC6675], as well as 20 other nonstandard approaches such as FACK [FACK]. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on September 10, 2020. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 1. Terminology 56 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 57 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 58 "OPTIONAL" in this document are to be interpreted as described in BCP 59 14 [RFC2119] [RFC8174] when, and only when, they appear in all 60 capitals, as shown here. In this document, these words will appear 61 with that interpretation only when in UPPER CASE. Lower case uses of 62 these words are not to be interpreted as carrying [RFC2119] 63 significance. 65 2. Introduction 67 This document presents a new loss detection algorithm called RACK 68 ("Recent ACKnowledgment"). RACK uses the notion of time instead of 69 the conventional packet or sequence counting approaches for detecting 70 losses. RACK deems a packet lost if it has not been delivered and 71 some packet sent sufficiently later has been delivered. It does this 72 by recording packet transmission times and inferring losses using 73 cumulative acknowledgments or selective acknowledgment (SACK) TCP 74 options. 76 In recent years we have been observing several increasingly common 77 loss and reordering patterns in the Internet: 79 1. Slow recovery due to lost retransmissions. Traffic policers 80 [POLICER16] and burst losses often cause retransmissions to be 81 lost again. This severely increases latency because the lost 82 retransmissions can only be recovered by retransmission timeouts 83 (RTOs). 85 2. Tail drops. Structured request-response traffic turns more 86 losses into tail drops. In such cases, TCP is application- 87 limited, so it cannot send new data to probe losses and has to 88 rely on retransmission timeouts (RTOs). 90 3. Reordering. Link-layer protocols (e.g., 802.11 block ACK), link 91 bonding, or routers' internal load-balancing can deliver TCP 92 packets out of order. The degree of such reordering is usually 93 within the order of the path round trip time. 95 Despite TCP stacks (e.g. Linux) that implement many of the standard 96 and proposed loss detection algorithms 98 [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM], 99 we've found that together they do not perform well. The main reason 100 is that many of them are based on the classic rule of counting 101 duplicate acknowledgments [RFC5681]. They can either detect loss 102 quickly or accurately, but not both, especially when the sender is 103 application-limited or under reordering that is unpredictable. And 104 under these conditions none of them can detect lost retransmissions 105 well. 107 Also, these algorithms, including RFCs, rarely address the 108 interactions with other algorithms. For example, FACK may consider a 109 packet is lost while RFC6675 may not. Implementing N algorithms 110 while dealing with N^2 interactions is a daunting task and error- 111 prone. 113 The goal of RACK is to solve all the problems above by replacing many 114 of the loss detection algorithms above with one more effective 115 algorithm to handle loss and reordering. 117 3. Overview 119 The main idea behind RACK is that if a packet has been delivered out 120 of order, then the packets sent chronologically before that were 121 either lost or reordered. This concept is not fundamentally 122 different from [RFC5681][RFC6675][FACK]. But the key innovation in 123 RACK is to use a per-packet transmission timestamp and widely 124 deployed SACK options to conduct time-based inferences instead of 125 inferring losses with packet or sequence counting approaches. 127 Using a threshold for counting duplicate acknowledgments (i.e., 128 DupThresh) alone is no longer reliable because of today's prevalent 129 reordering patterns. A common type of reordering is that the last 130 "runt" packet of a window's worth of packet bursts gets delivered 131 first, then the rest arrive shortly after in order. To handle this 132 effectively, a sender would need to constantly adjust the DupThresh 133 to the burst size; but this would risk increasing the frequency of 134 RTOs on real losses. 136 Today's prevalent lost retransmissions also cause problems with 137 packet-counting approaches [RFC5681][RFC6675][FACK], since those 138 approaches depend on reasoning in sequence number space. 139 Retransmissions break the direct correspondence between ordering in 140 sequence space and ordering in time. So when retransmissions are 141 lost, sequence-based approaches are often unable to infer and quickly 142 repair losses that can be deduced with time-based approaches. 144 Instead of counting packets, RACK uses the most recently delivered 145 packet's transmission time to judge if some packets sent previous to 146 that time have "expired" by passing a certain reordering settling 147 window. On each ACK, RACK marks any already-expired packets lost, 148 and for any packets that have not yet expired it waits until the 149 reordering window passes and then marks those lost as well. In 150 either case, RACK can repair the loss without waiting for a (long) 151 RTO. RACK can be applied to both fast recovery and timeout recovery, 152 and can detect losses on both originally transmitted and 153 retransmitted packets, making it a great all-weather loss detection 154 mechanism. 156 4. Design Rationale for Reordering Tolerance 158 The reordering behavior of networks can evolve (over years) in 159 response to the behavior of transport protocols and applications, as 160 well as the needs of network designers and operators. From a network 161 or link designer's viewpoint, parallelization (eg. link bonding) is 162 the easiest way to get a network to go faster. Therefore their main 163 constraint on speed is reordering, and there is pressure to relax 164 that constraint. If RACK becomes widely deployed, the underlying 165 networks may introduce more reordering for higher throughput. But 166 this may result in excessive reordering that hurts end to end 167 performance: 169 1. End host packet processing: extreme reordering on high-speed 170 networks would incur high CPU cost by greatly reducing the 171 effectiveness of aggregation mechanisms, such as large receive 172 offload (LRO) and generic receive offload (GRO), and 173 significantly increasing the number of ACKs. 175 2. Congestion control: TCP congestion control implicitly assumes the 176 feedback from ACKs are from the same bottleneck. Therefore it 177 cannot handle well scenarios where packets are traversing largely 178 disjoint paths. 180 3. Loss recovery: Having an excessively large reordering window to 181 accommodate widely different latencies from different paths would 182 increase the latency of loss recovery. 184 An end-to-end transport protocol cannot tell immediately whether a 185 hole is reordering or loss. It can only distinguish between the two 186 in hindsight if the hole in the sequence space gets filled later 187 without a retransmission. How long the sender waits for such 188 potential reordering events to settle is determined by the current 189 reordering window. 191 Given these considerations, a core design philosophy of RACK is to 192 adapt to the measured duration of reordering events, within 193 reasonable and specific bounds. To accomplish this RACK places the 194 following mandates on the reordering window: 196 1. The initial RACK reordering window SHOULD be set to a small 197 fraction of the round-trip time. 199 2. If no reordering has been observed, then RACK SHOULD honor the 200 classic 3-DUPACK rule for initiating fast recovery. One simple 201 way to implement this is to temporarily override the reorder 202 window to 0. 204 3. The RACK reordering window SHOULD leverage Duplicate Selective 205 Acknowledgement (DSACK) information [RFC3708] to adaptively 206 estimate the duration of reordering events. 208 4. The RACK reordering window MUST be bounded and this bound SHOULD 209 be one round trip. 211 As a flow starts, either condition 1 or condition 2 or both would 212 trigger RACK to start the recovery process quickly. The low initial 213 reordering window and use of the 3-DUPACK rule are key to achieving 214 low-latency loss recovery for short flows by risking spurious 215 retransmissions to recover losses quickly. This rationale is that 216 spurious retransmissions for short flows are not expected to produce 217 excessive network traffic. 219 For long flows the design tolerates reordering within a round trip. 220 This handles reordering caused by path divergence in small time 221 scales (reordering within the round-trip time of the shortest path), 222 which should tolerate much of the reordering from link bonding, 223 multipath routing, or link-layer out-of-order delivery. It also 224 relaxes ordering constraints to allow sending flights of TCP packets 225 on different paths dynamically for better load-balancing (e.g. 226 flowlets). 228 However, the fact that the initial RACK reordering window is low, and 229 the RACK reordering window's adaptive growth is bounded, means that 230 there will continue to be a cost to reordering and a limit to RACK's 231 adaptation to reordering. This maintains a disincentive for network 232 designers and operators to introduce needless or excessive 233 reordering, particularly because they have to allow for low round 234 trip time paths. This means RACK will not encourage networks to 235 perform inconsiderate fine-grained packet-spraying over highly 236 disjoint paths with very different characteristics. There are good 237 alternative solutions, such as MPTCP, for such networks. 239 To conclude, the RACK algorithm aims to adapt to small degrees of 240 reordering, quickly recover most losses within one to two round 241 trips, and avoid costly retransmission timeouts (RTOs). In the 242 presence of reordering, the adaptation algorithm can impose 243 sometimes-needless delays when it waits to disambiguate loss from 244 reordering, but the penalty for waiting is bounded to one round trip 245 and such delays are confined to longer-running flows. 247 5. Requirements 249 The reader is expected to be familiar with the definitions given in 250 the TCP congestion control [RFC5681] and selective acknowledgment 251 [RFC2018] RFCs. Familiarity with the conservative SACK-based 252 recovery for TCP [RFC6675] is not expected but helps. 254 RACK has three requirements: 256 1. The connection MUST use selective acknowledgment (SACK) options 257 [RFC2018]. 259 2. For each packet sent, the sender MUST store its most recent 260 transmission time with (at least) millisecond granularity. For 261 round-trip times lower than a millisecond (e.g., intra-datacenter 262 communications) microsecond granularity would significantly help 263 the detection latency but is not required. 265 3. For each packet sent, the sender MUST remember whether the packet 266 has been retransmitted or not. 268 We assume that requirement 1 implies the sender keeps a SACK 269 scoreboard, which is a data structure to store selective 270 acknowledgment information on a per-connection basis ([RFC6675] 271 section 3). For the ease of explaining the algorithm, we use a 272 pseudo-scoreboard that manages the data in sequence number ranges. 273 But the specifics of the data structure are left to the implementor. 275 RACK does not need any change on the receiver. 277 6. Definitions 279 The reader is expected to be familiar with the definitions given in 280 [RFC793], including SND.UNA, SND.NXT, SEG.ACK, and SEG.SEQ. 282 6.1. Definitions of variables 284 A sender implementing RACK needs to store these new RACK variables: 286 "Packet.xmit_ts" is the time of the last transmission of a data 287 packet, including retransmissions, if any. The sender needs to 288 record the transmission time for each packet sent and not yet 289 acknowledged. The time MUST be stored at millisecond granularity or 290 finer. 292 "RACK.packet". Among all the packets that have been either 293 selectively or cumulatively acknowledged, RACK.packet is the one that 294 was sent most recently including retransmissions. 296 "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet. 298 "RACK.end_seq" is the ending TCP sequence number of RACK.packet. 300 "RACK.rtt" is the RTT of the most recently delivered packet on the 301 connection (either cumulatively acknowledged or selectively 302 acknowledged) that was not marked invalid as a possible spurious 303 retransmission. 305 "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated. 307 "RACK.reo_wnd" is a reordering window computed in the unit of time 308 used for recording packet transmission times. It is used to defer 309 the moment at which RACK marks a packet lost. 311 "RACK.dupthresh" is a constant specifying the number of duplicate 312 acknowledgments, or selectively acknowledged segments, that can 313 (under certain conditions) trigger fast recovery, similar to 314 [RFC6675]. As in [RFC5681] and [RFC6675], this threshold is defined 315 to be 3. 317 "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the 318 connection. 320 "RACK.ack_ts" is the time when all the sequences in RACK.packet were 321 selectively or cumulatively acknowledged. 323 "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd 325 "RACK.reo_wnd_persist" is the number of loss recoveries before 326 resetting RACK.reo_wnd 328 "RACK.dsack" indicates if a DSACK option has been received since last 329 RACK.reo_wnd change 331 "RACK.pkts_sacked" returns the total number of packets selectively 332 acknowledged in the SACK scoreboard. 334 "RACK.reord" indicates the connection has detected packet reordering 335 event(s) 336 "RACK.fack" is the highest selectively or cumulatively acknowledged 337 sequence 339 Note that the Packet.xmit_ts variable is per packet in flight. The 340 RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT 341 variables are kept in the per-connection TCP control block. 342 RACK.packet and RACK.ack_ts are used as local variables in the 343 algorithm. 345 7. Algorithm Details 347 7.1. Transmitting a data packet 349 Upon transmitting a new packet or retransmitting an old packet, 350 record the time in Packet.xmit_ts. RACK does not care if the 351 retransmission is triggered by an ACK, new application data, an RTO, 352 or any other means. 354 7.2. Upon receiving an ACK 356 Step 1: Update RACK.min_RTT. 358 Use the RTT measurements obtained via [RFC6298] or [RFC7323] to 359 update the estimated minimum RTT in RACK.min_RTT. The sender can 360 track a simple global minimum of all RTT measurements from the 361 connection, or a windowed min-filtered value of recent RTT 362 measurements. This document does not specify an exact approach. 364 Step 2: Update RACK stats 366 Given the information provided in an ACK, each packet cumulatively 367 ACKed or SACKed is marked as delivered in the scoreboard. Among all 368 the packets newly ACKed or SACKed in the connection, record the most 369 recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts. 370 Sometimes the timestamps of RACK.Packet and Packet could carry the 371 same transmit timestamps due to clock granularity or segmentation 372 offloading (i.e. the two packets were handed to the NIC as a single 373 unit). In that case the sequence numbers of RACK.end_seq and 374 Packet.end_seq are compared to break the tie. 376 Since an ACK can also acknowledge retransmitted data packets, and 377 retransmissions can be spurious, the sender must take care to avoid 378 spurious inferences. For example, if the sender were to use timing 379 information from a spurious retransmission, the RACK.rtt could be 380 vastly underestimated. 382 To avoid spurious inferences, ignore a packet as invalid if any of 383 its TCP sequences have been retransmitted before and either of two 384 conditions is true: 386 1. The Timestamp Echo Reply field (TSecr) of the ACK's timestamp 387 option [RFC7323], if available, indicates the ACK was not 388 acknowledging the last retransmission of the packet. 390 2. The packet was last retransmitted less than RACK.min_rtt ago. 392 If the ACK is not ignored as invalid, update the RACK.rtt to be the 393 RTT sample calculated using this ACK, and continue. If this ACK or 394 SACK was for the most recently sent packet, then record the 395 RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK. 396 Otherwise exit here and omit the following steps. 398 Notice that the second condition above is a heuristic. This 399 heuristic would fail to update RACK stats if a data packet is 400 spuriously retransmitted because of a recent minimum RTT decrease 401 (e.g. path change). For example, in cases with a TCP connection 402 without TCP timestamps, and where the first M packets in a flight of 403 data packets travel an old (longer) original path, and the remaining 404 N packets in that flight travel a new (shorter) path and arrive out 405 of order and elicit SACKs, then those SACKs for the N packets can 406 initiate a spurious retransmission of the first M packets. In such 407 scenarios, the sender would not be able to update its RACK.min_rtt 408 using the (ambiguous) RTT samples from retransmissions, so during 409 recovery all RTT samples may be less than RACK.min_rtt, and thus meet 410 the second condition. In such cases RACK may not detect losses from 411 ACK events and the recovery would then resort to the (slower) TLP or 412 RTO timer-based recovery. However, such events should be rare and 413 the connection would pick up the new minimum RTT when the recovery 414 ends, so the sender can avoid repeated similar failures. 416 Step 2 may be summarized in pseudocode as: 418 RACK_sent_after(t1, seq1, t2, seq2): 419 If t1 > t2: 420 Return true 421 Else if t1 == t2 AND seq1 > seq2: 422 Return true 423 Else: 424 Return false 426 RACK_update(): 427 For each Packet newly acknowledged cumulatively or selectively: 428 rtt = Now() - Packet.xmit_ts 429 If Packet.retransmitted is TRUE: 430 If ACK.ts_option.echo_reply < Packet.xmit_ts: 431 Return 432 If rtt < RACK.min_rtt: 433 Return 435 RACK.rtt = rtt 436 If RACK_sent_after(Packet.xmit_ts, Packet.end_seq 437 RACK.xmit_ts, RACK.end_seq): 438 RACK.xmit_ts = Packet.xmit_ts 440 Step 3: Detect packet reordering 442 To detect reordering, the sender looks for original data packets 443 being delivered out of order in sequence space. The sender tracks 444 the highest sequence selectively or cumulatively acknowledged in the 445 RACK.fack variable. The name fack stands for the most forward ACK 446 originated from the [FACK] draft. If the ACK selectively or 447 cumulatively acknowledges an unacknowledged and also never 448 retransmitted sequence below RACK.fack, then the corresponding packet 449 has been reordered and RACK.reord is set to TRUE. 451 The heuristic above only detects reordering if the re-ordered packet 452 has not yet been retransmitted. This is a major drawback because if 453 RACK has a low reordering window and the network is reordering 454 packets, RACK may falsely retransmit frequently. Consequently RACK 455 may fail to detect reordering to increase the reordering window, 456 because the reordered packets were already (falsely) retransmitted. 458 DSACK [RFC3708] can help mitigate this issue. The false 459 retransmission would solicit DSACK option in the ACK. Therefore if 460 the ACK has a DSACK option covering some sequence that were both 461 acknowledged and retransmitted, this implies the original packet was 462 reordered but RACK retransmitted the packet too quickly and should 463 set RACK.reord to TRUE. 465 RACK_detect_reordering(): 466 For each Packet newly acknowledged cumulatively or selectively: 467 If Packet.end_seq > RACK.fack: 468 RACK.fack = Packet.end_seq 469 Else if Packet.end_seq < RACK.fack AND 470 Packet.retransmitted is FALSE: 471 RACK.reord = TRUE 473 For each Packet covered by the DSACK option: 474 If Packet.retransmitted is TRUE: 475 RACK.reord = TRUE 477 Step 4: Update RACK reordering window 479 To handle the prevalent small degree of reordering, RACK.reo_wnd 480 serves as an allowance for settling time before marking a packet 481 lost. This section documents a detailed algorithm following the 482 design rationale section. RACK starts initially with a conservative 483 window of min_RTT/4. If no reordering has been observed, RACK uses 484 RACK.reo_wnd of 0 during loss recovery, in order to retransmit 485 quickly, or when the number of DUPACKs exceeds the classic DUPACK 486 threshold. The subtle difference between this approach and the 487 conventional one [RFC5681][RFC6675] is discussed later in the section 488 "RACK and TLP Discussion". 490 Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window, 491 to higher degrees of reordering, if DSACK is supported. Receiving an 492 ACK with a DSACK indicates a spurious retransmission, which in turn 493 suggests that the RACK reordering window, RACK.reo_wnd, is likely too 494 small. The sender MAY increase the RACK.reo_wnd window linearly for 495 every round trip in which the sender receives a DSACK, so that after 496 N distinct round trips in which a DSACK is received, the RACK.reo_wnd 497 becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT. The 498 inflated RACK.reo_wnd would persist for 16 loss recoveries and after 499 which it resets to its starting value, min_RTT / 4. 501 The following pseudocode implements the above algorithm. Note that 502 extensions that require additional TCP features (e.g. DSACK) would 503 work if the feature functions simply return false. 505 RACK_update_reo_wnd(): 506 RACK.min_RTT = TCP_min_RTT() 507 If DSACK option is present: 508 RACK.dsack = true 510 If SND.UNA < RACK.rtt_seq: 511 RACK.dsack = false /* React to DSACK once per round trip */ 513 If RACK.dsack: 514 RACK.reo_wnd_incr += 1 515 RACK.dsack = false 516 RACK.rtt_seq = SND.NXT 517 RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */ 518 Else if exiting loss recovery: 519 RACK.reo_wnd_persist -= 1 520 If RACK.reo_wnd_persist <= 0: 521 RACK.reo_wnd_incr = 1 523 If RACK.reord is FALSE: 524 If in loss recovery: /* If in fast or timeout recovery */ 525 RACK.reo_wnd = 0 526 Return 527 Else if RACK.pkts_sacked >= RACK.dupthresh: 528 RACK.reo_wnd = 0 529 return 530 RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr 531 RACK.reo_wnd = min(RACK.reo_wnd, SRTT) 533 Step 5: Detect losses. 535 For each packet that has not been SACKed, if RACK.xmit_ts is after 536 Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its 537 corresponding sequence range) lost in the scoreboard. The rationale 538 is that if another packet that was sent later has been delivered, and 539 the reordering window or "reordering settling time" has already 540 passed, then the packet was likely lost. 542 If another packet that was sent later has been delivered, but the 543 reordering window has not passed, then it is not yet safe to deem the 544 unacked packet lost. Using the basic algorithm above, the sender 545 would wait for the next ACK to further advance RACK.xmit_ts; but this 546 risks a timeout (RTO) if no more ACKs come back (e.g, due to losses 547 or application limit). For timely loss detection, the sender MAY 548 install a "reordering settling" timer set to fire at the earliest 549 moment at which it is safe to conclude that some packet is lost. The 550 earliest moment is the time it takes to expire the reordering window 551 of the earliest unacked packet in flight. 553 This timer expiration value can be derived as follows. As a starting 554 point, we consider that the reordering window has passed if the 555 RACK.packet was sent sufficiently after the packet in question, or a 556 sufficient time has elapsed since the RACK.packet was S/ACKed, or 557 some combination of the two. More precisely, RACK marks a packet as 558 lost if the reordering window for a packet has elapsed through the 559 sum of: 561 1. delta in transmit time between a packet and the RACK.packet 563 2. delta in time between RACK.ack_ts and now 565 So we mark a packet as lost if: 567 RACK.xmit_ts >= Packet.xmit_ts 568 AND 569 (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd 571 If we solve this second condition for "now", the moment at which we 572 can declare a packet lost, then we get: 574 now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts) 576 Then (RACK.ack_ts - RACK.xmit_ts) is the RTT of the packet the sender 577 used to set RACK.xmit_ts: the round trip time of the most recently 578 (re)transmitted packet that's been delivered. To be more robust to 579 reordering, RACK uses a more conservative RTT value to decide if an 580 unacknowledged packet should be considered lost, RACK.rtt: the round 581 trip time of the most recently delivered packet on the connection 582 that was not marked invalid as a possible spurious retransmission. 584 When packets are delivered in order, the most recently 585 (re)transmitted packet that's been delivered is also the most 586 recently delivered, hence RACK.rtt == RACK.ack_ts - RACK.xmit_ts. 587 But if packets were reordered, then the packet delivered most 588 recently was sent before the most recently (re)transmitted packet. 589 Hence RACK.rtt > (RACK.ack_ts - RACK.xmit_ts). 591 Since RACK.RTT >= (RACK.ack_ts - RACK.xmit_ts), the previous equation 592 reduces to saying that the sender can declare a packet lost when: 594 now >= Packet.xmit_ts + RACK.reo_wnd + RACK.rtt 596 In turn, that is equivalent to stating that a RACK sender should 597 declare a packet lost when: 599 Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0 600 The following pseudocode implements the algorithm above. When an ACK 601 is received or the RACK timer expires, call RACK_detect_loss(). The 602 algorithm includes an additional optimization to break timestamp ties 603 by using the TCP sequence space. The optimization is particularly 604 useful to detect losses in a timely manner with TCP Segmentation 605 Offload, where multiple packets in one TCP Segmentation Offload (TSO) 606 blob have identical timestamps. 608 RACK_detect_loss(): 609 timeout = 0 611 For each packet, Packet, not acknowledged yet: 612 If Packet.lost is TRUE AND Packet.retransmitted is FALSE: 613 Continue /* Packet lost but not yet retransmitted */ 615 If RACK_sent_after(RACK.xmit_ts, RACK.end_seq, 616 Packet.xmit_ts, Packet.end_seq): 617 remaining = Packet.xmit_ts + RACK.rtt + 618 RACK.reo_wnd - Now() 619 If remaining <= 0: 620 Packet.lost = TRUE 621 Else: 622 timeout = max(remaining, timeout) 624 If timeout != 0 625 Arm a timer to call RACK_detect_loss() after timeout 627 Implementation optimization: looping through packets in the SACK 628 scoreboard above could be very costly on large-BDP networks since the 629 inflight could be very large. If the implementation can organize the 630 scoreboard data structures to have packets sorted by the last 631 (re)transmission time, then the loop can start on the least recently 632 sent packet and abort on the first packet sent after RACK.time_ts. 633 This can be implemented by using a seperate doubly-linked list sorted 634 in time order. The implementation inserts the packet at the tail of 635 the list when it is (re)transmitted, and removes a packet from the 636 list when it is delivered or marked lost. We RECOMMEND such an 637 optimization because it enables implementations to support high-BDP 638 networks. This optimization is implemented in Linux and sees orders 639 of magnitude improvement in CPU usage on high-speed WAN networks. 641 7.3. Tail Loss Probe: fast recovery for tail losses 643 This section describes a supplemental algorithm, Tail Loss Probe 644 (TLP), which leverages RACK to further reduce RTO recoveries. TLP 645 triggers fast recovery to quickly repair tail losses that can 646 otherwise be recovered only via RTOs. After an original data 647 transmission, TLP sends a probe data segment within one to two RTTs. 649 The probe data segment can either be new, previously unsent data, or 650 a retransmission of previously sent data just below SND.NXT. In 651 either case the goal is to elicit more feedback from the receiver, in 652 the form of an ACK (potentially with SACK blocks), to allow RACK to 653 trigger fast recovery instead of an RTO. 655 An RTO occurs when the first unacknowledged sequence number is not 656 acknowledged after a conservative period of time has elapsed 657 [RFC6298]. Common causes of RTOs include: 659 1. The entire flight of data is lost. 661 2. Tail losses of data segments at the end of an application 662 transaction. 664 3. Tail losses of ACKs at the end of an application transaction. 666 4. Lost retransmits, which can halt fast recovery based on [RFC6675] 667 if the ACK stream completely dries up. For example, consider a 668 window of three data packets (P1, P2, P3) that are sent; P1 and 669 P2 are dropped. On receipt of a SACK for P3, RACK marks P1 and 670 P2 as lost and retransmits them as R1 and R2. Suppose R1 and R2 671 are lost as well, so there are no more returning ACKs to detect 672 R1 and R2 as lost. Recovery stalls. 674 5. An unexpectedly long round-trip time (RTT). This can cause ACKs 675 to arrive after the RTO timer expires. The F-RTO algorithm 676 [RFC5682] is designed to detect such spurious retransmission 677 timeouts and at least partially undo the consequences of such 678 events, but F-RTO cannot be used in many situations. 680 7.4. Tail Loss Probe: An Example 682 Following is an example of TLP. All events listed are at a TCP 683 sender. 685 1. Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 686 no more new data to transmit. A TLP is scheduled to be sent 2 687 RTTs after the transmission of the 10th segment. 689 2. Sender receives acknowledgements (ACKs) for segments 1-5; 690 segments 6-10 are lost and no ACKs are received. The sender 691 reschedules its TLP at a time relative to the last received ACK, 692 which is the ACK for segment 5 in this case. The sender sets the 693 time for the TLP using the calculation described in step (2) of 694 the algorithm. 696 3. When the TLP timer fires, sender retransmits segment 10. 698 4. After an RTT, a SACK for packet 10 arrives. The ACK also carries 699 SACK holes for segments 6, 7, 8 and 9. This triggers RACK-based 700 loss recovery. 702 5. The connection enters fast recovery and retransmits the remaining 703 lost segments. 705 7.5. Tail Loss Probe Algorithm Details 707 We define the terminology used in specifying the TLP algorithm: 709 FlightSize: amount of outstanding data in the network, as defined in 710 [RFC5681]. 712 RTO: The transport's retransmission timeout (RTO) is based on 713 measured round-trip times (RTT) between the sender and receiver, as 714 specified in [RFC6298] for TCP. PTO: Probe timeout (PTO) is a timer 715 event indicating that an ACK is overdue and the sender should try to 716 transmit a TLP. Its value is constrained to be smaller than or equal 717 to an RTO. 719 SRTT: smoothed round-trip time, computed as specified in [RFC6298]. 720 TLPRxtOut: a boolean indicating whether there is an unacknowledged 721 TLP retransmission. 723 TLPHighRxt: the value of SND.NXT at the time of sending a TLP 724 retransmission. 726 WCDelAckT: maximum delayed ACK timer value. 728 The TLP algorithm has three phases, which we discuss in turn. 730 7.5.1. Phase 1: Scheduling a loss probe 732 Step 1: Check conditions for scheduling a PTO. 734 A sender should check to see if it should schedule a PTO in the 735 following situations: 737 1. After transmitting new data that was not itself a TLP probe 739 2. Upon receiving an ACK that cumulatively acknowledges data 741 A sender should schedule a PTO only if all of the following 742 conditions are met: 744 1. The connection supports SACK [RFC2018] 745 2. The connection has no SACKed sequences in the SACK scoreboard 747 3. The connection is not in loss recovery 749 If a PTO can be scheduled according to these conditions, the sender 750 should schedule a PTO. If there was a previously scheduled PTO or 751 RTO pending, then that pending PTO or RTO should first be cancelled, 752 and then the new PTO should be scheduled. 754 If a PTO cannot be scheduled according to these conditions, then the 755 sender MUST arm the RTO timer if there is unacknowledged data in 756 flight. 758 Step 2: Select the duration of the PTO. 760 A sender SHOULD use the following logic to select the duration of a 761 PTO: 763 TLP_timeout(): 764 If SRTT is available: 765 PTO = 2 * SRTT 766 If FlightSize = 1: 767 PTO += WCDelAckT 768 Else: 769 PTO = 1 sec 771 If Now() + PTO > TCP_RTO_expire(): 772 PTO = TCP_RTO_expire() - Now() 774 Aiming for a PTO value of 2*SRTT allows a sender to wait long enough 775 to know that an ACK is overdue. Under normal circumstances, i.e. no 776 losses, an ACK typically arrives in one SRTT. But choosing PTO to be 777 exactly an SRTT is likely to generate spurious probes given that 778 network delay variance and even end-system timings can easily push an 779 ACK to be above an SRTT. We chose PTO to be the next integral 780 multiple of SRTT. 782 WCDelAckT stands for worst case delayed ACK timer. When FlightSize 783 is 1, PTO is inflated by WCDelAckT time to compensate for a potential 784 long delayed ACK timer at the receiver. The RECOMMENDED value for 785 WCDelAckT is 200ms. 787 Finally, if the time at which an RTO would fire (here denoted 788 "TCP_RTO_expire") is sooner than the computed time for the PTO, then 789 a probe is scheduled to be sent at that earlier time. 791 7.5.2. Phase 2: Sending a loss probe 793 When the PTO fires, transmit a probe data segment: 795 TLP_send_probe(): 796 If an unsent segment exists AND 797 the receive window allows new data to be sent: 798 Transmit the lowest-sequence unsent segment of up to SMSS 799 Increment FlightSize by the size of the newly-sent segment 801 Else if TLPRxtOut is not set: 802 Retransmit the highest-sequence segment sent so far 803 TLPRxtOut = true 804 TLPHighRxt = SND.NXT 805 The cwnd remains unchanged 807 When the loss probe is a retransmission, the sender uses the highest- 808 sequence segment sent so far. This is in order to deal with the 809 retransmission ambiguity problem in TCP. Suppose a sender sends N 810 segments, and then retransmits the last segment (segment N) as a loss 811 probe, and then the sender receives a SACK for segment N. As long as 812 the sender waits for any required RACK reordering settling timer to 813 then expire, it doesn't matter if that SACK was for the original 814 transmission of segment N or the TLP retransmission; in either case 815 the arrival of the SACK for segment N provides evidence that the N-1 816 segments preceding segment N were likely lost. In the case where 817 there is only one original outstanding segment of data (N=1), the 818 same logic (trivially) applies: an ACK for a single outstanding 819 segment tells the sender the N-1=0 segments preceding that segment 820 were lost. Furthermore, whether there are N>1 or N=1 outstanding 821 segments, there is a question about whether the original last segment 822 or its TLP retransmission were lost; the sender estimates this using 823 TLP recovery detection (see below). 825 Note that whether or not a probe was sent in TLP_send_probe(), the 826 sender MUST arm the RTO timer, not the PTO timer, at the end of 827 TLP_send_probe() if FlightSize is not zero. This ensures that the 828 sender does not send repeated, back-to-back TLP probes. Checking 829 TLPRxtOut prior to sending the loss probe is also critical to avoid 830 TLP loops if an application writes periodically at an interval less 831 than PTO. 833 7.5.3. Phase 3: ACK processing 835 On each incoming ACK, the sender should check the conditions in Step 836 1 of Phase 1 to see if it should schedule (or reschedule) the loss 837 probe timer. 839 7.6. TLP recovery detection 841 If the only loss in an outstanding window of data was the last 842 segment, then a TLP loss probe retransmission of that data segment 843 might repair the loss. TLP recovery detection examines ACKs to 844 detect when the probe might have repaired a loss, and thus allows 845 congestion control to properly reduce the congestion window (cwnd) 846 [RFC5681]. 848 Consider a TLP retransmission episode where a sender retransmits a 849 tail packet in a flight. The TLP retransmission episode ends when 850 the sender receives an ACK with a SEG.ACK above the SND.NXT at the 851 time the episode started (i.e. TLPHighRxt). During the TLP 852 retransmission episode the sender checks for a duplicate ACK or 853 D-SACK indicating that both the original segment and TLP 854 retransmission arrived at the receiver, meaning there was no loss 855 that needed repairing. If the TLP sender does not receive such an 856 indication before the end of the TLP retransmission episode, then it 857 MUST estimate that either the original data segment or the TLP 858 retransmission were lost, and congestion control MUST react 859 appropriately to that loss as it would any other loss. 861 Since a significant fraction of the hosts that support SACK do not 862 support duplicate selective acknowledgments (D-SACKs) [RFC2883] the 863 TLP algorithm for detecting such lost segments relies only on basic 864 SACK support [RFC2018]. 866 7.6.1. Initializing and resetting state 868 When a connection is created, or suffers a retransmission timeout, or 869 enters fast recovery, it executes the following: 871 TLPRxtOut = false 873 7.6.2. Recording loss probe states 875 Senders MUST only send a TLP loss probe retransmission if TLPRxtOut 876 is false. This ensures that at any given time a connection has at 877 most one outstanding TLP retransmission. This allows the sender to 878 use the algorithm described in this section to estimate whether any 879 data segments were lost. 881 Note that this condition only restricts TLP loss probes that are 882 retransmissions. There may be an arbitrary number of outstanding 883 unacknowledged TLP loss probes that consist of new, previously-unsent 884 data, since the retransmission timeout and fast recovery algorithms 885 are sufficient to detect losses of such probe segments. 887 Upon sending a TLP probe that is a retransmission, the sender sets 888 TLPRxtOut to true and TLPHighRxt to SND.NXT. 890 7.6.3. Detecting recoveries accomplished by loss probes 892 Step 1: Track ACKs indicating receipt of original and retransmitted 893 segments 895 A sender considers both the original segment and TLP probe 896 retransmission segment as acknowledged if either 1 or 2 are true: 898 1. This is a duplicate acknowledgment (as defined in [RFC5681], 899 section 2), and all of the following conditions are met: 901 1. TLPRxtOut is true 903 2. SEG.ACK == TLPHighRxt 905 3. SEG.ACK == SND.UNA 907 4. the segment contains no SACK blocks for sequence ranges above 908 TLPHighRxt 910 5. the segment contains no data 912 6. the segment is not a window update 914 2. This is an ACK acknowledging a sequence number at or above 915 TLPHighRxt and it contains a D-SACK; i.e. all of the following 916 conditions are met: 918 1. TLPRxtOut is true 920 2. SEG.ACK >= TLPHighRxt 922 3. the ACK contains a D-SACK block 924 If either of the conditions is met, then the sender estimates that 925 the receiver received both the original data segment and the TLP 926 probe retransmission, and so the sender considers the TLP episode to 927 be done, and records that fact by setting TLPRxtOut to false. 929 Step 2: Mark the end of a TLP retransmission episode and detect 930 losses 932 If the sender receives a cumulative ACK for data beyond the TLP loss 933 probe retransmission then, in the absence of reordering on the return 934 path of ACKs, it should have received any ACKs for the original 935 segment and TLP probe retransmission segment. At that time, if the 936 TLPRxtOut flag is still true and thus indicates that the TLP probe 937 retransmission remains unacknowledged, then the sender should presume 938 that at least one of its data segments was lost, so it SHOULD invoke 939 a congestion control response equivalent to fast recovery. 941 More precisely, on each ACK the sender executes the following: 943 if (TLPRxtOut and SEG.ACK >= TLPHighRxt) { 944 TLPRxtOut = false 945 EnterRecovery() 946 ExitRecovery() 947 } 949 8. RACK and TLP Discussion 951 8.1. Advantages 953 The biggest advantage of RACK is that every data packet, whether it 954 is an original data transmission or a retransmission, can be used to 955 detect losses of the packets sent chronologically prior to it. 957 Example: TAIL DROP. Consider a sender that transmits a window of 958 three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the 959 transmission of each packet is at least RACK.reo_wnd (1 millisecond 960 by default) after the transmission of the previous packet. RACK will 961 mark P1 as lost when the SACK of P2 is received, and this will 962 trigger the retransmission of P1 as R1. When R1 is cumulatively 963 acknowledged, RACK will mark P3 as lost and the sender will 964 retransmit P3 as R3. This example illustrates how RACK is able to 965 repair certain drops at the tail of a transaction without any timer. 966 Notice that neither the conventional duplicate ACK threshold 967 [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK] 968 algorithm can detect such losses, because of the required packet or 969 sequence count. 971 Example: LOST RETRANSMIT. Consider a window of three data packets 972 (P1, P2, P3) that are sent; P1 and P2 are dropped. Suppose the 973 transmission of each packet is at least RACK.reo_wnd (1 millisecond 974 by default) after the transmission of the previous packet. When P3 975 is SACKed, RACK will mark P1 and P2 lost and they will be 976 retransmitted as R1 and R2. Suppose R1 is lost again but R2 is 977 SACKed; RACK will mark R1 lost for retransmission again. Again, 978 neither the conventional three duplicate ACK threshold approach, nor 979 [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect 980 such losses. And such a lost retransmission is very common when TCP 981 is being rate-limited, particularly by token bucket policers with 982 large bucket depth and low rate limit. Retransmissions are often 983 lost repeatedly because standard congestion control requires multiple 984 round trips to reduce the rate below the policed rate. 986 Example: SMALL DEGREE OF REORDERING. Consider a common reordering 987 event: a window of packets are sent as (P1, P2, P3). P1 and P2 carry 988 a full payload of MSS octets, but P3 has only a 1-octet payload. 989 Suppose the sender has detected reordering previously and thus 990 RACK.reo_wnd is min_RTT/4. Now P3 is reordered and delivered first, 991 before P1 and P2. As long as P1 and P2 are delivered within 992 min_RTT/4, RACK will not consider P1 and P2 lost. But if P1 and P2 993 are delivered outside the reordering window, then RACK will still 994 falsely mark P1 and P2 lost. We discuss how to reduce false 995 positives in the end of this section. 997 The examples above show that RACK is particularly useful when the 998 sender is limited by the application, which is common for 999 interactive, request/response traffic. Similarly, RACK still works 1000 when the sender is limited by the receive window, which is common for 1001 applications that use the receive window to throttle the sender. 1003 For some implementations (e.g., Linux), RACK works quite efficiently 1004 with TCP Segmentation Offload (TSO). RACK always marks the entire 1005 TSO blob lost because the packets in the same TSO blob have the same 1006 transmission timestamp. By contrast, the algorithms based on 1007 sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset 1008 of packets in the TSO blob lost, forcing the stack to perform 1009 expensive fragmentation of the TSO blob, or to selectively tag 1010 individual packets lost in the scoreboard. 1012 8.2. Disadvantages 1014 RACK requires the sender to record the transmission time of each 1015 packet sent at a clock granularity of one millisecond or finer. TCP 1016 implementations that record this already for RTT estimation do not 1017 require any new per-packet state. But implementations that are not 1018 yet recording packet transmission times will need to add per-packet 1019 internal state (commonly either 4 or 8 octets per packet or TSO blob) 1020 to track transmission times. In contrast, the conventional [RFC6675] 1021 loss detection approach does not require any per-packet state beyond 1022 the SACK scoreboard. This is particularly useful on ultra-low RTT 1023 networks where the RTT is far less than the sender TCP clock 1024 granularity (e.g. inside data-centers). 1026 RACK can easily and optionally support the conventional approach in 1027 [RFC6675][RFC5681] by resetting the reordering window to zero when 1028 the threshold is met. Note that this approach differs slightly from 1029 [RFC6675] which considers a packet lost when at least DupThresh 1030 higher-sequence packets are SACKed. RACK's approach considers a 1031 packet lost when at least one higher sequence packet is SACKed and 1032 the total number of SACKed packets is at least DupThresh. For 1033 example, suppose a connection sends 10 packets, and packets 3, 5, 7 1034 are SACKed. [RFC6675] considers packets 1 and 2 lost. RACK 1035 considers packets 1, 2, 4, 6 lost. 1037 8.3. Adjusting the reordering window 1039 When the sender detects packet reordering, RACK uses a reordering 1040 window of min_rtt / 4. It uses the minimum RTT to accommodate 1041 reordering introduced by packets traversing slightly different paths 1042 (e.g., router-based parallelism schemes) or out-of-order deliveries 1043 in the lower link layer (e.g., wireless links using link-layer 1044 retransmission). RACK uses a quarter of minimum RTT because Linux 1045 TCP used the same factor in its implementation to delay Early 1046 Retransmit [RFC5827] to reduce spurious loss detections in the 1047 presence of reordering, and experience shows that this seems to work 1048 reasonably well. We have evaluated using the smoothed RTT (SRTT from 1049 [RFC6298] RTT estimation) or the most recently measured RTT 1050 (RACK.rtt) using an experiment similar to that in the Performance 1051 Evaluation section. They do not make any significant difference in 1052 terms of total recovery latency. 1054 8.4. Relationships with other loss recovery algorithms 1056 The primary motivation of RACK is to ultimately provide a simple and 1057 general replacement for some of the standard loss recovery algorithms 1058 [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard 1059 ones [FACK][THIN-STREAM]. While RACK can be a supplemental loss 1060 detection mechanism on top of these algorithms, this is not 1061 necessary, because RACK implicitly subsumes most of them. 1063 [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK 1064 threshold based on the current or previous flight sizes. RACK takes 1065 a different approach, by using only one ACK event and a reordering 1066 window. RACK can be seen as an extended Early Retransmit [RFC5827] 1067 without a FlightSize limit but with an additional reordering window. 1068 [FACK] considers an original packet to be lost when its sequence 1069 range is sufficiently far below the highest SACKed sequence. In some 1070 sense RACK can be seen as a generalized form of FACK that operates in 1071 time space instead of sequence space, enabling it to better handle 1072 reordering, application-limited traffic, and lost retransmissions. 1074 Since the 3 duplicate ACK threshold for triggering fast recovery 1075 [RFC5681] has been widely deployed and usually works well in the 1076 absence of reordering, RACK uses this signal to trigger fast recovery 1077 if a connection has not observed reordering. 1079 RACK is compatible with and does not interfere with the standard RTO 1080 [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel 1081 algorithms [RFC3522]. This is because RACK only detects loss by 1082 using ACK events. It neither changes the RTO timer calculation nor 1083 detects spurious timeouts. 1085 Furthermore, RACK naturally works well with Tail Loss Probe [TLP] 1086 because a tail loss probe solicits either an ACK or SACK, which can 1087 be used by RACK to detect more losses. RACK can be used to relax 1088 TLP's requirement for using FACK and retransmitting the the highest- 1089 sequenced packet, because RACK is agnostic to packet sequence 1090 numbers, and uses transmission time instead. Thus TLP could be 1091 modified to retransmit the first unacknowledged packet, which could 1092 improve application latency. 1094 8.5. Interaction with congestion control 1096 RACK intentionally decouples loss detection from congestion control. 1097 RACK only detects losses; it does not modify the congestion control 1098 algorithm [RFC5681][RFC6937]. A packet marked lost by RACK SHOULD 1099 NOT be retransmitted until congestion control deems this appropriate. 1101 RACK is applicable for both fast recovery and recovery after a 1102 retransmission timeout (RTO) in [RFC5681]. RACK applies equally to 1103 fast recovery and RTO recovery because RACK is purely based on the 1104 transmission time order of packets. When a packet retransmitted by 1105 RTO is acknowledged, RACK will mark any unacked packet sent 1106 sufficiently prior to the RTO as lost, because at least one RTT has 1107 elapsed since these packets were sent. 1109 RACK may detect losses faster or slower than the conventional 1110 duplicate ACK threshold approach does. RACK can detect losses faster 1111 by not requiring three DUPACKs, so congestion control may reduce the 1112 congestion window earlier. When the network path has both reordering 1113 and losses, RACK detects losses slower by waiting for the reordering 1114 window to expire. TCP may continue to increase the congestion window 1115 upon receiving ACKs during this time, making the sender more 1116 aggressive. Certain congestion control algorithms can benefit from 1117 accounting for this increase in the congestion window during the 1118 reordering window. 1120 8.5.1. Example: interactions with congestion control 1122 The following simple example compares how RACK and non-RACK loss 1123 detection interacts with congestion control: suppose a TCP sender has 1124 a congestion window (cwnd) of 20 packets on a SACK-enabled 1125 connection. It sends 10 data packets and all of them are lost. 1127 Without RACK, the sender would time out, reset cwnd to 1, and 1128 retransmit the first packet. It would take four round trips (1 + 2 + 1129 4 + 3 = 10) to retransmit all the 10 lost packets using slow start. 1130 The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4 1131 packets due to congestion window validation. 1133 With RACK, a sender would send the TLP after 2*RTT and get a DUPACK. 1134 If the sender implements Proportional Rate Reduction [RFC6937] it 1135 would slow start to retransmit the remaining 9 lost packets since the 1136 number of packets in flight (0) is lower than the slow start 1137 threshold (10). The slow start would again take four round trips (1 1138 + 2 + 4 + 3 = 10). The recovery latency would be 2*RTT + 4*RTT, with 1139 an ending cwnd set to the slow start threshold of 10 packets. 1141 In both cases, the sender after the recovery would be in congestion 1142 avoidance. The difference in recovery latency (RTO + 4*RTT vs 6*RTT) 1143 can be significant if the RTT is much smaller than the minimum RTO (1 1144 second in RFC6298) or if the RTT is large. The former case is common 1145 in local area networks, data-center networks, or content distribution 1146 networks with deep deployments. The latter case is more common in 1147 developing regions with highly congested and/or high-latency 1148 networks. 1150 8.6. TLP recovery detection with delayed ACKs 1152 Delayed ACKs complicate the detection of repairs done by TLP, since 1153 with a delayed ACK the sender receives one fewer ACK than would 1154 normally be expected. To mitigate this complication, before sending 1155 a TLP loss probe retransmission, the sender should attempt to wait 1156 long enough that the receiver has sent any delayed ACKs that it is 1157 withholding. The sender algorithm described above features such a 1158 delay, in the form of WCDelAckT. Furthermore, if the receiver 1159 supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then 1160 in the case of a delayed ACK the sender's TLP recovery detection 1161 algorithm (see above) can use the D-SACK information to infer that 1162 the original and TLP retransmission both arrived at the receiver. 1164 If there is ACK loss or a delayed ACK without a D-SACK, then this 1165 algorithm is conservative, because the sender will reduce cwnd when 1166 in fact there was no packet loss. In practice this is acceptable, 1167 and potentially even desirable: if there is reverse path congestion 1168 then reducing cwnd can be prudent. 1170 8.7. RACK for other transport protocols 1172 RACK can be implemented in other transport protocols. The algorithm 1173 can be simplified by skipping step 3 if the protocol can support a 1174 unique transmission or packet identifier (e.g. TCP timestamp options 1176 [RFC7323]). For example, the QUIC protocol implements RACK [QUIC- 1177 LR]. The [Sprout] loss detection algorithm was also independently 1178 designed to use a 10ms reordering window to improve its loss 1179 detection. 1181 9. Experiments and Performance Evaluations 1183 RACK and TLP have been deployed at Google, for both connections to 1184 users in the Internet and internally. We conducted a performance 1185 evaluation experiment for RACK and TLP on a small set of Google Web 1186 servers in Western Europe that serve mostly European and some African 1187 countries. The experiment lasted three days in March 2017. The 1188 servers were divided evenly into four groups of roughly 5.3 million 1189 flows each: 1191 Group 1 (control): RACK off, TLP off, RFC 6675 on 1193 Group 2: RACK on, TLP off, RFC 6675 on 1195 Group 3: RACK on, TLP on, RFC 6675 on 1197 Group 4: RACK on, TLP on, RFC 6675 off 1199 All groups used Linux with CUBIC congestion control, an initial 1200 congestion window of 10 packets, and the fq/pacing qdisc. In terms 1201 of specific recovery features, all groups enabled RFC5682 (F-RTO) but 1202 disabled FACK because it is not an IETF RFC. FACK was excluded 1203 because the goal of this setup is to compare RACK and TLP to RFC- 1204 based loss recoveries. Since TLP depends on either FACK or RACK, we 1205 could not run another group that enables TLP only (with both RACK and 1206 FACK disabled). Group 4 is to test whether RACK plus TLP can 1207 completely replace the DupThresh-based [RFC6675]. 1209 The servers sit behind a load balancer that distributes the 1210 connections evenly across the four groups. 1212 Each group handles a similar number of connections and sends and 1213 receives similar amounts of data. We compare total time spent in 1214 loss recovery across groups. The recovery time is measured from when 1215 the recovery and retransmission starts, until the remote host has 1216 acknowledged the highest sequence (SND.NXT) at the time the recovery 1217 started. Therefore the recovery includes both fast recoveries and 1218 timeout recoveries. 1220 Our data shows that Group 2 recovery latency is only 0.3% lower than 1221 the Group 1 recovery latency. But Group 3 recovery latency is 25% 1222 lower than Group 1 due to a 40% reduction in RTO-triggered 1223 recoveries! Therefore it is important to implement both TLP and RACK 1224 for performance. Group 4's total recovery latency is 0.02% lower 1225 than Group 3's, indicating that RACK plus TLP can successfully 1226 replace RFC6675 as a standalone recovery mechanism. 1228 We want to emphasize that the current experiment is limited in terms 1229 of network coverage. The connectivity in Western Europe is fairly 1230 good, therefore loss recovery is not a major performance bottleneck. 1231 We plan to expand our experiments to regions with worse connectivity, 1232 in particular on networks with strong traffic policing. 1234 10. Security Considerations 1236 RACK does not change the risk profile for TCP. 1238 An interesting scenario is ACK-splitting attacks [SCWA99]: for an 1239 MSS-size packet sent, the receiver or the attacker might send MSS 1240 ACKs that SACK or acknowledge one additional byte per ACK. This 1241 would not fool RACK. RACK.xmit_ts would not advance because all the 1242 sequences of the packet are transmitted at the same time (carry the 1243 same transmission timestamp). In other words, SACKing only one byte 1244 of a packet or SACKing the packet in entirety have the same effect on 1245 RACK. 1247 11. IANA Considerations 1249 This document makes no request of IANA. 1251 Note to RFC Editor: this section may be removed on publication as an 1252 RFC. 1254 12. Acknowledgments 1256 The authors thank Matt Mathis for his insights in FACK and Michael 1257 Welzl for his per-packet timer idea that inspired this work. Eric 1258 Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana 1259 Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi 1260 Nishida, Bob Briscoe, Felix Weinrank, Michael Tuexen, Martin Duke, 1261 and Ilpo Jarvinen contributed to the draft or the implementations in 1262 Linux, FreeBSD, Windows and QUIC. 1264 13. References 1266 13.1. Normative References 1268 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 1269 Options", RFC 2018, October 1996. 1271 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1272 Requirement Levels", RFC 2119, March 1997. 1274 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1275 Extension to the Selective Acknowledgement (SACK) Option 1276 for TCP", RFC 2883, July 2000. 1278 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 1279 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 1280 November 2006. 1282 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1283 Control", RFC 5681, September 2009. 1285 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 1286 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 1287 Spurious Retransmission Timeouts with TCP", RFC 5682, 1288 September 2009. 1290 [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. 1291 Hurtig, "Early Retransmit for TCP and Stream Control 1292 Transmission Protocol (SCTP)", RFC 5827, April 2010. 1294 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1295 "Computing TCP's Retransmission Timer", RFC 6298, June 1296 2011. 1298 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1299 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1300 Based on Selective Acknowledgment (SACK) for TCP", 1301 RFC 6675, August 2012. 1303 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 1304 Rate Reduction for TCP", May 2013. 1306 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 1307 Scheffenegger, "TCP Extensions for High Performance", 1308 September 2014. 1310 [RFC793] Postel, J., "Transmission Control Protocol", September 1311 1981. 1313 13.2. Informative References 1315 [FACK] Mathis, M. and M. Jamshid, "Forward acknowledgement: 1316 refining TCP congestion control", ACM SIGCOMM Computer 1317 Communication Review, Volume 26, Issue 4, Oct. 1996. , 1318 1996. 1320 [POLICER16] 1321 Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng, 1322 Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An 1323 Analysis of Traffic Policing in the Web", ACM SIGCOMM , 1324 2016. 1326 [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And 1327 Congestion Control", draft-ietf-quic-recovery-latest (work 1328 in progress), March 2020. 1330 [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP 1331 and SCTP RTO Restart", February 2016. 1333 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 1334 "TCP Congestion Control With a Misbehaving Receiver", ACM 1335 Computer Communication Review, 29(5) , 1999. 1337 [Sprout] Winstein, K., Sivaraman, A., and H. Balakrishnan, 1338 "Stochastic Forecasts Achieve High Throughput and Low 1339 Delay over Cellular Networks", USENIX Symposium on 1340 Networked Systems Design and Implementation (NSDI) , 2013. 1342 [THIN-STREAM] 1343 Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, 1344 "TCP enhancements for interactive thin-stream 1345 applications", NOSSDAV , 2008. 1347 [TLP] Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 1348 "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of 1349 Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work 1350 in progress), August 2013. 1352 Authors' Addresses 1354 Yuchung Cheng 1355 Google, Inc 1357 Email: ycheng@google.com 1359 Neal Cardwell 1360 Google, Inc 1362 Email: ncardwell@google.com 1363 Nandita Dukkipati 1364 Google, Inc 1366 Email: nanditad@google.com 1368 Priyaranjan Jha 1369 Google, Inc 1371 Email: priyarjha@google.com