idnits 2.17.1 draft-ietf-tcpm-rack-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (July 2, 2018) is 2097 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC4653' is mentioned on line 1009, but not defined == Missing Reference: 'RFC3708' is mentioned on line 451, but not defined == Missing Reference: 'RFC3522' is mentioned on line 1029, but not defined == Unused Reference: 'RFC2119' is defined on line 1207, but no explicit reference was found in the text == Unused Reference: 'RFC4737' is defined on line 1214, but no explicit reference was found in the text == Unused Reference: 'RFC793' is defined on line 1246, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 2 errors (**), 0 flaws (~~), 8 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group Y. Cheng 3 Internet-Draft N. Cardwell 4 Intended status: Experimental N. Dukkipati 5 Expires: January 3, 2019 P. Jha 6 Google, Inc 7 July 2, 2018 9 RACK: a time-based fast loss detection algorithm for TCP 10 draft-ietf-tcpm-rack-04 12 Abstract 14 This document presents a new TCP loss detection algorithm called RACK 15 ("Recent ACKnowledgment"). RACK uses the notion of time, instead of 16 packet or sequence counts, to detect losses, for modern TCP 17 implementations that can support per-packet timestamps and the 18 selective acknowledgment (SACK) option. It is intended to replace 19 the conventional DUPACK threshold approach and its variants, as well 20 as other nonstandard approaches. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 3, 2019. 39 Copyright Notice 41 Copyright (c) 2018 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 1. Introduction 56 This document presents a new loss detection algorithm called RACK 57 ("Recent ACKnowledgment"). RACK uses the notion of time instead of 58 the conventional packet or sequence counting approaches for detecting 59 losses. RACK deems a packet lost if some packet sent sufficiently 60 later has been delivered. It does this by recording packet 61 transmission times and inferring losses using cumulative 62 acknowledgments or selective acknowledgment (SACK) TCP options. 64 In the last couple of years we have been observing several 65 increasingly common loss and reordering patterns in the Internet: 67 1. Lost retransmissions. Traffic policers [POLICER16] and burst 68 losses often cause retransmissions to be lost again, severely 69 increasing TCP latency. 71 2. Tail drops. Structured request-response traffic turns more 72 losses into tail drops. In such cases, TCP is application- 73 limited, so it cannot send new data to probe losses and has to 74 rely on retransmission timeouts (RTOs). 76 3. Reordering. Link layer protocols (e.g., 802.11 block ACK) or 77 routers' internal load-balancing can deliver TCP packets out of 78 order. The degree of such reordering is usually within the order 79 of the path round trip time. 81 Despite TCP stacks (e.g. Linux) that implement many of the standard 82 and proposed loss detection algorithms 83 [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM], 84 we've found that together they do not perform well. The main reason 85 is that many of them are based on the classic rule of counting 86 duplicate acknowledgments [RFC5681]. They can either detect loss 87 quickly or accurately, but not both, especially when the sender is 88 application-limited or under reordering that is unpredictable. And 89 under these conditions none of them can detect lost retransmissions 90 well. 92 Also, these algorithms, including RFCs, rarely address the 93 interactions with other algorithms. For example, FACK may consider a 94 packet is lost while RFC6675 may not. Implementing N algorithms 95 while dealing with N^2 interactions is a daunting task and error- 96 prone. 98 The goal of RACK is to solve all the problems above by replacing many 99 of the loss detection algorithms above with one more effective 100 algorithm to handle loss and reordering. 102 2. Overview 104 The main idea behind RACK is that if a packet has been delivered out 105 of order, then the packets sent chronologically before that were 106 either lost or reordered. This concept is not fundamentally 107 different from [RFC5681][RFC6675][FACK]. But the key innovation in 108 RACK is to use a per-packet transmission timestamp and widely 109 deployed SACK options to conduct time-based inferences instead of 110 inferring losses with packet or sequence counting approaches. 112 Using a threshold for counting duplicate acknowledgments (i.e., 113 DupThresh) alone is no longer reliable because of today's prevalent 114 reordering patterns. A common type of reordering is that the last 115 "runt" packet of a window's worth of packet bursts gets delivered 116 first, then the rest arrive shortly after in order. To handle this 117 effectively, a sender would need to constantly adjust the DupThresh 118 to the burst size; but this would risk increasing the frequency of 119 RTOs on real losses. 121 Today's prevalent lost retransmissions also cause problems with 122 packet-counting approaches [RFC5681][RFC6675][FACK], since those 123 approaches depend on reasoning in sequence number space. 124 Retransmissions break the direct correspondence between ordering in 125 sequence space and ordering in time. So when retransmissions are 126 lost, sequence-based approaches are often unable to infer and quickly 127 repair losses that can be deduced with time-based approaches. 129 Instead of counting packets, RACK uses the most recently delivered 130 packet's transmission time to judge if some packets sent previous to 131 that time have "expired" by passing a certain reordering settling 132 window. On each ACK, RACK marks any already-expired packets lost, 133 and for any packets that have not yet expired it waits until the 134 reordering window passes and then marks those lost as well. In 135 either case, RACK can repair the loss without waiting for a (long) 136 RTO. RACK can be applied to both fast recovery and timeout recovery, 137 and can detect losses on both originally transmitted and 138 retransmitted packets, making it a great all-weather loss detection 139 mechanism. 141 3. Design Rationale for Reordering Tolerance 143 The reordering behavior of networks can evolve (over years) in 144 response to the behavior of transport protocols and applications, as 145 well as the needs of network designers and operators. From a network 146 or link designer's viewpoint, parallelization (eg. link bonding) is 147 the easiest way to get a network to go faster. Therefore their main 148 constraint on speed is reordering, and there is pressure to relax 149 that constraint. If RACK becomes widely deployed, the underlying 150 networks may introduce more reordering for higher throughput. But 151 this may result in excessive reordering that hurts end to end 152 performance: 154 1. End host packet processing: extreme reordering on high-speed 155 networks would incur high CPU cost by greatly reducing the 156 effectiveness of aggregation mechanisms, such as large receive 157 offload (LRO) and generic receive offload (GRO), and 158 significantly increasing the number of ACKs. 160 2. Congestion control: TCP congestion control implicitly assumes the 161 feedback from ACKs are from the same bottleneck. Therefore it 162 cannot handle well scenarios where packets are traversing largely 163 disjoint paths. 165 3. Loss recovery: Having an excessively large reordering window to 166 accommodate widely different latencies from different paths would 167 increase the latency of loss recovery. 169 An end-to-end transport protocol cannot tell immediately whether a 170 hole is reordering or loss. It can only distinguish between the two 171 in hindsight if the hole in the sequence space gets filled later 172 without a retransmission. How long the sender waits for such 173 potential reordering events to settle is determined by the current 174 reordering window. 176 Given these considerations, a core design philosophy of RACK is to 177 adapt to the measured duration of reordering events, within 178 reasonable and specific bounds. To accomplish this RACK places the 179 following mandates on the reordering window: 181 1. The initial RACK reordering window SHOULD be set to a small 182 fraction of the round-trip time. 184 2. If no reordering has been observed, then RACK SHOULD honor the 185 classic 3-DUPACK rule for initiating fast recovery. One simple 186 way to implement this is to temporarily override the reorder 187 window to 0. 189 3. The RACK reordering window SHOULD leverage Duplicate Selective 190 Acknowledgement (DSACK) information [RFC3708] to adaptively 191 estimate the duration of reordering events. 193 4. The RACK reordering window MUST be bounded and this bound SHOULD 194 be one round trip. 196 As a flow starts, either condition 1 or condition 2 or both would 197 trigger RACK to start the recovery process quickly. The low initial 198 reordering window and use of the 3-DUPACK rule are key to achieving 199 low-latency loss recovery for short flows by risking spurious 200 retransmissions to recover losses quickly. This rationale is that 201 spurious retransmissions for short flows are not expected to produce 202 excessive network traffic. 204 For long flows the design tolerates reordering within a round trip. 205 This handles reordering caused by path divergence in small time 206 scales (reordering within the round-trip time of the shortest path), 207 which should tolerate much of the reordering from link bonding, 208 multipath routing, or link-layer out-of-order delivery. It also 209 relaxes ordering constraints to allow sending flights of TCP packets 210 on different paths dynamically for better load-balancing (e.g. 211 flowlets). 213 However, the fact that the initial RACK reordering window is low, and 214 the RACK reordering window's adaptive growth is bounded, means that 215 there will continue to be a cost to reordering and a limit to RACK's 216 adaptation to reordering. This maintains a disincentive for network 217 designers and operators to introduce needless or excessive 218 reordering, particularly because they have to allow for low round 219 trip time paths. This means RACK will not encourage networks to 220 perform inconsiderate fine-grained packet-spraying over highly 221 disjoint paths with very different characteristics. There are good 222 alternative solutions, such as MPTCP, for such networks. 224 To conclude, the RACK algorithm aims to adapt to small degrees of 225 reordering, quickly recover most losses within one to two round 226 trips, and avoid costly retransmission timeouts (RTOs). In the 227 presence of reordering, the adaptation algorithm can impose 228 sometimes-needless delays when it waits to disambiguate loss from 229 reordering, but the penalty for waiting is bounded to one round trip 230 and such delays are confined to longer-running flows. 232 This document provides a concrete and detailed reordering window 233 adaptation algorithm for implementors. We note that the specifics of 234 the algorithm are likely to evolve over time. But that is a separate 235 engineering optimization that's out of scope for this document. 237 4. Requirements 239 The reader is expected to be familiar with the definitions given in 240 the TCP congestion control [RFC5681] and selective acknowledgment 241 [RFC2018] RFCs. Familiarity with the conservative SACK-based 242 recovery for TCP [RFC6675] is not expected but helps. 244 RACK has three requirements: 246 1. The connection MUST use selective acknowledgment (SACK) options 247 [RFC2018]. 249 2. For each packet sent, the sender MUST store its most recent 250 transmission time with (at least) millisecond granularity. For 251 round-trip times lower than a millisecond (e.g., intra-datacenter 252 communications) microsecond granularity would significantly help 253 the detection latency but is not required. 255 3. For each packet sent, the sender MUST remember whether the packet 256 has been retransmitted or not. 258 We assume that requirement 1 implies the sender keeps a SACK 259 scoreboard, which is a data structure to store selective 260 acknowledgment information on a per-connection basis ([RFC6675] 261 section 3). For the ease of explaining the algorithm, we use a 262 pseudo-scoreboard that manages the data in sequence number ranges. 263 But the specifics of the data structure are left to the implementor. 265 RACK does not need any change on the receiver. 267 5. Definitions of variables 269 A sender needs to store these new RACK variables: 271 "Packet.xmit_ts" is the time of the last transmission of a data 272 packet, including retransmissions, if any. The sender needs to 273 record the transmission time for each packet sent and not yet 274 acknowledged. The time MUST be stored at millisecond granularity or 275 finer. 277 "RACK.packet". Among all the packets that have been either 278 selectively or cumulatively acknowledged, RACK.packet is the one that 279 was sent most recently (including retransmissions). 281 "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet. 283 "RACK.end_seq" is the ending TCP sequence number of RACK.packet. 285 "RACK.rtt" is the RTT of the most recently transmitted packet that 286 has been delivered (either cumulatively acknowledged or selectively 287 acknowledged) on the connection. 289 "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated. 291 "RACK.reo_wnd" is a reordering window computed in the unit of time 292 used for recording packet transmission times. It is used to defer 293 the moment at which RACK marks a packet lost. 295 "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the 296 connection. 298 "RACK.ack_ts" is the time when all the sequences in RACK.packet were 299 selectively or cumulatively acknowledged. 301 "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd 303 "RACK.reo_wnd_persist" is the number of loss recoveries before 304 resetting RACK.reo_wnd "RACK.dsack" indicates if a DSACK option has 305 been received since last RACK.reo_wnd change "RACK.pkts_sacked" 306 returns the total number of packets selectively acknowledged in the 307 SACK scoreboard. 309 "RACK.reord" indicates the connection has detected packet reordering 310 event(s) 312 "RACK.fack" is the highest selectively or cumulatively acknowledged 313 sequence 315 Note that the Packet.xmit_ts variable is per packet in flight. The 316 RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT 317 variables are kept in the per-connection TCP control block. 318 RACK.packet and RACK.ack_ts are used as local variables in the 319 algorithm. 321 6. Algorithm Details 323 6.1. Transmitting a data packet 325 Upon transmitting a new packet or retransmitting an old packet, 326 record the time in Packet.xmit_ts. RACK does not care if the 327 retransmission is triggered by an ACK, new application data, an RTO, 328 or any other means. 330 6.2. Upon receiving an ACK 332 Step 1: Update RACK.min_RTT. 334 Use the RTT measurements obtained via [RFC6298] or [RFC7323] to 335 update the estimated minimum RTT in RACK.min_RTT. The sender can 336 track a simple global minimum of all RTT measurements from the 337 connection, or a windowed min-filtered value of recent RTT 338 measurements. This document does not specify an exact approach. 340 Step 2: Update RACK stats 342 Given the information provided in an ACK, each packet cumulatively 343 ACKed or SACKed is marked as delivered in the scoreboard. Among all 344 the packets newly ACKed or SACKed in the connection, record the most 345 recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts. 346 Sometimes the timestamps of RACK.Packet and Packet could carry the 347 same transmit timestamps due to clock granularity or segmentation 348 offloading (i.e. the two packets were sent as a jumbo frame into the 349 NIC). In that case the sequence numbers of RACK.end_seq and 350 Packet.end_seq are compared to break the tie. 352 Since an ACK can also acknowledge retransmitted data packets, 353 RACK.rtt can be vastly underestimated if the retransmission was 354 spurious. To avoid that, ignore a packet if any of its TCP sequences 355 have been retransmitted before and either of two conditions is true: 357 1. The Timestamp Echo Reply field (TSecr) of the ACK's timestamp 358 option [RFC7323], if available, indicates the ACK was not 359 acknowledging the last retransmission of the packet. 361 2. The packet was last retransmitted less than RACK.min_rtt ago. 363 If the ACK is not ignored as invalid, update the RACK.rtt to be the 364 RTT sample calculated using this ACK, and continue. If this ACK or 365 SACK was for the most recently sent packet, then record the 366 RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK. 367 Otherwise exit here and omit the following steps. 369 Notice that the second condition above is a heuristic. This 370 heuristic would fail to update RACK stats if the packet is spuriously 371 retransmitted because of a recent minimum RTT decrease (e.g. path 372 change). Consequentially RACK may not detect losses from ACK events 373 and the recovery resorts to the (slower) TLP or RTO timer-based 374 events. However such events should be rare and the connection would 375 pick up the new minimum RTT when the recovery ends to avoid repeated 376 similar failures. 378 Step 2 may be summarized in pseudocode as: 380 RACK_sent_after(t1, seq1, t2, seq2): 381 If t1 > t2: 382 Return true 383 Else if t1 == t2 AND seq1 > seq2: 384 Return true 385 Else: 386 Return false 388 RACK_update(): 389 For each Packet newly acknowledged cumulatively or selectively: 390 rtt = Now() - Packet.xmit_ts 391 If Packet.retransmitted is TRUE: 392 If ACK.ts_option.echo_reply < Packet.xmit_ts: 393 Return 394 If rtt < RACK.min_rtt: 395 Return 397 RACK.rtt = rtt 398 If RACK_sent_after(Packet.xmit_ts, Packet.end_seq 399 RACK.xmit_ts, RACK.end_seq): 400 RACK.xmit_ts = Packet.xmit_ts 402 Step 3: Detect packet reordering 404 To detect reordering, the sender looks for original data packets 405 being delivered out of order in sequence space. The sender tracks 406 the highest sequence selectively or cumulatively acknowledged in the 407 RACK.fack variable. The name fack stands for the most forward ACK 408 originated from the [FACK] draft. If the ACK selectively or 409 cumulatively acknowledges an unacknowledged and also never 410 retransmitted sequence below RACK.fack, then the corresponding packet 411 has been reordered and RACK.reord is set to 1. 413 The heuristic above only detects reordering if the re-ordered packet 414 has not yet been retransmitted. This is a major drawback because if 415 RACK has a low reordering window and the network is reordering 416 packets, RACK may falsely retransmit frequently. Consequently RACK 417 may fail to detect reordering to increase the reordering window, 418 because the reordered packets were already (falsely) retransmitted. 420 DSACK [RFC3708] can help mitigate this issue. The false 421 retransmission would solicit DSACK option in the ACK. Therefore if 422 the ACK has a DSACK option covering some sequence that were both 423 acknowledged and retransmitted, this implies the original packet was 424 reordered but RACK retransmitted the packet too quickly and should 425 set RACK.reord to 1. 427 RACK_detect_reordering(): 428 For each Packet newly acknowledged cumulatively or selectively: 429 If Packet.end_seq > RACK.fack: 430 RACK.fack = Packet.end_seq 431 Else if Packet.end_seq < RACK.fack AND 432 Packet.retransmitted is FALSE: 433 RACK.reord = TRUE 434 For each Packet covered by the DSACK option: 435 If Packet.retransmitted is TRUE: 436 RACK.reord = TRUE 438 Step 4: Update RACK reordering window 440 To handle the prevalent small degree of reordering, RACK.reo_wnd 441 serves as an allowance for settling time before marking a packet 442 lost. This section documents a detailed algorithm following the 443 design rationale section. RACK starts initially with a conservative 444 window of min_RTT/4. If no reordering has been observed, RACK uses 445 RACK.reo_wnd of 0 during loss recovery, in order to retransmit 446 quickly, or when the number of DUPACKs exceeds the classic DUPACK 447 threshold. The subtle difference of this approach and conventional 448 one [RFC5681][RFC6675] is dicussed later in the section of "RACK and 449 TLP discussions". 451 Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window, 452 to higher degrees of reordering, if DSACK is supported. Receiving an 453 ACK with a DSACK indicates a spurious retransmission, which in turn 454 suggests that the RACK reordering window, RACK.reo_wnd, is likely too 455 small. The sender MAY increase the RACK.reo_wnd window linearly for 456 every round trip in which the sender receives a DSACK, so that after 457 N distinct round trips in which a DSACK is received, the RACK.reo_wnd 458 becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT. The 459 inflated RACK.reo_wnd would persist for 16 loss recoveries and after 460 which it resets to its starting value, min_RTT / 4. 462 The following pseudocode implements above algorithm. Note that 463 extensions that require additional TCP features (e.g. DSACK) would 464 work if the feature functions simply return false. 466 RACK_update_reo_wnd(): 467 RACK.min_RTT = TCP_min_RTT() 468 If DSACK option is present: 469 RACK.dsack = true 471 If SND.UNA < RACK.rtt_seq: 472 RACK.dsack = false /* React to DSACK once per round trip */ 474 If RACK.dsack: 475 RACK.reo_wnd_incr += 1 476 RACK.dsack = false 477 RACK.rtt_seq = SND.NXT 478 RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */ 479 Else if exiting loss recovery: 480 RACK.reo_wnd_persist -= 1 481 If RACK.reo_wnd_persist <= 0: 482 RACK.reo_wnd_incr = 1 484 If RACK.reordering_seen is FALSE: 485 If in loss recovery: /* If in fast or timeout recovery */ 486 RACK.reo_wnd = 0 487 Return 488 Else if RACK.pkts_sacked >= RACK.dupthresh: 489 RACK.reo_wnd = 0 490 return 491 RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr 492 RACK.reo_wnd = min(RACK.reo_wnd, SRTT) 494 Step 5: Detect losses. 496 For each packet that has not been SACKed, if RACK.xmit_ts is after 497 Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its 498 corresponding sequence range) lost in the scoreboard. The rationale 499 is that if another packet that was sent later has been delivered, and 500 the reordering window or "reordering settling time" has already 501 passed, then the packet was likely lost. 503 If another packet that was sent later has been delivered, but the 504 reordering window has not passed, then it is not yet safe to deem the 505 unacked packet lost. Using the basic algorithm above, the sender 506 would wait for the next ACK to further advance RACK.xmit_ts; but this 507 risks a timeout (RTO) if no more ACKs come back (e.g, due to losses 508 or application limit). For timely loss detection, the sender MAY 509 install a "reordering settling" timer set to fire at the earliest 510 moment at which it is safe to conclude that some packet is lost. The 511 earliest moment is the time it takes to expire the reordering window 512 of the earliest unacked packet in flight. 514 This timer expiration value can be derived as follows. As a starting 515 point, we consider that the reordering window has passed if the 516 RACK.packet was sent sufficiently after the packet in question, or a 517 sufficient time has elapsed since the RACK.packet was S/ACKed, or 518 some combination of the two. More precisely, RACK marks a packet as 519 lost if the reordering window for a packet has elapsed through the 520 sum of: 522 1. delta in transmit time between a packet and the RACK.packet 524 2. delta in time between RACK.ack_ts and now 526 So we mark a packet as lost if: 528 RACK.xmit_ts >= Packet.xmit_ts 529 AND 530 (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd 532 If we solve this second condition for "now", the moment at which we 533 can declare a packet lost, then we get: 535 now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts) 537 Then (RACK.ack_ts - RACK.xmit_ts) is just the RTT of the packet we 538 used to set RACK.xmit_ts, so this reduces to: 540 Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0 542 The following pseudocode implements the algorithm above. When an ACK 543 is received or the RACK timer expires, call RACK_detect_loss(). The 544 algorithm includes an additional optimization to break timestamp ties 545 by using the TCP sequence space. The optimization is particularly 546 useful to detect losses in a timely manner with TCP Segmentation 547 Offload, where multiple packets in one TSO blob have identical 548 timestamps. It is also useful when the timestamp clock granularity 549 is close to or longer than the actual round trip time. 551 RACK_detect_loss(): 552 timeout = 0 554 For each packet, Packet, not acknowledged yet: 555 If Packet.lost is TRUE or Packet.retransmitted is FALSE: 556 Continue /* Lost packet not retransmitted yet */ 558 If RACK_sent_after(RACK.xmit_ts, RACK.end_seq, 559 Packet.xmit_ts, Packet.end_seq): 560 remaining = Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - Now() 561 If remaining <= 0: 562 Packet.lost = TRUE 563 Else: 564 timeout = max(remaining, timeout) 566 If timeout != 0 567 Arm a timer to call RACK_detect_loss() after timeout 569 Implementation optimization: looping through packets in the SACK 570 scoreboard above could be very costly on large-BDP networks since the 571 inflight could be very large. If the implementation can organize the 572 scoreboard data structures to have packets sorted by the last 573 (re)transmission time, then the loop can start on the least recently 574 sent packet and abort on the first packet sent after RACK.time_ts. 575 This can be implemented by using a seperate list sorted in time 576 order. The implementation inserts the packet at the tail of the list 577 when it is (re)transmitted, and removes a packet from the list when 578 it is delivered or marked lost. We RECOMMEND such an optimization 579 because it enables implementations to support high-BDP networks. 580 This optimization is implemented in Linux and sees orders of 581 magnitude improvement in CPU usage on high-speed WAN networks. 583 6.3. Tail Loss Probe: fast recovery for tail losses 585 This section describes a supplemental algorithm, Tail Loss Probe 586 (TLP), which leverages RACK to further reduce RTO recoveries. TLP 587 triggers fast recovery to quickly repair tail losses that can 588 otherwise be recovered by RTOs only. After an original data 589 transmission, TLP sends a probe data segment within one to two RTTs. 590 The probe data segment can either be new, previously unsent data, or 591 a retransmission of previously sent data just below SND.NXT. In 592 either case the goal is to elicit more feedback from the receiver, in 593 the form of an ACK (potentially with SACK blocks), to allow RACK to 594 trigger fast recovery instead of an RTO. 596 An RTO occurs when the first unacknowledged sequence number is not 597 acknowledged after a conservative period of time has elapsed 598 [RFC6298]. Common causes of RTOs include: 600 1. The entire flight of data is lost. 602 2. Tail losses of data segments at the end of an application 603 transaction. 605 3. Tail losses of ACKs at the end of an application transaction. 607 4. Lost retransmits, which can halt fast recovery based on [RFC6675] 608 if the ACK stream completely dries up. For example, consider a 609 window of three data packets (P1, P2, P3) that are sent; P1 and 610 P2 are dropped. On receipt of a SACK for P3, RACK marks P1 and 611 P2 as lost and retransmits them as R1 and R2. Suppose R1 and R2 612 are lost as well, so there are no more returning ACKs to detect 613 R1 and R2 as lost. Recovery stalls. 615 5. An unexpectedly long round-trip time (RTT). This can cause ACKs 616 to arrive after the RTO timer expires. The F-RTO algorithm 617 [RFC5682] is designed to detect such spurious retransmission 618 timeouts and at least partially undo the consequences of such 619 events, but F-RTO cannot be used in many situations. 621 6.4. Tail Loss Probe: An Example 623 Following is an example of TLP. All events listed are at a TCP 624 sender. 626 1. Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 627 no more new data to transmit. A PTO is scheduled to fire in 2 628 RTTs, after the transmission of the 10th segment. 630 2. Sender receives acknowledgements (ACKs) for segments 1-5; 631 segments 6-10 are lost and no ACKs are received. The sender 632 reschedules its PTO timer relative to the last received ACK, 633 which is the ACK for segment 5 in this case. The sender sets the 634 PTO interval using the calculation described in step (2) of the 635 algorithm. 637 3. When PTO fires, sender retransmits segment 10. 639 4. After an RTT, a SACK for packet 10 arrives. The ACK also carries 640 SACK holes for segments 6, 7, 8 and 9. This triggers RACK-based 641 loss recovery. 643 5. The connection enters fast recovery and retransmits the remaining 644 lost segments. 646 6.5. Tail Loss Probe Algorithm Details 648 We define the terminology used in specifying the TLP algorithm: 650 FlightSize: amount of outstanding data in the network, as defined in 651 [RFC5681]. 653 RTO: The transport's retransmission timeout (RTO) is based on 654 measured round-trip times (RTT) between the sender and receiver, as 655 specified in [RFC6298] for TCP. PTO: Probe timeout (PTO) is a timer 656 event indicating that an ACK is overdue. Its value is constrained to 657 be smaller than or equal to an RTO. 659 SRTT: smoothed round-trip time, computed as specified in [RFC6298]. 661 The TLP algorithm has three phases, which we discuss in turn. 663 6.5.1. Phase 1: Scheduling a loss probe 665 Step 1: Check conditions for scheduling a PTO. 667 A sender should check to see if it should schedule a PTO in the 668 following situations: 670 1. After transmitting new data that was not itself a TLP probe 672 2. Upon receiving an ACK that cumulatively acknowledges data 674 3. Upon receiving a SACK that selectively acknowledges data that was 675 last sent before the segment with SEG.SEQ=SND.UNA was last 676 (re)transmitted 678 A sender should schedule a PTO only if all of the following 679 conditions are met: 681 1. The connection supports SACK [RFC2018] 683 2. The connection has no SACKed sequences in the SACK scoreboard 685 3. The connection is not in loss recovery 687 If a PTO can be scheduled according to these conditions, the sender 688 should schedule a PTO. If there was a previously scheduled PTO or 689 RTO pending, then that pending PTO or RTO should first be cancelled, 690 and then the new PTO should be scheduled. 692 If a PTO cannot be scheduled according to these conditions, then the 693 sender MUST arm the RTO timer if there is unacknowledged data in 694 flight. 696 Step 2: Select the duration of the PTO. 698 A sender SHOULD use the following logic to select the duration of a 699 PTO: 701 TLP_timeout(): 702 If SRTT is available: 703 PTO = 2 * SRTT 704 If FlightSize = 1: 705 PTO += WCDelAckT 706 Else: 707 PTO += 2ms 708 Else: 709 PTO = 1 sec 711 If Now() + PTO > TCP_RTO_expire(): 712 PTO = TCP_RTO_expire() - Now() 714 Aiming for a PTO value of 2*SRTT allows a sender to wait long enough 715 to know that an ACK is overdue. Under normal circumstances, i.e. no 716 losses, an ACK typically arrives in one SRTT. But choosing PTO to be 717 exactly an SRTT is likely to generate spurious probes given that 718 network delay variance and even end-system timings can easily push an 719 ACK to be above an SRTT. We chose PTO to be the next integral 720 multiple of SRTT. 722 Similarly, network delay variations and end-system processing 723 latencies and timer granularities can easily delay ACKs beyond 724 2*SRTT, so senders SHOULD add at least 2ms to a computed PTO value 725 (and MAY add more if the sending host OS timer granularity is more 726 coarse than 1ms). 728 WCDelAckT stands for worst case delayed ACK timer. When FlightSize 729 is 1, PTO is inflated by WCDelAckT time to compensate for a potential 730 long delayed ACK timer at the receiver. The RECOMMENDED value for 731 WCDelAckT is 200ms. 733 Finally, if the time at which an RTO would fire (here denoted 734 "TCP_RTO_expire") is sooner than the computed time for the PTO, then 735 a probe is scheduled to be sent at that earlier time. 737 6.5.2. Phase 2: Sending a loss probe 739 When the PTO fires, transmit a probe data segment: 741 TLP_send_probe(): 742 If an unsent segment exists AND 743 the receive window allows new data to be sent: 744 Transmit the lowest-sequence unsent segment of up to SMSS 745 Increment FlightSize by the size of the newly-sent segment 746 Else: 747 Retransmit the highest-sequence segment sent so far 748 The cwnd remains unchanged 750 When the loss probe is a retransmission, the sender uses the highest- 751 sequence segment sent so far. This is in order to deal with the 752 retransmission ambiguity problem in TCP. Suppose a sender sends N 753 segments, and then retransmits the last segment (segment N) as a loss 754 probe, and then the sender receives a SACK for segment N. As long as 755 the sender waits for any required RACK reordering settling timer to 756 then expire, it doesn't matter if that SACK was for the original 757 transmission of segment N or the TLP retransmission; in either case 758 the arrival of the SACK for segment N provides evidence that the N-1 759 segments preceding segment N were likely lost. In the case where 760 there is only one original outstanding segment of data (N=1), the 761 same logic (trivially) applies: an ACK for a single outstanding 762 segment tells the sender the N-1=0 segments preceding that segment 763 were lost. Furthermore, whether there are N>1 or N=1 outstanding 764 segments, there is a question about whether the original last segment 765 or its TLP retransmission were lost; the sender estimates this using 766 TLP recovery detection (see below). 768 Note that after transmitting a TLP, the sender MUST arm an RTO timer, 769 and not the PTO timer. This ensures that the sender does not send 770 repeated, back-to-back TLP probes. 772 6.5.3. Phase 3: ACK processing 774 On each incoming ACK, the sender should check the conditions in Step 775 1 of Phase 1 to see if it should schedule (or reschedule) the loss 776 probe timer. 778 6.6. TLP recovery detection 780 If the only loss in an outstanding window of data was the last 781 segment, then a TLP loss probe retransmission of that data segment 782 might repair the loss. TLP recovery detection examines ACKs to 783 detect when the probe might have repaired a loss, and thus allows 784 congestion control to properly reduce the congestion window (cwnd) 785 [RFC5681]. 787 Consider a TLP retransmission episode where a sender retransmits a 788 tail packet in a flight. The TLP retransmission episode ends when 789 the sender receives an ACK with a SEG.ACK above the SND.NXT at the 790 time the episode started. During the TLP retransmission episode the 791 sender checks for a duplicate ACK or D-SACK indicating that both the 792 original segment and TLP retransmission arrived at the receiver, 793 meaning there was no loss that needed repairing. If the TLP sender 794 does not receive such an indication before the end of the TLP 795 retransmission episode, then it MUST estimate that either the 796 original data segment or the TLP retransmission were lost, and 797 congestion control MUST react appropriately to that loss as it would 798 any other loss. 800 Since a significant fraction of the hosts that support SACK do not 801 support duplicate selective acknowledgments (D-SACKs) [RFC2883] the 802 TLP algorithm for detecting such lost segments relies only on basic 803 SACK support [RFC2018]. 805 Definitions of variables 807 TLPRxtOut: a boolean indicating whether there is an unacknowledged 808 TLP retransmission. 810 TLPHighRxt: the value of SND.NXT at the time of sending a TLP 811 retransmission. 813 6.6.1. Initializing and resetting state 815 When a connection is created, or suffers a retransmission timeout, or 816 enters fast recovery, it executes the following: 818 TLPRxtOut = false 820 6.6.2. Recording loss probe states 822 Senders MUST only send a TLP loss probe retransmission if TLPRxtOut 823 is false. This ensures that at any given time a connection has at 824 most one outstanding TLP retransmission. This allows the sender to 825 use the algorithm described in this section to estimate whether any 826 data segments were lost. 828 Note that this condition only restricts TLP loss probes that are 829 retransmissions. There may be an arbitrary number of outstanding 830 unacknowledged TLP loss probes that consist of new, previously-unsent 831 data, since the retransmission timeout and fast recovery algorithms 832 are sufficient to detect losses of such probe segments. 834 Upon sending a TLP probe that is a retransmission, the sender sets 835 TLPRxtOut to true and TLPHighRxt to SND.NXT. 837 Detecting recoveries accomplished by loss probes 839 Step 1: Track ACKs indicating receipt of original and retransmitted 840 segments 842 A sender considers both the original segment and TLP probe 843 retransmission segment as acknowledged if either 1 or 2 are true: 845 1. This is a duplicate acknowledgment (as defined in [RFC5681], 846 section 2), and all of the following conditions are met: 848 1. TLPRxtOut is true 850 2. SEG.ACK == TLPHighRxt 852 3. SEG.ACK == SND.UNA 854 4. the segment contains no SACK blocks for sequence ranges above 855 TLPHighRxt 857 5. the segment contains no data 859 6. the segment is not a window update 861 2. This is an ACK acknowledging a sequence number at or above 862 TLPHighRxt and it contains a D-SACK; i.e. all of the following 863 conditions are met: 865 1. TLPRxtOut is true 867 2. SEG.ACK >= TLPHighRxt 869 3. the ACK contains a D-SACK block 871 If neither conditions are met, then the sender estimates that the 872 receiver received both the original data segment and the TLP probe 873 retransmission, and so the sender considers the TLP episode to be 874 done, and records that fact by setting TLPRxtOut to false. 876 Step 2: Mark the end of a TLP retransmission episode and detect 877 losses 878 If the sender receives a cumulative ACK for data beyond the TLP loss 879 probe retransmission then, in the absence of reordering on the return 880 path of ACKs, it should have received any ACKs for the original 881 segment and TLP probe retransmission segment. At that time, if the 882 TLPRxtOut flag is still true and thus indicates that the TLP probe 883 retransmission remains unacknowledged, then the sender should presume 884 that at least one of its data segments was lost, so it SHOULD invoke 885 a congestion control response equivalent to fast recovery. 887 More precisely, on each ACK the sender executes the following: 889 if (TLPRxtOut and SEG.ACK >= TLPHighRxt) { 890 TLPRxtOut = false 891 EnterRecovery() 892 ExitRecovery() 893 } 895 7. RACK and TLP discussions 897 7.1. Advantages 899 The biggest advantage of RACK is that every data packet, whether it 900 is an original data transmission or a retransmission, can be used to 901 detect losses of the packets sent chronologically prior to it. 903 Example: TAIL DROP. Consider a sender that transmits a window of 904 three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the 905 transmission of each packet is at least RACK.reo_wnd (1 millisecond 906 by default) after the transmission of the previous packet. RACK will 907 mark P1 as lost when the SACK of P2 is received, and this will 908 trigger the retransmission of P1 as R1. When R1 is cumulatively 909 acknowledged, RACK will mark P3 as lost and the sender will 910 retransmit P3 as R3. This example illustrates how RACK is able to 911 repair certain drops at the tail of a transaction without any timer. 912 Notice that neither the conventional duplicate ACK threshold 913 [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK] 914 algorithm can detect such losses, because of the required packet or 915 sequence count. 917 Example: LOST RETRANSMIT. Consider a window of three data packets 918 (P1, P2, P3) that are sent; P1 and P2 are dropped. Suppose the 919 transmission of each packet is at least RACK.reo_wnd (1 millisecond 920 by default) after the transmission of the previous packet. When P3 921 is SACKed, RACK will mark P1 and P2 lost and they will be 922 retransmitted as R1 and R2. Suppose R1 is lost again but R2 is 923 SACKed; RACK will mark R1 lost for retransmission again. Again, 924 neither the conventional three duplicate ACK threshold approach, nor 925 [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect 926 such losses. And such a lost retransmission is very common when TCP 927 is being rate-limited, particularly by token bucket policers with 928 large bucket depth and low rate limit. Retransmissions are often 929 lost repeatedly because standard congestion control requires multiple 930 round trips to reduce the rate below the policed rate. 932 Example: SMALL DEGREE OF REORDERING. Consider a common reordering 933 event: a window of packets are sent as (P1, P2, P3). P1 and P2 carry 934 a full payload of MSS octets, but P3 has only a 1-octet payload. 935 Suppose the sender has detected reordering previously and thus 936 RACK.reo_wnd is min_RTT/4. Now P3 is reordered and delivered first, 937 before P1 and P2. As long as P1 and P2 are delivered within 938 min_RTT/4, RACK will not consider P1 and P2 lost. But if P1 and P2 939 are delivered outside the reordering window, then RACK will still 940 falsely mark P1 and P2 lost. We discuss how to reduce false 941 positives in the end of this section. 943 The examples above show that RACK is particularly useful when the 944 sender is limited by the application, which is common for 945 interactive, request/response traffic. Similarly, RACK still works 946 when the sender is limited by the receive window, which is common for 947 applications that use the receive window to throttle the sender. 949 For some implementations (e.g., Linux), RACK works quite efficiently 950 with TCP Segmentation Offload (TSO). RACK always marks the entire 951 TSO blob lost because the packets in the same TSO blob have the same 952 transmission timestamp. By contrast, the algorithms based on 953 sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset 954 of packets in the TSO blob lost, forcing the stack to perform 955 expensive fragmentation of the TSO blob, or to selectively tag 956 individual packets lost in the scoreboard. 958 7.2. Disadvantages 960 RACK requires the sender to record the transmission time of each 961 packet sent at a clock granularity of one millisecond or finer. TCP 962 implementations that record this already for RTT estimation do not 963 require any new per-packet state. But implementations that are not 964 yet recording packet transmission times will need to add per-packet 965 internal state (commonly either 4 or 8 octets per packet or TSO blob) 966 to track transmission times. In contrast, the conventional [RFC6675] 967 loss detection approach does not require any per-packet state beyond 968 the SACK scoreboard. This is particularly useful on ultra-low RTT 969 networks where the RTT is far less than the sender TCP clock 970 grainularity (e.g. inside data-centers). 972 RACK can easily and optionally support the conventional approach in 973 [RFC6675][RFC5681] by resetting the reordering window to zero when 974 the threshold is met. Note that this approach differs slightly from 975 [RFC6675] which considers a packet lost when at least #DupThresh 976 higher-sequenc packets are SACKed. RACK's approach considers a 977 packet lost when at least one higher sequence packet is SACKed and 978 the total number of SACKed packets is at least DupThresh. For 979 example, suppose a connection sends 10 packets, and packets 3, 5, 7 980 are SACKed. [RFC6675] considers packets 1 and 2 lost. RACK 981 considers packets 1, 2, 4, 6 lost. 983 7.3. Adjusting the reordering window 985 When the sender detects packet reordering, RACK uses a reordering 986 window of min_rtt / 4. It uses the minimum RTT to accommodate 987 reordering introduced by packets traversing slightly different paths 988 (e.g., router-based parallelism schemes) or out-of-order deliveries 989 in the lower link layer (e.g., wireless links using link-layer 990 retransmission). RACK uses a quarter of minimum RTT because Linux 991 TCP used the same factor in its implementation to delay Early 992 Retransmit [RFC5827] to reduce spurious loss detections in the 993 presence of reordering, and experience shows that this seems to work 994 reasonably well. We have evaluated using the smoothed RTT (SRTT from 995 [RFC6298] RTT estimation) or the most recently measured RTT 996 (RACK.rtt) using an experiment similar to that in the Performance 997 Evaluation section. They do not make any significant difference in 998 terms of total recovery latency. 1000 7.4. Relationships with other loss recovery algorithms 1002 The primary motivation of RACK is to ultimately provide a simple and 1003 general replacement for some of the standard loss recovery algorithms 1004 [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard 1005 ones [FACK][THIN-STREAM]. While RACK can be a supplemental loss 1006 detection mechanism on top of these algorithms, this is not 1007 necessary, because RACK implicitly subsumes most of them. 1009 [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK 1010 threshold based on the current or previous flight sizes. RACK takes 1011 a different approach, by using only one ACK event and a reordering 1012 window. RACK can be seen as an extended Early Retransmit [RFC5827] 1013 without a FlightSize limit but with an additional reordering window. 1014 [FACK] considers an original packet to be lost when its sequence 1015 range is sufficiently far below the highest SACKed sequence. In some 1016 sense RACK can be seen as a generalized form of FACK that operates in 1017 time space instead of sequence space, enabling it to better handle 1018 reordering, application-limited traffic, and lost retransmissions. 1020 Nevertheless RACK is still an experimental algorithm. Since the 1021 oldest loss detection algorithm, the 3 duplicate ACK threshold 1023 [RFC5681], has been standardized and widely deployed. RACK can 1024 easily and optionally support the conventional approach for 1025 compatibility. 1027 RACK is compatible with and does not interfere with the the standard 1028 RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel 1029 algorithms [RFC3522]. This is because RACK only detects loss by 1030 using ACK events. It neither changes the RTO timer calculation nor 1031 detects spurious timeouts. 1033 Furthermore, RACK naturally works well with Tail Loss Probe [TLP] 1034 because a tail loss probe solicits either an ACK or SACK, which can 1035 be used by RACK to detect more losses. RACK can be used to relax 1036 TLP's requirement for using FACK and retransmitting the the highest- 1037 sequenced packet, because RACK is agnostic to packet sequence 1038 numbers, and uses transmission time instead. Thus TLP could be 1039 modified to retransmit the first unacknowledged packet, which could 1040 improve application latency. 1042 7.5. Interaction with congestion control 1044 RACK intentionally decouples loss detection from congestion control. 1045 RACK only detects losses; it does not modify the congestion control 1046 algorithm [RFC5681][RFC6937]. However, RACK may detect losses 1047 earlier or later than the conventional duplicate ACK threshold 1048 approach does. A packet marked lost by RACK SHOULD NOT be 1049 retransmitted until congestion control deems this appropriate. 1050 Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used 1051 when using RACK. 1053 RACK is applicable for both fast recovery and recovery after a 1054 retransmission timeout (RTO) in [RFC5681]. RACK applies equally to 1055 fast recovery and RTO recovery because RACK is purely based on the 1056 transmission time order of packets. When a packet retransmitted by 1057 RTO is acknowledged, RACK will mark any unacked packet sent 1058 sufficiently prior to the RTO as lost, because at least one RTT has 1059 elapsed since these packets were sent. 1061 The following simple example compares how RACK and non-RACK loss 1062 detection interacts with congestion control: suppose a TCP sender has 1063 a congestion window (cwnd) of 20 packets on a SACK-enabled 1064 connection. It sends 10 data packets and all of them are lost. 1066 Without RACK, the sender would time out, reset cwnd to 1, and 1067 retransmit the first packet. It would take four round trips (1 + 2 + 1068 4 + 3 = 10) to retransmit all the 10 lost packets using slow start. 1069 The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4 1070 packets due to congestion window validation. 1072 With RACK, a sender would send the TLP after 2*RTT and get a DUPACK. 1073 If the sender implements Proportional Rate Reduction [RFC6937] it 1074 would slow start to retransmit the remaining 9 lost packets since the 1075 number of packets in flight (0) is lower than the slow start 1076 threshold (10). The slow start would again take four round trips (1 1077 + 2 + 4 + 3 = 10). The recovery latency would be 2*RTT + 4*RTT, with 1078 an ending cwnd set to the slow start threshold of 10 packets. 1080 In both cases, the sender after the recovery would be in congestion 1081 avoidance. The difference in recovery latency (RTO + 4*RTT vs 6*RTT) 1082 can be significant if the RTT is much smaller than the minimum RTO (1 1083 second in RFC6298) or if the RTT is large. The former case is common 1084 in local area networks, data-center networks, or content distribution 1085 networks with deep deployments. The latter case is more common in 1086 developing regions with highly congested and/or high-latency 1087 networks. The ending congestion window after recovery also impacts 1088 subsequent data transfer. 1090 7.6. TLP recovery detection with delayed ACKs 1092 Delayed ACKs complicate the detection of repairs done by TLP, since 1093 with a delayed ACK the sender receives one fewer ACK than would 1094 normally be expected. To mitigate this complication, before sending 1095 a TLP loss probe retransmission, the sender should attempt to wait 1096 long enough that the receiver has sent any delayed ACKs that it is 1097 withholding. The sender algorithm described above features such a 1098 delay, in the form of WCDelAckT. Furthermore, if the receiver 1099 supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then 1100 in the case of a delayed ACK the sender's TLP recovery detection 1101 algorithm (see above) can use the D-SACK information to infer that 1102 the original and TLP retransmission both arrived at the receiver. 1104 If there is ACK loss or a delayed ACK without a D-SACK, then this 1105 algorithm is conservative, because the sender will reduce cwnd when 1106 in fact there was no packet loss. In practice this is acceptable, 1107 and potentially even desirable: if there is reverse path congestion 1108 then reducing cwnd can be prudent. 1110 7.7. RACK for other transport protocols 1112 RACK can be implemented in other transport protocols. The algorithm 1113 can be simplified by skipping step 3 if the protocol can support a 1114 unique transmission or packet identifier (e.g. TCP timestamp options 1115 [RFC7323]). For example, the QUIC protocol implements RACK [QUIC- 1116 LR]. 1118 8. Experiments and Performance Evaluations 1120 RACK and TLP have been deployed at Google, for both connections to 1121 users in the Internet and internally. We conducted a performance 1122 evaluation experiment for RACK and TLP on a small set of Google Web 1123 servers in Western Europe that serve mostly European and some African 1124 countries. The experiment lasted three days in March 2017. The 1125 servers were divided evenly into four groups of roughly 5.3 million 1126 flows each: 1128 Group 1 (control): RACK off, TLP off, RFC 6675 on 1130 Group 2: RACK on, TLP off, RFC 6675 on 1132 Group 3: RACK on, TLP on, RFC 6675 on 1134 Group 4: RACK on, TLP on, RFC 6675 off 1136 All groups used Linux with CUBIC congestion control, an initial 1137 congestion window of 10 packets, and the fq/pacing qdisc. In terms 1138 of specific recovery features, all groups enabled RFC5682 (F-RTO) but 1139 disabled FACK because it is not an IETF RFC. FACK was excluded 1140 because the goal of this setup is to compare RACK and TLP to RFC- 1141 based loss recoveries. Since TLP depends on either FACK or RACK, we 1142 could not run another group that enables TLP only (with both RACK and 1143 FACK disabled). Group 4 is to test whether RACK plus TLP can 1144 completely replace the DupThresh-based [RFC6675]. 1146 The servers sit behind a load balancer that distributes the 1147 connections evenly across the four groups. 1149 Each group handles a similar number of connections and sends and 1150 receives similar amounts of data. We compare total time spent in 1151 loss recovery across groups. The recovery time is measured from when 1152 the recovery and retransmission starts, until the remote host has 1153 acknowledged the highest sequence (SND.NXT) at the time the recovery 1154 started. Therefore the recovery includes both fast recoveries and 1155 timeout recoveries. 1157 Our data shows that Group 2 recovery latency is only 0.3% lower than 1158 the Group 1 recovery latency. But Group 3 recovery latency is 25% 1159 lower than Group 1 due to a 40% reduction in RTO-triggered 1160 recoveries! Therefore it is important to implement both TLP and RACK 1161 for performance. Group 4's total recovery latency is 0.02% lower 1162 than Group 3's, indicating that RACK plus TLP can successfully 1163 replace RFC6675 as a standalone recovery mechanism. 1165 We want to emphasize that the current experiment is limited in terms 1166 of network coverage. The connectivity in Western Europe is fairly 1167 good, therefore loss recovery is not a major performance bottleneck. 1168 We plan to expand our experiments to regions with worse connectivity, 1169 in particular on networks with strong traffic policing. 1171 9. Security Considerations 1173 RACK does not change the risk profile for TCP. 1175 An interesting scenario is ACK-splitting attacks [SCWA99]: for an 1176 MSS-size packet sent, the receiver or the attacker might send MSS 1177 ACKs that SACK or acknowledge one additional byte per ACK. This 1178 would not fool RACK. RACK.xmit_ts would not advance because all the 1179 sequences of the packet are transmitted at the same time (carry the 1180 same transmission timestamp). In other words, SACKing only one byte 1181 of a packet or SACKing the packet in entirety have the same effect on 1182 RACK. 1184 10. IANA Considerations 1186 This document makes no request of IANA. 1188 Note to RFC Editor: this section may be removed on publication as an 1189 RFC. 1191 11. Acknowledgments 1193 The authors thank Matt Mathis for his insights in FACK and Michael 1194 Welzl for his per-packet timer idea that inspired this work. Eric 1195 Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana 1196 Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi 1197 Nishida, and Bob Briscoe contributed to the draft and the 1198 implementations in Linux, FreeBSD and QUIC. 1200 12. References 1202 12.1. Normative References 1204 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 1205 Options", RFC 2018, October 1996. 1207 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1208 Requirement Levels", RFC 2119, March 1997. 1210 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1211 Extension to the Selective Acknowledgement (SACK) Option 1212 for TCP", RFC 2883, July 2000. 1214 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 1215 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 1216 November 2006. 1218 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1219 Control", RFC 5681, September 2009. 1221 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 1222 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 1223 Spurious Retransmission Timeouts with TCP", RFC 5682, 1224 September 2009. 1226 [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. 1227 Hurtig, "Early Retransmit for TCP and Stream Control 1228 Transmission Protocol (SCTP)", RFC 5827, April 2010. 1230 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1231 "Computing TCP's Retransmission Timer", RFC 6298, June 1232 2011. 1234 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1235 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1236 Based on Selective Acknowledgment (SACK) for TCP", 1237 RFC 6675, August 2012. 1239 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 1240 Rate Reduction for TCP", May 2013. 1242 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 1243 Scheffenegger, "TCP Extensions for High Performance", 1244 September 2014. 1246 [RFC793] Postel, J., "Transmission Control Protocol", September 1247 1981. 1249 12.2. Informative References 1251 [FACK] Mathis, M. and M. Jamshid, "Forward acknowledgement: 1252 refining TCP congestion control", ACM SIGCOMM Computer 1253 Communication Review, Volume 26, Issue 4, Oct. 1996. , 1254 1996. 1256 [POLICER16] 1257 Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng, 1258 Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An 1259 Analysis of Traffic Policing in the Web", ACM SIGCOMM , 1260 2016. 1262 [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And 1263 Congestion Control", draft-tsvwg-quic-loss-recovery-01 1264 (work in progress), June 2016. 1266 [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP 1267 and SCTP RTO Restart", February 2016. 1269 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 1270 "TCP Congestion Control With a Misbehaving Receiver", ACM 1271 Computer Communication Review, 29(5) , 1999. 1273 [THIN-STREAM] 1274 Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, 1275 "TCP enhancements for interactive thin-stream 1276 applications", NOSSDAV , 2008. 1278 [TLP] Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 1279 "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of 1280 Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work 1281 in progress), August 2013. 1283 Authors' Addresses 1285 Yuchung Cheng 1286 Google, Inc 1287 1600 Amphitheater Parkway 1288 Mountain View, California 94043 1289 USA 1291 Email: ycheng@google.com 1293 Neal Cardwell 1294 Google, Inc 1295 76 Ninth Avenue 1296 New York, NY 10011 1297 USA 1299 Email: ncardwell@google.com 1301 Nandita Dukkipati 1302 Google, Inc 1303 1600 Amphitheater Parkway 1304 Mountain View, California 94043 1306 Email: nanditad@google.com 1307 Priyaranjan Jha 1308 Google, Inc 1309 1600 Amphitheater Parkway 1310 Mountain View, California 94043 1312 Email: priyarjha@google.com