idnits 2.17.1 draft-ietf-tcpm-rack-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (October 31, 2016) is 2735 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC3517' is mentioned on line 663, but not defined ** Obsolete undefined reference: RFC 3517 (Obsoleted by RFC 6675) == Missing Reference: 'RFC4653' is mentioned on line 713, but not defined -- Looks like a reference, but probably isn't: '8' on line 976 -- Looks like a reference, but probably isn't: '9' on line 978 == Missing Reference: 'RFC3522' is mentioned on line 732, but not defined -- Looks like a reference, but probably isn't: '1' on line 962 -- Looks like a reference, but probably isn't: '2' on line 964 -- Looks like a reference, but probably isn't: '3' on line 966 -- Looks like a reference, but probably isn't: '4' on line 968 -- Looks like a reference, but probably isn't: '5' on line 970 -- Looks like a reference, but probably isn't: '6' on line 972 -- Looks like a reference, but probably isn't: '7' on line 974 -- Looks like a reference, but probably isn't: '10' on line 980 == Unused Reference: 'RFC793' is defined on line 875, but no explicit reference was found in the text == Unused Reference: 'RFC2119' is defined on line 906, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 3 errors (**), 0 flaws (~~), 7 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group Y. Cheng 3 Internet-Draft N. Cardwell 4 Intended status: Experimental N. Dukkipati 5 Expires: May 4, 2017 Google, Inc 6 October 31, 2016 8 RACK: a time-based fast loss detection algorithm for TCP 9 draft-ietf-tcpm-rack-01 11 Abstract 13 This document presents a new TCP loss detection algorithm called RACK 14 ("Recent ACKnowledgment"). RACK uses the notion of time, instead of 15 packet or sequence counts, to detect losses, for modern TCP 16 implementations that can support per-packet timestamps and the 17 selective acknowledgment (SACK) option. It is intended to replace 18 the conventional DUPACK threshold approach and its variants, as well 19 as other nonstandard approaches. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on May 4, 2017. 38 Copyright Notice 40 Copyright (c) 2016 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 1. Introduction 55 This document presents a new loss detection algorithm called RACK 56 ("Recent ACKnowledgment"). RACK uses the notion of time instead of 57 the conventional packet or sequence counting approaches for detecting 58 losses. RACK deems a packet lost if some packet sent sufficiently 59 later has been delivered. It does this by recording packet 60 transmission times and inferring losses using cumulative 61 acknowledgments or selective acknowledgment (SACK) TCP options. 63 In the last couple of years we have been observing several 64 increasingly common loss and reordering patterns in the Internet: 66 1. Lost retransmissions. Traffic policers [POLICER16] and burst 67 losses often cause retransmissions to be lost again, severely 68 increasing TCP latency. 70 2. Tail drops. Structured request-response traffic turns more 71 losses into tail drops. In such cases, TCP is application- 72 limited, so it cannot send new data to probe losses and has to 73 rely on retransmission timeouts (RTOs). 75 3. Reordering. Link layer protocols (e.g., 802.11 block ACK) or 76 routers' internal load-balancing can deliver TCP packets out of 77 order. The degree of such reordering is usually within the order 78 of the path round trip time. 80 Despite TCP stacks (e.g. Linux) that implement many of the standard 81 and proposed loss detection algorithms 82 [RFC3517][RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN- 83 STREAM][TLP], we've found that together they do not perform well. 84 The main reason is that many of them are based on the classic rule of 85 counting duplicate acknowledgments [RFC5681]. They can either detect 86 loss quickly or accurately, but not both, especially when the sender 87 is application-limited or under reordering that is unpredictable. 88 And under these conditions none of them can detect lost 89 retransmissions well. 91 Also, these algorithms, including RFCs, rarely address the 92 interactions with other algorithms. For example, FACK may consider a 93 packet is lost while RFC3517 may not. Implementing N algorithms 94 while dealing with N^2 interactions is a daunting task and error- 95 prone. 97 The goal of RACK is to solve all the problems above by replacing many 98 of the loss detection algorithms above with one simpler, and also 99 more effective, algorithm. 101 2. Overview 103 The main idea behind RACK is that if a packet has been delivered out 104 of order, then the packets sent chronologically before that were 105 either lost or reordered. This concept is not fundamentally 106 different from [RFC5681][RFC3517][FACK]. But the key innovation in 107 RACK is to use a per-packet transmission timestamp and widely 108 deployed SACK options to conduct time-based inferences instead of 109 inferring losses with packet or sequence counting approaches. 111 Using a threshold for counting duplicate acknowledgments (i.e., 112 dupthresh) is no longer reliable because of today's prevalent 113 reordering patterns. A common type of reordering is that the last 114 "runt" packet of a window's worth of packet bursts gets delivered 115 first, then the rest arrive shortly after in order. To handle this 116 effectively, a sender would need to constantly adjust the dupthresh 117 to the burst size; but this would risk increasing the frequency of 118 RTOs on real losses. 120 Today's prevalent lost retransmissions also cause problems with 121 packet-counting approaches [RFC5681][RFC3517][FACK], since those 122 approaches depend on reasoning in sequence number space. 123 Retransmissions break the direct correspondence between ordering in 124 sequence space and ordering in time. So when retransmissions are 125 lost, sequence-based approaches are often unable to infer and quickly 126 repair losses that can be deduced with time-based approaches. 128 Instead of counting packets, RACK uses the most recently delivered 129 packet's transmission time to judge if some packets sent previous to 130 that time have "expired" by passing a certain reordering settling 131 window. On each ACK, RACK marks any already-expired packets lost, 132 and for any packets that have not yet expired it waits until the 133 reordering window passes and then marks those lost as well. In 134 either case, RACK can repair the loss without waiting for a (long) 135 RTO. RACK can be applied to both fast recovery and timeout recovery, 136 and can detect losses on both originally transmitted and 137 retransmitted packets, making it a great all-weather recovery 138 mechanism. 140 3. Requirements 142 The reader is expected to be familiar with the definitions given in 143 the TCP congestion control [RFC5681] and selective acknowledgment 145 [RFC2018] RFCs. Familiarity with the conservative SACK-based 146 recovery for TCP [RFC6675] is not expected but helps. 148 RACK has three requirements: 150 1. The connection MUST use selective acknowledgment (SACK) options 151 [RFC2018]. 153 2. For each packet sent, the sender MUST store its most recent 154 transmission time with (at least) millisecond granularity. For 155 round-trip times lower than a millisecond (e.g., intra-datacenter 156 communications) microsecond granularity would significantly help 157 the detection latency but is not required. 159 3. For each packet sent, the sender MUST remember whether the packet 160 has been retransmitted or not. 162 We assume that requirement 1 implies the sender keeps a SACK 163 scoreboard, which is a data structure to store selective 164 acknowledgment information on a per-connection basis. For the ease 165 of explaining the algorithm, we use a pseudo-scoreboard that manages 166 the data in sequence number ranges. But the specifics of the data 167 structure are left to the implementor. 169 RACK does not need any change on the receiver. 171 4. Definitions of variables 173 A sender needs to store these new RACK variables: 175 "Packet.xmit_ts" is the time of the last transmission of a data 176 packet, including retransmissions, if any. The sender needs to 177 record the transmission time for each packet sent and not yet 178 acknowledged. The time MUST be stored at millisecond granularity or 179 finer. 181 "RACK.packet". Among all the packets that have been either 182 selectively or cummulatively acknowledged, RACK.packet is the one 183 that was sent most recently (including retransmission). 185 "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet. 187 "RACK.end_seq" is the ending TCP sequence number of RACk.packet. 189 "RACK.RTT" is the associated RTT measured when RACK.xmit_ts, above, 190 was changed. It is the RTT of the most recently transmitted packet 191 that has been delivered (either cumulatively acknowledged or 192 selectively acknowledged) on the connection. 194 "RACK.reo_wnd" is a reordering window for the connection, computed in 195 the unit of time used for recording packet transmission times. It is 196 used to defer the moment at which RACK marks a packet lost. 198 "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the 199 connection. 201 "RACK.ack_ts" is the time when all the sequences in RACK.packet were 202 selectively or cumulatively acknowledged. 204 Note that the Packet.xmit_ts variable is per packet in flight. The 205 RACK.xmit_ts, RACK.RTT, RACK.reo_wnd, and RACK.min_RTT variables are 206 to keep in TCP control block per connection. RACK.packet and 207 RACK.ack_ts are used as local variables in the algorithm. 209 5. Algorithm Details 211 5.1. Transmitting a data packet 213 Upon transmitting a new packet or retransmitting an old packet, 214 record the time in Packet.xmit_ts. RACK does not care if the 215 retransmission is triggered by an ACK, new application data, an RTO, 216 or any other means. 218 5.2. Upon receiving an ACK 220 Step 1: Update RACK.min_RTT. 222 Use the RTT measurements obtained in [RFC6298] or [RFC7323] to update 223 the estimated minimum RTT in RACK.min_RTT. The sender can track a 224 simple global minimum of all RTT measurements from the connection, or 225 a windowed min-filtered value of recent RTT measurements. This 226 document does not specify an exact approach. 228 Step 2: Update RACK.reo_wnd. 230 To handle the prevalent small degree of reordering, RACK.reo_wnd 231 serves as an allowance for settling time before marking a packet 232 lost. By default it is 1 millisecond. We RECOMMEND implementing the 233 reordering detection in [REORDER-DETECT][RFC4737] to dynamically 234 adjust the reordering window. When the sender detects packet 235 reordering RACK.reo_wnd MAY be changed to RACK.min_RTT/4. We discuss 236 more about the reordering window in the next section. 238 Step 3: Advance RACK.xmit_ts and update RACK.RTT and RACK.end_seq 240 Given the information provided in an ACK, each packet cumulatively 241 ACKed or SACKed is marked as delivered in the scoreboard. Among all 242 the packets newly ACKed or SACKed in the connection, record the most 243 recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts. 244 Ignore the packet if any of its TCP sequences has been retransmitted 245 before and either of two condition is true: 247 1. The Timestamp Echo Reply field (TSecr) of the ACK's timestamp 248 option [RFC7323], if available, indicates the ACK was not 249 acknowledging the last retransmission of the packet. 251 2. The packet was last retransmitted less than RACK.min_rtt ago. 252 While it is still possible the packet is spuriously retransmitted 253 because of a recent RTT decrease, we believe that our experience 254 suggests this is a reasonable heuristic. 256 If this ACK causes a change to RACK.xmit_ts then record the RTT and 257 sequence implied by this ACK: 259 RACK.RTT = Now() - RACK.xmit_ts 260 RACK.end_seq = Packet.end_seq 262 Exit here and omit the following steps if RACK.xmit_ts has not 263 changed. 265 Step 4: Detect losses. 267 For each packet that has not been fully SACKed, if RACK.xmit_ts is 268 after Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its 269 corresponding sequence range) lost in the scoreboard. The rationale 270 is that if another packet that was sent later has been delivered, and 271 the reordering window or "reordering settling time" has already 272 passed, the packet was likely lost. 274 If a packet that was sent later has been delivered, but the 275 reordering window has not passed, then it is not yet safe to deem the 276 given packet lost. Using the basic algorithm above, the sender would 277 wait for the next ACK to further advance RACK.xmit_ts; but this risks 278 a timeout (RTO) if no more ACKs come back (e.g, due to losses or 279 application limit). For timely loss detection, the sender MAY 280 install a "reordering settling" timer set to fire at the earliest 281 moment at which it is safe to conclude that some packet is lost. The 282 earliest moment is the time it takes to expire the reordering window 283 of the earliest unacked packet in flight. 285 This timer expiration value can be derived as follows. As a starting 286 point, we consider that the reordering window has passed if the 287 RACK.packet was sent sufficiently after the packet in question, or a 288 sufficient time has elapsed since the RACK.packet was S/ACKed, or 289 some combination of the two. More precisely, RACK marks a packet as 290 lost if the reordering window for a packet has elapsed through the 291 sum of: 293 1. delta in transmit time between a packet and the RACK.packet 295 2. delta in time between when RACK.ack_ts and now 297 So we mark a packet as lost if: 299 RACK.xmit_ts > Packet.xmit_ts 300 AND 301 (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) > RACK.reo_wnd 303 If we solve this second condition for "now", the moment at which we 304 can declare a packet lost, then we get: 306 now > Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts) 308 Then (RACK.ack_ts - RACK.xmit_ts) is just the RTT of the packet we 309 used to set RACK.xmit_ts, so this reduces to: 311 now > Packet.xmit_ts + RACK.RTT + RACK.reo_wnd 313 The following pseudocode implements the algorithm above. When an ACK 314 is received or the RACK timer expires, call RACK_detect_loss(). The 315 algorithm includes an additional optimization to break timestamp ties 316 by using the TCP sequence space. The optimization is particularly 317 useful to detect losses in a timely manner with TCP Segmentation 318 Offload, where multiple packets in one TSO blob have identical 319 timestamps. It is also useful when the timestamp clock granularity 320 is close to or longer than the actual round trip time. 322 RACK_detect_loss(): 323 min_timeout = 0 325 For each packet, Packet, in the scoreboard: 326 If Packet is already SACKed, ACKed, 327 or marked lost and not yet retransmitted: 328 Skip to the next packet 330 If Packet.xmit_ts > RACK.xmit_ts: 331 Skip to the next packet 332 /* Timestamp tie breaker */ 333 If Packet.xmit_ts == RACK.xmit_ts AND 334 Packet.end_seq > RACK.end_seq: 335 Skip to the next packet 337 timeout = Packet.xmit_ts + RACK.RTT + RACK.reo_wnd + 1 338 If Now() >= timeout: 339 Mark Packet lost 340 Else If (min_timeout == 0) or (timeout is before min_timeout): 341 min_timeout = timeout 343 If min_timeout != 0 344 Arm a timer to call RACK_detect_loss() after min_timeout 346 6. Tail Loss Probe: fast recovery on tail losses 348 This section describes a supplemental algorithm, Tail Loss Probe 349 (TLP), which leverages RACK to further reduce RTO recoveries. TLP 350 triggers fast recovery to quickly repair tail losses that can 351 otherwise only be recoverable by RTOs. After an original data 352 transmission, TLP sends a probe data segment within one to two RTTs. 353 The probe data segment can either be new, previously unsent data, or 354 a retransmission. In either case the goal is to elicit more feedback 355 from the receiver, in the form of an ACK (potentially with SACK 356 blocks), to allow RACK to trigger fast recovery instead of an RTO. 358 An RTO occurs when the first unacknowledged sequence number is not 359 acknowledged after a conservative period of time has elapsed [RFC6298 360 [1]]. Common causes of RTOs include: 362 1. Tail losses at the end of an application transaction. 364 2. Lost retransmits, which can halt fast recovery if the ACK stream 365 completely dries up. For example, consider a window of three 366 data packets (P1, P2, P3) that are sent; P1 and P2 are dropped. 367 On receipt of a SACK for P3, RACK marks P1 and P2 as lost and 368 retransmits them as R1 and R2. Suppose R1 and R2 are lost as 369 well, so there are no more returning ACKs to detect R1 and R2 as 370 lost. Recovery stalls. 372 3. Tail losses of ACKs. 374 4. An unexpectedly long round-trip time (RTT). This can cause ACKs 375 to arrive after the RTO timer expires. The F-RTO algorithm 376 [RFC5682 [2]] is designed to detect such spurious retransmission 377 timeouts and at least partially undo the consequences of such 378 events (though F-RTO cannot be used in many situations). 380 6.1. Tail Loss Probe: An Example 382 Following is an example of TLP. All events listed are at a TCP 383 sender. 385 (1) Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 386 no more new data to transmit. A PTO is scheduled to fire in 2 RTTs, 387 after the transmission of the 10th segment. (2) Sender receives 388 acknowledgements (ACKs) for segments 1-5; segments 6-10 are lost and 389 no ACKs are received. The sender reschedules its PTO timer relative 390 to the last received ACK, which is the ACK for segment 5 in this 391 case. The sender sets the PTO interval using the calculation 392 described in step (2) of the algorithm. (3) When PTO fires, sender 393 retransmits segment 10. (4) After an RTT, a SACK for packet 10 394 arrives. The ACK also carries SACK holes for segments 6, 7, 8 and 9. 395 This triggers RACK-based loss recovery. (5) The connection enters 396 fast recovery and retransmits the remaining lost segments. 398 6.2. Tail Loss Probe Algorithm Details 400 We define the terminology used in specifying the TLP algorithm: 402 FlightSize: amount of outstanding data in the network, as defined in 403 [RFC5681 [3]]. 405 RTO: The transport's retransmission timeout (RTO) is based on 406 measured round-trip times (RTT) between the sender and receiver, as 407 specified in [RFC6298 [4]] for TCP. PTO: Probe timeout is a timer 408 event indicating that an ACK is overdue. Its value is constrained to 409 be smaller than or equal to an RTO. 411 SRTT: smoothed round-trip time, computed as specified in [RFC6298 412 [5]]. 414 Open state: the sender has so far received in-sequence ACKs with no 415 SACK blocks, and no other indications (such as retransmission 416 timeout) that a loss may have occurred. 418 The TLP algorithm has three phases, which we discuss in turn. 420 6.2.1. Phase 1: Scheduling a loss probe 422 Step 1: Check conditions for scheduling a PTO. 424 A sender should schedule a PTO after transmitting new data or 425 receiving an ACK if the following conditions are met: 427 (a) The connection is in Open state. (b) The connection is either 428 cwnd-limited (the data in flight matches or exceeds the cwnd) or 429 application-limited (there is no unsent data that the receiver window 430 allows to be sent). (c) SACK is enabled for the connection. 432 (d) The most recently transmitted data was not itself a TLP probe 433 (i.e. a sender MUST NOT send consecutive or back-to-back TLP probes). 435 (e) TLPRtxOut is false, indicating there is no TLP retransmission 436 episode in progress (see below). 438 Step 2: Select the duration of the PTO. 440 A sender SHOULD use the following logic to select the duration of a 441 PTO: 443 If an SRTT estimate is available: 444 PTO = 2 * SRTT 445 Else: 446 PTO = initial RTO of 1 sec 447 If FlightSize == 1: 448 PTO = max(PTO, 1.5 * SRTT + WCDelAckT) 449 PTO = max(10ms, PTO) 450 PTO = min(RTO, PTO) 452 Aiming for a PTO value of 2*SRTT allows a sender to wait long enough 453 to know that an ACK is overdue. Under normal circumstances, i.e. no 454 losses, an ACK typically arrives in one SRTT. But choosing PTO to be 455 exactly an SRTT is likely to generate spurious probes given that 456 network delay variance and even end-system timings can easily push an 457 ACK to be above an SRTT. We chose PTO to be the next integral 458 multiple of SRTT. Similarly, current end-system processing latencies 459 and timer granularities can easily push an ACK beyond 10ms, so 460 senders SHOULD use a minimum PTO value of 10ms. If RTO is smaller 461 than the computed value for PTO, then a probe is scheduled to be sent 462 at the RTO time. 464 WCDelAckT stands for worst case delayed ACK timer. When FlightSize 465 is 1, PTO is inflated additionally by WCDelAckT time to compensate 466 for a potential long delayed ACK timer at the receiver. The 467 RECOMMENDED value for WCDelAckT is 200ms, or the delayed ACK interval 468 value explicitly negotiated by the sender and receiver, if one is 469 available. 471 6.2.2. Phase 2: Sending a loss probe 473 When the PTO fires, transmit a probe data segment: 475 If a previously unsent segment exists AND 476 the receive window allows new data to be sent: 477 Transmit that new segment 478 FlightSize += SMSS 479 The cwnd remains unchanged 480 Record Packet.xmit_ts 481 Else: 482 Retransmit the last segment 483 The cwnd remains unchanged 485 6.2.3. Phase 3: ACK processing 487 On each incoming ACK, the sender should ancel any existing loss probe 488 timer. The timer will be re-scheduled if appropriate. 490 6.3. TLP recovery detection 492 If the only loss in an outstanding window of data was the last 493 segment, then a TLP loss probe retransmission of that data segment 494 might repair the loss. TLP loss detection examines ACKs to detect 495 when the probe might have repaired a loss, and thus allows congestion 496 control to properly reduce the congestion window (cwnd) [RFC5681 497 [6]]. 499 Consider a TLP retransmission episode where a sender retransmits a 500 tail packet in a flight. The TLP retransmission episode ends when 501 the sender receives an ACK with a SEG.ACK above the SND.NXT at the 502 time the episode started. During the TLP retransmission episode the 503 sender checks for a duplicate ACK or D-SACK indicating that both the 504 original segment and TLP retransmission arrived at the receiver, 505 meaning there was no loss that needed repairing. If the TLP sender 506 does not receive such an indication before the end of the TLP 507 retransmission episode, then it MUST estimate that either the 508 original data segment or the TLP retransmission were lost, and 509 congestion control MUST react appropriately to that loss as it would 510 any other loss. 512 Since a significant fraction of the hosts that support SACK do not 513 support duplicate selective acknowledgments (D-SACKs) [RFC2883 [7]] 514 the TLP algorithm for detecting such lost segments relies only on 515 basicRFC 2018 [8] SACK support [RFC2018 [9]]. 517 Definitions of variables 519 TLPRtxOut: a boolean indicating whether there is an unacknowledged 520 TLP retransmission. 522 TLPHighRxt: the value of SND.NXT at the time of sending a TLP 523 retransmission. 525 6.3.1. Initializing and resetting state 527 When a connection is created, or suffers a retransmission timeout, or 528 enters fast recovery, it should reset TLPRtxOut to false 530 6.3.2. Recording loss probe states 532 Senders must only send a TLP loss probe retransmission if TLPRtxOut 533 is false. This ensures that at any given time a connection has at 534 most one outstanding TLP retransmission. This allows the sender to 535 use the algorithm described in this section to estimate whether any 536 data segments were lost. 538 Note that this condition only restricts TLP loss probes that are 539 retransmissions. There may be an arbitrary number of outstanding 540 unacknowledged TLP loss probes that consist of new, previously-unsent 541 data, since the retransmission timeout and fast recovery algorithms 542 are sufficient to detect losses of such probe segments. 544 Upon sending a TLP probe that is a retransmission, the sender set 545 TLPRtxOut to true and TLPHighRxt to SND.NXT 547 Detecting recoveries done by loss probes 549 Step 1: Track ACKs indicating receipt of original and retransmitted 550 segments 552 A sender considers both the original segment and TLP probe 553 retransmission segment as acknowledged if either (i) or (ii) are 554 true: 556 (i) This is a duplicate acknowledgment (as defined in [RFC5681 [10]], 557 section 2), and all of the following conditions are met: 559 (a) TLPRtxOut is true 561 (b) SEG.ACK == TLPHighRxt 562 (c) SEG.ACK == SND.UNA 564 (d) the segment contains no SACK blocks for sequence ranges above 565 TLPHighRxt 567 (e) the segment contains no data 569 (f) the segment is not a window update 571 (ii) This is an ACK acknowledging a sequence number at or above 572 TLPHighRxt and it contains a D-SACK; i.e. all of the following 573 conditions are met: 575 (a) TLPRtxOut is true 577 (b) SEG.ACK >= TLPHighRxt and 579 (c) the ACK contains a D-SACK block 581 If either conditions (i) or (ii) are met, then the sender estimates 582 that the receiver received both the original data segment and the TLP 583 probe retransmission, and so the sender considers the TLP episode to 584 be done, and records that fact by setting TLPRtxOut to false. 586 Step 2: Mark the end of a TLP retransmission episode and detect 587 losses 589 If the sender receives a cumulative ACK for data beyond the TLP loss 590 probe retransmission then, in the absence of reordering on the return 591 path of ACKs, it should have received any ACKs for the original 592 segment and TLP probe retransmission segment. At that time, if the 593 TLPRtxOut flag is still true and thus indicates that the TLP probe 594 retransmission remains unacknowledged, then the sender should presume 595 that at least one of its data segments was lost, so it SHOULD invoke 596 a congestion control response equivalent to the response to any other 597 loss. 599 More precisely, on each ACK, after executing step (5a) the sender 600 SHOULD reset the TLPRtxOut to false, and invoke the congestion 601 control about the loss event that TLP has successfully repaired. 603 7. RACK and TLP discussions 605 7.1. Advantages 607 The biggest advantage of RACK is that every data packet, whether it 608 is an original data transmission or a retransmission, can be used to 609 detect losses of the packets sent prior to it. 611 Example: tail drop. Consider a sender that transmits a window of 612 three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the 613 transmission of each packet is at least RACK.reo_wnd (1 millisecond 614 by default) after the transmission of the previous packet. RACK will 615 mark P1 as lost when the SACK of P2 is received, and this will 616 trigger the retransmission of P1 as R1. When R1 is cumulatively 617 acknowledged, RACK will mark P3 as lost and the sender will 618 retransmit P3 as R3. This example illustrates how RACK is able to 619 repair certain drops at the tail of a transaction without any timer. 620 Notice that neither the conventional duplicate ACK threshold 621 [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK] 622 algorithm can detect such losses, because of the required packet or 623 sequence count. 625 Example: lost retransmit. Consider a window of three data packets 626 (P1, P2, P3) that are sent; P1 and P2 are dropped. Suppose the 627 transmission of each packet is at least RACK.reo_wnd (1 millisecond 628 by default) after the transmission of the previous packet. When P3 629 is SACKed, RACK will mark P1 and P2 lost and they will be 630 retransmitted as R1 and R2. Suppose R1 is lost again (as a tail 631 drop) but R2 is SACKed; RACK will mark R1 lost for retransmission 632 again. Again, neither the conventional three duplicate ACK threshold 633 approach, nor [RFC6675], nor the Forward Acknowledgment [FACK] 634 algorithm can detect such losses. And such a lost retransmission is 635 very common when TCP is being rate-limited, particularly by token 636 bucket policers with large bucket depth and low rate limit. 637 Retransmissions are often lost repeatedly because standard congestion 638 control requires multiple round trips to reduce the rate below the 639 policed rate. 641 Example: (small) degree of reordering. Consider a common reordering 642 event: a window of packets are sent as (P1, P2, P3). P1 and P2 carry 643 a full payload of MSS octets, but P3 has only a 1-octet payload due 644 to application-limited behavior. Suppose the sender has detected 645 reordering previously (e.g., by implementing the algorithm in 646 [REORDER-DETECT]) and thus RACK.reo_wnd is min_RTT/4. Now P3 is 647 reordered and delivered first, before P1 and P2. As long as P1 and 648 P2 are delivered within min_RTT/4, RACK will not consider P1 and P2 649 lost. But if P1 and P2 are delivered outside the reordering window, 650 then RACK will still falsely mark P1 and P2 lost. We discuss how to 651 reduce the false positives in the end of this section. 653 The examples above show that RACK is particularly useful when the 654 sender is limited by the application, which is common for 655 interactive, request/response traffic. Similarly, RACK still works 656 when the sender is limited by the receive window, which is common for 657 applications that use the receive window to throttle the sender. 659 For some implementations (e.g., Linux), RACK works quite efficiently 660 with TCP Segmentation Offload (TSO). RACK always marks the entire 661 TSO blob lost because the packets in the same TSO blob have the same 662 transmission timestamp. By contrast, the counting based algorithms 663 (e.g., [RFC3517][RFC5681]) may mark only a subset of packets in the 664 TSO blob lost, forcing the stack to perform expensive fragmentation 665 of the TSO blob, or to selectively tag individual packets lost in the 666 scoreboard. 668 7.2. Disadvantages 670 RACK requires the sender to record the transmission time of each 671 packet sent at a clock granularity of one millisecond or finer. TCP 672 implementations that record this already for RTT estimation do not 673 require any new per-packet state. But implementations that are not 674 yet recording packet transmission times will need to add per-packet 675 internal state (commonly either 4 or 8 octets per packet) to track 676 transmission times. In contrast, the conventional approach requires 677 one variable to track number of duplicate ACK threshold. 679 7.3. Adjusting the reordering window 681 RACK uses a reordering window of min_rtt / 4. It uses the minimum 682 RTT to accommodate reordering introduced by packets traversing 683 slightly different paths (e.g., router-based parallelism schemes) or 684 out-of-order deliveries in the lower link layer (e.g., wireless links 685 using link-layer retransmission). Alternatively, RACK can use the 686 smoothed RTT used in RTT estimation [RFC6298]. However, smoothed RTT 687 can be significantly inflated by orders of magnitude due to 688 congestion and buffer-bloat, which would result in an overly 689 conservative reordering window and slow loss detection. Furthermore, 690 RACK uses a quarter of minimum RTT because Linux TCP uses the same 691 factor in its implementation to delay Early Retransmit [RFC5827] to 692 reduce spurious loss detections in the presence of reordering, and 693 experience shows that this seems to work reasonably well. 695 One potential improvement is to further adapt the reordering window 696 by measuring the degree of reordering in time, instead of packet 697 distances. But that requires storing the delivery timestamp of each 698 packet. Some scoreboard implementations currently merge SACKed 699 packets together to support TSO (TCP Segmentation Offload) for faster 700 scoreboard indexing. Supporting per-packet delivery timestamps is 701 difficult in such implementations. However, we acknowledge that the 702 current metric can be improved by further research. 704 7.4. Relationships with other loss recovery algorithms 706 The primary motivation of RACK is to ultimately provide a simple and 707 general replacement for some of the standard loss recovery algorithms 708 [RFC5681][RFC6675][RFC5827][RFC4653] and nonstandard ones 709 [FACK][THIN-STREAM]. While RACK can be a supplemental loss detection 710 on top of these algorithms, this is not necessary, because the RACK 711 implicitly subsumes most of them. 713 [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK 714 threshold based on the current or previous flight sizes. RACK takes 715 a different approach, by using only one ACK event and a reordering 716 window. RACK can be seen as an extended Early Retransmit [RFC5827] 717 without a FlightSize limit but with an additional reordering window. 718 [FACK] considers an original packet to be lost when its sequence 719 range is sufficiently far below the highest SACKed sequence. In some 720 sense RACK can be seen as a generalized form of FACK that operates in 721 time space instead of sequence space, enabling it to better handle 722 reordering, application-limited traffic, and lost retransmissions. 724 Nevertheless RACK is still an experimental algorithm. Since the 725 oldest loss detection algorithm, the 3 duplicate ACK threshold 726 [RFC5681], has been standardized and widely deployed, we RECOMMEND 727 TCP implementations use both RACK and the algorithm specified in 728 Section 3.2 in [RFC5681] for compatibility. 730 RACK is compatible with and does not interfere with the the standard 731 RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel 732 algorithms [RFC3522]. This is because RACK only detects loss by 733 using ACK events. It neither changes the timer calculation nor 734 detects spurious timeouts. 736 Furthermore, RACK naturally works well with Tail Loss Probe [TLP] 737 because a tail loss probe solicit seither an ACK or SACK, which can 738 be used by RACK to detect more losses. RACK can be used to relax 739 TLP's requirement for using FACK and retransmitting the the highest- 740 sequenced packet, because RACK is agnostic to packet sequence 741 numbers, and uses transmission time instead. Thus TLP can be 742 modified to retransmit the first unacknowledged packet, which can 743 improve application latency. 745 7.5. Interaction with congestion control 747 RACK intentionally decouples loss detection from congestion control. 748 RACK only detects losses; it does not modify the congestion control 749 algorithm [RFC5681][RFC6937]. However, RACK may detect losses 750 earlier or later than the conventional duplicate ACK threshold 751 approach does. A packet marked lost by RACK SHOULD NOT be 752 retransmitted until congestion control deems this appropriate (e.g. 753 using [RFC6937]). 755 RACK is applicable for both fast recovery and recovery after a 756 retransmission timeout (RTO) in [RFC5681]. The distinction between 757 fast recovery or RTO recovery is not necessary because RACK is purely 758 based on the transmission time order of packets. When a packet 759 retransmitted by RTO is acknowledged, RACK will mark any unacked 760 packet sent sufficiently prior to the RTO as lost, because at least 761 one RTT has elapsed since these packets were sent. 763 7.6. TLP recovery detection with delayed ACKs 765 Delayed ACKs complicate the detection of reparies done by TLP, since 766 with a delayed ACK the sender receives one fewer ACK than would 767 normally be expected. To mitigate this complication, before sending 768 a TLP loss probe retransmission, the sender should attempt to wait 769 long enough that the receiver has sent any delayed ACKs that it is 770 withholding. The sender algorithm described above features such a 771 delay, in the form of WCDelAckT. Furthermore, if the receiver 772 supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then 773 in the case of a delayed ACK the sender's TLP loss detection 774 algorithm (in step (4)(a)(ii), above) can use the D-SACK information 775 to infer that the original and TLP retransmission both arrived at the 776 receiver. 778 If there is ACK loss or a delayed ACK without a D-SACK, then this 779 algorithm is conservative, because the sender will reduce cwnd when 780 in fact there was no packet loss. In practice this is acceptable, 781 and potentially even desirable: if there is reverse path congestion 782 then reducing cwnd is prudent. 784 However, in practice sending a single byte of data turned out to be 785 problematic to implement and more fragile than necessary. Instead we 786 use a full segment to probe but have to add complexity to compensate 787 for the probe itself masking losses. 789 7.7. RACK for other transport protocols 791 RACK can be implemented in other transport protocols. The algorithm 792 can skip step 3 and simplify if the protocol can support unique 793 transmission or packet identifier (e.g. TCP echo options). For 794 example, the QUIC protocol implements RACK [QUIC-LR] . 796 8. Experiments and Performance Evaluations 798 RACK and TLP have been deployed at Google including the connections 799 to the users in the Internet and internally. We conducted an 800 performance evaluation experiment on RACK and TLP on a small set of 801 Google Web servers in western-europe that serve most European and 802 some African countries. The length of the experiments was five days 803 (one weekend plus 3 weekdays) in October 2016, where the servers were 804 divided evenly into three groups. 806 Group 1 (control): RACK off, TLP off 808 Group 2: RACK on, TLP off 810 Group 3: RACK on, TLP on 812 All groups use Linux using the Cubic congestion control with an 813 initial window of 10 packets and fq/pacing qdisc. In term of 814 specific recovery features, all of them enable RFC3517 (Conservative 815 SACK-based recovery) and RFC5682 (F-RTO) but disable FACK because it 816 is not an IETF RFC. The goal of this setup is to compare RACK and 817 TLP to RFC-based loss recoveries instead of Linux-based recoveries. 819 The servers sit behind a load-balancer that distributes the 820 connections evenly across the three groups. 822 Each group handles similar amount of connections and send and receive 823 similar amount of data. We compare total amount of time spent in 824 loss recovery across groups. The recovery time is from when the 825 recovery and retransmit starts, till the remote has acknowledge 826 beyond the highest sequence at the time the recovery starts. 827 Therefore the recovery includes both fast recoveries and timeout 828 recoveries. Our data shows that Group 2 recovery latency is only 2% 829 lower than the Group 1 recovery latency. But Group 3 recovery 830 latency is 25% lower than Group 1 by reducing 40% of the RTOs 831 triggered recoveries! Therefore it is very important to implement 832 both TLP and RACK for performance. 834 We want to emphasize that the current experiment is limited in terms 835 of network coverage. The connectivities in western-europe is fairly 836 good therefore loss recovery is not a performance bottleneck. We 837 plan to expand our experiments in regions with worse connectivities, 838 in particular on networks with strong traffic policing. We also plan 839 to add the fourth group to disable RFC3517 to use solely RACK and TLP 840 only to see if RACK plus TLP can completely replace all other SACK 841 based recoveries. 843 9. Security Considerations 845 RACK does not change the risk profile for TCP. 847 An interesting scenario is ACK-splitting attacks [SCWA99]: for an 848 MSS-size packet sent, the receiver or the attacker might send MSS 849 ACKs that SACK or acknowledge one additional byte per ACK. This 850 would not fool RACK. RACK.xmit_ts would not advance because all the 851 sequences of the packet are transmitted at the same time (carry the 852 same transmission timestamp). In other words, SACKing only one byte 853 of a packet or SACKing the packet in entirety have the same effect on 854 RACK. 856 10. IANA Considerations 858 This document makes no request of IANA. 860 Note to RFC Editor: this section may be removed on publication as an 861 RFC. 863 11. Acknowledgments 865 The authors thank Matt Mathis for his insights in FACK and Michael 866 Welzl for his per-packet timer idea that inspired this work. Eric 867 Dumazet, Randy Stewart, Van Jacobson, Ian Swett, and Jana Iyengar 868 contributed to the algorithm and the implementations in Linux, 869 FreeBSD and QUIC. 871 12. References 873 12.1. Normative References 875 [RFC793] Postel, J., "Transmission Control Protocol", September 876 1981. 878 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 879 Options", RFC 2018, October 1996. 881 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 882 Rate Reduction for TCP", May 2013. 884 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 885 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 886 November 2006. 888 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 889 and Y. Nishida, "A Conservative Loss Recovery Algorithm 890 Based on Selective Acknowledgment (SACK) for TCP", 891 RFC 6675, August 2012. 893 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 894 "Computing TCP's Retransmission Timer", RFC 6298, June 895 2011. 897 [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. 898 Hurtig, "Early Retransmit for TCP and Stream Control 899 Transmission Protocol (SCTP)", RFC 5827, April 2010. 901 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 902 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 903 Spurious Retransmission Timeouts with TCP", RFC 5682, 904 September 2009. 906 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 907 Requirement Levels", RFC 2119, March 1997. 909 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 910 Control", RFC 5681, September 2009. 912 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 913 Extension to the Selective Acknowledgement (SACK) Option 914 for TCP", RFC 2883, July 2000. 916 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 917 Scheffenegger, "TCP Extensions for High Performance", 918 September 2014. 920 12.2. Informative References 922 [FACK] Mathis, M. and M. Jamshid, "Forward acknowledgement: 923 refining TCP congestion control", ACM SIGCOMM Computer 924 Communication Review, Volume 26, Issue 4, Oct. 1996. , 925 1996. 927 [TLP] Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 928 "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of 929 Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work 930 in progress), August 2013. 932 [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP 933 and SCTP RTO Restart", February 2016. 935 [REORDER-DETECT] 936 Zimmermann, A., Schulte, L., Wolff, C., and A. Hannemann, 937 "Detection and Quantification of Packet Reordering with 938 TCP", draft-zimmermann-tcpm-reordering-detection-02 (work 939 in progress), November 2014. 941 [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And 942 Congestion Control", draft-tsvwg-quic-loss-recovery-01 943 (work in progress), June 2016. 945 [THIN-STREAM] 946 Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, 947 "TCP enhancements for interactive thin-stream 948 applications", NOSSDAV , 2008. 950 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 951 "TCP Congestion Control With a Misbehaving Receiver", ACM 952 Computer Communication Review, 29(5) , 1999. 954 [POLICER16] 955 Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng, 956 Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An 957 Analysis of Traffic Policing in the Web", ACM SIGCOMM , 958 2016. 960 12.3. URIs 962 [1] https://tools.ietf.org/html/rfc6298 964 [2] https://tools.ietf.org/html/rfc5682 966 [3] https://tools.ietf.org/html/rfc5681 968 [4] https://tools.ietf.org/html/rfc6298 970 [5] https://tools.ietf.org/html/rfc6298 972 [6] https://tools.ietf.org/html/rfc5681 974 [7] https://tools.ietf.org/html/rfc2883 976 [8] https://tools.ietf.org/html/rfc2018 978 [9] https://tools.ietf.org/html/rfc2018 980 [10] https://tools.ietf.org/html/rfc5681 982 Authors' Addresses 984 Yuchung Cheng 985 Google, Inc 986 1600 Amphitheater Parkway 987 Mountain View, California 94043 988 USA 990 Email: ycheng@google.com 992 Neal Cardwell 993 Google, Inc 994 76 Ninth Avenue 995 New York, NY 10011 996 USA 998 Email: ncardwell@google.com 1000 Nandita Dukkipati 1001 Google, Inc 1002 1600 Amphitheater Parkway 1003 Mountain View, California 94043 1004 USA 1006 Email: nanditad@google.com