idnits 2.17.1 draft-ietf-tcpm-rack-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 3 instances of too long lines in the document, the longest one being 7 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (March 5, 2018) is 2242 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC3517' is mentioned on line 990, but not defined ** Obsolete undefined reference: RFC 3517 (Obsoleted by RFC 6675) == Missing Reference: 'RFC4653' is mentioned on line 858, but not defined == Missing Reference: 'RFC3708' is mentioned on line 302, but not defined == Missing Reference: 'RFC3522' is mentioned on line 877, but not defined == Unused Reference: 'RFC2119' is defined on line 1052, but no explicit reference was found in the text == Unused Reference: 'RFC4737' is defined on line 1059, but no explicit reference was found in the text == Unused Reference: 'RFC793' is defined on line 1091, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 4 errors (**), 0 flaws (~~), 9 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group Y. Cheng 3 Internet-Draft N. Cardwell 4 Intended status: Experimental N. Dukkipati 5 Expires: September 6, 2018 P. Jha 6 Google, Inc 7 March 5, 2018 9 RACK: a time-based fast loss detection algorithm for TCP 10 draft-ietf-tcpm-rack-03 12 Abstract 14 This document presents a new TCP loss detection algorithm called RACK 15 ("Recent ACKnowledgment"). RACK uses the notion of time, instead of 16 packet or sequence counts, to detect losses, for modern TCP 17 implementations that can support per-packet timestamps and the 18 selective acknowledgment (SACK) option. It is intended to replace 19 the conventional DUPACK threshold approach and its variants, as well 20 as other nonstandard approaches. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on September 6, 2018. 39 Copyright Notice 41 Copyright (c) 2018 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 1. Introduction 56 This document presents a new loss detection algorithm called RACK 57 ("Recent ACKnowledgment"). RACK uses the notion of time instead of 58 the conventional packet or sequence counting approaches for detecting 59 losses. RACK deems a packet lost if some packet sent sufficiently 60 later has been delivered. It does this by recording packet 61 transmission times and inferring losses using cumulative 62 acknowledgments or selective acknowledgment (SACK) TCP options. 64 In the last couple of years we have been observing several 65 increasingly common loss and reordering patterns in the Internet: 67 1. Lost retransmissions. Traffic policers [POLICER16] and burst 68 losses often cause retransmissions to be lost again, severely 69 increasing TCP latency. 71 2. Tail drops. Structured request-response traffic turns more 72 losses into tail drops. In such cases, TCP is application- 73 limited, so it cannot send new data to probe losses and has to 74 rely on retransmission timeouts (RTOs). 76 3. Reordering. Link layer protocols (e.g., 802.11 block ACK) or 77 routers' internal load-balancing can deliver TCP packets out of 78 order. The degree of such reordering is usually within the order 79 of the path round trip time. 81 Despite TCP stacks (e.g. Linux) that implement many of the standard 82 and proposed loss detection algorithms 83 [RFC3517][RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN- 84 STREAM][TLP], we've found that together they do not perform well. 85 The main reason is that many of them are based on the classic rule of 86 counting duplicate acknowledgments [RFC5681]. They can either detect 87 loss quickly or accurately, but not both, especially when the sender 88 is application-limited or under reordering that is unpredictable. 89 And under these conditions none of them can detect lost 90 retransmissions well. 92 Also, these algorithms, including RFCs, rarely address the 93 interactions with other algorithms. For example, FACK may consider a 94 packet is lost while RFC3517 may not. Implementing N algorithms 95 while dealing with N^2 interactions is a daunting task and error- 96 prone. 98 The goal of RACK is to solve all the problems above by replacing many 99 of the loss detection algorithms above with one simpler, and also 100 more effective, algorithm. 102 2. Overview 104 The main idea behind RACK is that if a packet has been delivered out 105 of order, then the packets sent chronologically before that were 106 either lost or reordered. This concept is not fundamentally 107 different from [RFC5681][RFC3517][FACK]. But the key innovation in 108 RACK is to use a per-packet transmission timestamp and widely 109 deployed SACK options to conduct time-based inferences instead of 110 inferring losses with packet or sequence counting approaches. 112 Using a threshold for counting duplicate acknowledgments (i.e., 113 DupThresh) is no longer reliable because of today's prevalent 114 reordering patterns. A common type of reordering is that the last 115 "runt" packet of a window's worth of packet bursts gets delivered 116 first, then the rest arrive shortly after in order. To handle this 117 effectively, a sender would need to constantly adjust the DupThresh 118 to the burst size; but this would risk increasing the frequency of 119 RTOs on real losses. 121 Today's prevalent lost retransmissions also cause problems with 122 packet-counting approaches [RFC5681][RFC3517][FACK], since those 123 approaches depend on reasoning in sequence number space. 124 Retransmissions break the direct correspondence between ordering in 125 sequence space and ordering in time. So when retransmissions are 126 lost, sequence-based approaches are often unable to infer and quickly 127 repair losses that can be deduced with time-based approaches. 129 Instead of counting packets, RACK uses the most recently delivered 130 packet's transmission time to judge if some packets sent previous to 131 that time have "expired" by passing a certain reordering settling 132 window. On each ACK, RACK marks any already-expired packets lost, 133 and for any packets that have not yet expired it waits until the 134 reordering window passes and then marks those lost as well. In 135 either case, RACK can repair the loss without waiting for a (long) 136 RTO. RACK can be applied to both fast recovery and timeout recovery, 137 and can detect losses on both originally transmitted and 138 retransmitted packets, making it a great all-weather loss detection 139 mechanism. 141 3. Requirements 143 The reader is expected to be familiar with the definitions given in 144 the TCP congestion control [RFC5681] and selective acknowledgment 146 [RFC2018] RFCs. Familiarity with the conservative SACK-based 147 recovery for TCP [RFC6675] is not expected but helps. 149 RACK has three requirements: 151 1. The connection MUST use selective acknowledgment (SACK) options 152 [RFC2018]. 154 2. For each packet sent, the sender MUST store its most recent 155 transmission time with (at least) millisecond granularity. For 156 round-trip times lower than a millisecond (e.g., intra-datacenter 157 communications) microsecond granularity would significantly help 158 the detection latency but is not required. 160 3. For each packet sent, the sender MUST remember whether the packet 161 has been retransmitted or not. 163 We assume that requirement 1 implies the sender keeps a SACK 164 scoreboard, which is a data structure to store selective 165 acknowledgment information on a per-connection basis ([RFC6675] 166 section 3). For the ease of explaining the algorithm, we use a 167 pseudo-scoreboard that manages the data in sequence number ranges. 168 But the specifics of the data structure are left to the implementor. 170 RACK does not need any change on the receiver. 172 4. Definitions of variables 174 A sender needs to store these new RACK variables: 176 "Packet.xmit_ts" is the time of the last transmission of a data 177 packet, including retransmissions, if any. The sender needs to 178 record the transmission time for each packet sent and not yet 179 acknowledged. The time MUST be stored at millisecond granularity or 180 finer. 182 "RACK.packet". Among all the packets that have been either 183 selectively or cumulatively acknowledged, RACK.packet is the one that 184 was sent most recently (including retransmissions). 186 "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet. 188 "RACK.end_seq" is the ending TCP sequence number of RACK.packet. 190 "RACK.RTT" is the associated RTT measured when RACK.xmit_ts, above, 191 was changed. It is the RTT of the most recently transmitted packet 192 that has been delivered (either cumulatively acknowledged or 193 selectively acknowledged) on the connection. 195 "RACK.reo_wnd" is a reordering window for the connection, computed in 196 the unit of time used for recording packet transmission times. It is 197 used to defer the moment at which RACK marks a packet lost. 199 "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the 200 connection. 202 "RACK.ack_ts" is the time when all the sequences in RACK.packet were 203 selectively or cumulatively acknowledged. 205 "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd 207 "RACK.reo_wnd_persist" is the number of loss recoveries before 208 resetting RACK.reo_wnd "RACK.dsack" indicates if RACK.reo_wnd has 209 been adjusted upon receiving a DSACK option 211 Note that the Packet.xmit_ts variable is per packet in flight. The 212 RACK.xmit_ts, RACK.end_seq, RACK.RTT, RACK.reo_wnd, and RACK.min_RTT 213 variables are kept in the per-connection TCP control block. 214 RACK.packet and RACK.ack_ts are used as local variables in the 215 algorithm. 217 5. Algorithm Details 219 5.1. Transmitting a data packet 221 Upon transmitting a new packet or retransmitting an old packet, 222 record the time in Packet.xmit_ts. RACK does not care if the 223 retransmission is triggered by an ACK, new application data, an RTO, 224 or any other means. 226 5.2. Upon receiving an ACK 228 Step 1: Update RACK.min_RTT. 230 Use the RTT measurements obtained via [RFC6298] or [RFC7323] to 231 update the estimated minimum RTT in RACK.min_RTT. The sender can 232 track a simple global minimum of all RTT measurements from the 233 connection, or a windowed min-filtered value of recent RTT 234 measurements. This document does not specify an exact approach. 236 Step 2: Update RACK stats 238 Given the information provided in an ACK, each packet cumulatively 239 ACKed or SACKed is marked as delivered in the scoreboard. Among all 240 the packets newly ACKed or SACKed in the connection, record the most 241 recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts. 242 Sometimes the timestamps of RACK.Packet and Packet could carry the 243 same transmit timestamps due to clock granularity or segmentation 244 offloading (i.e. the two packets were sent as a jumbo frame into the 245 NIC). In that case the sequence numbers of RACK.end_seq and 246 Packet.end_seq are compared to break the tie. 248 Since an ACK can also acknowledge retransmitted data packets, 249 RACK.RTT can be vastly underestimated if the retransmission was 250 spurious. To avoid that, ignore a packet if any of its TCP sequences 251 have been retransmitted before and either of two conditions is true: 253 1. The Timestamp Echo Reply field (TSecr) of the ACK's timestamp 254 option [RFC7323], if available, indicates the ACK was not 255 acknowledging the last retransmission of the packet. 257 2. The packet was last retransmitted less than RACK.min_rtt ago. 258 While it is still possible the packet is spuriously retransmitted 259 because of a recent RTT decrease, we believe that our experience 260 suggests this is a reasonable heuristic. 262 If the ACK is not ignored as invalid, update the RACK.RTT to be the 263 RTT sample calculated using this ACK, and continue. If this ACK or 264 SACK was for the most recently sent packet, then record the 265 RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK. 266 Otherwise exit here and omit the following steps. 268 Step 2 may be summarized in pseudocode as: 270 RACK_sent_after(t1, seq1, t2, seq2): 271 If t1 > t2: 272 Return true 273 Else if t1 == t2 AND seq1 > seq2: 274 Return true 275 Else: 276 Return false 278 RACK_update(): 279 For each Packet newly acknowledged cumulatively or selectively: 280 rtt = Now() - RACK.xmit_ts 281 If Packet has been retransmitted: 282 If ACK.ts_option.echo_reply < Packet.xmit_ts: 283 Return 284 If rtt < RACK.min_rtt: 285 Return 287 RACK.RTT = rtt 288 If RACK_sent_after(Packet.xmit_ts, Packet.end_seq 289 RACK.xmit_ts, RACK.end_seq): 290 RACK.xmit_ts = Packet.xmit_ts 291 RACK.end_seq = Packet.end_seq 293 Step 3: Update RACK reordering window 295 To handle the prevalent small degree of reordering, RACK.reo_wnd 296 serves as an allowance for settling time before marking a packet 297 lost. Use a conservative window of min_RTT / 4 if the connection is 298 not currently in loss recovery. When in loss recovery, use a 299 RACK.reo_wnd of zero in order to retransmit quickly. 301 Extension 1: Optionally size the window based on DSACK Further, the 302 sender MAY leverage DSACK [RFC3708] to adapt the reordering window to 303 higher degrees of reordering. Receiving an ACK with a DSACK 304 indicates a spurious retransmission, which in turn suggests that the 305 RACK reordering window, RACK.reo_wnd, is likely too small. The 306 sender MAY increase the RACK.reo_wnd window linearly for every round 307 trip in which the sender receives a DSACK, so that after N distinct 308 round trips in which a DSACK is received, the RACK.reo_wnd is N * 309 min_RTT / 4. The inflated RACK.reo_wnd would persist for 16 loss 310 recoveries and then reset to its starting value, min_RTT / 4. 312 Extension 2: Optionally size the window if reordering has been 313 observed 315 If the reordering window is too small or the connection does not 316 support DSACK, then RACK can trigger spurious loss recoveries and 317 reduce the congestion window unnecessarily. If the implementation 318 supports reordering detection such as [REORDER-DETECT], then the 319 sender MAY use the dynamically-sized reordering window based on 320 min_RTT during loss recovery instead of a zero reordering window to 321 compensate. Extension 3: Optionally size the window with the classic 322 DUPACK threshold heuristic The DUPACK threshold approach in the 323 current standards [RFC5681][RFC6675] is simple, and for decades has 324 been effective in quickly detecting losses, despite the drawbacks 325 discussed earlier. RACK can easily maintain the DUPACK threshold's 326 advantages of quick detection by resetting the reordering window to 327 zero (using RACK.reo_wnd = 0) when the DUPACK threshold is met (i.e. 328 when at least three packets have been selectively acknowledged). The 329 subtle differences are discussed in the section "RACK and TLP 330 discussions". 332 The following algorithm includes the basic and all the extensions 333 mentioned above. Note that individual extensions that require 334 additional TCP features (e.g. DSACK) would work if the feature 335 functions simply return false. 337 RACK_update_reo_wnd: 338 RACK.min_RTT = TCP_min_RTT() 339 If RACK_ext_TCP_ACK_has_DSACK_option(): 340 RACK.dsack = true 342 If SND.UNA < RACK.roundtrip_seq: 343 RACK.dsack = false /* React to DSACK once within a round trip */ 345 If RACK.dsack: 346 RACK.reo_wnd_incr += 1 347 RACK.dsack = false 348 RACK.roundtrip_seq = SND.NXT 349 RACK.reo_wnd_persist = 16 /* Keep window for 16 loss recoveries */ 350 Else if exiting loss recovery: 351 RACK.reo_wnd_persist -= 1 352 If RACK.reo_wnd_persist <= 0: 353 RACK.reo_wnd_incr = 1 355 If in loss recovery and not RACK_ext_TCP_seen_reordering(): 356 RACK.reo_wnd = 0 357 Else if RACK_ext_TCP_dupack_threshold_hit(): /* DUPTHRESH emulation mode */ 358 RACK.reo_wnd = 0 359 Else: 360 RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr 361 RACK.reo_wnd = min(RACK.reo_wnd, SRTT) 363 Step 4: Detect losses. 365 For each packet that has not been SACKed, if RACK.xmit_ts is after 366 Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its 367 corresponding sequence range) lost in the scoreboard. The rationale 368 is that if another packet that was sent later has been delivered, and 369 the reordering window or "reordering settling time" has already 370 passed, then the packet was likely lost. 372 If another packet that was sent later has been delivered, but the 373 reordering window has not passed, then it is not yet safe to deem the 374 unacked packet lost. Using the basic algorithm above, the sender 375 would wait for the next ACK to further advance RACK.xmit_ts; but this 376 risks a timeout (RTO) if no more ACKs come back (e.g, due to losses 377 or application limit). For timely loss detection, the sender MAY 378 install a "reordering settling" timer set to fire at the earliest 379 moment at which it is safe to conclude that some packet is lost. The 380 earliest moment is the time it takes to expire the reordering window 381 of the earliest unacked packet in flight. 383 This timer expiration value can be derived as follows. As a starting 384 point, we consider that the reordering window has passed if the 385 RACK.packet was sent sufficiently after the packet in question, or a 386 sufficient time has elapsed since the RACK.packet was S/ACKed, or 387 some combination of the two. More precisely, RACK marks a packet as 388 lost if the reordering window for a packet has elapsed through the 389 sum of: 391 1. delta in transmit time between a packet and the RACK.packet 393 2. delta in time between RACK.ack_ts and now 395 So we mark a packet as lost if: 397 RACK.xmit_ts >= Packet.xmit_ts 398 AND 399 (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd 401 If we solve this second condition for "now", the moment at which we 402 can declare a packet lost, then we get: 404 now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts) 406 Then (RACK.ack_ts - RACK.xmit_ts) is just the RTT of the packet we 407 used to set RACK.xmit_ts, so this reduces to: 409 Packet.xmit_ts + RACK.RTT + RACK.reo_wnd - now <= 0 411 The following pseudocode implements the algorithm above. When an ACK 412 is received or the RACK timer expires, call RACK_detect_loss(). The 413 algorithm includes an additional optimization to break timestamp ties 414 by using the TCP sequence space. The optimization is particularly 415 useful to detect losses in a timely manner with TCP Segmentation 416 Offload, where multiple packets in one TSO blob have identical 417 timestamps. It is also useful when the timestamp clock granularity 418 is close to or longer than the actual round trip time. 420 RACK_detect_loss(): 421 timeout = 0 423 For each packet, Packet, in the scoreboard: 424 If Packet is already SACKed 425 or marked lost and not yet retransmitted: 426 Continue 428 If RACK_sent_after(RACK.xmit_ts, RACK.end_seq, 429 Packet.xmit_ts, Packet.end_seq): 430 remaining = Packet.xmit_ts + RACK.RTT + RACK.reo_wnd - Now() 431 If remaining <= 0: 432 Mark Packet lost 433 Else: 434 timeout = max(remaining, timeout) 436 If timeout != 0 437 Arm a timer to call RACK_detect_loss() after timeout 439 Implementation optimization: looping through packets in the SACK 440 scoreboard above could be very costly on large BDP networks since the 441 inflight could be very large. If the implementation can organize the 442 scoreboard data structures to have packets sorted by the last 443 (re)transmission time, then the loop can start on the least recently 444 sent packet and aborts on the first packet sent after RACK.time_ts. 445 This can be implemented by using a seperate list sorted in time 446 order. The implementation inserts the packet to the tail of the list 447 when it is (re)transmitted, and removes a packet from the list when 448 it is delivered or marked lost. We RECOMMEND such an optimization 449 for implementations for support high BDP networks. The optimization 450 is implemented in Linux and sees orders of magnitude improvement on 451 CPU usage on high speed WAN networks. 453 Tail Loss Probe: fast recovery on tail losses 455 This section describes a supplemental algorithm, Tail Loss Probe 456 (TLP), which leverages RACK to further reduce RTO recoveries. TLP 457 triggers fast recovery to quickly repair tail losses that can 458 otherwise be recovered by RTOs only. After an original data 459 transmission, TLP sends a probe data segment within one to two RTTs. 460 The probe data segment can either be new, previously unsent data, or 461 a retransmission of previously sent data just below SND.NXT. In 462 either case the goal is to elicit more feedback from the receiver, in 463 the form of an ACK (potentially with SACK blocks), to allow RACK to 464 trigger fast recovery instead of an RTO. 466 An RTO occurs when the first unacknowledged sequence number is not 467 acknowledged after a conservative period of time has elapsed 468 [RFC6298]. Common causes of RTOs include: 470 1. The entire flight is lost 472 2. Tail losses at the end of an application transaction 474 3. Lost retransmits, which can halt fast recovery based on [RFC6675] 475 if the ACK stream completely dries up. For example, consider a 476 window of three data packets (P1, P2, P3) that are sent; P1 and 477 P2 are dropped. On receipt of a SACK for P3, RACK marks P1 and 478 P2 as lost and retransmits them as R1 and R2. Suppose R1 and R2 479 are lost as well, so there are no more returning ACKs to detect 480 R1 and R2 as lost. Recovery stalls. 482 4. Tail losses of ACKs. 484 5. An unexpectedly long round-trip time (RTT). This can cause ACKs 485 to arrive after the RTO timer expires. The F-RTO algorithm 486 [RFC5682] is designed to detect such spurious retransmission 487 timeouts and at least partially undo the consequences of such 488 events, but F-RTO cannot be used in many situations. 490 5.3. Tail Loss Probe: An Example 492 Following is an example of TLP. All events listed are at a TCP 493 sender. 495 1. Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is 496 no more new data to transmit. A PTO is scheduled to fire in 2 497 RTTs, after the transmission of the 10th segment. 499 2. Sender receives acknowledgements (ACKs) for segments 1-5; 500 segments 6-10 are lost and no ACKs are received. The sender 501 reschedules its PTO timer relative to the last received ACK, 502 which is the ACK for segment 5 in this case. The sender sets the 503 PTO interval using the calculation described in step (2) of the 504 algorithm. 506 3. When PTO fires, sender retransmits segment 10. 508 4. After an RTT, a SACK for packet 10 arrives. The ACK also carries 509 SACK holes for segments 6, 7, 8 and 9. This triggers RACK-based 510 loss recovery. 512 5. The connection enters fast recovery and retransmits the remaining 513 lost segments. 515 5.4. Tail Loss Probe Algorithm Details 517 We define the terminology used in specifying the TLP algorithm: 519 FlightSize: amount of outstanding data in the network, as defined in 520 [RFC5681]. 522 RTO: The transport's retransmission timeout (RTO) is based on 523 measured round-trip times (RTT) between the sender and receiver, as 524 specified in [RFC6298] for TCP. PTO: Probe timeout (PTO) is a timer 525 event indicating that an ACK is overdue. Its value is constrained to 526 be smaller than or equal to an RTO. 528 SRTT: smoothed round-trip time, computed as specified in [RFC6298]. 530 Open state: the sender's loss recovery state machine is in its 531 normal, default state: there are no SACKed sequence ranges in the 532 SACK scoreboard, and neither fast recovery, timeout-based recovery, 533 nor ECN-based cwnd reduction are underway. 535 The TLP algorithm has three phases, which we discuss in turn. 537 5.4.1. Phase 1: Scheduling a loss probe 539 Step 1: Check conditions for scheduling a PTO. 541 A sender should check to see if it should schedule a PTO in two 542 situations: 544 1. After transmitting new data 546 2. Upon receiving an ACK that cumulatively acknowledges data. 548 A sender should schedule a PTO only if all of the following 549 conditions are met: 551 1. The connection supports SACK [RFC2018] 553 2. The connection is not in loss recovery 554 3. The connection is either limited by congestion window (the data 555 in flight matches or exceeds the cwnd) or application-limited 556 (there is no unsent data that the receiver window allows to be 557 sent). 559 4. The most recently transmitted data was not itself a TLP probe 560 (i.e. a sender MUST NOT send consecutive or back-to-back TLP 561 probes). 563 If a PTO cannot be scheduled according to these conditions, then the 564 sender MUST arm the RTO timer if there is unacknowledged data in 565 flight. 567 Step 2: Select the duration of the PTO. 569 A sender SHOULD use the following logic to select the duration of a 570 PTO: 572 TLP_timeout(): 573 If SRTT is available: 574 PTO = 2 * SRTT 575 If FlightSize = 1: 576 PTO += WCDelAckT 577 Else: 578 PTO += 2ms 579 Else: 580 PTO = 1 sec 582 If Now() + PTO > TCP_RTO_expire(): 583 PTO = TCP_RTO_expire() - Now() 585 Aiming for a PTO value of 2*SRTT allows a sender to wait long enough 586 to know that an ACK is overdue. Under normal circumstances, i.e. no 587 losses, an ACK typically arrives in one SRTT. But choosing PTO to be 588 exactly an SRTT is likely to generate spurious probes given that 589 network delay variance and even end-system timings can easily push an 590 ACK to be above an SRTT. We chose PTO to be the next integral 591 multiple of SRTT. 593 Similarly, current end-system processing latencies and timer 594 granularities can easily delay ACKs, so senders SHOULD add at least 595 2ms to a computed PTO value (and MAY add more if the sending host OS 596 timer granularity is more coarse than 1ms). 598 WCDelAckT stands for worst case delayed ACK timer. When FlightSize 599 is 1, PTO is inflated by WCDelAckT time to compensate for a potential 600 long delayed ACK timer at the receiver. The RECOMMENDED value for 601 WCDelAckT is 200ms. 603 Finally, if the time at which an RTO would fire (here denoted 604 "TCP_RTO_expire") is sooner than the computed time for the PTO, then 605 a probe is scheduled to be sent at that earlier time.. 607 5.4.2. Phase 2: Sending a loss probe 609 When the PTO fires, transmit a probe data segment: 611 TLP_send_probe(): 612 If a previously unsent segment exists AND 613 the receive window allows new data to be sent: 614 Transmit that new segment 615 FlightSize += SMSS 616 Else: 617 Retransmit the last segment 618 The cwnd remains unchanged 620 5.4.3. Phase 3: ACK processing 622 On each incoming ACK, the sender should cancel any existing loss 623 probe timer. The sender should then reschedule the loss probe timer 624 if the conditions in Step 1 of Phase 1 allow. 626 5.5. TLP recovery detection 628 If the only loss in an outstanding window of data was the last 629 segment, then a TLP loss probe retransmission of that data segment 630 might repair the loss. TLP recovery detection examines ACKs to 631 detect when the probe might have repaired a loss, and thus allows 632 congestion control to properly reduce the congestion window (cwnd) 633 [RFC5681]. 635 Consider a TLP retransmission episode where a sender retransmits a 636 tail packet in a flight. The TLP retransmission episode ends when 637 the sender receives an ACK with a SEG.ACK above the SND.NXT at the 638 time the episode started. During the TLP retransmission episode the 639 sender checks for a duplicate ACK or D-SACK indicating that both the 640 original segment and TLP retransmission arrived at the receiver, 641 meaning there was no loss that needed repairing. If the TLP sender 642 does not receive such an indication before the end of the TLP 643 retransmission episode, then it MUST estimate that either the 644 original data segment or the TLP retransmission were lost, and 645 congestion control MUST react appropriately to that loss as it would 646 any other loss. 648 Since a significant fraction of the hosts that support SACK do not 649 support duplicate selective acknowledgments (D-SACKs) [RFC2883] the 650 TLP algorithm for detecting such lost segments relies only on basic 651 SACK support [RFC2018]. 653 Definitions of variables 655 TLPRxtOut: a boolean indicating whether there is an unacknowledged 656 TLP retransmission. 658 TLPHighRxt: the value of SND.NXT at the time of sending a TLP 659 retransmission. 661 5.5.1. Initializing and resetting state 663 When a connection is created, or suffers a retransmission timeout, or 664 enters fast recovery, it executes the following: 666 TLPRxtOut = false 668 5.5.2. Recording loss probe states 670 Senders must only send a TLP loss probe retransmission if TLPRxtOut 671 is false. This ensures that at any given time a connection has at 672 most one outstanding TLP retransmission. This allows the sender to 673 use the algorithm described in this section to estimate whether any 674 data segments were lost. 676 Note that this condition only restricts TLP loss probes that are 677 retransmissions. There may be an arbitrary number of outstanding 678 unacknowledged TLP loss probes that consist of new, previously-unsent 679 data, since the retransmission timeout and fast recovery algorithms 680 are sufficient to detect losses of such probe segments. 682 Upon sending a TLP probe that is a retransmission, the sender sets 683 TLPRxtOut to true and TLPHighRxt to SND.NXT. 685 Detecting recoveries accomplished by loss probes 687 Step 1: Track ACKs indicating receipt of original and retransmitted 688 segments 690 A sender considers both the original segment and TLP probe 691 retransmission segment as acknowledged if either 1 or 2 are true: 693 1. This is a duplicate acknowledgment (as defined in [RFC5681], 694 section 2), and all of the following conditions are met: 696 1. TLPRxtOut is true 697 2. SEG.ACK == TLPHighRxt 699 3. SEG.ACK == SND.UNA 701 4. the segment contains no SACK blocks for sequence ranges above 702 TLPHighRxt 704 5. the segment contains no data 706 6. the segment is not a window update 708 2. This is an ACK acknowledging a sequence number at or above 709 TLPHighRxt and it contains a D-SACK; i.e. all of the following 710 conditions are met: 712 1. TLPRxtOut is true 714 2. SEG.ACK >= TLPHighRxt 716 3. the ACK contains a D-SACK block 718 If neither conditions are met, then the sender estimates that the 719 receiver received both the original data segment and the TLP probe 720 retransmission, and so the sender considers the TLP episode to be 721 done, and records that fact by setting TLPRxtOut to false. 723 Step 2: Mark the end of a TLP retransmission episode and detect 724 losses 726 If the sender receives a cumulative ACK for data beyond the TLP loss 727 probe retransmission then, in the absence of reordering on the return 728 path of ACKs, it should have received any ACKs for the original 729 segment and TLP probe retransmission segment. At that time, if the 730 TLPRxtOut flag is still true and thus indicates that the TLP probe 731 retransmission remains unacknowledged, then the sender should presume 732 that at least one of its data segments was lost, so it SHOULD invoke 733 a congestion control response equivalent to fast recovery. 735 More precisely, on each ACK the sender executes the following: 737 if (TLPRxtOut and SEG.ACK >= TLPHighRxt) { 738 TLPRxtOut = false 739 EnterRecovery() 740 ExitRecovery() 741 } 743 6. RACK and TLP discussions 745 6.1. Advantages 747 The biggest advantage of RACK is that every data packet, whether it 748 is an original data transmission or a retransmission, can be used to 749 detect losses of the packets sent chronologically prior to it. 751 Example: TAIL DROP. Consider a sender that transmits a window of 752 three data packets (P1, P2, P3), and P1 and P3 are lost. Suppose the 753 transmission of each packet is at least RACK.reo_wnd (1 millisecond 754 by default) after the transmission of the previous packet. RACK will 755 mark P1 as lost when the SACK of P2 is received, and this will 756 trigger the retransmission of P1 as R1. When R1 is cumulatively 757 acknowledged, RACK will mark P3 as lost and the sender will 758 retransmit P3 as R3. This example illustrates how RACK is able to 759 repair certain drops at the tail of a transaction without any timer. 760 Notice that neither the conventional duplicate ACK threshold 761 [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK] 762 algorithm can detect such losses, because of the required packet or 763 sequence count. 765 Example: LOST RETRANSMIT. Consider a window of three data packets 766 (P1, P2, P3) that are sent; P1 and P2 are dropped. Suppose the 767 transmission of each packet is at least RACK.reo_wnd (1 millisecond 768 by default) after the transmission of the previous packet. When P3 769 is SACKed, RACK will mark P1 and P2 lost and they will be 770 retransmitted as R1 and R2. Suppose R1 is lost again but R2 is 771 SACKed; RACK will mark R1 lost for retransmission again. Again, 772 neither the conventional three duplicate ACK threshold approach, nor 773 [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect 774 such losses. And such a lost retransmission is very common when TCP 775 is being rate-limited, particularly by token bucket policers with 776 large bucket depth and low rate limit. Retransmissions are often 777 lost repeatedly because standard congestion control requires multiple 778 round trips to reduce the rate below the policed rate. 780 Example: SMALL DEGREE OF REORDERING. Consider a common reordering 781 event: a window of packets are sent as (P1, P2, P3). P1 and P2 carry 782 a full payload of MSS octets, but P3 has only a 1-octet payload. 783 Suppose the sender has detected reordering previously (e.g., by 784 implementing the algorithm in [REORDER-DETECT]) and thus RACK.reo_wnd 785 is min_RTT/4. Now P3 is reordered and delivered first, before P1 and 786 P2. As long as P1 and P2 are delivered within min_RTT/4, RACK will 787 not consider P1 and P2 lost. But if P1 and P2 are delivered outside 788 the reordering window, then RACK will still falsely mark P1 and P2 789 lost. We discuss how to reduce false positives in the end of this 790 section. 792 The examples above show that RACK is particularly useful when the 793 sender is limited by the application, which is common for 794 interactive, request/response traffic. Similarly, RACK still works 795 when the sender is limited by the receive window, which is common for 796 applications that use the receive window to throttle the sender. 798 For some implementations (e.g., Linux), RACK works quite efficiently 799 with TCP Segmentation Offload (TSO). RACK always marks the entire 800 TSO blob lost because the packets in the same TSO blob have the same 801 transmission timestamp. By contrast, the counting based algorithms 802 (e.g., [RFC3517][RFC5681]) may mark only a subset of packets in the 803 TSO blob lost, forcing the stack to perform expensive fragmentation 804 of the TSO blob, or to selectively tag individual packets lost in the 805 scoreboard. 807 6.2. Disadvantages 809 RACK requires the sender to record the transmission time of each 810 packet sent at a clock granularity of one millisecond or finer. TCP 811 implementations that record this already for RTT estimation do not 812 require any new per-packet state. But implementations that are not 813 yet recording packet transmission times will need to add per-packet 814 internal state (commonly either 4 or 8 octets per packet or TSO blob) 815 to track transmission times. In contrast, the conventional [RFC6675] 816 loss detection approach does not require any per-packet state beyond 817 the SACK scoreboard. This is particularly useful on ultra-low RTT 818 networks where the RTT is far less than the sender TCP clock 819 grainularity (e.g. inside data-centers). 821 RACK can easily and optionally support the conventional approach in 822 [RFC6675][RFC5681] by resetting the reordering window to zero when 823 the threshold is met. Note that this approach differs slightly from 824 [RFC6675] which considers a packet lost when at least #DupThresh 825 higher-sequenc packets are SACKed. RACK's approach considers a 826 packet lost when at least one higher sequence packet is SACKed and 827 the total number of SACKed packets is at least DupThresh. For 828 example, suppose a connection sends 10 packets, and packets 3, 5, 7 829 are SACKed. [RFC6675] considers packets 1 and 2 lost. RACK 830 considers packets 1, 2, 4, 6 lost. 832 6.3. Adjusting the reordering window 834 When the sender detects packet reordering, RACK uses a reordering 835 window of min_rtt / 4. It uses the minimum RTT to accommodate 836 reordering introduced by packets traversing slightly different paths 837 (e.g., router-based parallelism schemes) or out-of-order deliveries 838 in the lower link layer (e.g., wireless links using link-layer 839 retransmission). RACK uses a quarter of minimum RTT because Linux 840 TCP used the same factor in its implementation to delay Early 841 Retransmit [RFC5827] to reduce spurious loss detections in the 842 presence of reordering, and experience shows that this seems to work 843 reasonably well. We have evaluated using the smoothed RTT (SRTT from 844 [RFC6298] RTT estimation) or the most recently measured RTT 845 (RACK.RTT) using an experiment similar to that in the Performance 846 Evaluation section. They do not make any significant difference in 847 terms of total recovery latency. 849 6.4. Relationships with other loss recovery algorithms 851 The primary motivation of RACK is to ultimately provide a simple and 852 general replacement for some of the standard loss recovery algorithms 853 [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard 854 ones [FACK][THIN-STREAM]. While RACK can be a supplemental loss 855 detection mechanism on top of these algorithms, this is not 856 necessary, because RACK implicitly subsumes most of them. 858 [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK 859 threshold based on the current or previous flight sizes. RACK takes 860 a different approach, by using only one ACK event and a reordering 861 window. RACK can be seen as an extended Early Retransmit [RFC5827] 862 without a FlightSize limit but with an additional reordering window. 863 [FACK] considers an original packet to be lost when its sequence 864 range is sufficiently far below the highest SACKed sequence. In some 865 sense RACK can be seen as a generalized form of FACK that operates in 866 time space instead of sequence space, enabling it to better handle 867 reordering, application-limited traffic, and lost retransmissions. 869 Nevertheless RACK is still an experimental algorithm. Since the 870 oldest loss detection algorithm, the 3 duplicate ACK threshold 871 [RFC5681], has been standardized and widely deployed. RACK can 872 easily and optionally support the conventional approach for 873 compatibility. 875 RACK is compatible with and does not interfere with the the standard 876 RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel 877 algorithms [RFC3522]. This is because RACK only detects loss by 878 using ACK events. It neither changes the RTO timer calculation nor 879 detects spurious timeouts. 881 Furthermore, RACK naturally works well with Tail Loss Probe [TLP] 882 because a tail loss probe solicits either an ACK or SACK, which can 883 be used by RACK to detect more losses. RACK can be used to relax 884 TLP's requirement for using FACK and retransmitting the the highest- 885 sequenced packet, because RACK is agnostic to packet sequence 886 numbers, and uses transmission time instead. Thus TLP could be 887 modified to retransmit the first unacknowledged packet, which could 888 improve application latency. 890 6.5. Interaction with congestion control 892 RACK intentionally decouples loss detection from congestion control. 893 RACK only detects losses; it does not modify the congestion control 894 algorithm [RFC5681][RFC6937]. However, RACK may detect losses 895 earlier or later than the conventional duplicate ACK threshold 896 approach does. A packet marked lost by RACK SHOULD NOT be 897 retransmitted until congestion control deems this appropriate. 898 Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used 899 when using RACK. 901 RACK is applicable for both fast recovery and recovery after a 902 retransmission timeout (RTO) in [RFC5681]. RACK applies equally to 903 fast recovery and RTO recovery because RACK is purely based on the 904 transmission time order of packets. When a packet retransmitted by 905 RTO is acknowledged, RACK will mark any unacked packet sent 906 sufficiently prior to the RTO as lost, because at least one RTT has 907 elapsed since these packets were sent. 909 The following simple example compares how RACK and non-RACK loss 910 detection interacts with congestion control: suppose a TCP sender has 911 a congestion window (cwnd) of 20 packets on a SACK-enabled 912 connection. It sends 10 data packets and all of them are lost. 914 Without RACK, the sender would time out, reset cwnd to 1, and 915 retransmit the first packet. It would take four round trips (1 + 2 + 916 4 + 3 = 10) to retransmit all the 10 lost packets using slow start. 917 The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4 918 packets due to congestion window validation. 920 With RACK, a sender would send the TLP after 2*RTT and get a DUPACK. 921 If the sender implements Proportional Rate Reduction [RFC6937] it 922 would slow start to retransmit the remaining 9 lost packets since the 923 number of packets in flight (0) is lower than the slow start 924 threshold (10). The slow start would again take four round trips (1 925 + 2 + 4 + 3 = 10). The recovery latency would be 2*RTT + 4*RTT, with 926 an ending cwnd set to the slow start threshold of 10 packets. 928 In both cases, the sender after the recovery would be in congestion 929 avoidance. The difference in recovery latency (RTO + 4*RTT vs 6*RTT) 930 can be significant if the RTT is much smaller than the minimum RTO (1 931 second in RFC6298) or if the RTT is large. The former case is common 932 in local area networks, data-center networks, or content distribution 933 networks with deep deployments. The latter case is more common in 934 developing regions with highly congested and/or high-latency 935 networks. The ending congestion window after recovery also impacts 936 subsequent data transfer. 938 6.6. TLP recovery detection with delayed ACKs 940 Delayed ACKs complicate the detection of repairs done by TLP, since 941 with a delayed ACK the sender receives one fewer ACK than would 942 normally be expected. To mitigate this complication, before sending 943 a TLP loss probe retransmission, the sender should attempt to wait 944 long enough that the receiver has sent any delayed ACKs that it is 945 withholding. The sender algorithm described above features such a 946 delay, in the form of WCDelAckT. Furthermore, if the receiver 947 supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then 948 in the case of a delayed ACK the sender's TLP recovery detection 949 algorithm (see above) can use the D-SACK information to infer that 950 the original and TLP retransmission both arrived at the receiver. 952 If there is ACK loss or a delayed ACK without a D-SACK, then this 953 algorithm is conservative, because the sender will reduce cwnd when 954 in fact there was no packet loss. In practice this is acceptable, 955 and potentially even desirable: if there is reverse path congestion 956 then reducing cwnd can be prudent. 958 6.7. RACK for other transport protocols 960 RACK can be implemented in other transport protocols. The algorithm 961 can be simplified by skipping step 3 if the protocol can support a 962 unique transmission or packet identifier (e.g. TCP echo options). 963 For example, the QUIC protocol implements RACK [QUIC-LR]. 965 7. Experiments and Performance Evaluations 967 RACK and TLP have been deployed at Google, for both connections to 968 users in the Internet and internally. We conducted a performance 969 evaluation experiment for RACK and TLP on a small set of Google Web 970 servers in Western Europe that serve mostly European and some African 971 countries. The experiment lasted three days in March 2017. The 972 servers were divided evenly into four groups of roughly 5.3 million 973 flows each: 975 Group 1 (control): RACK off, TLP off, RFC 3517 on 977 Group 2: RACK on, TLP off, RFC 3517 on 979 Group 3: RACK on, TLP on, RFC 3517 on 981 Group 4: RACK on, TLP on, RFC 3517 off 982 All groups used Linux with CUBIC congestion control, an initial 983 congestion window of 10 packets, and the fq/pacing qdisc. In terms 984 of specific recovery features, all groups enabled RFC5682 (F-RTO) but 985 disabled FACK because it is not an IETF RFC. FACK was excluded 986 because the goal of this setup is to compare RACK and TLP to RFC- 987 based loss recoveries. Since TLP depends on either FACK or RACK, we 988 could not run another group that enables TLP only (with both RACK and 989 FACK disabled). Group 4 is to test whether RACK plus TLP can 990 completely replace the DupThresh-based [RFC3517]. 992 The servers sit behind a load balancer that distributes the 993 connections evenly across the four groups. 995 Each group handles a similar number of connections and sends and 996 receives similar amounts of data. We compare total time spent in 997 loss recovery across groups. The recovery time is measured from when 998 the recovery and retransmission starts, until the remote host has 999 acknowledged the highest sequence (SND.NXT) at the time the recovery 1000 started. Therefore the recovery includes both fast recoveries and 1001 timeout recoveries. 1003 Our data shows that Group 2 recovery latency is only 0.3% lower than 1004 the Group 1 recovery latency. But Group 3 recovery latency is 25% 1005 lower than Group 1 due to a 40% reduction in RTO-triggered 1006 recoveries! Therefore it is important to implement both TLP and RACK 1007 for performance. Group 4's total recovery latency is 0.02% lower 1008 than Group 3's, indicating that RACK plus TLP can successfully 1009 replace RFC3517 as a standalone recovery mechanism. 1011 We want to emphasize that the current experiment is limited in terms 1012 of network coverage. The connectivity in Western Europe is fairly 1013 good, therefore loss recovery is not a major performance bottleneck. 1014 We plan to expand our experiments to regions with worse connectivity, 1015 in particular on networks with strong traffic policing. 1017 8. Security Considerations 1019 RACK does not change the risk profile for TCP. 1021 An interesting scenario is ACK-splitting attacks [SCWA99]: for an 1022 MSS-size packet sent, the receiver or the attacker might send MSS 1023 ACKs that SACK or acknowledge one additional byte per ACK. This 1024 would not fool RACK. RACK.xmit_ts would not advance because all the 1025 sequences of the packet are transmitted at the same time (carry the 1026 same transmission timestamp). In other words, SACKing only one byte 1027 of a packet or SACKing the packet in entirety have the same effect on 1028 RACK. 1030 9. IANA Considerations 1032 This document makes no request of IANA. 1034 Note to RFC Editor: this section may be removed on publication as an 1035 RFC. 1037 10. Acknowledgments 1039 The authors thank Matt Mathis for his insights in FACK and Michael 1040 Welzl for his per-packet timer idea that inspired this work. Eric 1041 Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana 1042 Iyengar, and Hiren Panchasara contributed to the draft and the 1043 implementations in Linux, FreeBSD and QUIC. 1045 11. References 1047 11.1. Normative References 1049 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 1050 Options", RFC 2018, October 1996. 1052 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1053 Requirement Levels", RFC 2119, March 1997. 1055 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1056 Extension to the Selective Acknowledgement (SACK) Option 1057 for TCP", RFC 2883, July 2000. 1059 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 1060 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 1061 November 2006. 1063 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1064 Control", RFC 5681, September 2009. 1066 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 1067 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 1068 Spurious Retransmission Timeouts with TCP", RFC 5682, 1069 September 2009. 1071 [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. 1072 Hurtig, "Early Retransmit for TCP and Stream Control 1073 Transmission Protocol (SCTP)", RFC 5827, April 2010. 1075 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1076 "Computing TCP's Retransmission Timer", RFC 6298, June 1077 2011. 1079 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1080 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1081 Based on Selective Acknowledgment (SACK) for TCP", 1082 RFC 6675, August 2012. 1084 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 1085 Rate Reduction for TCP", May 2013. 1087 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 1088 Scheffenegger, "TCP Extensions for High Performance", 1089 September 2014. 1091 [RFC793] Postel, J., "Transmission Control Protocol", September 1092 1981. 1094 11.2. Informative References 1096 [FACK] Mathis, M. and M. Jamshid, "Forward acknowledgement: 1097 refining TCP congestion control", ACM SIGCOMM Computer 1098 Communication Review, Volume 26, Issue 4, Oct. 1996. , 1099 1996. 1101 [POLICER16] 1102 Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng, 1103 Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An 1104 Analysis of Traffic Policing in the Web", ACM SIGCOMM , 1105 2016. 1107 [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Recovery And 1108 Congestion Control", draft-tsvwg-quic-loss-recovery-01 1109 (work in progress), June 2016. 1111 [REORDER-DETECT] 1112 Zimmermann, A., Schulte, L., Wolff, C., and A. Hannemann, 1113 "Detection and Quantification of Packet Reordering with 1114 TCP", draft-zimmermann-tcpm-reordering-detection-02 (work 1115 in progress), November 2014. 1117 [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP 1118 and SCTP RTO Restart", February 2016. 1120 [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 1121 "TCP Congestion Control With a Misbehaving Receiver", ACM 1122 Computer Communication Review, 29(5) , 1999. 1124 [THIN-STREAM] 1125 Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, 1126 "TCP enhancements for interactive thin-stream 1127 applications", NOSSDAV , 2008. 1129 [TLP] Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 1130 "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of 1131 Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work 1132 in progress), August 2013. 1134 Authors' Addresses 1136 Yuchung Cheng 1137 Google, Inc 1138 1600 Amphitheater Parkway 1139 Mountain View, California 94043 1140 USA 1142 Email: ycheng@google.com 1144 Neal Cardwell 1145 Google, Inc 1146 76 Ninth Avenue 1147 New York, NY 10011 1148 USA 1150 Email: ncardwell@google.com 1152 Nandita Dukkipati 1153 Google, Inc 1154 1600 Amphitheater Parkway 1155 Mountain View, California 94043 1157 Email: nanditad@google.com 1159 Priyaranjan Jha 1160 Google, Inc 1161 1600 Amphitheater Parkway 1162 Mountain View, California 94043 1164 Email: priyarjha@google.com