idnits 2.17.1 draft-ietf-tcpm-rack-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 3 instances of too long lines in the document, the longest one being 7 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (March 5, 2018) is 2242 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC3517' is mentioned on line 990, but not defined

  ** Obsolete undefined reference: RFC 3517 (Obsoleted by RFC 6675)

  == Missing Reference: 'RFC4653' is mentioned on line 858, but not defined

  == Missing Reference: 'RFC3708' is mentioned on line 302, but not defined

  == Missing Reference: 'RFC3522' is mentioned on line 877, but not defined

  == Unused Reference: 'RFC2119' is defined on line 1052, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC4737' is defined on line 1059, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC793' is defined on line 1091, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)


     Summary: 4 errors (**), 0 flaws (~~), 9 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                   Y. Cheng
3	Internet-Draft                                               N. Cardwell
4	Intended status: Experimental                               N. Dukkipati
5	Expires: September 6, 2018                                        P. Jha
6	                                                             Google, Inc
7	                                                           March 5, 2018

9	        RACK: a time-based fast loss detection algorithm for TCP
10	                        draft-ietf-tcpm-rack-03

12	Abstract

14	   This document presents a new TCP loss detection algorithm called RACK
15	   ("Recent ACKnowledgment").  RACK uses the notion of time, instead of
16	   packet or sequence counts, to detect losses, for modern TCP
17	   implementations that can support per-packet timestamps and the
18	   selective acknowledgment (SACK) option.  It is intended to replace
19	   the conventional DUPACK threshold approach and its variants, as well
20	   as other nonstandard approaches.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on September 6, 2018.

39	Copyright Notice

41	   Copyright (c) 2018 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (https://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	1.  Introduction

56	   This document presents a new loss detection algorithm called RACK
57	   ("Recent ACKnowledgment").  RACK uses the notion of time instead of
58	   the conventional packet or sequence counting approaches for detecting
59	   losses.  RACK deems a packet lost if some packet sent sufficiently
60	   later has been delivered.  It does this by recording packet
61	   transmission times and inferring losses using cumulative
62	   acknowledgments or selective acknowledgment (SACK) TCP options.

64	   In the last couple of years we have been observing several
65	   increasingly common loss and reordering patterns in the Internet:

67	   1.  Lost retransmissions.  Traffic policers [POLICER16] and burst
68	       losses often cause retransmissions to be lost again, severely
69	       increasing TCP latency.

71	   2.  Tail drops.  Structured request-response traffic turns more
72	       losses into tail drops.  In such cases, TCP is application-
73	       limited, so it cannot send new data to probe losses and has to
74	       rely on retransmission timeouts (RTOs).

76	   3.  Reordering.  Link layer protocols (e.g., 802.11 block ACK) or
77	       routers' internal load-balancing can deliver TCP packets out of
78	       order.  The degree of such reordering is usually within the order
79	       of the path round trip time.

81	   Despite TCP stacks (e.g.  Linux) that implement many of the standard
82	   and proposed loss detection algorithms
83	   [RFC3517][RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-
84	   STREAM][TLP], we've found that together they do not perform well.
85	   The main reason is that many of them are based on the classic rule of
86	   counting duplicate acknowledgments [RFC5681].  They can either detect
87	   loss quickly or accurately, but not both, especially when the sender
88	   is application-limited or under reordering that is unpredictable.
89	   And under these conditions none of them can detect lost
90	   retransmissions well.

92	   Also, these algorithms, including RFCs, rarely address the
93	   interactions with other algorithms.  For example, FACK may consider a
94	   packet is lost while RFC3517 may not.  Implementing N algorithms
95	   while dealing with N^2 interactions is a daunting task and error-
96	   prone.

98	   The goal of RACK is to solve all the problems above by replacing many
99	   of the loss detection algorithms above with one simpler, and also
100	   more effective, algorithm.

102	2.  Overview

104	   The main idea behind RACK is that if a packet has been delivered out
105	   of order, then the packets sent chronologically before that were
106	   either lost or reordered.  This concept is not fundamentally
107	   different from [RFC5681][RFC3517][FACK].  But the key innovation in
108	   RACK is to use a per-packet transmission timestamp and widely
109	   deployed SACK options to conduct time-based inferences instead of
110	   inferring losses with packet or sequence counting approaches.

112	   Using a threshold for counting duplicate acknowledgments (i.e.,
113	   DupThresh) is no longer reliable because of today's prevalent
114	   reordering patterns.  A common type of reordering is that the last
115	   "runt" packet of a window's worth of packet bursts gets delivered
116	   first, then the rest arrive shortly after in order.  To handle this
117	   effectively, a sender would need to constantly adjust the DupThresh
118	   to the burst size; but this would risk increasing the frequency of
119	   RTOs on real losses.

121	   Today's prevalent lost retransmissions also cause problems with
122	   packet-counting approaches [RFC5681][RFC3517][FACK], since those
123	   approaches depend on reasoning in sequence number space.
124	   Retransmissions break the direct correspondence between ordering in
125	   sequence space and ordering in time.  So when retransmissions are
126	   lost, sequence-based approaches are often unable to infer and quickly
127	   repair losses that can be deduced with time-based approaches.

129	   Instead of counting packets, RACK uses the most recently delivered
130	   packet's transmission time to judge if some packets sent previous to
131	   that time have "expired" by passing a certain reordering settling
132	   window.  On each ACK, RACK marks any already-expired packets lost,
133	   and for any packets that have not yet expired it waits until the
134	   reordering window passes and then marks those lost as well.  In
135	   either case, RACK can repair the loss without waiting for a (long)
136	   RTO.  RACK can be applied to both fast recovery and timeout recovery,
137	   and can detect losses on both originally transmitted and
138	   retransmitted packets, making it a great all-weather loss detection
139	   mechanism.

141	3.  Requirements

143	   The reader is expected to be familiar with the definitions given in
144	   the TCP congestion control [RFC5681] and selective acknowledgment

146	   [RFC2018] RFCs.  Familiarity with the conservative SACK-based
147	   recovery for TCP [RFC6675] is not expected but helps.

149	   RACK has three requirements:

151	   1.  The connection MUST use selective acknowledgment (SACK) options
152	       [RFC2018].

154	   2.  For each packet sent, the sender MUST store its most recent
155	       transmission time with (at least) millisecond granularity.  For
156	       round-trip times lower than a millisecond (e.g., intra-datacenter
157	       communications) microsecond granularity would significantly help
158	       the detection latency but is not required.

160	   3.  For each packet sent, the sender MUST remember whether the packet
161	       has been retransmitted or not.

163	   We assume that requirement 1 implies the sender keeps a SACK
164	   scoreboard, which is a data structure to store selective
165	   acknowledgment information on a per-connection basis ([RFC6675]
166	   section 3).  For the ease of explaining the algorithm, we use a
167	   pseudo-scoreboard that manages the data in sequence number ranges.
168	   But the specifics of the data structure are left to the implementor.

170	   RACK does not need any change on the receiver.

172	4.  Definitions of variables

174	   A sender needs to store these new RACK variables:

176	   "Packet.xmit_ts" is the time of the last transmission of a data
177	   packet, including retransmissions, if any.  The sender needs to
178	   record the transmission time for each packet sent and not yet
179	   acknowledged.  The time MUST be stored at millisecond granularity or
180	   finer.

182	   "RACK.packet".  Among all the packets that have been either
183	   selectively or cumulatively acknowledged, RACK.packet is the one that
184	   was sent most recently (including retransmissions).

186	   "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet.

188	   "RACK.end_seq" is the ending TCP sequence number of RACK.packet.

190	   "RACK.RTT" is the associated RTT measured when RACK.xmit_ts, above,
191	   was changed.  It is the RTT of the most recently transmitted packet
192	   that has been delivered (either cumulatively acknowledged or
193	   selectively acknowledged) on the connection.

195	   "RACK.reo_wnd" is a reordering window for the connection, computed in
196	   the unit of time used for recording packet transmission times.  It is
197	   used to defer the moment at which RACK marks a packet lost.

199	   "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the
200	   connection.

202	   "RACK.ack_ts" is the time when all the sequences in RACK.packet were
203	   selectively or cumulatively acknowledged.

205	   "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd

207	   "RACK.reo_wnd_persist" is the number of loss recoveries before
208	   resetting RACK.reo_wnd "RACK.dsack" indicates if RACK.reo_wnd has
209	   been adjusted upon receiving a DSACK option

211	   Note that the Packet.xmit_ts variable is per packet in flight.  The
212	   RACK.xmit_ts, RACK.end_seq, RACK.RTT, RACK.reo_wnd, and RACK.min_RTT
213	   variables are kept in the per-connection TCP control block.
214	   RACK.packet and RACK.ack_ts are used as local variables in the
215	   algorithm.

217	5.  Algorithm Details

219	5.1.  Transmitting a data packet

221	   Upon transmitting a new packet or retransmitting an old packet,
222	   record the time in Packet.xmit_ts.  RACK does not care if the
223	   retransmission is triggered by an ACK, new application data, an RTO,
224	   or any other means.

226	5.2.  Upon receiving an ACK

228	   Step 1: Update RACK.min_RTT.

230	   Use the RTT measurements obtained via [RFC6298] or [RFC7323] to
231	   update the estimated minimum RTT in RACK.min_RTT.  The sender can
232	   track a simple global minimum of all RTT measurements from the
233	   connection, or a windowed min-filtered value of recent RTT
234	   measurements.  This document does not specify an exact approach.

236	   Step 2: Update RACK stats

238	   Given the information provided in an ACK, each packet cumulatively
239	   ACKed or SACKed is marked as delivered in the scoreboard.  Among all
240	   the packets newly ACKed or SACKed in the connection, record the most
241	   recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts.
242	   Sometimes the timestamps of RACK.Packet and Packet could carry the
243	   same transmit timestamps due to clock granularity or segmentation
244	   offloading (i.e. the two packets were sent as a jumbo frame into the
245	   NIC).  In that case the sequence numbers of RACK.end_seq and
246	   Packet.end_seq are compared to break the tie.

248	   Since an ACK can also acknowledge retransmitted data packets,
249	   RACK.RTT can be vastly underestimated if the retransmission was
250	   spurious.  To avoid that, ignore a packet if any of its TCP sequences
251	   have been retransmitted before and either of two conditions is true:

253	   1.  The Timestamp Echo Reply field (TSecr) of the ACK's timestamp
254	       option [RFC7323], if available, indicates the ACK was not
255	       acknowledging the last retransmission of the packet.

257	   2.  The packet was last retransmitted less than RACK.min_rtt ago.
258	       While it is still possible the packet is spuriously retransmitted
259	       because of a recent RTT decrease, we believe that our experience
260	       suggests this is a reasonable heuristic.

262	   If the ACK is not ignored as invalid, update the RACK.RTT to be the
263	   RTT sample calculated using this ACK, and continue.  If this ACK or
264	   SACK was for the most recently sent packet, then record the
265	   RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK.
266	   Otherwise exit here and omit the following steps.

268	   Step 2 may be summarized in pseudocode as:

270	   RACK_sent_after(t1, seq1, t2, seq2):
271	       If t1 > t2:
272	           Return true
273	       Else if t1 == t2 AND seq1 > seq2:
274	           Return true
275	       Else:
276	           Return false

278	   RACK_update():
279	       For each Packet newly acknowledged cumulatively or selectively:
280	           rtt = Now() - RACK.xmit_ts
281	           If Packet has been retransmitted:
282	               If ACK.ts_option.echo_reply < Packet.xmit_ts:
283	                  Return
284	               If rtt < RACK.min_rtt:
285	                  Return

287	           RACK.RTT = rtt
288	           If RACK_sent_after(Packet.xmit_ts, Packet.end_seq
289	                              RACK.xmit_ts, RACK.end_seq):
290	               RACK.xmit_ts = Packet.xmit_ts
291	               RACK.end_seq = Packet.end_seq

293	   Step 3: Update RACK reordering window

295	   To handle the prevalent small degree of reordering, RACK.reo_wnd
296	   serves as an allowance for settling time before marking a packet
297	   lost.  Use a conservative window of min_RTT / 4 if the connection is
298	   not currently in loss recovery.  When in loss recovery, use a
299	   RACK.reo_wnd of zero in order to retransmit quickly.

301	   Extension 1: Optionally size the window based on DSACK Further, the
302	   sender MAY leverage DSACK [RFC3708] to adapt the reordering window to
303	   higher degrees of reordering.  Receiving an ACK with a DSACK
304	   indicates a spurious retransmission, which in turn suggests that the
305	   RACK reordering window, RACK.reo_wnd, is likely too small.  The
306	   sender MAY increase the RACK.reo_wnd window linearly for every round
307	   trip in which the sender receives a DSACK, so that after N distinct
308	   round trips in which a DSACK is received, the RACK.reo_wnd is N *
309	   min_RTT / 4.  The inflated RACK.reo_wnd would persist for 16 loss
310	   recoveries and then reset to its starting value, min_RTT / 4.

312	   Extension 2: Optionally size the window if reordering has been
313	   observed

315	   If the reordering window is too small or the connection does not
316	   support DSACK, then RACK can trigger spurious loss recoveries and
317	   reduce the congestion window unnecessarily.  If the implementation
318	   supports reordering detection such as [REORDER-DETECT], then the
319	   sender MAY use the dynamically-sized reordering window based on
320	   min_RTT during loss recovery instead of a zero reordering window to
321	   compensate.  Extension 3: Optionally size the window with the classic
322	   DUPACK threshold heuristic The DUPACK threshold approach in the
323	   current standards [RFC5681][RFC6675] is simple, and for decades has
324	   been effective in quickly detecting losses, despite the drawbacks
325	   discussed earlier.  RACK can easily maintain the DUPACK threshold's
326	   advantages of quick detection by resetting the reordering window to
327	   zero (using RACK.reo_wnd = 0) when the DUPACK threshold is met (i.e.
328	   when at least three packets have been selectively acknowledged).  The
329	   subtle differences are discussed in the section "RACK and TLP
330	   discussions".

332	   The following algorithm includes the basic and all the extensions
333	   mentioned above.  Note that individual extensions that require
334	   additional TCP features (e.g.  DSACK) would work if the feature
335	   functions simply return false.

337	RACK_update_reo_wnd:
338	    RACK.min_RTT = TCP_min_RTT()
339	    If RACK_ext_TCP_ACK_has_DSACK_option():
340	        RACK.dsack = true

342	    If SND.UNA < RACK.roundtrip_seq:
343	        RACK.dsack = false  /* React to DSACK once within a round trip */

345	    If RACK.dsack:
346	        RACK.reo_wnd_incr += 1
347	        RACK.dsack = false
348	        RACK.roundtrip_seq = SND.NXT
349	        RACK.reo_wnd_persist = 16 /* Keep window for 16 loss recoveries */
350	    Else if exiting loss recovery:
351	        RACK.reo_wnd_persist -= 1
352	        If RACK.reo_wnd_persist <= 0:
353	            RACK.reo_wnd_incr = 1

355	    If in loss recovery and not RACK_ext_TCP_seen_reordering():
356	        RACK.reo_wnd = 0
357	    Else if RACK_ext_TCP_dupack_threshold_hit(): /* DUPTHRESH emulation mode */
358	        RACK.reo_wnd = 0
359	    Else:
360	        RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr
361	        RACK.reo_wnd = min(RACK.reo_wnd, SRTT)

363	   Step 4: Detect losses.

365	   For each packet that has not been SACKed, if RACK.xmit_ts is after
366	   Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its
367	   corresponding sequence range) lost in the scoreboard.  The rationale
368	   is that if another packet that was sent later has been delivered, and
369	   the reordering window or "reordering settling time" has already
370	   passed, then the packet was likely lost.

372	   If another packet that was sent later has been delivered, but the
373	   reordering window has not passed, then it is not yet safe to deem the
374	   unacked packet lost.  Using the basic algorithm above, the sender
375	   would wait for the next ACK to further advance RACK.xmit_ts; but this
376	   risks a timeout (RTO) if no more ACKs come back (e.g, due to losses
377	   or application limit).  For timely loss detection, the sender MAY
378	   install a "reordering settling" timer set to fire at the earliest
379	   moment at which it is safe to conclude that some packet is lost.  The
380	   earliest moment is the time it takes to expire the reordering window
381	   of the earliest unacked packet in flight.

383	   This timer expiration value can be derived as follows.  As a starting
384	   point, we consider that the reordering window has passed if the
385	   RACK.packet was sent sufficiently after the packet in question, or a
386	   sufficient time has elapsed since the RACK.packet was S/ACKed, or
387	   some combination of the two.  More precisely, RACK marks a packet as
388	   lost if the reordering window for a packet has elapsed through the
389	   sum of:

391	   1.  delta in transmit time between a packet and the RACK.packet

393	   2.  delta in time between RACK.ack_ts and now

395	   So we mark a packet as lost if:

397	   RACK.xmit_ts >= Packet.xmit_ts
398	           AND
399	   (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd

401	   If we solve this second condition for "now", the moment at which we
402	   can declare a packet lost, then we get:

404	   now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts)

406	   Then (RACK.ack_ts - RACK.xmit_ts) is just the RTT of the packet we
407	   used to set RACK.xmit_ts, so this reduces to:

409	   Packet.xmit_ts + RACK.RTT + RACK.reo_wnd - now <= 0

411	   The following pseudocode implements the algorithm above.  When an ACK
412	   is received or the RACK timer expires, call RACK_detect_loss().  The
413	   algorithm includes an additional optimization to break timestamp ties
414	   by using the TCP sequence space.  The optimization is particularly
415	   useful to detect losses in a timely manner with TCP Segmentation
416	   Offload, where multiple packets in one TSO blob have identical
417	   timestamps.  It is also useful when the timestamp clock granularity
418	   is close to or longer than the actual round trip time.

420	RACK_detect_loss():
421	    timeout = 0

423	    For each packet, Packet, in the scoreboard:
424	        If Packet is already SACKed
425	            or marked lost and not yet retransmitted:
426	            Continue

428	        If RACK_sent_after(RACK.xmit_ts, RACK.end_seq,
429	                           Packet.xmit_ts, Packet.end_seq):
430	            remaining = Packet.xmit_ts + RACK.RTT + RACK.reo_wnd - Now()
431	            If remaining <= 0:
432	                Mark Packet lost
433	            Else:
434	                timeout = max(remaining, timeout)

436	    If timeout != 0
437	        Arm a timer to call RACK_detect_loss() after timeout

439	   Implementation optimization: looping through packets in the SACK
440	   scoreboard above could be very costly on large BDP networks since the
441	   inflight could be very large.  If the implementation can organize the
442	   scoreboard data structures to have packets sorted by the last
443	   (re)transmission time, then the loop can start on the least recently
444	   sent packet and aborts on the first packet sent after RACK.time_ts.
445	   This can be implemented by using a seperate list sorted in time
446	   order.  The implementation inserts the packet to the tail of the list
447	   when it is (re)transmitted, and removes a packet from the list when
448	   it is delivered or marked lost.  We RECOMMEND such an optimization
449	   for implementations for support high BDP networks.  The optimization
450	   is implemented in Linux and sees orders of magnitude improvement on
451	   CPU usage on high speed WAN networks.

453	   Tail Loss Probe: fast recovery on tail losses

455	   This section describes a supplemental algorithm, Tail Loss Probe
456	   (TLP), which leverages RACK to further reduce RTO recoveries.  TLP
457	   triggers fast recovery to quickly repair tail losses that can
458	   otherwise be recovered by RTOs only.  After an original data
459	   transmission, TLP sends a probe data segment within one to two RTTs.
460	   The probe data segment can either be new, previously unsent data, or
461	   a retransmission of previously sent data just below SND.NXT.  In
462	   either case the goal is to elicit more feedback from the receiver, in
463	   the form of an ACK (potentially with SACK blocks), to allow RACK to
464	   trigger fast recovery instead of an RTO.

466	   An RTO occurs when the first unacknowledged sequence number is not
467	   acknowledged after a conservative period of time has elapsed
468	   [RFC6298].  Common causes of RTOs include:

470	   1.  The entire flight is lost

472	   2.  Tail losses at the end of an application transaction

474	   3.  Lost retransmits, which can halt fast recovery based on [RFC6675]
475	       if the ACK stream completely dries up.  For example, consider a
476	       window of three data packets (P1, P2, P3) that are sent; P1 and
477	       P2 are dropped.  On receipt of a SACK for P3, RACK marks P1 and
478	       P2 as lost and retransmits them as R1 and R2.  Suppose R1 and R2
479	       are lost as well, so there are no more returning ACKs to detect
480	       R1 and R2 as lost.  Recovery stalls.

482	   4.  Tail losses of ACKs.

484	   5.  An unexpectedly long round-trip time (RTT).  This can cause ACKs
485	       to arrive after the RTO timer expires.  The F-RTO algorithm
486	       [RFC5682] is designed to detect such spurious retransmission
487	       timeouts and at least partially undo the consequences of such
488	       events, but F-RTO cannot be used in many situations.

490	5.3.  Tail Loss Probe: An Example

492	   Following is an example of TLP.  All events listed are at a TCP
493	   sender.

495	   1.  Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10.  There is
496	       no more new data to transmit.  A PTO is scheduled to fire in 2
497	       RTTs, after the transmission of the 10th segment.

499	   2.  Sender receives acknowledgements (ACKs) for segments 1-5;
500	       segments 6-10 are lost and no ACKs are received.  The sender
501	       reschedules its PTO timer relative to the last received ACK,
502	       which is the ACK for segment 5 in this case.  The sender sets the
503	       PTO interval using the calculation described in step (2) of the
504	       algorithm.

506	   3.  When PTO fires, sender retransmits segment 10.

508	   4.  After an RTT, a SACK for packet 10 arrives.  The ACK also carries
509	       SACK holes for segments 6, 7, 8 and 9.  This triggers RACK-based
510	       loss recovery.

512	   5.  The connection enters fast recovery and retransmits the remaining
513	       lost segments.

515	5.4.  Tail Loss Probe Algorithm Details

517	   We define the terminology used in specifying the TLP algorithm:

519	   FlightSize: amount of outstanding data in the network, as defined in
520	   [RFC5681].

522	   RTO: The transport's retransmission timeout (RTO) is based on
523	   measured round-trip times (RTT) between the sender and receiver, as
524	   specified in [RFC6298] for TCP.  PTO: Probe timeout (PTO) is a timer
525	   event indicating that an ACK is overdue.  Its value is constrained to
526	   be smaller than or equal to an RTO.

528	   SRTT: smoothed round-trip time, computed as specified in [RFC6298].

530	   Open state: the sender's loss recovery state machine is in its
531	   normal, default state: there are no SACKed sequence ranges in the
532	   SACK scoreboard, and neither fast recovery, timeout-based recovery,
533	   nor ECN-based cwnd reduction are underway.

535	   The TLP algorithm has three phases, which we discuss in turn.

537	5.4.1.  Phase 1: Scheduling a loss probe

539	   Step 1: Check conditions for scheduling a PTO.

541	   A sender should check to see if it should schedule a PTO in two
542	   situations:

544	   1.  After transmitting new data

546	   2.  Upon receiving an ACK that cumulatively acknowledges data.

548	   A sender should schedule a PTO only if all of the following
549	   conditions are met:

551	   1.  The connection supports SACK [RFC2018]

553	   2.  The connection is not in loss recovery
554	   3.  The connection is either limited by congestion window (the data
555	       in flight matches or exceeds the cwnd) or application-limited
556	       (there is no unsent data that the receiver window allows to be
557	       sent).

559	   4.  The most recently transmitted data was not itself a TLP probe
560	       (i.e. a sender MUST NOT send consecutive or back-to-back TLP
561	       probes).

563	   If a PTO cannot be scheduled according to these conditions, then the
564	   sender MUST arm the RTO timer if there is unacknowledged data in
565	   flight.

567	   Step 2: Select the duration of the PTO.

569	   A sender SHOULD use the following logic to select the duration of a
570	   PTO:

572	   TLP_timeout():
573	       If SRTT is available:
574	           PTO = 2 * SRTT
575	           If FlightSize = 1:
576	              PTO += WCDelAckT
577	           Else:
578	              PTO += 2ms
579	       Else:
580	           PTO = 1 sec

582	       If Now() + PTO > TCP_RTO_expire():
583	           PTO = TCP_RTO_expire() - Now()

585	   Aiming for a PTO value of 2*SRTT allows a sender to wait long enough
586	   to know that an ACK is overdue.  Under normal circumstances, i.e. no
587	   losses, an ACK typically arrives in one SRTT.  But choosing PTO to be
588	   exactly an SRTT is likely to generate spurious probes given that
589	   network delay variance and even end-system timings can easily push an
590	   ACK to be above an SRTT.  We chose PTO to be the next integral
591	   multiple of SRTT.

593	   Similarly, current end-system processing latencies and timer
594	   granularities can easily delay ACKs, so senders SHOULD add at least
595	   2ms to a computed PTO value (and MAY add more if the sending host OS
596	   timer granularity is more coarse than 1ms).

598	   WCDelAckT stands for worst case delayed ACK timer.  When FlightSize
599	   is 1, PTO is inflated by WCDelAckT time to compensate for a potential
600	   long delayed ACK timer at the receiver.  The RECOMMENDED value for
601	   WCDelAckT is 200ms.

603	   Finally, if the time at which an RTO would fire (here denoted
604	   "TCP_RTO_expire") is sooner than the computed time for the PTO, then
605	   a probe is scheduled to be sent at that earlier time..

607	5.4.2.  Phase 2: Sending a loss probe

609	   When the PTO fires, transmit a probe data segment:

611	   TLP_send_probe():
612	       If a previously unsent segment exists AND
613	          the receive window allows new data to be sent:
614	           Transmit that new segment
615	           FlightSize += SMSS
616	       Else:
617	           Retransmit the last segment
618	       The cwnd remains unchanged

620	5.4.3.  Phase 3: ACK processing

622	   On each incoming ACK, the sender should cancel any existing loss
623	   probe timer.  The sender should then reschedule the loss probe timer
624	   if the conditions in Step 1 of Phase 1 allow.

626	5.5.  TLP recovery detection

628	   If the only loss in an outstanding window of data was the last
629	   segment, then a TLP loss probe retransmission of that data segment
630	   might repair the loss.  TLP recovery detection examines ACKs to
631	   detect when the probe might have repaired a loss, and thus allows
632	   congestion control to properly reduce the congestion window (cwnd)
633	   [RFC5681].

635	   Consider a TLP retransmission episode where a sender retransmits a
636	   tail packet in a flight.  The TLP retransmission episode ends when
637	   the sender receives an ACK with a SEG.ACK above the SND.NXT at the
638	   time the episode started.  During the TLP retransmission episode the
639	   sender checks for a duplicate ACK or D-SACK indicating that both the
640	   original segment and TLP retransmission arrived at the receiver,
641	   meaning there was no loss that needed repairing.  If the TLP sender
642	   does not receive such an indication before the end of the TLP
643	   retransmission episode, then it MUST estimate that either the
644	   original data segment or the TLP retransmission were lost, and
645	   congestion control MUST react appropriately to that loss as it would
646	   any other loss.

648	   Since a significant fraction of the hosts that support SACK do not
649	   support duplicate selective acknowledgments (D-SACKs) [RFC2883] the
650	   TLP algorithm for detecting such lost segments relies only on basic
651	   SACK support [RFC2018].

653	   Definitions of variables

655	   TLPRxtOut: a boolean indicating whether there is an unacknowledged
656	   TLP retransmission.

658	   TLPHighRxt: the value of SND.NXT at the time of sending a TLP
659	   retransmission.

661	5.5.1.  Initializing and resetting state

663	   When a connection is created, or suffers a retransmission timeout, or
664	   enters fast recovery, it executes the following:

666	       TLPRxtOut = false

668	5.5.2.  Recording loss probe states

670	   Senders must only send a TLP loss probe retransmission if TLPRxtOut
671	   is false.  This ensures that at any given time a connection has at
672	   most one outstanding TLP retransmission.  This allows the sender to
673	   use the algorithm described in this section to estimate whether any
674	   data segments were lost.

676	   Note that this condition only restricts TLP loss probes that are
677	   retransmissions.  There may be an arbitrary number of outstanding
678	   unacknowledged TLP loss probes that consist of new, previously-unsent
679	   data, since the retransmission timeout and fast recovery algorithms
680	   are sufficient to detect losses of such probe segments.

682	   Upon sending a TLP probe that is a retransmission, the sender sets
683	   TLPRxtOut to true and TLPHighRxt to SND.NXT.

685	   Detecting recoveries accomplished by loss probes

687	   Step 1: Track ACKs indicating receipt of original and retransmitted
688	   segments

690	   A sender considers both the original segment and TLP probe
691	   retransmission segment as acknowledged if either 1 or 2 are true:

693	   1.  This is a duplicate acknowledgment (as defined in [RFC5681],
694	       section 2), and all of the following conditions are met:

696	       1.  TLPRxtOut is true
697	       2.  SEG.ACK == TLPHighRxt

699	       3.  SEG.ACK == SND.UNA

701	       4.  the segment contains no SACK blocks for sequence ranges above
702	           TLPHighRxt

704	       5.  the segment contains no data

706	       6.  the segment is not a window update

708	   2.  This is an ACK acknowledging a sequence number at or above
709	       TLPHighRxt and it contains a D-SACK; i.e. all of the following
710	       conditions are met:

712	       1.  TLPRxtOut is true

714	       2.  SEG.ACK >= TLPHighRxt

716	       3.  the ACK contains a D-SACK block

718	   If neither conditions are met, then the sender estimates that the
719	   receiver received both the original data segment and the TLP probe
720	   retransmission, and so the sender considers the TLP episode to be
721	   done, and records that fact by setting TLPRxtOut to false.

723	   Step 2: Mark the end of a TLP retransmission episode and detect
724	   losses

726	   If the sender receives a cumulative ACK for data beyond the TLP loss
727	   probe retransmission then, in the absence of reordering on the return
728	   path of ACKs, it should have received any ACKs for the original
729	   segment and TLP probe retransmission segment.  At that time, if the
730	   TLPRxtOut flag is still true and thus indicates that the TLP probe
731	   retransmission remains unacknowledged, then the sender should presume
732	   that at least one of its data segments was lost, so it SHOULD invoke
733	   a congestion control response equivalent to fast recovery.

735	   More precisely, on each ACK the sender executes the following:

737	       if (TLPRxtOut and SEG.ACK >= TLPHighRxt) {
738	           TLPRxtOut = false
739	           EnterRecovery()
740	           ExitRecovery()
741	       }

743	6.  RACK and TLP discussions

745	6.1.  Advantages

747	   The biggest advantage of RACK is that every data packet, whether it
748	   is an original data transmission or a retransmission, can be used to
749	   detect losses of the packets sent chronologically prior to it.

751	   Example: TAIL DROP.  Consider a sender that transmits a window of
752	   three data packets (P1, P2, P3), and P1 and P3 are lost.  Suppose the
753	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
754	   by default) after the transmission of the previous packet.  RACK will
755	   mark P1 as lost when the SACK of P2 is received, and this will
756	   trigger the retransmission of P1 as R1.  When R1 is cumulatively
757	   acknowledged, RACK will mark P3 as lost and the sender will
758	   retransmit P3 as R3.  This example illustrates how RACK is able to
759	   repair certain drops at the tail of a transaction without any timer.
760	   Notice that neither the conventional duplicate ACK threshold
761	   [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK]
762	   algorithm can detect such losses, because of the required packet or
763	   sequence count.

765	   Example: LOST RETRANSMIT.  Consider a window of three data packets
766	   (P1, P2, P3) that are sent; P1 and P2 are dropped.  Suppose the
767	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
768	   by default) after the transmission of the previous packet.  When P3
769	   is SACKed, RACK will mark P1 and P2 lost and they will be
770	   retransmitted as R1 and R2.  Suppose R1 is lost again but R2 is
771	   SACKed; RACK will mark R1 lost for retransmission again.  Again,
772	   neither the conventional three duplicate ACK threshold approach, nor
773	   [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect
774	   such losses.  And such a lost retransmission is very common when TCP
775	   is being rate-limited, particularly by token bucket policers with
776	   large bucket depth and low rate limit.  Retransmissions are often
777	   lost repeatedly because standard congestion control requires multiple
778	   round trips to reduce the rate below the policed rate.

780	   Example: SMALL DEGREE OF REORDERING.  Consider a common reordering
781	   event: a window of packets are sent as (P1, P2, P3).  P1 and P2 carry
782	   a full payload of MSS octets, but P3 has only a 1-octet payload.
783	   Suppose the sender has detected reordering previously (e.g., by
784	   implementing the algorithm in [REORDER-DETECT]) and thus RACK.reo_wnd
785	   is min_RTT/4.  Now P3 is reordered and delivered first, before P1 and
786	   P2.  As long as P1 and P2 are delivered within min_RTT/4, RACK will
787	   not consider P1 and P2 lost.  But if P1 and P2 are delivered outside
788	   the reordering window, then RACK will still falsely mark P1 and P2
789	   lost.  We discuss how to reduce false positives in the end of this
790	   section.

792	   The examples above show that RACK is particularly useful when the
793	   sender is limited by the application, which is common for
794	   interactive, request/response traffic.  Similarly, RACK still works
795	   when the sender is limited by the receive window, which is common for
796	   applications that use the receive window to throttle the sender.

798	   For some implementations (e.g., Linux), RACK works quite efficiently
799	   with TCP Segmentation Offload (TSO).  RACK always marks the entire
800	   TSO blob lost because the packets in the same TSO blob have the same
801	   transmission timestamp.  By contrast, the counting based algorithms
802	   (e.g., [RFC3517][RFC5681]) may mark only a subset of packets in the
803	   TSO blob lost, forcing the stack to perform expensive fragmentation
804	   of the TSO blob, or to selectively tag individual packets lost in the
805	   scoreboard.

807	6.2.  Disadvantages

809	   RACK requires the sender to record the transmission time of each
810	   packet sent at a clock granularity of one millisecond or finer.  TCP
811	   implementations that record this already for RTT estimation do not
812	   require any new per-packet state.  But implementations that are not
813	   yet recording packet transmission times will need to add per-packet
814	   internal state (commonly either 4 or 8 octets per packet or TSO blob)
815	   to track transmission times.  In contrast, the conventional [RFC6675]
816	   loss detection approach does not require any per-packet state beyond
817	   the SACK scoreboard.  This is particularly useful on ultra-low RTT
818	   networks where the RTT is far less than the sender TCP clock
819	   grainularity (e.g. inside data-centers).

821	   RACK can easily and optionally support the conventional approach in
822	   [RFC6675][RFC5681] by resetting the reordering window to zero when
823	   the threshold is met.  Note that this approach differs slightly from
824	   [RFC6675] which considers a packet lost when at least #DupThresh
825	   higher-sequenc packets are SACKed.  RACK's approach considers a
826	   packet lost when at least one higher sequence packet is SACKed and
827	   the total number of SACKed packets is at least DupThresh.  For
828	   example, suppose a connection sends 10 packets, and packets 3, 5, 7
829	   are SACKed.  [RFC6675] considers packets 1 and 2 lost.  RACK
830	   considers packets 1, 2, 4, 6 lost.

832	6.3.  Adjusting the reordering window

834	   When the sender detects packet reordering, RACK uses a reordering
835	   window of min_rtt / 4.  It uses the minimum RTT to accommodate
836	   reordering introduced by packets traversing slightly different paths
837	   (e.g., router-based parallelism schemes) or out-of-order deliveries
838	   in the lower link layer (e.g., wireless links using link-layer
839	   retransmission).  RACK uses a quarter of minimum RTT because Linux
840	   TCP used the same factor in its implementation to delay Early
841	   Retransmit [RFC5827] to reduce spurious loss detections in the
842	   presence of reordering, and experience shows that this seems to work
843	   reasonably well.  We have evaluated using the smoothed RTT (SRTT from
844	   [RFC6298] RTT estimation) or the most recently measured RTT
845	   (RACK.RTT) using an experiment similar to that in the Performance
846	   Evaluation section.  They do not make any significant difference in
847	   terms of total recovery latency.

849	6.4.  Relationships with other loss recovery algorithms

851	   The primary motivation of RACK is to ultimately provide a simple and
852	   general replacement for some of the standard loss recovery algorithms
853	   [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard
854	   ones [FACK][THIN-STREAM].  While RACK can be a supplemental loss
855	   detection mechanism on top of these algorithms, this is not
856	   necessary, because RACK implicitly subsumes most of them.

858	   [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK
859	   threshold based on the current or previous flight sizes.  RACK takes
860	   a different approach, by using only one ACK event and a reordering
861	   window.  RACK can be seen as an extended Early Retransmit [RFC5827]
862	   without a FlightSize limit but with an additional reordering window.
863	   [FACK] considers an original packet to be lost when its sequence
864	   range is sufficiently far below the highest SACKed sequence.  In some
865	   sense RACK can be seen as a generalized form of FACK that operates in
866	   time space instead of sequence space, enabling it to better handle
867	   reordering, application-limited traffic, and lost retransmissions.

869	   Nevertheless RACK is still an experimental algorithm.  Since the
870	   oldest loss detection algorithm, the 3 duplicate ACK threshold
871	   [RFC5681], has been standardized and widely deployed.  RACK can
872	   easily and optionally support the conventional approach for
873	   compatibility.

875	   RACK is compatible with and does not interfere with the the standard
876	   RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel
877	   algorithms [RFC3522].  This is because RACK only detects loss by
878	   using ACK events.  It neither changes the RTO timer calculation nor
879	   detects spurious timeouts.

881	   Furthermore, RACK naturally works well with Tail Loss Probe [TLP]
882	   because a tail loss probe solicits either an ACK or SACK, which can
883	   be used by RACK to detect more losses.  RACK can be used to relax
884	   TLP's requirement for using FACK and retransmitting the the highest-
885	   sequenced packet, because RACK is agnostic to packet sequence
886	   numbers, and uses transmission time instead.  Thus TLP could be
887	   modified to retransmit the first unacknowledged packet, which could
888	   improve application latency.

890	6.5.  Interaction with congestion control

892	   RACK intentionally decouples loss detection from congestion control.
893	   RACK only detects losses; it does not modify the congestion control
894	   algorithm [RFC5681][RFC6937].  However, RACK may detect losses
895	   earlier or later than the conventional duplicate ACK threshold
896	   approach does.  A packet marked lost by RACK SHOULD NOT be
897	   retransmitted until congestion control deems this appropriate.
898	   Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used
899	   when using RACK.

901	   RACK is applicable for both fast recovery and recovery after a
902	   retransmission timeout (RTO) in [RFC5681].  RACK applies equally to
903	   fast recovery and RTO recovery because RACK is purely based on the
904	   transmission time order of packets.  When a packet retransmitted by
905	   RTO is acknowledged, RACK will mark any unacked packet sent
906	   sufficiently prior to the RTO as lost, because at least one RTT has
907	   elapsed since these packets were sent.

909	   The following simple example compares how RACK and non-RACK loss
910	   detection interacts with congestion control: suppose a TCP sender has
911	   a congestion window (cwnd) of 20 packets on a SACK-enabled
912	   connection.  It sends 10 data packets and all of them are lost.

914	   Without RACK, the sender would time out, reset cwnd to 1, and
915	   retransmit the first packet.  It would take four round trips (1 + 2 +
916	   4 + 3 = 10) to retransmit all the 10 lost packets using slow start.
917	   The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4
918	   packets due to congestion window validation.

920	   With RACK, a sender would send the TLP after 2*RTT and get a DUPACK.
921	   If the sender implements Proportional Rate Reduction [RFC6937] it
922	   would slow start to retransmit the remaining 9 lost packets since the
923	   number of packets in flight (0) is lower than the slow start
924	   threshold (10).  The slow start would again take four round trips (1
925	   + 2 + 4 + 3 = 10).  The recovery latency would be 2*RTT + 4*RTT, with
926	   an ending cwnd set to the slow start threshold of 10 packets.

928	   In both cases, the sender after the recovery would be in congestion
929	   avoidance.  The difference in recovery latency (RTO + 4*RTT vs 6*RTT)
930	   can be significant if the RTT is much smaller than the minimum RTO (1
931	   second in RFC6298) or if the RTT is large.  The former case is common
932	   in local area networks, data-center networks, or content distribution
933	   networks with deep deployments.  The latter case is more common in
934	   developing regions with highly congested and/or high-latency
935	   networks.  The ending congestion window after recovery also impacts
936	   subsequent data transfer.

938	6.6.  TLP recovery detection with delayed ACKs

940	   Delayed ACKs complicate the detection of repairs done by TLP, since
941	   with a delayed ACK the sender receives one fewer ACK than would
942	   normally be expected.  To mitigate this complication, before sending
943	   a TLP loss probe retransmission, the sender should attempt to wait
944	   long enough that the receiver has sent any delayed ACKs that it is
945	   withholding.  The sender algorithm described above features such a
946	   delay, in the form of WCDelAckT.  Furthermore, if the receiver
947	   supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then
948	   in the case of a delayed ACK the sender's TLP recovery detection
949	   algorithm (see above) can use the D-SACK information to infer that
950	   the original and TLP retransmission both arrived at the receiver.

952	   If there is ACK loss or a delayed ACK without a D-SACK, then this
953	   algorithm is conservative, because the sender will reduce cwnd when
954	   in fact there was no packet loss.  In practice this is acceptable,
955	   and potentially even desirable: if there is reverse path congestion
956	   then reducing cwnd can be prudent.

958	6.7.  RACK for other transport protocols

960	   RACK can be implemented in other transport protocols.  The algorithm
961	   can be simplified by skipping step 3 if the protocol can support a
962	   unique transmission or packet identifier (e.g.  TCP echo options).
963	   For example, the QUIC protocol implements RACK [QUIC-LR].

965	7.  Experiments and Performance Evaluations

967	   RACK and TLP have been deployed at Google, for both connections to
968	   users in the Internet and internally.  We conducted a performance
969	   evaluation experiment for RACK and TLP on a small set of Google Web
970	   servers in Western Europe that serve mostly European and some African
971	   countries.  The experiment lasted three days in March 2017.  The
972	   servers were divided evenly into four groups of roughly 5.3 million
973	   flows each:

975	   Group 1 (control): RACK off, TLP off, RFC 3517 on

977	   Group 2: RACK on, TLP off, RFC 3517 on

979	   Group 3: RACK on, TLP on, RFC 3517 on

981	   Group 4: RACK on, TLP on, RFC 3517 off
982	   All groups used Linux with CUBIC congestion control, an initial
983	   congestion window of 10 packets, and the fq/pacing qdisc.  In terms
984	   of specific recovery features, all groups enabled RFC5682 (F-RTO) but
985	   disabled FACK because it is not an IETF RFC.  FACK was excluded
986	   because the goal of this setup is to compare RACK and TLP to RFC-
987	   based loss recoveries.  Since TLP depends on either FACK or RACK, we
988	   could not run another group that enables TLP only (with both RACK and
989	   FACK disabled).  Group 4 is to test whether RACK plus TLP can
990	   completely replace the DupThresh-based [RFC3517].

992	   The servers sit behind a load balancer that distributes the
993	   connections evenly across the four groups.

995	   Each group handles a similar number of connections and sends and
996	   receives similar amounts of data.  We compare total time spent in
997	   loss recovery across groups.  The recovery time is measured from when
998	   the recovery and retransmission starts, until the remote host has
999	   acknowledged the highest sequence (SND.NXT) at the time the recovery
1000	   started.  Therefore the recovery includes both fast recoveries and
1001	   timeout recoveries.

1003	   Our data shows that Group 2 recovery latency is only 0.3% lower than
1004	   the Group 1 recovery latency.  But Group 3 recovery latency is 25%
1005	   lower than Group 1 due to a 40% reduction in RTO-triggered
1006	   recoveries!  Therefore it is important to implement both TLP and RACK
1007	   for performance.  Group 4's total recovery latency is 0.02% lower
1008	   than Group 3's, indicating that RACK plus TLP can successfully
1009	   replace RFC3517 as a standalone recovery mechanism.

1011	   We want to emphasize that the current experiment is limited in terms
1012	   of network coverage.  The connectivity in Western Europe is fairly
1013	   good, therefore loss recovery is not a major performance bottleneck.
1014	   We plan to expand our experiments to regions with worse connectivity,
1015	   in particular on networks with strong traffic policing.

1017	8.  Security Considerations

1019	   RACK does not change the risk profile for TCP.

1021	   An interesting scenario is ACK-splitting attacks [SCWA99]: for an
1022	   MSS-size packet sent, the receiver or the attacker might send MSS
1023	   ACKs that SACK or acknowledge one additional byte per ACK.  This
1024	   would not fool RACK.  RACK.xmit_ts would not advance because all the
1025	   sequences of the packet are transmitted at the same time (carry the
1026	   same transmission timestamp).  In other words, SACKing only one byte
1027	   of a packet or SACKing the packet in entirety have the same effect on
1028	   RACK.

1030	9.  IANA Considerations

1032	   This document makes no request of IANA.

1034	   Note to RFC Editor: this section may be removed on publication as an
1035	   RFC.

1037	10.  Acknowledgments

1039	   The authors thank Matt Mathis for his insights in FACK and Michael
1040	   Welzl for his per-packet timer idea that inspired this work.  Eric
1041	   Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana
1042	   Iyengar, and Hiren Panchasara contributed to the draft and the
1043	   implementations in Linux, FreeBSD and QUIC.

1045	11.  References

1047	11.1.  Normative References

1049	   [RFC2018]  Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment
1050	              Options", RFC 2018, October 1996.

1052	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1053	              Requirement Levels", RFC 2119, March 1997.

1055	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
1056	              Extension to the Selective Acknowledgement (SACK) Option
1057	              for TCP", RFC 2883, July 2000.

1059	   [RFC4737]  Morton, A., Ciavattone, L., Ramachandran, G., Shalunov,
1060	              S., and J. Perser, "Packet Reordering Metrics", RFC 4737,
1061	              November 2006.

1063	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1064	              Control", RFC 5681, September 2009.

1066	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
1067	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
1068	              Spurious Retransmission Timeouts with TCP", RFC 5682,
1069	              September 2009.

1071	   [RFC5827]  Allman, M., Ayesta, U., Wang, L., Blanton, J., and P.
1072	              Hurtig, "Early Retransmit for TCP and Stream Control
1073	              Transmission Protocol (SCTP)", RFC 5827, April 2010.

1075	   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
1076	              "Computing TCP's Retransmission Timer", RFC 6298, June
1077	              2011.

1079	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
1080	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
1081	              Based on Selective Acknowledgment (SACK) for TCP",
1082	              RFC 6675, August 2012.

1084	   [RFC6937]  Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
1085	              Rate Reduction for TCP", May 2013.

1087	   [RFC7323]  Borman, D., Braden, B., Jacobson, V., and R.
1088	              Scheffenegger, "TCP Extensions for High Performance",
1089	              September 2014.

1091	   [RFC793]   Postel, J., "Transmission Control Protocol", September
1092	              1981.

1094	11.2.  Informative References

1096	   [FACK]     Mathis, M. and M. Jamshid, "Forward acknowledgement:
1097	              refining TCP congestion control", ACM SIGCOMM Computer
1098	              Communication Review, Volume 26, Issue 4, Oct. 1996. ,
1099	              1996.

1101	   [POLICER16]
1102	              Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng,
1103	              Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An
1104	              Analysis of Traffic Policing in the Web", ACM SIGCOMM ,
1105	              2016.

1107	   [QUIC-LR]  Iyengar, J. and I. Swett, "QUIC Loss Recovery And
1108	              Congestion Control", draft-tsvwg-quic-loss-recovery-01
1109	              (work in progress), June 2016.

1111	   [REORDER-DETECT]
1112	              Zimmermann, A., Schulte, L., Wolff, C., and A. Hannemann,
1113	              "Detection and Quantification of Packet Reordering with
1114	              TCP", draft-zimmermann-tcpm-reordering-detection-02 (work
1115	              in progress), November 2014.

1117	   [RFC7765]  Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
1118	              and SCTP RTO Restart", February 2016.

1120	   [SCWA99]   Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
1121	              "TCP Congestion Control With a Misbehaving Receiver", ACM
1122	              Computer Communication Review, 29(5) , 1999.

1124	   [THIN-STREAM]
1125	              Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen,
1126	              "TCP enhancements for interactive thin-stream
1127	              applications", NOSSDAV , 2008.

1129	   [TLP]      Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis,
1130	              "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of
1131	              Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work
1132	              in progress), August 2013.

1134	Authors' Addresses

1136	   Yuchung Cheng
1137	   Google, Inc
1138	   1600 Amphitheater Parkway
1139	   Mountain View, California  94043
1140	   USA

1142	   Email: ycheng@google.com

1144	   Neal Cardwell
1145	   Google, Inc
1146	   76 Ninth Avenue
1147	   New York, NY  10011
1148	   USA

1150	   Email: ncardwell@google.com

1152	   Nandita Dukkipati
1153	   Google, Inc
1154	   1600 Amphitheater Parkway
1155	   Mountain View, California  94043

1157	   Email: nanditad@google.com

1159	   Priyaranjan Jha
1160	   Google, Inc
1161	   1600 Amphitheater Parkway
1162	   Mountain View, California  94043

1164	   Email: priyarjha@google.com