idnits 2.17.1 draft-ietf-tcpm-rack-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 1, 2019) is 1637 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC8174' is mentioned on line 64, but not defined

  == Missing Reference: 'RFC4653' is mentioned on line 1062, but not defined

  == Missing Reference: 'RFC3708' is mentioned on line 480, but not defined

  == Missing Reference: 'RFC3522' is mentioned on line 1081, but not defined

  == Unused Reference: 'RFC4737' is defined on line 1268, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)


     Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                   Y. Cheng
3	Internet-Draft                                               N. Cardwell
4	Intended status: Experimental                               N. Dukkipati
5	Expires: May 4, 2020                                              P. Jha
6	                                                             Google, Inc
7	                                                        November 1, 2019

9	        RACK: a time-based fast loss detection algorithm for TCP
10	                        draft-ietf-tcpm-rack-06

12	Abstract

14	   This document presents a new TCP loss detection algorithm called RACK
15	   ("Recent ACKnowledgment").  RACK uses the notion of time, instead of
16	   packet or sequence counts, to detect losses, for modern TCP
17	   implementations that can support per-packet timestamps and the
18	   selective acknowledgment (SACK) option.  It is intended to replace
19	   the conventional DUPACK threshold approach and its variants, as well
20	   as other nonstandard approaches.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on May 4, 2020.

39	Copyright Notice

41	   Copyright (c) 2019 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (https://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	1.  Terminology

56	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
57	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
58	   "OPTIONAL" in this document are to be interpreted as described in BCP
59	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
60	   capitals, as shown here.  In this document, these words will appear
61	   with that interpretation only when in UPPER CASE.  Lower case uses of
62	   these words are not to be interpreted as carrying [RFC2119]
63	   significance.

65	2.  Introduction

67	   This document presents a new loss detection algorithm called RACK
68	   ("Recent ACKnowledgment").  RACK uses the notion of time instead of
69	   the conventional packet or sequence counting approaches for detecting
70	   losses.  RACK deems a packet lost if some packet sent sufficiently
71	   later has been delivered.  It does this by recording packet
72	   transmission times and inferring losses using cumulative
73	   acknowledgments or selective acknowledgment (SACK) TCP options.

75	   In the last couple of years we have been observing several
76	   increasingly common loss and reordering patterns in the Internet:

78	   1.  Lost retransmissions.  Traffic policers [POLICER16] and burst
79	       losses often cause retransmissions to be lost again, severely
80	       increasing TCP latency.

82	   2.  Tail drops.  Structured request-response traffic turns more
83	       losses into tail drops.  In such cases, TCP is application-
84	       limited, so it cannot send new data to probe losses and has to
85	       rely on retransmission timeouts (RTOs).

87	   3.  Reordering.  Link-layer protocols (e.g., 802.11 block ACK), link
88	       bonding, or routers' internal load-balancing can deliver TCP
89	       packets out of order.  The degree of such reordering is usually
90	       within the order of the path round trip time.

92	   Despite TCP stacks (e.g.  Linux) that implement many of the standard
93	   and proposed loss detection algorithms
94	   [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM],
95	   we've found that together they do not perform well.  The main reason
96	   is that many of them are based on the classic rule of counting
97	   duplicate acknowledgments [RFC5681].  They can either detect loss
98	   quickly or accurately, but not both, especially when the sender is
99	   application-limited or under reordering that is unpredictable.  And
100	   under these conditions none of them can detect lost retransmissions
101	   well.

103	   Also, these algorithms, including RFCs, rarely address the
104	   interactions with other algorithms.  For example, FACK may consider a
105	   packet is lost while RFC6675 may not.  Implementing N algorithms
106	   while dealing with N^2 interactions is a daunting task and error-
107	   prone.

109	   The goal of RACK is to solve all the problems above by replacing many
110	   of the loss detection algorithms above with one more effective
111	   algorithm to handle loss and reordering.

113	3.  Overview

115	   The main idea behind RACK is that if a packet has been delivered out
116	   of order, then the packets sent chronologically before that were
117	   either lost or reordered.  This concept is not fundamentally
118	   different from [RFC5681][RFC6675][FACK].  But the key innovation in
119	   RACK is to use a per-packet transmission timestamp and widely
120	   deployed SACK options to conduct time-based inferences instead of
121	   inferring losses with packet or sequence counting approaches.

123	   Using a threshold for counting duplicate acknowledgments (i.e.,
124	   DupThresh) alone is no longer reliable because of today's prevalent
125	   reordering patterns.  A common type of reordering is that the last
126	   "runt" packet of a window's worth of packet bursts gets delivered
127	   first, then the rest arrive shortly after in order.  To handle this
128	   effectively, a sender would need to constantly adjust the DupThresh
129	   to the burst size; but this would risk increasing the frequency of
130	   RTOs on real losses.

132	   Today's prevalent lost retransmissions also cause problems with
133	   packet-counting approaches [RFC5681][RFC6675][FACK], since those
134	   approaches depend on reasoning in sequence number space.
135	   Retransmissions break the direct correspondence between ordering in
136	   sequence space and ordering in time.  So when retransmissions are
137	   lost, sequence-based approaches are often unable to infer and quickly
138	   repair losses that can be deduced with time-based approaches.

140	   Instead of counting packets, RACK uses the most recently delivered
141	   packet's transmission time to judge if some packets sent previous to
142	   that time have "expired" by passing a certain reordering settling
143	   window.  On each ACK, RACK marks any already-expired packets lost,
144	   and for any packets that have not yet expired it waits until the
145	   reordering window passes and then marks those lost as well.  In
146	   either case, RACK can repair the loss without waiting for a (long)
147	   RTO.  RACK can be applied to both fast recovery and timeout recovery,
148	   and can detect losses on both originally transmitted and
149	   retransmitted packets, making it a great all-weather loss detection
150	   mechanism.

152	4.  Design Rationale for Reordering Tolerance

154	   The reordering behavior of networks can evolve (over years) in
155	   response to the behavior of transport protocols and applications, as
156	   well as the needs of network designers and operators.  From a network
157	   or link designer's viewpoint, parallelization (eg. link bonding) is
158	   the easiest way to get a network to go faster.  Therefore their main
159	   constraint on speed is reordering, and there is pressure to relax
160	   that constraint.  If RACK becomes widely deployed, the underlying
161	   networks may introduce more reordering for higher throughput.  But
162	   this may result in excessive reordering that hurts end to end
163	   performance:

165	   1.  End host packet processing: extreme reordering on high-speed
166	       networks would incur high CPU cost by greatly reducing the
167	       effectiveness of aggregation mechanisms, such as large receive
168	       offload (LRO) and generic receive offload (GRO), and
169	       significantly increasing the number of ACKs.

171	   2.  Congestion control: TCP congestion control implicitly assumes the
172	       feedback from ACKs are from the same bottleneck.  Therefore it
173	       cannot handle well scenarios where packets are traversing largely
174	       disjoint paths.

176	   3.  Loss recovery: Having an excessively large reordering window to
177	       accommodate widely different latencies from different paths would
178	       increase the latency of loss recovery.

180	   An end-to-end transport protocol cannot tell immediately whether a
181	   hole is reordering or loss.  It can only distinguish between the two
182	   in hindsight if the hole in the sequence space gets filled later
183	   without a retransmission.  How long the sender waits for such
184	   potential reordering events to settle is determined by the current
185	   reordering window.

187	   Given these considerations, a core design philosophy of RACK is to
188	   adapt to the measured duration of reordering events, within
189	   reasonable and specific bounds.  To accomplish this RACK places the
190	   following mandates on the reordering window:

192	   1.  The initial RACK reordering window SHOULD be set to a small
193	       fraction of the round-trip time.

195	   2.  If no reordering has been observed, then RACK SHOULD honor the
196	       classic 3-DUPACK rule for initiating fast recovery.  One simple
197	       way to implement this is to temporarily override the reorder
198	       window to 0.

200	   3.  The RACK reordering window SHOULD leverage Duplicate Selective
201	       Acknowledgement (DSACK) information [RFC3708] to adaptively
202	       estimate the duration of reordering events.

204	   4.  The RACK reordering window MUST be bounded and this bound SHOULD
205	       be one round trip.

207	   As a flow starts, either condition 1 or condition 2 or both would
208	   trigger RACK to start the recovery process quickly.  The low initial
209	   reordering window and use of the 3-DUPACK rule are key to achieving
210	   low-latency loss recovery for short flows by risking spurious
211	   retransmissions to recover losses quickly.  This rationale is that
212	   spurious retransmissions for short flows are not expected to produce
213	   excessive network traffic.

215	   For long flows the design tolerates reordering within a round trip.
216	   This handles reordering caused by path divergence in small time
217	   scales (reordering within the round-trip time of the shortest path),
218	   which should tolerate much of the reordering from link bonding,
219	   multipath routing, or link-layer out-of-order delivery.  It also
220	   relaxes ordering constraints to allow sending flights of TCP packets
221	   on different paths dynamically for better load-balancing (e.g.
222	   flowlets).

224	   However, the fact that the initial RACK reordering window is low, and
225	   the RACK reordering window's adaptive growth is bounded, means that
226	   there will continue to be a cost to reordering and a limit to RACK's
227	   adaptation to reordering.  This maintains a disincentive for network
228	   designers and operators to introduce needless or excessive
229	   reordering, particularly because they have to allow for low round
230	   trip time paths.  This means RACK will not encourage networks to
231	   perform inconsiderate fine-grained packet-spraying over highly
232	   disjoint paths with very different characteristics.  There are good
233	   alternative solutions, such as MPTCP, for such networks.

235	   To conclude, the RACK algorithm aims to adapt to small degrees of
236	   reordering, quickly recover most losses within one to two round
237	   trips, and avoid costly retransmission timeouts (RTOs).  In the
238	   presence of reordering, the adaptation algorithm can impose
239	   sometimes-needless delays when it waits to disambiguate loss from
240	   reordering, but the penalty for waiting is bounded to one round trip
241	   and such delays are confined to longer-running flows.

243	   This document provides a concrete and detailed reordering window
244	   adaptation algorithm for implementors.  We note that the specifics of
245	   the algorithm are likely to evolve over time.  But that is a separate
246	   engineering optimization that's out of scope for this document.

248	5.  Requirements

250	   The reader is expected to be familiar with the definitions given in
251	   the TCP congestion control [RFC5681] and selective acknowledgment
252	   [RFC2018] RFCs.  Familiarity with the conservative SACK-based
253	   recovery for TCP [RFC6675] is not expected but helps.

255	   RACK has three requirements:

257	   1.  The connection MUST use selective acknowledgment (SACK) options
258	       [RFC2018].

260	   2.  For each packet sent, the sender MUST store its most recent
261	       transmission time with (at least) millisecond granularity.  For
262	       round-trip times lower than a millisecond (e.g., intra-datacenter
263	       communications) microsecond granularity would significantly help
264	       the detection latency but is not required.

266	   3.  For each packet sent, the sender MUST remember whether the packet
267	       has been retransmitted or not.

269	   We assume that requirement 1 implies the sender keeps a SACK
270	   scoreboard, which is a data structure to store selective
271	   acknowledgment information on a per-connection basis ([RFC6675]
272	   section 3).  For the ease of explaining the algorithm, we use a
273	   pseudo-scoreboard that manages the data in sequence number ranges.
274	   But the specifics of the data structure are left to the implementor.

276	   RACK does not need any change on the receiver.

278	6.  Definitions

280	   The reader is expected to be familiar with the definitions given in
281	   [RFC793], including SND.UNA, SND.NXT, SEG.ACK, and SEG.SEQ.

283	6.1.  Definitions of variables

285	   A sender implementing RACK needs to store these new RACK variables:

287	   "Packet.xmit_ts" is the time of the last transmission of a data
288	   packet, including retransmissions, if any.  The sender needs to
289	   record the transmission time for each packet sent and not yet
290	   acknowledged.  The time MUST be stored at millisecond granularity or
291	   finer.

293	   "RACK.packet".  Among all the packets that have been either
294	   selectively or cumulatively acknowledged, RACK.packet is the one that
295	   was sent most recently including retransmissions.

297	   "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet.

299	   "RACK.end_seq" is the ending TCP sequence number of RACK.packet.

301	   "RACK.rtt" is the RTT of the most recently delivered packet on the
302	   connection (either cumulatively acknowledged or selectively
303	   acknowledged) that was not marked invalid as a possible spurious
304	   retransmission.

306	   "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated.

308	   "RACK.reo_wnd" is a reordering window computed in the unit of time
309	   used for recording packet transmission times.  It is used to defer
310	   the moment at which RACK marks a packet lost.

312	   "RACK.dupthresh" is a constant specifying the number of duplicate
313	   acknowledgments, or selectively acknowledged segments, that can
314	   (under certain conditions) trigger fast recovery, similar to
315	   [RFC6675].  As in [RFC5681] and [RFC6675], this threshold is defined
316	   to be 3.

318	   "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the
319	   connection.

321	   "RACK.ack_ts" is the time when all the sequences in RACK.packet were
322	   selectively or cumulatively acknowledged.

324	   "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd

326	   "RACK.reo_wnd_persist" is the number of loss recoveries before
327	   resetting RACK.reo_wnd

329	   "RACK.dsack" indicates if a DSACK option has been received since last
330	   RACK.reo_wnd change "RACK.pkts_sacked" returns the total number of
331	   packets selectively acknowledged in the SACK scoreboard.

333	   "RACK.reord" indicates the connection has detected packet reordering
334	   event(s)
335	   "RACK.fack" is the highest selectively or cumulatively acknowledged
336	   sequence

338	   Note that the Packet.xmit_ts variable is per packet in flight.  The
339	   RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT
340	   variables are kept in the per-connection TCP control block.
341	   RACK.packet and RACK.ack_ts are used as local variables in the
342	   algorithm.

344	7.  Algorithm Details

346	7.1.  Transmitting a data packet

348	   Upon transmitting a new packet or retransmitting an old packet,
349	   record the time in Packet.xmit_ts.  RACK does not care if the
350	   retransmission is triggered by an ACK, new application data, an RTO,
351	   or any other means.

353	7.2.  Upon receiving an ACK

355	   Step 1: Update RACK.min_RTT.

357	   Use the RTT measurements obtained via [RFC6298] or [RFC7323] to
358	   update the estimated minimum RTT in RACK.min_RTT.  The sender can
359	   track a simple global minimum of all RTT measurements from the
360	   connection, or a windowed min-filtered value of recent RTT
361	   measurements.  This document does not specify an exact approach.

363	   Step 2: Update RACK stats

365	   Given the information provided in an ACK, each packet cumulatively
366	   ACKed or SACKed is marked as delivered in the scoreboard.  Among all
367	   the packets newly ACKed or SACKed in the connection, record the most
368	   recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts.
369	   Sometimes the timestamps of RACK.Packet and Packet could carry the
370	   same transmit timestamps due to clock granularity or segmentation
371	   offloading (i.e. the two packets were handed to the NIC as a single
372	   unit).  In that case the sequence numbers of RACK.end_seq and
373	   Packet.end_seq are compared to break the tie.

375	   Since an ACK can also acknowledge retransmitted data packets, and
376	   retransmissions can be spurious, the sender must take care to avoid
377	   spurious inferences.  For example, if the sender were to use timing
378	   information from a spurious retransmission, the RACK.rtt could be
379	   vastly underestimated.

381	   To avoid spurious inferences, ignore a packet as invalid if any of
382	   its TCP sequences have been retransmitted before and either of two
383	   conditions is true:

385	   1.  The Timestamp Echo Reply field (TSecr) of the ACK's timestamp
386	       option [RFC7323], if available, indicates the ACK was not
387	       acknowledging the last retransmission of the packet.

389	   2.  The packet was last retransmitted less than RACK.min_rtt ago.

391	   If the ACK is not ignored as invalid, update the RACK.rtt to be the
392	   RTT sample calculated using this ACK, and continue.  If this ACK or
393	   SACK was for the most recently sent packet, then record the
394	   RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK.
395	   Otherwise exit here and omit the following steps.

397	   Notice that the second condition above is a heuristic.  This
398	   heuristic would fail to update RACK stats if the packet is spuriously
399	   retransmitted because of a recent minimum RTT decrease (e.g. path
400	   change).  Consequentially RACK may not detect losses from ACK events
401	   and the recovery resorts to the (slower) TLP or RTO timer-based
402	   events.  However such events should be rare and the connection would
403	   pick up the new minimum RTT when the recovery ends to avoid repeated
404	   similar failures.

406	   Step 2 may be summarized in pseudocode as:

408	   RACK_sent_after(t1, seq1, t2, seq2):
409	       If t1 > t2:
410	           Return true
411	       Else if t1 == t2 AND seq1 > seq2:
412	           Return true
413	       Else:
414	           Return false

416	   RACK_update():
417	       For each Packet newly acknowledged cumulatively or selectively:
418	           rtt = Now() - Packet.xmit_ts
419	           If Packet.retransmitted is TRUE:
420	               If ACK.ts_option.echo_reply < Packet.xmit_ts:
421	                  Return
422	               If rtt < RACK.min_rtt:
423	                  Return

425	           RACK.rtt = rtt
426	           If RACK_sent_after(Packet.xmit_ts, Packet.end_seq
427	                              RACK.xmit_ts, RACK.end_seq):
428	               RACK.xmit_ts = Packet.xmit_ts

430	   Step 3: Detect packet reordering

432	   To detect reordering, the sender looks for original data packets
433	   being delivered out of order in sequence space.  The sender tracks
434	   the highest sequence selectively or cumulatively acknowledged in the
435	   RACK.fack variable.  The name fack stands for the most forward ACK
436	   originated from the [FACK] draft.  If the ACK selectively or
437	   cumulatively acknowledges an unacknowledged and also never
438	   retransmitted sequence below RACK.fack, then the corresponding packet
439	   has been reordered and RACK.reord is set to TRUE.

441	   The heuristic above only detects reordering if the re-ordered packet
442	   has not yet been retransmitted.  This is a major drawback because if
443	   RACK has a low reordering window and the network is reordering
444	   packets, RACK may falsely retransmit frequently.  Consequently RACK
445	   may fail to detect reordering to increase the reordering window,
446	   because the reordered packets were already (falsely) retransmitted.

448	   DSACK [RFC3708] can help mitigate this issue.  The false
449	   retransmission would solicit DSACK option in the ACK.  Therefore if
450	   the ACK has a DSACK option covering some sequence that were both
451	   acknowledged and retransmitted, this implies the original packet was
452	   reordered but RACK retransmitted the packet too quickly and should
453	   set RACK.reord to TRUE.

455	   RACK_detect_reordering():
456	       For each Packet newly acknowledged cumulatively or selectively:
457	           If Packet.end_seq > RACK.fack:
458	               RACK.fack = Packet.end_seq
459	           Else if Packet.end_seq < RACK.fack AND
460	                   Packet.retransmitted is FALSE:
461	               RACK.reord = TRUE

463	       For each Packet covered by the DSACK option:
464	           If Packet.retransmitted is TRUE:
465	               RACK.reord = TRUE

467	   Step 4: Update RACK reordering window

469	   To handle the prevalent small degree of reordering, RACK.reo_wnd
470	   serves as an allowance for settling time before marking a packet
471	   lost.  This section documents a detailed algorithm following the
472	   design rationale section.  RACK starts initially with a conservative
473	   window of min_RTT/4.  If no reordering has been observed, RACK uses
474	   RACK.reo_wnd of 0 during loss recovery, in order to retransmit
475	   quickly, or when the number of DUPACKs exceeds the classic DUPACK
476	   threshold.  The subtle difference of this approach and conventional
477	   one [RFC5681][RFC6675] is dicussed later in the section of "RACK and
478	   TLP discussions".

480	   Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window,
481	   to higher degrees of reordering, if DSACK is supported.  Receiving an
482	   ACK with a DSACK indicates a spurious retransmission, which in turn
483	   suggests that the RACK reordering window, RACK.reo_wnd, is likely too
484	   small.  The sender MAY increase the RACK.reo_wnd window linearly for
485	   every round trip in which the sender receives a DSACK, so that after
486	   N distinct round trips in which a DSACK is received, the RACK.reo_wnd
487	   becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT.  The
488	   inflated RACK.reo_wnd would persist for 16 loss recoveries and after
489	   which it resets to its starting value, min_RTT / 4.

491	   The following pseudocode implements above algorithm.  Note that
492	   extensions that require additional TCP features (e.g.  DSACK) would
493	   work if the feature functions simply return false.

495	   RACK_update_reo_wnd():
496	       RACK.min_RTT = TCP_min_RTT()
497	       If DSACK option is present:
498	           RACK.dsack = true

500	       If SND.UNA < RACK.rtt_seq:
501	           RACK.dsack = false  /* React to DSACK once per round trip */

503	       If RACK.dsack:
504	           RACK.reo_wnd_incr += 1
505	           RACK.dsack = false
506	           RACK.rtt_seq = SND.NXT
507	           RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */
508	       Else if exiting loss recovery:
509	           RACK.reo_wnd_persist -= 1
510	           If RACK.reo_wnd_persist <= 0:
511	               RACK.reo_wnd_incr = 1

513	       If RACK.reord is FALSE:
514	           If in loss recovery:  /* If in fast or timeout recovery */
515	               RACK.reo_wnd = 0
516	               Return
517	           Else if RACK.pkts_sacked >= RACK.dupthresh:
518	               RACK.reo_wnd = 0
519	               return
520	       RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr
521	       RACK.reo_wnd = min(RACK.reo_wnd, SRTT)

523	   Step 5: Detect losses.

525	   For each packet that has not been SACKed, if RACK.xmit_ts is after
526	   Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its
527	   corresponding sequence range) lost in the scoreboard.  The rationale
528	   is that if another packet that was sent later has been delivered, and
529	   the reordering window or "reordering settling time" has already
530	   passed, then the packet was likely lost.

532	   If another packet that was sent later has been delivered, but the
533	   reordering window has not passed, then it is not yet safe to deem the
534	   unacked packet lost.  Using the basic algorithm above, the sender
535	   would wait for the next ACK to further advance RACK.xmit_ts; but this
536	   risks a timeout (RTO) if no more ACKs come back (e.g, due to losses
537	   or application limit).  For timely loss detection, the sender MAY
538	   install a "reordering settling" timer set to fire at the earliest
539	   moment at which it is safe to conclude that some packet is lost.  The
540	   earliest moment is the time it takes to expire the reordering window
541	   of the earliest unacked packet in flight.

543	   This timer expiration value can be derived as follows.  As a starting
544	   point, we consider that the reordering window has passed if the
545	   RACK.packet was sent sufficiently after the packet in question, or a
546	   sufficient time has elapsed since the RACK.packet was S/ACKed, or
547	   some combination of the two.  More precisely, RACK marks a packet as
548	   lost if the reordering window for a packet has elapsed through the
549	   sum of:

551	   1. delta in transmit time between a packet and the RACK.packet

553	   2. delta in time between RACK.ack_ts and now

555	   So we mark a packet as lost if:

557	    RACK.xmit_ts >= Packet.xmit_ts
558	           AND
559	    (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd

561	   If we solve this second condition for "now", the moment at which we
562	   can declare a packet lost, then we get:

564	   now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts)

566	   Then (RACK.ack_ts - RACK.xmit_ts) is the RTT of the packet the sender
567	   used to set RACK.xmit_ts: the round trip time of the most recently
568	   (re)transmitted packet that's been delivered.  To be more robust to
569	   reordering, RACK uses a more conservative RTT value to decide if an
570	   unacknowledged packet should be considered lost, RACK.rtt: the round
571	   trip time of the most recently delivered packet on the connection
572	   that was not marked invalid as a possible spurious retransmission.

574	   When packets are delivered in order, the most recently
575	   (re)transmitted packet that's been delivered is also the most
576	   recently delivered, hence RACK.rtt == RACK.ack_ts - RACK.xmit_ts.
577	   But if packets were reordered, then the packet delivered most
578	   recently was sent before the most recently (re)transmitted packet.
579	   Hence RACK.rtt > (RACK.ack_ts - RACK.xmit_ts).

581	   Since RACK.RTT >= (RACK.ack_ts - RACK.xmit_ts), the previous equation
582	   reduces to saying that the sender can declare a packet lost when:

584	   now >= Packet.xmit_ts + RACK.reo_wnd + RACK.rtt

586	   In turn, that is equivalent to stating that a RACK sender should
587	   declare a packet lost when:

589	   Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0

591	   The following pseudocode implements the algorithm above.  When an ACK
592	   is received or the RACK timer expires, call RACK_detect_loss().  The
593	   algorithm includes an additional optimization to break timestamp ties
594	   by using the TCP sequence space.  The optimization is particularly
595	   useful to detect losses in a timely manner with TCP Segmentation
596	   Offload, where multiple packets in one TSO blob have identical
597	   timestamps.  It is also useful when the timestamp clock granularity
598	   is close to or longer than the actual round trip time.

600	   RACK_detect_loss():
601	       timeout = 0

603	       For each packet, Packet, not acknowledged yet:
604	           If Packet.lost is TRUE AND Packet.retransmitted is FALSE:
605	               Continue /* Packet lost but not yet retransmitted */

607	           If RACK_sent_after(RACK.xmit_ts, RACK.end_seq,
608	                              Packet.xmit_ts, Packet.end_seq):
609	               remaining = Packet.xmit_ts + RACK.rtt +
610	                           RACK.reo_wnd - Now()
611	               If remaining <= 0:
612	                   Packet.lost = TRUE
613	               Else:
614	                   timeout = max(remaining, timeout)

616	       If timeout != 0
617	           Arm a timer to call RACK_detect_loss() after timeout

619	   Implementation optimization: looping through packets in the SACK
620	   scoreboard above could be very costly on large-BDP networks since the
621	   inflight could be very large.  If the implementation can organize the
622	   scoreboard data structures to have packets sorted by the last
623	   (re)transmission time, then the loop can start on the least recently
624	   sent packet and abort on the first packet sent after RACK.time_ts.
625	   This can be implemented by using a seperate list sorted in time
626	   order.  The implementation inserts the packet at the tail of the list
627	   when it is (re)transmitted, and removes a packet from the list when
628	   it is delivered or marked lost.  We RECOMMEND such an optimization
629	   because it enables implementations to support high-BDP networks.
630	   This optimization is implemented in Linux and sees orders of
631	   magnitude improvement in CPU usage on high-speed WAN networks.

633	7.3.  Tail Loss Probe: fast recovery for tail losses

635	   This section describes a supplemental algorithm, Tail Loss Probe
636	   (TLP), which leverages RACK to further reduce RTO recoveries.  TLP
637	   triggers fast recovery to quickly repair tail losses that can
638	   otherwise be recovered by RTOs only.  After an original data
639	   transmission, TLP sends a probe data segment within one to two RTTs.
640	   The probe data segment can either be new, previously unsent data, or
641	   a retransmission of previously sent data just below SND.NXT.  In
642	   either case the goal is to elicit more feedback from the receiver, in
643	   the form of an ACK (potentially with SACK blocks), to allow RACK to
644	   trigger fast recovery instead of an RTO.

646	   An RTO occurs when the first unacknowledged sequence number is not
647	   acknowledged after a conservative period of time has elapsed
648	   [RFC6298].  Common causes of RTOs include:

650	   1.  The entire flight of data is lost.

652	   2.  Tail losses of data segments at the end of an application
653	       transaction.

655	   3.  Tail losses of ACKs at the end of an application transaction.

657	   4.  Lost retransmits, which can halt fast recovery based on [RFC6675]
658	       if the ACK stream completely dries up.  For example, consider a
659	       window of three data packets (P1, P2, P3) that are sent; P1 and
660	       P2 are dropped.  On receipt of a SACK for P3, RACK marks P1 and
661	       P2 as lost and retransmits them as R1 and R2.  Suppose R1 and R2
662	       are lost as well, so there are no more returning ACKs to detect
663	       R1 and R2 as lost.  Recovery stalls.

665	   5.  An unexpectedly long round-trip time (RTT).  This can cause ACKs
666	       to arrive after the RTO timer expires.  The F-RTO algorithm
667	       [RFC5682] is designed to detect such spurious retransmission
668	       timeouts and at least partially undo the consequences of such
669	       events, but F-RTO cannot be used in many situations.

671	7.4.  Tail Loss Probe: An Example

673	   Following is an example of TLP.  All events listed are at a TCP
674	   sender.

676	   1.  Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10.  There is
677	       no more new data to transmit.  A PTO is scheduled to fire in 2
678	       RTTs, after the transmission of the 10th segment.

680	   2.  Sender receives acknowledgements (ACKs) for segments 1-5;
681	       segments 6-10 are lost and no ACKs are received.  The sender
682	       reschedules its PTO timer relative to the last received ACK,
683	       which is the ACK for segment 5 in this case.  The sender sets the
684	       PTO interval using the calculation described in step (2) of the
685	       algorithm.

687	   3.  When PTO fires, sender retransmits segment 10.

689	   4.  After an RTT, a SACK for packet 10 arrives.  The ACK also carries
690	       SACK holes for segments 6, 7, 8 and 9.  This triggers RACK-based
691	       loss recovery.

693	   5.  The connection enters fast recovery and retransmits the remaining
694	       lost segments.

696	7.5.  Tail Loss Probe Algorithm Details

698	   We define the terminology used in specifying the TLP algorithm:

700	   FlightSize: amount of outstanding data in the network, as defined in
701	   [RFC5681].

703	   RTO: The transport's retransmission timeout (RTO) is based on
704	   measured round-trip times (RTT) between the sender and receiver, as
705	   specified in [RFC6298] for TCP.  PTO: Probe timeout (PTO) is a timer
706	   event indicating that an ACK is overdue.  Its value is constrained to
707	   be smaller than or equal to an RTO.

709	   SRTT: smoothed round-trip time, computed as specified in [RFC6298].
710	   TLPRxtOut: a boolean indicating whether there is an unacknowledged
711	   TLP retransmission.

713	   TLPHighRxt: the value of SND.NXT at the time of sending a TLP
714	   retransmission.

716	   The TLP algorithm has three phases, which we discuss in turn.

718	7.5.1.  Phase 1: Scheduling a loss probe

720	   Step 1: Check conditions for scheduling a PTO.

722	   A sender should check to see if it should schedule a PTO in the
723	   following situations:

725	   1.  After transmitting new data that was not itself a TLP probe

727	   2.  Upon receiving an ACK that cumulatively acknowledges data

729	   3.  Upon receiving a SACK that selectively acknowledges data that was
730	       last sent before the segment with SEG.SEQ=SND.UNA was last
731	       (re)transmitted

733	   A sender should schedule a PTO only if all of the following
734	   conditions are met:

736	   1.  The connection supports SACK [RFC2018]

738	   2.  The connection has no SACKed sequences in the SACK scoreboard

740	   3.  The connection is not in loss recovery

742	   If a PTO can be scheduled according to these conditions, the sender
743	   should schedule a PTO.  If there was a previously scheduled PTO or
744	   RTO pending, then that pending PTO or RTO should first be cancelled,
745	   and then the new PTO should be scheduled.

747	   If a PTO cannot be scheduled according to these conditions, then the
748	   sender MUST arm the RTO timer if there is unacknowledged data in
749	   flight.

751	   Step 2: Select the duration of the PTO.

753	   A sender SHOULD use the following logic to select the duration of a
754	   PTO:

756	   TLP_timeout():
757	       If SRTT is available:
758	           PTO = 2 * SRTT
759	           If FlightSize = 1:
760	              PTO += WCDelAckT
761	           Else:
762	              PTO += 2ms
763	       Else:
764	           PTO = 1 sec

766	       If Now() + PTO > TCP_RTO_expire():
767	           PTO = TCP_RTO_expire() - Now()

769	   Aiming for a PTO value of 2*SRTT allows a sender to wait long enough
770	   to know that an ACK is overdue.  Under normal circumstances, i.e. no
771	   losses, an ACK typically arrives in one SRTT.  But choosing PTO to be
772	   exactly an SRTT is likely to generate spurious probes given that
773	   network delay variance and even end-system timings can easily push an
774	   ACK to be above an SRTT.  We chose PTO to be the next integral
775	   multiple of SRTT.

777	   Similarly, network delay variations and end-system processing
778	   latencies and timer granularities can easily delay ACKs beyond
779	   2*SRTT, so senders SHOULD add at least 2ms to a computed PTO value
780	   (and MAY add more if the sending host OS timer granularity is more
781	   coarse than 1ms).

783	   WCDelAckT stands for worst case delayed ACK timer.  When FlightSize
784	   is 1, PTO is inflated by WCDelAckT time to compensate for a potential
785	   long delayed ACK timer at the receiver.  The RECOMMENDED value for
786	   WCDelAckT is 200ms.

788	   Finally, if the time at which an RTO would fire (here denoted
789	   "TCP_RTO_expire") is sooner than the computed time for the PTO, then
790	   a probe is scheduled to be sent at that earlier time.

792	7.5.2.  Phase 2: Sending a loss probe

794	   When the PTO fires, transmit a probe data segment:

796	   TLP_send_probe():
797	       If an unsent segment exists AND
798	          the receive window allows new data to be sent:
799	              Transmit the lowest-sequence unsent segment of up to SMSS
800	              Increment FlightSize by the size of the newly-sent segment

802	       Else if TLPRxtOut is not set:
803	              Retransmit the highest-sequence segment sent so far
804	       The cwnd remains unchanged

806	   When the loss probe is a retransmission, the sender uses the highest-
807	   sequence segment sent so far.  This is in order to deal with the
808	   retransmission ambiguity problem in TCP.  Suppose a sender sends N
809	   segments, and then retransmits the last segment (segment N) as a loss
810	   probe, and then the sender receives a SACK for segment N.  As long as
811	   the sender waits for any required RACK reordering settling timer to
812	   then expire, it doesn't matter if that SACK was for the original
813	   transmission of segment N or the TLP retransmission; in either case
814	   the arrival of the SACK for segment N provides evidence that the N-1
815	   segments preceding segment N were likely lost.  In the case where
816	   there is only one original outstanding segment of data (N=1), the
817	   same logic (trivially) applies: an ACK for a single outstanding
818	   segment tells the sender the N-1=0 segments preceding that segment
819	   were lost.  Furthermore, whether there are N>1 or N=1 outstanding
820	   segments, there is a question about whether the original last segment
821	   or its TLP retransmission were lost; the sender estimates this using
822	   TLP recovery detection (see below).

824	   Note that wether a probe was sent or not in TLP_send_probe(), the
825	   sender MUST arm the RTO timer, not the PTO timer, at the end of
826	   TLP_send_probe() if FlightSize is not zero.  This ensures that the
827	   sender does not send repeated, back-to-back TLP probes.  Checking
828	   TLPRxtOut prior to sending the loss probe is also critical to avoid
829	   TLP loops if an application writes periodically at an interval less
830	   than PTO.

832	7.5.3.  Phase 3: ACK processing

834	   On each incoming ACK, the sender should check the conditions in Step
835	   1 of Phase 1 to see if it should schedule (or reschedule) the loss
836	   probe timer.

838	7.6.  TLP recovery detection

840	   If the only loss in an outstanding window of data was the last
841	   segment, then a TLP loss probe retransmission of that data segment
842	   might repair the loss.  TLP recovery detection examines ACKs to
843	   detect when the probe might have repaired a loss, and thus allows
844	   congestion control to properly reduce the congestion window (cwnd)
845	   [RFC5681].

847	   Consider a TLP retransmission episode where a sender retransmits a
848	   tail packet in a flight.  The TLP retransmission episode ends when
849	   the sender receives an ACK with a SEG.ACK above the SND.NXT at the
850	   time the episode started.  During the TLP retransmission episode the
851	   sender checks for a duplicate ACK or D-SACK indicating that both the
852	   original segment and TLP retransmission arrived at the receiver,
853	   meaning there was no loss that needed repairing.  If the TLP sender
854	   does not receive such an indication before the end of the TLP
855	   retransmission episode, then it MUST estimate that either the
856	   original data segment or the TLP retransmission were lost, and
857	   congestion control MUST react appropriately to that loss as it would
858	   any other loss.

860	   Since a significant fraction of the hosts that support SACK do not
861	   support duplicate selective acknowledgments (D-SACKs) [RFC2883] the
862	   TLP algorithm for detecting such lost segments relies only on basic
863	   SACK support [RFC2018].

865	7.6.1.  Initializing and resetting state

867	   When a connection is created, or suffers a retransmission timeout, or
868	   enters fast recovery, it executes the following:

870	       TLPRxtOut = false

872	7.6.2.  Recording loss probe states

874	   Senders MUST only send a TLP loss probe retransmission if TLPRxtOut
875	   is false.  This ensures that at any given time a connection has at
876	   most one outstanding TLP retransmission.  This allows the sender to
877	   use the algorithm described in this section to estimate whether any
878	   data segments were lost.

880	   Note that this condition only restricts TLP loss probes that are
881	   retransmissions.  There may be an arbitrary number of outstanding
882	   unacknowledged TLP loss probes that consist of new, previously-unsent
883	   data, since the retransmission timeout and fast recovery algorithms
884	   are sufficient to detect losses of such probe segments.

886	   Upon sending a TLP probe that is a retransmission, the sender sets
887	   TLPRxtOut to true and TLPHighRxt to SND.NXT.

889	   Detecting recoveries accomplished by loss probes
890	   Step 1: Track ACKs indicating receipt of original and retransmitted
891	   segments

893	   A sender considers both the original segment and TLP probe
894	   retransmission segment as acknowledged if either 1 or 2 are true:

896	   1.  This is a duplicate acknowledgment (as defined in [RFC5681],
897	       section 2), and all of the following conditions are met:

899	       1.  TLPRxtOut is true

901	       2.  SEG.ACK == TLPHighRxt

903	       3.  SEG.ACK == SND.UNA

905	       4.  the segment contains no SACK blocks for sequence ranges above
906	           TLPHighRxt

908	       5.  the segment contains no data

910	       6.  the segment is not a window update

912	   2.  This is an ACK acknowledging a sequence number at or above
913	       TLPHighRxt and it contains a D-SACK; i.e. all of the following
914	       conditions are met:

916	       1.  TLPRxtOut is true

918	       2.  SEG.ACK >= TLPHighRxt

920	       3.  the ACK contains a D-SACK block

922	   If either of the conditions is met, then the sender estimates that
923	   the receiver received both the original data segment and the TLP
924	   probe retransmission, and so the sender considers the TLP episode to
925	   be done, and records that fact by setting TLPRxtOut to false.

927	   Step 2: Mark the end of a TLP retransmission episode and detect
928	   losses

930	   If the sender receives a cumulative ACK for data beyond the TLP loss
931	   probe retransmission then, in the absence of reordering on the return
932	   path of ACKs, it should have received any ACKs for the original
933	   segment and TLP probe retransmission segment.  At that time, if the
934	   TLPRxtOut flag is still true and thus indicates that the TLP probe
935	   retransmission remains unacknowledged, then the sender should presume
936	   that at least one of its data segments was lost, so it SHOULD invoke
937	   a congestion control response equivalent to fast recovery.

939	   More precisely, on each ACK the sender executes the following:

941	       if (TLPRxtOut and SEG.ACK >= TLPHighRxt) {
942	           TLPRxtOut = false
943	           EnterRecovery()
944	           ExitRecovery()
945	       }

947	8.  RACK and TLP discussions

949	8.1.  Advantages

951	   The biggest advantage of RACK is that every data packet, whether it
952	   is an original data transmission or a retransmission, can be used to
953	   detect losses of the packets sent chronologically prior to it.

955	   Example: TAIL DROP.  Consider a sender that transmits a window of
956	   three data packets (P1, P2, P3), and P1 and P3 are lost.  Suppose the
957	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
958	   by default) after the transmission of the previous packet.  RACK will
959	   mark P1 as lost when the SACK of P2 is received, and this will
960	   trigger the retransmission of P1 as R1.  When R1 is cumulatively
961	   acknowledged, RACK will mark P3 as lost and the sender will
962	   retransmit P3 as R3.  This example illustrates how RACK is able to
963	   repair certain drops at the tail of a transaction without any timer.
964	   Notice that neither the conventional duplicate ACK threshold
965	   [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK]
966	   algorithm can detect such losses, because of the required packet or
967	   sequence count.

969	   Example: LOST RETRANSMIT.  Consider a window of three data packets
970	   (P1, P2, P3) that are sent; P1 and P2 are dropped.  Suppose the
971	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
972	   by default) after the transmission of the previous packet.  When P3
973	   is SACKed, RACK will mark P1 and P2 lost and they will be
974	   retransmitted as R1 and R2.  Suppose R1 is lost again but R2 is
975	   SACKed; RACK will mark R1 lost for retransmission again.  Again,
976	   neither the conventional three duplicate ACK threshold approach, nor
977	   [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect
978	   such losses.  And such a lost retransmission is very common when TCP
979	   is being rate-limited, particularly by token bucket policers with
980	   large bucket depth and low rate limit.  Retransmissions are often
981	   lost repeatedly because standard congestion control requires multiple
982	   round trips to reduce the rate below the policed rate.

984	   Example: SMALL DEGREE OF REORDERING.  Consider a common reordering
985	   event: a window of packets are sent as (P1, P2, P3).  P1 and P2 carry
986	   a full payload of MSS octets, but P3 has only a 1-octet payload.

988	   Suppose the sender has detected reordering previously and thus
989	   RACK.reo_wnd is min_RTT/4.  Now P3 is reordered and delivered first,
990	   before P1 and P2.  As long as P1 and P2 are delivered within
991	   min_RTT/4, RACK will not consider P1 and P2 lost.  But if P1 and P2
992	   are delivered outside the reordering window, then RACK will still
993	   falsely mark P1 and P2 lost.  We discuss how to reduce false
994	   positives in the end of this section.

996	   The examples above show that RACK is particularly useful when the
997	   sender is limited by the application, which is common for
998	   interactive, request/response traffic.  Similarly, RACK still works
999	   when the sender is limited by the receive window, which is common for
1000	   applications that use the receive window to throttle the sender.

1002	   For some implementations (e.g., Linux), RACK works quite efficiently
1003	   with TCP Segmentation Offload (TSO).  RACK always marks the entire
1004	   TSO blob lost because the packets in the same TSO blob have the same
1005	   transmission timestamp.  By contrast, the algorithms based on
1006	   sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset
1007	   of packets in the TSO blob lost, forcing the stack to perform
1008	   expensive fragmentation of the TSO blob, or to selectively tag
1009	   individual packets lost in the scoreboard.

1011	8.2.  Disadvantages

1013	   RACK requires the sender to record the transmission time of each
1014	   packet sent at a clock granularity of one millisecond or finer.  TCP
1015	   implementations that record this already for RTT estimation do not
1016	   require any new per-packet state.  But implementations that are not
1017	   yet recording packet transmission times will need to add per-packet
1018	   internal state (commonly either 4 or 8 octets per packet or TSO blob)
1019	   to track transmission times.  In contrast, the conventional [RFC6675]
1020	   loss detection approach does not require any per-packet state beyond
1021	   the SACK scoreboard.  This is particularly useful on ultra-low RTT
1022	   networks where the RTT is far less than the sender TCP clock
1023	   grainularity (e.g. inside data-centers).

1025	   RACK can easily and optionally support the conventional approach in
1026	   [RFC6675][RFC5681] by resetting the reordering window to zero when
1027	   the threshold is met.  Note that this approach differs slightly from
1028	   [RFC6675] which considers a packet lost when at least #DupThresh
1029	   higher-sequenc packets are SACKed.  RACK's approach considers a
1030	   packet lost when at least one higher sequence packet is SACKed and
1031	   the total number of SACKed packets is at least DupThresh.  For
1032	   example, suppose a connection sends 10 packets, and packets 3, 5, 7
1033	   are SACKed.  [RFC6675] considers packets 1 and 2 lost.  RACK
1034	   considers packets 1, 2, 4, 6 lost.

1036	8.3.  Adjusting the reordering window

1038	   When the sender detects packet reordering, RACK uses a reordering
1039	   window of min_rtt / 4.  It uses the minimum RTT to accommodate
1040	   reordering introduced by packets traversing slightly different paths
1041	   (e.g., router-based parallelism schemes) or out-of-order deliveries
1042	   in the lower link layer (e.g., wireless links using link-layer
1043	   retransmission).  RACK uses a quarter of minimum RTT because Linux
1044	   TCP used the same factor in its implementation to delay Early
1045	   Retransmit [RFC5827] to reduce spurious loss detections in the
1046	   presence of reordering, and experience shows that this seems to work
1047	   reasonably well.  We have evaluated using the smoothed RTT (SRTT from
1048	   [RFC6298] RTT estimation) or the most recently measured RTT
1049	   (RACK.rtt) using an experiment similar to that in the Performance
1050	   Evaluation section.  They do not make any significant difference in
1051	   terms of total recovery latency.

1053	8.4.  Relationships with other loss recovery algorithms

1055	   The primary motivation of RACK is to ultimately provide a simple and
1056	   general replacement for some of the standard loss recovery algorithms
1057	   [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard
1058	   ones [FACK][THIN-STREAM].  While RACK can be a supplemental loss
1059	   detection mechanism on top of these algorithms, this is not
1060	   necessary, because RACK implicitly subsumes most of them.

1062	   [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK
1063	   threshold based on the current or previous flight sizes.  RACK takes
1064	   a different approach, by using only one ACK event and a reordering
1065	   window.  RACK can be seen as an extended Early Retransmit [RFC5827]
1066	   without a FlightSize limit but with an additional reordering window.
1067	   [FACK] considers an original packet to be lost when its sequence
1068	   range is sufficiently far below the highest SACKed sequence.  In some
1069	   sense RACK can be seen as a generalized form of FACK that operates in
1070	   time space instead of sequence space, enabling it to better handle
1071	   reordering, application-limited traffic, and lost retransmissions.

1073	   Nevertheless RACK is still an experimental algorithm.  Since the
1074	   oldest loss detection algorithm, the 3 duplicate ACK threshold
1075	   [RFC5681], has been standardized and widely deployed.  RACK can
1076	   easily and optionally support the conventional approach for
1077	   compatibility.

1079	   RACK is compatible with and does not interfere with the the standard
1080	   RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel
1081	   algorithms [RFC3522].  This is because RACK only detects loss by
1082	   using ACK events.  It neither changes the RTO timer calculation nor
1083	   detects spurious timeouts.

1085	   Furthermore, RACK naturally works well with Tail Loss Probe [TLP]
1086	   because a tail loss probe solicits either an ACK or SACK, which can
1087	   be used by RACK to detect more losses.  RACK can be used to relax
1088	   TLP's requirement for using FACK and retransmitting the the highest-
1089	   sequenced packet, because RACK is agnostic to packet sequence
1090	   numbers, and uses transmission time instead.  Thus TLP could be
1091	   modified to retransmit the first unacknowledged packet, which could
1092	   improve application latency.

1094	8.5.  Interaction with congestion control

1096	   RACK intentionally decouples loss detection from congestion control.
1097	   RACK only detects losses; it does not modify the congestion control
1098	   algorithm [RFC5681][RFC6937].  However, RACK may detect losses
1099	   earlier or later than the conventional duplicate ACK threshold
1100	   approach does.  A packet marked lost by RACK SHOULD NOT be
1101	   retransmitted until congestion control deems this appropriate.
1102	   Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used
1103	   when using RACK.

1105	   RACK is applicable for both fast recovery and recovery after a
1106	   retransmission timeout (RTO) in [RFC5681].  RACK applies equally to
1107	   fast recovery and RTO recovery because RACK is purely based on the
1108	   transmission time order of packets.  When a packet retransmitted by
1109	   RTO is acknowledged, RACK will mark any unacked packet sent
1110	   sufficiently prior to the RTO as lost, because at least one RTT has
1111	   elapsed since these packets were sent.

1113	   The following simple example compares how RACK and non-RACK loss
1114	   detection interacts with congestion control: suppose a TCP sender has
1115	   a congestion window (cwnd) of 20 packets on a SACK-enabled
1116	   connection.  It sends 10 data packets and all of them are lost.

1118	   Without RACK, the sender would time out, reset cwnd to 1, and
1119	   retransmit the first packet.  It would take four round trips (1 + 2 +
1120	   4 + 3 = 10) to retransmit all the 10 lost packets using slow start.
1121	   The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4
1122	   packets due to congestion window validation.

1124	   With RACK, a sender would send the TLP after 2*RTT and get a DUPACK.
1125	   If the sender implements Proportional Rate Reduction [RFC6937] it
1126	   would slow start to retransmit the remaining 9 lost packets since the
1127	   number of packets in flight (0) is lower than the slow start
1128	   threshold (10).  The slow start would again take four round trips (1
1129	   + 2 + 4 + 3 = 10).  The recovery latency would be 2*RTT + 4*RTT, with
1130	   an ending cwnd set to the slow start threshold of 10 packets.

1132	   In both cases, the sender after the recovery would be in congestion
1133	   avoidance.  The difference in recovery latency (RTO + 4*RTT vs 6*RTT)
1134	   can be significant if the RTT is much smaller than the minimum RTO (1
1135	   second in RFC6298) or if the RTT is large.  The former case is common
1136	   in local area networks, data-center networks, or content distribution
1137	   networks with deep deployments.  The latter case is more common in
1138	   developing regions with highly congested and/or high-latency
1139	   networks.  The ending congestion window after recovery also impacts
1140	   subsequent data transfer.

1142	8.6.  TLP recovery detection with delayed ACKs

1144	   Delayed ACKs complicate the detection of repairs done by TLP, since
1145	   with a delayed ACK the sender receives one fewer ACK than would
1146	   normally be expected.  To mitigate this complication, before sending
1147	   a TLP loss probe retransmission, the sender should attempt to wait
1148	   long enough that the receiver has sent any delayed ACKs that it is
1149	   withholding.  The sender algorithm described above features such a
1150	   delay, in the form of WCDelAckT.  Furthermore, if the receiver
1151	   supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then
1152	   in the case of a delayed ACK the sender's TLP recovery detection
1153	   algorithm (see above) can use the D-SACK information to infer that
1154	   the original and TLP retransmission both arrived at the receiver.

1156	   If there is ACK loss or a delayed ACK without a D-SACK, then this
1157	   algorithm is conservative, because the sender will reduce cwnd when
1158	   in fact there was no packet loss.  In practice this is acceptable,
1159	   and potentially even desirable: if there is reverse path congestion
1160	   then reducing cwnd can be prudent.

1162	8.7.  RACK for other transport protocols

1164	   RACK can be implemented in other transport protocols.  The algorithm
1165	   can be simplified by skipping step 3 if the protocol can support a
1166	   unique transmission or packet identifier (e.g.  TCP timestamp options
1167	   [RFC7323]).  For example, the QUIC protocol implements RACK [QUIC-
1168	   LR].  The [Sprout] congestion control also independently designed to
1169	   use a 10ms reordering window to improve its loss detection.

1171	9.  Experiments and Performance Evaluations

1173	   RACK and TLP have been deployed at Google, for both connections to
1174	   users in the Internet and internally.  We conducted a performance
1175	   evaluation experiment for RACK and TLP on a small set of Google Web
1176	   servers in Western Europe that serve mostly European and some African
1177	   countries.  The experiment lasted three days in March 2017.  The
1178	   servers were divided evenly into four groups of roughly 5.3 million
1179	   flows each:

1181	   Group 1 (control): RACK off, TLP off, RFC 6675 on

1183	   Group 2: RACK on, TLP off, RFC 6675 on

1185	   Group 3: RACK on, TLP on, RFC 6675 on

1187	   Group 4: RACK on, TLP on, RFC 6675 off

1189	   All groups used Linux with CUBIC congestion control, an initial
1190	   congestion window of 10 packets, and the fq/pacing qdisc.  In terms
1191	   of specific recovery features, all groups enabled RFC5682 (F-RTO) but
1192	   disabled FACK because it is not an IETF RFC.  FACK was excluded
1193	   because the goal of this setup is to compare RACK and TLP to RFC-
1194	   based loss recoveries.  Since TLP depends on either FACK or RACK, we
1195	   could not run another group that enables TLP only (with both RACK and
1196	   FACK disabled).  Group 4 is to test whether RACK plus TLP can
1197	   completely replace the DupThresh-based [RFC6675].

1199	   The servers sit behind a load balancer that distributes the
1200	   connections evenly across the four groups.

1202	   Each group handles a similar number of connections and sends and
1203	   receives similar amounts of data.  We compare total time spent in
1204	   loss recovery across groups.  The recovery time is measured from when
1205	   the recovery and retransmission starts, until the remote host has
1206	   acknowledged the highest sequence (SND.NXT) at the time the recovery
1207	   started.  Therefore the recovery includes both fast recoveries and
1208	   timeout recoveries.

1210	   Our data shows that Group 2 recovery latency is only 0.3% lower than
1211	   the Group 1 recovery latency.  But Group 3 recovery latency is 25%
1212	   lower than Group 1 due to a 40% reduction in RTO-triggered
1213	   recoveries!  Therefore it is important to implement both TLP and RACK
1214	   for performance.  Group 4's total recovery latency is 0.02% lower
1215	   than Group 3's, indicating that RACK plus TLP can successfully
1216	   replace RFC6675 as a standalone recovery mechanism.

1218	   We want to emphasize that the current experiment is limited in terms
1219	   of network coverage.  The connectivity in Western Europe is fairly
1220	   good, therefore loss recovery is not a major performance bottleneck.
1221	   We plan to expand our experiments to regions with worse connectivity,
1222	   in particular on networks with strong traffic policing.

1224	10.  Security Considerations

1226	   RACK does not change the risk profile for TCP.

1228	   An interesting scenario is ACK-splitting attacks [SCWA99]: for an
1229	   MSS-size packet sent, the receiver or the attacker might send MSS
1230	   ACKs that SACK or acknowledge one additional byte per ACK.  This
1231	   would not fool RACK.  RACK.xmit_ts would not advance because all the
1232	   sequences of the packet are transmitted at the same time (carry the
1233	   same transmission timestamp).  In other words, SACKing only one byte
1234	   of a packet or SACKing the packet in entirety have the same effect on
1235	   RACK.

1237	11.  IANA Considerations

1239	   This document makes no request of IANA.

1241	   Note to RFC Editor: this section may be removed on publication as an
1242	   RFC.

1244	12.  Acknowledgments

1246	   The authors thank Matt Mathis for his insights in FACK and Michael
1247	   Welzl for his per-packet timer idea that inspired this work.  Eric
1248	   Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana
1249	   Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi
1250	   Nishida, Bob Briscoe, Felix Weinrank, Michael Tuexen, and Martin Duke
1251	   contributed to the draft or the implementations in Linux, FreeBSD,
1252	   Windows and QUIC.

1254	13.  References

1256	13.1.  Normative References

1258	   [RFC2018]  Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment
1259	              Options", RFC 2018, October 1996.

1261	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1262	              Requirement Levels", RFC 2119, March 1997.

1264	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
1265	              Extension to the Selective Acknowledgement (SACK) Option
1266	              for TCP", RFC 2883, July 2000.

1268	   [RFC4737]  Morton, A., Ciavattone, L., Ramachandran, G., Shalunov,
1269	              S., and J. Perser, "Packet Reordering Metrics", RFC 4737,
1270	              November 2006.

1272	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1273	              Control", RFC 5681, September 2009.

1275	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
1276	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
1277	              Spurious Retransmission Timeouts with TCP", RFC 5682,
1278	              September 2009.

1280	   [RFC5827]  Allman, M., Ayesta, U., Wang, L., Blanton, J., and P.
1281	              Hurtig, "Early Retransmit for TCP and Stream Control
1282	              Transmission Protocol (SCTP)", RFC 5827, April 2010.

1284	   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
1285	              "Computing TCP's Retransmission Timer", RFC 6298, June
1286	              2011.

1288	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
1289	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
1290	              Based on Selective Acknowledgment (SACK) for TCP",
1291	              RFC 6675, August 2012.

1293	   [RFC6937]  Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
1294	              Rate Reduction for TCP", May 2013.

1296	   [RFC7323]  Borman, D., Braden, B., Jacobson, V., and R.
1297	              Scheffenegger, "TCP Extensions for High Performance",
1298	              September 2014.

1300	   [RFC793]   Postel, J., "Transmission Control Protocol", September
1301	              1981.

1303	13.2.  Informative References

1305	   [FACK]     Mathis, M. and M. Jamshid, "Forward acknowledgement:
1306	              refining TCP congestion control", ACM SIGCOMM Computer
1307	              Communication Review, Volume 26, Issue 4, Oct. 1996. ,
1308	              1996.

1310	   [POLICER16]
1311	              Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng,
1312	              Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An
1313	              Analysis of Traffic Policing in the Web", ACM SIGCOMM ,
1314	              2016.

1316	   [QUIC-LR]  Iyengar, J. and I. Swett, "QUIC Loss Recovery And
1317	              Congestion Control", draft-tsvwg-quic-loss-recovery-01
1318	              (work in progress), June 2016.

1320	   [RFC7765]  Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
1321	              and SCTP RTO Restart", February 2016.

1323	   [SCWA99]   Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
1324	              "TCP Congestion Control With a Misbehaving Receiver", ACM
1325	              Computer Communication Review, 29(5) , 1999.

1327	   [Sprout]   Winstein, K., Sivaraman, A., and H. Balakrishnan,
1328	              "Stochastic Forecasts Achieve High Throughput and Low
1329	              Delay over Cellular Networks", USENIX Symposium on
1330	              Networked Systems Design and Implementation (NSDI) , 2013.

1332	   [THIN-STREAM]
1333	              Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen,
1334	              "TCP enhancements for interactive thin-stream
1335	              applications", NOSSDAV , 2008.

1337	   [TLP]      Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis,
1338	              "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of
1339	              Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work
1340	              in progress), August 2013.

1342	Authors' Addresses

1344	   Yuchung Cheng
1345	   Google, Inc

1347	   Email: ycheng@google.com

1349	   Neal Cardwell
1350	   Google, Inc

1352	   Email: ncardwell@google.com

1354	   Nandita Dukkipati
1355	   Google, Inc

1357	   Email: nanditad@google.com

1359	   Priyaranjan Jha
1360	   Google, Inc

1362	   Email: priyarjha@google.com