idnits 2.17.1 draft-ietf-tcpm-rack-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. ** The abstract seems to contain references ([RFC6675], [FACK]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 9, 2020) is 1506 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC8174' is mentioned on line 64, but not defined

  == Missing Reference: 'RFC4653' is mentioned on line 1063, but not defined

  == Missing Reference: 'RFC3708' is mentioned on line 490, but not defined

  == Missing Reference: 'RFC3522' is mentioned on line 1081, but not defined

  == Unused Reference: 'RFC4737' is defined on line 1278, but no explicit
     reference was found in the text

  ** Downref: Normative reference to an Experimental RFC: RFC 5827

  ** Downref: Normative reference to an Experimental RFC: RFC 6937

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- No information found for draft-ietf-quic-recovery-latest - is the name
     correct?


     Summary: 6 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                   Y. Cheng
3	Internet-Draft                                               N. Cardwell
4	Intended status: Standards Track                            N. Dukkipati
5	Expires: September 10, 2020                                       P. Jha
6	                                                             Google, Inc
7	                                                           March 9, 2020

9	        RACK: a time-based fast loss detection algorithm for TCP
10	                        draft-ietf-tcpm-rack-08

12	Abstract

14	   This document presents a new TCP loss detection algorithm called RACK
15	   ("Recent ACKnowledgment").  RACK uses the notion of time, instead of
16	   packet or sequence counts, to detect losses, for modern TCP
17	   implementations that can support per-packet timestamps and the
18	   selective acknowledgment (SACK) option.  It is intended to be an
19	   alternative to the DUPACK threshold approach [RFC6675], as well as
20	   other nonstandard approaches such as FACK [FACK].

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on September 10, 2020.

39	Copyright Notice

41	   Copyright (c) 2020 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (https://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	1.  Terminology

56	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
57	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
58	   "OPTIONAL" in this document are to be interpreted as described in BCP
59	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
60	   capitals, as shown here.  In this document, these words will appear
61	   with that interpretation only when in UPPER CASE.  Lower case uses of
62	   these words are not to be interpreted as carrying [RFC2119]
63	   significance.

65	2.  Introduction

67	   This document presents a new loss detection algorithm called RACK
68	   ("Recent ACKnowledgment").  RACK uses the notion of time instead of
69	   the conventional packet or sequence counting approaches for detecting
70	   losses.  RACK deems a packet lost if it has not been delivered and
71	   some packet sent sufficiently later has been delivered.  It does this
72	   by recording packet transmission times and inferring losses using
73	   cumulative acknowledgments or selective acknowledgment (SACK) TCP
74	   options.

76	   In recent years we have been observing several increasingly common
77	   loss and reordering patterns in the Internet:

79	   1.  Slow recovery due to lost retransmissions.  Traffic policers
80	       [POLICER16] and burst losses often cause retransmissions to be
81	       lost again.  This severely increases latency because the lost
82	       retransmissions can only be recovered by retransmission timeouts
83	       (RTOs).

85	   2.  Tail drops.  Structured request-response traffic turns more
86	       losses into tail drops.  In such cases, TCP is application-
87	       limited, so it cannot send new data to probe losses and has to
88	       rely on retransmission timeouts (RTOs).

90	   3.  Reordering.  Link-layer protocols (e.g., 802.11 block ACK), link
91	       bonding, or routers' internal load-balancing can deliver TCP
92	       packets out of order.  The degree of such reordering is usually
93	       within the order of the path round trip time.

95	   Despite TCP stacks (e.g.  Linux) that implement many of the standard
96	   and proposed loss detection algorithms

98	   [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM],
99	   we've found that together they do not perform well.  The main reason
100	   is that many of them are based on the classic rule of counting
101	   duplicate acknowledgments [RFC5681].  They can either detect loss
102	   quickly or accurately, but not both, especially when the sender is
103	   application-limited or under reordering that is unpredictable.  And
104	   under these conditions none of them can detect lost retransmissions
105	   well.

107	   Also, these algorithms, including RFCs, rarely address the
108	   interactions with other algorithms.  For example, FACK may consider a
109	   packet is lost while RFC6675 may not.  Implementing N algorithms
110	   while dealing with N^2 interactions is a daunting task and error-
111	   prone.

113	   The goal of RACK is to solve all the problems above by replacing many
114	   of the loss detection algorithms above with one more effective
115	   algorithm to handle loss and reordering.

117	3.  Overview

119	   The main idea behind RACK is that if a packet has been delivered out
120	   of order, then the packets sent chronologically before that were
121	   either lost or reordered.  This concept is not fundamentally
122	   different from [RFC5681][RFC6675][FACK].  But the key innovation in
123	   RACK is to use a per-packet transmission timestamp and widely
124	   deployed SACK options to conduct time-based inferences instead of
125	   inferring losses with packet or sequence counting approaches.

127	   Using a threshold for counting duplicate acknowledgments (i.e.,
128	   DupThresh) alone is no longer reliable because of today's prevalent
129	   reordering patterns.  A common type of reordering is that the last
130	   "runt" packet of a window's worth of packet bursts gets delivered
131	   first, then the rest arrive shortly after in order.  To handle this
132	   effectively, a sender would need to constantly adjust the DupThresh
133	   to the burst size; but this would risk increasing the frequency of
134	   RTOs on real losses.

136	   Today's prevalent lost retransmissions also cause problems with
137	   packet-counting approaches [RFC5681][RFC6675][FACK], since those
138	   approaches depend on reasoning in sequence number space.
139	   Retransmissions break the direct correspondence between ordering in
140	   sequence space and ordering in time.  So when retransmissions are
141	   lost, sequence-based approaches are often unable to infer and quickly
142	   repair losses that can be deduced with time-based approaches.

144	   Instead of counting packets, RACK uses the most recently delivered
145	   packet's transmission time to judge if some packets sent previous to
146	   that time have "expired" by passing a certain reordering settling
147	   window.  On each ACK, RACK marks any already-expired packets lost,
148	   and for any packets that have not yet expired it waits until the
149	   reordering window passes and then marks those lost as well.  In
150	   either case, RACK can repair the loss without waiting for a (long)
151	   RTO.  RACK can be applied to both fast recovery and timeout recovery,
152	   and can detect losses on both originally transmitted and
153	   retransmitted packets, making it a great all-weather loss detection
154	   mechanism.

156	4.  Design Rationale for Reordering Tolerance

158	   The reordering behavior of networks can evolve (over years) in
159	   response to the behavior of transport protocols and applications, as
160	   well as the needs of network designers and operators.  From a network
161	   or link designer's viewpoint, parallelization (eg. link bonding) is
162	   the easiest way to get a network to go faster.  Therefore their main
163	   constraint on speed is reordering, and there is pressure to relax
164	   that constraint.  If RACK becomes widely deployed, the underlying
165	   networks may introduce more reordering for higher throughput.  But
166	   this may result in excessive reordering that hurts end to end
167	   performance:

169	   1.  End host packet processing: extreme reordering on high-speed
170	       networks would incur high CPU cost by greatly reducing the
171	       effectiveness of aggregation mechanisms, such as large receive
172	       offload (LRO) and generic receive offload (GRO), and
173	       significantly increasing the number of ACKs.

175	   2.  Congestion control: TCP congestion control implicitly assumes the
176	       feedback from ACKs are from the same bottleneck.  Therefore it
177	       cannot handle well scenarios where packets are traversing largely
178	       disjoint paths.

180	   3.  Loss recovery: Having an excessively large reordering window to
181	       accommodate widely different latencies from different paths would
182	       increase the latency of loss recovery.

184	   An end-to-end transport protocol cannot tell immediately whether a
185	   hole is reordering or loss.  It can only distinguish between the two
186	   in hindsight if the hole in the sequence space gets filled later
187	   without a retransmission.  How long the sender waits for such
188	   potential reordering events to settle is determined by the current
189	   reordering window.

191	   Given these considerations, a core design philosophy of RACK is to
192	   adapt to the measured duration of reordering events, within
193	   reasonable and specific bounds.  To accomplish this RACK places the
194	   following mandates on the reordering window:

196	   1.  The initial RACK reordering window SHOULD be set to a small
197	       fraction of the round-trip time.

199	   2.  If no reordering has been observed, then RACK SHOULD honor the
200	       classic 3-DUPACK rule for initiating fast recovery.  One simple
201	       way to implement this is to temporarily override the reorder
202	       window to 0.

204	   3.  The RACK reordering window SHOULD leverage Duplicate Selective
205	       Acknowledgement (DSACK) information [RFC3708] to adaptively
206	       estimate the duration of reordering events.

208	   4.  The RACK reordering window MUST be bounded and this bound SHOULD
209	       be one round trip.

211	   As a flow starts, either condition 1 or condition 2 or both would
212	   trigger RACK to start the recovery process quickly.  The low initial
213	   reordering window and use of the 3-DUPACK rule are key to achieving
214	   low-latency loss recovery for short flows by risking spurious
215	   retransmissions to recover losses quickly.  This rationale is that
216	   spurious retransmissions for short flows are not expected to produce
217	   excessive network traffic.

219	   For long flows the design tolerates reordering within a round trip.
220	   This handles reordering caused by path divergence in small time
221	   scales (reordering within the round-trip time of the shortest path),
222	   which should tolerate much of the reordering from link bonding,
223	   multipath routing, or link-layer out-of-order delivery.  It also
224	   relaxes ordering constraints to allow sending flights of TCP packets
225	   on different paths dynamically for better load-balancing (e.g.
226	   flowlets).

228	   However, the fact that the initial RACK reordering window is low, and
229	   the RACK reordering window's adaptive growth is bounded, means that
230	   there will continue to be a cost to reordering and a limit to RACK's
231	   adaptation to reordering.  This maintains a disincentive for network
232	   designers and operators to introduce needless or excessive
233	   reordering, particularly because they have to allow for low round
234	   trip time paths.  This means RACK will not encourage networks to
235	   perform inconsiderate fine-grained packet-spraying over highly
236	   disjoint paths with very different characteristics.  There are good
237	   alternative solutions, such as MPTCP, for such networks.

239	   To conclude, the RACK algorithm aims to adapt to small degrees of
240	   reordering, quickly recover most losses within one to two round
241	   trips, and avoid costly retransmission timeouts (RTOs).  In the
242	   presence of reordering, the adaptation algorithm can impose
243	   sometimes-needless delays when it waits to disambiguate loss from
244	   reordering, but the penalty for waiting is bounded to one round trip
245	   and such delays are confined to longer-running flows.

247	5.  Requirements

249	   The reader is expected to be familiar with the definitions given in
250	   the TCP congestion control [RFC5681] and selective acknowledgment
251	   [RFC2018] RFCs.  Familiarity with the conservative SACK-based
252	   recovery for TCP [RFC6675] is not expected but helps.

254	   RACK has three requirements:

256	   1.  The connection MUST use selective acknowledgment (SACK) options
257	       [RFC2018].

259	   2.  For each packet sent, the sender MUST store its most recent
260	       transmission time with (at least) millisecond granularity.  For
261	       round-trip times lower than a millisecond (e.g., intra-datacenter
262	       communications) microsecond granularity would significantly help
263	       the detection latency but is not required.

265	   3.  For each packet sent, the sender MUST remember whether the packet
266	       has been retransmitted or not.

268	   We assume that requirement 1 implies the sender keeps a SACK
269	   scoreboard, which is a data structure to store selective
270	   acknowledgment information on a per-connection basis ([RFC6675]
271	   section 3).  For the ease of explaining the algorithm, we use a
272	   pseudo-scoreboard that manages the data in sequence number ranges.
273	   But the specifics of the data structure are left to the implementor.

275	   RACK does not need any change on the receiver.

277	6.  Definitions

279	   The reader is expected to be familiar with the definitions given in
280	   [RFC793], including SND.UNA, SND.NXT, SEG.ACK, and SEG.SEQ.

282	6.1.  Definitions of variables

284	   A sender implementing RACK needs to store these new RACK variables:

286	   "Packet.xmit_ts" is the time of the last transmission of a data
287	   packet, including retransmissions, if any.  The sender needs to
288	   record the transmission time for each packet sent and not yet
289	   acknowledged.  The time MUST be stored at millisecond granularity or
290	   finer.

292	   "RACK.packet".  Among all the packets that have been either
293	   selectively or cumulatively acknowledged, RACK.packet is the one that
294	   was sent most recently including retransmissions.

296	   "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet.

298	   "RACK.end_seq" is the ending TCP sequence number of RACK.packet.

300	   "RACK.rtt" is the RTT of the most recently delivered packet on the
301	   connection (either cumulatively acknowledged or selectively
302	   acknowledged) that was not marked invalid as a possible spurious
303	   retransmission.

305	   "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated.

307	   "RACK.reo_wnd" is a reordering window computed in the unit of time
308	   used for recording packet transmission times.  It is used to defer
309	   the moment at which RACK marks a packet lost.

311	   "RACK.dupthresh" is a constant specifying the number of duplicate
312	   acknowledgments, or selectively acknowledged segments, that can
313	   (under certain conditions) trigger fast recovery, similar to
314	   [RFC6675].  As in [RFC5681] and [RFC6675], this threshold is defined
315	   to be 3.

317	   "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the
318	   connection.

320	   "RACK.ack_ts" is the time when all the sequences in RACK.packet were
321	   selectively or cumulatively acknowledged.

323	   "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd

325	   "RACK.reo_wnd_persist" is the number of loss recoveries before
326	   resetting RACK.reo_wnd

328	   "RACK.dsack" indicates if a DSACK option has been received since last
329	   RACK.reo_wnd change

331	   "RACK.pkts_sacked" returns the total number of packets selectively
332	   acknowledged in the SACK scoreboard.

334	   "RACK.reord" indicates the connection has detected packet reordering
335	   event(s)
336	   "RACK.fack" is the highest selectively or cumulatively acknowledged
337	   sequence

339	   Note that the Packet.xmit_ts variable is per packet in flight.  The
340	   RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT
341	   variables are kept in the per-connection TCP control block.
342	   RACK.packet and RACK.ack_ts are used as local variables in the
343	   algorithm.

345	7.  Algorithm Details

347	7.1.  Transmitting a data packet

349	   Upon transmitting a new packet or retransmitting an old packet,
350	   record the time in Packet.xmit_ts.  RACK does not care if the
351	   retransmission is triggered by an ACK, new application data, an RTO,
352	   or any other means.

354	7.2.  Upon receiving an ACK

356	   Step 1: Update RACK.min_RTT.

358	   Use the RTT measurements obtained via [RFC6298] or [RFC7323] to
359	   update the estimated minimum RTT in RACK.min_RTT.  The sender can
360	   track a simple global minimum of all RTT measurements from the
361	   connection, or a windowed min-filtered value of recent RTT
362	   measurements.  This document does not specify an exact approach.

364	   Step 2: Update RACK stats

366	   Given the information provided in an ACK, each packet cumulatively
367	   ACKed or SACKed is marked as delivered in the scoreboard.  Among all
368	   the packets newly ACKed or SACKed in the connection, record the most
369	   recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts.
370	   Sometimes the timestamps of RACK.Packet and Packet could carry the
371	   same transmit timestamps due to clock granularity or segmentation
372	   offloading (i.e. the two packets were handed to the NIC as a single
373	   unit).  In that case the sequence numbers of RACK.end_seq and
374	   Packet.end_seq are compared to break the tie.

376	   Since an ACK can also acknowledge retransmitted data packets, and
377	   retransmissions can be spurious, the sender must take care to avoid
378	   spurious inferences.  For example, if the sender were to use timing
379	   information from a spurious retransmission, the RACK.rtt could be
380	   vastly underestimated.

382	   To avoid spurious inferences, ignore a packet as invalid if any of
383	   its TCP sequences have been retransmitted before and either of two
384	   conditions is true:

386	   1.  The Timestamp Echo Reply field (TSecr) of the ACK's timestamp
387	       option [RFC7323], if available, indicates the ACK was not
388	       acknowledging the last retransmission of the packet.

390	   2.  The packet was last retransmitted less than RACK.min_rtt ago.

392	   If the ACK is not ignored as invalid, update the RACK.rtt to be the
393	   RTT sample calculated using this ACK, and continue.  If this ACK or
394	   SACK was for the most recently sent packet, then record the
395	   RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK.
396	   Otherwise exit here and omit the following steps.

398	   Notice that the second condition above is a heuristic.  This
399	   heuristic would fail to update RACK stats if a data packet is
400	   spuriously retransmitted because of a recent minimum RTT decrease
401	   (e.g. path change).  For example, in cases with a TCP connection
402	   without TCP timestamps, and where the first M packets in a flight of
403	   data packets travel an old (longer) original path, and the remaining
404	   N packets in that flight travel a new (shorter) path and arrive out
405	   of order and elicit SACKs, then those SACKs for the N packets can
406	   initiate a spurious retransmission of the first M packets.  In such
407	   scenarios, the sender would not be able to update its RACK.min_rtt
408	   using the (ambiguous) RTT samples from retransmissions, so during
409	   recovery all RTT samples may be less than RACK.min_rtt, and thus meet
410	   the second condition.  In such cases RACK may not detect losses from
411	   ACK events and the recovery would then resort to the (slower) TLP or
412	   RTO timer-based recovery.  However, such events should be rare and
413	   the connection would pick up the new minimum RTT when the recovery
414	   ends, so the sender can avoid repeated similar failures.

416	   Step 2 may be summarized in pseudocode as:

418	   RACK_sent_after(t1, seq1, t2, seq2):
419	       If t1 > t2:
420	           Return true
421	       Else if t1 == t2 AND seq1 > seq2:
422	           Return true
423	       Else:
424	           Return false

426	   RACK_update():
427	       For each Packet newly acknowledged cumulatively or selectively:
428	           rtt = Now() - Packet.xmit_ts
429	           If Packet.retransmitted is TRUE:
430	               If ACK.ts_option.echo_reply < Packet.xmit_ts:
431	                  Return
432	               If rtt < RACK.min_rtt:
433	                  Return

435	           RACK.rtt = rtt
436	           If RACK_sent_after(Packet.xmit_ts, Packet.end_seq
437	                              RACK.xmit_ts, RACK.end_seq):
438	               RACK.xmit_ts = Packet.xmit_ts

440	   Step 3: Detect packet reordering

442	   To detect reordering, the sender looks for original data packets
443	   being delivered out of order in sequence space.  The sender tracks
444	   the highest sequence selectively or cumulatively acknowledged in the
445	   RACK.fack variable.  The name fack stands for the most forward ACK
446	   originated from the [FACK] draft.  If the ACK selectively or
447	   cumulatively acknowledges an unacknowledged and also never
448	   retransmitted sequence below RACK.fack, then the corresponding packet
449	   has been reordered and RACK.reord is set to TRUE.

451	   The heuristic above only detects reordering if the re-ordered packet
452	   has not yet been retransmitted.  This is a major drawback because if
453	   RACK has a low reordering window and the network is reordering
454	   packets, RACK may falsely retransmit frequently.  Consequently RACK
455	   may fail to detect reordering to increase the reordering window,
456	   because the reordered packets were already (falsely) retransmitted.

458	   DSACK [RFC3708] can help mitigate this issue.  The false
459	   retransmission would solicit DSACK option in the ACK.  Therefore if
460	   the ACK has a DSACK option covering some sequence that were both
461	   acknowledged and retransmitted, this implies the original packet was
462	   reordered but RACK retransmitted the packet too quickly and should
463	   set RACK.reord to TRUE.

465	   RACK_detect_reordering():
466	       For each Packet newly acknowledged cumulatively or selectively:
467	           If Packet.end_seq > RACK.fack:
468	               RACK.fack = Packet.end_seq
469	           Else if Packet.end_seq < RACK.fack AND
470	                   Packet.retransmitted is FALSE:
471	               RACK.reord = TRUE

473	       For each Packet covered by the DSACK option:
474	           If Packet.retransmitted is TRUE:
475	               RACK.reord = TRUE

477	   Step 4: Update RACK reordering window

479	   To handle the prevalent small degree of reordering, RACK.reo_wnd
480	   serves as an allowance for settling time before marking a packet
481	   lost.  This section documents a detailed algorithm following the
482	   design rationale section.  RACK starts initially with a conservative
483	   window of min_RTT/4.  If no reordering has been observed, RACK uses
484	   RACK.reo_wnd of 0 during loss recovery, in order to retransmit
485	   quickly, or when the number of DUPACKs exceeds the classic DUPACK
486	   threshold.  The subtle difference between this approach and the
487	   conventional one [RFC5681][RFC6675] is discussed later in the section
488	   "RACK and TLP Discussion".

490	   Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window,
491	   to higher degrees of reordering, if DSACK is supported.  Receiving an
492	   ACK with a DSACK indicates a spurious retransmission, which in turn
493	   suggests that the RACK reordering window, RACK.reo_wnd, is likely too
494	   small.  The sender MAY increase the RACK.reo_wnd window linearly for
495	   every round trip in which the sender receives a DSACK, so that after
496	   N distinct round trips in which a DSACK is received, the RACK.reo_wnd
497	   becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT.  The
498	   inflated RACK.reo_wnd would persist for 16 loss recoveries and after
499	   which it resets to its starting value, min_RTT / 4.

501	   The following pseudocode implements the above algorithm.  Note that
502	   extensions that require additional TCP features (e.g.  DSACK) would
503	   work if the feature functions simply return false.

505	   RACK_update_reo_wnd():
506	       RACK.min_RTT = TCP_min_RTT()
507	       If DSACK option is present:
508	           RACK.dsack = true

510	       If SND.UNA < RACK.rtt_seq:
511	           RACK.dsack = false  /* React to DSACK once per round trip */

513	       If RACK.dsack:
514	           RACK.reo_wnd_incr += 1
515	           RACK.dsack = false
516	           RACK.rtt_seq = SND.NXT
517	           RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */
518	       Else if exiting loss recovery:
519	           RACK.reo_wnd_persist -= 1
520	           If RACK.reo_wnd_persist <= 0:
521	               RACK.reo_wnd_incr = 1

523	       If RACK.reord is FALSE:
524	           If in loss recovery:  /* If in fast or timeout recovery */
525	               RACK.reo_wnd = 0
526	               Return
527	           Else if RACK.pkts_sacked >= RACK.dupthresh:
528	               RACK.reo_wnd = 0
529	               return
530	       RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr
531	       RACK.reo_wnd = min(RACK.reo_wnd, SRTT)

533	   Step 5: Detect losses.

535	   For each packet that has not been SACKed, if RACK.xmit_ts is after
536	   Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its
537	   corresponding sequence range) lost in the scoreboard.  The rationale
538	   is that if another packet that was sent later has been delivered, and
539	   the reordering window or "reordering settling time" has already
540	   passed, then the packet was likely lost.

542	   If another packet that was sent later has been delivered, but the
543	   reordering window has not passed, then it is not yet safe to deem the
544	   unacked packet lost.  Using the basic algorithm above, the sender
545	   would wait for the next ACK to further advance RACK.xmit_ts; but this
546	   risks a timeout (RTO) if no more ACKs come back (e.g, due to losses
547	   or application limit).  For timely loss detection, the sender MAY
548	   install a "reordering settling" timer set to fire at the earliest
549	   moment at which it is safe to conclude that some packet is lost.  The
550	   earliest moment is the time it takes to expire the reordering window
551	   of the earliest unacked packet in flight.

553	   This timer expiration value can be derived as follows.  As a starting
554	   point, we consider that the reordering window has passed if the
555	   RACK.packet was sent sufficiently after the packet in question, or a
556	   sufficient time has elapsed since the RACK.packet was S/ACKed, or
557	   some combination of the two.  More precisely, RACK marks a packet as
558	   lost if the reordering window for a packet has elapsed through the
559	   sum of:

561	   1. delta in transmit time between a packet and the RACK.packet

563	   2. delta in time between RACK.ack_ts and now

565	   So we mark a packet as lost if:

567	    RACK.xmit_ts >= Packet.xmit_ts
568	           AND
569	    (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd

571	   If we solve this second condition for "now", the moment at which we
572	   can declare a packet lost, then we get:

574	   now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts)

576	   Then (RACK.ack_ts - RACK.xmit_ts) is the RTT of the packet the sender
577	   used to set RACK.xmit_ts: the round trip time of the most recently
578	   (re)transmitted packet that's been delivered.  To be more robust to
579	   reordering, RACK uses a more conservative RTT value to decide if an
580	   unacknowledged packet should be considered lost, RACK.rtt: the round
581	   trip time of the most recently delivered packet on the connection
582	   that was not marked invalid as a possible spurious retransmission.

584	   When packets are delivered in order, the most recently
585	   (re)transmitted packet that's been delivered is also the most
586	   recently delivered, hence RACK.rtt == RACK.ack_ts - RACK.xmit_ts.
587	   But if packets were reordered, then the packet delivered most
588	   recently was sent before the most recently (re)transmitted packet.
589	   Hence RACK.rtt > (RACK.ack_ts - RACK.xmit_ts).

591	   Since RACK.RTT >= (RACK.ack_ts - RACK.xmit_ts), the previous equation
592	   reduces to saying that the sender can declare a packet lost when:

594	   now >= Packet.xmit_ts + RACK.reo_wnd + RACK.rtt

596	   In turn, that is equivalent to stating that a RACK sender should
597	   declare a packet lost when:

599	   Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0
600	   The following pseudocode implements the algorithm above.  When an ACK
601	   is received or the RACK timer expires, call RACK_detect_loss().  The
602	   algorithm includes an additional optimization to break timestamp ties
603	   by using the TCP sequence space.  The optimization is particularly
604	   useful to detect losses in a timely manner with TCP Segmentation
605	   Offload, where multiple packets in one TCP Segmentation Offload (TSO)
606	   blob have identical timestamps.

608	   RACK_detect_loss():
609	       timeout = 0

611	       For each packet, Packet, not acknowledged yet:
612	           If Packet.lost is TRUE AND Packet.retransmitted is FALSE:
613	               Continue /* Packet lost but not yet retransmitted */

615	           If RACK_sent_after(RACK.xmit_ts, RACK.end_seq,
616	                              Packet.xmit_ts, Packet.end_seq):
617	               remaining = Packet.xmit_ts + RACK.rtt +
618	                           RACK.reo_wnd - Now()
619	               If remaining <= 0:
620	                   Packet.lost = TRUE
621	               Else:
622	                   timeout = max(remaining, timeout)

624	       If timeout != 0
625	           Arm a timer to call RACK_detect_loss() after timeout

627	   Implementation optimization: looping through packets in the SACK
628	   scoreboard above could be very costly on large-BDP networks since the
629	   inflight could be very large.  If the implementation can organize the
630	   scoreboard data structures to have packets sorted by the last
631	   (re)transmission time, then the loop can start on the least recently
632	   sent packet and abort on the first packet sent after RACK.time_ts.
633	   This can be implemented by using a seperate doubly-linked list sorted
634	   in time order.  The implementation inserts the packet at the tail of
635	   the list when it is (re)transmitted, and removes a packet from the
636	   list when it is delivered or marked lost.  We RECOMMEND such an
637	   optimization because it enables implementations to support high-BDP
638	   networks.  This optimization is implemented in Linux and sees orders
639	   of magnitude improvement in CPU usage on high-speed WAN networks.

641	7.3.  Tail Loss Probe: fast recovery for tail losses

643	   This section describes a supplemental algorithm, Tail Loss Probe
644	   (TLP), which leverages RACK to further reduce RTO recoveries.  TLP
645	   triggers fast recovery to quickly repair tail losses that can
646	   otherwise be recovered only via RTOs.  After an original data
647	   transmission, TLP sends a probe data segment within one to two RTTs.

649	   The probe data segment can either be new, previously unsent data, or
650	   a retransmission of previously sent data just below SND.NXT.  In
651	   either case the goal is to elicit more feedback from the receiver, in
652	   the form of an ACK (potentially with SACK blocks), to allow RACK to
653	   trigger fast recovery instead of an RTO.

655	   An RTO occurs when the first unacknowledged sequence number is not
656	   acknowledged after a conservative period of time has elapsed
657	   [RFC6298].  Common causes of RTOs include:

659	   1.  The entire flight of data is lost.

661	   2.  Tail losses of data segments at the end of an application
662	       transaction.

664	   3.  Tail losses of ACKs at the end of an application transaction.

666	   4.  Lost retransmits, which can halt fast recovery based on [RFC6675]
667	       if the ACK stream completely dries up.  For example, consider a
668	       window of three data packets (P1, P2, P3) that are sent; P1 and
669	       P2 are dropped.  On receipt of a SACK for P3, RACK marks P1 and
670	       P2 as lost and retransmits them as R1 and R2.  Suppose R1 and R2
671	       are lost as well, so there are no more returning ACKs to detect
672	       R1 and R2 as lost.  Recovery stalls.

674	   5.  An unexpectedly long round-trip time (RTT).  This can cause ACKs
675	       to arrive after the RTO timer expires.  The F-RTO algorithm
676	       [RFC5682] is designed to detect such spurious retransmission
677	       timeouts and at least partially undo the consequences of such
678	       events, but F-RTO cannot be used in many situations.

680	7.4.  Tail Loss Probe: An Example

682	   Following is an example of TLP.  All events listed are at a TCP
683	   sender.

685	   1.  Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10.  There is
686	       no more new data to transmit.  A TLP is scheduled to be sent 2
687	       RTTs after the transmission of the 10th segment.

689	   2.  Sender receives acknowledgements (ACKs) for segments 1-5;
690	       segments 6-10 are lost and no ACKs are received.  The sender
691	       reschedules its TLP at a time relative to the last received ACK,
692	       which is the ACK for segment 5 in this case.  The sender sets the
693	       time for the TLP using the calculation described in step (2) of
694	       the algorithm.

696	   3.  When the TLP timer fires, sender retransmits segment 10.

698	   4.  After an RTT, a SACK for packet 10 arrives.  The ACK also carries
699	       SACK holes for segments 6, 7, 8 and 9.  This triggers RACK-based
700	       loss recovery.

702	   5.  The connection enters fast recovery and retransmits the remaining
703	       lost segments.

705	7.5.  Tail Loss Probe Algorithm Details

707	   We define the terminology used in specifying the TLP algorithm:

709	   FlightSize: amount of outstanding data in the network, as defined in
710	   [RFC5681].

712	   RTO: The transport's retransmission timeout (RTO) is based on
713	   measured round-trip times (RTT) between the sender and receiver, as
714	   specified in [RFC6298] for TCP.  PTO: Probe timeout (PTO) is a timer
715	   event indicating that an ACK is overdue and the sender should try to
716	   transmit a TLP.  Its value is constrained to be smaller than or equal
717	   to an RTO.

719	   SRTT: smoothed round-trip time, computed as specified in [RFC6298].
720	   TLPRxtOut: a boolean indicating whether there is an unacknowledged
721	   TLP retransmission.

723	   TLPHighRxt: the value of SND.NXT at the time of sending a TLP
724	   retransmission.

726	   WCDelAckT: maximum delayed ACK timer value.

728	   The TLP algorithm has three phases, which we discuss in turn.

730	7.5.1.  Phase 1: Scheduling a loss probe

732	   Step 1: Check conditions for scheduling a PTO.

734	   A sender should check to see if it should schedule a PTO in the
735	   following situations:

737	   1.  After transmitting new data that was not itself a TLP probe

739	   2.  Upon receiving an ACK that cumulatively acknowledges data

741	   A sender should schedule a PTO only if all of the following
742	   conditions are met:

744	   1.  The connection supports SACK [RFC2018]
745	   2.  The connection has no SACKed sequences in the SACK scoreboard

747	   3.  The connection is not in loss recovery

749	   If a PTO can be scheduled according to these conditions, the sender
750	   should schedule a PTO.  If there was a previously scheduled PTO or
751	   RTO pending, then that pending PTO or RTO should first be cancelled,
752	   and then the new PTO should be scheduled.

754	   If a PTO cannot be scheduled according to these conditions, then the
755	   sender MUST arm the RTO timer if there is unacknowledged data in
756	   flight.

758	   Step 2: Select the duration of the PTO.

760	   A sender SHOULD use the following logic to select the duration of a
761	   PTO:

763	   TLP_timeout():
764	       If SRTT is available:
765	           PTO = 2 * SRTT
766	           If FlightSize = 1:
767	              PTO += WCDelAckT
768	       Else:
769	           PTO = 1 sec

771	       If Now() + PTO > TCP_RTO_expire():
772	           PTO = TCP_RTO_expire() - Now()

774	   Aiming for a PTO value of 2*SRTT allows a sender to wait long enough
775	   to know that an ACK is overdue.  Under normal circumstances, i.e. no
776	   losses, an ACK typically arrives in one SRTT.  But choosing PTO to be
777	   exactly an SRTT is likely to generate spurious probes given that
778	   network delay variance and even end-system timings can easily push an
779	   ACK to be above an SRTT.  We chose PTO to be the next integral
780	   multiple of SRTT.

782	   WCDelAckT stands for worst case delayed ACK timer.  When FlightSize
783	   is 1, PTO is inflated by WCDelAckT time to compensate for a potential
784	   long delayed ACK timer at the receiver.  The RECOMMENDED value for
785	   WCDelAckT is 200ms.

787	   Finally, if the time at which an RTO would fire (here denoted
788	   "TCP_RTO_expire") is sooner than the computed time for the PTO, then
789	   a probe is scheduled to be sent at that earlier time.

791	7.5.2.  Phase 2: Sending a loss probe

793	   When the PTO fires, transmit a probe data segment:

795	   TLP_send_probe():
796	       If an unsent segment exists AND
797	          the receive window allows new data to be sent:
798	              Transmit the lowest-sequence unsent segment of up to SMSS
799	              Increment FlightSize by the size of the newly-sent segment

801	       Else if TLPRxtOut is not set:
802	              Retransmit the highest-sequence segment sent so far
803	              TLPRxtOut = true
804	              TLPHighRxt = SND.NXT
805	       The cwnd remains unchanged

807	   When the loss probe is a retransmission, the sender uses the highest-
808	   sequence segment sent so far.  This is in order to deal with the
809	   retransmission ambiguity problem in TCP.  Suppose a sender sends N
810	   segments, and then retransmits the last segment (segment N) as a loss
811	   probe, and then the sender receives a SACK for segment N.  As long as
812	   the sender waits for any required RACK reordering settling timer to
813	   then expire, it doesn't matter if that SACK was for the original
814	   transmission of segment N or the TLP retransmission; in either case
815	   the arrival of the SACK for segment N provides evidence that the N-1
816	   segments preceding segment N were likely lost.  In the case where
817	   there is only one original outstanding segment of data (N=1), the
818	   same logic (trivially) applies: an ACK for a single outstanding
819	   segment tells the sender the N-1=0 segments preceding that segment
820	   were lost.  Furthermore, whether there are N>1 or N=1 outstanding
821	   segments, there is a question about whether the original last segment
822	   or its TLP retransmission were lost; the sender estimates this using
823	   TLP recovery detection (see below).

825	   Note that whether or not a probe was sent in TLP_send_probe(), the
826	   sender MUST arm the RTO timer, not the PTO timer, at the end of
827	   TLP_send_probe() if FlightSize is not zero.  This ensures that the
828	   sender does not send repeated, back-to-back TLP probes.  Checking
829	   TLPRxtOut prior to sending the loss probe is also critical to avoid
830	   TLP loops if an application writes periodically at an interval less
831	   than PTO.

833	7.5.3.  Phase 3: ACK processing

835	   On each incoming ACK, the sender should check the conditions in Step
836	   1 of Phase 1 to see if it should schedule (or reschedule) the loss
837	   probe timer.

839	7.6.  TLP recovery detection

841	   If the only loss in an outstanding window of data was the last
842	   segment, then a TLP loss probe retransmission of that data segment
843	   might repair the loss.  TLP recovery detection examines ACKs to
844	   detect when the probe might have repaired a loss, and thus allows
845	   congestion control to properly reduce the congestion window (cwnd)
846	   [RFC5681].

848	   Consider a TLP retransmission episode where a sender retransmits a
849	   tail packet in a flight.  The TLP retransmission episode ends when
850	   the sender receives an ACK with a SEG.ACK above the SND.NXT at the
851	   time the episode started (i.e.  TLPHighRxt).  During the TLP
852	   retransmission episode the sender checks for a duplicate ACK or
853	   D-SACK indicating that both the original segment and TLP
854	   retransmission arrived at the receiver, meaning there was no loss
855	   that needed repairing.  If the TLP sender does not receive such an
856	   indication before the end of the TLP retransmission episode, then it
857	   MUST estimate that either the original data segment or the TLP
858	   retransmission were lost, and congestion control MUST react
859	   appropriately to that loss as it would any other loss.

861	   Since a significant fraction of the hosts that support SACK do not
862	   support duplicate selective acknowledgments (D-SACKs) [RFC2883] the
863	   TLP algorithm for detecting such lost segments relies only on basic
864	   SACK support [RFC2018].

866	7.6.1.  Initializing and resetting state

868	   When a connection is created, or suffers a retransmission timeout, or
869	   enters fast recovery, it executes the following:

871	       TLPRxtOut = false

873	7.6.2.  Recording loss probe states

875	   Senders MUST only send a TLP loss probe retransmission if TLPRxtOut
876	   is false.  This ensures that at any given time a connection has at
877	   most one outstanding TLP retransmission.  This allows the sender to
878	   use the algorithm described in this section to estimate whether any
879	   data segments were lost.

881	   Note that this condition only restricts TLP loss probes that are
882	   retransmissions.  There may be an arbitrary number of outstanding
883	   unacknowledged TLP loss probes that consist of new, previously-unsent
884	   data, since the retransmission timeout and fast recovery algorithms
885	   are sufficient to detect losses of such probe segments.

887	   Upon sending a TLP probe that is a retransmission, the sender sets
888	   TLPRxtOut to true and TLPHighRxt to SND.NXT.

890	7.6.3.  Detecting recoveries accomplished by loss probes

892	   Step 1: Track ACKs indicating receipt of original and retransmitted
893	   segments

895	   A sender considers both the original segment and TLP probe
896	   retransmission segment as acknowledged if either 1 or 2 are true:

898	   1.  This is a duplicate acknowledgment (as defined in [RFC5681],
899	       section 2), and all of the following conditions are met:

901	       1.  TLPRxtOut is true

903	       2.  SEG.ACK == TLPHighRxt

905	       3.  SEG.ACK == SND.UNA

907	       4.  the segment contains no SACK blocks for sequence ranges above
908	           TLPHighRxt

910	       5.  the segment contains no data

912	       6.  the segment is not a window update

914	   2.  This is an ACK acknowledging a sequence number at or above
915	       TLPHighRxt and it contains a D-SACK; i.e. all of the following
916	       conditions are met:

918	       1.  TLPRxtOut is true

920	       2.  SEG.ACK >= TLPHighRxt

922	       3.  the ACK contains a D-SACK block

924	   If either of the conditions is met, then the sender estimates that
925	   the receiver received both the original data segment and the TLP
926	   probe retransmission, and so the sender considers the TLP episode to
927	   be done, and records that fact by setting TLPRxtOut to false.

929	   Step 2: Mark the end of a TLP retransmission episode and detect
930	   losses

932	   If the sender receives a cumulative ACK for data beyond the TLP loss
933	   probe retransmission then, in the absence of reordering on the return
934	   path of ACKs, it should have received any ACKs for the original
935	   segment and TLP probe retransmission segment.  At that time, if the
936	   TLPRxtOut flag is still true and thus indicates that the TLP probe
937	   retransmission remains unacknowledged, then the sender should presume
938	   that at least one of its data segments was lost, so it SHOULD invoke
939	   a congestion control response equivalent to fast recovery.

941	   More precisely, on each ACK the sender executes the following:

943	       if (TLPRxtOut and SEG.ACK >= TLPHighRxt) {
944	           TLPRxtOut = false
945	           EnterRecovery()
946	           ExitRecovery()
947	       }

949	8.  RACK and TLP Discussion

951	8.1.  Advantages

953	   The biggest advantage of RACK is that every data packet, whether it
954	   is an original data transmission or a retransmission, can be used to
955	   detect losses of the packets sent chronologically prior to it.

957	   Example: TAIL DROP.  Consider a sender that transmits a window of
958	   three data packets (P1, P2, P3), and P1 and P3 are lost.  Suppose the
959	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
960	   by default) after the transmission of the previous packet.  RACK will
961	   mark P1 as lost when the SACK of P2 is received, and this will
962	   trigger the retransmission of P1 as R1.  When R1 is cumulatively
963	   acknowledged, RACK will mark P3 as lost and the sender will
964	   retransmit P3 as R3.  This example illustrates how RACK is able to
965	   repair certain drops at the tail of a transaction without any timer.
966	   Notice that neither the conventional duplicate ACK threshold
967	   [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK]
968	   algorithm can detect such losses, because of the required packet or
969	   sequence count.

971	   Example: LOST RETRANSMIT.  Consider a window of three data packets
972	   (P1, P2, P3) that are sent; P1 and P2 are dropped.  Suppose the
973	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
974	   by default) after the transmission of the previous packet.  When P3
975	   is SACKed, RACK will mark P1 and P2 lost and they will be
976	   retransmitted as R1 and R2.  Suppose R1 is lost again but R2 is
977	   SACKed; RACK will mark R1 lost for retransmission again.  Again,
978	   neither the conventional three duplicate ACK threshold approach, nor
979	   [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect
980	   such losses.  And such a lost retransmission is very common when TCP
981	   is being rate-limited, particularly by token bucket policers with
982	   large bucket depth and low rate limit.  Retransmissions are often
983	   lost repeatedly because standard congestion control requires multiple
984	   round trips to reduce the rate below the policed rate.

986	   Example: SMALL DEGREE OF REORDERING.  Consider a common reordering
987	   event: a window of packets are sent as (P1, P2, P3).  P1 and P2 carry
988	   a full payload of MSS octets, but P3 has only a 1-octet payload.
989	   Suppose the sender has detected reordering previously and thus
990	   RACK.reo_wnd is min_RTT/4.  Now P3 is reordered and delivered first,
991	   before P1 and P2.  As long as P1 and P2 are delivered within
992	   min_RTT/4, RACK will not consider P1 and P2 lost.  But if P1 and P2
993	   are delivered outside the reordering window, then RACK will still
994	   falsely mark P1 and P2 lost.  We discuss how to reduce false
995	   positives in the end of this section.

997	   The examples above show that RACK is particularly useful when the
998	   sender is limited by the application, which is common for
999	   interactive, request/response traffic.  Similarly, RACK still works
1000	   when the sender is limited by the receive window, which is common for
1001	   applications that use the receive window to throttle the sender.

1003	   For some implementations (e.g., Linux), RACK works quite efficiently
1004	   with TCP Segmentation Offload (TSO).  RACK always marks the entire
1005	   TSO blob lost because the packets in the same TSO blob have the same
1006	   transmission timestamp.  By contrast, the algorithms based on
1007	   sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset
1008	   of packets in the TSO blob lost, forcing the stack to perform
1009	   expensive fragmentation of the TSO blob, or to selectively tag
1010	   individual packets lost in the scoreboard.

1012	8.2.  Disadvantages

1014	   RACK requires the sender to record the transmission time of each
1015	   packet sent at a clock granularity of one millisecond or finer.  TCP
1016	   implementations that record this already for RTT estimation do not
1017	   require any new per-packet state.  But implementations that are not
1018	   yet recording packet transmission times will need to add per-packet
1019	   internal state (commonly either 4 or 8 octets per packet or TSO blob)
1020	   to track transmission times.  In contrast, the conventional [RFC6675]
1021	   loss detection approach does not require any per-packet state beyond
1022	   the SACK scoreboard.  This is particularly useful on ultra-low RTT
1023	   networks where the RTT is far less than the sender TCP clock
1024	   granularity (e.g. inside data-centers).

1026	   RACK can easily and optionally support the conventional approach in
1027	   [RFC6675][RFC5681] by resetting the reordering window to zero when
1028	   the threshold is met.  Note that this approach differs slightly from
1029	   [RFC6675] which considers a packet lost when at least DupThresh
1030	   higher-sequence packets are SACKed.  RACK's approach considers a
1031	   packet lost when at least one higher sequence packet is SACKed and
1032	   the total number of SACKed packets is at least DupThresh.  For
1033	   example, suppose a connection sends 10 packets, and packets 3, 5, 7
1034	   are SACKed.  [RFC6675] considers packets 1 and 2 lost.  RACK
1035	   considers packets 1, 2, 4, 6 lost.

1037	8.3.  Adjusting the reordering window

1039	   When the sender detects packet reordering, RACK uses a reordering
1040	   window of min_rtt / 4.  It uses the minimum RTT to accommodate
1041	   reordering introduced by packets traversing slightly different paths
1042	   (e.g., router-based parallelism schemes) or out-of-order deliveries
1043	   in the lower link layer (e.g., wireless links using link-layer
1044	   retransmission).  RACK uses a quarter of minimum RTT because Linux
1045	   TCP used the same factor in its implementation to delay Early
1046	   Retransmit [RFC5827] to reduce spurious loss detections in the
1047	   presence of reordering, and experience shows that this seems to work
1048	   reasonably well.  We have evaluated using the smoothed RTT (SRTT from
1049	   [RFC6298] RTT estimation) or the most recently measured RTT
1050	   (RACK.rtt) using an experiment similar to that in the Performance
1051	   Evaluation section.  They do not make any significant difference in
1052	   terms of total recovery latency.

1054	8.4.  Relationships with other loss recovery algorithms

1056	   The primary motivation of RACK is to ultimately provide a simple and
1057	   general replacement for some of the standard loss recovery algorithms
1058	   [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard
1059	   ones [FACK][THIN-STREAM].  While RACK can be a supplemental loss
1060	   detection mechanism on top of these algorithms, this is not
1061	   necessary, because RACK implicitly subsumes most of them.

1063	   [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK
1064	   threshold based on the current or previous flight sizes.  RACK takes
1065	   a different approach, by using only one ACK event and a reordering
1066	   window.  RACK can be seen as an extended Early Retransmit [RFC5827]
1067	   without a FlightSize limit but with an additional reordering window.
1068	   [FACK] considers an original packet to be lost when its sequence
1069	   range is sufficiently far below the highest SACKed sequence.  In some
1070	   sense RACK can be seen as a generalized form of FACK that operates in
1071	   time space instead of sequence space, enabling it to better handle
1072	   reordering, application-limited traffic, and lost retransmissions.

1074	   Since the 3 duplicate ACK threshold for triggering fast recovery
1075	   [RFC5681] has been widely deployed and usually works well in the
1076	   absence of reordering, RACK uses this signal to trigger fast recovery
1077	   if a connection has not observed reordering.

1079	   RACK is compatible with and does not interfere with the standard RTO
1080	   [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel
1081	   algorithms [RFC3522].  This is because RACK only detects loss by
1082	   using ACK events.  It neither changes the RTO timer calculation nor
1083	   detects spurious timeouts.

1085	   Furthermore, RACK naturally works well with Tail Loss Probe [TLP]
1086	   because a tail loss probe solicits either an ACK or SACK, which can
1087	   be used by RACK to detect more losses.  RACK can be used to relax
1088	   TLP's requirement for using FACK and retransmitting the the highest-
1089	   sequenced packet, because RACK is agnostic to packet sequence
1090	   numbers, and uses transmission time instead.  Thus TLP could be
1091	   modified to retransmit the first unacknowledged packet, which could
1092	   improve application latency.

1094	8.5.  Interaction with congestion control

1096	   RACK intentionally decouples loss detection from congestion control.
1097	   RACK only detects losses; it does not modify the congestion control
1098	   algorithm [RFC5681][RFC6937].  A packet marked lost by RACK SHOULD
1099	   NOT be retransmitted until congestion control deems this appropriate.

1101	   RACK is applicable for both fast recovery and recovery after a
1102	   retransmission timeout (RTO) in [RFC5681].  RACK applies equally to
1103	   fast recovery and RTO recovery because RACK is purely based on the
1104	   transmission time order of packets.  When a packet retransmitted by
1105	   RTO is acknowledged, RACK will mark any unacked packet sent
1106	   sufficiently prior to the RTO as lost, because at least one RTT has
1107	   elapsed since these packets were sent.

1109	   RACK may detect losses faster or slower than the conventional
1110	   duplicate ACK threshold approach does.  RACK can detect losses faster
1111	   by not requiring three DUPACKs, so congestion control may reduce the
1112	   congestion window earlier.  When the network path has both reordering
1113	   and losses, RACK detects losses slower by waiting for the reordering
1114	   window to expire.  TCP may continue to increase the congestion window
1115	   upon receiving ACKs during this time, making the sender more
1116	   aggressive.  Certain congestion control algorithms can benefit from
1117	   accounting for this increase in the congestion window during the
1118	   reordering window.

1120	8.5.1.  Example: interactions with congestion control

1122	   The following simple example compares how RACK and non-RACK loss
1123	   detection interacts with congestion control: suppose a TCP sender has
1124	   a congestion window (cwnd) of 20 packets on a SACK-enabled
1125	   connection.  It sends 10 data packets and all of them are lost.

1127	   Without RACK, the sender would time out, reset cwnd to 1, and
1128	   retransmit the first packet.  It would take four round trips (1 + 2 +
1129	   4 + 3 = 10) to retransmit all the 10 lost packets using slow start.
1130	   The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4
1131	   packets due to congestion window validation.

1133	   With RACK, a sender would send the TLP after 2*RTT and get a DUPACK.
1134	   If the sender implements Proportional Rate Reduction [RFC6937] it
1135	   would slow start to retransmit the remaining 9 lost packets since the
1136	   number of packets in flight (0) is lower than the slow start
1137	   threshold (10).  The slow start would again take four round trips (1
1138	   + 2 + 4 + 3 = 10).  The recovery latency would be 2*RTT + 4*RTT, with
1139	   an ending cwnd set to the slow start threshold of 10 packets.

1141	   In both cases, the sender after the recovery would be in congestion
1142	   avoidance.  The difference in recovery latency (RTO + 4*RTT vs 6*RTT)
1143	   can be significant if the RTT is much smaller than the minimum RTO (1
1144	   second in RFC6298) or if the RTT is large.  The former case is common
1145	   in local area networks, data-center networks, or content distribution
1146	   networks with deep deployments.  The latter case is more common in
1147	   developing regions with highly congested and/or high-latency
1148	   networks.

1150	8.6.  TLP recovery detection with delayed ACKs

1152	   Delayed ACKs complicate the detection of repairs done by TLP, since
1153	   with a delayed ACK the sender receives one fewer ACK than would
1154	   normally be expected.  To mitigate this complication, before sending
1155	   a TLP loss probe retransmission, the sender should attempt to wait
1156	   long enough that the receiver has sent any delayed ACKs that it is
1157	   withholding.  The sender algorithm described above features such a
1158	   delay, in the form of WCDelAckT.  Furthermore, if the receiver
1159	   supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then
1160	   in the case of a delayed ACK the sender's TLP recovery detection
1161	   algorithm (see above) can use the D-SACK information to infer that
1162	   the original and TLP retransmission both arrived at the receiver.

1164	   If there is ACK loss or a delayed ACK without a D-SACK, then this
1165	   algorithm is conservative, because the sender will reduce cwnd when
1166	   in fact there was no packet loss.  In practice this is acceptable,
1167	   and potentially even desirable: if there is reverse path congestion
1168	   then reducing cwnd can be prudent.

1170	8.7.  RACK for other transport protocols

1172	   RACK can be implemented in other transport protocols.  The algorithm
1173	   can be simplified by skipping step 3 if the protocol can support a
1174	   unique transmission or packet identifier (e.g.  TCP timestamp options

1176	   [RFC7323]).  For example, the QUIC protocol implements RACK [QUIC-
1177	   LR].  The [Sprout] loss detection algorithm was also independently
1178	   designed to use a 10ms reordering window to improve its loss
1179	   detection.

1181	9.  Experiments and Performance Evaluations

1183	   RACK and TLP have been deployed at Google, for both connections to
1184	   users in the Internet and internally.  We conducted a performance
1185	   evaluation experiment for RACK and TLP on a small set of Google Web
1186	   servers in Western Europe that serve mostly European and some African
1187	   countries.  The experiment lasted three days in March 2017.  The
1188	   servers were divided evenly into four groups of roughly 5.3 million
1189	   flows each:

1191	   Group 1 (control): RACK off, TLP off, RFC 6675 on

1193	   Group 2: RACK on, TLP off, RFC 6675 on

1195	   Group 3: RACK on, TLP on, RFC 6675 on

1197	   Group 4: RACK on, TLP on, RFC 6675 off

1199	   All groups used Linux with CUBIC congestion control, an initial
1200	   congestion window of 10 packets, and the fq/pacing qdisc.  In terms
1201	   of specific recovery features, all groups enabled RFC5682 (F-RTO) but
1202	   disabled FACK because it is not an IETF RFC.  FACK was excluded
1203	   because the goal of this setup is to compare RACK and TLP to RFC-
1204	   based loss recoveries.  Since TLP depends on either FACK or RACK, we
1205	   could not run another group that enables TLP only (with both RACK and
1206	   FACK disabled).  Group 4 is to test whether RACK plus TLP can
1207	   completely replace the DupThresh-based [RFC6675].

1209	   The servers sit behind a load balancer that distributes the
1210	   connections evenly across the four groups.

1212	   Each group handles a similar number of connections and sends and
1213	   receives similar amounts of data.  We compare total time spent in
1214	   loss recovery across groups.  The recovery time is measured from when
1215	   the recovery and retransmission starts, until the remote host has
1216	   acknowledged the highest sequence (SND.NXT) at the time the recovery
1217	   started.  Therefore the recovery includes both fast recoveries and
1218	   timeout recoveries.

1220	   Our data shows that Group 2 recovery latency is only 0.3% lower than
1221	   the Group 1 recovery latency.  But Group 3 recovery latency is 25%
1222	   lower than Group 1 due to a 40% reduction in RTO-triggered
1223	   recoveries!  Therefore it is important to implement both TLP and RACK
1224	   for performance.  Group 4's total recovery latency is 0.02% lower
1225	   than Group 3's, indicating that RACK plus TLP can successfully
1226	   replace RFC6675 as a standalone recovery mechanism.

1228	   We want to emphasize that the current experiment is limited in terms
1229	   of network coverage.  The connectivity in Western Europe is fairly
1230	   good, therefore loss recovery is not a major performance bottleneck.
1231	   We plan to expand our experiments to regions with worse connectivity,
1232	   in particular on networks with strong traffic policing.

1234	10.  Security Considerations

1236	   RACK does not change the risk profile for TCP.

1238	   An interesting scenario is ACK-splitting attacks [SCWA99]: for an
1239	   MSS-size packet sent, the receiver or the attacker might send MSS
1240	   ACKs that SACK or acknowledge one additional byte per ACK.  This
1241	   would not fool RACK.  RACK.xmit_ts would not advance because all the
1242	   sequences of the packet are transmitted at the same time (carry the
1243	   same transmission timestamp).  In other words, SACKing only one byte
1244	   of a packet or SACKing the packet in entirety have the same effect on
1245	   RACK.

1247	11.  IANA Considerations

1249	   This document makes no request of IANA.

1251	   Note to RFC Editor: this section may be removed on publication as an
1252	   RFC.

1254	12.  Acknowledgments

1256	   The authors thank Matt Mathis for his insights in FACK and Michael
1257	   Welzl for his per-packet timer idea that inspired this work.  Eric
1258	   Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana
1259	   Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi
1260	   Nishida, Bob Briscoe, Felix Weinrank, Michael Tuexen, Martin Duke,
1261	   and Ilpo Jarvinen contributed to the draft or the implementations in
1262	   Linux, FreeBSD, Windows and QUIC.

1264	13.  References

1266	13.1.  Normative References

1268	   [RFC2018]  Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment
1269	              Options", RFC 2018, October 1996.

1271	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1272	              Requirement Levels", RFC 2119, March 1997.

1274	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
1275	              Extension to the Selective Acknowledgement (SACK) Option
1276	              for TCP", RFC 2883, July 2000.

1278	   [RFC4737]  Morton, A., Ciavattone, L., Ramachandran, G., Shalunov,
1279	              S., and J. Perser, "Packet Reordering Metrics", RFC 4737,
1280	              November 2006.

1282	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1283	              Control", RFC 5681, September 2009.

1285	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
1286	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
1287	              Spurious Retransmission Timeouts with TCP", RFC 5682,
1288	              September 2009.

1290	   [RFC5827]  Allman, M., Ayesta, U., Wang, L., Blanton, J., and P.
1291	              Hurtig, "Early Retransmit for TCP and Stream Control
1292	              Transmission Protocol (SCTP)", RFC 5827, April 2010.

1294	   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
1295	              "Computing TCP's Retransmission Timer", RFC 6298, June
1296	              2011.

1298	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
1299	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
1300	              Based on Selective Acknowledgment (SACK) for TCP",
1301	              RFC 6675, August 2012.

1303	   [RFC6937]  Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
1304	              Rate Reduction for TCP", May 2013.

1306	   [RFC7323]  Borman, D., Braden, B., Jacobson, V., and R.
1307	              Scheffenegger, "TCP Extensions for High Performance",
1308	              September 2014.

1310	   [RFC793]   Postel, J., "Transmission Control Protocol", September
1311	              1981.

1313	13.2.  Informative References

1315	   [FACK]     Mathis, M. and M. Jamshid, "Forward acknowledgement:
1316	              refining TCP congestion control", ACM SIGCOMM Computer
1317	              Communication Review, Volume 26, Issue 4, Oct. 1996. ,
1318	              1996.

1320	   [POLICER16]
1321	              Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng,
1322	              Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An
1323	              Analysis of Traffic Policing in the Web", ACM SIGCOMM ,
1324	              2016.

1326	   [QUIC-LR]  Iyengar, J. and I. Swett, "QUIC Loss Recovery And
1327	              Congestion Control", draft-ietf-quic-recovery-latest (work
1328	              in progress), March 2020.

1330	   [RFC7765]  Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
1331	              and SCTP RTO Restart", February 2016.

1333	   [SCWA99]   Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
1334	              "TCP Congestion Control With a Misbehaving Receiver", ACM
1335	              Computer Communication Review, 29(5) , 1999.

1337	   [Sprout]   Winstein, K., Sivaraman, A., and H. Balakrishnan,
1338	              "Stochastic Forecasts Achieve High Throughput and Low
1339	              Delay over Cellular Networks", USENIX Symposium on
1340	              Networked Systems Design and Implementation (NSDI) , 2013.

1342	   [THIN-STREAM]
1343	              Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen,
1344	              "TCP enhancements for interactive thin-stream
1345	              applications", NOSSDAV , 2008.

1347	   [TLP]      Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis,
1348	              "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of
1349	              Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work
1350	              in progress), August 2013.

1352	Authors' Addresses

1354	   Yuchung Cheng
1355	   Google, Inc

1357	   Email: ycheng@google.com

1359	   Neal Cardwell
1360	   Google, Inc

1362	   Email: ncardwell@google.com
1363	   Nandita Dukkipati
1364	   Google, Inc

1366	   Email: nanditad@google.com

1368	   Priyaranjan Jha
1369	   Google, Inc

1371	   Email: priyarjha@google.com