idnits 2.17.1 draft-ietf-tcpm-rack-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (Jan 17, 2020) is 1554 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC8174' is mentioned on line 64, but not defined

  == Missing Reference: 'RFC4653' is mentioned on line 1065, but not defined

  == Missing Reference: 'RFC3708' is mentioned on line 484, but not defined

  == Missing Reference: 'RFC3522' is mentioned on line 1084, but not defined

  == Unused Reference: 'RFC4737' is defined on line 1283, but no explicit
     reference was found in the text

  ** Downref: Normative reference to an Experimental RFC: RFC 5827

  ** Downref: Normative reference to an Experimental RFC: RFC 6937

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)


     Summary: 5 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                   Y. Cheng
3	Internet-Draft                                               N. Cardwell
4	Intended status: Standards Track                            N. Dukkipati
5	Expires: July 20, 2020                                            P. Jha
6	                                                             Google, Inc
7	                                                            Jan 17, 2020

9	        RACK: a time-based fast loss detection algorithm for TCP
10	                        draft-ietf-tcpm-rack-07

12	Abstract

14	   This document presents a new TCP loss detection algorithm called RACK
15	   ("Recent ACKnowledgment").  RACK uses the notion of time, instead of
16	   packet or sequence counts, to detect losses, for modern TCP
17	   implementations that can support per-packet timestamps and the
18	   selective acknowledgment (SACK) option.  It is intended to replace
19	   the conventional DUPACK threshold approach and its variants, as well
20	   as other nonstandard approaches.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on July 20, 2020.

39	Copyright Notice

41	   Copyright (c) 2020 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (https://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	1.  Terminology

56	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
57	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
58	   "OPTIONAL" in this document are to be interpreted as described in BCP
59	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
60	   capitals, as shown here.  In this document, these words will appear
61	   with that interpretation only when in UPPER CASE.  Lower case uses of
62	   these words are not to be interpreted as carrying [RFC2119]
63	   significance.

65	2.  Introduction

67	   This document presents a new loss detection algorithm called RACK
68	   ("Recent ACKnowledgment").  RACK uses the notion of time instead of
69	   the conventional packet or sequence counting approaches for detecting
70	   losses.  RACK deems a packet lost if it has not been delivered and
71	   some packet sent sufficiently later has been delivered.  It does this
72	   by recording packet transmission times and inferring losses using
73	   cumulative acknowledgments or selective acknowledgment (SACK) TCP
74	   options.

76	   In recent years we have been observing several increasingly common
77	   loss and reordering patterns in the Internet:

79	   1.  Slow recovery due to lost retransmissions.  Traffic policers
80	       [POLICER16] and burst losses often cause retransmissions to be
81	       lost again.  This severely increases latency because the lost
82	       retransmissions can only be recovered by retransmission timeouts
83	       (RTOs).

85	   2.  Tail drops.  Structured request-response traffic turns more
86	       losses into tail drops.  In such cases, TCP is application-
87	       limited, so it cannot send new data to probe losses and has to
88	       rely on retransmission timeouts (RTOs).

90	   3.  Reordering.  Link-layer protocols (e.g., 802.11 block ACK), link
91	       bonding, or routers' internal load-balancing can deliver TCP
92	       packets out of order.  The degree of such reordering is usually
93	       within the order of the path round trip time.

95	   Despite TCP stacks (e.g.  Linux) that implement many of the standard
96	   and proposed loss detection algorithms

98	   [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM],
99	   we've found that together they do not perform well.  The main reason
100	   is that many of them are based on the classic rule of counting
101	   duplicate acknowledgments [RFC5681].  They can either detect loss
102	   quickly or accurately, but not both, especially when the sender is
103	   application-limited or under reordering that is unpredictable.  And
104	   under these conditions none of them can detect lost retransmissions
105	   well.

107	   Also, these algorithms, including RFCs, rarely address the
108	   interactions with other algorithms.  For example, FACK may consider a
109	   packet is lost while RFC6675 may not.  Implementing N algorithms
110	   while dealing with N^2 interactions is a daunting task and error-
111	   prone.

113	   The goal of RACK is to solve all the problems above by replacing many
114	   of the loss detection algorithms above with one more effective
115	   algorithm to handle loss and reordering.

117	3.  Overview

119	   The main idea behind RACK is that if a packet has been delivered out
120	   of order, then the packets sent chronologically before that were
121	   either lost or reordered.  This concept is not fundamentally
122	   different from [RFC5681][RFC6675][FACK].  But the key innovation in
123	   RACK is to use a per-packet transmission timestamp and widely
124	   deployed SACK options to conduct time-based inferences instead of
125	   inferring losses with packet or sequence counting approaches.

127	   Using a threshold for counting duplicate acknowledgments (i.e.,
128	   DupThresh) alone is no longer reliable because of today's prevalent
129	   reordering patterns.  A common type of reordering is that the last
130	   "runt" packet of a window's worth of packet bursts gets delivered
131	   first, then the rest arrive shortly after in order.  To handle this
132	   effectively, a sender would need to constantly adjust the DupThresh
133	   to the burst size; but this would risk increasing the frequency of
134	   RTOs on real losses.

136	   Today's prevalent lost retransmissions also cause problems with
137	   packet-counting approaches [RFC5681][RFC6675][FACK], since those
138	   approaches depend on reasoning in sequence number space.
139	   Retransmissions break the direct correspondence between ordering in
140	   sequence space and ordering in time.  So when retransmissions are
141	   lost, sequence-based approaches are often unable to infer and quickly
142	   repair losses that can be deduced with time-based approaches.

144	   Instead of counting packets, RACK uses the most recently delivered
145	   packet's transmission time to judge if some packets sent previous to
146	   that time have "expired" by passing a certain reordering settling
147	   window.  On each ACK, RACK marks any already-expired packets lost,
148	   and for any packets that have not yet expired it waits until the
149	   reordering window passes and then marks those lost as well.  In
150	   either case, RACK can repair the loss without waiting for a (long)
151	   RTO.  RACK can be applied to both fast recovery and timeout recovery,
152	   and can detect losses on both originally transmitted and
153	   retransmitted packets, making it a great all-weather loss detection
154	   mechanism.

156	4.  Design Rationale for Reordering Tolerance

158	   The reordering behavior of networks can evolve (over years) in
159	   response to the behavior of transport protocols and applications, as
160	   well as the needs of network designers and operators.  From a network
161	   or link designer's viewpoint, parallelization (eg. link bonding) is
162	   the easiest way to get a network to go faster.  Therefore their main
163	   constraint on speed is reordering, and there is pressure to relax
164	   that constraint.  If RACK becomes widely deployed, the underlying
165	   networks may introduce more reordering for higher throughput.  But
166	   this may result in excessive reordering that hurts end to end
167	   performance:

169	   1.  End host packet processing: extreme reordering on high-speed
170	       networks would incur high CPU cost by greatly reducing the
171	       effectiveness of aggregation mechanisms, such as large receive
172	       offload (LRO) and generic receive offload (GRO), and
173	       significantly increasing the number of ACKs.

175	   2.  Congestion control: TCP congestion control implicitly assumes the
176	       feedback from ACKs are from the same bottleneck.  Therefore it
177	       cannot handle well scenarios where packets are traversing largely
178	       disjoint paths.

180	   3.  Loss recovery: Having an excessively large reordering window to
181	       accommodate widely different latencies from different paths would
182	       increase the latency of loss recovery.

184	   An end-to-end transport protocol cannot tell immediately whether a
185	   hole is reordering or loss.  It can only distinguish between the two
186	   in hindsight if the hole in the sequence space gets filled later
187	   without a retransmission.  How long the sender waits for such
188	   potential reordering events to settle is determined by the current
189	   reordering window.

191	   Given these considerations, a core design philosophy of RACK is to
192	   adapt to the measured duration of reordering events, within
193	   reasonable and specific bounds.  To accomplish this RACK places the
194	   following mandates on the reordering window:

196	   1.  The initial RACK reordering window SHOULD be set to a small
197	       fraction of the round-trip time.

199	   2.  If no reordering has been observed, then RACK SHOULD honor the
200	       classic 3-DUPACK rule for initiating fast recovery.  One simple
201	       way to implement this is to temporarily override the reorder
202	       window to 0.

204	   3.  The RACK reordering window SHOULD leverage Duplicate Selective
205	       Acknowledgement (DSACK) information [RFC3708] to adaptively
206	       estimate the duration of reordering events.

208	   4.  The RACK reordering window MUST be bounded and this bound SHOULD
209	       be one round trip.

211	   As a flow starts, either condition 1 or condition 2 or both would
212	   trigger RACK to start the recovery process quickly.  The low initial
213	   reordering window and use of the 3-DUPACK rule are key to achieving
214	   low-latency loss recovery for short flows by risking spurious
215	   retransmissions to recover losses quickly.  This rationale is that
216	   spurious retransmissions for short flows are not expected to produce
217	   excessive network traffic.

219	   For long flows the design tolerates reordering within a round trip.
220	   This handles reordering caused by path divergence in small time
221	   scales (reordering within the round-trip time of the shortest path),
222	   which should tolerate much of the reordering from link bonding,
223	   multipath routing, or link-layer out-of-order delivery.  It also
224	   relaxes ordering constraints to allow sending flights of TCP packets
225	   on different paths dynamically for better load-balancing (e.g.
226	   flowlets).

228	   However, the fact that the initial RACK reordering window is low, and
229	   the RACK reordering window's adaptive growth is bounded, means that
230	   there will continue to be a cost to reordering and a limit to RACK's
231	   adaptation to reordering.  This maintains a disincentive for network
232	   designers and operators to introduce needless or excessive
233	   reordering, particularly because they have to allow for low round
234	   trip time paths.  This means RACK will not encourage networks to
235	   perform inconsiderate fine-grained packet-spraying over highly
236	   disjoint paths with very different characteristics.  There are good
237	   alternative solutions, such as MPTCP, for such networks.

239	   To conclude, the RACK algorithm aims to adapt to small degrees of
240	   reordering, quickly recover most losses within one to two round
241	   trips, and avoid costly retransmission timeouts (RTOs).  In the
242	   presence of reordering, the adaptation algorithm can impose
243	   sometimes-needless delays when it waits to disambiguate loss from
244	   reordering, but the penalty for waiting is bounded to one round trip
245	   and such delays are confined to longer-running flows.

247	   This document provides a concrete and detailed reordering window
248	   adaptation algorithm for implementers.  We note that the specifics of
249	   the algorithm are likely to evolve over time.  But that is a separate
250	   engineering optimization that's out of scope for this document.

252	5.  Requirements

254	   The reader is expected to be familiar with the definitions given in
255	   the TCP congestion control [RFC5681] and selective acknowledgment
256	   [RFC2018] RFCs.  Familiarity with the conservative SACK-based
257	   recovery for TCP [RFC6675] is not expected but helps.

259	   RACK has three requirements:

261	   1.  The connection MUST use selective acknowledgment (SACK) options
262	       [RFC2018].

264	   2.  For each packet sent, the sender MUST store its most recent
265	       transmission time with (at least) millisecond granularity.  For
266	       round-trip times lower than a millisecond (e.g., intra-datacenter
267	       communications) microsecond granularity would significantly help
268	       the detection latency but is not required.

270	   3.  For each packet sent, the sender MUST remember whether the packet
271	       has been retransmitted or not.

273	   We assume that requirement 1 implies the sender keeps a SACK
274	   scoreboard, which is a data structure to store selective
275	   acknowledgment information on a per-connection basis ([RFC6675]
276	   section 3).  For the ease of explaining the algorithm, we use a
277	   pseudo-scoreboard that manages the data in sequence number ranges.
278	   But the specifics of the data structure are left to the implementor.

280	   RACK does not need any change on the receiver.

282	6.  Definitions

284	   The reader is expected to be familiar with the definitions given in
285	   [RFC793], including SND.UNA, SND.NXT, SEG.ACK, and SEG.SEQ.

287	6.1.  Definitions of variables

289	   A sender implementing RACK needs to store these new RACK variables:

291	   "Packet.xmit_ts" is the time of the last transmission of a data
292	   packet, including retransmissions, if any.  The sender needs to
293	   record the transmission time for each packet sent and not yet
294	   acknowledged.  The time MUST be stored at millisecond granularity or
295	   finer.

297	   "RACK.packet".  Among all the packets that have been either
298	   selectively or cumulatively acknowledged, RACK.packet is the one that
299	   was sent most recently including retransmissions.

301	   "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet.

303	   "RACK.end_seq" is the ending TCP sequence number of RACK.packet.

305	   "RACK.rtt" is the RTT of the most recently delivered packet on the
306	   connection (either cumulatively acknowledged or selectively
307	   acknowledged) that was not marked invalid as a possible spurious
308	   retransmission.

310	   "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated.

312	   "RACK.reo_wnd" is a reordering window computed in the unit of time
313	   used for recording packet transmission times.  It is used to defer
314	   the moment at which RACK marks a packet lost.

316	   "RACK.dupthresh" is a constant specifying the number of duplicate
317	   acknowledgments, or selectively acknowledged segments, that can
318	   (under certain conditions) trigger fast recovery, similar to
319	   [RFC6675].  As in [RFC5681] and [RFC6675], this threshold is defined
320	   to be 3.

322	   "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the
323	   connection.

325	   "RACK.ack_ts" is the time when all the sequences in RACK.packet were
326	   selectively or cumulatively acknowledged.

328	   "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd

330	   "RACK.reo_wnd_persist" is the number of loss recoveries before
331	   resetting RACK.reo_wnd
332	   "RACK.dsack" indicates if a DSACK option has been received since last
333	   RACK.reo_wnd change "RACK.pkts_sacked" returns the total number of
334	   packets selectively acknowledged in the SACK scoreboard.

336	   "RACK.reord" indicates the connection has detected packet reordering
337	   event(s)

339	   "RACK.fack" is the highest selectively or cumulatively acknowledged
340	   sequence

342	   Note that the Packet.xmit_ts variable is per packet in flight.  The
343	   RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT
344	   variables are kept in the per-connection TCP control block.
345	   RACK.packet and RACK.ack_ts are used as local variables in the
346	   algorithm.

348	7.  Algorithm Details

350	7.1.  Transmitting a data packet

352	   Upon transmitting a new packet or retransmitting an old packet,
353	   record the time in Packet.xmit_ts.  RACK does not care if the
354	   retransmission is triggered by an ACK, new application data, an RTO,
355	   or any other means.

357	7.2.  Upon receiving an ACK

359	   Step 1: Update RACK.min_RTT.

361	   Use the RTT measurements obtained via [RFC6298] or [RFC7323] to
362	   update the estimated minimum RTT in RACK.min_RTT.  The sender can
363	   track a simple global minimum of all RTT measurements from the
364	   connection, or a windowed min-filtered value of recent RTT
365	   measurements.  This document does not specify an exact approach.

367	   Step 2: Update RACK stats

369	   Given the information provided in an ACK, each packet cumulatively
370	   ACKed or SACKed is marked as delivered in the scoreboard.  Among all
371	   the packets newly ACKed or SACKed in the connection, record the most
372	   recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts.
373	   Sometimes the timestamps of RACK.Packet and Packet could carry the
374	   same transmit timestamps due to clock granularity or segmentation
375	   offloading (i.e. the two packets were handed to the NIC as a single
376	   unit).  In that case the sequence numbers of RACK.end_seq and
377	   Packet.end_seq are compared to break the tie.

379	   Since an ACK can also acknowledge retransmitted data packets, and
380	   retransmissions can be spurious, the sender must take care to avoid
381	   spurious inferences.  For example, if the sender were to use timing
382	   information from a spurious retransmission, the RACK.rtt could be
383	   vastly underestimated.

385	   To avoid spurious inferences, ignore a packet as invalid if any of
386	   its TCP sequences have been retransmitted before and either of two
387	   conditions is true:

389	   1.  The Timestamp Echo Reply field (TSecr) of the ACK's timestamp
390	       option [RFC7323], if available, indicates the ACK was not
391	       acknowledging the last retransmission of the packet.

393	   2.  The packet was last retransmitted less than RACK.min_rtt ago.

395	   If the ACK is not ignored as invalid, update the RACK.rtt to be the
396	   RTT sample calculated using this ACK, and continue.  If this ACK or
397	   SACK was for the most recently sent packet, then record the
398	   RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK.
399	   Otherwise exit here and omit the following steps.

401	   Notice that the second condition above is a heuristic.  This
402	   heuristic would fail to update RACK stats if the packet is spuriously
403	   retransmitted because of a recent minimum RTT decrease (e.g. path
404	   change).  Consequentially RACK may not detect losses from ACK events
405	   and the recovery resorts to the (slower) TLP or RTO timer-based
406	   events.  However such events should be rare and the connection would
407	   pick up the new minimum RTT when the recovery ends to avoid repeated
408	   similar failures.

410	   Step 2 may be summarized in pseudocode as:

412	   RACK_sent_after(t1, seq1, t2, seq2):
413	       If t1 > t2:
414	           Return true
415	       Else if t1 == t2 AND seq1 > seq2:
416	           Return true
417	       Else:
418	           Return false

420	   RACK_update():
421	       For each Packet newly acknowledged cumulatively or selectively:
422	           rtt = Now() - Packet.xmit_ts
423	           If Packet.retransmitted is TRUE:
424	               If ACK.ts_option.echo_reply < Packet.xmit_ts:
425	                  Return
426	               If rtt < RACK.min_rtt:
427	                  Return

429	           RACK.rtt = rtt
430	           If RACK_sent_after(Packet.xmit_ts, Packet.end_seq
431	                              RACK.xmit_ts, RACK.end_seq):
432	               RACK.xmit_ts = Packet.xmit_ts

434	   Step 3: Detect packet reordering

436	   To detect reordering, the sender looks for original data packets
437	   being delivered out of order in sequence space.  The sender tracks
438	   the highest sequence selectively or cumulatively acknowledged in the
439	   RACK.fack variable.  The name fack stands for the most forward ACK
440	   originated from the [FACK] draft.  If the ACK selectively or
441	   cumulatively acknowledges an unacknowledged and also never
442	   retransmitted sequence below RACK.fack, then the corresponding packet
443	   has been reordered and RACK.reord is set to TRUE.

445	   The heuristic above only detects reordering if the re-ordered packet
446	   has not yet been retransmitted.  This is a major drawback because if
447	   RACK has a low reordering window and the network is reordering
448	   packets, RACK may falsely retransmit frequently.  Consequently RACK
449	   may fail to detect reordering to increase the reordering window,
450	   because the reordered packets were already (falsely) retransmitted.

452	   DSACK [RFC3708] can help mitigate this issue.  The false
453	   retransmission would solicit DSACK option in the ACK.  Therefore if
454	   the ACK has a DSACK option covering some sequence that were both
455	   acknowledged and retransmitted, this implies the original packet was
456	   reordered but RACK retransmitted the packet too quickly and should
457	   set RACK.reord to TRUE.

459	   RACK_detect_reordering():
460	       For each Packet newly acknowledged cumulatively or selectively:
461	           If Packet.end_seq > RACK.fack:
462	               RACK.fack = Packet.end_seq
463	           Else if Packet.end_seq < RACK.fack AND
464	                   Packet.retransmitted is FALSE:
465	               RACK.reord = TRUE

467	       For each Packet covered by the DSACK option:
468	           If Packet.retransmitted is TRUE:
469	               RACK.reord = TRUE

471	   Step 4: Update RACK reordering window

473	   To handle the prevalent small degree of reordering, RACK.reo_wnd
474	   serves as an allowance for settling time before marking a packet
475	   lost.  This section documents a detailed algorithm following the
476	   design rationale section.  RACK starts initially with a conservative
477	   window of min_RTT/4.  If no reordering has been observed, RACK uses
478	   RACK.reo_wnd of 0 during loss recovery, in order to retransmit
479	   quickly, or when the number of DUPACKs exceeds the classic DUPACK
480	   threshold.  The subtle difference of this approach and conventional
481	   one [RFC5681][RFC6675] is dicussed later in the section of "RACK and
482	   TLP discussions".

484	   Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window,
485	   to higher degrees of reordering, if DSACK is supported.  Receiving an
486	   ACK with a DSACK indicates a spurious retransmission, which in turn
487	   suggests that the RACK reordering window, RACK.reo_wnd, is likely too
488	   small.  The sender MAY increase the RACK.reo_wnd window linearly for
489	   every round trip in which the sender receives a DSACK, so that after
490	   N distinct round trips in which a DSACK is received, the RACK.reo_wnd
491	   becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT.  The
492	   inflated RACK.reo_wnd would persist for 16 loss recoveries and after
493	   which it resets to its starting value, min_RTT / 4.

495	   The following pseudocode implements above algorithm.  Note that
496	   extensions that require additional TCP features (e.g.  DSACK) would
497	   work if the feature functions simply return false.

499	   RACK_update_reo_wnd():
500	       RACK.min_RTT = TCP_min_RTT()
501	       If DSACK option is present:
502	           RACK.dsack = true

504	       If SND.UNA < RACK.rtt_seq:
505	           RACK.dsack = false  /* React to DSACK once per round trip */

507	       If RACK.dsack:
508	           RACK.reo_wnd_incr += 1
509	           RACK.dsack = false
510	           RACK.rtt_seq = SND.NXT
511	           RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */
512	       Else if exiting loss recovery:
513	           RACK.reo_wnd_persist -= 1
514	           If RACK.reo_wnd_persist <= 0:
515	               RACK.reo_wnd_incr = 1

517	       If RACK.reord is FALSE:
518	           If in loss recovery:  /* If in fast or timeout recovery */
519	               RACK.reo_wnd = 0
520	               Return
521	           Else if RACK.pkts_sacked >= RACK.dupthresh:
522	               RACK.reo_wnd = 0
523	               return
524	       RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr
525	       RACK.reo_wnd = min(RACK.reo_wnd, SRTT)

527	   Step 5: Detect losses.

529	   For each packet that has not been SACKed, if RACK.xmit_ts is after
530	   Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its
531	   corresponding sequence range) lost in the scoreboard.  The rationale
532	   is that if another packet that was sent later has been delivered, and
533	   the reordering window or "reordering settling time" has already
534	   passed, then the packet was likely lost.

536	   If another packet that was sent later has been delivered, but the
537	   reordering window has not passed, then it is not yet safe to deem the
538	   unacked packet lost.  Using the basic algorithm above, the sender
539	   would wait for the next ACK to further advance RACK.xmit_ts; but this
540	   risks a timeout (RTO) if no more ACKs come back (e.g, due to losses
541	   or application limit).  For timely loss detection, the sender MAY
542	   install a "reordering settling" timer set to fire at the earliest
543	   moment at which it is safe to conclude that some packet is lost.  The
544	   earliest moment is the time it takes to expire the reordering window
545	   of the earliest unacked packet in flight.

547	   This timer expiration value can be derived as follows.  As a starting
548	   point, we consider that the reordering window has passed if the
549	   RACK.packet was sent sufficiently after the packet in question, or a
550	   sufficient time has elapsed since the RACK.packet was S/ACKed, or
551	   some combination of the two.  More precisely, RACK marks a packet as
552	   lost if the reordering window for a packet has elapsed through the
553	   sum of:

555	   1. delta in transmit time between a packet and the RACK.packet

557	   2. delta in time between RACK.ack_ts and now

559	   So we mark a packet as lost if:

561	    RACK.xmit_ts >= Packet.xmit_ts
562	           AND
563	    (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd

565	   If we solve this second condition for "now", the moment at which we
566	   can declare a packet lost, then we get:

568	   now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts)

570	   Then (RACK.ack_ts - RACK.xmit_ts) is the RTT of the packet the sender
571	   used to set RACK.xmit_ts: the round trip time of the most recently
572	   (re)transmitted packet that's been delivered.  To be more robust to
573	   reordering, RACK uses a more conservative RTT value to decide if an
574	   unacknowledged packet should be considered lost, RACK.rtt: the round
575	   trip time of the most recently delivered packet on the connection
576	   that was not marked invalid as a possible spurious retransmission.

578	   When packets are delivered in order, the most recently
579	   (re)transmitted packet that's been delivered is also the most
580	   recently delivered, hence RACK.rtt == RACK.ack_ts - RACK.xmit_ts.
581	   But if packets were reordered, then the packet delivered most
582	   recently was sent before the most recently (re)transmitted packet.
583	   Hence RACK.rtt > (RACK.ack_ts - RACK.xmit_ts).

585	   Since RACK.RTT >= (RACK.ack_ts - RACK.xmit_ts), the previous equation
586	   reduces to saying that the sender can declare a packet lost when:

588	   now >= Packet.xmit_ts + RACK.reo_wnd + RACK.rtt

590	   In turn, that is equivalent to stating that a RACK sender should
591	   declare a packet lost when:

593	   Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0
594	   The following pseudocode implements the algorithm above.  When an ACK
595	   is received or the RACK timer expires, call RACK_detect_loss().  The
596	   algorithm includes an additional optimization to break timestamp ties
597	   by using the TCP sequence space.  The optimization is particularly
598	   useful to detect losses in a timely manner with TCP Segmentation
599	   Offload, where multiple packets in one TSO blob have identical
600	   timestamps.  It is also useful when the timestamp clock granularity
601	   is close to or longer than the actual round trip time.

603	   RACK_detect_loss():
604	       timeout = 0

606	       For each packet, Packet, not acknowledged yet:
607	           If Packet.lost is TRUE AND Packet.retransmitted is FALSE:
608	               Continue /* Packet lost but not yet retransmitted */

610	           If RACK_sent_after(RACK.xmit_ts, RACK.end_seq,
611	                              Packet.xmit_ts, Packet.end_seq):
612	               remaining = Packet.xmit_ts + RACK.rtt +
613	                           RACK.reo_wnd - Now()
614	               If remaining <= 0:
615	                   Packet.lost = TRUE
616	               Else:
617	                   timeout = max(remaining, timeout)

619	       If timeout != 0
620	           Arm a timer to call RACK_detect_loss() after timeout

622	   Implementation optimization: looping through packets in the SACK
623	   scoreboard above could be very costly on large-BDP networks since the
624	   inflight could be very large.  If the implementation can organize the
625	   scoreboard data structures to have packets sorted by the last
626	   (re)transmission time, then the loop can start on the least recently
627	   sent packet and abort on the first packet sent after RACK.time_ts.
628	   This can be implemented by using a seperate doubly-linked list sorted
629	   in time order.  The implementation inserts the packet at the tail of
630	   the list when it is (re)transmitted, and removes a packet from the
631	   list when it is delivered or marked lost.  We RECOMMEND such an
632	   optimization because it enables implementations to support high-BDP
633	   networks.  This optimization is implemented in Linux and sees orders
634	   of magnitude improvement in CPU usage on high-speed WAN networks.

636	7.3.  Tail Loss Probe: fast recovery for tail losses

638	   This section describes a supplemental algorithm, Tail Loss Probe
639	   (TLP), which leverages RACK to further reduce RTO recoveries.  TLP
640	   triggers fast recovery to quickly repair tail losses that can
641	   otherwise be recovered by RTOs only.  After an original data
642	   transmission, TLP sends a probe data segment within one to two RTTs.
643	   The probe data segment can either be new, previously unsent data, or
644	   a retransmission of previously sent data just below SND.NXT.  In
645	   either case the goal is to elicit more feedback from the receiver, in
646	   the form of an ACK (potentially with SACK blocks), to allow RACK to
647	   trigger fast recovery instead of an RTO.

649	   An RTO occurs when the first unacknowledged sequence number is not
650	   acknowledged after a conservative period of time has elapsed
651	   [RFC6298].  Common causes of RTOs include:

653	   1.  The entire flight of data is lost.

655	   2.  Tail losses of data segments at the end of an application
656	       transaction.

658	   3.  Tail losses of ACKs at the end of an application transaction.

660	   4.  Lost retransmits, which can halt fast recovery based on [RFC6675]
661	       if the ACK stream completely dries up.  For example, consider a
662	       window of three data packets (P1, P2, P3) that are sent; P1 and
663	       P2 are dropped.  On receipt of a SACK for P3, RACK marks P1 and
664	       P2 as lost and retransmits them as R1 and R2.  Suppose R1 and R2
665	       are lost as well, so there are no more returning ACKs to detect
666	       R1 and R2 as lost.  Recovery stalls.

668	   5.  An unexpectedly long round-trip time (RTT).  This can cause ACKs
669	       to arrive after the RTO timer expires.  The F-RTO algorithm
670	       [RFC5682] is designed to detect such spurious retransmission
671	       timeouts and at least partially undo the consequences of such
672	       events, but F-RTO cannot be used in many situations.

674	7.4.  Tail Loss Probe: An Example

676	   Following is an example of TLP.  All events listed are at a TCP
677	   sender.

679	   1.  Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10.  There is
680	       no more new data to transmit.  A PTO is scheduled to fire in 2
681	       RTTs, after the transmission of the 10th segment.

683	   2.  Sender receives acknowledgements (ACKs) for segments 1-5;
684	       segments 6-10 are lost and no ACKs are received.  The sender
685	       reschedules its PTO timer relative to the last received ACK,
686	       which is the ACK for segment 5 in this case.  The sender sets the
687	       PTO interval using the calculation described in step (2) of the
688	       algorithm.

690	   3.  When PTO fires, sender retransmits segment 10.

692	   4.  After an RTT, a SACK for packet 10 arrives.  The ACK also carries
693	       SACK holes for segments 6, 7, 8 and 9.  This triggers RACK-based
694	       loss recovery.

696	   5.  The connection enters fast recovery and retransmits the remaining
697	       lost segments.

699	7.5.  Tail Loss Probe Algorithm Details

701	   We define the terminology used in specifying the TLP algorithm:

703	   FlightSize: amount of outstanding data in the network, as defined in
704	   [RFC5681].

706	   RTO: The transport's retransmission timeout (RTO) is based on
707	   measured round-trip times (RTT) between the sender and receiver, as
708	   specified in [RFC6298] for TCP.  PTO: Probe timeout (PTO) is a timer
709	   event indicating that an ACK is overdue.  Its value is constrained to
710	   be smaller than or equal to an RTO.

712	   SRTT: smoothed round-trip time, computed as specified in [RFC6298].
713	   TLPRxtOut: a boolean indicating whether there is an unacknowledged
714	   TLP retransmission.

716	   TLPHighRxt: the value of SND.NXT at the time of sending a TLP
717	   retransmission.

719	   The TLP algorithm has three phases, which we discuss in turn.

721	7.5.1.  Phase 1: Scheduling a loss probe

723	   Step 1: Check conditions for scheduling a PTO.

725	   A sender should check to see if it should schedule a PTO in the
726	   following situations:

728	   1.  After transmitting new data that was not itself a TLP probe

730	   2.  Upon receiving an ACK that cumulatively acknowledges data

732	   3.  Upon receiving a SACK that selectively acknowledges data that was
733	       last sent before the segment with SEG.SEQ=SND.UNA was last
734	       (re)transmitted

736	   A sender should schedule a PTO only if all of the following
737	   conditions are met:

739	   1.  The connection supports SACK [RFC2018]

741	   2.  The connection has no SACKed sequences in the SACK scoreboard

743	   3.  The connection is not in loss recovery

745	   If a PTO can be scheduled according to these conditions, the sender
746	   should schedule a PTO.  If there was a previously scheduled PTO or
747	   RTO pending, then that pending PTO or RTO should first be cancelled,
748	   and then the new PTO should be scheduled.

750	   If a PTO cannot be scheduled according to these conditions, then the
751	   sender MUST arm the RTO timer if there is unacknowledged data in
752	   flight.

754	   Step 2: Select the duration of the PTO.

756	   A sender SHOULD use the following logic to select the duration of a
757	   PTO:

759	   TLP_timeout():
760	       If SRTT is available:
761	           PTO = 2 * SRTT
762	           If FlightSize = 1:
763	              PTO += WCDelAckT
764	           Else:
765	              PTO += 2ms
766	       Else:
767	           PTO = 1 sec

769	       If Now() + PTO > TCP_RTO_expire():
770	           PTO = TCP_RTO_expire() - Now()

772	   Aiming for a PTO value of 2*SRTT allows a sender to wait long enough
773	   to know that an ACK is overdue.  Under normal circumstances, i.e. no
774	   losses, an ACK typically arrives in one SRTT.  But choosing PTO to be
775	   exactly an SRTT is likely to generate spurious probes given that
776	   network delay variance and even end-system timings can easily push an
777	   ACK to be above an SRTT.  We chose PTO to be the next integral
778	   multiple of SRTT.

780	   Similarly, network delay variations and end-system processing
781	   latencies and timer granularities can easily delay ACKs beyond
782	   2*SRTT, so senders SHOULD add at least 2ms to a computed PTO value
783	   (and MAY add more if the sending host OS timer granularity is more
784	   coarse than 1ms).

786	   WCDelAckT stands for worst case delayed ACK timer.  When FlightSize
787	   is 1, PTO is inflated by WCDelAckT time to compensate for a potential
788	   long delayed ACK timer at the receiver.  The RECOMMENDED value for
789	   WCDelAckT is 200ms.

791	   Finally, if the time at which an RTO would fire (here denoted
792	   "TCP_RTO_expire") is sooner than the computed time for the PTO, then
793	   a probe is scheduled to be sent at that earlier time.

795	7.5.2.  Phase 2: Sending a loss probe

797	   When the PTO fires, transmit a probe data segment:

799	   TLP_send_probe():
800	       If an unsent segment exists AND
801	          the receive window allows new data to be sent:
802	              Transmit the lowest-sequence unsent segment of up to SMSS
803	              Increment FlightSize by the size of the newly-sent segment

805	       Else if TLPRxtOut is not set:
806	              Retransmit the highest-sequence segment sent so far
807	       The cwnd remains unchanged

809	   When the loss probe is a retransmission, the sender uses the highest-
810	   sequence segment sent so far.  This is in order to deal with the
811	   retransmission ambiguity problem in TCP.  Suppose a sender sends N
812	   segments, and then retransmits the last segment (segment N) as a loss
813	   probe, and then the sender receives a SACK for segment N.  As long as
814	   the sender waits for any required RACK reordering settling timer to
815	   then expire, it doesn't matter if that SACK was for the original
816	   transmission of segment N or the TLP retransmission; in either case
817	   the arrival of the SACK for segment N provides evidence that the N-1
818	   segments preceding segment N were likely lost.  In the case where
819	   there is only one original outstanding segment of data (N=1), the
820	   same logic (trivially) applies: an ACK for a single outstanding
821	   segment tells the sender the N-1=0 segments preceding that segment
822	   were lost.  Furthermore, whether there are N>1 or N=1 outstanding
823	   segments, there is a question about whether the original last segment
824	   or its TLP retransmission were lost; the sender estimates this using
825	   TLP recovery detection (see below).

827	   Note that whether or not a probe was sent in TLP_send_probe(), the
828	   sender MUST arm the RTO timer, not the PTO timer, at the end of
829	   TLP_send_probe() if FlightSize is not zero.  This ensures that the
830	   sender does not send repeated, back-to-back TLP probes.  Checking
831	   TLPRxtOut prior to sending the loss probe is also critical to avoid
832	   TLP loops if an application writes periodically at an interval less
833	   than PTO.

835	7.5.3.  Phase 3: ACK processing

837	   On each incoming ACK, the sender should check the conditions in Step
838	   1 of Phase 1 to see if it should schedule (or reschedule) the loss
839	   probe timer.

841	7.6.  TLP recovery detection

843	   If the only loss in an outstanding window of data was the last
844	   segment, then a TLP loss probe retransmission of that data segment
845	   might repair the loss.  TLP recovery detection examines ACKs to
846	   detect when the probe might have repaired a loss, and thus allows
847	   congestion control to properly reduce the congestion window (cwnd)
848	   [RFC5681].

850	   Consider a TLP retransmission episode where a sender retransmits a
851	   tail packet in a flight.  The TLP retransmission episode ends when
852	   the sender receives an ACK with a SEG.ACK above the SND.NXT at the
853	   time the episode started.  During the TLP retransmission episode the
854	   sender checks for a duplicate ACK or D-SACK indicating that both the
855	   original segment and TLP retransmission arrived at the receiver,
856	   meaning there was no loss that needed repairing.  If the TLP sender
857	   does not receive such an indication before the end of the TLP
858	   retransmission episode, then it MUST estimate that either the
859	   original data segment or the TLP retransmission were lost, and
860	   congestion control MUST react appropriately to that loss as it would
861	   any other loss.

863	   Since a significant fraction of the hosts that support SACK do not
864	   support duplicate selective acknowledgments (D-SACKs) [RFC2883] the
865	   TLP algorithm for detecting such lost segments relies only on basic
866	   SACK support [RFC2018].

868	7.6.1.  Initializing and resetting state

870	   When a connection is created, or suffers a retransmission timeout, or
871	   enters fast recovery, it executes the following:

873	       TLPRxtOut = false

875	7.6.2.  Recording loss probe states

877	   Senders MUST only send a TLP loss probe retransmission if TLPRxtOut
878	   is false.  This ensures that at any given time a connection has at
879	   most one outstanding TLP retransmission.  This allows the sender to
880	   use the algorithm described in this section to estimate whether any
881	   data segments were lost.

883	   Note that this condition only restricts TLP loss probes that are
884	   retransmissions.  There may be an arbitrary number of outstanding
885	   unacknowledged TLP loss probes that consist of new, previously-unsent
886	   data, since the retransmission timeout and fast recovery algorithms
887	   are sufficient to detect losses of such probe segments.

889	   Upon sending a TLP probe that is a retransmission, the sender sets
890	   TLPRxtOut to true and TLPHighRxt to SND.NXT.

892	   Detecting recoveries accomplished by loss probes

894	   Step 1: Track ACKs indicating receipt of original and retransmitted
895	   segments

897	   A sender considers both the original segment and TLP probe
898	   retransmission segment as acknowledged if either 1 or 2 are true:

900	   1.  This is a duplicate acknowledgment (as defined in [RFC5681],
901	       section 2), and all of the following conditions are met:

903	       1.  TLPRxtOut is true

905	       2.  SEG.ACK == TLPHighRxt

907	       3.  SEG.ACK == SND.UNA

909	       4.  the segment contains no SACK blocks for sequence ranges above
910	           TLPHighRxt

912	       5.  the segment contains no data

914	       6.  the segment is not a window update

916	   2.  This is an ACK acknowledging a sequence number at or above
917	       TLPHighRxt and it contains a D-SACK; i.e. all of the following
918	       conditions are met:

920	       1.  TLPRxtOut is true

922	       2.  SEG.ACK >= TLPHighRxt

924	       3.  the ACK contains a D-SACK block

926	   If either of the conditions is met, then the sender estimates that
927	   the receiver received both the original data segment and the TLP
928	   probe retransmission, and so the sender considers the TLP episode to
929	   be done, and records that fact by setting TLPRxtOut to false.

931	   Step 2: Mark the end of a TLP retransmission episode and detect
932	   losses

934	   If the sender receives a cumulative ACK for data beyond the TLP loss
935	   probe retransmission then, in the absence of reordering on the return
936	   path of ACKs, it should have received any ACKs for the original
937	   segment and TLP probe retransmission segment.  At that time, if the
938	   TLPRxtOut flag is still true and thus indicates that the TLP probe
939	   retransmission remains unacknowledged, then the sender should presume
940	   that at least one of its data segments was lost, so it SHOULD invoke
941	   a congestion control response equivalent to fast recovery.

943	   More precisely, on each ACK the sender executes the following:

945	       if (TLPRxtOut and SEG.ACK >= TLPHighRxt) {
946	           TLPRxtOut = false
947	           EnterRecovery()
948	           ExitRecovery()
949	       }

951	8.  RACK and TLP discussions

953	8.1.  Advantages

955	   The biggest advantage of RACK is that every data packet, whether it
956	   is an original data transmission or a retransmission, can be used to
957	   detect losses of the packets sent chronologically prior to it.

959	   Example: TAIL DROP.  Consider a sender that transmits a window of
960	   three data packets (P1, P2, P3), and P1 and P3 are lost.  Suppose the
961	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
962	   by default) after the transmission of the previous packet.  RACK will
963	   mark P1 as lost when the SACK of P2 is received, and this will
964	   trigger the retransmission of P1 as R1.  When R1 is cumulatively
965	   acknowledged, RACK will mark P3 as lost and the sender will
966	   retransmit P3 as R3.  This example illustrates how RACK is able to
967	   repair certain drops at the tail of a transaction without any timer.
968	   Notice that neither the conventional duplicate ACK threshold
969	   [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK]
970	   algorithm can detect such losses, because of the required packet or
971	   sequence count.

973	   Example: LOST RETRANSMIT.  Consider a window of three data packets
974	   (P1, P2, P3) that are sent; P1 and P2 are dropped.  Suppose the
975	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
976	   by default) after the transmission of the previous packet.  When P3
977	   is SACKed, RACK will mark P1 and P2 lost and they will be
978	   retransmitted as R1 and R2.  Suppose R1 is lost again but R2 is
979	   SACKed; RACK will mark R1 lost for retransmission again.  Again,
980	   neither the conventional three duplicate ACK threshold approach, nor
981	   [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect
982	   such losses.  And such a lost retransmission is very common when TCP
983	   is being rate-limited, particularly by token bucket policers with
984	   large bucket depth and low rate limit.  Retransmissions are often
985	   lost repeatedly because standard congestion control requires multiple
986	   round trips to reduce the rate below the policed rate.

988	   Example: SMALL DEGREE OF REORDERING.  Consider a common reordering
989	   event: a window of packets are sent as (P1, P2, P3).  P1 and P2 carry
990	   a full payload of MSS octets, but P3 has only a 1-octet payload.
991	   Suppose the sender has detected reordering previously and thus
992	   RACK.reo_wnd is min_RTT/4.  Now P3 is reordered and delivered first,
993	   before P1 and P2.  As long as P1 and P2 are delivered within
994	   min_RTT/4, RACK will not consider P1 and P2 lost.  But if P1 and P2
995	   are delivered outside the reordering window, then RACK will still
996	   falsely mark P1 and P2 lost.  We discuss how to reduce false
997	   positives in the end of this section.

999	   The examples above show that RACK is particularly useful when the
1000	   sender is limited by the application, which is common for
1001	   interactive, request/response traffic.  Similarly, RACK still works
1002	   when the sender is limited by the receive window, which is common for
1003	   applications that use the receive window to throttle the sender.

1005	   For some implementations (e.g., Linux), RACK works quite efficiently
1006	   with TCP Segmentation Offload (TSO).  RACK always marks the entire
1007	   TSO blob lost because the packets in the same TSO blob have the same
1008	   transmission timestamp.  By contrast, the algorithms based on
1009	   sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset
1010	   of packets in the TSO blob lost, forcing the stack to perform
1011	   expensive fragmentation of the TSO blob, or to selectively tag
1012	   individual packets lost in the scoreboard.

1014	8.2.  Disadvantages

1016	   RACK requires the sender to record the transmission time of each
1017	   packet sent at a clock granularity of one millisecond or finer.  TCP
1018	   implementations that record this already for RTT estimation do not
1019	   require any new per-packet state.  But implementations that are not
1020	   yet recording packet transmission times will need to add per-packet
1021	   internal state (commonly either 4 or 8 octets per packet or TSO blob)
1022	   to track transmission times.  In contrast, the conventional [RFC6675]
1023	   loss detection approach does not require any per-packet state beyond
1024	   the SACK scoreboard.  This is particularly useful on ultra-low RTT
1025	   networks where the RTT is far less than the sender TCP clock
1026	   grainularity (e.g. inside data-centers).

1028	   RACK can easily and optionally support the conventional approach in
1029	   [RFC6675][RFC5681] by resetting the reordering window to zero when
1030	   the threshold is met.  Note that this approach differs slightly from
1031	   [RFC6675] which considers a packet lost when at least #DupThresh
1032	   higher-sequenc packets are SACKed.  RACK's approach considers a
1033	   packet lost when at least one higher sequence packet is SACKed and
1034	   the total number of SACKed packets is at least DupThresh.  For
1035	   example, suppose a connection sends 10 packets, and packets 3, 5, 7
1036	   are SACKed.  [RFC6675] considers packets 1 and 2 lost.  RACK
1037	   considers packets 1, 2, 4, 6 lost.

1039	8.3.  Adjusting the reordering window

1041	   When the sender detects packet reordering, RACK uses a reordering
1042	   window of min_rtt / 4.  It uses the minimum RTT to accommodate
1043	   reordering introduced by packets traversing slightly different paths
1044	   (e.g., router-based parallelism schemes) or out-of-order deliveries
1045	   in the lower link layer (e.g., wireless links using link-layer
1046	   retransmission).  RACK uses a quarter of minimum RTT because Linux
1047	   TCP used the same factor in its implementation to delay Early
1048	   Retransmit [RFC5827] to reduce spurious loss detections in the
1049	   presence of reordering, and experience shows that this seems to work
1050	   reasonably well.  We have evaluated using the smoothed RTT (SRTT from
1051	   [RFC6298] RTT estimation) or the most recently measured RTT
1052	   (RACK.rtt) using an experiment similar to that in the Performance
1053	   Evaluation section.  They do not make any significant difference in
1054	   terms of total recovery latency.

1056	8.4.  Relationships with other loss recovery algorithms

1058	   The primary motivation of RACK is to ultimately provide a simple and
1059	   general replacement for some of the standard loss recovery algorithms
1060	   [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard
1061	   ones [FACK][THIN-STREAM].  While RACK can be a supplemental loss
1062	   detection mechanism on top of these algorithms, this is not
1063	   necessary, because RACK implicitly subsumes most of them.

1065	   [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK
1066	   threshold based on the current or previous flight sizes.  RACK takes
1067	   a different approach, by using only one ACK event and a reordering
1068	   window.  RACK can be seen as an extended Early Retransmit [RFC5827]
1069	   without a FlightSize limit but with an additional reordering window.
1070	   [FACK] considers an original packet to be lost when its sequence
1071	   range is sufficiently far below the highest SACKed sequence.  In some
1072	   sense RACK can be seen as a generalized form of FACK that operates in
1073	   time space instead of sequence space, enabling it to better handle
1074	   reordering, application-limited traffic, and lost retransmissions.

1076	   Nevertheless RACK is still an experimental algorithm.  Since the
1077	   oldest loss detection algorithm, the 3 duplicate ACK threshold
1078	   [RFC5681], has been standardized and widely deployed.  RACK can
1079	   easily and optionally support the conventional approach for
1080	   compatibility.

1082	   RACK is compatible with and does not interfere with the the standard
1083	   RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel
1084	   algorithms [RFC3522].  This is because RACK only detects loss by
1085	   using ACK events.  It neither changes the RTO timer calculation nor
1086	   detects spurious timeouts.

1088	   Furthermore, RACK naturally works well with Tail Loss Probe [TLP]
1089	   because a tail loss probe solicits either an ACK or SACK, which can
1090	   be used by RACK to detect more losses.  RACK can be used to relax
1091	   TLP's requirement for using FACK and retransmitting the the highest-
1092	   sequenced packet, because RACK is agnostic to packet sequence
1093	   numbers, and uses transmission time instead.  Thus TLP could be
1094	   modified to retransmit the first unacknowledged packet, which could
1095	   improve application latency.

1097	8.5.  Interaction with congestion control

1099	   RACK intentionally decouples loss detection from congestion control.
1100	   RACK only detects losses; it does not modify the congestion control
1101	   algorithm [RFC5681][RFC6937].  A packet marked lost by RACK SHOULD
1102	   NOT be retransmitted until congestion control deems this appropriate.
1103	   Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used
1104	   when using RACK.

1106	   RACK is applicable for both fast recovery and recovery after a
1107	   retransmission timeout (RTO) in [RFC5681].  RACK applies equally to
1108	   fast recovery and RTO recovery because RACK is purely based on the
1109	   transmission time order of packets.  When a packet retransmitted by
1110	   RTO is acknowledged, RACK will mark any unacked packet sent
1111	   sufficiently prior to the RTO as lost, because at least one RTT has
1112	   elapsed since these packets were sent.

1114	   RACK may detect losses faster or slower than the conventional
1115	   duplicate ACK threshold approach does.  RACK can detect losses faster
1116	   by not requiring three DUPACKs, so congestion control may reduce the
1117	   congestion window earlier.  When the network path has both reordering
1118	   and losses, RACK detects losses slower by waiting for the reordering
1119	   window to expire.  TCP may continue to increase the congestion window
1120	   upon receiving ACKs during this time, making the sender more
1121	   aggressive.  Certain congestion control algorithms can benefit from
1122	   accounting for this increase in the congestion window during the
1123	   reordering window.

1125	8.5.1.  Example: interactions with congestion control

1127	   The following simple example compares how RACK and non-RACK loss
1128	   detection interacts with congestion control: suppose a TCP sender has
1129	   a congestion window (cwnd) of 20 packets on a SACK-enabled
1130	   connection.  It sends 10 data packets and all of them are lost.

1132	   Without RACK, the sender would time out, reset cwnd to 1, and
1133	   retransmit the first packet.  It would take four round trips (1 + 2 +
1134	   4 + 3 = 10) to retransmit all the 10 lost packets using slow start.
1135	   The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4
1136	   packets due to congestion window validation.

1138	   With RACK, a sender would send the TLP after 2*RTT and get a DUPACK.
1139	   If the sender implements Proportional Rate Reduction [RFC6937] it
1140	   would slow start to retransmit the remaining 9 lost packets since the
1141	   number of packets in flight (0) is lower than the slow start
1142	   threshold (10).  The slow start would again take four round trips (1
1143	   + 2 + 4 + 3 = 10).  The recovery latency would be 2*RTT + 4*RTT, with
1144	   an ending cwnd set to the slow start threshold of 10 packets.

1146	   In both cases, the sender after the recovery would be in congestion
1147	   avoidance.  The difference in recovery latency (RTO + 4*RTT vs 6*RTT)
1148	   can be significant if the RTT is much smaller than the minimum RTO (1
1149	   second in RFC6298) or if the RTT is large.  The former case is common
1150	   in local area networks, data-center networks, or content distribution
1151	   networks with deep deployments.  The latter case is more common in
1152	   developing regions with highly congested and/or high-latency
1153	   networks.  The ending congestion window after recovery also impacts
1154	   subsequent data transfer.

1156	8.6.  TLP recovery detection with delayed ACKs

1158	   Delayed ACKs complicate the detection of repairs done by TLP, since
1159	   with a delayed ACK the sender receives one fewer ACK than would
1160	   normally be expected.  To mitigate this complication, before sending
1161	   a TLP loss probe retransmission, the sender should attempt to wait
1162	   long enough that the receiver has sent any delayed ACKs that it is
1163	   withholding.  The sender algorithm described above features such a
1164	   delay, in the form of WCDelAckT.  Furthermore, if the receiver
1165	   supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then
1166	   in the case of a delayed ACK the sender's TLP recovery detection
1167	   algorithm (see above) can use the D-SACK information to infer that
1168	   the original and TLP retransmission both arrived at the receiver.

1170	   If there is ACK loss or a delayed ACK without a D-SACK, then this
1171	   algorithm is conservative, because the sender will reduce cwnd when
1172	   in fact there was no packet loss.  In practice this is acceptable,
1173	   and potentially even desirable: if there is reverse path congestion
1174	   then reducing cwnd can be prudent.

1176	8.7.  RACK for other transport protocols

1178	   RACK can be implemented in other transport protocols.  The algorithm
1179	   can be simplified by skipping step 3 if the protocol can support a
1180	   unique transmission or packet identifier (e.g.  TCP timestamp options
1181	   [RFC7323]).  For example, the QUIC protocol implements RACK [QUIC-
1182	   LR].  The [Sprout] loss detection algorithm was also independently
1183	   designed to use a 10ms reordering window to improve its loss
1184	   detection.

1186	9.  Experiments and Performance Evaluations

1188	   RACK and TLP have been deployed at Google, for both connections to
1189	   users in the Internet and internally.  We conducted a performance
1190	   evaluation experiment for RACK and TLP on a small set of Google Web
1191	   servers in Western Europe that serve mostly European and some African
1192	   countries.  The experiment lasted three days in March 2017.  The
1193	   servers were divided evenly into four groups of roughly 5.3 million
1194	   flows each:

1196	   Group 1 (control): RACK off, TLP off, RFC 6675 on

1198	   Group 2: RACK on, TLP off, RFC 6675 on

1200	   Group 3: RACK on, TLP on, RFC 6675 on

1202	   Group 4: RACK on, TLP on, RFC 6675 off

1204	   All groups used Linux with CUBIC congestion control, an initial
1205	   congestion window of 10 packets, and the fq/pacing qdisc.  In terms
1206	   of specific recovery features, all groups enabled RFC5682 (F-RTO) but
1207	   disabled FACK because it is not an IETF RFC.  FACK was excluded
1208	   because the goal of this setup is to compare RACK and TLP to RFC-
1209	   based loss recoveries.  Since TLP depends on either FACK or RACK, we
1210	   could not run another group that enables TLP only (with both RACK and
1211	   FACK disabled).  Group 4 is to test whether RACK plus TLP can
1212	   completely replace the DupThresh-based [RFC6675].

1214	   The servers sit behind a load balancer that distributes the
1215	   connections evenly across the four groups.

1217	   Each group handles a similar number of connections and sends and
1218	   receives similar amounts of data.  We compare total time spent in
1219	   loss recovery across groups.  The recovery time is measured from when
1220	   the recovery and retransmission starts, until the remote host has
1221	   acknowledged the highest sequence (SND.NXT) at the time the recovery
1222	   started.  Therefore the recovery includes both fast recoveries and
1223	   timeout recoveries.

1225	   Our data shows that Group 2 recovery latency is only 0.3% lower than
1226	   the Group 1 recovery latency.  But Group 3 recovery latency is 25%
1227	   lower than Group 1 due to a 40% reduction in RTO-triggered
1228	   recoveries!  Therefore it is important to implement both TLP and RACK
1229	   for performance.  Group 4's total recovery latency is 0.02% lower
1230	   than Group 3's, indicating that RACK plus TLP can successfully
1231	   replace RFC6675 as a standalone recovery mechanism.

1233	   We want to emphasize that the current experiment is limited in terms
1234	   of network coverage.  The connectivity in Western Europe is fairly
1235	   good, therefore loss recovery is not a major performance bottleneck.
1236	   We plan to expand our experiments to regions with worse connectivity,
1237	   in particular on networks with strong traffic policing.

1239	10.  Security Considerations

1241	   RACK does not change the risk profile for TCP.

1243	   An interesting scenario is ACK-splitting attacks [SCWA99]: for an
1244	   MSS-size packet sent, the receiver or the attacker might send MSS
1245	   ACKs that SACK or acknowledge one additional byte per ACK.  This
1246	   would not fool RACK.  RACK.xmit_ts would not advance because all the
1247	   sequences of the packet are transmitted at the same time (carry the
1248	   same transmission timestamp).  In other words, SACKing only one byte
1249	   of a packet or SACKing the packet in entirety have the same effect on
1250	   RACK.

1252	11.  IANA Considerations

1254	   This document makes no request of IANA.

1256	   Note to RFC Editor: this section may be removed on publication as an
1257	   RFC.

1259	12.  Acknowledgments

1261	   The authors thank Matt Mathis for his insights in FACK and Michael
1262	   Welzl for his per-packet timer idea that inspired this work.  Eric
1263	   Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana
1264	   Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi
1265	   Nishida, Bob Briscoe, Felix Weinrank, Michael Tuexen, and Martin Duke
1266	   contributed to the draft or the implementations in Linux, FreeBSD,
1267	   Windows and QUIC.

1269	13.  References

1271	13.1.  Normative References

1273	   [RFC2018]  Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment
1274	              Options", RFC 2018, October 1996.

1276	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1277	              Requirement Levels", RFC 2119, March 1997.

1279	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
1280	              Extension to the Selective Acknowledgement (SACK) Option
1281	              for TCP", RFC 2883, July 2000.

1283	   [RFC4737]  Morton, A., Ciavattone, L., Ramachandran, G., Shalunov,
1284	              S., and J. Perser, "Packet Reordering Metrics", RFC 4737,
1285	              November 2006.

1287	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1288	              Control", RFC 5681, September 2009.

1290	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
1291	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
1292	              Spurious Retransmission Timeouts with TCP", RFC 5682,
1293	              September 2009.

1295	   [RFC5827]  Allman, M., Ayesta, U., Wang, L., Blanton, J., and P.
1296	              Hurtig, "Early Retransmit for TCP and Stream Control
1297	              Transmission Protocol (SCTP)", RFC 5827, April 2010.

1299	   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
1300	              "Computing TCP's Retransmission Timer", RFC 6298, June
1301	              2011.

1303	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
1304	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
1305	              Based on Selective Acknowledgment (SACK) for TCP",
1306	              RFC 6675, August 2012.

1308	   [RFC6937]  Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
1309	              Rate Reduction for TCP", May 2013.

1311	   [RFC7323]  Borman, D., Braden, B., Jacobson, V., and R.
1312	              Scheffenegger, "TCP Extensions for High Performance",
1313	              September 2014.

1315	   [RFC793]   Postel, J., "Transmission Control Protocol", September
1316	              1981.

1318	13.2.  Informative References

1320	   [FACK]     Mathis, M. and M. Jamshid, "Forward acknowledgement:
1321	              refining TCP congestion control", ACM SIGCOMM Computer
1322	              Communication Review, Volume 26, Issue 4, Oct. 1996. ,
1323	              1996.

1325	   [POLICER16]
1326	              Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng,
1327	              Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An
1328	              Analysis of Traffic Policing in the Web", ACM SIGCOMM ,
1329	              2016.

1331	   [QUIC-LR]  Iyengar, J. and I. Swett, "QUIC Loss Recovery And
1332	              Congestion Control", draft-tsvwg-quic-loss-recovery-01
1333	              (work in progress), June 2016.

1335	   [RFC7765]  Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
1336	              and SCTP RTO Restart", February 2016.

1338	   [SCWA99]   Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
1339	              "TCP Congestion Control With a Misbehaving Receiver", ACM
1340	              Computer Communication Review, 29(5) , 1999.

1342	   [Sprout]   Winstein, K., Sivaraman, A., and H. Balakrishnan,
1343	              "Stochastic Forecasts Achieve High Throughput and Low
1344	              Delay over Cellular Networks", USENIX Symposium on
1345	              Networked Systems Design and Implementation (NSDI) , 2013.

1347	   [THIN-STREAM]
1348	              Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen,
1349	              "TCP enhancements for interactive thin-stream
1350	              applications", NOSSDAV , 2008.

1352	   [TLP]      Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis,
1353	              "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of
1354	              Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work
1355	              in progress), August 2013.

1357	Authors' Addresses

1359	   Yuchung Cheng
1360	   Google, Inc

1362	   Email: ycheng@google.com
1363	   Neal Cardwell
1364	   Google, Inc

1366	   Email: ncardwell@google.com

1368	   Nandita Dukkipati
1369	   Google, Inc

1371	   Email: nanditad@google.com

1373	   Priyaranjan Jha
1374	   Google, Inc

1376	   Email: priyarjha@google.com