idnits 2.17.1 draft-ietf-tcpm-rack-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 26, 2019) is 1827 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC4653' is mentioned on line 1033, but not defined

  == Missing Reference: 'RFC3708' is mentioned on line 472, but not defined

  == Missing Reference: 'RFC3522' is mentioned on line 1052, but not defined

  -- Looks like a reference, but probably isn't: '1' on line 1309

  -- Looks like a reference, but probably isn't: '2' on line 1311

  == Unused Reference: 'RFC2119' is defined on line 1231, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC4737' is defined on line 1238, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)


     Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                   Y. Cheng
3	Internet-Draft                                               N. Cardwell
4	Intended status: Experimental                               N. Dukkipati
5	Expires: October 28, 2019                                         P. Jha
6	                                                             Google, Inc
7	                                                          April 26, 2019

9	        RACK: a time-based fast loss detection algorithm for TCP
10	                        draft-ietf-tcpm-rack-05

12	Abstract

14	   This document presents a new TCP loss detection algorithm called RACK
15	   ("Recent ACKnowledgment").  RACK uses the notion of time, instead of
16	   packet or sequence counts, to detect losses, for modern TCP
17	   implementations that can support per-packet timestamps and the
18	   selective acknowledgment (SACK) option.  It is intended to replace
19	   the conventional DUPACK threshold approach and its variants, as well
20	   as other nonstandard approaches.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on October 28, 2019.

39	Copyright Notice

41	   Copyright (c) 2019 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (https://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	1.  Introduction

56	   This document presents a new loss detection algorithm called RACK
57	   ("Recent ACKnowledgment").  RACK uses the notion of time instead of
58	   the conventional packet or sequence counting approaches for detecting
59	   losses.  RACK deems a packet lost if some packet sent sufficiently
60	   later has been delivered.  It does this by recording packet
61	   transmission times and inferring losses using cumulative
62	   acknowledgments or selective acknowledgment (SACK) TCP options.

64	   In the last couple of years we have been observing several
65	   increasingly common loss and reordering patterns in the Internet:

67	   1.  Lost retransmissions.  Traffic policers [POLICER16] and burst
68	       losses often cause retransmissions to be lost again, severely
69	       increasing TCP latency.

71	   2.  Tail drops.  Structured request-response traffic turns more
72	       losses into tail drops.  In such cases, TCP is application-
73	       limited, so it cannot send new data to probe losses and has to
74	       rely on retransmission timeouts (RTOs).

76	   3.  Reordering.  Link-layer protocols (e.g., 802.11 block ACK), link
77	       bonding, or routers' internal load-balancing can deliver TCP
78	       packets out of order.  The degree of such reordering is usually
79	       within the order of the path round trip time.

81	   Despite TCP stacks (e.g.  Linux) that implement many of the standard
82	   and proposed loss detection algorithms
83	   [RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-STREAM],
84	   we've found that together they do not perform well.  The main reason
85	   is that many of them are based on the classic rule of counting
86	   duplicate acknowledgments [RFC5681].  They can either detect loss
87	   quickly or accurately, but not both, especially when the sender is
88	   application-limited or under reordering that is unpredictable.  And
89	   under these conditions none of them can detect lost retransmissions
90	   well.

92	   Also, these algorithms, including RFCs, rarely address the
93	   interactions with other algorithms.  For example, FACK may consider a
94	   packet is lost while RFC6675 may not.  Implementing N algorithms
95	   while dealing with N^2 interactions is a daunting task and error-
96	   prone.

98	   The goal of RACK is to solve all the problems above by replacing many
99	   of the loss detection algorithms above with one more effective
100	   algorithm to handle loss and reordering.

102	2.  Overview

104	   The main idea behind RACK is that if a packet has been delivered out
105	   of order, then the packets sent chronologically before that were
106	   either lost or reordered.  This concept is not fundamentally
107	   different from [RFC5681][RFC6675][FACK].  But the key innovation in
108	   RACK is to use a per-packet transmission timestamp and widely
109	   deployed SACK options to conduct time-based inferences instead of
110	   inferring losses with packet or sequence counting approaches.

112	   Using a threshold for counting duplicate acknowledgments (i.e.,
113	   DupThresh) alone is no longer reliable because of today's prevalent
114	   reordering patterns.  A common type of reordering is that the last
115	   "runt" packet of a window's worth of packet bursts gets delivered
116	   first, then the rest arrive shortly after in order.  To handle this
117	   effectively, a sender would need to constantly adjust the DupThresh
118	   to the burst size; but this would risk increasing the frequency of
119	   RTOs on real losses.

121	   Today's prevalent lost retransmissions also cause problems with
122	   packet-counting approaches [RFC5681][RFC6675][FACK], since those
123	   approaches depend on reasoning in sequence number space.
124	   Retransmissions break the direct correspondence between ordering in
125	   sequence space and ordering in time.  So when retransmissions are
126	   lost, sequence-based approaches are often unable to infer and quickly
127	   repair losses that can be deduced with time-based approaches.

129	   Instead of counting packets, RACK uses the most recently delivered
130	   packet's transmission time to judge if some packets sent previous to
131	   that time have "expired" by passing a certain reordering settling
132	   window.  On each ACK, RACK marks any already-expired packets lost,
133	   and for any packets that have not yet expired it waits until the
134	   reordering window passes and then marks those lost as well.  In
135	   either case, RACK can repair the loss without waiting for a (long)
136	   RTO.  RACK can be applied to both fast recovery and timeout recovery,
137	   and can detect losses on both originally transmitted and
138	   retransmitted packets, making it a great all-weather loss detection
139	   mechanism.

141	3.  Design Rationale for Reordering Tolerance

143	   The reordering behavior of networks can evolve (over years) in
144	   response to the behavior of transport protocols and applications, as
145	   well as the needs of network designers and operators.  From a network
146	   or link designer's viewpoint, parallelization (eg. link bonding) is
147	   the easiest way to get a network to go faster.  Therefore their main
148	   constraint on speed is reordering, and there is pressure to relax
149	   that constraint.  If RACK becomes widely deployed, the underlying
150	   networks may introduce more reordering for higher throughput.  But
151	   this may result in excessive reordering that hurts end to end
152	   performance:

154	   1.  End host packet processing: extreme reordering on high-speed
155	       networks would incur high CPU cost by greatly reducing the
156	       effectiveness of aggregation mechanisms, such as large receive
157	       offload (LRO) and generic receive offload (GRO), and
158	       significantly increasing the number of ACKs.

160	   2.  Congestion control: TCP congestion control implicitly assumes the
161	       feedback from ACKs are from the same bottleneck.  Therefore it
162	       cannot handle well scenarios where packets are traversing largely
163	       disjoint paths.

165	   3.  Loss recovery: Having an excessively large reordering window to
166	       accommodate widely different latencies from different paths would
167	       increase the latency of loss recovery.

169	   An end-to-end transport protocol cannot tell immediately whether a
170	   hole is reordering or loss.  It can only distinguish between the two
171	   in hindsight if the hole in the sequence space gets filled later
172	   without a retransmission.  How long the sender waits for such
173	   potential reordering events to settle is determined by the current
174	   reordering window.

176	   Given these considerations, a core design philosophy of RACK is to
177	   adapt to the measured duration of reordering events, within
178	   reasonable and specific bounds.  To accomplish this RACK places the
179	   following mandates on the reordering window:

181	   1.  The initial RACK reordering window SHOULD be set to a small
182	       fraction of the round-trip time.

184	   2.  If no reordering has been observed, then RACK SHOULD honor the
185	       classic 3-DUPACK rule for initiating fast recovery.  One simple
186	       way to implement this is to temporarily override the reorder
187	       window to 0.

189	   3.  The RACK reordering window SHOULD leverage Duplicate Selective
190	       Acknowledgement (DSACK) information [RFC3708] to adaptively
191	       estimate the duration of reordering events.

193	   4.  The RACK reordering window MUST be bounded and this bound SHOULD
194	       be one round trip.

196	   As a flow starts, either condition 1 or condition 2 or both would
197	   trigger RACK to start the recovery process quickly.  The low initial
198	   reordering window and use of the 3-DUPACK rule are key to achieving
199	   low-latency loss recovery for short flows by risking spurious
200	   retransmissions to recover losses quickly.  This rationale is that
201	   spurious retransmissions for short flows are not expected to produce
202	   excessive network traffic.

204	   For long flows the design tolerates reordering within a round trip.
205	   This handles reordering caused by path divergence in small time
206	   scales (reordering within the round-trip time of the shortest path),
207	   which should tolerate much of the reordering from link bonding,
208	   multipath routing, or link-layer out-of-order delivery.  It also
209	   relaxes ordering constraints to allow sending flights of TCP packets
210	   on different paths dynamically for better load-balancing (e.g.
211	   flowlets).

213	   However, the fact that the initial RACK reordering window is low, and
214	   the RACK reordering window's adaptive growth is bounded, means that
215	   there will continue to be a cost to reordering and a limit to RACK's
216	   adaptation to reordering.  This maintains a disincentive for network
217	   designers and operators to introduce needless or excessive
218	   reordering, particularly because they have to allow for low round
219	   trip time paths.  This means RACK will not encourage networks to
220	   perform inconsiderate fine-grained packet-spraying over highly
221	   disjoint paths with very different characteristics.  There are good
222	   alternative solutions, such as MPTCP, for such networks.

224	   To conclude, the RACK algorithm aims to adapt to small degrees of
225	   reordering, quickly recover most losses within one to two round
226	   trips, and avoid costly retransmission timeouts (RTOs).  In the
227	   presence of reordering, the adaptation algorithm can impose
228	   sometimes-needless delays when it waits to disambiguate loss from
229	   reordering, but the penalty for waiting is bounded to one round trip
230	   and such delays are confined to longer-running flows.

232	   This document provides a concrete and detailed reordering window
233	   adaptation algorithm for implementors.  We note that the specifics of
234	   the algorithm are likely to evolve over time.  But that is a separate
235	   engineering optimization that's out of scope for this document.

237	4.  Requirements

239	   The reader is expected to be familiar with the definitions given in
240	   the TCP congestion control [RFC5681] and selective acknowledgment
241	   [RFC2018] RFCs.  Familiarity with the conservative SACK-based
242	   recovery for TCP [RFC6675] is not expected but helps.

244	   RACK has three requirements:

246	   1.  The connection MUST use selective acknowledgment (SACK) options
247	       [RFC2018].

249	   2.  For each packet sent, the sender MUST store its most recent
250	       transmission time with (at least) millisecond granularity.  For
251	       round-trip times lower than a millisecond (e.g., intra-datacenter
252	       communications) microsecond granularity would significantly help
253	       the detection latency but is not required.

255	   3.  For each packet sent, the sender MUST remember whether the packet
256	       has been retransmitted or not.

258	   We assume that requirement 1 implies the sender keeps a SACK
259	   scoreboard, which is a data structure to store selective
260	   acknowledgment information on a per-connection basis ([RFC6675]
261	   section 3).  For the ease of explaining the algorithm, we use a
262	   pseudo-scoreboard that manages the data in sequence number ranges.
263	   But the specifics of the data structure are left to the implementor.

265	   RACK does not need any change on the receiver.

267	5.  Definitions

269	5.1.  Terminology

271	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
272	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
273	   document are to be interpreted as described in [RFC2119 [1]].  In
274	   this document, these words will appear with that interpretation only
275	   when in UPPER CASE.  Lower case uses of these words are not to be
276	   interpreted as carrying [RFC2119 [2]] significance.

278	   The reader is expected to be familiar with the definitions given in
279	   [RFC793], including SND.UNA, SND.NXT, SEG.ACK, and SEG.SEQ.

281	5.2.  Definitions of variables

283	   A sender implementing RACK needs to store these new RACK variables:

285	   "Packet.xmit_ts" is the time of the last transmission of a data
286	   packet, including retransmissions, if any.  The sender needs to
287	   record the transmission time for each packet sent and not yet
288	   acknowledged.  The time MUST be stored at millisecond granularity or
289	   finer.

291	   "RACK.packet".  Among all the packets that have been either
292	   selectively or cumulatively acknowledged, RACK.packet is the one that
293	   was sent most recently (including retransmissions).

295	   "RACK.xmit_ts" is the latest transmission timestamp of RACK.packet.

297	   "RACK.end_seq" is the ending TCP sequence number of RACK.packet.

299	   "RACK.rtt" is the RTT of the most recently transmitted packet that
300	   has been delivered (either cumulatively acknowledged or selectively
301	   acknowledged) on the connection.

303	   "RACK.rtt_seq" is the SND.NXT when RACK.rtt is updated.

305	   "RACK.reo_wnd" is a reordering window computed in the unit of time
306	   used for recording packet transmission times.  It is used to defer
307	   the moment at which RACK marks a packet lost.

309	   "RACK.dupthresh" is a constant specifying the number of duplicate
310	   acknowledgments, or selectively acknowledged segments, that can
311	   (under certain conditions) trigger fast recovery, similar to
312	   [RFC6675].  As in [RFC5681] and [RFC6675], this threshold is defined
313	   to be 3.

315	   "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the
316	   connection.

318	   "RACK.ack_ts" is the time when all the sequences in RACK.packet were
319	   selectively or cumulatively acknowledged.

321	   "RACK.reo_wnd_incr" is the multiplier applied to adjust RACK.reo_wnd

323	   "RACK.reo_wnd_persist" is the number of loss recoveries before
324	   resetting RACK.reo_wnd

326	   "RACK.dsack" indicates if a DSACK option has been received since last
327	   RACK.reo_wnd change "RACK.pkts_sacked" returns the total number of
328	   packets selectively acknowledged in the SACK scoreboard.

330	   "RACK.reord" indicates the connection has detected packet reordering
331	   event(s)

333	   "RACK.fack" is the highest selectively or cumulatively acknowledged
334	   sequence

336	   Note that the Packet.xmit_ts variable is per packet in flight.  The
337	   RACK.xmit_ts, RACK.end_seq, RACK.rtt, RACK.reo_wnd, and RACK.min_RTT
338	   variables are kept in the per-connection TCP control block.
339	   RACK.packet and RACK.ack_ts are used as local variables in the
340	   algorithm.

342	6.  Algorithm Details

344	6.1.  Transmitting a data packet

346	   Upon transmitting a new packet or retransmitting an old packet,
347	   record the time in Packet.xmit_ts.  RACK does not care if the
348	   retransmission is triggered by an ACK, new application data, an RTO,
349	   or any other means.

351	6.2.  Upon receiving an ACK

353	   Step 1: Update RACK.min_RTT.

355	   Use the RTT measurements obtained via [RFC6298] or [RFC7323] to
356	   update the estimated minimum RTT in RACK.min_RTT.  The sender can
357	   track a simple global minimum of all RTT measurements from the
358	   connection, or a windowed min-filtered value of recent RTT
359	   measurements.  This document does not specify an exact approach.

361	   Step 2: Update RACK stats

363	   Given the information provided in an ACK, each packet cumulatively
364	   ACKed or SACKed is marked as delivered in the scoreboard.  Among all
365	   the packets newly ACKed or SACKed in the connection, record the most
366	   recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts.
367	   Sometimes the timestamps of RACK.Packet and Packet could carry the
368	   same transmit timestamps due to clock granularity or segmentation
369	   offloading (i.e. the two packets were sent as a jumbo frame into the
370	   NIC).  In that case the sequence numbers of RACK.end_seq and
371	   Packet.end_seq are compared to break the tie.

373	   Since an ACK can also acknowledge retransmitted data packets,
374	   RACK.rtt can be vastly underestimated if the retransmission was
375	   spurious.  To avoid that, ignore a packet if any of its TCP sequences
376	   have been retransmitted before and either of two conditions is true:

378	   1.  The Timestamp Echo Reply field (TSecr) of the ACK's timestamp
379	       option [RFC7323], if available, indicates the ACK was not
380	       acknowledging the last retransmission of the packet.

382	   2.  The packet was last retransmitted less than RACK.min_rtt ago.

384	   If the ACK is not ignored as invalid, update the RACK.rtt to be the
385	   RTT sample calculated using this ACK, and continue.  If this ACK or
386	   SACK was for the most recently sent packet, then record the
387	   RACK.xmit_ts timestamp and RACK.end_seq sequence implied by this ACK.
388	   Otherwise exit here and omit the following steps.

390	   Notice that the second condition above is a heuristic.  This
391	   heuristic would fail to update RACK stats if the packet is spuriously
392	   retransmitted because of a recent minimum RTT decrease (e.g. path
393	   change).  Consequentially RACK may not detect losses from ACK events
394	   and the recovery resorts to the (slower) TLP or RTO timer-based
395	   events.  However such events should be rare and the connection would
396	   pick up the new minimum RTT when the recovery ends to avoid repeated
397	   similar failures.

399	   Step 2 may be summarized in pseudocode as:

401	   RACK_sent_after(t1, seq1, t2, seq2):
402	       If t1 > t2:
403	           Return true
404	       Else if t1 == t2 AND seq1 > seq2:
405	           Return true
406	       Else:
407	           Return false

409	   RACK_update():
410	       For each Packet newly acknowledged cumulatively or selectively:
411	           rtt = Now() - Packet.xmit_ts
412	           If Packet.retransmitted is TRUE:
413	               If ACK.ts_option.echo_reply < Packet.xmit_ts:
414	                  Return
415	               If rtt < RACK.min_rtt:
416	                  Return

418	           RACK.rtt = rtt
419	           If RACK_sent_after(Packet.xmit_ts, Packet.end_seq
420	                              RACK.xmit_ts, RACK.end_seq):
421	               RACK.xmit_ts = Packet.xmit_ts

423	   Step 3: Detect packet reordering
424	   To detect reordering, the sender looks for original data packets
425	   being delivered out of order in sequence space.  The sender tracks
426	   the highest sequence selectively or cumulatively acknowledged in the
427	   RACK.fack variable.  The name fack stands for the most forward ACK
428	   originated from the [FACK] draft.  If the ACK selectively or
429	   cumulatively acknowledges an unacknowledged and also never
430	   retransmitted sequence below RACK.fack, then the corresponding packet
431	   has been reordered and RACK.reord is set to TRUE.

433	   The heuristic above only detects reordering if the re-ordered packet
434	   has not yet been retransmitted.  This is a major drawback because if
435	   RACK has a low reordering window and the network is reordering
436	   packets, RACK may falsely retransmit frequently.  Consequently RACK
437	   may fail to detect reordering to increase the reordering window,
438	   because the reordered packets were already (falsely) retransmitted.

440	   DSACK [RFC3708] can help mitigate this issue.  The false
441	   retransmission would solicit DSACK option in the ACK.  Therefore if
442	   the ACK has a DSACK option covering some sequence that were both
443	   acknowledged and retransmitted, this implies the original packet was
444	   reordered but RACK retransmitted the packet too quickly and should
445	   set RACK.reord to TRUE.

447	   RACK_detect_reordering():
448	       For each Packet newly acknowledged cumulatively or selectively:
449	           If Packet.end_seq > RACK.fack:
450	               RACK.fack = Packet.end_seq
451	           Else if Packet.end_seq < RACK.fack AND
452	                   Packet.retransmitted is FALSE:
453	               RACK.reord = TRUE

455	       For each Packet covered by the DSACK option:
456	           If Packet.retransmitted is TRUE:
457	               RACK.reord = TRUE

459	   Step 4: Update RACK reordering window

461	   To handle the prevalent small degree of reordering, RACK.reo_wnd
462	   serves as an allowance for settling time before marking a packet
463	   lost.  This section documents a detailed algorithm following the
464	   design rationale section.  RACK starts initially with a conservative
465	   window of min_RTT/4.  If no reordering has been observed, RACK uses
466	   RACK.reo_wnd of 0 during loss recovery, in order to retransmit
467	   quickly, or when the number of DUPACKs exceeds the classic DUPACK
468	   threshold.  The subtle difference of this approach and conventional
469	   one [RFC5681][RFC6675] is dicussed later in the section of "RACK and
470	   TLP discussions".

472	   Further, RACK MAY use DSACK [RFC3708] to adapt the reordering window,
473	   to higher degrees of reordering, if DSACK is supported.  Receiving an
474	   ACK with a DSACK indicates a spurious retransmission, which in turn
475	   suggests that the RACK reordering window, RACK.reo_wnd, is likely too
476	   small.  The sender MAY increase the RACK.reo_wnd window linearly for
477	   every round trip in which the sender receives a DSACK, so that after
478	   N distinct round trips in which a DSACK is received, the RACK.reo_wnd
479	   becomes (N+1) * min_RTT / 4, with an upper-bound of SRTT.  The
480	   inflated RACK.reo_wnd would persist for 16 loss recoveries and after
481	   which it resets to its starting value, min_RTT / 4.

483	   The following pseudocode implements above algorithm.  Note that
484	   extensions that require additional TCP features (e.g.  DSACK) would
485	   work if the feature functions simply return false.

487	   RACK_update_reo_wnd():
488	       RACK.min_RTT = TCP_min_RTT()
489	       If DSACK option is present:
490	           RACK.dsack = true

492	       If SND.UNA < RACK.rtt_seq:
493	           RACK.dsack = false  /* React to DSACK once per round trip */

495	       If RACK.dsack:
496	           RACK.reo_wnd_incr += 1
497	           RACK.dsack = false
498	           RACK.rtt_seq = SND.NXT
499	           RACK.reo_wnd_persist = 16 /* Keep window for 16 recoveries */
500	       Else if exiting loss recovery:
501	           RACK.reo_wnd_persist -= 1
502	           If RACK.reo_wnd_persist <= 0:
503	               RACK.reo_wnd_incr = 1

505	       If RACK.reord is FALSE:
506	           If in loss recovery:  /* If in fast or timeout recovery */
507	               RACK.reo_wnd = 0
508	               Return
509	           Else if RACK.pkts_sacked >= RACK.dupthresh:
510	               RACK.reo_wnd = 0
511	               return
512	       RACK.reo_wnd = RACK.min_RTT / 4 * RACK.reo_wnd_incr
513	       RACK.reo_wnd = min(RACK.reo_wnd, SRTT)

515	   Step 5: Detect losses.

517	   For each packet that has not been SACKed, if RACK.xmit_ts is after
518	   Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its
519	   corresponding sequence range) lost in the scoreboard.  The rationale
520	   is that if another packet that was sent later has been delivered, and
521	   the reordering window or "reordering settling time" has already
522	   passed, then the packet was likely lost.

524	   If another packet that was sent later has been delivered, but the
525	   reordering window has not passed, then it is not yet safe to deem the
526	   unacked packet lost.  Using the basic algorithm above, the sender
527	   would wait for the next ACK to further advance RACK.xmit_ts; but this
528	   risks a timeout (RTO) if no more ACKs come back (e.g, due to losses
529	   or application limit).  For timely loss detection, the sender MAY
530	   install a "reordering settling" timer set to fire at the earliest
531	   moment at which it is safe to conclude that some packet is lost.  The
532	   earliest moment is the time it takes to expire the reordering window
533	   of the earliest unacked packet in flight.

535	   This timer expiration value can be derived as follows.  As a starting
536	   point, we consider that the reordering window has passed if the
537	   RACK.packet was sent sufficiently after the packet in question, or a
538	   sufficient time has elapsed since the RACK.packet was S/ACKed, or
539	   some combination of the two.  More precisely, RACK marks a packet as
540	   lost if the reordering window for a packet has elapsed through the
541	   sum of:

543	   1. delta in transmit time between a packet and the RACK.packet

545	   2. delta in time between RACK.ack_ts and now

547	   So we mark a packet as lost if:

549	    RACK.xmit_ts >= Packet.xmit_ts
550	           AND
551	    (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) >= RACK.reo_wnd

553	   If we solve this second condition for "now", the moment at which we
554	   can declare a packet lost, then we get:

556	   now >= Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts)

558	   Then (RACK.ack_ts - RACK.xmit_ts) is just the RTT of the packet we
559	   used to set RACK.xmit_ts, so this reduces to:

561	   Packet.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0

563	   The following pseudocode implements the algorithm above.  When an ACK
564	   is received or the RACK timer expires, call RACK_detect_loss().  The
565	   algorithm includes an additional optimization to break timestamp ties
566	   by using the TCP sequence space.  The optimization is particularly
567	   useful to detect losses in a timely manner with TCP Segmentation
568	   Offload, where multiple packets in one TSO blob have identical
569	   timestamps.  It is also useful when the timestamp clock granularity
570	   is close to or longer than the actual round trip time.

572	   RACK_detect_loss():
573	       timeout = 0

575	       For each packet, Packet, not acknowledged yet:
576	           If Packet.lost is TRUE AND Packet.retransmitted is FALSE:
577	               Continue /* Packet lost but not yet retransmitted */

579	           If RACK_sent_after(RACK.xmit_ts, RACK.end_seq,
580	                              Packet.xmit_ts, Packet.end_seq):
581	               remaining = Packet.xmit_ts + RACK.rtt +
582	                           RACK.reo_wnd - Now()
583	               If remaining <= 0:
584	                   Packet.lost = TRUE
585	               Else:
586	                   timeout = max(remaining, timeout)

588	       If timeout != 0
589	           Arm a timer to call RACK_detect_loss() after timeout

591	   Implementation optimization: looping through packets in the SACK
592	   scoreboard above could be very costly on large-BDP networks since the
593	   inflight could be very large.  If the implementation can organize the
594	   scoreboard data structures to have packets sorted by the last
595	   (re)transmission time, then the loop can start on the least recently
596	   sent packet and abort on the first packet sent after RACK.time_ts.
597	   This can be implemented by using a seperate list sorted in time
598	   order.  The implementation inserts the packet at the tail of the list
599	   when it is (re)transmitted, and removes a packet from the list when
600	   it is delivered or marked lost.  We RECOMMEND such an optimization
601	   because it enables implementations to support high-BDP networks.
602	   This optimization is implemented in Linux and sees orders of
603	   magnitude improvement in CPU usage on high-speed WAN networks.

605	6.3.  Tail Loss Probe: fast recovery for tail losses

607	   This section describes a supplemental algorithm, Tail Loss Probe
608	   (TLP), which leverages RACK to further reduce RTO recoveries.  TLP
609	   triggers fast recovery to quickly repair tail losses that can
610	   otherwise be recovered by RTOs only.  After an original data
611	   transmission, TLP sends a probe data segment within one to two RTTs.
612	   The probe data segment can either be new, previously unsent data, or
613	   a retransmission of previously sent data just below SND.NXT.  In
614	   either case the goal is to elicit more feedback from the receiver, in
615	   the form of an ACK (potentially with SACK blocks), to allow RACK to
616	   trigger fast recovery instead of an RTO.

618	   An RTO occurs when the first unacknowledged sequence number is not
619	   acknowledged after a conservative period of time has elapsed
620	   [RFC6298].  Common causes of RTOs include:

622	   1.  The entire flight of data is lost.

624	   2.  Tail losses of data segments at the end of an application
625	       transaction.

627	   3.  Tail losses of ACKs at the end of an application transaction.

629	   4.  Lost retransmits, which can halt fast recovery based on [RFC6675]
630	       if the ACK stream completely dries up.  For example, consider a
631	       window of three data packets (P1, P2, P3) that are sent; P1 and
632	       P2 are dropped.  On receipt of a SACK for P3, RACK marks P1 and
633	       P2 as lost and retransmits them as R1 and R2.  Suppose R1 and R2
634	       are lost as well, so there are no more returning ACKs to detect
635	       R1 and R2 as lost.  Recovery stalls.

637	   5.  An unexpectedly long round-trip time (RTT).  This can cause ACKs
638	       to arrive after the RTO timer expires.  The F-RTO algorithm
639	       [RFC5682] is designed to detect such spurious retransmission
640	       timeouts and at least partially undo the consequences of such
641	       events, but F-RTO cannot be used in many situations.

643	6.4.  Tail Loss Probe: An Example

645	   Following is an example of TLP.  All events listed are at a TCP
646	   sender.

648	   1.  Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10.  There is
649	       no more new data to transmit.  A PTO is scheduled to fire in 2
650	       RTTs, after the transmission of the 10th segment.

652	   2.  Sender receives acknowledgements (ACKs) for segments 1-5;
653	       segments 6-10 are lost and no ACKs are received.  The sender
654	       reschedules its PTO timer relative to the last received ACK,
655	       which is the ACK for segment 5 in this case.  The sender sets the
656	       PTO interval using the calculation described in step (2) of the
657	       algorithm.

659	   3.  When PTO fires, sender retransmits segment 10.

661	   4.  After an RTT, a SACK for packet 10 arrives.  The ACK also carries
662	       SACK holes for segments 6, 7, 8 and 9.  This triggers RACK-based
663	       loss recovery.

665	   5.  The connection enters fast recovery and retransmits the remaining
666	       lost segments.

668	6.5.  Tail Loss Probe Algorithm Details

670	   We define the terminology used in specifying the TLP algorithm:

672	   FlightSize: amount of outstanding data in the network, as defined in
673	   [RFC5681].

675	   RTO: The transport's retransmission timeout (RTO) is based on
676	   measured round-trip times (RTT) between the sender and receiver, as
677	   specified in [RFC6298] for TCP.  PTO: Probe timeout (PTO) is a timer
678	   event indicating that an ACK is overdue.  Its value is constrained to
679	   be smaller than or equal to an RTO.

681	   SRTT: smoothed round-trip time, computed as specified in [RFC6298].

683	   The TLP algorithm has three phases, which we discuss in turn.

685	6.5.1.  Phase 1: Scheduling a loss probe

687	   Step 1: Check conditions for scheduling a PTO.

689	   A sender should check to see if it should schedule a PTO in the
690	   following situations:

692	   1.  After transmitting new data that was not itself a TLP probe

694	   2.  Upon receiving an ACK that cumulatively acknowledges data

696	   3.  Upon receiving a SACK that selectively acknowledges data that was
697	       last sent before the segment with SEG.SEQ=SND.UNA was last
698	       (re)transmitted

700	   A sender should schedule a PTO only if all of the following
701	   conditions are met:

703	   1.  The connection supports SACK [RFC2018]

705	   2.  The connection has no SACKed sequences in the SACK scoreboard

707	   3.  The connection is not in loss recovery
708	   If a PTO can be scheduled according to these conditions, the sender
709	   should schedule a PTO.  If there was a previously scheduled PTO or
710	   RTO pending, then that pending PTO or RTO should first be cancelled,
711	   and then the new PTO should be scheduled.

713	   If a PTO cannot be scheduled according to these conditions, then the
714	   sender MUST arm the RTO timer if there is unacknowledged data in
715	   flight.

717	   Step 2: Select the duration of the PTO.

719	   A sender SHOULD use the following logic to select the duration of a
720	   PTO:

722	   TLP_timeout():
723	       If SRTT is available:
724	           PTO = 2 * SRTT
725	           If FlightSize = 1:
726	              PTO += WCDelAckT
727	           Else:
728	              PTO += 2ms
729	       Else:
730	           PTO = 1 sec

732	       If Now() + PTO > TCP_RTO_expire():
733	           PTO = TCP_RTO_expire() - Now()

735	   Aiming for a PTO value of 2*SRTT allows a sender to wait long enough
736	   to know that an ACK is overdue.  Under normal circumstances, i.e. no
737	   losses, an ACK typically arrives in one SRTT.  But choosing PTO to be
738	   exactly an SRTT is likely to generate spurious probes given that
739	   network delay variance and even end-system timings can easily push an
740	   ACK to be above an SRTT.  We chose PTO to be the next integral
741	   multiple of SRTT.

743	   Similarly, network delay variations and end-system processing
744	   latencies and timer granularities can easily delay ACKs beyond
745	   2*SRTT, so senders SHOULD add at least 2ms to a computed PTO value
746	   (and MAY add more if the sending host OS timer granularity is more
747	   coarse than 1ms).

749	   WCDelAckT stands for worst case delayed ACK timer.  When FlightSize
750	   is 1, PTO is inflated by WCDelAckT time to compensate for a potential
751	   long delayed ACK timer at the receiver.  The RECOMMENDED value for
752	   WCDelAckT is 200ms.

754	   Finally, if the time at which an RTO would fire (here denoted
755	   "TCP_RTO_expire") is sooner than the computed time for the PTO, then
756	   a probe is scheduled to be sent at that earlier time.

758	6.5.2.  Phase 2: Sending a loss probe

760	   When the PTO fires, transmit a probe data segment:

762	   TLP_send_probe():
763	       If an unsent segment exists AND
764	          the receive window allows new data to be sent:
765	           Transmit the lowest-sequence unsent segment of up to SMSS
766	           Increment FlightSize by the size of the newly-sent segment
767	       Else:
768	           Retransmit the highest-sequence segment sent so far
769	       The cwnd remains unchanged

771	   When the loss probe is a retransmission, the sender uses the highest-
772	   sequence segment sent so far.  This is in order to deal with the
773	   retransmission ambiguity problem in TCP.  Suppose a sender sends N
774	   segments, and then retransmits the last segment (segment N) as a loss
775	   probe, and then the sender receives a SACK for segment N.  As long as
776	   the sender waits for any required RACK reordering settling timer to
777	   then expire, it doesn't matter if that SACK was for the original
778	   transmission of segment N or the TLP retransmission; in either case
779	   the arrival of the SACK for segment N provides evidence that the N-1
780	   segments preceding segment N were likely lost.  In the case where
781	   there is only one original outstanding segment of data (N=1), the
782	   same logic (trivially) applies: an ACK for a single outstanding
783	   segment tells the sender the N-1=0 segments preceding that segment
784	   were lost.  Furthermore, whether there are N>1 or N=1 outstanding
785	   segments, there is a question about whether the original last segment
786	   or its TLP retransmission were lost; the sender estimates this using
787	   TLP recovery detection (see below).

789	   Note that after transmitting a TLP, the sender MUST arm an RTO timer,
790	   and not the PTO timer.  This ensures that the sender does not send
791	   repeated, back-to-back TLP probes.  This is important to avoid TLP
792	   loops if an application writes periodically at an interval less than
793	   PTO.

795	6.5.3.  Phase 3: ACK processing

797	   On each incoming ACK, the sender should check the conditions in Step
798	   1 of Phase 1 to see if it should schedule (or reschedule) the loss
799	   probe timer.

801	6.6.  TLP recovery detection

803	   If the only loss in an outstanding window of data was the last
804	   segment, then a TLP loss probe retransmission of that data segment
805	   might repair the loss.  TLP recovery detection examines ACKs to
806	   detect when the probe might have repaired a loss, and thus allows
807	   congestion control to properly reduce the congestion window (cwnd)
808	   [RFC5681].

810	   Consider a TLP retransmission episode where a sender retransmits a
811	   tail packet in a flight.  The TLP retransmission episode ends when
812	   the sender receives an ACK with a SEG.ACK above the SND.NXT at the
813	   time the episode started.  During the TLP retransmission episode the
814	   sender checks for a duplicate ACK or D-SACK indicating that both the
815	   original segment and TLP retransmission arrived at the receiver,
816	   meaning there was no loss that needed repairing.  If the TLP sender
817	   does not receive such an indication before the end of the TLP
818	   retransmission episode, then it MUST estimate that either the
819	   original data segment or the TLP retransmission were lost, and
820	   congestion control MUST react appropriately to that loss as it would
821	   any other loss.

823	   Since a significant fraction of the hosts that support SACK do not
824	   support duplicate selective acknowledgments (D-SACKs) [RFC2883] the
825	   TLP algorithm for detecting such lost segments relies only on basic
826	   SACK support [RFC2018].

828	   Definitions of variables

830	   TLPRxtOut: a boolean indicating whether there is an unacknowledged
831	   TLP retransmission.

833	   TLPHighRxt: the value of SND.NXT at the time of sending a TLP
834	   retransmission.

836	6.6.1.  Initializing and resetting state

838	   When a connection is created, or suffers a retransmission timeout, or
839	   enters fast recovery, it executes the following:

841	       TLPRxtOut = false

843	6.6.2.  Recording loss probe states

845	   Senders MUST only send a TLP loss probe retransmission if TLPRxtOut
846	   is false.  This ensures that at any given time a connection has at
847	   most one outstanding TLP retransmission.  This allows the sender to
848	   use the algorithm described in this section to estimate whether any
849	   data segments were lost.

851	   Note that this condition only restricts TLP loss probes that are
852	   retransmissions.  There may be an arbitrary number of outstanding
853	   unacknowledged TLP loss probes that consist of new, previously-unsent
854	   data, since the retransmission timeout and fast recovery algorithms
855	   are sufficient to detect losses of such probe segments.

857	   Upon sending a TLP probe that is a retransmission, the sender sets
858	   TLPRxtOut to true and TLPHighRxt to SND.NXT.

860	   Detecting recoveries accomplished by loss probes

862	   Step 1: Track ACKs indicating receipt of original and retransmitted
863	   segments

865	   A sender considers both the original segment and TLP probe
866	   retransmission segment as acknowledged if either 1 or 2 are true:

868	   1.  This is a duplicate acknowledgment (as defined in [RFC5681],
869	       section 2), and all of the following conditions are met:

871	       1.  TLPRxtOut is true

873	       2.  SEG.ACK == TLPHighRxt

875	       3.  SEG.ACK == SND.UNA

877	       4.  the segment contains no SACK blocks for sequence ranges above
878	           TLPHighRxt

880	       5.  the segment contains no data

882	       6.  the segment is not a window update

884	   2.  This is an ACK acknowledging a sequence number at or above
885	       TLPHighRxt and it contains a D-SACK; i.e. all of the following
886	       conditions are met:

888	       1.  TLPRxtOut is true

890	       2.  SEG.ACK >= TLPHighRxt

892	       3.  the ACK contains a D-SACK block

894	   If either of the conditions is met, then the sender estimates that
895	   the receiver received both the original data segment and the TLP
896	   probe retransmission, and so the sender considers the TLP episode to
897	   be done, and records that fact by setting TLPRxtOut to false.

899	   Step 2: Mark the end of a TLP retransmission episode and detect
900	   losses

902	   If the sender receives a cumulative ACK for data beyond the TLP loss
903	   probe retransmission then, in the absence of reordering on the return
904	   path of ACKs, it should have received any ACKs for the original
905	   segment and TLP probe retransmission segment.  At that time, if the
906	   TLPRxtOut flag is still true and thus indicates that the TLP probe
907	   retransmission remains unacknowledged, then the sender should presume
908	   that at least one of its data segments was lost, so it SHOULD invoke
909	   a congestion control response equivalent to fast recovery.

911	   More precisely, on each ACK the sender executes the following:

913	       if (TLPRxtOut and SEG.ACK >= TLPHighRxt) {
914	           TLPRxtOut = false
915	           EnterRecovery()
916	           ExitRecovery()
917	       }

919	7.  RACK and TLP discussions

921	7.1.  Advantages

923	   The biggest advantage of RACK is that every data packet, whether it
924	   is an original data transmission or a retransmission, can be used to
925	   detect losses of the packets sent chronologically prior to it.

927	   Example: TAIL DROP.  Consider a sender that transmits a window of
928	   three data packets (P1, P2, P3), and P1 and P3 are lost.  Suppose the
929	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
930	   by default) after the transmission of the previous packet.  RACK will
931	   mark P1 as lost when the SACK of P2 is received, and this will
932	   trigger the retransmission of P1 as R1.  When R1 is cumulatively
933	   acknowledged, RACK will mark P3 as lost and the sender will
934	   retransmit P3 as R3.  This example illustrates how RACK is able to
935	   repair certain drops at the tail of a transaction without any timer.
936	   Notice that neither the conventional duplicate ACK threshold
937	   [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK]
938	   algorithm can detect such losses, because of the required packet or
939	   sequence count.

941	   Example: LOST RETRANSMIT.  Consider a window of three data packets
942	   (P1, P2, P3) that are sent; P1 and P2 are dropped.  Suppose the
943	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
944	   by default) after the transmission of the previous packet.  When P3
945	   is SACKed, RACK will mark P1 and P2 lost and they will be
946	   retransmitted as R1 and R2.  Suppose R1 is lost again but R2 is
947	   SACKed; RACK will mark R1 lost for retransmission again.  Again,
948	   neither the conventional three duplicate ACK threshold approach, nor
949	   [RFC6675], nor the Forward Acknowledgment [FACK] algorithm can detect
950	   such losses.  And such a lost retransmission is very common when TCP
951	   is being rate-limited, particularly by token bucket policers with
952	   large bucket depth and low rate limit.  Retransmissions are often
953	   lost repeatedly because standard congestion control requires multiple
954	   round trips to reduce the rate below the policed rate.

956	   Example: SMALL DEGREE OF REORDERING.  Consider a common reordering
957	   event: a window of packets are sent as (P1, P2, P3).  P1 and P2 carry
958	   a full payload of MSS octets, but P3 has only a 1-octet payload.
959	   Suppose the sender has detected reordering previously and thus
960	   RACK.reo_wnd is min_RTT/4.  Now P3 is reordered and delivered first,
961	   before P1 and P2.  As long as P1 and P2 are delivered within
962	   min_RTT/4, RACK will not consider P1 and P2 lost.  But if P1 and P2
963	   are delivered outside the reordering window, then RACK will still
964	   falsely mark P1 and P2 lost.  We discuss how to reduce false
965	   positives in the end of this section.

967	   The examples above show that RACK is particularly useful when the
968	   sender is limited by the application, which is common for
969	   interactive, request/response traffic.  Similarly, RACK still works
970	   when the sender is limited by the receive window, which is common for
971	   applications that use the receive window to throttle the sender.

973	   For some implementations (e.g., Linux), RACK works quite efficiently
974	   with TCP Segmentation Offload (TSO).  RACK always marks the entire
975	   TSO blob lost because the packets in the same TSO blob have the same
976	   transmission timestamp.  By contrast, the algorithms based on
977	   sequence counting (e.g., [RFC6675][RFC5681]) may mark only a subset
978	   of packets in the TSO blob lost, forcing the stack to perform
979	   expensive fragmentation of the TSO blob, or to selectively tag
980	   individual packets lost in the scoreboard.

982	7.2.  Disadvantages

984	   RACK requires the sender to record the transmission time of each
985	   packet sent at a clock granularity of one millisecond or finer.  TCP
986	   implementations that record this already for RTT estimation do not
987	   require any new per-packet state.  But implementations that are not
988	   yet recording packet transmission times will need to add per-packet
989	   internal state (commonly either 4 or 8 octets per packet or TSO blob)
990	   to track transmission times.  In contrast, the conventional [RFC6675]
991	   loss detection approach does not require any per-packet state beyond
992	   the SACK scoreboard.  This is particularly useful on ultra-low RTT
993	   networks where the RTT is far less than the sender TCP clock
994	   grainularity (e.g. inside data-centers).

996	   RACK can easily and optionally support the conventional approach in
997	   [RFC6675][RFC5681] by resetting the reordering window to zero when
998	   the threshold is met.  Note that this approach differs slightly from
999	   [RFC6675] which considers a packet lost when at least #DupThresh
1000	   higher-sequenc packets are SACKed.  RACK's approach considers a
1001	   packet lost when at least one higher sequence packet is SACKed and
1002	   the total number of SACKed packets is at least DupThresh.  For
1003	   example, suppose a connection sends 10 packets, and packets 3, 5, 7
1004	   are SACKed.  [RFC6675] considers packets 1 and 2 lost.  RACK
1005	   considers packets 1, 2, 4, 6 lost.

1007	7.3.  Adjusting the reordering window

1009	   When the sender detects packet reordering, RACK uses a reordering
1010	   window of min_rtt / 4.  It uses the minimum RTT to accommodate
1011	   reordering introduced by packets traversing slightly different paths
1012	   (e.g., router-based parallelism schemes) or out-of-order deliveries
1013	   in the lower link layer (e.g., wireless links using link-layer
1014	   retransmission).  RACK uses a quarter of minimum RTT because Linux
1015	   TCP used the same factor in its implementation to delay Early
1016	   Retransmit [RFC5827] to reduce spurious loss detections in the
1017	   presence of reordering, and experience shows that this seems to work
1018	   reasonably well.  We have evaluated using the smoothed RTT (SRTT from
1019	   [RFC6298] RTT estimation) or the most recently measured RTT
1020	   (RACK.rtt) using an experiment similar to that in the Performance
1021	   Evaluation section.  They do not make any significant difference in
1022	   terms of total recovery latency.

1024	7.4.  Relationships with other loss recovery algorithms

1026	   The primary motivation of RACK is to ultimately provide a simple and
1027	   general replacement for some of the standard loss recovery algorithms
1028	   [RFC5681][RFC6675][RFC5827][RFC4653], as well as some nonstandard
1029	   ones [FACK][THIN-STREAM].  While RACK can be a supplemental loss
1030	   detection mechanism on top of these algorithms, this is not
1031	   necessary, because RACK implicitly subsumes most of them.

1033	   [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK
1034	   threshold based on the current or previous flight sizes.  RACK takes
1035	   a different approach, by using only one ACK event and a reordering
1036	   window.  RACK can be seen as an extended Early Retransmit [RFC5827]
1037	   without a FlightSize limit but with an additional reordering window.
1038	   [FACK] considers an original packet to be lost when its sequence
1039	   range is sufficiently far below the highest SACKed sequence.  In some
1040	   sense RACK can be seen as a generalized form of FACK that operates in
1041	   time space instead of sequence space, enabling it to better handle
1042	   reordering, application-limited traffic, and lost retransmissions.

1044	   Nevertheless RACK is still an experimental algorithm.  Since the
1045	   oldest loss detection algorithm, the 3 duplicate ACK threshold
1046	   [RFC5681], has been standardized and widely deployed.  RACK can
1047	   easily and optionally support the conventional approach for
1048	   compatibility.

1050	   RACK is compatible with and does not interfere with the the standard
1051	   RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel
1052	   algorithms [RFC3522].  This is because RACK only detects loss by
1053	   using ACK events.  It neither changes the RTO timer calculation nor
1054	   detects spurious timeouts.

1056	   Furthermore, RACK naturally works well with Tail Loss Probe [TLP]
1057	   because a tail loss probe solicits either an ACK or SACK, which can
1058	   be used by RACK to detect more losses.  RACK can be used to relax
1059	   TLP's requirement for using FACK and retransmitting the the highest-
1060	   sequenced packet, because RACK is agnostic to packet sequence
1061	   numbers, and uses transmission time instead.  Thus TLP could be
1062	   modified to retransmit the first unacknowledged packet, which could
1063	   improve application latency.

1065	7.5.  Interaction with congestion control

1067	   RACK intentionally decouples loss detection from congestion control.
1068	   RACK only detects losses; it does not modify the congestion control
1069	   algorithm [RFC5681][RFC6937].  However, RACK may detect losses
1070	   earlier or later than the conventional duplicate ACK threshold
1071	   approach does.  A packet marked lost by RACK SHOULD NOT be
1072	   retransmitted until congestion control deems this appropriate.
1073	   Specifically, Proportional Rate Reduction [RFC6937] SHOULD be used
1074	   when using RACK.

1076	   RACK is applicable for both fast recovery and recovery after a
1077	   retransmission timeout (RTO) in [RFC5681].  RACK applies equally to
1078	   fast recovery and RTO recovery because RACK is purely based on the
1079	   transmission time order of packets.  When a packet retransmitted by
1080	   RTO is acknowledged, RACK will mark any unacked packet sent
1081	   sufficiently prior to the RTO as lost, because at least one RTT has
1082	   elapsed since these packets were sent.

1084	   The following simple example compares how RACK and non-RACK loss
1085	   detection interacts with congestion control: suppose a TCP sender has
1086	   a congestion window (cwnd) of 20 packets on a SACK-enabled
1087	   connection.  It sends 10 data packets and all of them are lost.

1089	   Without RACK, the sender would time out, reset cwnd to 1, and
1090	   retransmit the first packet.  It would take four round trips (1 + 2 +
1091	   4 + 3 = 10) to retransmit all the 10 lost packets using slow start.
1092	   The recovery latency would be RTO + 4*RTT, with an ending cwnd of 4
1093	   packets due to congestion window validation.

1095	   With RACK, a sender would send the TLP after 2*RTT and get a DUPACK.
1096	   If the sender implements Proportional Rate Reduction [RFC6937] it
1097	   would slow start to retransmit the remaining 9 lost packets since the
1098	   number of packets in flight (0) is lower than the slow start
1099	   threshold (10).  The slow start would again take four round trips (1
1100	   + 2 + 4 + 3 = 10).  The recovery latency would be 2*RTT + 4*RTT, with
1101	   an ending cwnd set to the slow start threshold of 10 packets.

1103	   In both cases, the sender after the recovery would be in congestion
1104	   avoidance.  The difference in recovery latency (RTO + 4*RTT vs 6*RTT)
1105	   can be significant if the RTT is much smaller than the minimum RTO (1
1106	   second in RFC6298) or if the RTT is large.  The former case is common
1107	   in local area networks, data-center networks, or content distribution
1108	   networks with deep deployments.  The latter case is more common in
1109	   developing regions with highly congested and/or high-latency
1110	   networks.  The ending congestion window after recovery also impacts
1111	   subsequent data transfer.

1113	7.6.  TLP recovery detection with delayed ACKs

1115	   Delayed ACKs complicate the detection of repairs done by TLP, since
1116	   with a delayed ACK the sender receives one fewer ACK than would
1117	   normally be expected.  To mitigate this complication, before sending
1118	   a TLP loss probe retransmission, the sender should attempt to wait
1119	   long enough that the receiver has sent any delayed ACKs that it is
1120	   withholding.  The sender algorithm described above features such a
1121	   delay, in the form of WCDelAckT.  Furthermore, if the receiver
1122	   supports duplicate selective acknowledgments (D-SACKs) [RFC2883] then
1123	   in the case of a delayed ACK the sender's TLP recovery detection
1124	   algorithm (see above) can use the D-SACK information to infer that
1125	   the original and TLP retransmission both arrived at the receiver.

1127	   If there is ACK loss or a delayed ACK without a D-SACK, then this
1128	   algorithm is conservative, because the sender will reduce cwnd when
1129	   in fact there was no packet loss.  In practice this is acceptable,
1130	   and potentially even desirable: if there is reverse path congestion
1131	   then reducing cwnd can be prudent.

1133	7.7.  RACK for other transport protocols

1135	   RACK can be implemented in other transport protocols.  The algorithm
1136	   can be simplified by skipping step 3 if the protocol can support a
1137	   unique transmission or packet identifier (e.g.  TCP timestamp options
1138	   [RFC7323]).  For example, the QUIC protocol implements RACK [QUIC-
1139	   LR].

1141	8.  Experiments and Performance Evaluations

1143	   RACK and TLP have been deployed at Google, for both connections to
1144	   users in the Internet and internally.  We conducted a performance
1145	   evaluation experiment for RACK and TLP on a small set of Google Web
1146	   servers in Western Europe that serve mostly European and some African
1147	   countries.  The experiment lasted three days in March 2017.  The
1148	   servers were divided evenly into four groups of roughly 5.3 million
1149	   flows each:

1151	   Group 1 (control): RACK off, TLP off, RFC 6675 on

1153	   Group 2: RACK on, TLP off, RFC 6675 on

1155	   Group 3: RACK on, TLP on, RFC 6675 on

1157	   Group 4: RACK on, TLP on, RFC 6675 off

1159	   All groups used Linux with CUBIC congestion control, an initial
1160	   congestion window of 10 packets, and the fq/pacing qdisc.  In terms
1161	   of specific recovery features, all groups enabled RFC5682 (F-RTO) but
1162	   disabled FACK because it is not an IETF RFC.  FACK was excluded
1163	   because the goal of this setup is to compare RACK and TLP to RFC-
1164	   based loss recoveries.  Since TLP depends on either FACK or RACK, we
1165	   could not run another group that enables TLP only (with both RACK and
1166	   FACK disabled).  Group 4 is to test whether RACK plus TLP can
1167	   completely replace the DupThresh-based [RFC6675].

1169	   The servers sit behind a load balancer that distributes the
1170	   connections evenly across the four groups.

1172	   Each group handles a similar number of connections and sends and
1173	   receives similar amounts of data.  We compare total time spent in
1174	   loss recovery across groups.  The recovery time is measured from when
1175	   the recovery and retransmission starts, until the remote host has
1176	   acknowledged the highest sequence (SND.NXT) at the time the recovery
1177	   started.  Therefore the recovery includes both fast recoveries and
1178	   timeout recoveries.

1180	   Our data shows that Group 2 recovery latency is only 0.3% lower than
1181	   the Group 1 recovery latency.  But Group 3 recovery latency is 25%
1182	   lower than Group 1 due to a 40% reduction in RTO-triggered
1183	   recoveries!  Therefore it is important to implement both TLP and RACK
1184	   for performance.  Group 4's total recovery latency is 0.02% lower
1185	   than Group 3's, indicating that RACK plus TLP can successfully
1186	   replace RFC6675 as a standalone recovery mechanism.

1188	   We want to emphasize that the current experiment is limited in terms
1189	   of network coverage.  The connectivity in Western Europe is fairly
1190	   good, therefore loss recovery is not a major performance bottleneck.
1191	   We plan to expand our experiments to regions with worse connectivity,
1192	   in particular on networks with strong traffic policing.

1194	9.  Security Considerations

1196	   RACK does not change the risk profile for TCP.

1198	   An interesting scenario is ACK-splitting attacks [SCWA99]: for an
1199	   MSS-size packet sent, the receiver or the attacker might send MSS
1200	   ACKs that SACK or acknowledge one additional byte per ACK.  This
1201	   would not fool RACK.  RACK.xmit_ts would not advance because all the
1202	   sequences of the packet are transmitted at the same time (carry the
1203	   same transmission timestamp).  In other words, SACKing only one byte
1204	   of a packet or SACKing the packet in entirety have the same effect on
1205	   RACK.

1207	10.  IANA Considerations

1209	   This document makes no request of IANA.

1211	   Note to RFC Editor: this section may be removed on publication as an
1212	   RFC.

1214	11.  Acknowledgments

1216	   The authors thank Matt Mathis for his insights in FACK and Michael
1217	   Welzl for his per-packet timer idea that inspired this work.  Eric
1218	   Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana
1219	   Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi
1220	   Nishida, Bob Briscoe, Felix Weinrank, and Michael Tuexen contributed
1221	   to the draft or the implementations in Linux, FreeBSD, Windows and
1222	   QUIC.

1224	12.  References

1226	12.1.  Normative References

1228	   [RFC2018]  Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment
1229	              Options", RFC 2018, October 1996.

1231	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1232	              Requirement Levels", RFC 2119, March 1997.

1234	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
1235	              Extension to the Selective Acknowledgement (SACK) Option
1236	              for TCP", RFC 2883, July 2000.

1238	   [RFC4737]  Morton, A., Ciavattone, L., Ramachandran, G., Shalunov,
1239	              S., and J. Perser, "Packet Reordering Metrics", RFC 4737,
1240	              November 2006.

1242	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1243	              Control", RFC 5681, September 2009.

1245	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
1246	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
1247	              Spurious Retransmission Timeouts with TCP", RFC 5682,
1248	              September 2009.

1250	   [RFC5827]  Allman, M., Ayesta, U., Wang, L., Blanton, J., and P.
1251	              Hurtig, "Early Retransmit for TCP and Stream Control
1252	              Transmission Protocol (SCTP)", RFC 5827, April 2010.

1254	   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
1255	              "Computing TCP's Retransmission Timer", RFC 6298, June
1256	              2011.

1258	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
1259	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
1260	              Based on Selective Acknowledgment (SACK) for TCP",
1261	              RFC 6675, August 2012.

1263	   [RFC6937]  Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
1264	              Rate Reduction for TCP", May 2013.

1266	   [RFC7323]  Borman, D., Braden, B., Jacobson, V., and R.
1267	              Scheffenegger, "TCP Extensions for High Performance",
1268	              September 2014.

1270	   [RFC793]   Postel, J., "Transmission Control Protocol", September
1271	              1981.

1273	12.2.  Informative References

1275	   [FACK]     Mathis, M. and M. Jamshid, "Forward acknowledgement:
1276	              refining TCP congestion control", ACM SIGCOMM Computer
1277	              Communication Review, Volume 26, Issue 4, Oct. 1996. ,
1278	              1996.

1280	   [POLICER16]
1281	              Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng,
1282	              Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An
1283	              Analysis of Traffic Policing in the Web", ACM SIGCOMM ,
1284	              2016.

1286	   [QUIC-LR]  Iyengar, J. and I. Swett, "QUIC Loss Recovery And
1287	              Congestion Control", draft-tsvwg-quic-loss-recovery-01
1288	              (work in progress), June 2016.

1290	   [RFC7765]  Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
1291	              and SCTP RTO Restart", February 2016.

1293	   [SCWA99]   Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
1294	              "TCP Congestion Control With a Misbehaving Receiver", ACM
1295	              Computer Communication Review, 29(5) , 1999.

1297	   [THIN-STREAM]
1298	              Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen,
1299	              "TCP enhancements for interactive thin-stream
1300	              applications", NOSSDAV , 2008.

1302	   [TLP]      Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis,
1303	              "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of
1304	              Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work
1305	              in progress), August 2013.

1307	12.3.  URIs

1309	   [1] https://tools.ietf.org/html/rfc2119

1311	   [2] https://tools.ietf.org/html/rfc2119

1313	Authors' Addresses
1314	   Yuchung Cheng
1315	   Google, Inc
1316	   1600 Amphitheater Parkway
1317	   Mountain View, California  94043
1318	   USA

1320	   Email: ycheng@google.com

1322	   Neal Cardwell
1323	   Google, Inc
1324	   76 Ninth Avenue
1325	   New York, NY  10011
1326	   USA

1328	   Email: ncardwell@google.com

1330	   Nandita Dukkipati
1331	   Google, Inc
1332	   1600 Amphitheater Parkway
1333	   Mountain View, California  94043

1335	   Email: nanditad@google.com

1337	   Priyaranjan Jha
1338	   Google, Inc
1339	   1600 Amphitheater Parkway
1340	   Mountain View, California  94043

1342	   Email: priyarjha@google.com