idnits 2.17.1 

draft-cheng-tcpm-rack-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There is 1 instance of too long lines in the document, the longest one
     being 38 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (July 6, 2016) is 2850 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC3517' is mentioned on line 397, but not defined

  ** Obsolete undefined reference: RFC 3517 (Obsoleted by RFC 6675)

  == Missing Reference: 'RFC4653' is mentioned on line 447, but not defined

  == Missing Reference: 'RFC3522' is mentioned on line 466, but not defined

  == Unused Reference: 'RFC793' is defined on line 536, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2119' is defined on line 567, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2883' is defined on line 573, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)


     Summary: 3 errors (**), 0 flaws (~~), 8 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                   Y. Cheng
3	Internet-Draft                                               N. Cardwell
4	Intended status: Experimental                                Google, Inc
5	Expires: January 7, 2017                                    July 6, 2016

7	        RACK: a time-based fast loss detection algorithm for TCP
8	                        draft-cheng-tcpm-rack-01

10	Abstract

12	   This document presents a new TCP loss detection algorithm called RACK
13	   ("Recent ACKnowledgment").  RACK uses the notion of time, instead of
14	   packet or sequence counts, to detect losses, for modern TCP
15	   implementations that can support per-packet timestamps and the
16	   selective acknowledgment (SACK) option.  It is intended to replace
17	   the conventional DUPACK threshold approach and its variants, as well
18	   as other nonstandard approaches.

20	Status of This Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This Internet-Draft will expire on January 7, 2017.

37	Copyright Notice

39	   Copyright (c) 2016 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document.  Please review these documents
46	   carefully, as they describe your rights and restrictions with respect
47	   to this document.  Code Components extracted from this document must
48	   include Simplified BSD License text as described in Section 4.e of
49	   the Trust Legal Provisions and are provided without warranty as
50	   described in the Simplified BSD License.

52	1.  Introduction

54	   This document presents a new loss detection algorithm called RACK
55	   ("Recent ACKnowledgment").  RACK uses the notion of time instead of
56	   the conventional packet or sequence counting approaches for detecting
57	   losses.  RACK deems a packet lost if some packet sent sufficiently
58	   later has been delivered.  It does this by recording packet
59	   transmission times and inferring losses using cumulative
60	   acknowledgments or selective acknowledgment (SACK) TCP options.

62	   In the last couple of years we have been observing several
63	   increasingly common loss and reordering patterns in the Internet:

65	   1.  Lost retransmissions.  Traffic policers [POLICER16] and burst
66	       losses often cause retransmissions to be lost again, severely
67	       increasing TCP latency.

69	   2.  Tail drops.  Structured request-response traffic turns more
70	       losses into tail drops.  In such cases, TCP is application-
71	       limited, so it cannot send new data to probe losses and has to
72	       rely on retransmission timeouts (RTOs).

74	   3.  Reordering.  Link layer protocols (e.g., 802.11 block ACK) or
75	       routers' internal load-balancing can deliver TCP packets out of
76	       order.  The degree of such reordering is usually within the order
77	       of the path round trip time.

79	   Despite TCP stacks (e.g.  Linux) that implement many of the standard
80	   and proposed loss detection algorithms
81	   [RFC3517][RFC4653][RFC5827][RFC5681][RFC6675][RFC7765][FACK][THIN-
82	   STREAM][TLP], we've found that together they do not perform well.
83	   The main reason is that many of them are based on the classic rule of
84	   counting duplicate acknowledgments [RFC5681].  They can either detect
85	   loss quickly or accurately, but not both, especially when the sender
86	   is application-limited or under reordering that is unpredictable.
87	   And under these conditions none of them can detect lost
88	   retransmissions well.

90	   Also, these algorithms, including RFCs, rarely address the
91	   interactions with other algorithms.  For example, FACK may consider a
92	   packet is lost while RFC3517 may not.  Implementing N algorithms
93	   while dealing with N^2 interactions is a daunting task and error-
94	   prone.

96	   The goal of RACK is to solve all the problems above by replacing many
97	   of the loss detection algorithms above with one simpler, and also
98	   more effective, algorithm.

100	2.  Overview

102	   The main idea behind RACK is that if a packet has been delivered out
103	   of order, then the packets sent chronologically before that were
104	   either lost or reordered.  This concept is not fundamentally
105	   different from [RFC5681][RFC3517][FACK].  But the key innovation in
106	   RACK is to use a per-packet transmission timestamp and widely
107	   deployed SACK options to conduct time-based inferences instead of
108	   inferring losses with packet or sequence counting approaches.

110	   Using a threshold for counting duplicate acknowledgments (i.e.,
111	   dupthresh) is no longer reliable because of today's prevalent
112	   reordering patterns.  A common type of reordering is that the last
113	   "runt" packet of a window's worth of packet bursts gets delivered
114	   first, then the rest arrive shortly after in order.  To handle this
115	   effectively, a sender would need to constantly adjust the dupthresh
116	   to the burst size; but this would risk increasing the frequency of
117	   RTOs on real losses.

119	   Today's prevalent lost retransmissions also cause problems with
120	   packet-counting approaches [RFC5681][RFC3517][FACK], since those
121	   approaches depend on reasoning in sequence number space.
122	   Retransmissions break the direct correspondence between ordering in
123	   sequence space and ordering in time.  So when retransmissions are
124	   lost, sequence-based approaches are often unable to infer and quickly
125	   repair losses that can be deduced with time-based approaches.

127	   Instead of counting packets, RACK uses the most recently delivered
128	   packet's transmission time to judge if some packets sent previous to
129	   that time have "expired" by passing a certain reordering settling
130	   window.  On each ACK, RACK marks any already-expired packets lost,
131	   and for any packets that have not yet expired it waits until the
132	   reordering window passes and then marks those lost as well.  In
133	   either case, RACK can repair the loss without waiting for a (long)
134	   RTO.  RACK can be applied to both fast recovery and timeout recovery,
135	   and can detect losses on both originally transmitted and
136	   retransmitted packets, making it a great all-weather recovery
137	   mechanism.

139	3.  Requirements

141	   The reader is expected to be familiar with the definitions given in
142	   the TCP congestion control [RFC5681] and selective acknowledgment

144	   [RFC2018] RFCs.  Familiarity with the conservative SACK-based
145	   recovery for TCP [RFC6675] is not expected but helps.

147	   RACK has three requirements:

149	   1.  The connection MUST use selective acknowledgment (SACK) options
150	       [RFC2018].

152	   2.  For each packet sent, the sender MUST store its most recent
153	       transmission time with (at least) millisecond granularity.  For
154	       round-trip times lower than a millisecond (e.g., intra-datacenter
155	       communications) microsecond granularity would significantly help
156	       the detection latency but is not required.

158	   3.  For each packet sent, the sender MUST store whether the packet
159	       has been retransmitted or not.

161	   We assume that requirement 1 implies the sender keeps a SACK
162	   scoreboard, which is a data structure to store selective
163	   acknowledgment information on a per-connection basis.  For the ease
164	   of explaining the algorithm, we use a pseudo-scoreboard that manages
165	   the data in sequence number ranges.  But the specifics of the data
166	   structure are left to the implementor.

168	   RACK does not need any change on the receiver.

170	4.  Definitions of variables

172	   A sender needs to store these new RACK variables:

174	   "Packet.xmit_ts" is the time of the last transmission of a data
175	   packet, including any retransmissions, if any.  The sender needs to
176	   record the transmission time for each packet sent and not yet
177	   acknowledged.  The time MUST be stored at millisecond granularity or
178	   finer.

180	   "RACK.xmit_ts" is the most recent Packet.xmit_ts among all the
181	   packets that were delivered (either cumulatively acknowledged or
182	   selectively acknowledged) on the connection.

184	   "RACK.end_seq" is the ending TCP sequence number of the packet that
185	   was used to record the RACK.xmit_ts above.

187	   "RACK.RTT" is the associated RTT measured when RACK.xmit_ts, above,
188	   was changed.  It is the RTT of the most recently transmitted packet
189	   that has been delivered (either cumulatively acknowledged or
190	   selectively acknowledged) on the connection.

192	   "RACK.reo_wnd" is a reordering window for the connection, computed in
193	   the unit of time used for recording packet transmission times.  It is
194	   used to defer the moment at which RACK marks a packet lost.

196	   "RACK.min_RTT" is the estimated minimum round-trip time (RTT) of the
197	   connection.

199	   Note that the Packet.xmit_ts variable is per packet in flight.  The
200	   RACK.xmit_ts, RACK.RTT, RACK.reo_wnd, and RACK.min_RTT variables are
201	   per connection.

203	5.  Algorithm Details

205	5.1.  Transmitting a data packet

207	   Upon transmitting a new packet or retransmitting an old packet,
208	   record the time in Packet.xmit_ts.  RACK does not care if the
209	   retransmission is triggered by an ACK, new application data, an RTO,
210	   or any other means.

212	5.2.  Upon receiving an ACK

214	   Step 1: Update RACK.min_RTT.

216	   Use the RTT measurements obtained in [RFC6298] or [RFC7323] to update
217	   the estimated minimum RTT in RACK.min_RTT.  The sender can track a
218	   simple global minimum of all RTT measurements from the connection, or
219	   a windowed min-filtered value of recent RTT measurements.  This
220	   document does not specify an exact approach.

222	   Step 2: Update RACK.reo_wnd.

224	   To handle the prevalent small degree of reordering, RACK.reo_wnd
225	   serves as an allowance for settling time before marking a packet
226	   lost.  By default it is 1 millisecond.  We RECOMMEND implementing the
227	   reordering detection in [REORDER-DETECT][RFC4737] to dynamically
228	   adjust the reordering window.  When the sender detects packet
229	   reordering RACK.reo_wnd MAY be changed to RACK.min_RTT/4.  We discuss
230	   more about the reordering window in the next section.

232	   Step 3: Advance RACK.xmit_ts and update RACK.RTT and RACK.end_seq

234	   Given the information provided in an ACK, each packet cumulatively
235	   ACKed or SACKed is marked as delivered in the scoreboard.  Among all
236	   the packets newly ACKed or SACKed in the connection, record the most
237	   recent Packet.xmit_ts in RACK.xmit_ts if it is ahead of RACK.xmit_ts.
238	   Ignore the packet if any of its TCP sequences has been retransmitted
239	   before and either of two condition is true:

241	   1.  The Timestamp Echo Reply field (TSecr) of the ACK's timestamp
242	       option [RFC7323], if available, indicates the ACK was not
243	       acknowledging the last retransmission of the packet.

245	   2.  The packet was last retransmitted less than RACK.min_rtt ago.
246	       While it is still possible the packet is spuriously retransmitted
247	       because of a recent RTT decrease, we believe that our experience
248	       suggests this is a reasonable heuristic.

250	   If this ACK causes a change to RACK.xmit_ts then record the RTT and
251	   sequence implied by this ACK:

253	   RACK.RTT = Now() - RACK.xmit_ts
254	   RACK.end_seq = Packet.end_seq

256	   Exit here and omit the following steps if RACK.xmit_ts has not
257	   changed.

259	   Step 4: Detect losses.

261	   For each packet that has not been fully SACKed, if RACK.xmit_ts is
262	   after Packet.xmit_ts + RACK.reo_wnd, then mark the packet (or its
263	   corresponding sequence range) lost in the scoreboard.  The rationale
264	   is that if another packet that was sent later has been delivered, and
265	   the reordering window or "reordering settling time" has already
266	   passed, the packet was likely lost.

268	   If a packet that was sent later has been delivered, but the
269	   reordering window has not passed, then it is not yet safe to deem the
270	   given packet lost.  Using the basic algorithm above, the sender would
271	   wait for the next ACK to further advance RACK.xmit_ts; but this risks
272	   a timeout (RTO) if no more ACKs come back (e.g, due to losses or
273	   application limit).  For timely loss detection, the sender MAY
274	   install a "reordering settling" timer set to fire at the earliest
275	   moment at which it is safe to conclude that some packet is lost.  The
276	   earliest moment is the time it takes to expire the reordering window
277	   of the earliest unacked packet in flight.

279	   This timer expiration value can be derived as follows.  As a starting
280	   point, we consider that the reordering window has passed if the RACK
281	   packet was sent sufficiently after the packet in question, or a
282	   sufficient time has elapsed since the RACK packet was S/ACKed, or
283	   some combination of the two.  More precisely, RACK marks a packet as
284	   lost if the reordering window for a packet has elapsed through the
285	   sum of:

287	   1.  delta in transmit time between a packet and the RACK packet
288	   2.  delta in time between the S/ACK of the RACK packet (RACK.ack_ts)
289	       and now

291	   So we mark a packet as lost if:

293	   RACK.xmit_ts > Packet.xmit_ts   AND
294	   (RACK.xmit_ts - Packet.xmit_ts) + (now - RACK.ack_ts) > RACK.reo_wnd

296	   If we solve this second condition for "now", the moment at which we
297	   can declare a packet lost, then we get:

299	   now > Packet.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts)

301	   Then (RACK.ack_ts - RACK.xmit_ts) is just the RTT of the packet we
302	   used to set RACK.xmit_ts, so this reduces to:

304	   now > Packet.xmit_ts + RACK.RTT + RACK.reo_wnd

306	   The following pseudocode implements the algorithm above.  When an ACK
307	   is received or the RACK timer expires, call RACK_detect_loss().  The
308	   algorithm includes an additional optimization to break timestamp ties
309	   by using the TCP sequence space.  The optimization is particularly
310	   useful to detect losses in a timely manner with TCP Segmentation
311	   Offload, where multiple packets in one TSO blob have identical
312	   timestamps.  It is also useful when the timestamp clock granularity
313	   is close to or longer than the actual round trip time.

315	    RACK_detect_loss():
316	    min_timeout = 0

318	    For each packet, Packet, in the scoreboard:
319	        If Packet is already SACKed, ACKed,
320	           or marked lost and not yet retransmitted:
321	            Skip to the next packet

323	        If Packet.xmit_ts > RACK.xmit_ts:
324	            Skip to the next packet
325	        If Packet.xmit_ts == RACK.xmit_ts AND // Timestamp tie breaker           Packet.end_seq > RACK.end_seq
326	            Skip to the next packet

328	        timeout = Packet.xmit_ts + RACK.RTT + RACK.reo_wnd + 1
329	        If Now() >= timeout
330	            Mark Packet lost
331	        Else If (min_timeout == 0) or (timeout is before min_timeout):
332	            min_timeout = timeout

334	    If min_timeout != 0
335	        Arm a timer to call RACK_detect_loss() after min_timeout

337	6.  Analysis and Discussion

339	6.1.  Advantages

341	   The biggest advantage of RACK is that every data packet, whether it
342	   is an original data transmission or a retransmission, can be used to
343	   detect losses of the packets sent prior to it.

345	   Example: tail drop.  Consider a sender that transmits a window of
346	   three data packets (P1, P2, P3), and P1 and P3 are lost.  Suppose the
347	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
348	   by default) after the transmission of the previous packet.  RACK will
349	   mark P1 as lost when the SACK of P2 is received, and this will
350	   trigger the retransmission of P1 as R1.  When R1 is cumulatively
351	   acknowledged, RACK will mark P3 as lost and the sender will
352	   retransmit P3 as R3.  This example illustrates how RACK is able to
353	   repair certain drops at the tail of a transaction without any timer.
354	   Notice that neither the conventional duplicate ACK threshold
355	   [RFC5681], nor [RFC6675], nor the Forward Acknowledgment [FACK]
356	   algorithm can detect such losses, because of the required packet or
357	   sequence count.

359	   Example: lost retransmit.  Consider a window of three data packets
360	   (P1, P2, P3) that are sent; P1 and P2 are dropped.  Suppose the
361	   transmission of each packet is at least RACK.reo_wnd (1 millisecond
362	   by default) after the transmission of the previous packet.  When P3
363	   is SACKed, RACK will mark P1 and P2 lost and they will be
364	   retransmitted as R1 and R2.  Suppose R1 is lost again (as a tail
365	   drop) but R2 is SACKed; RACK will mark R1 lost for retransmission
366	   again.  Again, neither the conventional three duplicate ACK threshold
367	   approach, nor [RFC6675], nor the Forward Acknowledgment [FACK]
368	   algorithm can detect such losses.  And such a lost retransmission is
369	   very common when TCP is being rate-limited, particularly by token
370	   bucket policers with large bucket depth and low rate limit.
371	   Retransmissions are often lost repeatedly because standard congestion
372	   control requires multiple round trips to reduce the rate below the
373	   policed rate.

375	   Example: (small) degree of reordering.  Consider a common reordering
376	   event: a window of packets are sent as (P1, P2, P3).  P1 and P2 carry
377	   a full payload of MSS octets, but P3 has only a 1-octet payload due
378	   to application-limited behavior.  Suppose the sender has detected
379	   reordering previously (e.g., by implementing the algorithm in
380	   [REORDER-DETECT]) and thus RACK.reo_wnd is min_RTT/4.  Now P3 is
381	   reordered and delivered first, before P1 and P2.  As long as P1 and
382	   P2 are delivered within min_RTT/4, RACK will not consider P1 and P2
383	   lost.  But if P1 and P2 are delivered outside the reordering window,
384	   then RACK will still falsely mark P1 and P2 lost.  We discuss how to
385	   reduce the false positives in the end of this section.

387	   The examples above show that RACK is particularly useful when the
388	   sender is limited by the application, which is common for
389	   interactive, request/response traffic.  Similarly, RACK still works
390	   when the sender is limited by the receive window, which is common for
391	   applications that use the receive window to throttle the sender.

393	   For some implementations (e.g., Linux), RACK works quite efficiently
394	   with TCP Segmentation Offload (TSO).  RACK always marks the entire
395	   TSO blob lost because the packets in the same TSO blob have the same
396	   transmission timestamp.  By contrast, the counting based algorithms
397	   (e.g., [RFC3517][RFC5681]) may mark only a subset of packets in the
398	   TSO blob lost, forcing the stack to perform expensive fragmentation
399	   of the TSO blob, or to selectively tag individual packets lost in the
400	   scoreboard.

402	6.2.  Disadvantages

404	   RACK requires the sender to record the transmission time of each
405	   packet sent at a clock granularity of one millisecond or finer.  TCP
406	   implementations that record this already for RTT estimation do not
407	   require any new per-packet state.  But implementations that are not
408	   yet recording packet transmission times will need to add per-packet
409	   internal state (commonly either 4 or 8 octets per packet) to track
410	   transmission times.  In contrast, the conventional approach requires
411	   one variable to track number of duplicate ACK threshold.

413	6.3.  Adjusting the reordering window

415	   RACK uses a reordering window of min_rtt / 4.  It uses the minimum
416	   RTT to accommodate reordering introduced by packets traversing
417	   slightly different paths (e.g., router-based parallelism schemes) or
418	   out-of-order deliveries in the lower link layer (e.g., wireless links
419	   using link-layer retransmission).  Alternatively, RACK can use the
420	   smoothed RTT used in RTT estimation [RFC6298].  However, smoothed RTT
421	   can be significantly inflated by orders of magnitude due to
422	   congestion and buffer-bloat, which would result in an overly
423	   conservative reordering window and slow loss detection.  Furthermore,
424	   RACK uses a quarter of minimum RTT because Linux TCP uses the same
425	   factor in its implementation to delay Early Retransmit [RFC5827] to
426	   reduce spurious loss detections in the presence of reordering, and
427	   experience shows that this seems to work reasonably well.

429	   One potential improvement is to further adapt the reordering window
430	   by measuring the degree of reordering in time, instead of packet
431	   distances.  But that requires storing the delivery timestamp of each
432	   packet.  Some scoreboard implementations currently merge SACKed
433	   packets together to support TSO (TCP Segmentation Offload) for faster
434	   scoreboard indexing.  Supporting per-packet delivery timestamps is
435	   difficult in such implementations.  However, we acknowledge that the
436	   current metric can be improved by further research.

438	6.4.  Relationships with other loss recovery algorithms

440	   The primary motivation of RACK is to ultimately provide a simple and
441	   general replacement for some of the standard loss recovery algorithms
442	   [RFC5681][RFC6675][RFC5827][RFC4653] and nonstandard ones
443	   [FACK][THIN-STREAM].  While RACK can be a supplemental loss detection
444	   on top of these algorithms, this is not necessary, because the RACK
445	   implicitly subsumes most of them.

447	   [RFC5827][RFC4653][THIN-STREAM] dynamically adjusts the duplicate ACK
448	   threshold based on the current or previous flight sizes.  RACK takes
449	   a different approach, by using only one ACK event and a reordering
450	   window.  RACK can be seen as an extended Early Retransmit [RFC5827]
451	   without a FlightSize limit but with an additional reordering window.
452	   [FACK] considers an original packet to be lost when its sequence
453	   range is sufficiently far below the highest SACKed sequence.  In some
454	   sense RACK can be seen as a generalized form of FACK that operates in
455	   time space instead of sequence space, enabling it to better handle
456	   reordering, application-limited traffic, and lost retransmissions.

458	   Nevertheless RACK is still an experimental algorithm.  Since the
459	   oldest loss detection algorithm, the 3 duplicate ACK threshold
460	   [RFC5681], has been standardized and widely deployed, we RECOMMEND
461	   TCP implementations use both RACK and the algorithm specified in
462	   Section 3.2 in [RFC5681] for compatibility.

464	   RACK is compatible with and does not interfere with the the standard
465	   RTO [RFC6298], RTO-restart [RFC7765], F-RTO [RFC5682] and Eifel
466	   algorithms [RFC3522].  This is because RACK only detects loss by
467	   using ACK events.  It neither changes the timer calculation nor
468	   detects spurious timeouts.

470	   Furthermore, RACK naturally works well with Tail Loss Probe [TLP]
471	   because a tail loss probe solicit seither an ACK or SACK, which can
472	   be used by RACK to detect more losses.  RACK can be used to relax
473	   TLP's requirement for using FACK and retransmitting the the highest-
474	   sequenced packet, because RACK is agnostic to packet sequence
475	   numbers, and uses transmission time instead.  Thus TLP can be
476	   modified to retransmit the first unacknowledged packet, which can
477	   improve application latency.

479	6.5.  Interaction with congestion control

481	   RACK intentionally decouples loss detection from congestion control.
482	   RACK only detects losses; it does not modify the congestion control
483	   algorithm [RFC5681][RFC6937].  However, RACK may detect losses
484	   earlier or later than the conventional duplicate ACK threshold
485	   approach does.  A packet marked lost by RACK SHOULD NOT be
486	   retransmitted until congestion control deems this appropriate (e.g.
487	   using [RFC6937]).

489	   RACK is applicable for both fast recovery and recovery after a
490	   retransmission timeout (RTO) in [RFC5681].  The distinction between
491	   fast recovery or RTO recovery is not necessary because RACK is purely
492	   based on the transmission time order of packets.  When a packet
493	   retransmitted by RTO is acknowledged, RACK will mark any unacked
494	   packet sent sufficiently prior to the RTO as lost, because at least
495	   one RTT has elapsed since these packets were sent.

497	6.6.  RACK for other transport protocols

499	   RACK can be implemented in other transport protocols.  The algorithm
500	   can skip step 3 and simplify if the protocol can support unique
501	   transmission or packet identifier (e.g.  TCP echo options).  For
502	   example, the QUIC protocol implements RACK [QUIC-LR] .

504	7.  Security Considerations

506	   RACK does not change the risk profile for TCP.

508	   An interesting scenario is ACK-splitting attacks [SCWA99]: for an
509	   MSS-size packet sent, the receiver or the attacker might send MSS
510	   ACKs that SACK or acknowledge one additional byte per ACK.  This
511	   would not fool RACK.  RACK.xmit_ts would not advance because all the
512	   sequences of the packet are transmitted at the same time (carry the
513	   same transmission timestamp).  In other words, SACKing only one byte
514	   of a packet or SACKing the packet in entirety have the same effect on
515	   RACK.

517	8.  IANA Considerations

519	   This document makes no request of IANA.

521	   Note to RFC Editor: this section may be removed on publication as an
522	   RFC.

524	9.  Acknowledgments

526	   The authors thank Matt Mathis for his insights in FACK and Michael
527	   Welzl for his per-packet timer idea that inspired this work.  Nandita
528	   Dukkipati, Eric Dumazet, Randy Stewart, Van Jacobson, Ian Swett, and
529	   Jana Iyengar contributed to the algorithm and the implementations in
530	   Linux, FreeBSD and QUIC.

532	10.  References

534	10.1.  Normative References

536	   [RFC793]   Postel, J., "Transmission Control Protocol", September
537	              1981.

539	   [RFC2018]  Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment
540	              Options", RFC 2018, October 1996.

542	   [RFC6937]  Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
543	              Rate Reduction for TCP", May 2013.

545	   [RFC4737]  Morton, A., Ciavattone, L., Ramachandran, G., Shalunov,
546	              S., and J. Perser, "Packet Reordering Metrics", RFC 4737,
547	              November 2006.

549	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
550	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
551	              Based on Selective Acknowledgment (SACK) for TCP",
552	              RFC 6675, August 2012.

554	   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
555	              "Computing TCP's Retransmission Timer", RFC 6298, June
556	              2011.

558	   [RFC5827]  Allman, M., Ayesta, U., Wang, L., Blanton, J., and P.
559	              Hurtig, "Early Retransmit for TCP and Stream Control
560	              Transmission Protocol (SCTP)", RFC 5827, April 2010.

562	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
563	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
564	              Spurious Retransmission Timeouts with TCP", RFC 5682,
565	              September 2009.

567	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
568	              Requirement Levels", RFC 2119, March 1997.

570	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
571	              Control", RFC 5681, September 2009.

573	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
574	              Extension to the Selective Acknowledgement (SACK) Option
575	              for TCP", RFC 2883, July 2000.

577	   [RFC7323]  Borman, D., Braden, B., Jacobson, V., and R.
578	              Scheffenegger, "TCP Extensions for High Performance",
579	              September 2014.

581	10.2.  Informative References

583	   [FACK]     Mathis, M. and M. Jamshid, "Forward acknowledgement:
584	              refining TCP congestion control", ACM SIGCOMM Computer
585	              Communication Review, Volume 26, Issue 4, Oct. 1996. ,
586	              1996.

588	   [TLP]      Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis,
589	              "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of
590	              Tail Drops", draft-dukkipati-tcpm-tcp-loss-probe-01 (work
591	              in progress), August 2013.

593	   [RFC7765]  Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
594	              and SCTP RTO Restart", February 2016.

596	   [REORDER-DETECT]
597	              Zimmermann, A., Schulte, L., Wolff, C., and A. Hannemann,
598	              "Detection and Quantification of Packet Reordering with
599	              TCP", draft-zimmermann-tcpm-reordering-detection-02 (work
600	              in progress), November 2014.

602	   [QUIC-LR]  Iyengar, J. and I. Swett, "QUIC Loss Recovery And
603	              Congestion Control", draft-tsvwg-quic-loss-recovery-01
604	              (work in progress), June 2016.

606	   [THIN-STREAM]
607	              Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen,
608	              "TCP enhancements for interactive thin-stream
609	              applications", NOSSDAV , 2008.

611	   [SCWA99]   Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
612	              "TCP Congestion Control With a Misbehaving Receiver", ACM
613	              Computer Communication Review, 29(5) , 1999.

615	   [POLICER16]
616	              Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng,
617	              Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An
618	              Analysis of Traffic Policing in the Web", ACM SIGCOMM ,
619	              2016.

621	Authors' Addresses

623	   Yuchung Cheng
624	   Google, Inc
625	   1600 Amphitheater Parkway
626	   Mountain View, California  94043
627	   USA

629	   Email: ycheng@google.com

631	   Neal Cardwell
632	   Google, Inc
633	   76 Ninth Avenue
634	   New York, NY  10011
635	   USA

637	   Email: ncardwell@google.com