idnits 2.17.1 

draft-zimmermann-tcpm-reordering-reaction-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (May 20, 2014) is 3626 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-02) exists of
     draft-zimmermann-tcpm-reordering-detection-01

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC  896
     (Obsoleted by RFC 7805)

  -- Obsolete informational reference (is this intentional?): RFC 2861
     (Obsoleted by RFC 7661)

  -- Obsolete informational reference (is this intentional?): RFC 2960
     (Obsoleted by RFC 4960)


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance and Minor Extensions                       A. Zimmermann
3	(TCPM) WG                                                   NetApp, Inc.
4	Internet-Draft                                                L. Schulte
5	Intended status: Experimental                           Aalto University
6	Expires: November 21, 2014                                      C. Wolff
7	                                                            A. Hannemann
8	                                                           credativ GmbH
9	                                                            May 20, 2014

11	         Making TCP Adaptively Robust to Non-Congestion Events
12	              draft-zimmermann-tcpm-reordering-reaction-01

14	Abstract

16	   This document specifies an adaptive Non-Congestion Robustness (aNCR)
17	   mechanism for TCP.  In the absence of explicit congestion
18	   notification from the network, TCP uses only packet loss as an
19	   indication of congestion.  One of the signals TCP uses to determine
20	   loss is the arrival of three duplicate acknowledgments.  However,
21	   this heuristic is not always correct, notably in the case when paths
22	   reorder packets.  This results in degraded performance.

24	   TCP-aNCR is designed to mitigate this performance degradation by
25	   adaptively increasing the number of duplicate acknowledgments
26	   required to trigger loss recovery, based on the current state of the
27	   connection, in an effort to better disambiguate true segment loss
28	   from segment reordering.  This document specifies the changes to TCP
29	   and TCP-NCR (on which this specification is build on) and discusses
30	   the costs and benefits of these modifications.

32	Status of this Memo

34	   This Internet-Draft is submitted in full conformance with the
35	   provisions of BCP 78 and BCP 79.

37	   Internet-Drafts are working documents of the Internet Engineering
38	   Task Force (IETF).  Note that other groups may also distribute
39	   working documents as Internet-Drafts.  The list of current Internet-
40	   Drafts is at http://datatracker.ietf.org/drafts/current/.

42	   Internet-Drafts are draft documents valid for a maximum of six months
43	   and may be updated, replaced, or obsoleted by other documents at any
44	   time.  It is inappropriate to use Internet-Drafts as reference
45	   material or to cite them other than as "work in progress."

47	   This Internet-Draft will expire on November 21, 2014.

49	Copyright Notice

51	   Copyright (c) 2014 IETF Trust and the persons identified as the
52	   document authors.  All rights reserved.

54	   This document is subject to BCP 78 and the IETF Trust's Legal
55	   Provisions Relating to IETF Documents
56	   (http://trustee.ietf.org/license-info) in effect on the date of
57	   publication of this document.  Please review these documents
58	   carefully, as they describe your rights and restrictions with respect
59	   to this document.  Code Components extracted from this document must
60	   include Simplified BSD License text as described in Section 4.e of
61	   the Trust Legal Provisions and are provided without warranty as
62	   described in the Simplified BSD License.

64	Table of Contents

66	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
67	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  7
68	   3.  Basic Concept  . . . . . . . . . . . . . . . . . . . . . . . .  7
69	   4.  Appropriate Detection and Quantification Algorithms  . . . . .  8
70	   5.  The TCP-aNCR Algorithm . . . . . . . . . . . . . . . . . . . .  8
71	     5.1.  Initialization during Connection Establishment . . . . . .  9
72	     5.2.  Initializing Extended Limited Transmit . . . . . . . . . . 10
73	     5.3.  Executing Extended Limited Transmit  . . . . . . . . . . . 11
74	     5.4.  Terminating Extended Limited Transmit  . . . . . . . . . . 12
75	     5.5.  Entering Loss Recovery . . . . . . . . . . . . . . . . . . 14
76	     5.6.  Reordering Extent  . . . . . . . . . . . . . . . . . . . . 14
77	     5.7.  Retransmission Timeout . . . . . . . . . . . . . . . . . . 14
78	   6.  Protocol Steps in Detail . . . . . . . . . . . . . . . . . . . 14
79	   7.  Discussion of TCP-aNCR . . . . . . . . . . . . . . . . . . . . 17
80	     7.1.  Variable Duplicate Acknowledgment Threshold  . . . . . . . 17
81	     7.2.  Relative Reordering Extent . . . . . . . . . . . . . . . . 18
82	     7.3.  Reordering during Slow Start . . . . . . . . . . . . . . . 18
83	     7.4.  Preventing Bursts  . . . . . . . . . . . . . . . . . . . . 19
84	     7.5.  Persistent receiving of Selective Acknowledgments  . . . . 20
85	   8.  Interoperability Issues  . . . . . . . . . . . . . . . . . . . 22
86	     8.1.  Early Retransmit . . . . . . . . . . . . . . . . . . . . . 22
87	     8.2.  Congestion Window Validation . . . . . . . . . . . . . . . 22
88	     8.3.  Reactive Response to Packet Reordering . . . . . . . . . . 22
89	     8.4.  Buffer Auto-Tuning . . . . . . . . . . . . . . . . . . . . 23
90	   9.  Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 23
91	   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 25
92	   11. Security Considerations  . . . . . . . . . . . . . . . . . . . 25
93	   12. Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 26
94	   13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
95	     13.1. Normative References . . . . . . . . . . . . . . . . . . . 26
96	     13.2. Informative References . . . . . . . . . . . . . . . . . . 27
97	   Appendix A.  Changes from previous versions of the draft . . . . . 28
98	     A.1.  Changes from
99	           draft-zimmermann-tcpm-reordering-reaction-00 . . . . . . . 28
100	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 29

102	1.  Introduction

104	   One strength of the Transmission Control Protocol (TCP) [RFC0793]
105	   lies in its ability to adjust its sending rate according to the
106	   perceived congestion in the network [RFC5681].  In the absence of
107	   explicit notification of congestion from the network, TCP uses
108	   segment loss as an indication of congestion (i.e., assuming queue
109	   overflow).  A TCP receiver sends cumulative acknowledgments (ACKs)
110	   indicating the next sequence number expected from the sender for
111	   arriving segments [RFC0793].  When segments arrive out of order,
112	   duplicate ACKs are generated.  As specified in [RFC5681], a TCP
113	   sender uses the arrival of three duplicate ACKs as an indication of
114	   segment loss.  The TCP sender retransmits the segment assumed lost
115	   and reduces the sending rate, based on the assumption that the loss
116	   was caused by resource contention on the path.  The TCP sender does
117	   not assume loss on the first or second duplicate ACK, but waits for
118	   three duplicate ACKs to account for minor packet reordering.
119	   However, the use of this constant threshold of duplicate ACKs leads
120	   to performance degradation if the extent of the packet reordering in
121	   the network increases [RFC4653].

123	   Whenever interoperability with the TCP congestion control and loss
124	   recovery standard [RFC5681] is a prerequisite, increasing the
125	   duplicate acknowledgment threshold (DupThresh) is the method of
126	   choice to a priori prevent any negative impact - in particular, a
127	   spurious Fast Retransmit and Fast Recovery phase - that packet
128	   reordering has on TCP.  However, this procedure also delays a Fast
129	   Retransmit by increasing the DupThresh, and therefore has costs and
130	   risks, too.  According to [Zha+03], these are: (1) a delayed response
131	   to congestion in the network, (2) a potential expiration of the
132	   retransmission timer, and (3) a significant increase in the end-to-
133	   end delay for lost segments.

135	   In the current TCP standard, congestion control and loss recovery are
136	   tightly coupled: when the oldest outstanding segment is declared
137	   lost, a retransmission is triggered, and the sending rate is reduced
138	   on the assumption that the loss is due to resource contention
139	   [RFC5681].  Therefore, any change to DupThresh causes not only a
140	   change to the loss recovery, but also to the congestion control
141	   response.  TCP-NCR [RFC4653] addresses this problem by defining two
142	   extensions to TCP's Limited Transmit [RFC3042] scheme: Careful and
143	   Aggressive Extended Limited Transmit.

145	   The first variant of the two, Careful Limited Transmit, sends one
146	   previously unsent segment in response to duplicate acknowledgments
147	   for every two segments that are known to have left the network.  This
148	   effectively halves the sending rate, since normal TCP operation sends
149	   one new segment for every segment that has left the network.

151	   Further, the halving starts immediately and is not delayed until a
152	   retransmission is triggered.  In the case of packet reordering (i.e.,
153	   not segment loss), TCP-NCR restores the congestion control state to
154	   its previous state after the event.

156	   The second variant, Aggressive Limited Transmit, transmits one
157	   previously unsent data segment in response to duplicate
158	   acknowledgments for every segment known to have left the network.
159	   With this variant, while waiting to disambiguate the loss from a
160	   reordering event, ACK-clocked transmission continues at roughly the
161	   same rate as before the event started.  Retransmission and the
162	   sending rate reduction happen per [RFC5681] [RFC6675], albeit after a
163	   delay caused by the increased DupThresh.  Although this approach
164	   delays legitimate rate reductions (possibly slightly, and temporarily
165	   aggravating overall congestion on the network), the scheme has the
166	   advantage of not reducing the transmission rate in the face of packet
167	   reordering.

169	   A basic requirement for preventing an avoidable expiration of the
170	   retransmission timer is to generally ensure that an increased
171	   DupThresh can potentially be reached in time so that Fast Retransmit
172	   is triggered and Fast Recovery is completed before the RTO expires.
173	   Simply increasing DupThresh before retransmitting a segment can make
174	   TCP brittle to packet or ACK loss, since such loss reduces the number
175	   of duplicate ACKs that will arrive at the sender from the receiver.
176	   For instance, if cwnd is 10 segments and one segment is lost, a
177	   DupThresh of 10 will never be met, because duplicate ACKs
178	   corresponding to at most 9 segments will arrive at the sender.  To
179	   mitigate this issue, the TCP-NCR [RFC4653] modification makes two
180	   fundamental changes to the way [RFC5681] [RFC6675] currently
181	   operates.

183	   First, as mentioned above, TCP-NCR [RFC4653] extends TCP's Limited
184	   Transmit [RFC3042] scheme to allow for the sending of new data
185	   segment while the TCP sender stays in the 'disorder' state and
186	   disambiguate loss and reordering.  This new data serves to increase
187	   the likelihood that enough duplicate ACKs arrive at the sender to
188	   trigger loss recovery, if it is appropriate.  Second, DupThresh is
189	   increased from the current fixed value of three [RFC5681] to a value
190	   indicating that approximately a congestion window's worth of data has
191	   left the network.  Since cwnd represents the amount of data a TCP
192	   sender can transmit in one round-trip time (RTT), this corresponds to
193	   approximately the largest amount of time a TCP sender can wait before
194	   the costly retransmission timeout may be triggered.

196	   Of vital importance is that TCP-NCR [RFC4653] holds DupThresh not
197	   constant, but dynamically adjusts it on each SACK to the current
198	   amount of outstanding data, which depends not only on the congestion
199	   window, but also on the receiver's advertised window.  Thus, it is
200	   guaranteed that the outstanding data generates a sufficient number of
201	   duplicate ACKs for reaching DupThresh and a transition to the
202	   'recovery' state.  This is important in cases where there is no new
203	   data available to send.

205	   Regarding the problem of packet reordering, TCP-NCR's [RFC4653]
206	   decision of waiting to receive notice that cwnd bytes have left the
207	   network before deciding whether the root cause is loss or reordering
208	   is essentially a trade-off between making the best decision regarding
209	   the cause of the duplicate ACKs and responsiveness, and represents a
210	   good compromise between avoiding spurious Fast Retransmits and
211	   avoiding unnecessary RTOs.  On the other hand, if there is no visible
212	   packet reordering on the network path - which today is the rule and
213	   not the exception - or the delay caused by the reordering is very
214	   low, delaying Fast Retransmit is unnecessary in the case of
215	   congestion, and data is delivered to the application up to one RTT
216	   later.  Especially for delay-sensitive applications, such as a
217	   terminal session over SSH, this is generally undesirable.  By
218	   dynamically adapting DupThresh not only to the amount of outstanding
219	   data but also to the perceived packet reordering on the network path,
220	   this issue can be offset.  This is the key idea behind the TCP-aNCR
221	   algorithm.

223	   This document specifies a set of TCP modifications to provide an
224	   adaptive Non-Congestion Robustness (aNCR) mechanism for TCP.  The
225	   TCP-aNCR modifications lend themselves to incremental deployment.
226	   Only the TCP implementation on the sender side requires modification.
227	   The changes themselves are modest.  TCP-aNCR is built on top of the
228	   TCP Selective Acknowledgments Option [RFC2018] and the SACK-based
229	   loss recovery scheme given in [RFC6675] and represents an enhancement
230	   of the original TCP-NCR mechanism [RFC4653].  Currently, TCP-aNCR is
231	   an independent approach of making TCP more robust to packet
232	   reordering.  It is not clear if upcoming versions of this draft TCP-
233	   aNCR will obsolete TCP-NCR or not.

235	   It should be noted that the TCP-aNCR algorithm in this document could
236	   be easily adapted to the Stream Control Transmission Protocol (SCTP)
237	   [RFC2960], since SCTP uses congestion control algorithms similar to
238	   TCP (and thus has the same reordering robustness issues).

240	   The remainder of this document is organized as follows.  Section 3
241	   provides a high-level description of the TCP-aNCR mechanism.
242	   Section 4 defines TCP-aNCR's requirements for an appropriate
243	   detection and quantification algorithm.  Section 5 specifies the TCP-
244	   aNCR algorithm and Section 6 discusses each step of the algorithm in
245	   detail.  Section 7 provides a discussion of several design decisions
246	   behind TCP-aNCR.  Section 8 discusses interoperability issues related
247	   to introducing TCP-aNCR.  Finally, related work is presented in
248	   Section 9 and security concerns in Section 11.

250	2.  Terminology

252	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
253	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
254	   document are to be interpreted as described [RFC2119].

256	   The reader is expected to be familiar with the TCP state variables
257	   described in [RFC0793] (SND.NXT), [RFC5681] (cwnd, rwnd, ssthresh,
258	   FlightSize, IW), [RFC6675] (pipe, DupThresh, SACK scoreboard), and
259	   [RFC6582] (recover).  Further, the term 'acceptable acknowledgment'
260	   is used as defined in [RFC0793].  That is, an ACK that increases the
261	   connection's cumulative ACK point by acknowledging previously
262	   unacknowledged data.  The term 'duplicate acknowledgment' is used as
263	   defined in [RFC6675], which is different from the definition of
264	   duplicate acknowledgment in [RFC5681].

266	   This specification defines the four TCP sender states 'open',
267	   'disorder', 'recovery', and 'loss' as follows.  As long as no
268	   duplicate ACK is received and no segment is considered lost, the TCP
269	   sender is in the 'open' state.  Upon the reception of the first
270	   consecutive duplicate ACK, TCP will enter the 'disorder' state.
271	   After receiving DupThresh duplicate ACKs, the TCP sender switches to
272	   the 'recovery' state and executes standard loss recovery procedures
273	   like Fast Retransmit and Fast Recovery [RFC5681].  Upon a
274	   retransmission timeout, the TCP sender enters the 'loss' state.  The
275	   'recovery' state can only be reached by a transition from the
276	   'disorder' state, the 'loss' state can be reached from any other
277	   state.

279	   The following specification depends on the standard TCP congestion
280	   control and loss recovery algorithms and the SACK-based loss recovery
281	   scheme given in [RFC5681], respectively [RFC6675].  The algorithm
282	   presents an enhancement of TCP-NCR [RFC4653].  The reader is assumed
283	   to be familiar with the algorithms specified in these documents.

285	3.  Basic Concept

287	   The general idea behind the TCP-aNCR algorithm is to extend the TCP-
288	   NCR algorithm [RFC4653], so that - based on an appropriate packet
289	   reordering detection and quantification algorithm (see Section 4) -
290	   TCP congestion control and loss recovery [RFC5681] is adaptively
291	   adjusted to the actual perceived packet reordering on the network
292	   path.

294	   TCP-NCR [RFC4653] increases DupThresh from the current fixed value of
295	   three duplicate ACKs [RFC5681] to approximately until a congestion
296	   window of data has left the network.  Since cwnd represents the
297	   amount of data a TCP sender can transmit in one RTT, the choice to
298	   trigger a retransmission only after a cwnd's worth of data is known
299	   to have left the network represents roughly the largest amount of
300	   time a TCP sender can wait before the RTO may be triggered.  The
301	   approach chosen in TCP-aNCR is to take TCP-NCR's DupThresh as an
302	   upper bound for an adjustment of the DupThresh that is adaptive to
303	   the actual packet reordering on the network path.

305	   Using TCP-NCR's DupThresh as an upper bound decouples the avoidance
306	   of spurious Fast Retransmits from the avoidance of unnecessary
307	   retransmission timeouts.  Therefore, the adaptive adjustment of the
308	   DupThresh to current perceived packet reordering can be conducted
309	   without taking any retransmission timeout avoidance strategy into
310	   account.  This independence allows TCP-aNCR to quickly respond to
311	   perceived packet reordering by setting its DupThresh so that it
312	   always corresponds to the minimum of the maximum possible (TCP-NCR's
313	   DupThresh) and the maximum measured reordering extent since the last
314	   RTO.  The reordering extent used by TCP-aNCR is by itself not a
315	   static absolute reordering extent, but a relative reordering extent
316	   (see Section 4).

318	4.  Appropriate Detection and Quantification Algorithms

320	   If the TCP-aNCR algorithm is implemented at the TCP sender, it MUST
321	   be implemented together with an appropriate packet reordering
322	   detection and quantification algorithm that is specified in a
323	   standards track or experimental RFC.

325	   Designers of reordering detection algorithms who want their
326	   algorithms to work together with the TCP-aNCR algorithm SHOULD reuse
327	   the variable 'ReorExtR' (relative reordering extent) with the
328	   semantics and defined values specified in
329	   [I-D.zimmermann-tcpm-reordering-detection].  A 'ReorExtR' given by
330	   the detection algorithm holds a value ranging from 0 to 1 which holds
331	   the new measured reordering sample as a fraction of the data in
332	   flight.  TCP-aNCR then saves this new fraction if it is greater than
333	   the current value.

335	5.  The TCP-aNCR Algorithm

337	   When both the Nagle algorithm [RFC0896] [RFC1122] and the TCP
338	   Selective Acknowledgment Option [RFC2018] are enabled for a
339	   connection, a TCP sender MAY employ the following TCP-aNCR algorithm
340	   to dynamically adapt TCP's congestion control and loss recovery
341	   [RFC5681] to the currently perceived packet reordering on the network
342	   path.

344	   Without the Nagle algorithm, there is no straightforward way to
345	   accurately calculate the number of outstanding segments in the
346	   network (and, therefore, no good way to derive an appropriate
347	   DupThresh) without adding state to the TCP sender.  A TCP connection
348	   that does not use the Nagle algorithm SHOULD NOT use TCP-aNCR.  The
349	   adaptation of TCP-aNCR to an implementation that carefully tracks the
350	   sequence numbers transmitted in each segment is considered future
351	   work.

353	   A necessary prerequisite for TCP-aNCR's adaptability is that a TCP
354	   sender has enabled an appropriate detection and quantification
355	   algorithm that complies with the requirements defined in Section 4.
356	   If such an algorithm is either non-existent or not used, the behavior
357	   of TCP-aNCR is completely analogous to the TCP-NCR algorithm as
358	   defined in [RFC4653].  If a TCP sender does implement TCP-aNCR, the
359	   implementation MUST follow the various specifications provided in
360	   Sections 5.1 to 5.7.

362	5.1.  Initialization during Connection Establishment

364	   After the completion of the TCP connection establishment, the
365	   following state constants and variables MUST be initialized in the
366	   TCP transmission control block for the given TCP connection:

368	   (C.1)  Depending on which variant of Extended Limited Transmit should
369	          be executed, the constant LT_F MUST initialized as follows.
370	          For Careful Extended Limited Transmit:

372	             LT_F = 2/3

374	          For Aggressive Extended Limited Transmit:

376	             LT_F = 1/2

378	          This constant reflects the fraction of outstanding data
379	          (including data sent during Extended Limited Transmit) that
380	          must be SACKed before a retransmission is at the latest
381	          triggered.

383	   (C.2)  If TCP-aNCR should adaptively adjust the DupThresh to the
384	          current perceived packet reordering on the network path, then
385	          the variable 'ReorExtR', which stores the maximum relative
386	          reordering extent, MUST initialized as:

388	             ReorExtR = 0

390	          Otherwise the dynamically adaptation of TCP-aNCR SHOULD be
391	          disabled by setting

393	             ReorExtR = -1

395	          A relative reordering extent of 0 results in the standard
396	          DupThresh of three duplicate ACKs, as defined in [RFC5681].  A
397	          fixed relative reordering extent of -1 results in the TCP-NCR
398	          behavior from [RFC4653].

400	5.2.  Initializing Extended Limited Transmit

402	   If the SACK scoreboard is empty upon the receipt of a duplicate ACK
403	   (i.e., the TCP sender has received no SACK information from the
404	   receiver), a TCP sender MUST enter Extended Limited Transmit by
405	   initialize the following five state variables in the TCP Transmission
406	   Control Block:

408	   (I.1)  The TCP sender MUST save the current outstanding data:

410	             FlightSizePrev = FlightSize

412	   (I.2)  The TCP sender MUST save the highest sequence number
413	          transmitted so far:

415	             recover = SND.NXT - 1

417	          Note: The state variable 'recover' from [RFC6582] can be
418	          reused, since NewReno TCP uses 'recover' at the initialization
419	          of a loss recovery procedure, whereas TCP-aNCR uses 'recover'
420	          *before* loss recovery.

422	   (I.3)  The TCP sender MUST initialize the variable 'skipped' that
423	          tracks the number of segments for which an ACK does not
424	          trigger a transmission during Careful Limited Transmit:

426	             skipped = 0

428	          During Aggressive Limited Transmit, 'skipped' is not used.

430	   (I.4)  The TCP sender MUST set DupThresh based on the current
431	          FlightSize:

433	             DupThresh = max (LT_F * (FlightSize / SMSS), 3)

435	          The lower bound of DupThresh = 3 is kept from [RFC5681]

437	          [RFC6675].

439	   (I.5)  If (ReorExtR != -1) holds, then the TCP sender MUST set
440	          DupThresh based on the relative reordering extent 'ReorExtR':

442	             DupThresh = max (min (DupThresh,
443	                                   ReorExtR * (FlightSize / SMSS)), 3)

445	   In addition to the above steps, the incoming ACK MUST be processed
446	   with the (E) series of steps in Section 5.3.

448	5.3.  Executing Extended Limited Transmit

450	   On each ACK that a) arrives after TCP-aNCR has entered the Extended
451	   Limited Transmit phase (as outlined in Section 5.2) *and* b) carries
452	   new SACK information, *and* c) does *not* advance the cumulative ACK
453	   point, the TCP sender MUST use the following procedure.

455	   (E.1)  The TCP sender MUST update the SACK scoreboard and uses the
456	          SetPipe() procedure from [RFC6675] to set the 'pipe' variable
457	          (which represents the number of bytes still considered "in the
458	          network").  Note: the current value of DupThresh MUST be used
459	          by SetPipe() to produce an accurate assessment of the amount
460	          of data still considered in the network.

462	   (E.2)  The TCP sender MUST initialize the variable 'burst' that
463	          tracks the number of segments that can at most be sent per ACK
464	          to the size of the Initial Window (IW) [RFC5681]:

466	             burst = IW

468	   (E.3)  If a) (cwnd - pipe - skipped >= 1 * SMSS) holds, *and* b) the
469	          receive window (rwnd) allows to send SMSS bytes of previously
470	          unsent data, *and* c) there are SMSS bytes of previously
471	          unsent data available for transmission, then the TCP sender
472	          MUST transmit one segment of SMSS bytes.  Otherwise, the TCP
473	          sender MUST skip to step (E.7).

475	   (E.4)  The TCP sender MUST increment 'pipe' by SMSS bytes and MUST
476	          decrement 'burst' by SMSS bytes to reflect the newly
477	          transmitted segment:

479	             pipe = pipe + SMSS
480	             burst = burst - SMSS

482	   (E.5)  If Careful Limited Transmit is used, 'skipped' MUST be
483	          incremented by SMSS bytes to ensure that the next SMSS bytes
484	          of SACKed data processed do not trigger a Limited Transmit
485	          transmission.

487	             skipped = skipped + SMSS

489	   (E.6)  If (burst > 0) holds, the TCP sender MUST return to step (E.3)
490	          to ensure that as many bytes as appropriate are transmitted.
491	          Otherwise, if more than IW bytes were SACKed by a single ACK,
492	          the TCP sender MUST skip to step (E.7).  The additional amount
493	          of data becomes available again by the next received duplicate
494	          ACK and the re-execution of SetPipe().

496	   (E.7)  The TCP sender MUST save the maximum amount of data that is
497	          considered to have been in the network during the last RTT:

499	             pipe_max = max (pipe, pipe_max)

501	   (E.8)  The TCP sender MUST set DupThresh based on the current
502	          FlightSize:

504	             DupThresh = max (LT_F * (FlightSize / SMSS), 3)

506	          The lower bound of DupThresh = 3 is kept from [RFC5681]
507	          [RFC6675].

509	   (E.9)  If (ReorExtR != -1) holds, then the TCP sender MUST set
510	          DupThresh based on the relative reordering extent 'ReorExtR':

512	             DupThresh = max (min (DupThresh,
513	                                   ReorExtR * (FlightSize / SMSS)), 3)

515	5.4.  Terminating Extended Limited Transmit

517	   On the receipt of a duplicate ACK that a) arrives after TCP-aNCR has
518	   entered the Extended Limited Transmit phase (as outlined in
519	   Section 5.2) *and* b) advances the cumulative ACK point, the TCP
520	   sender MUST use the following procedure.

522	   The arrival of an acceptable ACK that advances the cumulative ACK
523	   point while in Extended Limited Transmit, but before loss recovery is
524	   triggered, signals that a series of duplicate ACKs was caused by
525	   reordering and not congestion.  Therefore, Extended Limited Transmit
526	   will be either terminated or re-entered.

528	   (T.1)  If the received ACK extends not only the cumulative ACK point,
529	          but *also* carries new SACK information (i.e., the ACK is both
530	          an acceptable ACK and a duplicate ACK), the TCP sender MUST
531	          restart Extended Limited Transmit and MUST go to step (T.2).
532	          Otherwise, the TCP sender MUST terminate it and MUST skip to
533	          step (T.3).

535	   (T.2)  If the Cumulative Acknowledgment field of the received ACK
536	          covers more than 'recover' (i.e., SEG.ACK > recover), Extended
537	          Limited Transmit has transmitted one cwnd worth of data
538	          without any losses and the TCP sender MUST update the
539	          following state variables by

541	             FlightSizePrev = pipe_max
542	             pipe_max = 0

544	          and MUST go to step (I.2) to re-start Extended Limited
545	          Transmit.  Otherwise if (SEG.ACK <= recover) holds, the TCP
546	          sender MUST go to step (I.3).  This ensures that in the event
547	          of a loss the cwnd reduction is based on a current value of
548	          FlightSizePrev.

550	   The following steps are executed only if the received ACK does *not*
551	   carry SACK information.  Extended Limited Transmit will be
552	   terminated.

554	   (T.3)  A TCP sender MUST set ssthresh to:

556	             ssthresh = max (cwnd, ssthresh)

558	          This step provides TCP-aNCR with a sense of "history".  If the
559	          next step (T.4) reduces the congestion window, this step
560	          ensures that TCP-aNCR will slow-start back to the operating
561	          point that was in effect before Extended Limited Transmit.

563	   (T.4)  A TCP sender MUST reset cwnd to:

565	             cwnd = FlightSize + SMSS

567	          This step ensures that cwnd is not significantly larger than
568	          the amount of data outstanding, a situation that would cause a
569	          line rate burst.

571	   (T.5)  A TCP is now permitted to transmit previously unsent data as
572	          allowed by cwnd, FlightSize, application data availability,
573	          and the receiver's advertised window.

575	5.5.  Entering Loss Recovery

577	   The receipt of an ACK that results in deeming the oldest outstanding
578	   segment is lost via the algorithms in [RFC6675] terminates Extended
579	   Limited Transmit and initializes the loss recovery according to
580	   [RFC6675].  One slight change to [RFC6675] MUST be made, however.

582	   (Ret)  In Section 5, step (4.2) of [RFC6675] MUST be changed to:

584	                 ssthresh = cwnd = (FlightSizePrev / 2)

586	          This ensures that the congestion control modifications are
587	          made with respect to the amount of data in the network before
588	          FlightSize was increased by Extended Limited Transmit.

590	   Once the algorithm in [RFC6675] takes over from Extended Limited
591	   Transmit, the DupThresh value MUST be held constant until the loss
592	   recovery phase terminates.

594	5.6.  Reordering Extent

596	   Whenever the additional detection and quantification algorithm (see
597	   Section 4) detects and quantifies a new reordering event, the TCP
598	   sender MUST update the state variable 'ReorExtR'.

600	   (Ext)  Let 'ReorExtR_New' the newly determined relative reordering
601	          extent:

603	                 ReorExtR = min (max (ReorExtR, ReorExtR_New), 1)

605	5.7.  Retransmission Timeout

607	   The expiration of the retransmission timer SHOULD be interpreted as
608	   an indication of a path characteristics change, and the TCP sender
609	   SHOULD reset DupThresh to the default value of three.

611	   (RTO)  If an RTO occurs and (ReorExtR != -1) (i.e.  TCP-aNCR is used
612	          and not TCP-NCR), then a TCP sender SHOULD reset 'ReorExtR':

614	                 ReorExtR = 0

616	6.  Protocol Steps in Detail

618	   Upon the receipt of the first duplicate ACK in the 'open' state (the
619	   SACK scoreboard is empty), the TCP sender starts to execute TCP-aNCR
620	   by entering the 'disorder' state and the initialization of Extended
621	   Limited Transmit.  First, the TCP sender saves the current amount of
622	   outstanding data as well as the highest sequence number transmitted
623	   so far (SND.NXT - 1) (steps (I.1) and (I.2)).  In addition, if the
624	   TCP connection uses the careful variant of the Extended Careful
625	   Limited Transmit (step (C.1)), the 'skipped' variable, which tracks
626	   the number of segments for which an ACK does not trigger a
627	   transmission during Careful Limited Transmit, is initialized with
628	   zero (step (I.3)).  The last step during the initialization is the
629	   determination of DupThresh.  Depending on whether TCP-aNCR has been
630	   configured during the connection establishment to adaptively adjust
631	   to the currently perceived packet reordering on the path (step
632	   (C.2)), DupThresh is either determined exclusively based on the
633	   current FlightSize (as TCP-NCR [RFC4653] does) or, in addition, also
634	   based on the relative extent reordering (steps (I.4) and (I.5)).

636	   Depending on which variant of Extended Limited Transmit should be
637	   executed, the constant LT_F must be set accordingly (step (C.1)).
638	   This constant reflects the fraction of outstanding data (including
639	   data sent during Extended Limited Transmit) that must be SACKed
640	   before a retransmission is triggered at the latest (which is the case
641	   when a DupThresh that is based on relative reordering extent is
642	   larger then TCP-NCR's DupThresh).  Since Aggressive Limited Transmit
643	   sends a new segment for every segment known to have left the network,
644	   a total of approximately cwnd segments will be sent, and therefore
645	   ideally a total of approximately 2*cwnd segments will be outstanding
646	   when a retransmission is finally triggered.  DupThresh is then set to
647	   LT_F = 1/2 of 2*cwnd (or about 1 RTT's worth of data) (see step
648	   (I.4)).  The factor is different for Careful Limited Transmit,
649	   because the sender only transmits one new segment for every two
650	   segments that are SACKed and therefore will ideally have a total of
651	   maximum of 1.5*cwnd segments outstanding when the retransmission is
652	   triggered.  Hence, the required threshold is LT_F=2/3 of 1.5*cwnd to
653	   delay the retransmission by roughly 1 RTT.

655	   For each duplicate ACK received in the 'disorder' state, which is not
656	   an acceptable ACK, i.e., it carries new SACK information, but does
657	   not advance the cumulative ACK point, Extended Limited Transmit is
658	   executed.  First, the SACK scoreboard is updated and based on the
659	   current value of DupThresh, the amount of outstanding data (step
660	   (E.1)).  Furthermore, the state variable 'burst' that indicates the
661	   number of segments that can be sent at most for of each received ACK
662	   is initialized to the size of the initial window [RFC6928] (step
663	   E.2)).  If more than IW bytes were SACKed by a single ACK, the
664	   additional amount of data becomes available again by the next
665	   received duplicate ACK and the re-execution of SetPipe() (step
666	   (E.1)).

668	   Next, if new data is available for transmission and both the
669	   congestion window and the receiver window allow to send SMSS bytes of
670	   previously unsent data, a segment of SMSS bytes is sent (step (E.3)).
671	   Subsequently, the corresponding state variables 'pipe', 'burst' and -
672	   optionally - 'skipped' are updated (steps (E.4) and (E.5)).  If, due
673	   to the current size of the congestion and receiver windows (step
674	   (E.2)), due to the current value of 'burst' (step (E.5)), no further
675	   segment may be sent, the processing of the ACK is terminated.
676	   Provided that the amount of data that is currently considered to be
677	   in the network is greater than the previously stored one, this new
678	   value is stored for later use (step (E.7)).  Finally, to take into
679	   account the new data sent, DupThresh is updated (steps (E.6) and
680	   (E.7)).

682	   The arrival of an acceptable ACK in the 'disorder' state that
683	   advances the cumulative ACK point during Extended Limited Transmit
684	   signals that a series of duplicate ACKs was caused by reordering and
685	   not congestion.  Therefore, the receipt of an acceptable ACK that
686	   does not carry any SACK information terminates Extended Limited
687	   Transmit (step (T.1)).  The slow start threshold is set to the
688	   maximum of its current value and the current value of cwnd (step
689	   (T.3)).  Cwnd itself is set to the current value of FlightSize plus
690	   one segment (step (T.4)).  As a result, the congestion window is not
691	   significantly larger than the current amount of outstanding data, so
692	   that a burst of data is effectively prevented.  If new data is
693	   available for transmission and both the new values of cwnd and rwnd
694	   allow to send SMSS bytes of previously unsent data, a segment is send
695	   (step (T.5)).

697	   On the other hand, if the received ACK acknowledges new data not only
698	   cumulatively but also selectively - the ACK carries new SACK
699	   information - Extended Limited Transmit is not terminated but re-
700	   entered (step (T.1)).  If the Cumulative Acknowledgment field of the
701	   received ACK covers more than 'recover', one cwnd worth of data has
702	   been transmitted during Extended Limited Transmit without any packet
703	   loss.  Therefore, FlightSizePrev, the amount of outstanding data
704	   saved at the beginning of Extended Limited Transmit (step (I.1)), is
705	   considered outdated (step (T.2)).  This step ensures that in the
706	   event of packet loss, the reduction of the cwnd is based on an up-to-
707	   date value, which reflects the number of bytes outstanding in the
708	   network (see Section 7).  Finally, regardless of whether or not
709	   'recover' is covered, Extended Limited Transmit is re-entered.

711	   The second case that leads to a termination of Extended Limited
712	   Transmit is the receipt of an ACK that signals via the algorithm in
713	   [RFC6675] that the oldest outstanding segment is considered lost.  If
714	   either DupThresh or more duplicate ACKs are received, or the oldest
715	   outstanding segment is deemed lost via the function IsLost() of
716	   [RFC6675], Extended Limited Transmit is terminated and SACK-based
717	   loss recovery is entered [RFC6675].  Once the algorithm in [RFC6675]
718	   takes over from Extended Limited Transmit, the DupThresh value MUST
719	   be held constant until loss recovery is terminated.  The process of
720	   loss recovery itself is not changed by TCP-aNCR.  The only exception
721	   is a slight change of the step (4.2) of RFC 6675 [RFC6675], which
722	   ensures that the adjustment made by the congestion control - halving
723	   the congestion window - is made with respect to the initial amount of
724	   outstanding data while Limited Transmit Extended is executed (step
725	   (Ret)).  The use of FlightSize at this point would no longer be valid
726	   since the amount of outstanding data may double by executing Extended
727	   Limited Transmit.

729	7.  Discussion of TCP-aNCR

731	   The specification of TCP-aNCR represents an incremental update of RFC
732	   4653 [RFC4653].  All changes made by TCP-aNCR can be divided into two
733	   categories.  On one hand, they implement TCP-aNCR's ability to
734	   dynamically adapted TCP congestion control and loss recovery
735	   [RFC5681] to the currently perceived packet reordering on the network
736	   path.  These include the use of a variable DupThresh and the use of a
737	   relative reordering extent.  On the other hand, the changes that
738	   basically correct weaknesses of the original TCP-NCR algorithm and
739	   which are independent of TCP-aNCR adaptability.  These include packet
740	   reordering during slow start, the prevention of bursts, and the
741	   persistent receipt of SACKs.

743	7.1.  Variable Duplicate Acknowledgment Threshold

745	   The central point of the TCP-aNCR algorithm is the usage of a
746	   DupThresh that is adaptable to the perceived packet reordering on the
747	   network path.  Based on the actual amount of outstanding data, TCP-
748	   NCR's DupThresh represents roughly the largest amount of time a Fast
749	   Retransmit can safely be delayed before a costly retransmission
750	   timeout may be triggered.  Therefore, to avoid an RTO, TCP-aNCR's
751	   reordering-aware DupThresh is an upper bound of the one calculated in
752	   TCP-NCR (steps (I.5) and (E.9)).  This decouples the avoidance of
753	   spurious Fast Retransmits from the avoidance of RTOs.  It allows TCP-
754	   aNCR to react fast and efficiently to packet reordering.  The
755	   DupThresh always corresponds to the minimum of the largest possible
756	   and largest detected reordering.  With constant packet reordering in
757	   terms of the rate and delay, TCP-aNCR gives a DupThresh based on the
758	   relative reordering extent with an optimal delay for every bandwidth-
759	   delay-product.  If TCP-aNCR should not adaptively adjust the
760	   DupThresh to the current perceived packet reordering on the network
761	   path (because for example an appropriate detection and quantification
762	   algorithm is not implemented), the dynamically adaptation of TCP-aNCR
763	   can be disabled, so that TCP-aNCR behaves like TCP-NCR [RFC4653].

765	7.2.  Relative Reordering Extent

767	   Whenever a new reordering event is detected and presented to TCP-aNCR
768	   in the form of a relative reordering extend 'ReorExtR', TCP-aNCR
769	   saves and uses the new 'ReorExtR' if it is larger than the old one
770	   (step (EXT)).  The upper bound of 1 assures that no excessively large
771	   value is used.  A 'ReorExtR' larger than one means that more than
772	   FlightSize bytes would have been received out-of-order before the
773	   reordered segment is received.  The delay caused by the reordering is
774	   thus longer than the RTT of the TCP connection.  Since the RTT is
775	   roughly the time a Fast Retransmit can safely be delayed before the
776	   retransmission has to be to avoid an RTO, a maximum 'ReorExtR' of one
777	   seems to be a suitable value.

779	   The expiration of the retransmission timer is interpreted by TCP-aNCR
780	   as an indication of a change in path characteristics, hence, the
781	   saved 'ReorExtR' is assumed to be outdated and will be invalidated
782	   (step (RTO)).  As a consequence, the relative reordering extent
783	   'ReorExtR' increases monotonically between two successive
784	   retransmission timeouts and corresponds to the maximum measured
785	   reordering extent since the last RTO.  Other approaches would be an
786	   exponentially-weighted moving average (EWMA) or a histogram of the
787	   last n reordering extents.  The main drawback of an EWMA is however
788	   that on average half of the detected reordering events would be
789	   larger than the saved reordering extend.  Thus, only half of the
790	   spurious retransmits could be avoided.  Applying an histogram could
791	   largely avoid the disadvantages of an EWMA, however, it would result
792	   in a not acceptable increase in memory usage.

794	   In combination with the invalidation after an RTO, the advantage of
795	   using maximum is the low complexity as well as its fast convergence
796	   to the actual maximum reordering on the network path.  As a result,
797	   the negative impact that packet reordering has on TCP's congestion
798	   control and loss recovery can be avoided.  A disadvantage of using a
799	   maximum is that if the delay caused by the reordering decreases over
800	   the lifetime of the TCP connection, a Fast Retransmit is
801	   unnecessarily long delayed.  Nevertheless, since the negative impact
802	   reordering has on TCP's congestion control and loss recovery is more
803	   substantial than the disadvantage of a longer delay, a decrease of
804	   the ReorExtR between RTOs is considered inappropriate.

806	7.3.  Reordering during Slow Start

808	   The arrival of an acceptable ACK during Extended Limited Transmit
809	   signals that previously received duplicate ACKs are the result of
810	   packet reordering and not congestion, so that Extended Limited
811	   Transmit is completed accordingly.  Upon the termination of Extended
812	   Limited Transmit, and especially when using the Careful variant, TCP-
813	   NCR (as well as TCP-aNCR) may be in a situation where the entire cwnd
814	   is not being utilized.  Therefore, to mitigate a potential burst of
815	   segments, in step (T.2) TCP-NCR sets the slow start threshold to the
816	   FlightSize that was saved at the beginning of Extended Limited
817	   Transmit [RFC4653].  This step should ensure that TCP-NCR slow starts
818	   back to the operating point in use before Extended Limited Transmit.

820	   Unfortunately, the assignment in step (T.2) is only correct if the
821	   TCP sender already was in congestion avoidance at the time Extended
822	   Limited Transmit was entered.  Otherwise, if the TCP sender was
823	   instead in slow start, the value of ssthresh is greater than the
824	   saved FlightSize so that slow start prematurely concludes.  This
825	   behavior can leave much of the network resources idle, and a long
826	   time may needed in order to use the full capacity.  To mitigate this
827	   issue, TCP-aNCR sets the slow start threshold to the maximum of its
828	   current value and the current cwnd (step (T.3)).  This continues slow
829	   start after a reordering event happening during slow start.

831	7.4.  Preventing Bursts

833	   In cases where a new single SACK covers more than one segment - this
834	   can happen either due to packet loss or packet reordering on the ACK
835	   path - TCP-NCR [RFC4653] sends an undesirable burst of data.  TCP-
836	   aNCR solves this problem by limiting the burst size - the maximum of
837	   data that can send in response to a single SACK - to the Initial
838	   Window [RFC5681] while executing Extended Limited Transmit (steps
839	   (E.2), (E.4), and (E.6)).  Since IW represents the amount of data
840	   that a TCP sender is able to send into the network safely without
841	   knowing its characteristics, it is a reasonable value for the burst
842	   size, too.  If more than IW bytes were SACKed by a single ACK, the
843	   additional amount of data becomes available again by the next
844	   received duplicate ACK.  Thus, the transmission of new segments is
845	   spread over the next received ACKs, so that micro bursts - a
846	   characteristic of packet reordering in the reverse path - are largely
847	   compensated.

849	   Another situation that causes undesired bursts of segments with TCP-
850	   NCR is the receipt of an acceptable ACK during Careful Extended
851	   Limited Transmit.  If multiple segments from a single window of data
852	   are delayed by packet reordering, typically the first acceptable ACK
853	   after entering the 'disorder' state acknowledges data not only
854	   cumulatively but also selectively.  Hence, Extended Limited Transmit
855	   is not terminated but re-started.  If the segments are delayed by the
856	   reordering for almost one RTT, then the amount of outstanding data in
857	   the network ('pipe') is approximately half the amount of data saved
858	   at the beginning of Extended Limited Transmit (FlightSizePrev).  If
859	   the sequence numbers of the delayed segments are close to each other
860	   in the sequence number space, the acceptable ACK acknowledges only a
861	   small amount of data, so that FlightSize is still large.  As a
862	   result, TCP-NCR sets the cwnd to FlightSizePrev in step (T.1).  Since
863	   'pipe' is only half of FlightSizePrev due to Careful Extended Limited
864	   Transmit, TCP-NCR sends a burst of almost half a cwnd worth of data
865	   in the subsequent step (T.3).

867	   Note: Even in the case the sequence numbers of the delayed segments
868	   are not close to each other in the sequence number space and cwnd is
869	   set in step (T.1) to FlightSize + SMSS, a burst of data will emerge
870	   due to re-entering Extended Limited Transmit, because TCP-NCR sets
871	   'skipped' to zero in step (I.2) and uses FlightSizePrev in step
872	   (E.2).

874	   TCP-aNCR prevents such a burst by making a clear differentiation
875	   between terminating Extended Limited Transmit and a restarting
876	   Extended Limited Transmit (step T.1).  Only the first case causes the
877	   congestion window to be set to the current FlightSize plus one
878	   segment.  In the latter case, when re-entering Extended Limited
879	   Transmit, the congestion window is not adjusted and the original
880	   (T.1) of the TCP-NCR specification is omitted.  The transmission of
881	   new data is then only performed after re-entering Extended Limited
882	   Transmit in step (E.2) of the TCP-aNCR specification, where the
883	   actual burst mitigation takes place.

885	7.5.  Persistent receiving of Selective Acknowledgments

887	   In some inconvenient cases it could happen that a TCP sender
888	   persistently receives SACK information due to reordering on the
889	   network path, e.g., if the segments are often and/or lengthy delayed
890	   by the packet reordering.  With TCP-NCR, the persistent reception of
891	   SACKs causes Extended Limited Transmit to be entered with the first
892	   received duplicate ACK but never to be terminated if no packet loss
893	   occurs - for every received ACK, TCP-NCR either follows steps (E.1)
894	   to (E.6) or steps (T.1) to (T.4).  In particular, TCP-NCR executes a)
895	   for every acceptable ACK step (T.4) and b) at any time step (I.1)
896	   again.  Hence, the amount of outstanding data saved at the beginning
897	   of Extended Limited Transmit, FlightSizePrev, is never updated.

899	   An emerging problem in this context is that during Extended Limited
900	   Transmit TCP-NCR determines the transmission of new segments in step
901	   (E.2) solely on the basis of FlightSizePrev, so that an interim
902	   increase of the cwnd is not considered (according to [RFC5681], the
903	   congestion window is increased for every received acceptable ACK that
904	   advances the cumulative ACK point, no matter if it carries SACK
905	   information or not).  As a result, TCP-NCR can only very slowly
906	   determine the available capacity of the communication path.

908	   TCP-aNCR addresses this problem by limiting the amount of data that
909	   is allowed to be sent into the network during Extended Limited
910	   Transmit not on the basis of FlightSizePrev, but on the size of the
911	   congestion window.  The equation in step E.3 of the TCP-aNCR
912	   specification is therefore equal to the one used in [RFC6675] (except
913	   for the 'skipped' variable).  If an acceptable ACK is received during
914	   the execution of Extended Limited Transmit, re-entering Extended
915	   Limited Transmit makes any increase in cwnd immediately available.
916	   Hence, even in the case when persistently receiving SACKs, the
917	   available capacity of the communication path can be determined
918	   quickly.

920	   Another problem resulting from persistently receiving SACKs, and
921	   which is related to the increase in cwnd in response to received
922	   acceptable ACKs, is the reduction of cwnd due to a packet loss.  When
923	   a packet is considered lost, the congestion control adjustment is
924	   done with respect to the amount of outstanding data at the beginning
925	   of Extended Limited Transmit, FlightSizePrev (step (Ret)).  As in the
926	   previous case, an increase in cwnd is again not taken into account.
927	   A simple solution to the problem would be to perform the window
928	   reduction not on the basis of FlightSizePrev but analogous to step
929	   (E.2) based on the current size of cwnd.

931	   A problem with this solution is that cwnd can potentially be
932	   increased, although the TCP connection is limited by the application
933	   and not by cwnd.  Although [RFC2861] specifies that an increase of
934	   cwnd is only applicable if cwnd is fully utilized, this behavior is
935	   not specified by any standards track document.  But even this
936	   conservative increase behavior is guaranteed to not be conservative
937	   enough.  If, from a single window of data, both segments are delayed
938	   but also lost, cwnd would first be increased in response to each
939	   received acceptable ACKs, while subsequently reduced due to the lost
940	   segments, which would not result in a halving of the cwnd any more.

942	   The solution proposed by TCP-aNCR reuses the state variable 'recover'
943	   from [RFC6582] and adapts the approach taken by NewReno TCP and SACK
944	   TCP to detect, with help of the state variable, the end of one loss
945	   recovery phase properly, allowing to recover multiple losses from a
946	   single window of data efficiently.  Therefore, by entering the
947	   'disorder' state and the starting Extended Limited Transmit, TCP-aNCR
948	   saves the highest sequence number sent so far in 'recover'.  If a
949	   received acceptable ACK covers more than 'recover', one cwnd's worth
950	   of data has been transmitted during Extended Limited Transmit without
951	   any packet loss.  Hence, FlightSizePrev can be updated by 'pipe_max',
952	   which reflects the maximum amount of data that is considered to have
953	   been in the network during the last RTT.  This update takes an
954	   interim increase in cwnd into account, so that in case of packet
955	   loss, the reduction in cwnd can be based on the current value of
956	   FlightSizePrev.

958	8.  Interoperability Issues

960	   TCP-aNCR requires that both the TCP Selective Acknowledgment Option
961	   [RFC2018] as well as a SACK-based loss recovery scheme compatible to
962	   one given in [RFC6675] are used by the TCP sender.  Hence,
963	   compatibility to both specifications is REQUIRED.

965	8.1.  Early Retransmit

967	   The specification of TCP-aNCR in this document and the Early
968	   Retransmit algorithm specified in [RFC5827] define orthogonal methods
969	   to modify DupThresh.  Early Retransmit allows the TCP sender to
970	   reduce the number of duplicate ACKs required to trigger a Fast
971	   Retransmit below the standard DupThresh of three, if FlightSize is
972	   less than 4*SMSS and no new segment can be sent.  In contrast, TCP-
973	   aNCR allows, starting from the minimum of three duplicate ACKs, to
974	   increase the DupThresh beyond the standard of three duplicate ACKs to
975	   make TCP more robust to packet reordering, if the amount of
976	   outstanding data is sufficient to reach the increased DupThresh to
977	   trigger Fast Retransmit and Fast Recovery.

979	8.2.  Congestion Window Validation

981	   The increase of the congestion window during application-limited
982	   periods can lead to an invalidation of the congestion window, in that
983	   it no longer reflects current information about the state of the
984	   network, if the congestion window might never have been fully
985	   utilized during the last RTT.  According to [RFC2861], the congestion
986	   window should, first, only be increased during slow-start or
987	   congestion avoidance if the cwnd has been fully utilized by the TCP
988	   sender and, second, gradually be reduced during each RTT in which the
989	   cwnd was not fully used.

991	   A problem that arises in this context is that during Careful Extended
992	   Limited Transmit, cwnd is not fully utilized due to the variable
993	   'skipped' (see step (E.3)), so that - strictly following [RFC2861] -
994	   the congestion window should not be increased upon the receipt of an
995	   acceptable ACK.  A trivial solution of this problem is to include the
996	   variable 'skipped' in the calculation of [RFC2861] to determine
997	   whether the congestion window is fully utilized or not.

999	8.3.  Reactive Response to Packet Reordering

1001	   As a proactive scheme with the aim to a priori prevent the negative
1002	   impact that packet reordering has on TCP, TCP-aNCR can conceptually
1003	   be combined with any reactive response to packet reordering, which
1004	   attempts to mitigate the negative effects of reordering a posteriori.
1005	   This is because the modifications of TCP-aNCR to the standard TCP
1006	   congestion control and loss recovery [RFC6675] are implemented in the
1007	   'disorder' state and are performed by the TCP sender before it enters
1008	   loss recovery, while reactive responses to packet reordering operate
1009	   generally after entering loss recovery, by undoing the unnecessarily
1010	   changes to the congestion control state.

1012	   If unnecessary changes to the congestion control state are undone
1013	   after loss recovery, which is typically the case if a spurious Fast
1014	   Retransmit is detected based on the DSACK option [RFC3708][RFC4015],
1015	   since first ACK carrying a DSACK option usually arrives at a TCP
1016	   sender only after loss recovery has already terminated, it might
1017	   happen that the restoring of the original value of the congestion
1018	   window is done at a time at which the TCP sender is already back in
1019	   again in the 'disorder' state and executing Extended Limited
1020	   Transmit.  While this is basically compatible with the TCP-aNCR
1021	   specification - the undo simply represents an increase of the
1022	   congestion window - however, some care must be taken that the
1023	   combination of the algorithms does not lead to unwanted behavior.

1025	8.4.  Buffer Auto-Tuning

1027	   Although all modifications of the TCP-aNCR algorithm are implemented
1028	   in the TCP sender, the receiver also potentially has a part to play.
1029	   If some segments from a single window of data are delayed by the
1030	   packet reordering in the network, all segments that are received in
1031	   out-of-order have to be queued in the receive buffer until the holes
1032	   in sequence number space have been closed and the data can be
1033	   delivered to the receiving application.  In the worst case, which
1034	   occurs if the TCP sender uses Aggressive Limited Transmit and the
1035	   reordering delay is close to the RTT, TCP-aNCR increases the
1036	   receiver's buffering requirement by up to an extra cwnd.  Therefore,
1037	   to maximize the benefits from TCP-aNCR, receivers should advertise a
1038	   large window - ideally by using buffer auto-tuning algorithms - to
1039	   absorb the extra out-of-order data.  In the case that the additional
1040	   buffer requirements are not met, the use of the above algorithm takes
1041	   into account the reduced advertised window - with a corresponding
1042	   loss in robustness to packet reordering.

1044	9.  Related Work

1046	   Over the past few years, several solutions have been proposed to
1047	   improve the performance of TCP in the face of packet reordering.
1048	   These schemes generally fall into one of two categories (with some
1049	   overlap): mechanisms that try to prevent spurious retransmits from
1050	   happening (proactive schemes) and mechanisms that try to detect
1051	   spurious retransmits and undo the needless congestion control state
1052	   changes that have been taken (reactive schemes).

1054	   [I-D.blanton-tcp-reordering], [Zha+03] and [LM05] attempt to prevent
1055	   packet reordering from triggering spurious retransmits by using
1056	   various algorithms to approximate the DupThresh required to
1057	   disambiguate loss and reordering over a given network path at a given
1058	   time.  This basic principle is also used in TCP-aNCR.  While
1059	   [I-D.blanton-tcp-reordering] describes four basic approaches on how
1060	   to increase the DupThresh and discusses pros and cons of these
1061	   approaches, presents [Zha+03] a relatively complex algorithm that
1062	   saves the reordering extents in a histogram and calculates the
1063	   DupThresh in a way that a certain percentage of samples is smaller
1064	   then the DupThresh.  [LM05] uses an EWMA for the same purpose.  Both
1065	   algorithms do not prevent all the spurious retransmissions by design.

1067	   In contrast to the above mentioned algorithms Linux [Linux]
1068	   implements a proactive scheme by setting the DupThresh to the highest
1069	   detected reordering and resets only upon an RTO.  To avoid a costly
1070	   retransmission timeout due to the increased DupThresh Linux
1071	   implements first an extension of the Limited Transmit algorithm,
1072	   second limits the DupThresh to an upper bound of 127 duplicate ACKs,
1073	   and third prematurely enters loss recovery if too few segments are
1074	   in-flight to reach the DupThresh and no additional segments can send.
1075	   Especially the last change is commendable since, besides TCP-NCR,
1076	   none of the described algorithms in this section mention a similar
1077	   concern.

1079	   [Boh+06] and [Bha+04] presents proactive schemes based on timers by
1080	   which the DupThresh is ignored altogether.  After the timer is
1081	   expired TCP initialize the loss recovery.  In [Bha+04] this timer has
1082	   a length of one RTT and is started when the first duplicate ACK is
1083	   received, whereas the approach taken in [Boh+06] solely relies on
1084	   timers to detect packet loss without taking into account any other
1085	   congestion signals such as duplicate ACKs.  It assigns each segment
1086	   send a timestamp and retransmits the segment if the corresponding
1087	   timer fires.

1089	   TCP-NCR [RFC4653] tries to prevent spurious retransmits similar to
1090	   [I-D.blanton-tcp-reordering] or [Zha+03] as it delays a
1091	   retransmission to disambiguate loss and reordering.  However, TCP-NCR
1092	   takes a simplified approach by simply delay a retransmission by an
1093	   amount based on the current cwnd (in comparison to standard TCP),
1094	   while the other schemes use relatively complex algorithms in an
1095	   attempt to derive a more precise value for DupThresh that depends on
1096	   the current patterns of packet reordering.  Many of the features
1097	   offered by TCP-NCR have been taken into account while designing TCP-
1098	   aNCR.

1100	   Besides the proactive schemes, several other schemes have been
1101	   developed to detect and mitigate needless retransmissions after the
1102	   fact.  The Eifel detection algorithm [RFC3522], the detection based
1103	   on DSACKs [RFC3708], and F-RTO scheme [RFC5682] represent approaches
1104	   to detect spurious retransmissions, while the Eifel response
1105	   algorithm [RFC4015], [I-D.blanton-tcp-reordering], and Linux [Linux]
1106	   present respectively implement algorithms to mitigate the changes
1107	   these events made to the congestion control state.  As discussed in
1108	   Section 8.3 TCP-aNCR could be used in conjunction with these
1109	   algorithms, with TCP-aNCR attempting to prevent spurious retransmits
1110	   and some other scheme kicking in if the prevention failed.

1112	10.  IANA Considerations

1114	   This memo includes no request to IANA.

1116	11.  Security Considerations

1118	   By taking dedicated actions so that the perceived packet reordering
1119	   in the network is either underestimating or overestimating by the use
1120	   of an relative and absolute reordering, an attacker or misbehaving
1121	   TCP receiver has in regards to TCP's congestion control two options
1122	   to bias a TCP-aNCR sender.  An underestimation of the present packet
1123	   reordering in the network occursi, if for example, a misbehaving TCP
1124	   receiver already acknowledges segments while they are actually still
1125	   in-flight, causing holes premature are closed in the sequence number
1126	   space of the SACK scoreboard.  With regard to TCP-aNCR the result of
1127	   an underestimated packet reordering is a too small DupThresh,
1128	   resulting in a premature loss recovery execution.  In context of
1129	   TCP's congestion control the effects of such attacks are limited
1130	   since the lower bound of TCP-aNCR's DupThresh is the default value of
1131	   three duplicate ACKs [RFC5681], so that in worst case TCP-aNCR
1132	   behaves equal to TCP SACK [RFC6675].

1134	   In contrast to an underestimation, an overestimation of the packet
1135	   reordering in the network occurs, if for example, a misbehaving TCP
1136	   receiver still further send SACKs for subsequent segments before it
1137	   sends an acceptable ACK for the actually already received delayed
1138	   segment, so that the hole in the sequence number space of the SACK
1139	   scoreboard is later closed.  In the context of TCP-aNCR the result of
1140	   such an overestimation is a too large DupThresh, so that in the case
1141	   of a packet loss TCP's loss recovery is executed later than
1142	   necessary.  Similar to the previous case, the effects of delayed
1143	   entry into the loss recovery are limited because on the one hand TCP-
1144	   NCR's DupThresh is used as an upper bound for TCP-aNCR's variable
1145	   DupThresh so that the entrance to the loss recovery and the
1146	   adaptation of the congestion window may be delayed at most one RTT.
1147	   On the other hand, such a limited delay of the congestion control
1148	   adjustment has even in the worst case only a limited impact on the
1149	   performance of TCP connection and has generally been regarded as safe
1150	   for use on the Internet [Ban+01].

1152	12.  Acknowledgments

1154	   The authors would like to thank Daniel Slot for his TCP-NCR
1155	   implementation in Linux.  We also thank the flowgrind [Flowgrind]
1156	   authors and contributors for here performance measurement tool, which
1157	   give us a powerful tool to analyze TCP's congestion control and loss
1158	   recovery behavior in detail.

1160	13.  References

1162	13.1.  Normative References

1164	   [I-D.zimmermann-tcpm-reordering-detection]
1165	              Zimmermann, A., Schulte, L., Wolff, C., and A. Hannemann,
1166	              "Detection and Quantification of Packet Reordering with
1167	              TCP", draft-zimmermann-tcpm-reordering-detection-01 (work
1168	              in progress), November 2013.

1170	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
1171	              RFC 793, September 1981.

1173	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
1174	              Selective Acknowledgment Options", RFC 2018, October 1996.

1176	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1177	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1179	   [RFC3042]  Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing
1180	              TCP's Loss Recovery Using Limited Transmit", RFC 3042,
1181	              January 2001.

1183	   [RFC4653]  Bhandarkar, S., Reddy, A., Allman, M., and E. Blanton,
1184	              "Improving the Robustness of TCP to Non-Congestion
1185	              Events", RFC 4653, August 2006.

1187	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1188	              Control", RFC 5681, September 2009.

1190	   [RFC6582]  Henderson, T., Floyd, S., Gurtov, A., and Y. Nishida, "The
1191	              NewReno Modification to TCP's Fast Recovery Algorithm",
1192	              RFC 6582, April 2012.

1194	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
1195	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
1196	              Based on Selective Acknowledgment (SACK) for TCP",
1197	              RFC 6675, August 2012.

1199	   [RFC6928]  Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis,
1200	              "Increasing TCP's Initial Window", RFC 6928, April 2013.

1202	13.2.  Informative References

1204	   [Ban+01]   Bansal, D., Balakrishnan, H., Floyd, S., and S. Shenker,
1205	              "Dynamic Behavior of Slowly Responsive Congestion Control
1206	              Algorithms", Proceedings of the Conference on
1207	              Applications, Technologies, Architectures, and Protocols
1208	              for Computer Communication (SIGCOMM'01) pp. 263-274,
1209	              September 2001.

1211	   [Bha+04]   Bhandarkar, S., Sadry, N., Reddy, A., and N. Vaidya, "TCP-
1212	              DCR: A Novel Protocol for Tolerating Wireless Channel
1213	              Errors", IEEE Transactions on Mobile Computing vol. 4, no.
1214	              5.,  pp. 517-529, September 2005.

1216	   [Boh+06]   Bohacek, S., Hespanha, J., Lee, J., Lim, C., and K.
1217	              Obraczka, "A New TCP for Persistent Packet Reordering",
1218	              IEEE/ACM Transactions on Networking vol. 2, no. 14, pp.
1219	              369-382, April 2006.

1221	   [Flowgrind]
1222	              "Flowgrind Home Page", <http://www.flowgrind.net>.

1224	   [I-D.blanton-tcp-reordering]
1225	              Blanton, E., Dimond, R., and M. Allman, "Practices for TCP
1226	              Senders in the Face of Segment Reordering",
1227	              draft-blanton-tcp-reordering-00 (work in progress),
1228	              February 2003.

1230	   [LM05]     Leung, C. and C. Ma, "Enhancing TCP Performance to
1231	              Persistent Packet Reordering", KICS Journal of
1232	              Communications and Networks vol. 7, no. 3, pp. 385-393,
1233	              September 2005.

1235	   [Linux]    "The Linux Project", <http://www.kernel.org>.

1237	   [RFC0896]  Nagle, J., "Congestion control in IP/TCP internetworks",
1238	              RFC 896, January 1984.

1240	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
1241	              Communication Layers", STD 3, RFC 1122, October 1989.

1243	   [RFC2861]  Handley, M., Padhye, J., and S. Floyd, "TCP Congestion
1244	              Window Validation", RFC 2861, June 2000.

1246	   [RFC2960]  Stewart, R., Xie, Q., Morneault, K., Sharp, C.,
1247	              Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M.,
1248	              Zhang, L., and V. Paxson, "Stream Control Transmission
1249	              Protocol", RFC 2960, October 2000.

1251	   [RFC3522]  Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
1252	              for TCP", RFC 3522, April 2003.

1254	   [RFC3708]  Blanton, E. and M. Allman, "Using TCP Duplicate Selective
1255	              Acknowledgement (DSACKs) and Stream Control Transmission
1256	              Protocol (SCTP) Duplicate Transmission Sequence Numbers
1257	              (TSNs) to Detect Spurious Retransmissions", RFC 3708,
1258	              February 2004.

1260	   [RFC4015]  Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm
1261	              for TCP", RFC 4015, February 2005.

1263	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
1264	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
1265	              Spurious Retransmission Timeouts with TCP", RFC 5682,
1266	              September 2009.

1268	   [RFC5827]  Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J., and
1269	              P. Hurtig, "Early Retransmit for TCP and Stream Control
1270	              Transmission Protocol (SCTP)", RFC 5827, May 2010.

1272	   [Zha+03]   Zhang, M., Karp, B., Floyd, S., and L. Peterson, "RR-TCP:
1273	              A Reordering-Robust TCP with DSACK", Proceedings of the
1274	              11th IEEE International Conference on Network Protocols
1275	              (ICNP'03) pp. 95-106, November 2003.

1277	Appendix A.  Changes from previous versions of the draft

1279	   This appendix should be removed by the RFC Editor before publishing
1280	   this document as an RFC.

1282	A.1.  Changes from draft-zimmermann-tcpm-reordering-reaction-00

1284	   o  Improved the wording throughout the document.

1286	   o  Replaced and updated some references.

1288	Authors' Addresses

1290	   Alexander Zimmermann
1291	   NetApp, Inc.
1292	   Sonnenallee 1
1293	   Kirchheim  85551
1294	   Germany

1296	   Phone: +49 89 900594712
1297	   Email: alexander.zimmermann@netapp.com

1299	   Lennart Schulte
1300	   Aalto University
1301	   Otakaari 5 A
1302	   Espoo  02150
1303	   Finland

1305	   Phone: +358 50 4355233
1306	   Email: lennart.schulte@aalto.fi

1308	   Carsten Wolff
1309	   credativ GmbH
1310	   Hohenzollernstrasse 133
1311	   Moenchengladbach  41061
1312	   Germany

1314	   Phone: +49 2161 4643 182
1315	   Email: carsten.wolff@credativ.de

1317	   Arnd Hannemann
1318	   credativ GmbH
1319	   Hohenzollernstrasse 133
1320	   Moenchengladbach  41061
1321	   Germany

1323	   Phone: +49 2161 4643 134
1324	   Email: arnd.hannemann@credativ.de