idnits 2.17.1 

draft-ietf-tcpm-3517bis-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 4 instances of too long lines in the document, the longest one
     being 6 characters in excess of 72.

  ** There is 1 instance of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 26, 2012) is 4468 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'A' is mentioned on line 151, but not defined

  == Missing Reference: 'B' is mentioned on line 151, but not defined

  == Unused Reference: 'RFC2026' is defined on line 596, but no explicit
     reference was found in the text

  == Unused Reference: 'Jac90' is defined on line 628, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3042' is defined on line 644, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Duplicate reference: RFC2018, mentioned in 'Errata1610', was also
     mentioned in 'RFC2018'.

  -- Obsolete informational reference (is this intentional?): RFC 3782
     (Obsoleted by RFC 6582)

  -- Obsolete informational reference (is this intentional?): RFC 3517
     (Obsoleted by RFC 6675)


     Summary: 4 errors (**), 0 flaws (~~), 7 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force                               E. Blanton
2	INTERNET-DRAFT                                         Purdue University
3	draft-ietf-tcpm-3517bis-01.txt                                 M. Allman
4	                                                                    ICSI
5	                                                                 L. Wang
6	                                                        Juniper Networks
7	                                                             I. Jarvinen
8	                                                                 M. Kojo
9	                                                  University of Helsinki
10	                                                              Y. Nishida
11	                                                            WIDE Project
12	                                                        January 26, 2012

14	         A Conservative Selective Acknowledgment (SACK)-based
15	                    Loss Recovery Algorithm for TCP

17	Status of this Memo

19	    This Internet-Draft is submitted to IETF in full conformance with
20	    the provisions of BCP 78 and BCP 79.

22	    Internet-Drafts are working documents of the Internet Engineering
23	    Task Force (IETF), its areas, and its working groups.  Note that
24	    other groups may also distribute working documents as Internet-
25	    Drafts.

27	    Internet-Drafts are draft documents valid for a maximum of six
28	    months and may be updated, replaced, or obsoleted by other documents
29	    at any time.  It is inappropriate to use Internet-Drafts as
30	    reference material or to cite them other than as "work in progress."

32	    The list of current Internet-Drafts can be accessed at
33	    http://www.ietf.org/ietf/1id-abstracts.txt.

35	    The list of Internet-Draft Shadow Directories can be accessed at
36	    http://www.ietf.org/shadow.html.

38	    This Internet-Draft will expire on May 22, 2012.

40	Copyright Notice

42	    Copyright (c) 2012 IETF Trust and the persons identified as the
43	    document authors.  All rights reserved.

45	    This document is subject to BCP 78 and the IETF Trust's Legal
46	    Provisions Relating to IETF Documents
47	    (http://trustee.ietf.org/license-info) in effect on the date of
48	    publication of this document.  Please review these documents
49	    carefully, as they describe your rights and restrictions with
50	    respect to this document.  Code Components extracted from this
51	    document must include Simplified BSD License text as described in
52	    Section 4.e of the Trust Legal Provisions and are provided without
53	    warranty as described in the Simplified BSD License.

55	Abstract

57	   This document presents a conservative loss recovery algorithm for TCP
58	   that is based on the use of the selective acknowledgment (SACK) TCP
59	   option.  The algorithm presented in this document conforms to the
60	   spirit of the current congestion control specification (RFC 5681),
61	   but allows TCP senders to recover more effectively when multiple
62	   segments are lost from a single flight of data.

64	1   Introduction

66	   This document presents a conservative loss recovery algorithm for TCP
67	   that is based on the use of the selective acknowledgment (SACK) TCP
68	   option.  While the TCP SACK [RFC2018] is being steadily deployed in
69	   the Internet [All00], there is evidence that hosts are not using the
70	   SACK information when making retransmission and congestion control
71	   decisions [PF01].  The goal of this document is to outline one
72	   straightforward method for TCP implementations to use SACK
73	   information to increase performance.

75	   [RFC5681] allows advanced loss recovery algorithms to be used by TCP
76	   [RFC793] provided that they follow the spirit of TCP's congestion
77	   control algorithms [RFC5681, RFC2914].  [RFC3782] outlines one such
78	   advanced recovery algorithm called NewReno.  This document outlines a
79	   loss recovery algorithm that uses the SACK [RFC2018] TCP option to
80	   enhance TCP's loss recovery.  The algorithm outlined in this
81	   document, heavily based on the algorithm detailed in [FF96], is a
82	   conservative replacement of the fast recovery algorithm [Jac90,
83	   RFC5681].  The algorithm specified in this document is a
84	   straightforward SACK-based loss recovery strategy that follows the
85	   guidelines set in [RFC5681] and can safely be used in TCP
86	   implementations.  Alternate SACK-based loss recovery methods can be
87	   used in TCP as implementers see fit (as long as the alternate
88	   algorithms follow the guidelines provided in [RFC5681]).  Please
89	   note, however, that the SACK-based decisions in this document (such
90	   as what segments are to be sent at what time) are largely decoupled
91	   from the congestion control algorithms, and as such can be treated as
92	   separate issues if so desired.

94	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
95	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
96	   document are to be interpreted as described in BCP 14, RFC 2119
97	   [RFC2119].

99	2   Definitions

101	   The reader is expected to be familiar with the definitions given in
102	   [RFC5681].

104	   The reader is assumed to be familiar with selective acknowledgments
105	   as specified in [RFC2018].

107	   For the purposes of explaining the SACK-based loss recovery algorithm
108	   we define six variables that a TCP sender stores:

110	      "HighACK" is the sequence number of the highest byte of data that
111	      has been cumulatively ACKed at a given point.

113	      "HighData" is the highest sequence number transmitted at a given
114	      point.

116	      "HighRxt" is the highest sequence number which has been
117	      retransmitted during the current loss recovery phase.

119	      "RescueRxt" is the highest sequence number which has been
120	      retransmitted optimistically to prevent stalling of the ACK clock
121	      when there is loss at the end of the window and no new data is
122	      available for transmission.

124	      "Pipe" is a sender's estimate of the number of bytes outstanding
125	      in the network.  This is used during recovery for limiting the
126	      sender's sending rate.  The pipe variable allows TCP to use a
127	      fundamentally different congestion control than specified in
128	      [RFC5681].  The algorithm is often referred to as the "pipe
129	      algorithm".

131	      "DupAcks" is the number of duplicate acknowledgments received
132	      since the last cumulative acknowledgment.

134	   For the purposes of this specification we define a "duplicate
135	   acknowledgment" as a segment that arrives carrying a SACK block which
136	   identifies previously unacknowledged and un-SACKed octets between
137	   HighACK and HighData.  Note that an ACK which carries new
138	   SACK data is counted as a duplicate acknowledgment under this
139	   definition even if it carries new data, changes the advertised
140	   window, or moves the cumulative acknowledgment point, which is
141	   different from the definition of duplicate acknowledgment
142	   in [RFC5681].

144	   We define a variable "DupThresh" that holds the number of duplicate
145	   acknowledgments required to trigger a retransmission.  Per [RFC5681]
146	   this threshold is defined to be 3 duplicate acknowledgments.
147	   However, implementers should consult any updates to [RFC5681] to
148	   determine the current value for DupThresh (or method for determining
149	   its value).

151	   Finally, a range of sequence numbers [A,B] is said to "cover"
152	   sequence number S if A <= S <= B.

154	3   Keeping Track of SACK Information

156	   For a TCP sender to implement the algorithm defined in the next
157	   section it must keep a data structure to store incoming selective
158	   acknowledgment information on a per connection basis.  Such a data
159	   structure is commonly called the "scoreboard".  The specifics of the
160	   scoreboard data structure are out of scope for this document (as long
161	   as the implementation can perform all functions required by this
162	   specification).

164	   Note that this document refers to keeping account of (marking)
165	   individual octets of data transferred across a TCP connection.  A
166	   real-world implementation of the scoreboard would likely prefer to
167	   manage this data as sequence number ranges.  The algorithms presented
168	   here allow this, but require the ability to mark arbitrary sequence
169	   number ranges as having been selectively acknowledged.

171	   Finally, note that the algorithm in this document assumes a
172	   sender that is not keeping track of segment boundaries after
173	   transmitting a segment.  It is possible that a sender that did
174	   keep this extra state may be able to use a more refined and
175	   precise algorithm than the one presented herein, however, we
176	   leave this as future work.

178	4   Processing and Acting Upon SACK Information

180	   For the purposes of the algorithm defined in this document the
181	   scoreboard SHOULD implement the following functions:

183	   Update ():

185	      Given the information provided in an ACK, each octet that is
186	      cumulatively ACKed or SACKed should be marked accordingly in the
187	      scoreboard data structure, and the total number of octets SACKed
188	      should be recorded.

190	      Note: SACK information is advisory and therefore SACKed data MUST
191	      NOT be removed from TCP's retransmission buffer until the data is
192	      cumulatively acknowledged [RFC2018].

194	   IsLost (SeqNum):

196	      This routine returns whether the given sequence number is
197	      considered to be lost.  The routine returns true when either
198	      DupThresh discontiguous SACKed sequences have arrived above
199	      'SeqNum' or more than (DupThresh - 1) * SMSS bytes with sequence
200	      numbers greater than 'SeqNum' have been SACKed.  Otherwise, the
201	      routine returns false.

203	   SetPipe ():

205	      This routine traverses the sequence space from HighACK to HighData
206	      and MUST set the "pipe" variable to an estimate of the number of
207	      octets that are currently in transit between the TCP sender and
208	      the TCP receiver.  After initializing pipe to zero the following
209	      steps are taken for each octet 'S1' in the sequence space between
210	      HighACK and HighData that has not been SACKed:

212	      (a) If IsLost (S1) returns false:

214	         Pipe is incremented by 1 octet.

216	         The effect of this condition is that pipe is incremented for
217	         packets that have not been SACKed and have not been determined
218	         to have been lost (i.e., those segments that are still assumed
219	         to be in the network).

221	      (b) If S1 <= HighRxt:

223	         Pipe is incremented by 1 octet.

225	         The effect of this condition is that pipe is incremented for
226	         the retransmission of the octet.

228	      Note that octets retransmitted without being considered lost are
229	      counted twice by the above mechanism.

231	   NextSeg ():

233	      This routine uses the scoreboard data structure maintained by the
234	      Update() function to determine what to transmit based on the SACK
235	      information that has arrived from the data receiver (and hence
236	      been marked in the scoreboard).  NextSeg () MUST return the
237	      sequence number range of the next segment that is to be
238	      transmitted, per the following rules:

240	      (1) If there exists a smallest unSACKed sequence number 'S2' that
241	          meets the following three criteria for determining loss, the
242	          sequence range of one segment of up to SMSS octets starting
243	          with S2 MUST be returned.

245	          (1.a) S2 is greater than HighRxt.

247	          (1.b) S2 is less than the highest octet covered by any
248	                received SACK.

250	          (1.c) IsLost (S2) returns true.

252	      (2) If no sequence number 'S2' per rule (1) exists but there
253	          exists available unsent data and the receiver's advertised
254	          window allows, the sequence range of one segment of up to SMSS
255	          octets of previously unsent data starting with sequence number
256	          HighData+1 MUST be returned.

258	      (3) If the conditions for rules (1) and (2) fail, but there exists
259	          an unSACKed sequence number 'S3' that meets the criteria for
260	          detecting loss given in steps (1.a) and (1.b) above
261	          (specifically excluding step (1.c)) then one segment of up to
262	          SMSS octets starting with S3 SHOULD be returned.

264	      (4) If the conditions for (1), (2), and (3) fail, but there
265	          exists outstanding unSACKed data, we provide the
266	          opportunity for a single "rescue" retransmission per entry
267	          into loss recovery.  If HighACK is greater than RescueRxt
268	          (or RescueRxt is undefined), then one segment of up to
269	          SMSS octets which MUST include the highest outstanding
270	          unSACKed sequence number SHOULD be returned, and RescueRxt
271	          set to RecoveryPoint.  HighRxt MUST NOT be updated.

273	          Note that rules (3) and (4) are a sort of retransmission "last
274	          resort".  They allow for retransmission of sequence numbers
275	          even when the sender has less certainty a segment has been
276	          lost than as with rule (1).  Retransmitting segments via rule
277	          (3) and (4) will help sustain TCP's ACK clock and therefore
278	          can potentially help avoid retransmission timeouts.  However,
279	          in sending these segments the sender has two copies of the
280	          same data considered to be in the network (and also in the
281	          Pipe estimate, in the case of (3)).  When an ACK or SACK
282	          arrives covering this retransmitted segment, the sender cannot
283	          be sure exactly how much data left the network (one of the two
284	          transmissions of the packet or both transmissions of the
285	          packet).  Therefore the sender may underestimate Pipe by
286	          considering both segments to have left the network when it is
287	          possible that only one of the two has.

289	      (5) If the conditions for each of (1), (2), (3), and (4) are not
290	          met, then NextSeg () MUST indicate failure, and no segment is
291	          returned.

293	   Note: The SACK-based loss recovery algorithm outlined in this
294	   document requires more computational resources than previous TCP loss
295	   recovery strategies.  However, we believe the scoreboard data
296	   structure can be implemented in a reasonably efficient manner (both
297	   in terms of computation complexity and memory usage) in most TCP
298	   implementations.

300	5   Algorithm Details

302	   Upon the receipt of any ACK containing SACK information, the
303	   scoreboard MUST be updated via the Update () routine.

305	   If the incoming ACK is a cumulative acknowledgment, the TCP MUST
306	   reset DupAcks to zero.

308	   If the incoming ACK is a duplicate acknowledgment per the definition
309	   in Section 2 (regardless of its status as a cumulative acknowledgment),
310	   and the TCP is not currently in loss recovery, the TCP MUST increase
311	   DupAcks by one and take the following steps:

313	   (1) If DupAcks >= DupThresh, go to step (4).

315	       Note: This check covers the case when a TCP receives SACK
316	       information for multiple segments smaller than SMSS, which can
317	       potentially prevent IsLost() (next step) from declaring a segment
318	       as lost.

320	   (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns
321	       true---indicating at least three segments have arrived above
322	       the current cumulative acknowledgment point, which is taken
323	       to indicate loss---go to step (4).

325	   (3) The TCP MAY transmit previously unsent data segments as per
326	       Limited Transmit [RFC5681], except that the number of octets
327	       which may be sent is governed by Pipe and cwnd as follows:

329	       (3.1) Set HighRxt to HighACK.

331	       (3.2) Run SetPipe ().

333	       (3.3) If (cwnd - pipe) >= 1 SMSS, there exists previously
334	             unsent data, and the receiver's advertised window
335	             allows, transmit up to 1 SMSS of data starting with the
336	             octet HighData+1 and update HighData to reflect this
337	             transmission, then return to (3.2).

339	       (3.4) Terminate processing of this ACK.

341	   (4) Invoke Fast Retransmit and enter loss recovery as follows:

343	       (4.1) RecoveryPoint = HighData

345	             When the TCP sender receives a cumulative ACK for this
346	             data octet the loss recovery phase is terminated.

348	       (4.2) ssthresh = cwnd = (FlightSize / 2)

350	             The congestion window (cwnd) and slow start threshold
351	             (ssthresh) are reduced to half of FlightSize per
352	             [RFC5681].  Additionally, note that [RFC5681] requires
353	             any segments sent as part of the Limited Transmit
354	             mechanism not be counted in FlightSize for the purpose
355	             of the above equation.

357	       (4.3) Retransmit the first data segment presumed dropped -- the
358	             segment starting with sequence number HighACK + 1.  To
359	             prevent repeated retransmission of the same data or a
360	             premature rescue retransmission, set both HighRxt and
361	             RescueRxt to the highest sequence number in the
362	             retransmitted segment.

364	       (4.4) Run SetPipe ()

366	             Set a "pipe" variable to the number of outstanding
367	             octets currently "in the pipe"; this is the data which
368	             has been sent by the TCP sender but for which no
369	             cumulative or selective acknowledgment has been
370	             received and the data has not been determined to have
371	             been dropped in the network.  It is assumed that the
372	             data is still traversing the network path.

374	       (4.5) In order to take advantage of potential additional
375	             available cwnd, proceed to step (C) below.

377	   Once a TCP is in the loss recovery phase the following procedure MUST
378	   be used for each arriving ACK:

380	   (A) An incoming cumulative ACK for a sequence number greater than
381	       RecoveryPoint signals the end of loss recovery and the loss
382	       recovery phase MUST be terminated.  Any information contained in
383	       the scoreboard for sequence numbers greater than the new value of
384	       HighACK SHOULD NOT be cleared when leaving the loss recovery
385	       phase.

387	   (B) Upon receipt of an ACK that does not cover RecoveryPoint the
388	       following actions MUST be taken:

390	       (B.1) Use Update () to record the new SACK information conveyed
391	       by the incoming ACK.

393	       (B.2) Use SetPipe () to re-calculate the number of octets still
394	       in the network.

396	   (C) If cwnd - pipe >= 1 SMSS the sender SHOULD transmit one or more
397	       segments as follows:

399	       (C.1) The scoreboard MUST be queried via NextSeg () for the
400	       sequence number range of the next segment to transmit (if any),
401	       and the given segment sent.  If NextSeg () returns failure (no
402	       data to send) return without sending anything (i.e., terminate
403	       steps C.1 -- C.5).

405	       (C.2) If any of the data octets sent in (C.1) are below HighData,
406	       HighRxt MUST be set to the highest sequence number of the
407	       retransmitted segment unless NextSeg () rule (4) was invoked for
408	       this retransmission.

410	       (C.3) If any of the data octets sent in (C.1) are above HighData,
411	       HighData must be updated to reflect the transmission of
412	       previously unsent data.

414	       (C.4) The estimate of the amount of data outstanding in the
415	       network must be updated by incrementing pipe by the number of
416	       octets transmitted in (C.1).

418	       (C.5) If cwnd - pipe >= 1 SMSS, return to (C.1)

420	   Note that steps (A) and (C) can potentially send a burst of
421	   back-to-back segments into the network if the incoming cumulative
422	   acknowledgment is for more than SMSS octets of data, or if incoming
423	   SACK blocks indicate that more than SMSS octets of data have been
424	   lost in the second half of the window.

426	5.1 Retransmission Timeouts

428	   In order to avoid memory deadlocks, the TCP receiver is allowed
429	   to discard data that has already been selectively acknowledged.
430	   As a result, [RFC2018] suggests that a TCP sender SHOULD expunge
431	   the SACK information gathered from a receiver upon a
432	   retransmission timeout "since the timeout might indicate that the
433	   data receiver has reneged."  Additionally, a TCP sender MUST
434	   "ignore prior SACK information in determining which data to
435	   retransmit."  However, since the publication of [RFC2018] this
436	   has come to be viewed by some as too strong.  It has been
437	   suggested that, as long as robust tests for reneging are present,
438	   an implementation can retain and use SACK information across a
439	   timeout event [Errata1610].  While this document does not change
440	   the specification in [RFC2018], we note that implementers should
441	   consult any updates to [RFC2018] on this subject.  Further, a
442	   SACK TCP sender SHOULD utilize all SACK information made
443	   available during the loss recovery following an RTO.

445	   If an RTO occurs during loss recovery as specified in this document,
446	   RecoveryPoint MUST be set to HighData.  Further, the new value of
447	   RecoveryPoint MUST be preserved and the loss recovery algorithm
448	   outlined in this document MUST be terminated.  In addition, a new
449	   recovery phase (as described in section 5) MUST NOT be initiated
450	   until HighACK is greater than or equal to the new value of
451	   RecoveryPoint.

453	   As described in Sections 4 and 5, Update () SHOULD continue to be
454	   used appropriately upon receipt of ACKs.  This will allow the
455	   recovery period after an RTO to benefit from all available
456	   information provided by the receiver, even if SACK information
457	   was expunged due to the RTO.

459	   If there are segments missing from the receiver's buffer
460	   following processing of the retransmitted segment, the
461	   corresponding ACK will contain SACK information.  In this case, a
462	   TCP sender SHOULD use this SACK information when determining what
463	   data should be sent in each segment following an RTO.  The exact
464	   algorithm for this selection is not specified in this document
465	   (specifically NextSeg () is inappropriate during loss recovery
466	   after an RTO).  A relatively straightforward approach to "filling
467	   in" the sequence space reported as missing should be a reasonable
468	   approach.

470	6   Managing the RTO Timer

472	   The standard TCP RTO estimator is defined in [RFC6298].  Due to the
473	   fact that the SACK algorithm in this document can have an impact on
474	   the behavior of the estimator, implementers may wish to consider how
475	   the timer is managed.  [RFC6298] calls for the RTO timer to be
476	   re-armed each time an ACK arrives that advances the cumulative ACK
477	   point.  Because the algorithm presented in this document can keep the
478	   ACK clock going through a fairly significant loss event,
479	   (comparatively longer than the algorithm described in [RFC5681]), on
480	   some networks the loss event could last longer than the RTO.  In this
481	   case the RTO timer would expire prematurely and a segment that need
482	   not be retransmitted would be resent.

484	   Therefore we give implementers the latitude to use the standard

486	   [RFC6298] style RTO management or, optionally, a more careful variant
487	   that re-arms the RTO timer on each retransmission that is sent during
488	   recovery MAY be used.  This provides a more conservative timer than
489	   specified in [RFC6298], and so may not always be an attractive
490	   alternative.  However, in some cases it may prevent needless
491	   retransmissions, go-back-N transmission and further reduction of the
492	   congestion window.

494	7   Research

496	   The algorithm specified in this document is analyzed in [FF96], which
497	   shows that the above algorithm is effective in reducing transfer time
498	   over standard TCP Reno [RFC5681] when multiple segments are dropped
499	   from a window of data (especially as the number of drops increases).
500	   [AHKO97] shows that the algorithm defined in this document can
501	   greatly improve throughput in connections traversing satellite
502	   channels.

504	8   Security Considerations

506	   The algorithm presented in this paper shares security considerations
507	   with [RFC5681].  A key difference is that an algorithm based on SACKs
508	   is more robust against attackers forging duplicate ACKs to force the
509	   TCP sender to reduce cwnd.  With SACKs, TCP senders have an
510	   additional check on whether or not a particular ACK is legitimate.
511	   While not fool-proof, SACK does provide some amount of protection in
512	   this area.

514	   Similarly, [CPNI309] sketches a variant of a blind attack [RFC5961]
515	   whereby an attacker can spoof out-of-window data to a TCP endpoint,
516	   causing it to respond to the legitimate peer with a duplicate
517	   cumulative ACK, per [RFC793].  Adding a SACK-based requirement to
518	   trigger loss recovery effectively mitigates this attack, as the
519	   duplicate ACKs caused by out-of-window segments will not contain SACK
520	   information indicating reception of previously un-SACKED in-window
521	   data.

523	9   Changes Relative to RFC 3517

525	   The state variable "DupAcks" has been added to the list of variables
526	   maintained by this algorithm, and its usage specified.

528	   The function IsLost () has been modified to require that more than
529	   (DupThresh - 1) * SMSS octets have been SACKed above a given sequence
530	   number as indication that it is lost, changed from at least
531	   (DupThresh * SMSS).  This retains the requirement that at least three
532	   segments following the sequence number in question have been SACKed,
533	   while improving detection in the event that the sender has
534	   outstanding segments which are smaller than SMSS.

536	   The definition of a "duplicate acknowledgment" has been modified to
537	   utilize the SACK information in detecting loss.  Duplicate cumulative
538	   acknowledgments can be caused by either loss or reordering in the
539	   network.  To disambiguate loss and reordering TCP's fast retransmit
540	   algorithm [RFC5681] waits until three duplicate ACKs arrive to
541	   trigger loss recovery.  This notion was then the basis for the
542	   algorithm specified in [RFC3517].  However, with SACK information
543	   there is no need to rely blindly on the cumulative acknowledgment
544	   field.  We can leverage the additional information present in the
545	   SACK blocks to understand that three segments have arrived at the
546	   receiver which lie above a gap in the sequence space, and can use
547	   that to trigger loss recovery.  This notion was used in [RFC3517]
548	   during loss recovery, and the change in this document is that the
549	   notion is also used to enter a loss recovery phase.

551	   The state variable "RescueRxt" has been added to the list of
552	   variables maintained by the algorithm, and its usage specified.  This
553	   variable is used to allow for one extra retransmission per entry into
554	   loss recovery, in order to keep the ACK clock going under certain
555	   circumstances involving loss at the end of the window.  This
556	   mechanism allows for no more than one segment of no larger than 1
557	   SMSS to be optimistically retransmitted per loss recovery.

559	   Rule (3) of NextSeg() has been changed from MAY to SHOULD, to
560	   appropriately reflect the opinion of the authors and working group
561	   that it should be left in, rather than out, if an implementor does
562	   not have a compelling reason to do otherwise.

564	Acknowledgments

566	   The authors wish to thank Sally Floyd for encouraging [RFC3517]
567	   and commenting on early drafts.  The algorithm described in this
568	   document is loosely based on an algorithm outlined by Kevin Fall
569	   and Sally Floyd in [FF96], although the authors of this document
570	   assume responsibility for any mistakes in the above text.

572	   [RFC3517] was co-authored by Kevin Fall, who provided crucial input
573	   to that document and hence this follow-on work.

575	   Murali Bashyam, Ken Calvert, Tom Henderson, Reiner Ludwig,
576	   Jamshid Mahdavi, Matt Mathis, Shawn Ostermann, Vern Paxson and
577	   Venkat Venkatsubra provided valuable feedback on earlier versions
578	   of this document.

580	   We thank Matt Mathis and Jamshid Mahdavi for implementing the
581	   scoreboard in ns and hence guiding our thinking in keeping track
582	   of SACK state.

584	   The first author would like to thank Ohio University and the Ohio
585	   University Internetworking Research Group for supporting the bulk of
586	   his work on this project.

588	Normative References

590	   [RFC793]  Postel, J., "Transmission Control Protocol", STD 7, RFC
591	             793, September 1981.

593	   [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP
594	             Selective Acknowledgment Options", RFC 2018, October 1996.

596	   [RFC2026] Bradner, S., "The Internet Standards Process -- Revision
597	             3", BCP 9, RFC 2026, October 1996.

599	   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
600	             Requirement Levels", BCP 14, RFC 2119, March 1997.

602	   [RFC5681] Allman, M., Paxson, V. and E. Blanton, "TCP Congestion
603	             Control", RFC 5681, September 2009.

605	Informative References

607	   [AHKO97]  Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann, "TCP
608	             Performance Over Satellite Links",  Proceedings of the Fifth
609	             International Conference on Telecommunications Systems,
610	             Nashville, TN, March, 1997.

612	   [All00]   Mark Allman, "A Web Server's View of the Transport Layer",
613	             ACM Computer Communication Review, 30(5), October 2000.

615	   [CPNI309] Fernando Gont, "Security Assessment of the Transmission
616	             Control Protocol (TCP)", CPNI Technical Note 3/2009,
617	             http://www.cpni.gov.uk/Docs/tn-03-09-security-assessment-TCP.pdf,
618	             February 2009.

620	   [Errata1610] Matt Mathis, "RFC Errata Report 1610 for RFC 2018",
621	             http://www.rfc-editor.org/errata_search.php?eid=1610,
622		     Verified 2008-12-09.

624	   [FF96]    Kevin Fall and Sally Floyd, "Simulation-based Comparisons
625	             of Tahoe, Reno and SACK TCP", Computer Communication
626	             Review, July 1996.

628	   [Jac90]   Van Jacobson, "Modified TCP Congestion Avoidance Algorithm",
629	             Technical Report, LBL, April 1990.

631	   [PF01]    Jitendra Padhye, Sally Floyd "Identifying the TCP Behavior
632	             of Web Servers", ACM SIGCOMM, August 2001.

634	   [RFC3782] Floyd, S., Henderson, T., and A. Gurtov, "The NewReno
635	             Modification to TCP's Fast Recovery Algorithm", RFC 3782,
636	             April 2004.

638	   [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, RFC
639	             2914, September 2000.

641	   [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, "Computing
642	             TCP's Retransmission Timer", RFC 6298, June 2011.

644	   [RFC3042] Allman, M., Balakrishnan, H, and S. Floyd, "Enhancing TCP's
645	             Loss Recovery Using Limited Transmit", RFC 3042, January
646	             2001.

648	   [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A
649	             Conservative Selective Acknowledgment (SACK)-based Loss
650	             Recovery Algorithm for TCP", RFC 3517, April 2003.

652	   [RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's
653	             Robustness to Blind In-Window Attacks", RFC 5961, August
654	             2010.

656	Authors' Addresses

658	   Ethan Blanton
659	   Purdue University Computer Sciences
660	   305 N. University St.
661	   West Lafayette, IN  47907

663	   EMail: eblanton@cs.purdue.edu

665	   Mark Allman
666	   International Computer Science Institute
667	   1947 Center St.  Suite 600
668	   Berkeley, CA  94704

670	   Phone: 440-235-1792
671	   EMail: mallman@icir.org
672	   http://www.icir.org/mallman

674	   Lili Wang
675	   Juniper Networks
676	   10 Technology Park Drive
677	   Westford, MA  01886

679	   EMail: liliw@juniper.net

681	   Ilpo Jarvinen
682	   University of Helsinki
683	   P.O. Box 68
684	   FI-00014 UNIVERSITY OF HELSINKI
685	   Finland

687	   Email: ilpo.jarvinen@helsinki.fi

689	   Markku Kojo
690	   University of Helsinki
691	   P.O. Box 68
692	   FI-00014 UNIVERSITY OF HELSINKI
693	   Finland

695	   Email: kojo@cs.helsinki.fi
696	   Yoshifumi Nishida
697	   WIDE Project
698	   Endo 5322
699	   Fujisawa, Kanagawa  252-8520
700	   Japan

702	   Email: nishida@wide.ad.jp