idnits 2.17.1 

draft-allman-tcp-sack-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-24) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There is 1 instance of too long lines in the document, the longest one
     being 1 character in excess of 72.

  ** There are 26 instances of lines with control characters in the document.

  ** The abstract seems to contain references ([RFC2119], [RFC2581]), which
     it shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'A' is mentioned on line 117, but not defined

  == Missing Reference: 'B' is mentioned on line 117, but not defined

  -- Possible downref: Non-RFC (?) normative reference: ref. 'AHKO97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'All00'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'FF96'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Jac90'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PF01'

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681)

  ** Obsolete normative reference: RFC 2582 (Obsoleted by RFC 3782)


     Summary: 9 errors (**), 0 flaws (~~), 3 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force                            Ethan Blanton
2	INTERNET DRAFT                                           Ohio University
3	File: draft-allman-tcp-sack-09.txt                           Mark Allman
4	                                                            BBN/NASA GRC
5	                                                              Kevin Fall
6	                                                          Intel Research
7	                                                          February, 2002
8	                                                   Expires: August, 2002

10	       A Conservative SACK-based Loss Recovery Algorithm for TCP

12	Status of this Memo

14	    This document is an Internet-Draft and is in full conformance with
15	    all provisions of Section 10 of [RFC2026].

17	    Internet-Drafts are working documents of the Internet Engineering
18	    Task Force (IETF), its areas, and its working groups.  Note that
19	    other groups may also distribute working documents as
20	    Internet-Drafts.

22	    Internet-Drafts are draft documents valid for a maximum of six
23	    months and may be updated, replaced, or obsoleted by other documents
24	    at any time.  It is inappropriate to use Internet-Drafts as
25	    reference material or to cite them other than as "work in progress."

27	    The list of current Internet-Drafts can be accessed at
28	    http://www.ietf.org/ietf/1id-abstracts.txt

30	    The list of Internet-Draft Shadow Directories can be accessed at
31	    http://www.ietf.org/shadow.html.

33	Abstract

35	    This document presents a conservative loss recovery algorithm
36	    for TCP that is based on the use of the selective acknowledgment
37	    TCP option.  The algorithm presented in this document conforms
38	    to the spirit of the current congestion control specification
39	    [RFC2581], but allows TCP senders to recover more effectively
40	    when multiple segments are lost from a single flight of data.

42	Terminology

44	    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
45	    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
46	    document are to be interpreted as described in RFC 2119 [RFC2119].

48	1   Introduction

50	    This document presents a conservative loss recovery algorithm for
51	    TCP that is based on the use of the selective acknowledgment TCP
52	    option.  While the TCP selective acknowledgment (SACK) option
53	    [RFC2018] is being steadily deployed in the Internet [All00] there
54	    is evidence that hosts are not using the SACK information when
55	    making retransmission and congestion control decisions [PF01].  The
56	    goal of this document is to outline one straightforward method for
57	    TCP implementations to use SACK information to increase performance.

59	    [RFC2581] allows advanced loss recovery algorithms to be used by TCP
60	    [RFC793] provided that they follow the spirit of TCP's congestion
61	    control algorithms [RFC2581,RFC2914].  [RFC2582] outlines one such
62	    advanced recovery algorithm called NewReno.  This document outlines
63	    a loss recovery algorithm that uses the selective acknowledgment
64	    (SACK) [RFC2018] TCP option to enhance TCP's loss recovery.  The
65	    algorithm outlined in this document, heavily based on the algorithm
66	    detailed in [FF96], is a conservative replacement of the fast
67	    recovery algorithm [Jac90,RFC2581].  The algorithm specified in this
68	    document is a straightforward SACK-based loss recovery strategy that
69	    follows the guidelines set in [RFC2581] and can safely be used in
70	    TCP implementations.  Alternate SACK-based loss recovery methods can
71	    be used in TCP as implementers see fit (as long as the alternate
72	    algorithms follow the guidelines provided in [RFC2581]).  Please
73	    note, however, that the SACK-based decisions in this document (such
74	    as what segments are to be sent at what time) are largely decoupled
75	    from the congestion control algorithms, and as such can be treated
76	    as separate issues if so desired.

78	2   Definitions

80	    The reader is expected to be familiar with the definitions given in
81	    [RFC2581].

83	    The reader is assumed to be familiar with selective acknowledgments
84	    as specified in [RFC2018].

86	    For the purposes of explaining the SACK-based loss recovery
87	    algorithm we define four variables that a TCP sender stores:

89	        ``HighACK'' is the sequence number of the highest byte of
90		data that has been cumulatively ACKed at a given point.

92	        ``HighData'' is the highest sequence number transmitted at a
93	        given point.

95	        ``HighRxt'' is the highest sequence number which has been
96	        retransmitted during the current loss recovery phase.

98		``Pipe'' is a sender's estimate of the number of bytes
99		outstanding in the network.  This is used during recovery
100		for limiting the sender's sending rate.  The pipe variable
101		allows TCP to use a fundamentally different congestion
102		control than specified in [RFC2581].  The algorithm is often
103		referred to as the ``pipe algorithm''.

105	    For the purposes of this specification we define a ``duplicate
106	    acknowledgment'' as an acknowledgment (ACK) whose cumulative ACK
107	    number is equal to the current value of HighACK, as described in
108	    [RFC2581].

110	    We define a variable ``DupThresh'' that holds the number of
111	    duplicate acknowledgments required to trigger a retransmission.  Per
112	    [RFC2581] this threshold is defined to be 3 duplicate
113	    acknowledgments.  However, implementers should consult any updates
114	    to [RFC2581] to determine the current value for DupThresh (or method
115	    for determining its value).

117	    Finally, a range of sequence numbers [A,B] is said to ``cover''
118	    sequence number S if A <= S <= B.

120	3   Keeping Track of SACK Information

122	    For a TCP sender to implement the algorithm defined in the next
123	    section it must keep a data structure to store incoming
124	    selective acknowledgment information on a per connection basis.
125	    Such a data structure is commonly called the ``scoreboard''.
126	    The specifics of the scoreboard data structure are out of scope
127	    for this document (as long as the implementation can perform all
128	    functions required by this specification).

130	    Note that while this document speaks of marking and keeping
131	    track of octets, a real world implementation would probably want
132	    to keep track of octet ranges or otherwise collapse the data
133	    while ensuring that arbitrary ranges are still markable.

135	4   Processing and Acting Upon SACK Information

137	    For the purposes of the algorithm defined in this document the
138	    scoreboard SHOULD implement the following functions:

140	    Update ():

142	        Each octet that is cumulatively ACKed or SACKed should be marked
143	        accordingly in the scoreboard data structure, and the total
144	        number of octets SACKed should be recorded.

146	        Note: SACK information is advisory and therefore SACKed data
147	        MUST NOT be removed from TCP's retransmission buffer until the
148	        data is cumulatively acknowledged [RFC2018].

150	    NextSeg ():

152		This routine uses the scoreboard data structure maintained
153		by the Update() function to determine what to transmit based
154		on the SACK information that has arrived from the data
155		receiver (and, hence, marked in the scoreboard).  NextSeg ()
156		MUST return the sequence number range of the next
157	        segment that is to be transmitted, per the following rules:

159	        (1) If there exists a smallest unSACKed sequence number 'S1'
160	            such that HighRxt < S1 < HighData and there are either
161	            DupThresh * SMSS octets above S1 which have been SACKed or
162	            the number of discontiguous SACKed sequence spaces above S1
163	            is greater than DupThresh, S1 is presumed to have been lost
164	            and the sequence range of one segment of up to SMSS octets
165	            starting with S1 MUST be returned.

167	        (2) If no sequence number 'S1' per rule (1) exists but there
168	            exists available unsent data and the receiver's advertised
169	            window allows, the sequence range of one segment of up to
170	            SMSS octets of previously unsent data starting with sequence
171	            number HighData+1 MUST be returned.

173	        (3) If the conditions for rules (1) and (2) fail, but there
174	            exists an unSACKed sequence number 'S2' such that HighRxt <
175	            S2 < HighData, one segment of up to SMSS octets starting
176	            with S2 MUST be returned.  Note that this segment need not
177	            meet the additional requirements in (1).

179	        (4) If the conditions for each of (1), (2), and (3) are not
180	            met, then NextSeg () MUST indicate failure, and no segment
181	            is returned.

183	    AmountSACKed (RangeBegin,RangeEnd):

185	        This routine MUST return the total number of octets which fall
186		between RangeBegin and RangeEnd that have been selectively
187	        acknowledged by the receiver.

189	    Note: The SACK-based loss recovery algorithm outlined in this
190	    document requires more computational resources than previous TCP
191	    loss recovery strategies.  However, we believe the scoreboard data
192	    structure can be implemented in a reasonably efficient manner (both
193	    in terms of computation complexity and memory usage) in most TCP
194	    implementations.

196	5   Algorithm Details

198	    Upon the receipt of any ACK containing SACK information, the
199	    scoreboard MUST be updated via the Update () routine.

201	    Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the
202	    scoreboard is to be updated as normal.  Note: The first and second
203	    duplicate ACKs can also be used to trigger the transmission of
204	    previously unsent segments using the Limited Transmit algorithm
205	    [RFC3042].

207	    When a TCP sender receives the duplicate ACK corresponding to
208	    DupThresh ACKs, the scoreboard MUST be updated with the new SACK
209	    information (via Update ()) and a loss recovery phase SHOULD be
210	    initiated, per the fast retransmit algorithm outlined in [RFC2581],
211	    and in doing so the following steps MUST be taken:

213	    (1) pipe = HighData - HighACK - AmountSACKed (HighACK,HighData)

215		Set a ``pipe'' variable to the number of outstanding octets
216	        currently ``in the pipe''; this is the data which has been
217	        sent by the TCP sender but for which no cumulative or
218	        selective acknowledgment has been received.  This data is
219	        assumed to be still traversing the network path.

221	    (2) RecoveryPoint = HighData

223		When the TCP sender receives a cumulative ACK for this data
224	        octet the loss recovery phase is terminated.

226	    (3) ssthresh = cwnd = (FlightSize / 2)

228		The congestion window (cwnd) and slow start threshold
229		(sstrhesh) are reduced to half of FlightSize per [RFC2581].

231	    (4) Retransmit the first data segment presumed dropped -- the
232		segment starting with sequence number HighACK + 1.  To
233		prevent repeated retransmission of the same data, set
234		HighRxt to the highest sequence number in the retransmitted
235		segment.

237	    (5) In order to take advantage of potential additional available
238	        cwnd, proceed to step (D) below.

240	    Once a TCP is in the loss recovery phase the following procedure
241	    MUST be used for each arriving ACK:

243	    (A) An incoming cumulative ACK for a sequence number greater than
244	        RecoveryPoint signals the end of loss recovery and the loss
245	        recovery phase MUST be terminated.  Any information contained in
246	        the scoreboard for sequence numbers greater than the new value
247	        of HighACK SHOULD NOT be cleared when leaving the loss recovery
248	        phase.

250	    (B) Upon receipt of a duplicate ACK the following actions MUST be
251	        taken:

253	        (B.1) Use Update () to record the new SACK information conveyed
254	            by the incoming ACK.

256	        (B.2) The pipe variable is decremented by the number of newly
257	            SACKed data octets conveyed in the incoming ACK (i.e., those
258	            octets that are being SACKed for the first time), as that is
259	            the amount of new data presumed to have left the network.

261	    (C) When a ``partial ACK'' (an ACK that increases the HighACK point,
262	        but does not terminate loss recovery) arrives, the following
263	        actions MUST be performed:

265	        (C.1) Before updating HighACK based on the received cumulative
266	            ACK, save HighACK as OldHighACK.

268	        (C.2) The scoreboard MUST be updated based on the cumulative ACK
269	            and any new SACK information that is included in the ACK via
270	            the Update () routine.

272	        (C.3) The value of pipe MUST be decremented by the number of
273	            octets that have left the network path using the following
274		    equation:

276		    pipe = pipe - ((HighACK - OldHighACK) -
277			           AmountSACKed (OldHighACK + 1, HighACK))

279	        (C.4) The value of pipe MUST be decremented by the number of
280	            newly SACKed data octets conveyed in the incoming ACK (i.e.,
281	            those octets that are being SACKed for the first time), as
282	            these octets represent data that has left the network.

284	    (D) While pipe is less than cwnd the TCP sender SHOULD transmit one
285	        or more segments as follows:

287	        (D.1) The scoreboard MUST be queried via NextSeg () for the
288	            sequence number range of the next segment to transmit, and
289	            the given segment sent.

291	        (D.2) If any of the data octets sent in (D.1) are above
292	            HighData, the pipe variable MUST be incremented by the
293	            number of data octets previously unsent in (D.1).

295	        (D.3) If any of the data octets sent in (D.1) are below
296	            HighData, HighRxt MUST be set to the highest sequence number
297	            of the segment retransmitted.

299	        (D.4) If any of the data octets sent in (D.1) are above
300	            HighData, HighData must be updated to reflect the
301	            transmission of previously unsent data.

303		(D.5) If cwnd - pipe is greater than 1 SMSS, return to (D.1)

305	5.1 Retransmission Timeouts

307	    In order to avoid memory deadlocks, the TCP receiver is allowed to
308	    discard data that has already been acknowledged with a selective
309	    acknowledgment.  As a result [RFC2018] suggests that a TCP sender
310	    SHOULD expunge the SACK information gathered from a receiver upon a
311	    retransmission timeout ``since the timeout might indicate that the
312	    data receiver has reneged.''  Additionally, a TCP sender MUST
313	    ``ignore prior SACK information in determining which data to
314	    retransmit.''  However, a SACK TCP sender SHOULD still use all SACK
315	    information made available during the slow start phase of loss
316	    recovery following an RTO.

318	    As described in Sections 4 and 5, Update () MAY continue to be
319	    used appropriately upon receipt of ACKs.  This will allow the
320	    slow start recovery period to benefit from all available
321	    information provided by the receiver, despite the fact that SACK
322	    information was expunged due to the RTO.

324	    If there are segments missing from the receiver's buffer following
325	    processing of the retransmitted segment, the corresponding ACK will
326	    contain SACK information.  In this case, a TCP sender SHOULD use
327	    this SACK information by using the NextSeg () routine to determine
328	    what data should be sent in each segment of the slow start.

330	6   Research

332	    The algorithm specified in this document is analyzed in [FF96],
333	    which shows that the above algorithm is effective in reducing
334	    transfer time over standard TCP Reno [RFC2581] when multiple
335	    segments are dropped from a window of data (especially as the number
336	    of drops increases).  [AHKO97] shows that the algorithm defined in
337	    this document can greatly improve throughput in connections
338	    traversing satellite channels.

340	7   Security Considerations

342	    The algorithm presented in this paper shares security considerations
343	    with [RFC2581].  A key difference is that an algorithm based on
344	    SACKs is more robust against attackers forging duplicate ACKs to
345	    force the TCP sender to reduce cwnd.  With SACKs, TCP senders have an
346	    additional check on whether or not a particular ACK is legitimate.
347	    While not fool-proof, SACK does provide some amount of protection in
348	    this area.

350	Acknowledgments

352	    The authors wish to thank Sally Floyd for encouraging this document
353	    and commenting on an early draft.  The algorithm described in this
354	    document is largely based on an algorithm outlined by Kevin Fall and
355	    Sally Floyd in [FF96], although the authors of this document assume
356	    responsibility for any mistakes in the above text.  Murali Bashyam,
357	    Reiner Ludwig, Jamshid Mahdavi, Matt Mathis, Shawn Ostermann, Vern
358	    Paxson and Venkat Venkatsubra provided valuable feedback on earlier
359	    versions of this document.  Finally, we thank Matt Mathis and
360	    Jamshid Mahdavi for implementing the scoreboard in ns and hence
361	    guiding our thinking in keeping track of SACK state.

363	References

365	    [AHKO97] Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann. TCP
366	        Performance Over Satellite Links.  Proceedings of the Fifth
367	        International Conference on Telecommunications Systems,
368	        Nashville, TN, March, 1997.

370	    [All00] Mark Allman. A Web Server's View of the Transport Layer. ACM
371	        Computer Communication Review, 30(5), October 2000.

373	    [FF96] Kevin Fall and Sally Floyd.  Simulation-based Comparisons of
374	        Tahoe, Reno and SACK TCP.  Computer Communication Review, July
375	        1996.

377	    [Jac90] Van Jacobson.  Modified TCP Congestion Avoidance Algorithm.
378	        Technical Report, LBL, April 1990.

380	    [PF01] Jitendra Padhye, Sally Floyd.  Identifying the TCP Behavior
381	        of Web Servers, ACM SIGCOMM, August 2001.

383	    [RFC793] Jon Postel, Transmission Control Protocol, STD 7, RFC 793,
384	        September 1981.

386	    [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. TCP Selective
387	        Acknowledgment Options. RFC 2018, October 1996

389	    [RFC2026] Scott Bradner. The Internet Standards Process -- Revision
390	        3, RFC 2026, October 1996

392	    [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
393	        Requirement Levels", BCP 14, RFC 2119, March 1997.

395	    [RFC2581] Mark Allman, Vern Paxson, W. Richard Stevens, TCP
396	        Congestion Control, RFC 2581, April 1999.

398	    [RFC2582] Sally Floyd and Tom Henderson.  The NewReno Modification
399	        to TCP's Fast Recovery Algorithm, RFC 2582, April 1999.

401	    [RFC2914] Sally Floyd.  Congestion Control Principles, RFC 2914,
402	        September 2000.

404	    [RFC3042] Mark Allman, Hari Balkrishnan, Sally Floyd.  Enhancing
405	        TCP's Loss Recovery Using Limited Transmit.  RFC 3042,
406		January 2001

408	Author's Addresses:

410	    Ethan Blanton
411	    Ohio University Internetworking Research Lab
412	    Stocker Center
413	    Athens, OH  45701
414	    eblanton@irg.cs.ohiou.edu

416	    Mark Allman
417	    BBN Technologies/NASA Glenn Research Center
418	    Lewis Field
419	    21000 Brookpark Rd.  MS 54-5
420	    Cleveland, OH  44135
421	    Phone: 216-433-6586
422	    Fax: 216-433-8705
423	    mallman@bbn.com
424	    http://roland.grc.nasa.gov/~mallman

426	    Kevin Fall
427	    Intel Research
428	    2150 Shattuck Ave., PH Suite
429	    Berkeley, CA 94704
430	    kfall@intel-research.net