idnits 2.17.1 

draft-allman-tcp-sack-12.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-24) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There is 1 instance of too long lines in the document, the longest one
     being 1 character in excess of 72.

  ** There are 31 instances of lines with control characters in the document.

  ** The abstract seems to contain references ([RFC2119], [RFC2581]), which
     it shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 262 has weird spacing: '...ariable  to th...'

  == Line 263 has weird spacing: '... is the  data ...'

  == Line 264 has weird spacing: '...	sent  by the ...'

  == Line 265 has weird spacing: '...ent has  been ...'

  == Line 266 has weird spacing: '...	not  been det...'

  == (1 more instance...)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'A' is mentioned on line 117, but not defined

  == Missing Reference: 'B' is mentioned on line 117, but not defined

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 2582
     (Obsoleted by RFC 3782)

  -- Obsolete informational reference (is this intentional?): RFC 2988
     (Obsoleted by RFC 6298)


     Summary: 7 errors (**), 0 flaws (~~), 9 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force                            Ethan Blanton
2	INTERNET DRAFT                                           Ohio University
3	File: draft-allman-tcp-sack-12.txt                           Mark Allman
4	                                                            BBN/NASA GRC
5	                                                              Kevin Fall
6	                                                          Intel Research
7	                                                              July, 2002
8	                                                  Expires: January, 2003

10	       A Conservative SACK-based Loss Recovery Algorithm for TCP

12	Status of this Memo

14	    This document is an Internet-Draft and is in full conformance with
15	    all provisions of Section 10 of [RFC2026].

17	    Internet-Drafts are working documents of the Internet Engineering
18	    Task Force (IETF), its areas, and its working groups.  Note that
19	    other groups may also distribute working documents as
20	    Internet-Drafts.

22	    Internet-Drafts are draft documents valid for a maximum of six
23	    months and may be updated, replaced, or obsoleted by other documents
24	    at any time.  It is inappropriate to use Internet-Drafts as
25	    reference material or to cite them other than as "work in progress."

27	    The list of current Internet-Drafts can be accessed at
28	    http://www.ietf.org/ietf/1id-abstracts.txt

30	    The list of Internet-Draft Shadow Directories can be accessed at
31	    http://www.ietf.org/shadow.html.

33	Abstract

35	    This document presents a conservative loss recovery algorithm
36	    for TCP that is based on the use of the selective acknowledgment
37	    TCP option.  The algorithm presented in this document conforms
38	    to the spirit of the current congestion control specification
39	    [RFC2581], but allows TCP senders to recover more effectively
40	    when multiple segments are lost from a single flight of data.

42	Terminology

44	    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
45	    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
46	    document are to be interpreted as described in RFC 2119 [RFC2119].

48	1   Introduction

50	    This document presents a conservative loss recovery algorithm for
51	    TCP that is based on the use of the selective acknowledgment TCP
52	    option.  While the TCP selective acknowledgment (SACK) option
53	    [RFC2018] is being steadily deployed in the Internet [All00] there
54	    is evidence that hosts are not using the SACK information when
55	    making retransmission and congestion control decisions [PF01].  The
56	    goal of this document is to outline one straightforward method for
57	    TCP implementations to use SACK information to increase performance.

59	    [RFC2581] allows advanced loss recovery algorithms to be used by TCP
60	    [RFC793] provided that they follow the spirit of TCP's congestion
61	    control algorithms [RFC2581,RFC2914].  [RFC2582] outlines one such
62	    advanced recovery algorithm called NewReno.  This document outlines
63	    a loss recovery algorithm that uses the selective acknowledgment
64	    (SACK) [RFC2018] TCP option to enhance TCP's loss recovery.  The
65	    algorithm outlined in this document, heavily based on the algorithm
66	    detailed in [FF96], is a conservative replacement of the fast
67	    recovery algorithm [Jac90,RFC2581].  The algorithm specified in this
68	    document is a straightforward SACK-based loss recovery strategy that
69	    follows the guidelines set in [RFC2581] and can safely be used in
70	    TCP implementations.  Alternate SACK-based loss recovery methods can
71	    be used in TCP as implementers see fit (as long as the alternate
72	    algorithms follow the guidelines provided in [RFC2581]).  Please
73	    note, however, that the SACK-based decisions in this document (such
74	    as what segments are to be sent at what time) are largely decoupled
75	    from the congestion control algorithms, and as such can be treated
76	    as separate issues if so desired.

78	2   Definitions

80	    The reader is expected to be familiar with the definitions given in
81	    [RFC2581].

83	    The reader is assumed to be familiar with selective acknowledgments
84	    as specified in [RFC2018].

86	    For the purposes of explaining the SACK-based loss recovery
87	    algorithm we define four variables that a TCP sender stores:

89	        ``HighACK'' is the sequence number of the highest byte of
90		data that has been cumulatively ACKed at a given point.

92	        ``HighData'' is the highest sequence number transmitted at a
93	        given point.

95	        ``HighRxt'' is the highest sequence number which has been
96	        retransmitted during the current loss recovery phase.

98		``Pipe'' is a sender's estimate of the number of bytes
99		outstanding in the network.  This is used during recovery
100		for limiting the sender's sending rate.  The pipe variable
101		allows TCP to use a fundamentally different congestion
102		control than specified in [RFC2581].  The algorithm is often
103		referred to as the ``pipe algorithm''.

105	    For the purposes of this specification we define a ``duplicate
106	    acknowledgment'' as an acknowledgment (ACK) whose cumulative ACK
107	    number is equal to the current value of HighACK, as described in
108	    [RFC2581].

110	    We define a variable ``DupThresh'' that holds the number of
111	    duplicate acknowledgments required to trigger a retransmission.  Per
112	    [RFC2581] this threshold is defined to be 3 duplicate
113	    acknowledgments.  However, implementers should consult any updates
114	    to [RFC2581] to determine the current value for DupThresh (or method
115	    for determining its value).

117	    Finally, a range of sequence numbers [A,B] is said to ``cover''
118	    sequence number S if A <= S <= B.

120	3   Keeping Track of SACK Information

122	    For a TCP sender to implement the algorithm defined in the next
123	    section it must keep a data structure to store incoming
124	    selective acknowledgment information on a per connection basis.
125	    Such a data structure is commonly called the ``scoreboard''.
126	    The specifics of the scoreboard data structure are out of scope
127	    for this document (as long as the implementation can perform all
128	    functions required by this specification).

130	    Note that while this document speaks of marking and keeping
131	    track of octets, a real world implementation would probably want
132	    to keep track of octet ranges or otherwise collapse the data
133	    while ensuring that arbitrary ranges are still markable.

135	4   Processing and Acting Upon SACK Information

137	    For the purposes of the algorithm defined in this document the
138	    scoreboard SHOULD implement the following functions:

140	    Update ():

142	        Given the information provided in an ACK, each octet that is
143	        cumulatively ACKed or SACKed should be marked accordingly in
144	        the scoreboard data structure, and the total number of
145	        octets SACKed should be recorded.

147	        Note: SACK information is advisory and therefore SACKed data
148	        MUST NOT be removed from TCP's retransmission buffer until the
149	        data is cumulatively acknowledged [RFC2018].

151	    IsLost (SeqNum):

153	        This routine returns whether the given sequence number is
154	        considered to be lost.  The routine returns true when either
155	        DupThresh discontiguous SACKed sequences have arrived above
156	        'SeqNum' or DupThresh * SMSS bytes with sequence numbers greater
157	        than 'SeqNum' have been SACKed.  Otherwise, the routine returns
158	        false.

160	    SetPipe ():

162	        This routine traverses the sequence space from HighACK to
163	        HighData and MUST set the ``pipe'' variable to an estimate of
164	        the number of octets that are currently in transit between the
165	        TCP sender and the TCP receiver.  After initializing pipe to
166	        zero the following steps are taken for each octet 'S1' in the
167	        sequence space between HighACK and HighData that has not been
168	        SACKed:

170		(a) The pipe variable is incremented by 1 octet.

172	        (b) If S1 <= HighRxt and IsLost (S1) returns false:

174	            Pipe is incremented by 1 octet.

176	            The effect of this condition is that pipe is incremented for
177	            both the original transmission and the retransmission of the
178	            octet because neither has been determined to have left the
179	            network at this point.

181	    NextSeg ():

183		This routine uses the scoreboard data structure maintained by
184	        the Update() function to determine what to transmit based on
185	        the SACK information that has arrived from the data receiver
186	        (and hence been marked in the scoreboard). NextSeg () MUST
187	        return the sequence number range of the next segment that is
188	        to be transmitted, per the following rules:

190	        (1) If there exists a smallest unSACKed sequence number 'S2'
191	            that meets the following three criteria for determining loss
192		    the sequence range of one segment of up to SMSS octets
193		    starting with S2 MUST be returned.

195		    (1.a) S2 is greater than HighRxt.

197	            (1.b) S2 is less than the highest octet convered by any
198	                received SACK.

200		    (1.c) IsLost (S2) returns true.

202	        (2) If no sequence number 'S2' per rule (1) exists but there
203	            exists available unsent data and the receiver's advertised
204	            window allows, the sequence range of one segment of up to
205	            SMSS octets of previously unsent data starting with sequence
206	            number HighData+1 MUST be returned.

208	        (3) If the conditions for rules (1) and (2) fail, but there
209	            exists an unSACKed sequence number 'S3' that meets the
210	            criteria for detecting loss given in steps (1.a) and (1.b)
211	            above (specifically excluding step (1.c)) then one segment
212	            of up to SMSS octets starting with S3 MUST be returned.

214	        (4) If the conditions for each of (1), (2), and (3) are not
215	            met, then NextSeg () MUST indicate failure, and no segment
216	            is returned.

218	    Note: The SACK-based loss recovery algorithm outlined in this
219	    document requires more computational resources than previous TCP
220	    loss recovery strategies.  However, we believe the scoreboard data
221	    structure can be implemented in a reasonably efficient manner (both
222	    in terms of computation complexity and memory usage) in most TCP
223	    implementations.

225	5   Algorithm Details

227	    Upon the receipt of any ACK containing SACK information, the
228	    scoreboard MUST be updated via the Update () routine.

230	    Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the
231	    scoreboard is to be updated as normal.  Note: The first and second
232	    duplicate ACKs can also be used to trigger the transmission of
233	    previously unsent segments using the Limited Transmit algorithm
234	    [RFC3042].

236	    When a TCP sender receives the duplicate ACK corresponding to
237	    DupThresh ACKs, the scoreboard MUST be updated with the new SACK
238	    information (via Update ()).  If no previous loss event has
239	    occurred on the connection or the cumulative acknowledgement point
240	    is beyond the last value of RecoveryPoint, a loss recovery phase
241	    SHOULD be initiated, per the fast retransmit algorithm outlined in
242	    [RFC2581].  The following steps MUST be taken:

244	    (1) RecoveryPoint = HighData

246		When the TCP sender receives a cumulative ACK for this data
247	        octet the loss recovery phase is terminated.

249	    (2) ssthresh = cwnd = (FlightSize / 2)

251		The congestion window (cwnd) and slow start threshold
252		(ssthresh) are reduced to half of FlightSize per [RFC2581].

254	    (3) Retransmit the first data segment presumed dropped -- the
255		segment starting with sequence number HighACK + 1.  To
256		prevent repeated retransmission of the same data, set
257		HighRxt to the highest sequence number in the retransmitted
258		segment.

260	    (4) Run SetPipe ()

262		Set a ``pipe'' variable  to the number of outstanding octets
263		currently ``in the pipe'';  this is the  data which has been
264		sent  by the  TCP   sender but  for which  no  cumulative or
265		selective acknowledgment has  been received and the data has
266		not  been determined  to have been  dropped  in the network.
267		This data is  assumed  to be  still  traversing  the network
268		path.

270	    (5) In order to take advantage of potential additional available
271	        cwnd, proceed to step (C) below.

273	    Once a TCP is in the loss recovery phase the following procedure
274	    MUST be used for each arriving ACK:

276	    (A) An incoming cumulative ACK for a sequence number greater than
277	        RecoveryPoint signals the end of loss recovery and the loss
278	        recovery phase MUST be terminated.  Any information contained in
279	        the scoreboard for sequence numbers greater than the new value
280	        of HighACK SHOULD NOT be cleared when leaving the loss recovery
281	        phase.

283	    (B) Upon receipt of an ACK that does not cover RecoveryPoint the
284		following actions MUST be taken:

286	        (B.1) Use Update () to record the new SACK information conveyed
287	            by the incoming ACK.

289	        (B.2) Use SetPipe () to re-calculate the number of octets still
290	            in the network.

292	    (C) If cwnd - pipe >= 1 SMSS the sender SHOULD transmit one or more
293	        segments as follows:

295	        (C.1) The scoreboard MUST be queried via NextSeg () for the
296	            sequence number range of the next segment to transmit (if
297	            any), and the given segment sent.

299	        (C.2) If any of the data octets sent in (C.1) are below
300	            HighData, HighRxt MUST be set to the highest sequence number
301	            of the segment retransmitted.

303	        (C.3) If any of the data octets sent in (C.1) are above
304	            HighData, HighData must be updated to reflect the
305	            transmission of previously unsent data.

307	        (C.4) The estimate of the amount of data outstanding in the
308	            network must be updated by incrementing pipe by the
309		    number of octets transmitted in (C.1).

311		(C.5) If cwnd - pipe >= 1 SMSS, return to (C.1)

313	5.1 Retransmission Timeouts

315	    In order to avoid memory deadlocks, the TCP receiver is allowed to
316	    discard data that has already been selectively acknowledged.  As a
317	    result, [RFC2018] suggests that a TCP sender SHOULD expunge the
318	    SACK information gathered from a receiver upon a retransmission
319	    timeout ``since the timeout might indicate that the data receiver
320	    has reneged.'' Additionally, a TCP sender MUST ``ignore prior SACK
321	    information in determining which data to retransmit.'' However, a
322	    SACK TCP sender SHOULD still use all SACK information made
323	    available during the slow start phase of loss recovery following
324	    an RTO.

326	    If an RTO occurs during loss recovery as specified in this document,
327	    RecoveryPoint MUST be preserved and the loss recovery algorithm
328	    outlined in this document MUST be terminated.  In addition, a new
329	    recovery phase (as described in section 5) MUST NOT be initiated
330	    until HighACK is greater than or equal to RecoveryPoint.

332	    As described in Sections 4 and 5, Update () SHOULD continue to be
333	    used appropriately upon receipt of ACKs.  This will allow the slow
334	    start recovery period to benefit from all available information
335	    provided by the receiver, despite the fact that SACK information was
336	    expunged due to the RTO.

338	    If there are segments missing from the receiver's buffer following
339	    processing of the retransmitted segment, the corresponding ACK will
340	    contain SACK information.  In this case, a TCP sender SHOULD use
341	    this SACK information when determining what data should be sent in
342	    each segment of the slow start.  The exact algorithm for this
343	    selection is not specified in this document (specifically NextSeg ()
344	    is inappropriate during slow start after an RTO).  A relatively
345	    straightforward approach to ``filling in'' the sequence space
346	    reported as missing should be a reasonable approach.

348	6   Managing the RTO Timer

350	    The standard TCP RTO estimator is defined in [RFC2988].  Due to
351	    the fact that the SACK algorithm in this document can have an
352	    impact on the behavior of the estimator, implementers may wish
353	    to consider how the timer is managed.  [RFC2988] calls for the
354	    RTO timer to be re-armed each time an ACK arrives that advances
355	    the cumulative ACK point.  Because the algorithm presented in
356	    this document can keep the ACK clock going through a fairly
357	    significant loss event, (comparatively longer than the algorithm
358	    described in [RFC2581]), on some networks the loss event could
359	    last longer than the RTO.  In this case the RTO timer would
360	    expire prematurely and a segment that need not be retransmitted
361	    would be resent.

363	    Therefore we give implementers the latitude to use the standard
364	    [RFC2988] style RTO management or, optionally, a more careful
365	    variant that re-arms the RTO timer on each retransmission that
366	    is sent during recovery MAY be used.  This provides a more
367	    conservative timer than specified in [RFC2988], and so may not
368	    always be an attractive alternative.  However, in some cases it
369	    may prevent needless retransmissions, go-back-N transmission and
370	    further reduction of the congestion window.

372	7   Research

374	    The algorithm specified in this document is analyzed in [FF96],
375	    which shows that the above algorithm is effective in reducing
376	    transfer time over standard TCP Reno [RFC2581] when multiple
377	    segments are dropped from a window of data (especially as the number
378	    of drops increases).  [AHKO97] shows that the algorithm defined in
379	    this document can greatly improve throughput in connections
380	    traversing satellite channels.

382	8   Security Considerations

384	    The algorithm presented in this paper shares security considerations
385	    with [RFC2581].  A key difference is that an algorithm based on
386	    SACKs is more robust against attackers forging duplicate ACKs to
387	    force the TCP sender to reduce cwnd.  With SACKs, TCP senders have an
388	    additional check on whether or not a particular ACK is legitimate.
389	    While not fool-proof, SACK does provide some amount of protection in
390	    this area.

392	Acknowledgments

394	    The authors wish to thank Sally Floyd for encouraging this
395	    document and commenting on early drafts.  The algorithm
396	    described in this document is loosely based on an algorithm
397	    outlined by Kevin Fall and Sally Floyd in [FF96], although the
398	    authors of this document assume responsibility for any mistakes
399	    in the above text.  Murali Bashyam, Ken Calvert, Tom Henderson,
400	    Reiner Ludwig, Jamshid Mahdavi, Matt Mathis, Shawn Ostermann,
401	    Vern Paxson, Venkat Venkatsubra and Lili Wang provided valuable
402	    feedback on earlier versions of this document.  Finally, we
403	    thank Matt Mathis and Jamshid Mahdavi for implementing the
404	    scoreboard in ns and hence guiding our thinking in keeping track
405	    of SACK state.

407	Normative References

409	    [RFC793] Jon Postel, Transmission Control Protocol, STD 7, RFC 793,
410	        September 1981.

412	    [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. TCP Selective
413	        Acknowledgment Options. RFC 2018, October 1996

415	    [RFC2026] Scott Bradner. The Internet Standards Process -- Revision
416	        3, RFC 2026, October 1996

418	    [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
419	        Requirement Levels", BCP 14, RFC 2119, March 1997.

421	    [RFC2581] Mark Allman, Vern Paxson, W. Richard Stevens, TCP
422	        Congestion Control, RFC 2581, April 1999.

424	Non-Normative References

426	    [AHKO97] Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann. TCP
427	        Performance Over Satellite Links.  Proceedings of the Fifth
428	        International Conference on Telecommunications Systems,
429	        Nashville, TN, March, 1997.

431	    [All00] Mark Allman. A Web Server's View of the Transport Layer. ACM
432	        Computer Communication Review, 30(5), October 2000.

434	    [FF96] Kevin Fall and Sally Floyd.  Simulation-based Comparisons of
435	        Tahoe, Reno and SACK TCP.  Computer Communication Review, July
436	        1996.

438	    [Jac90] Van Jacobson.  Modified TCP Congestion Avoidance Algorithm.
439	        Technical Report, LBL, April 1990.

441	    [PF01] Jitendra Padhye, Sally Floyd.  Identifying the TCP Behavior
442	        of Web Servers, ACM SIGCOMM, August 2001.

444	    [RFC2582] Sally Floyd and Tom Henderson.  The NewReno Modification
445	        to TCP's Fast Recovery Algorithm, RFC 2582, April 1999.

447	    [RFC2914] Sally Floyd.  Congestion Control Principles, RFC 2914,
448	        September 2000.

450	    [RFC2988] Vern Paxson, Mark Allman.  Computing TCP's Retransmission
451	        Timer, RFC 2988, November 2000.

453	    [RFC3042] Mark Allman, Hari Balkrishnan, Sally Floyd.  Enhancing
454	        TCP's Loss Recovery Using Limited Transmit.  RFC 3042,
455		January 2001

457	Author's Addresses:

459	    Ethan Blanton
460	    Ohio University Internetworking Research Lab
461	    Stocker Center
462	    Athens, OH  45701
463	    eblanton@irg.cs.ohiou.edu

465	    Mark Allman
466	    BBN Technologies/NASA Glenn Research Center
467	    Lewis Field
468	    21000 Brookpark Rd.  MS 54-5
469	    Cleveland, OH  44135
470	    Phone: 216-433-6586
471	    Fax: 216-433-8705
472	    mallman@bbn.com
473	    http://roland.grc.nasa.gov/~mallman

475	    Kevin Fall
476	    Intel Research
477	    2150 Shattuck Ave., PH Suite
478	    Berkeley, CA 94704
479	    kfall@intel-research.net