idnits 2.17.1 

draft-swami-tsvwg-tcp-dclor-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-26) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There is 1 instance of too long lines in the document, the longest one
     being 1 character in excess of 72.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 218: '..., the TCP sender MUST set its congesti...'
     RFC 2119 keyword, line 223: '...lue of SS_THRESH MUST be left unchange...'
     RFC 2119 keyword, line 227: '.... The TCP sender SHOULD also reset all...'
     RFC 2119 keyword, line 240: '...w data, the TCP sender SHOULD send the...'
     RFC 2119 keyword, line 245: '... 5. A TCP sender MUST repeat step-2 to...'
     (10 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 297 has weird spacing: '...-flight  shoul...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (September 24, 2003) is 7520 days in the past.  Is
     this intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC2026' is mentioned on line 16, but not defined

  == Unused Reference: 'RFC3517' is defined on line 379, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2883' is defined on line 385, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3522' is defined on line 389, but no explicit
     reference was found in the text

  == Unused Reference: 'LG03' is defined on line 392, but no explicit
     reference was found in the text

  == Unused Reference: 'SK03' is defined on line 396, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2988' is defined on line 401, but no explicit
     reference was found in the text

  == Unused Reference: 'BA02' is defined on line 404, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681)

  ** Obsolete normative reference: RFC 2861 (Obsoleted by RFC 7661)

  ** Obsolete normative reference: RFC 3517 (Obsoleted by RFC 6675)

  ** Downref: Normative reference to an Experimental RFC: RFC 3522

  == Outdated reference: A later version (-06) exists of
     draft-ietf-tsvwg-tcp-eifel-response-03

  == Outdated reference: A later version (-04) exists of
     draft-sarolahti-tsvwg-tcp-frto-03

  -- Possible downref: Normative reference to a draft: ref. 'SK03' 

  ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298)

  -- Possible downref: Normative reference to a draft: ref. 'BA02' 


     Summary: 10 errors (**), 0 flaws (~~), 12 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                             Yogesh Swami
3	INTERNET DRAFT                                                  Khiem Le
4	File: draft-swami-tsvwg-tcp-dclor-02.txt           Nokia Research Center
5	                                                                  Dallas
6	                                                      September 24, 2003
7	                                               Expires:   March 24, 2004

9	          DCLOR: De-correlated Loss Recovery using SACK option
10	                         for spurious timeouts.

12	Status of this Memo

14	   This document is an Internet-Draft and is in full conformance with
15	   all provisions of Section 10 of [RFC2026].

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups.  Note that
19	   other groups may also distribute working documents as Internet-
20	   Drafts.

22	   Internet-Drafts are draft documents valid for a maximum of six months
23	   and may be updated, replaced, or obsoleted by other documents at any
24	   time.  It is inappropriate to use Internet-Drafts as reference
25	   material or to cite them other than as "work in progress."

27	   The list of current Internet-Drafts can be accessed at
28	   http://www.ietf.org/ietf/1id-abstracts.txt

30	   The list of Internet-Draft Shadow Directories can be accessed at
31	   http://www.ietf.org/shadow.html

33	Abstract

35	   A spurious timeout in TCP forces the sender to unnecessarily
36	   retransmit one complete congestion window of data into the network.
37	   In addition, TCP uses the rate of arrival of ACKs as the basic
38	   criterion for congestion control. TCP makes the assumption that the
39	   rate at which ACKs are received reflects the end-to-end state of the
40	   network in terms of congestion. However, ACKs after a spurious
41	   timeout don't reflect the end-to-end congestion state of the network;
42	   they only reflect the congestion state of a part of the network. In
43	   these cases, the slow-start behavior after a timeout can further add
44	   to network congestion. In this draft we propose changes to the TCP
45	   sender that can be used to solve the problem of both redundant-
46	   retransmission and network congestion after a spurious timeout.

48	1. Introduction

50	   The response of a TCP sender after a retransmission timeout is
51	   governed by the underlying assumption that a mid-stream timeout can
52	   occur only if there is heavy congestion--manifested as packet
53	   loss--in the network. TCP therefore assumes that a timeout is a
54	   sufficient indication to a) recover all the packets in flight, and b)
55	   to initiate a congestion response (slow start in this case) suited
56	   for heavy congestion scenarios.

58	   Even though timeout is often a sufficient indication for recovering
59	   all the packets in flight and initiating slow start, the loss
60	   recovery algorithm should be separate from the congestion control
61	   decisions. The loss recovery algorithm should only answer the
62	   question of "what" data (i.e., what sequence numbers) to send. On the
63	   other hand, the congestion control algorithm should answer the
64	   question of "how much" data to send. But after a timeout, TCP
65	   addresses the issues of loss recovery and congestion control using a
66	   single mechanism--send one packet per round trip timeout (RTO)
67	   (answers the "how much" question) until an acknowledgment is
68	   received; the single segment sent is always the first unacknowledged
69	   outstanding packet in the retransmission queue (answers the "what"
70	   question).  Since the present TCP's loss recovery and congestion
71	   control algorithms are coupled together, we call this "Correlated
72	   Loss Recovery (CLOR)."

74	   Although the assumption that a timeout can occur only if there is
75	   severe congestion is valid for traditional wire-line networks, it
76	   does not hold good for some other types of networks--networks where
77	   packets can be stalled "in the network" for a significant duration
78	   without being discarded. Typical examples of such networks are
79	   cellular networks. In cellular networks, the link layer can
80	   experience a relatively long disruption due to errors, and the link
81	   layer protocol can keep these packets-in-error buffered as long as
82	   the link layer disruption lasts.

84	   In this document we present an alternative approach to loss recovery
85	   and congestion control that "De-Correlates" Loss Recovery from
86	   congestion congestion and allows independent choice on using a
87	   particular TCP sequence number without compromising on the congestion
88	   control principles of [RFC2581][RFC2914][RFC2861].

90	2. Problem Description.

92	   Let us assume that a TCP sender has sent N packets, p(1) ...  p(N),
93	   into the network and it's waiting for the ACK of p(1) (Figure-1). Due
94	   to bad network conditions or some other problem, these packets are
95	   excessively delayed at some some intermediary node RTR-1. Unlike
96	   standard IP routers, RTR-1 keeps these packets buffered for a
97	   relatively long period of time until these packets are forwarded to
98	   their intended recipient.  This excessive delay forces the TCP sender
99	   to timeout and enter slow start.

101	   As far as the sender is concerned, a timeout is always interpreted as
102	   heavy congestion. The TCP sender therefore makes the assumption that
103	   all packets between p(1) and p(N) were lost in the network. To
104	   recover from this misconstrued loss, the TCP sender retransmits P1(1)
105	   ( Px(k) represents the xth retransmission of packet with sequence
106	   number k), and waits for the ACK a(1).

108	   After some period of time when the network conditions at RTR-1
109	   improve, the queued in packets are finally dispatched to their
110	   intended recipient; in response to the packet the TCP receiver
111	   generates the ACK a(1). When the TCP sender receives a(1), it's
112	   fooled into believing that a(1) was generated in response to the
113	   retransmitted packet p1(1), while in reality a(1) was generated in
114	   response to the originally transmitted packet p(1). When the sender
115	   receives a(1), it increases its congestion window to two, and
116	   retransmits p1(2) and p1(3). As the sender receives more
117	   acknowledgments, it continues with retransmissions and finally starts
118	   sending new data.

120	   The following two sub sections examine the problems associated with
121	   the above-mentioned TCP behavior.

123	2.1 Redundant Data Retransmission

125	   The obvious and relatively easy-to-solve inefficiency of the above
126	   algorithm is that the entire congestion window worth of data is
127	   unnecessarily retransmitted. Although such retransmissions are
128	   harmless to high-bandwidth, well-provisioned, backbone links (so long
129	   they are infrequent), it could severely degrade the performance of
130	   slow links.

132	   In cases where bandwidth is a commodity at a premium, (e.g., cellular
133	   networks), unnecessary retransmission can also be costly.

135	2.2 Congestion after Spurious Timeout

137	   To analyze network congestion after spurious timeout, we compute the
138	   worst case scenario packet loss in the system--assuming only TCP
139	   connections to be present.

141	   After the spurious timeout, the TCP sender sets its SS_THRESH to N/2.
142	   Therefore, for the first N/2 ACKs received (i.e., ACK a(1) to a(N/2)
143	   ), the TCP sender will grow its congestion window by one and reach
144	   the SS_THRESH value of N/2.  For each ACK received, the TCP sender
145	   sends 2 packets. Therefore, by the end of the slow start, the TCP
146	   sender would have sent 2*(N/2) packets into the network. For the
147	   remaining N/2 ACKs (i.e., ACKs between a(N/2+1) to a(N)) the TCP
148	   sender will remain in the congestion avoidance phase and send one
149	   packet for each ACK received--sending N/2 more data segments. The net
150	   amount of data sent is therefore N/2 + N = 3N/2.

152	   Please note that the entire 3N/2 packets are injected into the
153	   network within a time period less than or equal to RTT in most cases.
154	   The number of data segments that left the network during this time is
155	   only N. Therefore, N/2 packets out of 3N/2 packets will be lost with
156	   a very high probability. These N/2 lost packets, however, need not
157	   come from the same connection, and such a data-burst will
158	   unnecessarily penalize all the competing TCP connections that share
159	   the same bottleneck router.

161	   Going further ahead, let us assume there are M competing TCP
162	   connections that share the same bottleneck router(s) with C(0) (each
163	   connection is numbered C(0) ... C(M-1)). During the period of time
164	   while C(0) is stalled, the TCP sender does not use its network
165	   resources--the buffer space--on the bottleneck router(s). The
166	   competing connections, C(1)...  C(M), however see this lack of
167	   activity as resource availability and start growing their window by
168	   at least one segment per RTT during this time period (by virtue of
169	   linear window increase during congestion avoidance phase). For
170	   simplicity reasons, we assume that each of these connections has the
171	   same round trip time of RTT, and the idle time for C(0) is k*RTT
172	   (where k > RTO/RTT). Under these assumptions, each of these competing
173	   connections will increase their congestion window by k segments.
174	   Therefore the amount of packets lost in the network due to slow start
175	   can be as high as:

177	                   N/2 + M*k       ... (4)

179	   the first term in the above equation is the packet loss due to slow
180	   start, while the second term is the loss due to window growth of
181	   completing connections (if the competing connections were in slow
182	   start the response could have been worse).

184	   Based on the above equation, we note that the congestion state of the
185	   network depends upon the duration of spurious timeout. In our response
186	   algorithm we therefore take the time duration of spurious timeout
187	   into account to reduce the data rate by half every RTO. Please note
188	   that this scheme works well only when the number of competing
189	   connections M does not vary too much while C(0) was stalled. A more
190	   conservative response algorithm should reduce the data rate to
191	   INIT_WINDOW if M is not bounded.

193	   In the following sections we describe an algorithm that solves the
194	   problem of both redundant retransmission and packet loss after a
195	   spurious timeout.

197	3. De-correlated Loss Recovery (DCLOR)

199	   The basic idea behind DCLOR is to send a new data segment from
200	   outside the sender's retransmission queue and wait for the ACK or
201	   SACK of the new data before initiating the response algorithm. Unlike
202	   slow-start where the response algorithm starts immediately after
203	   receiving the first ACK, DCLOR waits for the ACK/SACK of the new data
204	   sent after timeout before initiating loss recovery. The SACK block
205	   for new data contains sufficient information to determine all the
206	   packets that were lost into the network. Once the sequence number of
207	   lost packets is determined, the TCP sender grows its congestion
208	   window as determined by the SS_THRESH and it's congestion window.

210	3.1  Probe phase after a timeout

212	   The following steps describe the response of a TCP sender on a
213	   timeout:

215	     1. If the timeout occurs before the 3 way handshake is complete,
216	        the TCP sender's behavior is unchanged,

218	     2. After each timeout, the TCP sender MUST set its congestion
219	        window to:

221	                        cwnd = max( cwnd/2, INIT_WINDOW).

223	        The value of SS_THRESH MUST be left unchanged at this point. The
224	        TCP sender should also count the number of packets in flight at
225	        this time, and keep it in a state variable stale_outstanding.

227	     3. The TCP sender SHOULD also reset all the SACK tag bits in its
228	        retransmission queue if this the first timeout.

230	     4. Instead of sending the first unacknowledged packet P1 after a
231	        timeout, the TCP sender should *disregard* its congestion window
232	        and sends ONE new MSS size data (Pn+1).

234	        The TCP sender should also store the sequence number of the new
235	        segment in a new state variable called SS_PTR (for slow start
236	        pointer).

238	        If the sender does not have any new data outside its
239	        retransmission queue, or if the receiver's flow control window
240	        cannot sustain any new data, the TCP sender SHOULD send the
241	        highest sequence numbered MSS sized data chunk from its
242	        retransmission queue (i.e., it should send the last packet from
243	        its retransmission queue).

245	     5. A TCP sender MUST repeat step-2 to step-4 until it enters the
246	        Timeout-Recovery state as described in step 6.

248	3.2 Congestion Control After the probe phase

250	     6. For each ACK received with ACK-sequence number less than
251	        SS_PTR, the TCP sender SHOULD NOT grow it's congestion window.
252	        If the ACK contains a new SACK block, the SACK tag SHOULD be set
253	        in the corresponding data packet, and the number of packets in
254	        flight should be updated. If a pure ACK is received, the packet
255	        should be removed from the retransmission queue and the value of
256	        packets in flight should be updated.

258	        After making the above mentioned changes, the TCP sender SHOULD
259	        send new data (i.e., data from outside the retransmission queue)
260	        if the number of packets in flight is less than the congestion
261	        window. In addition, the TCP sender should keep a variable
262	        'new_packets' which counts the number of bytes (packets if
263	        congestion window is maintained as a count of packets) sent that
264	        have a sequence number greater than or equal to SS_PTR.

266	        In addition, the TCP sender SHOULD NOT take any timer sample for
267	        the stale ACKs. (NOTE: We do not attempt to change the RTT
268	        calculation in an ad-hoc manner; we believe that this is a
269	        research problem that needs better network modeling before an
270	        appropriate timer calculation can be found)

272	     7. Step-6 continues until the TCP sender receives an ACK
273	        with a  sequence number greater than SS_PTR, or a SACK block
274	        covering the sequence number greater than SS_PTR.

276	        If the sender receives a SACK block containing SS_PTR, i.e., if
277	        there is a packet loss in the stalled window, it SHOULD follow
278	        step-8.

280	        If the sender receives an ACK that acknowledges SS_PTR, i.e., if
281	        no packets were lost from the stalled window, it SHOULD go to
282	        step-10.

284	NOTE: In our previous experiments we had set the congestion window
285	   to one MSS after a spurious timeout, however this algorithm performs
286	   better if there is moderate load on the routers and the number of
287	   competing connections do not vary a lot 0 the stalling period. In
288	   case of heavy load, setting the congestion window to INIT_WINDOW
289	   still performs better. We believe that using the present congestion
290	   response makes a fair compromise for different scenarios.

292	3.3 Timeout-Recovery: recovering lost packets after timeout

294	     8. The TCP sender traverses the retransmission queue and marks
295	        all the packets without any SACK tag as lost. The TCP sender
296	        also updates its packets in flight based on the SACK tags and
297	        the lost segment information (the packets-in-flight  should be
298	        ZERO after the update).

300	        Please note that unlike Fast-Retransmit and Fast-recovery, DCLOR
301	        uses only one SACK block containing SS_PTR to mark packets as
302	        lost.  This is because we do not expect packet reordering to
303	        exist over the period of RTO.

305	     9. The TCP sender should update its SS_THRESH, as:

307	                        SS_THRESH= stale_outstanding/2

309	    10. The TCP sender SHOULD set cwnd=new_packets+1. (Note that if
310	        all packets were lost, the value of 'new_packets' will be 1, and
311	        therefore the congestion window will become 2, which is the
312	        value for a timeout due to congestion.)  If packets were lost in
313	        the network (i.e., if a SACK for SS_PTR was received), the TCP
314	        sender should start by sending packets with lowest sequence
315	        number; else it should continue with new data.

317	        The sender should follow the normal window growth strategy based
318	        on the value of SS_THRESH after this step.

320	   Please note that with a pure ACK acknowledging SS_PTR, the TCP sender
321	   does not update the SS_THRESH value (it directly enters step-10 from
322	   step-7). This prevents a TCP sender from setting its SS_THRESH to a
323	   very small values if the spurious timeout occurs at the start of the
324	   connection.

326	4. Data Delivery To Upper Layers

328	   If a TCP sender loses its entire congestion window worth of data,
329	   sending new data after timeout prevents a TCP receiver from
330	   forwarding the new data to the upper layers immediately.  However,
331	   once the SACK for this new data is received, the TCP sender will send
332	   the first lost segment. This essentially means that data delivery to
333	   the upper layers could be delayed by at most one RTT when all the
334	   packets are lost in the network.

336	   This, however, does not affect the throughput of the connection in
337	   any way. If a timeout has occurred, then the data delivery to the
338	   upper layers has already been excessively delayed.  Delaying it by
339	   another round trip is not a serious problem. Please note that
340	   reliability and timeliness are two conflicting issues and one cannot
341	   gain on one without sacrificing something else on the other.

343	5. Security Considerations

345	   The TCP SACK information is meant to be advisory, and a TCP receiver
346	   is allowed--though strongly discouraged--to discard data blocks the
347	   receiver has already SACKed [RFC2018]. Please note however that even
348	   if the TCP sender discards the data block it received, it MUST still
349	   send the SACK block for at least the recent most data received.
350	   Therefore in spite of SACK reneging, DCLOR will work without any
351	   deadlocks.

353	   A SACK implementation is also allowed not to send a SACK block even
354	   though the TCP sender and receiver might have agreed to SACK-
355	   Permitted option at the start of the connection. In these cases,
356	   however, if the receiver sends one SACK block, it must send SACK
357	   blocks for the rest of the connection. Because of the above mentioned
358	   leniency in implementation, its possible that a TCP receiver may
359	   agree on SACK-Permitted option, and yet not send any SACK blocks. To
360	   make DCLOR robust under these circumstances, DCLOR SHOULD NOT be
361	   invoked unless the sender has seen at least one SACK block before
362	   timeout. We, however, believe that once the SACK-Permitted option is
363	   accepted, the TCP sender MUST send a SACK block--even though that
364	   block might finally be discarded.  Otherwise, the SACK-Permitted
365	   option is completely redundant and serves little purpose. To the best
366	   of our knowledge, almost all SACK implementations send a SACK block
367	   if they have accepted the SACK-Permitted option.

369	6. References

371	     [RFC2581] M. Allman, V. Paxson, W. Stevens. "TCP Congestion
372	               Control," Apr, 1999.

374	     [RFC2914] S. Floyd, "Congestion Control Principles," Sep 2002.

376	     [RFC2861] M. Handley, J. Padhye, S. Floyd. "TCP Congestion
377	               Window Validation," Jun 2000.

379	     [RFC3517] E. Blanton, M. Allman, K. Fall, L. Wang, "Conservative
380	               SACK-based Loss Recovery Algorithm for TCP," Apr 2003.

382	     [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP
383	               Selective Acknowledgment Options," Oct 1996.

385	     [RFC2883] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky, "An
386	               Extension to the Selective Acknowledgment (SACK) Option
387	               for TCP," Jul 2000.

389	     [RFC3522] R. Ludwig, M. Meyer. "The Eiffel Detection Algorithm
390	               for TCP," Apr 2003.

392	     [LG03]    R. Ludwig, A. Gurtov, "The Eifel Response Algorithm for
393	               TCP." Internet draft; work in progress, draft-ietf-tsvwg-
394	               tcp-eifel-response-03.txt, Mar 2003.

396	     [SK03]    P. Sarolahti, M. Kojo. "F-RTO: A TCP RTO Recovery
397	               Algorithm for Avoiding Unnecessary Retransmissions."
398	               Internet draft; work in progress.  draft-sarolahti-tsvwg-
399	               tcp-frto-03.txt, Jan 2003.

401	     [RFC2988] V. Paxon, M. Allman. "Computing TCP's Retransmission
402	               Timer," Nov 2000.

404	     [BA02]    E. Blanton, M. Allman, "Using TCP DSACKs and SCTP
405	               Duplicate TSNs to Detect Spurious Retransmissions,"
406	               Internet draft; work in progress, draft-blanton-dsack-
407	               use-02.txt, Oct 2002.

409	7. IPR Statement

411	   The IETF has been notified of intellectual property rights claimed in
412	   regard to some or all of the specification contained in this
413	   document. For more information consult the on-line list of claimed
414	   rights at http://www.ietf.org/ipr.

416	Author's  Address:

418	   Yogesh Prem Swami                       Khiem Le
419	   Nokia Research Center                   Nokia Research Center
420	   6000 Connection Drive                   6000 Connection Drive
421	   Irving TX-75063                         Irving TX-75063
422	   USA                                     USA

424	   Phone: +1 972-374-0669                  Phone: +1 972-894-4882
425	   Email: yogesh.swami@nokia.com           Email: khiem.le@nokia.com