idnits 2.17.1 

draft-swami-tsvwg-tcp-dclor-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-26) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 25
     longer pages, the longest (page 2) being 60 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 26 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 9 instances of too long lines in the document, the longest one
     being 5 characters in excess of 72.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 465: '..., the TCP sender MUST set its congesti...'
     RFC 2119 keyword, line 468: '...       SS_THRESH MUST be left UNCHANGE...'
     RFC 2119 keyword, line 470: '.... The TCP sender SHOULD also reset all...'
     RFC 2119 keyword, line 514: '... 5. A TCP sender MUST repeat step-2 un...'
     RFC 2119 keyword, line 525: '.... The TCP sender MUST NOT update its c...'
     (13 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 290 has weird spacing: '...plicity  reaso...'

  == Line 484 has weird spacing: '..._PTR is  recei...'

  == Line 936 has weird spacing: '...g delay  and l...'

  == Line 1066 has weird spacing: '...terates  over ...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (November 2002) is 7833 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'RFC2026' on line 16

  == Missing Reference: '21' is mentioned on line 816, but not defined

  == Missing Reference: '11' is mentioned on line 816, but not defined

  == Unused Reference: '6' is defined on line 851, but no explicit reference
     was found in the text

  == Unused Reference: '9' is defined on line 862, but no explicit reference
     was found in the text

  ** Obsolete normative reference: RFC 2581 (ref. '1') (Obsoleted by RFC 5681)

  ** Obsolete normative reference: RFC 2861 (ref. '3') (Obsoleted by RFC 7661)

  == Outdated reference: A later version (-13) exists of
     draft-allman-tcp-sack-10

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  == Outdated reference: A later version (-07) exists of
     draft-ietf-tsvwg-tcp-eifel-alg-05

  ** Downref: Normative reference to an Experimental draft:
     draft-ietf-tsvwg-tcp-eifel-alg (ref. '7')

  == Outdated reference: A later version (-04) exists of
     draft-sarolahti-tsvwg-tcp-frto-01

  -- Possible downref: Normative reference to a draft: ref. '8' 

  ** Downref: Normative reference to an Informational RFC: RFC 2998 (ref. '9')


     Summary: 9 errors (**), 0 flaws (~~), 14 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                             Yogesh Swami
3	INTERNET DRAFT                                                  Khiem Le
4	File: draft-swami-tsvwg-tcp-dclor-00.txt           Nokia Research Center
5	                                                                  Dallas
6	                                                           November 2002
7	                                                     Expires:   Apr 2003

9	          DCLOR: De-correlated Loss Recovery using SACK option
10	                         for spurious timeouts.

12	Status of this Memo

14	   This document is an Internet-Draft and is in full conformance with
15	   all provisions of Section 10 of [RFC2026].

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups.  Note that
19	   other groups may also distribute working documents as Internet-
20	   Drafts.

22	   Internet-Drafts are draft documents valid for a maximum of six months
23	   and may be updated, replaced, or obsoleted by other documents at any
24	   time.  It is inappropriate to use Internet-Drafts as reference
25	   material or to cite them other than as "work in progress.

27	   The list of current Internet-Drafts can be accessed at
28	   http://www.ietf.org/ietf/1id-abstracts.txt

30	   The list of Internet-Draft Shadow Directories can be accessed at
31	   http://www.ietf.org/shadow.html

33	Abstract

35	   A spurious timeout in TCP forces the sender to unnecessarily
36	   retransmit one complete congestion window of data into the network.
37	   In addition, TCP uses the rate of arrival of ACKs as the basic
38	   criterion for congestion control. TCP makes the assumption that the
39	   rate at which ACKs are received reflects the end-to-end state of the
40	   network in terms of congestion. But after a spurious-timeout, the
41	   ACKs don't reflect the end-to-end congestion state of the network,
42	   but only a part of it. In these cases, the slow-start behavior after
43	   a timeout can further add to network congestion instead of relieving
44	   it.

46	   In this draft we propose changes to the TCP sender (no change is
47	   needed for TCP receiver) that can be used to solve the problems of
48	   both redundant-retransmission and network congestion after a spurious
49	   timeout. These changes preserve the sanctity of congestion control
50	   principles and are conservative in nature. The proposed
51	   algorithm--called DCLOR--separates congestion control from loss
52	   recovery, and uses the TCP SACK option to achieve this. DCLOR can be
53	   used as a congestion control algorithm (after a spurious timeout) only,
54	   and it can work with other spurious timeout detection mechanisms such
55	   as the Eifel detection scheme.

57	Table of Contents

59	   Abstract .....................................................   1

61	   1. Introduction ..............................................   3

63	   2. Problem Description .......................................   4
64	      2.1 Redundant Data Retransmission .........................   5
65	      2.2 Can Slow Start add to Congestion ......................   5

67	   3. Sources of spurious timeout ...............................   3
68	      3.1 Spurious timeout due to Excessive delay ...............   7
69	      3.2 Spurious timeout due to change of subnets .............   8
70	      3.3 Spurious Timeout due to Protocol Inefficiencies .......   9

72	   4. De-correlated Loss Recovery (DCLOR) .......................   9
73	         4.1  Probe phase after a timeout .......................  10
74	         4.2 Congestion Control After the probe phase ...........  12
75	         4.3 Recovering lost packets after timeout ..............  13

77	   5. Data Delivery To Upper Layers .............................  14

79	   6. Sack Reneging .............................................  15

81	   7. DCLOR Examples ............................................  15
82	      7.1 Timeout due to congestion .............................  15
83	      7.2 Timeout due to pure packet stalling ...................  16
84	      7.3 Timeout due to stalling and loss ......................  17

86	   8.  Security Considerations ..................................  18

88	   9.  References ...............................................  19

90	   10. IPR Statement ............................................  19

92	   Author's  Address ............................................  19

94	   Appendix - 1 .................................................  20

96	1. Introduction

98	   The response of a TCP sender after a retransmission timeout is
99	   governed by the underlying assumption that a mid-stream timeout can
100	   occur only if there is heavy congestion--manifested as packet
101	   loss--in the network. Even though loss is often caused by congestion,
102	   the loss recovery algorithm itself should only answer the question of
103	   "what" data (i.e., what sequence number of data ) to send. While on
104	   the other hand, the congestion control algorithm should answer the
105	   question of "how much" data to send. But after a timeout TCP
106	   addresses the issues of loss recovery and congestion control using a
107	   single mechanism--send one segment per round trip timeout (RTO)
108	   (answers the "how much" question) until an acknowledgment is
109	   received. The single segment sent is always the first unacknowledged
110	   outstanding packet in the retransmission queue (answers the "what"
111	   question).  Since the present TCP's loss recovery and congestion
112	   control algorithms are coupled together, we call this "Correlated
113	   Loss Recovery (CLOR)."

115	   Although the assumption that a timeout can occur only if there is
116	   severe congestion is valid for traditional wire-line networks, it
117	   does not hold good for some other types of networks--networks where
118	   packets can be stalled "in the network" for a significant duration
119	   without being discarded. Typical examples of such networks are
120	   cellular networks. In cellular networks, the link layer can
121	   experience a relatively long disruption due to errors, and the link
122	   layer protocol can keep these packets-in-error buffered as long as
123	   the link layer disruption lasts.

125	   In this document we present an alternative approach to loss recovery
126	   and congestion control that "De-Correlates" Loss Recovery from
127	   congestion congestion and allows independent choice on using a
128	   particular TCP sequence number without compromising on the congestion
129	   control principles of [1][2][3][4]. In doing so, the algorithm
130	   mitigates the ill effects of spurious retransmission timeouts.
131	   Spurious timeouts are not only detrimental to the throughput of the
132	   defaulting connection, but they also add to the overall network
133	   congestion--paradoxically enough, due to the subsequent slow-start.

135	   Although several drafts [7][8] have been presented on this topic in
136	   the IETF, we believe that none of them fully considers all the
137	   problems associated with spurious timeouts.  In the following section
138	   we first describe these problems in more detail and then describe the
139	   DCLOR mechanism in section-4.

141	2. Problem Description.

143	   Let us assume that a TCP sender has sent N packets, p(1) ...  p(N),
144	   into the network and it's waiting for the ACK of p(1) (Figure-1). Due
145	   to bad network conditions or some other problem, these packets are
146	   excessively delayed at some some intermediary node NDN. Unlike
147	   standard IP routers, the NDN keeps these packets buffered for a
148	   relatively long period of time until these packets are forwarded to
149	   their intended recipient.  This excessive delay forces the TCP sender
150	   to timeout and enter slow start.

152	   Figure-1

154	       TCP-Sender         NDN      TCP-Receiver

156	     ..... |----p(1)------>|           |
157	       ^   |----p(2)------>|           |
158	       :   |      .        |           |
159	    RTT=D  |      .        |           |
160	       :   |      .        |           |
161	     ..... |----p(N)------>|           |
162	           |      ^        |           |
163	           |      :        |           |
164	           |     RTO       |           |
165	           |      :        |           |
166	           |      V        |----p(1)-->|
167	       ... |----p1(1)----->|<---a(1)---|...
168	        L  |               |           |
169	       ... |<----a(1)------|----p(2)-->|  Inter ACK arrival time (IAT)
170	           |->p1(2),p1(3)->|<---a(2)---|...
171	           |      .        |     .     |
172	           |      .        |     .     |
173	           |      .        |     .     |
174	           |               |<---a(N)---|
175	           |               |---p1(1)-->|
176	           |               |<---a(N)---|
177	           |               |           |

179	   As far as the TCP sender is concerned, a timeout is always
180	   interpreted as heavy congestion. The TCP sender therefore makes the
181	   assumption that all packets between p(1) and p(N) were lost in the
182	   network. To recover from this misconstrued loss, the TCP sender
183	   retransmits P1(1) ( Px(k) represents the xth retransmission of packet
184	   with sequence number k), and waits for the ACK a(1).

186	   After some period of time the network conditions at NDN become
187	   amicable again, and the queued in packets at NDN are finally
188	   dispatched to their intended recipient, to which the the TCP receiver
189	   generates the ACK a(1). When the TCP sender receives a(1), it's
190	   fooled into believing that this ACK was generated in response to the
191	   retransmitted packet p1(1), while in reality a(1) was generated in
192	   response to the originally transmitted packet p(1). When the sender
193	   receives a(1), it increases its congestion window to two, and
194	   retransmits p1(2) and p1(3). As the sender receives more
195	   acknowledgments, it continues with retransmissions and finally starts
196	   sending new data.

198	   The following two subsections examine the problems associated with
199	   the above-mentioned TCP behavior.

201	2.1 Redundant Data Retransmission

203	   The obvious and relatively easy-to-solve inefficiency of the above
204	   algorithm is that the entire congestion window worth of data is
205	   unnecessarily retransmitted. Although such retransmissions are
206	   harmless to high-bandwidth, well-provisioned, backbone links (so long
207	   they are infrequent), it could severely degrade the performance of
208	   slow links.

210	   In cases where bandwidth is a commodity at a premium, (e.g., cellular
211	   networks), unnecessary retransmission can also be costly.

213	2.2 Can Slow Start add to Congestion after Spurious Timeout

215	   Paradoxically, slow start--the epitome of congestion control--can
216	   itself add to network congestion after a spurious timeout. Going back
217	   to the previous example, when the TCP sender receives a(1), it
218	   falsely assumes that a(1) was triggered by p1(1). But in reality,
219	   a(1) was triggered by p(1), not p1(1).  This ambiguity exits because
220	   of the cumulative acknowledgment property of TCP. In this case, if
221	   the TCP sender were to take RTT samples (which it does not take
222	   because of ACK ambiguity), it would register a value of L (Figure-1)
223	   instead of D--the RTT. In addition to this, when the TCP sender sends
224	   a1(2) and a1(3), it will receive the ACK for a1(2) and a1(3) right
225	   after the inter ACK arrival time IAT. Then-after, the TCP sender will
226	   keep increasing the data rate every IAT as new ACKs arrive.
227	   Mathematically, one can show that after a spurious timeout, the rate
228	   at which data is injected in to the network is given by 1. But if we
229	   want to preserve the network probing capability [2] of a TCP sender,
230	   then the rate should be given by (2).

232	           r(t) = 2*ceil(t-t0/IAT)         ... (1)

234	           r(t) = 2^ceil(t-t0/D)           ... (2)

236	   Where t>t0, and t0 is the time when first ACK after spurious timeout
237	   is received, and IAT and D are related by (3)

239	               IAT <= D/N                  ... (3)

241	   (N is the flight size at the time of spurious timeout).

243	   Figure-2
244	                      C(1) C(2)... C(M)
245	                        |  |   ... |
246	                        |  |   ... |
247	                        |  |   ... |
248	                        V  V   ... V
249	                        \  \      /
250	                         \  \    /
251	                          \  \  /
252	                    +------X--X--X---+       +------------------+
253	        Defaulting  |                |       |                  |
254	   C(0) ----------->|   Bottleneck   |------>|Buffered packets  |--->
255	        connection  |   router       |       |                  |
256	                    +-----X--X----X--+       +------------------+
257	                          |  |    |
258	                          |  |    |
259	                        c(1)c(2) C(M)

261	   We now compute the worst-case packet loss due to this data burst.
262	   After the timeout, the TCP sender will set its SS_THRESH to N/2
263	   (packets-in-flight/2). Therefore, for the first N/2 ACKs received
264	   (i.e., ACK a(1) to a(N/2) ), the TCP sender will grow its congestion
265	   window by one and reach the SS_THRESH value of N/2.  For each ACK
266	   received, the TCP sender also sends 2 packets.  Therefore by the end
267	   of the slow start, the TCP sender would have sent 2*(N/2) packets
268	   into the network. For the remaining N/2 ACKs (i.e., ACKs between
269	   a(N/2+1) to a(N)) the TCP sender will remain in the congestion
270	   avoidance phase and send one packet for each ACK received--sending
271	   N/2 more data segments. The net amount of data sent is therefore N/2
272	   + N = 3N/2 . Please note that the entire 3N/2 packets are injected
273	   into the network within a time period less than or equal to RTT. The
274	   number of data segments that left the network during this time is
275	   only N. Therefore, there is a high probability that N/2 packets out
276	   of 3N/2 packets will be lost. These N/2 lost packets, however, need
277	   not come from the same connection, and such a data-burst will
278	   unnecessarily penalize all the competing TCP connections that share
279	   the same bottleneck router.

281	   Going further ahead, let us assume there are M competing TCP
282	   connections that share the same bottleneck router as the defaulting
283	   connection C(0) does (Figure-2). During the period of time while C(0)
284	   is stalled, the TCP sender of C(0) does not use its network
285	   resources--the buffer space--on the bottleneck router(s). The
286	   competing connections, C(1) ... C(M), however see this lack of
287	   activity as resource availability and start growing their window by
288	   at least one segment per RTT during this time period (by virtue of
289	   linear window increase during congestion avoidance phase).  If, for
290	   simplicity  reasons, we assume that each of these TCP connections has
291	   the same round trip time of RTT, and if the idle time for C(0) is
292	    k*RTT, (where k > RTO/RTT) then each of these competing connections
293	   will increase their congestion window by k segments. Therefore the
294	   amount of packets lost in the network due to slow start can be as
295	   high as:

297	                   N/2 + M*k       ... (4)

299	   where the first term is the packet loss due to slow start, while the
300	   second term is the loss due to window growth of completing
301	   connections.

303	   Please note that even if a TCP sender restores its congestion window
304	   [7](or halves [8] it) to avoid slow start and redundant
305	   retransmission, it still cannot avoid the loss of M*k segments (M*k-
306	   N/2 segments in case the window was halved) that were added in the
307	   network because of sender's inactivity. Since a TCP sender does not
308	   know the number of connections it's competing with, or the time
309	   duration for which packets could be stalled in the network, the
310	   number of segments lost due to spurious timeout could be large.

312	   In the following sections we describe an algorithm that solves the
313	   problem of both redundant retransmission and packet loss after a
314	   spurious timeout.

316	3. Sources of spurious timeout

318	   Since the response algorithm after a timeout depends on the event
319	   that triggered it, it is important to know the factors that can cause
320	   such timeouts (in addition to the timeouts caused by congestion.) The
321	   following list enumerates few broad categories of events that can
322	   cause spurious timeouts in TCP.

324	3.1 Spurious timeout due to Excessive delay at a router

326	   Many link layer technologies have their own loss recovery schemes
327	   that use link layer feedback mechanisms to recover bit losses. These
328	   loss recovery schemes are often unaware of the transport layer loss
329	   recovery schemes. Therefore, if there is burst of error at the link
330	   layer, the TCP sender may timeout because of
331	    the loss recovery scheme used by the link layers.

333	3.2 Spurious timeout due to change of subnets

335	   A TCP receiver or sender may change subnets due to sender or receiver
336	   mobility (Figure-3). While the change of subnets is taking place, a
337	   router may keep the data packets buffered until a new route is
338	   established (i.e., until Path-3 and Path-2 are established in
339	   Figure-3).  Since establishing a new route can take a long time, a
340	   TCP sender may timeout.

342	   Figure-3
343	                    <.... Path of new data and ACK. ...>
344	                                   Path-2
345	                               +----------->     After
346	       +-------------+         |     ^       Subnet Change
347	       |             +---------+     .
348	       | TCP Sender  |               .Path-3
349	       |             +---------+     .
350	       +-------------+         |     .
351	                               +----------->    Before
352	                                   Path-1    Subnet Change

354	   Once the new path is established, the router may forward all data
355	   packets to the TCP receiver using Path-3. However, the ACKs triggered
356	   by these packets will reach the TCP sender using Path-2. In addition,
357	   the TCP sender will send new data in response to these ACKs on
358	   Path-2.

360	   When there is a change of subnets, the entire end-to-end path between
361	   the sender and receiver might change completely (for example, Path-1
362	   and Path-2 may not have any common routers between then). A complete
363	   change in the end-to-end path may take place, for example, if a TCP
364	   receiver uses Mobile-IPv6 with route optimization to move from one
365	   subnet to another.  Since the TCP connection does not have any
366	   information about the BDP on new path (Path-2), the TCP connection
367	   should start with a window size same as the window size of a new
368	   connection, i.e., with a size of two.

370	   Please note that of all possible reasons for spurious timeout, a
371	   change in subnets has maximum impact on network congestion. Since the
372	   BDP on a new path could be tens of orders of magnitude higher or
373	   lower than the BDP on an old path, an inappropriate congestion
374	   control scheme could severely add to network congestion on the new
375	   path (Path-2). For example, let us assume there are 100 TCP
376	   connections sharing the same bottleneck router. Let's also assume
377	   that the average number of subnet change per second per TCP receiver
378	   is 1/50, and the average congestion window for each TCP connection at
379	   the time of spurious timeout is 5 segments. Then in the worst case,
380	   the network may drop as many as (1/50)*(5/2)*100 = 20 segments every
381	   second if the congestion window is restored as done in [7].

383	3.3 Spurious Timeout due to Protocol Inefficiencies

385	   A TCP sender may timeout because of inherent protocol inefficiencies
386	   in TCP. For example, a spurious timeout may occur if
387	    multiple packets are lost and fast-retransmit/fast-recovery
388	   algorithms could not recover the lost packets within an RTO. A
389	   spurious timeout may also occur if there are so few packets injected
390	   into the network that in case of a packet loss the receiver cannot
391	   generate 3 duplicate ACKs.

393	4. De-correlated Loss Recovery (DCLOR)

395	   De-correlated loss recovery tries to remedy the problems described in
396	   Section-2 by separating congestion control from loss recovery
397	   mechanism. In doing so, we make the decision of "which packets to
398	   send" independent from "how much" data to send. In addition to this,
399	   the rate at which data is injected into the network preserves the
400	   conservation control principles as described in [1][2][3].

402	   The basic idea behind DCLOR is to send a new data segment from
403	   outside the sender's retransmission queue--the queue of
404	   unacknowledged data segments--and wait for the ACK or SACK of the new
405	   data before initiating the response algorithm. Unlike slow-start
406	   where the response algorithm starts immediately after receiving the
407	   first ACK, DCLOR waits for the ACK/SACK of the new data sent after
408	   timeout before initiating loss recovery. The SACK block for new data
409	   contains sufficient information to determine all the packets that
410	   were lost into the network. Once the sequence number of lost packets
411	   is determined, the TCP sender grows its congestion window as in case
412	   of slow-start, but only tries to recover those segments that were
413	   lost in the network. This not only allows the TCP sender to avoid the
414	   go-back-N problem, but also prevents superfluous congestion window
415	   growth as seen with slow-start (section-2.2).

417	   Before describing the algorithm itself, we fist enumerate the
418	   underlying design philosophy behind DCLOR as it can give more insight
419	   into the workings of the algorithm:

421	     1. A spurious timeout should not degrade the performance of
422	        other TCP connections sharing the same network path. As pointed
423	        out in Section-2, the slow-start subsequent to spurious timeout
424	        not only affects the performance of the defaulting connections,
425	        but it can also confer data-loss on other competing connections.

427	     2. Congestion control decisions should be made independent of the
428	        loss recovery decision. In other words, the congestion control
429	        algorithm should only determines the amount of data to be
430	        injected into the network, while the loss recovery algorithm
431	        should determine the sequence number of data to be recovered.

433	     3. Since the state of the network can change dramatically
434	        after a long interruption--long with respect to
435	        RTT--(Section-3), the response algorithm after a timeout should
436	        be conservative in the amount of data injected into the network
437	        after a timeout.  This avoids packet loss (and congestion) due
438	        to inappropriately inflated congestion window after spurious
439	        timeouts (Section-2.2)

441	     4. The TCP sender's estimate of network capacity--the
442	        SS_THRESH--should be updated if and only if packets are lost in
443	        the network. A spurious timeout without packet loss should not
444	        affect the SS_THRESH since SS_THRESH is a measure of the
445	        network's buffering capacity and there is no need to updated it
446	        unless packet loss is detected.

448	     5. If there is packet loss in addition to packet stalling, all
449	        the lost packets within that window should be recovered without
450	        any need for future timeouts. In other words, the algorithm
451	        should not require the TCP sender to timeout again due to
452	        protocol limitations.

454	   We now describe the algorithm in more detail, keeping the above
455	   mentioned points in mind. Please see Figure-4 for certain parameters.

457	4.1  Probe phase after a timeout

459	   The following steps describe the response of a TCP sender on a
460	   timeout:

462	     1. If the timeout occurs before the 3 way handshake is complete,
463	         the TCP sender's behavior is unchanged,

465	     2. After a timeout, the TCP sender MUST set its congestion window
466	        to ZERO (CWND = 0), and keep the number of outstanding packets
467	        (N)into a state variable for future use (step-9). The value of
468	        SS_THRESH MUST be left UNCHANGED at this point.

470	     3. The TCP sender SHOULD also reset all the SACK tag bits in its
471	        retransmission queue (essentially renege all the SACK
472	        information in accordance with the recommendation in [5]. Please
473	        also see Section-6 on SACK reneging for more information.)

475	     4. Instead of sending the first unacknowledged packet P1
476	        after a timeout, the TCP sender disregards its congestion window
477	        and sends ONE NEW MSS size data packet Pn+1.

479	        Figure-4                .
480	                                . DCLOR:
481	                                . a) At timeout send Pn+1 in stead of P1;
482	                                .    store SS_PTR=seq-num(Pn+1);set CWND=0
483	        <.....CWND.....>        . b) Don't update CWND for ACK/SACK <= SS_PTR
484	        +--+--+-...-+--+----+   . c) Once ACK/SACK > SS_PTR is  received
485	        |P1|P2|     |Pn|Pn+1|   .    Update SS_THRESH if SACK received
486	        +--+--+-...-+--+----+   . d) Update lost-packet information
487	                       ^        . e) set CWND=2 and start loss-recovery/
488	                       |        .    start sending new data in case pure ACK
489	                    SS_PTR      .    greater than SS_PTR is received.

491	        TCP Sender's Retransmission
492	                Queue

494	        The TCP sender also stores the sequence number of this new data
495	        packet in a new state variable called SS_PTR (for slow start
496	        pointer).

498	        If the sender does not have any new data outside its
499	        retransmission queue, or if the receiver's flow control window
500	        cannot sustain any new data, the TCP sender should send the
501	        highest sequence numbered MSS sized data chunk from its
502	        retransmission queue (i.e., it should send the last packet from
503	        its retransmission queue).

505	   The idea behind sending one packet on a timeout is to probe the
506	   network and determine if the network is congested. However, from the
507	   network point of view it does not matter what the sequence number of
508	   data sent was. What matters is the "amount" of data sent into the
509	   network. Therefore, sending a new packet from outside the
510	   retransmission queue has the same affect as sending an old packet.
511	   Therefore this behavior is in accordance with the TCP congestion
512	   control and avoidance mechanisms described in [1].

514	     5. A TCP sender MUST repeat step-2 until it enters the
515	        Timeout-Recovery state as described in step 6.

517	4.2 Congestion Control After the probe phase

519	     6. After the timeout, for each ACK received with the ACK-sequence
520	        number less than SS_PTR, the packet ACKed is removed from the
521	        retransmission queue. If the ACK contains a new SACK block, then
522	        the SACK tag is set in the corresponding data packet. However,
523	        no data is sent for these ACKs.

525	     7. The TCP sender MUST NOT update its congestion window from a
526	        value of ZERO until an ACK-sequence number greater than SS_PTR,
527	        or a SACK block containing SS_PTR is received. In other words,
528	        until the ACK-sequence for Pn+1 is received or a SACK block
529	        containing information about Pn+1 is received, NO new data
530	        segments (i.e., segments Pn+2 ...,) are added into the network.
531	        This allows the TCP sender to discard all the "stale ACKs"--the
532	        ACKs that are generated for stalled packets--and prevents
533	        superfluous growth of congestion window.

535	        The TCP sender MUST NOT take RTT samples for stale-ACKs. The
536	        SACK information present in stale-ACKs SHOULD be however used to
537	        put SACK tags on the retransmission queue. A TCP sender MAY
538	        reset its retransmission timer with every stale ACK received.

540	        If the sender receives a SACK block containing SS_PTR, i.e., if
541	        there is a packet loss in the stalled window, it SHOULD go to
542	        step-8.

544	        If the sender receives an ACK that acknowledges SS_PTR, i.e., if
545	        no packets were lost from the stalled window, it SHOULD go to
546	        step-10.

548	   The rationale for not sending any data segment for stale ACKs is to
549	   minimize congestion after a timeout. Since the duration of packet
550	   stalling could be large, the congestion state of the network could
551	   have change dramatically. Therefore, not using the stale ACKs to send
552	   data allows the TCP sender to recompute its congestion state from
553	   scratch and also minimizes packet loss.

555	NOTE-1: In our experiments, we have tried sending new data for
556	   stale ACKs in the spirit of reverting the congestion window as
557	   described in [7]. We have also tried setting the congestion window to
558	   half the packets-in-flight (pipe) as described in [8]. But we have
559	   found that not sending any new data, as described above, is the best
560	   scheme whenever there are multiple competing connections in the
561	   system. The results of these experiments, along with the experimental
562	   setup, are listed in Appendix-1.

564	   If a TCP sender has not received any ACK within an RTT, sending more
565	   than a few data segments (>4) will add to network congestion in one
566	   form or the other (in varying degrees) and the overall performance of
567	   the system as a whole degrades--on an average. In addition, the DCLOR
568	   scheme works well in case of subnet change where there could be
569	   orders of magnitude difference in BDP. DCLOR also has minimal effect
570	   on other competing connections.

572	   Based on our experiments, we do not recommend sending new data for
573	   stale ACKs, but it's possible to deploy a variety of congestion
574	   control schemes with the DCLOR framework. For example, a TCP sender
575	   on the reception of an old ACK MAY revert (or halve) its congestion
576	   window to the old value of packets-in-flight (pipe) and MAY send new
577	   data depending upon the congestion window. But the TCP sender SHOULD
578	   initiate its loss recovery ONLY after SS_PTR has been ACKed or
579	   SACKed. In other words, if the TCP sender receives duplicate ACKs
580	   with ACK-sequence number less than SS_PTR, it SHOULD NOT use these
581	   duplicate ACKs to initiate Fast-Retransmit and Fast-Recovery. The
582	   loss recovery SHOULD be initiated only after ACK/SACK for SS_PTR is
583	   received. After that the sender should follow the same steps as
584	   described in step-8, step-9, and step-10 (with a different congestion
585	   window of course).

587	NOTE-2: For the congestion response of DCLOR, we have also considered
588	   adaptive update of congestion window after each timeout. In this
589	   scheme a TCP sender reduces its congestion window by a factor of 2
590	   for every RTO period of inactivity. The SS_THRESH is not updated
591	   until a loss is detected. At the time of this writing we are still
592	   experimenting with the usefulness of such a scheme.  However, such a
593	   scheme is not well suited for spurious timeouts caused by change
594	   of subnets as described in Section-3.2.

596	4.3 Timeout-Recovery: recovering lost packets after timeout

598	     8. The TCP sender traverses the retransmission queue and marks
599	        all the packets without any SACK tag as lost. The TCP sender
600	        also updates its packets-in-flight (pipe) based on the SACK tags
601	        and the lost segment information (the packets-in-flight (pipe)
602	        should be ZERO after the update).

604	   Please note that unlike Fast-Retransmit and Fast-recovery, DCLOR uses
605	   only one SACK block containing SS_PTR to mark packets as lost.  This
606	   is because we do not expect packet reordering to exist over the
607	   period of RTO.

609	     9. The TCP sender updates its SS_THRESH, as following:

611	           SS_THRESH=packets-in-flight (pipe) at the time of timeout / 2

613	        Please note that the TCP sender stored the value of packets-in-
614	        flight at the time of timeout in step-2.

616	    10. The TCP sender sets its congestion window to 2. If packets
617	        were lost into the network (i.e., if a SACK for SS_PTR was
618	        received), the TCP sender should start by sending the least
619	        sequence number packets, else it should continue by sending new
620	        data.

622	   Please note that with a pure ACK acknowledging SS_PTR, the TCP sender
623	   does not update the SS_THRESH value (it directly enters step-10 from
624	   step-7). This prevents a TCP sender from setting its SS_THRESH to a
625	   very small value if the spurious timeout occurs at the start of the
626	   connection.

628	   The rationale behind not updating the SS_THRESH if no loss is
629	   detected is that SS_THRESH represents the sender's estimate of
630	   network capacity--the BDP. Unless a loss is detected, there is no
631	   reason to update the value of BDP.

633	5. Data Delivery To Upper Layers

635	   If a TCP sender loses its entire congestion window worth of data due
636	   to congestion, sending new data after timeout prevents a TCP receiver
637	   from forwarding the new data to the upper layers immediately.
638	   However, once the SACK for this new data is received, the TCP sender
639	   will send the first lost segment. This essentially means that data
640	   delivery to the upper layers could be delayed by ONE RTT when all the
641	   packets are lost in the network.

643	   This, however, does not affect the throughput of the connection in
644	   any way. If a timeout has occurred, then the data delivery to the
645	   upper layers has already been excessively delayed.  Delaying it by
646	   another round trip is not a serious problem.  Please note that
647	   reliability and timeliness are two conflicting issues and one cannot
648	   gain on one without sacrificing something else on the other.

650	6. Sack Reneging

652	   The TCP SACK information is meant to be advisory, and a TCP receiver
653	   is allowed--though strongly discouraged--to discard data blocks the
654	   receiver has already SACKed[5]. Please note however that even if the
655	   TCP sender discards the data block it received, it MUST still send
656	   the SACK block for at least the recent most data received. Therefore
657	   in spite of SACK reneging, DCLOR will work without any deadlocks.

659	   A SACK implementation is also allowed not to send a SACK block even
660	   though the TCP sender and receiver might have agreed to SACK-
661	   Permitted option at the start of the connection. In these cases,
662	   however, if the receiver sends one SACK block, it must send SACK
663	   blocks for the rest of the connection. Because of the above mentioned
664	   leniency in implementation, its possible that a TCP receiver may
665	   agree on SACK-Permitted option, and yet not send any SACK blocks. To
666	   make DCLOR robust under these circumstances, DCLOR SHOULD NOT be
667	   invoked unless the sender has seen at least one SACK block before
668	   timeout. We, however, believe that once the SACK-Permitted option is
669	   accepted, the TCP sender MUST send a SACK block--even though that
670	   block might finally be discarded.  Otherwise, the SACK-Permitted
671	   option is completely redundant and serves little purpose. To the best
672	   of our knowledge, almost all SACK implementations send a SACK block
673	   if they have accepted the SACK-Permitted option.

675	7. DCLOR Examples

677	   We now demonstrate the working of DCLOR with a few concrete examples.
678	   We first take the case in which the timeout is caused due to
679	   congestion. We then consider the case, in which all the packets are
680	   stalled in the network but there is no packet loss.  Finally, we take
681	   the example in which packets are both lost and stalled at the same
682	   time.

684	   In all these examples, the TCP sender has 20 MSS packets in flight
685	   (pipe) at the time when timeout occurs and each packet is denoted by
686	   P(i) (please note that we are using packet numbers instead of TCP
687	   sequence numbers to make the presentation simple. The example can be
688	   modified to take TCP sequence numbers into account too.) An ACK is
689	   denoted by A(i)[x,y][z,w], where i is the highest packet number
690	   acknowledged, and the ordered pair [x,y] and [z,w] represent the sack
691	   blocks present in that ACK (please note that i < x,y,z,w and its
692	   possible that x=y if just one packet is being sacked.)

694	7.1 Timeout due to congestion.

696	   In this case, we assume that all packets P(1) to P(20) are lost into
697	   the network. The following time-line shows the sequence of events:

699	           TCP Sender      Network        TCP Receiver
700	           ----------      -------        -----------
701	               P(1)-----------X        (Highest In-order seq = 0)
702	                .
703	                .     LOST IN NETWORK
704	                .   DUE TO CONGESTION
705	              P(20)-----------X
706	                ^
707	                :
708	               RTO
709	                :
710	                V
711	    preserve old_pkt_in_flight (pipe) = 20
712	        send   P(21)-------------------------> rec P(21)
713	    set SS_PTR = 21
714	        rcv A(0)[21,21] <-------------------- send A(0)[21, 21]

716	    Since SACK contains SS_PTR ==>
717	    Tag P(1) to P(20) as Lost
718	    Tag P(21) as SACKED
719	    Update pkt_in_flight (pipe) = 21 - 20 -1 = 0
720	    Since SACK=21 was received therefore
721	    set SS_THRESH to old_pkt_in_flight/2 = 10
722	    set CWND = 2
723	    Restart RTO-timer

725	        Send P(1)--------------------------->
726	        Send P(2)--------------------------->

728	7.2 Timeout due to pure packet stalling.

730	   In this case, we assume that all the packets P(1) to P(20) are
731	   stalled in the network. The following time-line shows the sequence of
732	   events:

734	           TCP Sender      Network        TCP Receiver
735	           ----------      -------        -----------
736	               P(1)-----------+        (Highest In-order seq = 0)
737	                .             |
738	                .             |
739	              P(20)-----------+
740	                ^             |
741	               RTO            |
742	                V             | Packets delayed

744	    old_pkt_in_flight (pipe) = 20    |
745	        send   P(21)----------+
746	        set SS_PTR = 21       |
747	        CWND=0                +--------------> rec P(1)
748	        rcv A(1) <---------------------------- send A(1)
749	    SS_PTR > A(1)=>           |
750	    no change in CWND = 0     .
751	    Remove P(1) from          .
752	    retransmission queue      .
753	    No timer sample taken     .
754	                              +--------------> rec P(i) {1<i<21}
755	        rcv A(i) <---------------------------- send A(i)
756	    SS_PTR > A(i) ==>         |
757	    no change in CWND = 0     |
758	    Remove P(i) from          |
759	    retransmission queue      .
760	    No timer sample taken     .
761	                              +---------------> rec P(21)
762	        rcv A(21) <---------------------------- send A(21)
763	    SS_PTR = A(21) ==>
764	    pure ACK received
765	    Remove P(21) from retransmission queue
766	    Update pkt_in_flight (pipe) = 0
767	    Since Pure ACK was received therefore
768	    SS_THRESH remains unchanged.
769	    set CWND = 2
770	    Restart RTO-timer
771	        Send P(22)--------------------------->
772	        Send P(23)--------------------------->

774	7.3 Comprehensive Scenario: Timeout due to stalling and loss.

776	   Building upon the previous two example, we now assume that in
777	   addition to data stalling, P(10) was lost into the network.

779	   The following time-line shows the sequence of events:

781	           TCP Sender      Network        TCP Receiver
782	           ----------      -------        -----------
783	               P(1)-------------+      (Highest In-order seq = 0)
784	                .               |
785	              P(10)----X Lost   |
786	                .               |
787	              P(20)-------------+
788	                ^               |

790	               RTO              |
791	                V               | Packets delayed
792	    old_pkt_in_flight (pipe) = 20      |
793	        send   P(21)------------+
794	        set SS_PTR = 21         |
795	        CWND=0                  +--------------> rec P(1)
796	        rcv A(1) <-----------------------------  send A(1)
797	    SS_PTR > A(1)=>             |
798	    no change in CWND = 0       .
799	    Remove P(1) from            .
800	    retransmission queue        .
801	    No RTT sample taken         .  {1<i<10}
802	                                +--------------> rec P(i)
803	        rcv A(i) <---------------------------- send A(i)
804	    SS_PTR > A(i) ==>           |
805	    no change in CWND = 0       |
806	    Remove P(i) from            |
807	    retransmission queue        .
808	    No RTT sample taken         .    {10<j<21}
809	                                +---------------> rec P(j)
810	        rcv A(9)[11,j] <------------------------  send A(9)[11,j]
811	    SS_PTR > A(9) and           |
812	    SS_PTR outside A[11,j]==>   |
813	    Tag P(j) as SACKED          |
814	    NO change in CWND=0         +---------------> rec P(21)

816	        rcv A(21) <---------------------------- send A(9)[11,21]
817	    Since SS_PTR = 21 ==>
818	    Tag P(10) as Lost
819	    Update pkt_in_flight (pipe) = 11-10-1
820	    (Note, P(1) ... P(9) are already removed
821	    from the retransmission queue)
822	    Since a SACK was received therefore
823	    set SS_THRESH=20/2 = 10
824	    set CWND = 2
825	    Restart RTO-timer
826	      recover the lost packet first
827	        Send P(10)--------------------------->
828	        Send P(22)--------------------------->

830	8. Security Considerations

832	   No new security threats are introduced by the use of DCLOR.

834	9. References

836	     [1] M. Allman, V. Paxson, W. Stevens. "TCP Congestion Control.
837	         RFC-2581."

839	     [2] S. Floyd. "Congestion Control Principles." RFC-2914.

841	     [3] M. Handley, J. Padhye, S. Floyd. "TCP Congestion Window
842	         Validation." RFC-2861.

844	     [4] Ethan Blanton, Mark Allman, Kevin Fall. "Conservative
845	         SACK-based Loss Recovery Algorithm for TCP."  draft-allman-tcp-
846	         sack-10.txt. Work in Progress.

848	     [5] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. "TCP Selective
849	         Acknowledgment Options." RFC-2018.

851	     [6] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky. "An Extension to
852	         the Selective Acknowledgment (SACK) Option for TCP" RCF-2883.

854	     [7] Reiner Ludwig, Michael Meyer. "The Eiffel Detection Algorithm
855	         for TCP." Internet Draft-work in progress.  draft-ietf-tsvwg-
856	         tcp-eifel-alg-05.txt.

858	     [8] P. Sarolahti, M. Kojo. "F-RTO: A TCP RTO Recovery Algorithm
859	         for Avoiding Unnecessary Retransmissions." Internet Draft--work
860	         in progress. draft-sarolahti-tsvwg-tcp-frto-01.txt

862	     [9] V. Paxon, M. Allman. "Computing TCP's Retransmission
863	         Timer." RFC-2998

865	10. IPR Statement

867	   The IETF has been notified of intellectual property rights claimed in
868	   regard to some or all of the specification contained in this
869	   document. For more information consult the on-line list of claimed
870	   rights at http://www.ietf.org/ipr.

872	Author's  Address:

874	   Yogesh Prem Swami
875	   Nokia Research Center
876	   6000 Connection Drive
877	   Irving TX-75063
878	   USA

880	   Phone: +1 972-374-0669
881	   Email: yogesh.swami@nokia.com

883	   Khiem Le
884	   Nokia Research Center
885	   6000 Connection Drive
886	   Irving TX-75063
887	   USA

889	   Phone: +1 972-894-4882
890	   Email: khiem.le@nokia.com

892	                               Appendix - 1

894	   In this section we provide information on our test setup and provide
895	   test results for comparison of DCLOR with Standard TCP, Eifel, and
896	   FRTO schemes.

898	App-1: Testing Methodology

900	   Please see Figure-5 for the test setup. The TCP sender and TCP
901	   receiver are connected through an intermediary "spurious timeout and
902	   network impairment simulator (STIS)," that simulates packet delay and
903	   certain network impairments.  The STIS is capable of reading all TCP
904	   packets received on any of its interfaces. It is capable of
905	   processing packets coming from multiple TCP connections. Each TCP
906	   packet is then subjected to a series of network impairments that are
907	   expected to be present in a typical IP network. The specifics of
908	   these impairments along with the model used to simulate them is
909	   enumerated below:

911	   Figure-5

913	       +---------------+    +--------------------+   +---------------+
914	       | TCP Sender    |    | Spurious timeout & |   |               |
915	       |               +----+ network impairment +---+ TCP Receiver  |
916	       | Linux-2.5.40  |    |  simulator         |   |               |
917	       +---------------+    +--------------------+   +---------------+

919	     1. Bottleneck Router

921	        The STIS is capable of simulating a bottleneck router on the
922	        end-to-end path of a TCP connection. For TCP, a bottleneck
923	        router is one that has the minimum ratio for the buffer-space to
924	        difference in link speed on the end-to-end path (i.e., if b(i)
925	        represents the buffer space of router i, and a(i) and d(i) be
926	        the incoming and the outgoing link speed of the router for a
927	        particular TCP connection, then the bottle neck router for such
928	        a connection is one that has min{ b(i)/[d(i)-a(i)]} for all
929	        routers i on the end-to-end path of the TCP connection).

931	        The bottleneck router is simulated by specifying a buffer space
932	        and the outgoing link speed (the arrival rate in our simulation
933	        was fixed by the link speed of the interface on which the packet
934	        arrived. This speed was a fixed value of 10Mbps).  In order to
935	        simulate outgoing link, the STIS enqueues each TCP packet and
936	        schedules them based on the queuing delay  and link-delay
937	        (service time) of packets in the queue (please see item-2 for
938	        more details). A packet received on any of the interfaces is
939	        processed only if the total bytes of enqueued data in the router
940	        is less than the specified buffer space.  A packet is dropped if
941	        the buffer space is full. Please note that the buffer space is
942	        allocated for and shared by all possible TCP connections going
943	        through STIS.

945	     2. Link Speed Simulation

947	        Although each TCP connection shares the same buffer space, each
948	        TCP connection has its own logical queue for each direction.
949	        The departure time of a new packet is computed by addition the
950	        link delay (packet-size / outgoing-link-speed) of the new packet
951	        to the queuing delay of all packets ahead of it. A TCP sender is
952	        scheduled on the appropriate interface once its departure time
953	        expires.

955	     3. Fixed Network Delay

957	        The wired-network delay seen by individual packets is the sum of
958	        the queuing delay and link delay of each packet on individual
959	        routers. If we assume that the number of routers are infinitely
960	        many on the end-to-end path, and if the variance of the delay is
961	        bounded (which is the case with finite queue sizes), then by the
962	        law of large numbers, the delay seen by each packet will be a
963	        constant. Although there are not infinitely many routers in the
964	        real networks, to make our model simple we assume that the delay
965	        added to each packet on the end-to-end path is a constant. This
966	        delay is added in addition to the queuing delay incurred due to
967	        link speed simulation. (Please note that fixed network delay
968	        tries to capture the queuing delay on the high speed link, while
969	        the delay incurred due to link speed simulation is the delay due
970	        to a last hop slow link.)

972	        The fixed network delay is simulated by adding a fixed delay to
973	        the departure time of each packet.

975	     4. Prolonged packet stalling

977	        In addition to the delay models described above, the STIS also
978	        simulates prolonged packet stalling in the network.  Although
979	        the end result of packet stalling is a delay spike, we do not
980	        simply add a long delay burst to achieve packet stalling rather
981	        use the Markov model shown in Figure-6. (This model is based on
982	        the two different kinds of delay spikes seen in many cellular-
983	        networks.)

985	        Every second a random number r (<1) is generated(lets say the
986	        random number is generated at time T) for each TCP connection
987	        and compared with p1 and p2. If r is less than p1 or P2 a state
988	        transition is made into state-2 or state-3.  When a TCP
989	        connection is in state-2 or state-3, each packet arriving in the
990	        time interval T and T+D (where D = d1 or d2 depending upon the
991	        state), is treated as if it arrived at time T+D. By altering the
992	        arrival time of each packet, this model simulates packet
993	        stalling in the network.

995	   Figure-6

997	                                   p=1-p1-p2
998	                                    +------+
999	                                    \     /
1000	                               +----+-----V-----+
1001	               p=1             | No Delay Spike |
1002	              +--------------->|   (state-1)    |<--------------+
1003	              |                +--+--------+----+               |
1004	              |                   |        |                    |p=1
1005	              |       +-----------+        +--------------+     |
1006	              |       |  p=p1                         p=p2|     |
1007	       +------+-------V-------+                      +----V-----+--------+
1008	       | Moderate delay spike |                      | Large delay spike |
1009	       |   delay=d1(state-2)  |                      | delay=d2(state-2) |
1010	       +----------------------+                      +-------------------+
1011	        Please note that in a real network, when packet stalling
1012	        occurs--for whatever reason--all packets remain queued in the
1013	        network until the packets can be forwarded again. Therefore the
1014	        effect of packet stalling is different from adding a long burst
1015	        of delay to a few of the packets. Please note that with such a
1016	        model, the delay incurred by individual packets will not be the
1017	        same.

1019	        Please also note that the state transitions are based on time.
1020	        Therefore the STIS will keep entering state-2 and state-3
1021	        whether or not there are any TCP packets to be forwarded.

1023	     5. Packet Reordering

1025	        Since the Internet is known to reorder 12% of the packets (this
1026	        study was done by Vern Paxson of UC Berkeley) on the end-to-end
1027	        path, we try to simulate this behavior in our simulations. To
1028	        simulate packet reordering we try to emulate the network
1029	        behavior that creates packet reordering--that is we try to
1030	        emulate router failure or load balancing to achieve packet
1031	        reordering.

1033	        A router failure is modeled using a two state Markov model, in
1034	        which transition from state to another means a route failure. In
1035	        each of the different states, the TCP connection incurs a
1036	        different fixed network delay as described above in item-2.
1037	        This model is based on the assumption that when a route failure
1038	        takes place, the delay on new path is different from the delay
1039	        on the old path.

1041	App-2: Test setup, parameters, and Results

1043	   Based on the model described above, the STIS uses following
1044	   parameters for testing.

1046	Table-1: STIS parameters

1048	     +---------------------------------------+------------------+
1049	     | Parameter Name                        | Parameter Value  |
1050	     +---------------------------------------+------------------+
1051	     | Path MTU                              |       1500 Bytes |
1052	     | Bottleneck Router's Capacity          |        74K Bytes |
1053	     | Outgoing link data rate               |    50 K Bits/sec |
1054	     | Fixed Network Delay (one way)         |           200 ms |
1055	     | Pkt Stalling delay in state-2 (d1)    |            5 sec |
1056	     | Pkt Stalling delay in state-3 (d2)    |            8 sec |
1057	     | Pkt Stalling prob to state-2 (p1)     |    0.05  per-sec |
1058	     | Pkt Stalling prob to state-3 (p2)     |    0.005 per-sec |
1059	     | Pkt reordering probability            |             0.12 |
1060	     | Pkt reordering delay                  |            20 ms |
1061	     +---------------------------------------+------------------+

1063	   On the TCP receiver, multiple TCP receiver processes simultaneously
1064	   run and try to download files of different sizes--each connection
1065	   competing to download a file of given size. In addition, each process
1066	   waits for a random time interval and iterates  over and over again to
1067	   download the same file.  The following table shows the traffic mix
1068	   used for our experiments.

1070	Table-2: Traffic Mix

1072	     +---------------+------------------+--------------+
1073	     | File Size     | # simultaneous   | # iterations |
1074	     |   (KB)        |  connections     |              |
1075	     +---------------+------------------+--------------+
1076	     |           5   |               6  |         2000 |
1077	     |          10   |               5  |         1000 |
1078	     |         100   |               5  |          100 |
1079	     |        1000   |               3  |           10 |
1080	     |       10000   |               1  |            1 |
1081	     +---------------+------------------+--------------+

1083	   Based on the STIS parameters and traffic mix, the cumulative
1084	   probability distribution of download time, i.e., the plot of number
1085	   of points with a download time of less than a give time duration in
1086	   the above experiment, for each file size was generated (because of
1087	   proper format for representing these plots, we are unable to include
1088	   these plots in this document.  We can provide this plot on request.)

1090	   Tables 3,4,and 5 show the mean and variance (in second) for
1091	   downloading different file sizes using DCLOR, EIFEL, FRTO, and
1092	   Standard TCP. Please note that not every connection was subjected to
1093	   a spurious timeout, and therefore the mean value alone is not a good
1094	   measure of performance. The variance captures the impact of such
1095	   spurious timeouts (the probability distribution is the best
1096	   representation of how individual schemes perform for different
1097	   download time range). As can be seen from the following table, Eifel
1098	   has a large variance because of restoring the window. Please note
1099	   that the variance for standard TCP is less than Eifel, or Frto,
1100	   because the first N packets in case of standard TCP are redundant,
1101	   and therefore even if they are lost due to error burst, the
1102	   defaulting connection does not incur a very severe penalty and
1103	   therefore the variance is small.  But for Eifel and FRTO, the N
1104	   packets after spurious timeout are new and therefore loss of these
1105	   packets has a huge impact on the variance.

1107	Table-3: Download Time Statistics for 5K files

1109	     +---------+--------+-------+--------+-------+
1110	     | (sec)   |  DCLOR | EIFEL |STD_TCP | FRTO  |
1111	     +---------+--------+-------+--------+-------+
1112	     |Mean     | 2.3869 | 2.4162| 2.3962 | 2.4049|
1113	     |Variance | 3.2473 |10.2579| 3.2164 | 4.0466|
1114	     +---------+--------+-------+--------+-------+

1116	Table-4: Download Time Statistics for 10K files

1118	     +---------+-------+--------+--------+------+
1119	     | (sec)   | DCLOR |  EIFEL |STD_TCP | FRTO |
1120	     +---------+-------+--------+--------+------+
1121	     |Mean     |3.4547 | 4.0560 | 3.7314 |3.7317|
1122	     |Variance |4.7452 |19.7828 | 7.6378 |8.4576|
1123	     +---------+-------+--------+--------+------+

1125	Table-5: Download Time Statistics for 100K files

1127	     +---------+---------+---------+---------+---------+
1128	     | (sec)   | DCLOR   | EIFEL   | STD_TCP |  FRTO   |
1129	     +---------+---------+---------+---------+---------+
1130	     |Mean     | 24.6297 | 26.1017 | 26.7425 | 25.5325 |
1131	     |Variance | 66.0804 | 98.4933 | 98.9363 | 78.1667 |
1132	     +---------+---------+---------+---------+---------+

1134	   In addition to the mean and variance of download times, we also
1135	   calculated the "spectral efficiency" of the system.  The spectral
1136	   efficiency was computed as:

1138	               (redundant_bytes_received)
1139	         SE= --------------------------------
1140	                  mean(cwnd)*MSS

1142	   The spectral efficiency is a measure of usefulness of a particular
1143	   scheme in terms of fraction of extra bytes sent due to spurious
1144	   timeout for every byte sent into the network. The smaller this number
1145	   is, the better the scheme is.  The numerator is the amount of
1146	   unnecessary data, while the denominator is the average network
1147	   capacity experienced by the connection. Since some schemes (Eifel for
1148	   example) requires more TCP options, the MSS was used instead of MTU
1149	   in the denominator.  Table-5 shows these results for all the four
1150	   different TCP implementations. Please note that FRTO behaves like
1151	   standard TCP if there is no new data to send or if the congestion
1152	   window is as big as flow control window. This is one of the reasons
1153	   why FRTO performance is worse than Eifel or DCLOR.

1155	Table-6: Spectral Efficiency

1157	     +-----------+----------+----------+----------+----------+
1158	     | File Size |   DCLOR  |  EIFEL   | STD_TCP  |  FRTO    |
1159	     |           |(MSS=1460)|  (1448)  |  (1460)  | (1460)   |
1160	     +-----------+----------+----------+----------+----------+
1161	     |     5K    | 0.004042 | 0.004454 | 0.092714 | 0.071336 |
1162	     |    10k    | 0.005249 | 0.012293 | 0.078977 | 0.052581 |
1163	     |   100k    | 0.017124 | 0.036748 | 0.624361 | 0.079285 |
1164	     +-----------+----------+----------+----------+----------+

1166	   The DCLOR algorithm for all these tests cases were implemented
1167	   in the Linux Kernel-2.5.40 (the source code for DCLOR
1168	   could be provided on request). The same kernel version was
1169	   used for all the other TCP enhancements too.