idnits 2.17.1 

draft-hughes-restart-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-24) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** The abstract seems to contain references ([AHO98], [BSSK97],
     [VH97,PK98], [JK92], [JB88], [BPK97], [VH97], [NS97], [Hei97], [APS99],
     [PN98], [Tou97], [FAP97], [HOT97], [Poo97], [FGMFB97]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The "Author's Address" (or "Authors' Addresses") section title is
     misspelled.

  == Line 593 has weird spacing: '...    int  ack_r...'

  == Line 594 has weird spacing: '...    int  bucke...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'AHO98'

  ** Obsolete normative reference: RFC 2581 (ref. 'APS99') (Obsoleted by RFC
     5681)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'BPK97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'BSSK97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'FGMFB97'

  ** Obsolete normative reference: RFC 2414 (ref. 'FAP97') (Obsoleted by RFC
     3390)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Hei97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'HOT97'

  ** Obsolete normative reference: RFC 1072 (ref. 'JB88') (Obsoleted by RFC
     1323, RFC 2018, RFC 6247)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Jac88'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'JK92'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NS97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PK98'

  ** Downref: Normative reference to an Informational RFC: RFC 2415 (ref.
     'PN98')

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Poo97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Tou97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VH97'


     Summary: 11 errors (**), 0 flaws (~~), 4 warnings (==), 16 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	INTERNET-DRAFT                  Amy S. Hughes, Joe Touch, John Heidemann
3	draft-hughes-restart-00.txt                                          ISI
4	                                                            Dec. 1, 2001
5	                                                   Expires: Jun. 1, 2002

7	              Issues in TCP Slow-Start Restart After Idle

9	Status of this Memo

11	   This document is an Internet-Draft and is NOT offered in accordance
12	   with Section 10 of RFC2026, and the author does not provide the IETF
13	   with any rights other than to publish as an Internet-Draft.
14	   Internet-Drafts are working documents of the Internet Engineering
15	   Task Force (IETF), its areas, and its working groups.  Note that
16	   other groups may also distribute working documents as Internet-
17	   Drafts.

19	   Internet-Drafts are draft documents valid for a maximum of six months
20	   and may be updated, replaced, or obsoleted by other documents at any
21	   time.  It is inappropriate to use Internet-Drafts as reference
22	   material or to cite them other than as "work in progress."

24	   The list of current Internet-Drafts can be accessed at
25	   http://www.ietf.org/1id-abstracts.html

27	   The list of Internet-Draft Shadow Directories can be accessed at
28	   http://www.ietf.org/shadow.html

30	   The distribution of this document is unlimited.

32	Abstract

34	   This draft discusses variations in the TCP 'slow-start restart' (SSR)
35	   algorithm, and the unintended failure of some variations to properly
36	   restart in some environments. SSR is intended to avoid line-rate
37	   bursts after idle periods, where TCP accumulates permission to send
38	   in the form of ACKs, but does not consume that permission
39	   immediately. SSR's original "restart after send is idle" is commonly
40	   implemented as "restart after receive is idle". The latter
41	   unintentionally fails to restart for bidirectional connections where
42	   the sender's burst is triggered by a reverse-path data packet, such
43	   as in persistent HTTP. Both the former and latter are shown to permit
44	   bursts in other circumstances. Several solutions are discussed, and
45	   their implementations evaluated.

47	   This document updates draft-ietf-tcpimpl-restart-01.txt.  It is a
48	   product of the LSAM, X-Bone, and DynaBone projects at ISI.  Comments
49	   are solicited and should be addressed to the authors.

51	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

53	Introduction

55	   Slow-Start Restart (SSR) describes one TCP behavior to respond to
56	   long sending pauses in an open connection.  When a sender becomes
57	   idle, the normal ACK-clocking mechanism which regulates traffic is no
58	   longer present and the sender may introduce a burst of packets into
59	   the network as large as the current congestion window (CWND).  Such a
60	   burst may be too large for the intermediate routers to handle and may
61	   be too large for the receiver to handle at one time as well.

63	   A send timer was first proposed [JK92] to detect idle sending
64	   periods; the recommended response is to close the congestion window
65	   and perform a new slow-start.  However, a footnote to this first
66	   proposed solution noted that send/receive symmetry on the channel
67	   meant that a receive timer could be used instead to achieve the same
68	   results.  As this second solution takes advantage of a timer that is
69	   already required (to detect packet loss) it was implemented by
70	   Jacobson and Karels.  This solution has been repeated in
71	   implementations which derive from their work.

73	   Bursty connections, such as the persistent connections required in
74	   HTTP/1.1 [FGMFB97] have been found to interact in meaningful ways
75	   with SSR [Hei97].  In fact, it was discovered that SSR never occurs
76	   with HTTP/1.1 [Poo97].  This is because a new request will reset the
77	   receive timer (as suggested in the footnote in [JK92]) and the
78	   sending pause will not be detected [Tou97].

80	   Further, both timer solutions depend on the retransmit timeout (RTO)
81	   and cannot detect send pauses that are shorter than this duration.
82	   In such cases, the sender may transmit a burst as large as the full
83	   congestion window.

85	Burst Detection.

87	   There are several ways of determining whether a connection is at risk
88	   of sending a burst of packets into the channel.  We will discuss each
89	   method below, from the least radical to the most radical.

91	 Receive Timer:

93	   The use of a receive timer is the most common burst detection method.
94	   It is attractive because it is simple and makes use of an existing
95	   timer.  However, a receive timer does not properly detect bursts in
96	   HTTP/1.1 because the timer is cancelled when the request packet is
97	   received.  Further, when the connection is idle for less than a full
98	   RTO, a burst cannot be detected.  Such a burst can happen when the
99	   connection is "nearly idle" or when ACKs are lost or reordered.

101	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

103	 Send Timer:

105	   A send timer is the reciprocal solution to using a receive timer.
106	   While it requires a new timestamp field to be maintained, it clearly
107	   detects send pauses and corrects the problem presented by HTTP/1.1.
108	   However, as with the receive timer, it cannot detect bursts that
109	   could happen before a full RTO.

111	 Packet Counting:

113	   An alternative method examines the unused portion of the congestion
114	   window to determine if the capacity to burst exists.  This method is
115	   simple, it uses existing information to make its decision, and it
116	   solves both the HTTP/1.1. problem as well as the RTO problem.  In
117	   addition, it addresses the problem that needs to be solved (bursts)
118	   instead of a specific circumstance where the problem could happen
119	   (send pauses).  However, where timer detection avoids defining a
120	   burst (it defines idle periods instead), here a burst must be defined
121	   before it can be detected.  One possible definition is the situation
122	   where the available portion of the sending window is some proportion
123	   of the entire congestion window, say 50%.  Another definition places
124	   a numerical limit on the available portion of the congestion window,
125	   say 4 or CWND-1 packets.

127	Burst Response

129	   Once a burst is detected, there are several different ways to take
130	   action.  The different possibilities are listed below, again from
131	   least to most radical.

133	 Full Restart:

135	   Reducing the congestion window to one packet and re-entering slow-
136	   start, the original slow-start restart, is one response.  This was
137	   the solution proposed by Jacobson and Karels.  This is a very
138	   conservative response and it defeats most of the speedup that
139	   HTTP/1.1 provides [HOT97].  Current proposals [FAP97] have suggested
140	   increasing the initial window from 1 packet to 4 packets.  Further,
141	   depending on the method of burst detection, Full Restart can be far
142	   more punitive than it should be.  Coupled with a timer, full restart
143	   is most likely to respond to a completely empty congestion window.
144	   Coupled with Packet Counting, the response could close the window too
145	   far, even smaller than the amount of outstanding data.

147	 Window Limiting:

149	   This is a modified version of Full Restart which solves the problem
150	   created by using Packet Counting to detect bursts.  With this type of

152	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

154	   response, the congestion window is reduced to the amount of
155	   outstanding data plus the slow-start initial window (1, 2, or 4).  It
156	   works exactly like Full Restart in the idle case, but is successful
157	   at controlling bursts in an active connection.  Further, in an active
158	   connection, it effectively implements a leaky bucket of the initial
159	   window size for the accumulation of send opportunity based on the
160	   receipt of ACKs.  This solution is fairly conservative, especially as
161	   it defaults to Full Restart, but more importantly, sending
162	   opportunity is simply lost if not used, and is not available for
163	   paced output.  Also, it forces negative congestion feedback on the
164	   congestion window.

166	 Burst Size Limitation:

168	   When a burst is detected, its effects are limited, the sender may not
169	   send any more than a preset number of packets into the network.  This
170	   is less conservative than the first two responses in that it does not
171	   affect the size of the congestion window, and it is simple to
172	   implement, simply count up the number of packets you can send and
173	   stop when you reach the limit. The burst count is reset according to
174	   some policy, such as when ACKs are received, each time send is
175	   called, or when some timer triggers. The behavior is determined by
176	   how these parameters are combined to reset the count.

178	 Pacing:

180	   When a burst is detected, packets are periodically sent into the
181	   network until the sender starts receiving acks and normal maintenance
182	   can be resumed [VH97,PK98].  This solution is very easy on the
183	   network and scales well in cases of high bw/delay.  However, it
184	   requires a new timer and additional research for parameter tuning.

186	Implemented Solutions

188	   Now we will examine combinations of the different detection and
189	   response methods presented above.  Each of the solutions below have
190	   been implemented in some form.

192	 BSD Implementation (Jacobson and Karels)

194	   The most common implementation uses a receive timer coupled with Full
195	   Restart.  This is the implementation that causes the interaction
196	   problems with HTTP/1.1.  The obvious alternative is to implement a
197	   send timer as originally intended and use Full Restart.  There are
198	   several drawbacks to this solution.  First, a send timer adds
199	   additional state and serves no purpose other than to correct the
200	   bursting behavior after send pauses.  Second, forcing a slow-start in
201	   this situation is problematic for HTTP/1.1.  A slow-start for each

203	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

205	   new user request adds a delay burden to characteristically small HTTP
206	   responses.  Further, the HTTP user request pattern is unpredictable.
207	   It is possible for the user to make a new request before the send
208	   timer expires, triggering a burst that would defeat such a timer.

210	 Maximum Burst Limitation (Floyd)

212	   Floyd has proposed a coupling of Packet Counting with Burst Size
213	   Limitation.  It was developed to avoid bursts caused by gaps in the
214	   ACK stream. Such gaps cause the send window to slide forward in
215	   jumps, rather than smoothly by 1-2 MTUs; subsequent send calls can
216	   thus generate bursts. This solution has been implemented in "ns" and
217	   it prevents the sender from transmitting a series of back-to-back
218	   packets larger than the user configured burst limit (suggested to be
219	   5 packets) [NS97].

221	   Maxburst defines a new function, send_much, with a parameter,
222	   maxburst, which is called every time an external event occurs, such
223	   as ACK is received or a timeout occurs. Each time send_much is
224	   called, at most maxburst packets are emitted. This forces interleaved
225	   processing of ACKs and sending of data. Sends initiated by the
226	   arrival of data in the buffer are not affected; as a result, if there
227	   is less than RTO of delay, and the send buffer is filled by the
228	   application, Maxburst can still send a burst the size of an entire
229	   window.

231	   Maxburst does not explicitly address the sending of bursts after an
232	   idle period; it relies on the existing mechanism to collapse CWND
233	   after an RTO, because send_much does not replace send as the
234	   application interface to transmitting data.

236	   The limit of 5 packets is designed to allow non-sequental ACK losses,
237	   and allow the CWND to grow normally under slow-start. Here is a list
238	   of the possible burst limits, and the need for each:

240	       1 packet per ACK or nothing gets done

242	       2 packets per ACK if ACKs are compressed

244	       3 packets during window growth; 2 for the compressed ACK, 1 for
245	       the growth

247	       4 packets per ACK received if ACKs are lost (not back-to-back),
248	       and ACKs are compressed

250	       5 packets per ACK received if ACKs are lost (not back-to-back),
251	       ACKs are compressed, and the window is growing

253	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

255	 Use it or lose it (UI/LI) (Hughes, Touch, and Heidemann)

257	   One proposed solution combines Packet Counting with Window Limiting,
258	   and the IETF community refers to it as "Use-It-or-Lose-It" (it was
259	   originally called Congestion Window Monitoring, CWM, but we will use
260	   the acronym UI/LI, pronounced 'wee-lee', here).  Whenever (CWND -
261	   outstanding data > 4), we reduce CWND to (outstanding data + 4).  The
262	   choice of 4 packets is discussed in with the implementation details
263	   below.  UI/LI allows the congestion window to grow normally but
264	   shrinks the congestion window as the sender becomes idle.  It also
265	   prevents the sender from transmitting any bursts larger than 4
266	   packets in response to a new request.  Because UI/LI is not dependent
267	   on any timers, the loss of an ACK or a nearly idle connection cannot
268	   cause any bursts.  UI/LI is similar to Maxburst, but avoids the burst
269	   by reducing CWND, rather than by inhibiting the sends directly.  As a
270	   result, we avoid the potential problem of sequential calls to
271	   "tcp_output", which would cause bursts in Maxburst.  UI/LI also
272	   causes TCP to use the feedback of 'not using the CWND fast enough',
273	   which results in a decrease in the CWND.

275	   UI/LI effectively imposes a leaky bucket type limitation on the
276	   congestion window.  The window is allowed to grow and be managed
277	   normally but the sender is not allowed to save up any sending
278	   opportunities.  Any opportunity that is not used is lost.  This
279	   property of UI/LI forces interleaved reception of ACKs and processing
280	   of sends.

282	 Burst-or-lose (Touch, Floyd)

284	   We have considered a modified version of Maxburst, in which all
285	   sends, including those initiated by data arrival from the
286	   application, limit their burst size each time they are called, and
287	   reset their flag at that time. We call this burst-or-lose (BOL). The
288	   result is very similar to UI/LI, but the bucket reflects only the
289	   permission to send created as the result of an ACK receipt, rather
290	   than the overall limit of the congestion window.

292	   In this case, it is unncessary to collapse CWND after an RTO. The
293	   primary reason for the collapse is to avoid bursts; BOL avoids bursts
294	   completely, but recovers as quickly as the ACK clocking can be
295	   restored.

297	   In the case when there is no data to send, multiple ACK arrivals do
298	   not increase the bucket available for sending. This avoids the bursts
299	   after idle periods. In the case when ACKs are lost, the remaining ACK
300	   generates permission to send only a fixed amount of data. The
301	   application cannot defeat this bucket by calling send multiple times.

303	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

305	 Rate Based Pacing (Visweswaraiah and Heidemann)

307	   Rate Based Pacing (RBP) combines the Pacing response with either a
308	   Send Timer or Packet Counting.  It avoids slow-start when resuming
309	   after sending pauses and allows the normal clocking of packets to be
310	   gracefully restarted.  When a burst potential is detected, the
311	   algorithm meters a small burst of packets into the channel [VH97].
312	   RBP is the least conservative solution to the bursting problem
313	   because it continues to make use of the pre-pause congestion window.
314	   If network conditions have changed significantly, maintaining the
315	   previous window could cause the paced connection to be overly
316	   aggressive as compared to other connections.  (Although some work
317	   suggests congestion windows are stable over multi-minute timeframes
318	   [BSSK97].)  More recently pacing has been suggested for use in
319	   wireless networking scenarios [BPK97], and for satellite connections.

321	Solution Comparison

323	   Below we present a comparison of the solutions presented above, plus
324	   the original Send Timer solution.  The Send Timer solution is not in
325	   use, but we implemented it in FreeBSD for comparison purposes.

327	   We contrast each solution on four criteria.  A "yes" in the "HTTP
328	   Burst" column indicates that this solution will solve the bursting
329	   problem created by HTTP/1.1.  A "yes" in the "Other Burst" column
330	   indicates that the solution will prevent bursts in other situations
331	   as well.  A "yes" in the "Adj CWND" column indicates that the
332	   solution involves adjusting the value of CWND.  The final column,
333	   "Code Size", gives the number of lines of code needed to correctly
334	   implement the solution.

336	                 |
337	                 | HTTP Burst   Other Burst   Adj CWND   Code Size
338	      -----------+-------------------------------------------------
339	      Send Timer |    yes           no          yes      ~5 lines
340	      -----------+-------------------------------------------------
341	       Rcv Timer |     no           no          yes       3 lines
342	      -----------+-------------------------------------------------
343	        Maxburst |    yes           yes          no      ~5 lines
344	      -----------+-------------------------------------------------
345	           UI/LI |    yes           yes         yes       3 lines
346	      -----------+-------------------------------------------------
347	             BOL |    yes           yes          no      ~5 lines
348	      -----------+-------------------------------------------------
349	             RBP |    yes           yes          no      ~30 lines

351	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

353	There are other conditions under which to compare these algorithms,
354	including behavior after a single ACK loss, behavior after multiple ACK
355	losses (sequential and non-sequential), and the window regrowth proper-
356	ties. These are outside the scope of this document.

358	                 |
359	                 |            Recovery rate
360	      -----------+-------------------------------------------------
361	      Send Timer |             SS then CA
362	      -----------+-------------------------------------------------
363	       Rcv Timer |             SS then CA
364	      -----------+-------------------------------------------------
365	        Maxburst |          (not applicable)
366	      -----------+-------------------------------------------------
367	           UI/LI | SS and or CA, depending on SSTHRESH
368	      -----------+-------------------------------------------------
369	             BOL |          (not applicable)
370	      -----------+-------------------------------------------------
371	             RBP |                1 RTT

373	The only significant difference between UI/LI and BOL is whether CWND is
374	affected, and thus the recovery period after the loss of an ACK. UI/LI
375	interprets ACK loss as an event warranting traditional TCP window recov-
376	ery, whereas BOL and Maxburst do not. Both UI/LI and BOL avoid the need
377	for idle detection for the purpose of CWND adjustment, and implement a
378	type of leaky bucket. Although UI/LI was intended as a type of leaky-
379	bucket system, BOL is much more directly analogous.

381	Experimental Comparisons

383	Packet traces of the current FreeBSD implementation of SSR (using the
384	receive timer), of a modified version of FreeBSD using a send timer, and
385	of UI/LI with HTTP/1.1 support the above observations.  The graphs below
386	show traces of two HTTP/1.1 requests.  The first request opens the CWND.
387	How the server responds to the packet containing the second request
388	reveals the differences between implementations.

390	The first graph shows the trace with FreeBSD v2.2.6, but the relevant
391	portions of the code have not been altered in more current releases.  In
392	response to the first request, the server performs slow-start normally.
393	After 10 seconds, and after the retransmission timeout (RTO), the second
394	request arrives.  At this point, receipt of the second request resets
395	the receive timer.  The server, under the belief that communication has
396	not gone idle, sends a CWND-sized burst of packets into the network.

398	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

400	                       Idle trace with Receive Timer
401	sequence number (in bytes)
402	          +------+-------+------+-------+------+-------+------+D-A---+
403	          +      +       +      +       +      +       +    DDAA     |
404	   100000 ++   Client_ACK  A                               DDAA     ++
405	          |    Client_SEG  C                            D  A A       |
406	          |    Server_SEG  D                            D A          |
407	    80000 ++                                            D A         ++
408	          |                                        D  A              |
409	          |                                        D                 |
410	    60000 ++                         D AA          D                ++
411	          |                       D A                                |
412	          |                       D A                                |
413	          |                   D ADDAA                                |
414	    40000 ++                  D A                                   ++
415	          |                   D A                                    |
416	          |               D AAD A                                    |
417	    20000 ++              D A                                       ++
418	          |           D A A                                          |
419	          +      +D AAD A+      +       +      +       +      +      |
420	        0 ++A-D-A+-------+------+-------+------+-C-----+------+-----++
421	          0      2       4      6       8     10      12     14
422	                       time since first SYN (in seconds)

424	The second graph shows the trace with FreeBSD v2.2.6 altered to use a
425	send timer to detect idle communication.  As expected, the server
426	responds to the first request with normal slow-start.  Now, the receipt
427	of the second request comes in after the RTO and the idle period is cor-
428	rectly detected.  In response, the server re-enters slow-start for the
429	second response.

431	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

433	                         Idle trace with Send Timer
434	sequence number (in bytes)
435	          +-------------+-------------+------------+-------------+DAAA
436	          +             +             +            +           D AAAA|
437	   100000 ++   Client_ACK  A                                   DADA ++
438	          |    Client_SEG  C                                   DA A  |
439	          |    Server_SEG  D                                D AAD    |
440	    80000 ++                                                DAAA    ++
441	          |                                              D AAD       |
442	          |                                           D AAD          |
443	    60000 ++                      D AA          D AA AA             ++
444	          |                     D AA                                 |
445	          |                     DAAA                                 |
446	          |                  D AAAA                                  |
447	    40000 ++                 D AA                                   ++
448	          |                  DADA                                    |
449	          |               D AAAA                                     |
450	    20000 ++              DADA                                      ++
451	          |            DDAAD                                         |
452	          +         DAAAA A           +            +             +   |
453	        0 ++----AADAA---+-------------+--------CC--+-------------+--++
454	          0             5            10           15            20
455	                       time since first SYN (in seconds)

457	The third graph shows the trace with FreeBSD altered to use UI/LI to
458	limit bursts.  Again, the first request and response are normal.  This
459	shows that UI/LI allows the congestion window to grow normally.  With
460	the second request we see the main difference of UI/LI.  Again, the sec-
461	ond request comes after the RTO, and here we see the main difference
462	with UI/LI.  In response to the second request, the server does not send
463	a burst the size of CWND or enter slow-start.  Instead, it sends several
464	small 4-packet bursts, with the CWND growing normally from this point.

466	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

468	                           Idle trace with UI/LI
469	sequence number (in bytes)
470	          +-------+------+-------+------+-------+------+------DD-A---+
471	          +       +      +       +      +       +      +      D+A    |
472	   100000 ++   Client_ACK  A                               D  A A   ++
473	          |    Client_SEG  C                              DDAA       |
474	          |    Server_SEG  D                              D A        |
475	    80000 ++                                          DD AD         ++
476	          |                                       D  AD A            |
477	          |                                   D  AD AA               |
478	    60000 ++                            D A   D                     ++
479	          |                           DDA A                          |
480	          |                           D A                            |
481	          |                        D AD A                            |
482	    40000 ++                      DDAA                              ++
483	          |                       D A                                |
484	          |                   DDAAA                                  |
485	    20000 ++              D  AD A                                   ++
486	          |               D A                                        |
487	          +       D A D AAA      +      +       +      +       +     |
488	        0 ++A-D-A-D------+-------+------+---C---+------+-------+----++
489	          0       2      4       6      8      10     12      14
490	                       time since first SYN (in seconds)

492	Note, RTO is the usual timer limit, but any value would have the same
493	results, depending on when the second request was presented in relation
494	to the timer.  If the second request arrives after the timer expires,
495	the reponse will behave as presented in these graphs.  If the second
496	request arrives before the timer expires and after all outstanding pack-
497	ets have been acknowledged, systems with a send or receive timer will
498	send a CWND-sized burst. If the second request arrives before the last
499	ACK but after the last packet sent, the server has the potential to send
500	a burst no larger than CWND.  UI/LI will limit any burst to 4 packets
501	regardless of the timing of the second request.

503	Implementation of UI/LI

505	UI/LI requires a simple modification to existing TCP output routines.
506	The changes required replace the current idle detection code.  Below is
507	the suggested pseudocode from Appendix C of [JK92]:

509	          int idle = (snd_max == snd_una);
510	          if (idle && now - lastrcv >= rto)
511	               cwnd = 1;

513	   UI/LI would replace that code as below:

515	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

517	          int idle = (snd_max == snd_una);
518	          int maxwin = 4 + snd_nxt - snd_una;
519	          if (cwnd > maxwin)
520	               cwnd = maxwin;

522	   Packet counting is implemented by line 2.  Lines 3 and 4 implement
523	   Window Limitation.  Note that the comparison in line 1 (defining
524	   "idle" [JK92]) is required for later Silly Window Syndrome avoidance
525	   and could now be moved there.

527	   We have implemented UI/LI in FreeBSD.  The affected code is in the
528	   file /sys/netinet/tcp_output.c.  The lines:

530	          idle = (tp->snd_max == tp->snd_una);
531	          if (idle && tp->t_idle >= tp->t_rxtcur)
532	               tp->snd_cwnd = tp->t_maxseg;

534	   Are replaced with:

536	          idle = (tp->snd_max == tp->snd_una);
537	          maxwin = (4 * tp->t_maxseg) + tp->snd_nxt - tp->snd_una;
538	          if (tp->snd_cwnd > maxwin)
539	               tp->snd_cwnd = maxwin;

541	   The choice of limiting the available congestion window to 4 packets
542	   is based on the normal operation of TCP.  An ACK received by the
543	   sender may be in response to the receipt of 2 packets, allowing
544	   another 2 to be sent.  Further, normal window growth may require the
545	   sending of a third packet.  Lastly, in slow-start with delayed ACKs,
546	   the receipt of an ACK can trigger the sending of 4 packets.  Thus, 4
547	   packets is a reasonable burst to send into the network.

549	   Increasing the initial window in slow-start to 4 packets has already
550	   been proposed [FAP97].  The effects of this change have been explored
551	   in simulation in [PN98] and in practice in [AHO98].  Such a modifica-
552	   tion to TCP would cause the same behavior as our solution in the
553	   cases where the pause timer has expired.  It does not address the
554	   pre-timeout bursting situation we are concerned with.

556	Implementation of BOL

558	   BOL can be implemented with a small amount of additional state, and a
559	   small number of lines in TCP's input, output, and timer routines. The
560	   state indicates the number of packets that can be emitted before the
561	   next timer or ACK receipt. The other lines increment the state when a
562	   notable event occurs, and decrement it when packets are sent.

564	   Our implementation uses two state variables: bucket - the number of

566	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

568	   packets that can be immediately sent, and ack_ratio - the number of
569	   packets per ACK (upper bound). ack_ratio is implied by RFC-2581, as
570	   at least one ACK "SHOULD" be sent for every 2 MSSs received, i.e.,
571	   ACK compression [APS99]. This is our default; we parameterize it so
572	   that when others modify TCP to use less frequent ACKs, our portion of
573	   the code will operate correctly. It should be noted that less fre-
574	   quent ACKs will, by definition, increase the burstiness of the source
575	   as a result. We also assume 'initial_win' is defined, currently 2
576	   MSS, but may be configurable as per Experimental RFC-2414 [FAP97].

578	   There are two events that set the value of bucket. When an ACK is
579	   received, we set bucket to ack_ratio * 2 + 1. This is for the same
580	   reasons as Maxburst uses '5' - to allow for TCP to run at full rate
581	   in the presence of isolated ACK losses. On timeout events, the bucket
582	   is set to init_win.

584	   When packets are emitted in the existing tcp_output routine, the
585	   bucket is decremented. The bucket does not become negative; if it
586	   would, that data is not sent, and the routine returns. This 'refusal
587	   to send' behavior is identical to not having sufficient send window;
588	   the application will retry, typically due to some timer event.

590	   The code is as follows:

592	          /* define the state for BOL */
593	               int  ack_ratio,     /* max number of packets per ack */
594	               int  bucket,        /* max burst size */

596	          #DEFINE init_win 2

598	          /* set the bucket on ACK receipt */
599	               bucket = ack_ratio * 2 + 1;

601	          /* set bucket on timeout */
602	               bucket = init_win;

604	          /* decrement the bucket accordingly */
605	               can_send = min(bucket - want_send, 0);
606	               bucket -= can_send;
607	               do_send(can_send);
608	               return can_send;

610	Local connections

612	   We recently discovered cases where SSR adversely affected local con-
613	   nections, i.e., connections within the same subnet. There is already
614	   code in the current BSD-derived TCP implementation that disables ini-
615	   tializing the congestion window (CWND) to a small value, when both

617	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

619	   ends of a connection share a subnet. This code, from FreeBSD
620	    tcp_input.c, is outlined below, where inp = tp->t_inpcb.

622	          /*
623	           * Don't force slow-start on local network.
624	           */
625	          if (!in_localaddr(inp->inp_faddr))
626	               tp->snd_cwnd = mss;

628	   RFC-2414 observes that there are 3 types of window sizes used in BSD
629	   TCP:

631	               1. intital window
632	               2. restart after timeout
633	               3. restart after loss

635	   Recent work indicates that these windows should be treated differ-
636	   ently; e.g., the restart-after-loss window should always be set to 1,
637	   and the initial window can be set to 4 or not used for local connec-
638	   tions [FAP97]. We proposed that it is consistent to extend this local
639	   exception to the restart-after-timeout window, as shown below as a
640	   FreeBSD code segment:

642	          idle = (tp->snd_max == tp->snd_una);
643	          maxwin = (4 * tp->t_maxseg) + tp->snd_nxt - tp->snd_una;
644	          if ((tp->snd_cwnd > maxwin) &&
645	             !in_localaddr(tcp->t_inpcb->inp_faddr))
646	               tp->snd_cwnd = maxwin;

648	   This modification is considered reasonable because it corresponds to
649	   the initial window determination. There are other cases where TCP
650	   uses "local" rules, such as MSS determination, and piggybacking.

652	Performance Issues

654	   The purpose of these modifications is to avoid line-rate bursts, pri-
655	   marily as the result of the source being idle, but also due to ACK
656	   losses. A result of many of the proposed solutions is to limit the
657	   source, either by reducing the congestion window, or limiting the
658	   transmission of bursts. The performance effects of congestion window
659	   reduction are well-studied, and are not addressed here. BOL can limit
660	   the source without modifying the congestion window, and there are
661	   concerns that this may affect the performance of TCP.

663	   In particular, some implementations of TCP send ACKs in a bursty
664	   fashion, either by reducing the send/receive context switching, or by
665	   reducing the overall number of ACKs. These modifications are used to
666	   increase the performance of TCP, but can also increase the burstiness

668	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

670	   of the source. BOL, by reducing the burstiness of the source, is con-
671	   sidered to potentially limit the effectiveness of these enhancements.

673	   TCP currently recommends that an ACK "SHOULD" be sent for at least
674	   every two MSS's received. We assume that this recommendation implies
675	   an interleaving factor for the source as well. If this factor is
676	   encoded in a TCB variable (ack_ratio), then BOL can allow for corre-
677	   sponding bursts consistent with reasonable (non-adjacent) ACK losses.

679	   We assume that TCPs should not delay ACK processing to aggregate the
680	   receive-side timeslices, in order to reduce send/receive interleav-
681	   ing, unless this is also encoded in their ack_ratio variable. Our
682	   initial measurements indicate that even the recommended interleaving
683	   does not substantially affect the performance of TCP.

685	   In addition, we recommend that any burst-avoidance modification be
686	   disabled for direct LAN connections. For such connections, the con-
687	   gestion window mechanism is already disabled in several TCP implemen-
688	   tations. The resulting burstiness is handled by the link layer and
689	   receiver buffers.

691	Conclusions

693	   At this time, we propose BOL as a simple, minimal and effective fix
694	   to the 'bug' in current TCP implementations that is exploited by
695	   HTTP/1.1.  Modifications can be made to TCP to solve the slow-start
696	   restart problem that are consistent with the original congestion
697	   avoidance specifications (i.e. a send timer).  However, we feel that
698	   the original intended behavior is not appropriate to some current
699	   applications, specifically HTTP. Thus, we recommend BOL to prevent
700	   bursts into the network.  Not only does this solution solve the cur-
701	   rent problem in a simple way, it will prevent bursting in any other
702	   situation that might arise. The packet bursts which we allow are con-
703	   sistent with congestion window growth algorithms and with Floyd's
704	   conclusion about increasing the initial window size.

706	   BOL, as well as the other solutions listed, need to be re-evaluated
707	   within emerging TCP implementations, e.g., SACK [JB88].  In general,
708	   TCP has no rate pacing and uses congestion control to avoid bursts in
709	   current implementations.  A more explicit mechanism, such as RBP or
710	   similar proposals may be desirable in the future.

712	Security implications

714	   There are security risks associated with these algorithms that can
715	   result in denial of service attacks.  Intermediate routers can delay
716	   data or ACKs in ways that can cause the restarting party to restart
717	   more conservatively. They also result in less probability of the

719	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

721	   restarting party generating a line-rate burst.

723	   In Send-Timer and Maxburst algorithms, as well as in a no-restart
724	   system, an attacker (either an intermediate router or data receiver)
725	   can clump the delivery of ACKs, which can result in line-rate bursts.

727	   The Receive-Timer, Rate-Based Pacing, UI/LI, and BOL algorithms use
728	   data or ACK arrival to moderate outgoing data and avoid bursts. If
729	   the attacker attempts to deliver data or ACKs so as to cause a burst,
730	   these algorithms successfully avoid creating a burst. In the process,
731	   their throughput is reduced.

733	   We believe the latter algorithms should be considered more robust
734	   than the former, because slow-start restart focuses on avoiding
735	   bursts, rather than maximizing throughput. In the recommended algo-
736	   rithms (BOL now, RBP for the future), the system succesfully avoids
737	   generating a burst as the result of an attacker.

739	Acknowledgements

741	   We would like to acknowledge Kacheong Poon, Sally Floyd, members of
742	   the End-to-End-Interest Mailing List, and members of the IETF TCP
743	   Implementation Working Group for their feedback and discussions on
744	   this document. We would also like to thank Ruth Aydt, Steve Tuecke,
745	   and Carl Kesselman for their observation of the local connection
746	   issue.

748	References

750	   [AHO98] Mark Allman, Chris Hayes, and Shawn Ostermann.  An Evaluation
751	       of TCP with Larger Initial Windows.  ACM Computer Communications
752	       Review, 28(3), 41-52 , July 1998.

754	   [APS99] M. Allman, V. Paxson, W. Stevens.  TCP Congestion Control,
755	       April 1999.  RFC-2581.

757	   [BPK97] Hari Balakrishnan, Venkata N. Padmanabhan, and Randy H. Katz.
758	       The Effects of Asymmetry on TCP Performance.  In Proceedings of
759	       the ACM/IEEE Mobicom, Budapest, Hungary, ACM.  September, 1997.

761	   [BSSK97] Hari Balakrishnan, Srinivasan Seshan, Mark Stemm, and Randy
762	       H. Katz.  Analyzing Stability in Wide-Area Network Performance.
763	       In Proceedings of the ACM SIGMETRICS, Seattle WA, USA, ACM.
764	       June, 1997.

766	   [FGMFB97] R. Fielding, Jim Gettys, Jeffrey C. Mogul, H. Frystyk, and
767	       Tim Berners-Lee.  Hypertext Transfer Protocol -- HTTP/1.1,

769	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

771	       January 1997. RFC-2068.

773	   [FAP97] Mark Allman, Sally Floyd, and Craig Partridge.  Increasing
774	       TCP's Initial Window, September 1998. RFC-2414.

776	   [Hei97] John Heidemann. Performance Interactions Between P-HTTP and
777	       TCP Implementations.  ACM Computer Communications Review, 27(2),
778	       65-73, April 1997.

780	   [HOT97] John Heidemann, Katia Obraczka, and Joe Touch.  Modeling the
781	       Performance of HTTP Over Several Transport Protocols.  ACM/IEEE
782	       Transactions on Networking 5(5), 616-630, October, 1997.

784	   [JB88] Van Jacobson and R.T. Braden. TCP extensions for long-delay
785	       paths, October 1988. RFC 1072.

787	   [Jac88] Van Jacobson.  Congestion Avoidance and Control.  Proc. SIG-
788	       COMM '88, in ACM Computer Communication Review, 18(4):314-329,
789	       August 1988. NOTE: This paper lacks the appendix that describes
790	       slow-start restart, which appears only in the unpublished [JK92].

792	   [JK92] Van Jacobson and Mike Karels.  Congestion Avoidance and Con-
793	       trol.  This is a revised version of [Jac88], which adds Karels as
794	       co-author and includes an additional appendix.  The revised ver-
795	       sion has not been published, but is available at
796	       ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z

798	   [NS97] ns Network Simulator.  http://www-mash.cs.berkeley.edu/ns/,
799	       1997.

801	   [PK98] Venkata N. Padmanabhan and Randy H. Katz.  TCP Fast Start: A
802	       Technique for Speeding Up Web Transfers.  In Proceedings of the
803	       IEEE GLOBECOM Internet Mini-Conference, Sydney, Australia, Novem-
804	       ber, 1998.

806	   [PN98] K. Poduri and K. Nichols. Simulation Studies of Increased Ini-
807	       tial TCP Window Size, September 1998. RFC-2415.

809	   [Poo97] Kacheong Poon, Sun Microsystems, tcp-implementors mailing
810	       list, August, 1997.

812	   [Tou97] Joe Touch, ISI, tcp-implementors mailing list, August 12,
813	       1997.

815	   [VH97] Vikram Visweswaraiah and John Heidemann.  Improving Restart of
816	       Idle TCP Connections.  Technical Report 97-661, University of
817	       Southern California, November 1997.

819	Hughes et al.       Issues in TCP Slow-Start Restart        Dec. 1, 2001

821	Authors/ Address

823	   Amy S. Hughes, Joe Touch, John Hiedemann
824	   University of Southern California/Information Sciences Institute
825	   4676 Admiralty Way
826	   Marina del Rey, CA 90292-6695
827	   USA
828	   Phone: +1 310-822-1511
829	   Fax:   +1 310-823-6714
830	   URLs:   http://www.isi.edu/~ahughes
831	        http://www.isi.edu/touch
832	        http://www.isi.edu/~johnh
833	   Email: ahughes@isi.edu
834	       touch@isi.edu
835	       johnh@isi.edu