idnits 2.17.1 

draft-ietf-tcpm-rfc2581bis-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 820.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 797.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 804.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 810.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 5 instances of too long lines in the document, the longest one
     being 2 characters in excess of 72.

  ** There are 4 instances of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 2006) is 6518 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC3390' is mentioned on line 631, but not defined

  == Unused Reference: 'Flo94' is defined on line 676, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC  813
     (Obsoleted by RFC 7805)

  -- Obsolete informational reference (is this intentional?): RFC 2001
     (Obsoleted by RFC 2581)

  -- Obsolete informational reference (is this intentional?): RFC 2414
     (Obsoleted by RFC 3390)

  -- Obsolete informational reference (is this intentional?): RFC 2581
     (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 2988
     (Obsoleted by RFC 6298)

  -- Obsolete informational reference (is this intentional?): RFC 3517
     (Obsoleted by RFC 6675)

  -- Obsolete informational reference (is this intentional?): RFC 3782
     (Obsoleted by RFC 6582)


     Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 14 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                          M. Allman
2	Internet-Draft                                                 V. Paxson
3	Expires: December 2006                                       ICIR / ICSI
4	                                                              E. Blanton
5	                                                       Purdue University
6	                                                               June 2006

8	                         TCP Congestion Control
9			   draft-ietf-tcpm-rfc2581bis-01.txt

11	Status of this Memo

13	    By submitting this Internet-Draft, each author represents that any
14	    applicable patent or other IPR claims of which he or she is aware
15	    have been or will be disclosed, and any of which he or she becomes
16	    aware will be disclosed, in accordance with Section 6 of BCP 79.

18	    Internet-Drafts are working documents of the Internet Engineering
19	    Task Force (IETF), its areas, and its working groups.  Note that
20	    other groups may also distribute working documents as
21	    Internet-Drafts.

23	    Internet-Drafts are draft documents valid for a maximum of six
24	    months and may be updated, replaced, or obsoleted by other documents
25	    at any time.  It is inappropriate to use Internet-Drafts as
26	    reference material or to cite them other than as "work in progress."

28	    The list of current Internet-Drafts can be accessed at
29	    http://www.ietf.org/ietf/1id-abstracts.txt.

31	    The list of Internet-Draft Shadow Directories can be accessed at
32	    http://www.ietf.org/shadow.html.

34	Copyright Notice

36	    Copyright (C) The Internet Society (2006).

38	Abstract

40	    This document defines TCP's four intertwined congestion control
41	    algorithms: slow start, congestion avoidance, fast retransmit, and
42	    fast recovery.  In addition, the document specifies how TCP should
43	    begin transmission after a relatively long idle period, as well as
44	    discussing various acknowledgment generation methods.

46	1. Introduction

48	    This document specifies four TCP [RFC793] congestion control
49	    algorithms: slow start, congestion avoidance, fast retransmit and
50	    fast recovery.  These algorithms were devised in [Jac88] and
51	    [Jac90]. Their use with TCP is standardized in [RFC1122].  Additional
52	    early work in additive-increase, multiplicative-decrease congestion
53	    control is given in [CJ89].

55	    This document is an update of [RFC2001] and [RFC2581].

57	    In addition to specifying the congestion control algorithms, this
58	    document specifies what TCP connections should do after a relatively
59	    long idle period, as well as specifying and clarifying some of the
60	    issues pertaining to TCP ACK generation.

62	    Note that [Ste94] provides examples of these algorithms in action
63	    and [WS95] provides an explanation of the source code for the BSD
64	    implementation of these algorithms.

66	    This document is organized as follows.  Section 2 provides various
67	    definitions which will be used throughout the document.  Section 3
68	    provides a specification of the congestion control
69	    algorithms. Section 4 outlines concerns related to the congestion
70	    control algorithms and finally, section 5 outlines security
71	    considerations.

73	    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
74	    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
75	    document are to be interpreted as described in [RFC2119].

77	2. Definitions

79	    This section provides the definition of several terms that will be
80	    used throughout the remainder of this document.

82	    SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or
83	        both).

85	    SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the
86	        largest segment that the sender can transmit.  This value can be
87	        based on the maximum transmission unit of the network, the path
88	        MTU discovery [RFC1191] algorithm, RMSS (see next item), or other
89	        factors.  The size does not include the TCP/IP headers and
90	        options.

92	    RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the
93	        largest segment the receiver is willing to accept.  This is the
94	        value specified in the MSS option sent by the receiver during
95	        connection startup.  Or, if the MSS option is not used, 536
96	        bytes [RFC1122].  The size does not include the TCP/IP headers and
97	        options.

99	    FULL-SIZED SEGMENT: A segment that contains the maximum number of
100	        data bytes permitted (i.e., a segment containing SMSS bytes of
101	        data).

103	    RECEIVER WINDOW (rwnd) The most recently advertised receiver window.

105	    CONGESTION WINDOW (cwnd): A TCP state variable that limits the
106	        amount of data a TCP can send.  At any given time, a TCP MUST
107	        NOT send data with a sequence number higher than the sum of the
108	        highest acknowledged sequence number and the minimum of cwnd and
109	        rwnd.

111	    INITIAL WINDOW (IW): The initial window is the size of the sender's
112	        congestion window after the three-way handshake is completed.

114	    LOSS WINDOW (LW): The loss window is the size of the congestion
115	        window after a TCP sender detects loss using its retransmission
116	        timer.

118	    RESTART WINDOW (RW): The restart window is the size of the
119	        congestion window after a TCP restarts transmission after an
120	        idle period (if the slow start algorithm is used; see section
121	        4.1 for more discussion).

123	    FLIGHT SIZE: The amount of data that has been sent but not yet
124	        acknowledged.

126	    DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
127	        "duplicate" in the following algorithms when (a) the receiver of
128	        the ACK has outstanding data, (b) the incoming acknowledgment
129	        carries no data, (c) the SYN and FIN bits are both off, (d) the
130	        acknowledgment number is equal to the greatest acknowledgment
131	        received on the given connection (TCP.UNA from [RFC793]) and (e)
132	        the advertised window in the incoming acknowledgment equals the
133	        advertised window in the last incoming acknowledgment.
134	        Alternatively, a TCP that utilizes selective acknowledgments
135	        [RFC2018,RFC2883] can determine an incoming ACK is a "duplicate"
136	        if the ACK contains previously unknown SACK information.

138	3. Congestion Control Algorithms

140	    This section defines the four congestion control algorithms: slow
141	    start, congestion avoidance, fast retransmit and fast recovery,
142	    developed in [Jac88] and [Jac90].  In some situations it may be
143	    beneficial for a TCP sender to be more conservative than the
144	    algorithms allow, however a TCP MUST NOT be more aggressive than the
145	    following algorithms allow (that is, MUST NOT send data when the
146	    value of cwnd computed by the following algorithms would not allow
147	    the data to be sent).

149	3.1 Slow Start and Congestion Avoidance

151	    The slow start and congestion avoidance algorithms MUST be used by a
152	    TCP sender to control the amount of outstanding data being injected
153	    into the network.  To implement these algorithms, two variables are
154	    added to the TCP per-connection state.  The congestion window (cwnd)
155	    is a sender-side limit on the amount of data the sender can transmit
156	    into the network before receiving an acknowledgment (ACK), while the
157	    receiver's advertised window (rwnd) is a receiver-side limit on the
158	    amount of outstanding data.  The minimum of cwnd and rwnd governs
159	    data transmission.

161	    Another state variable, the slow start threshold (ssthresh), is used
162	    to determine whether the slow start or congestion avoidance
163	    algorithm is used to control data transmission, as discussed below.

165	    Beginning transmission into a network with unknown conditions
166	    requires TCP to slowly probe the network to determine the available
167	    capacity, in order to avoid congesting the network with an
168	    inappropriately large burst of data.  The slow start algorithm is
169	    used for this purpose at the beginning of a transfer, or after
170	    repairing loss detected by the retransmission timer.

172	    IW, the initial value of cwnd, MUST be set using the following
173	    guidelines as an upper bound.

175	    If SMSS > 2190 bytes:
176		IW = 2 * SMSS bytes and MUST NOT be more than 2 segments
177	    If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes):
178		IW = 3 * SMSS bytes and MUST NOT be more than 3 segments
179	    if SMSS <= 1095 bytes:
180		IW = 4 * SMSS bytes and MUST NOT be more than 4 segments

182	    As specified in [RFC3390], the SYN/ACK and the acknowledgment of the
183	    SYN/ACK MUST NOT increase the size of the congestion window.
184	    Further, if the SYN or SYN/ACK is lost, the initial window used by a
185	    sender after a correctly transmitted SYN MUST be one segment
186	    consisting of at most SMSS bytes.

188	    A detailed rationale and discussion of the IW setting is provided in
189	    [RFC3390].

191	    When larger initial windows are implemented along with Path MTU
192	    Discovery [RFC1191], and the MSS being used is found to be too
193	    large, the congestion window cwnd SHOULD be reduced to prevent
194	    large bursts of smaller segments.  Specifically, cwnd SHOULD be
195	    reduced by the ratio of the old segment size to the new segment
196	    size.

198	    The initial value of ssthresh SHOULD be arbitrarily high (e.g., to
199	    the size of the largest possible advertised window), but ssthresh
200	    MUST be reduced in response to congestion.  Setting ssthresh as high
201	    as possible allows the network conditions, rather than some
202	    arbitrary host limit, to dictate the sending rate.  In cases where
203	    the end systems have a solid understanding of the network path, more
204	    carefully setting the initial ssthresh value may have merit (e.g.,
205	    such that the end host does not create congestion along the path).

207	    The slow start algorithm is used when cwnd < ssthresh, while the
208	    congestion avoidance algorithm is used when cwnd > ssthresh.  When
209	    cwnd and ssthresh are equal the sender may use either slow start or
210	    congestion avoidance.

212	    During slow start, a TCP increments cwnd by at most SMSS bytes for
213	    each ACK received that acknowledges new data.  Slow start ends when
214	    cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted
215	    above) or when congestion is observed.  While traditionally TCP
216	    implementations have increased cwnd by precisely SMSS bytes upon
217	    receipt of an ACK covering new data, we RECOMMEND that TCP
218	    implementations increase cwnd, per:

220	        cwnd += min (N, SMSS)                      (2)

222	    where N is the number of previously unacknowledged bytes
223	    acknowledged in the incoming ACK.  This adjustment is part of
224	    Appropriate Byte Counting [RFC3465] and provides robustness against
225	    misbehaving receivers which may attempt to induce a sender to
226	    artificially inflate cwnd using a mechanism known as "ACK Division"
227	    [SCWA99].  ACK Division consists of a receiver sending multiple ACKs
228	    for a single TCP data segment, each acknowledging only a portion of
229	    its data.  A TCP that increments cwnd by SMSS for each such ACK will
230	    inappropriately inflate the amount of data injected into the
231	    network.

233	    During congestion avoidance, cwnd is incremented by roughly 1
234	    full-sized segment per round-trip time (RTT).  Congestion avoidance
235	    continues until congestion is detected.  The basic guidelines for
236	    incrementing cwnd during congestion avoidance are:

238	      * MAY increment cwnd by SMSS bytes

240	      * SHOULD increment cwnd per equation (2)

242	      * MUST NOT increment cwnd by more than SMSS bytes

244	    We note that [RFC3465] allows for cwnd increases of more than SMSS
245	    bytes for incoming acknowledgments during slow start on an
246	    experimental basis, however such behavior is not allowed as part of
247	    the standard.

249	    The RECOMMENDED way to increase cwnd during congestion avoidance is
250	    to count the number of bytes that have been acknowledged by ACKs for
251	    new data.  (A drawback of this implementation is that it requires
252	    maintaining an additional state variable.)  When the number of bytes
253	    acknowledged reaches cwnd, then cwnd can be incremented by up to
254	    SMSS bytes.  Note that during congestion avoidance, cwnd MUST NOT be
255	    increased by more than SMSS bytes per RTT.  This method both allows
256	    TCPs to increase cwnd by one segment per RTT in the face of delayed
257	    ACKs and provides robustness against ACK Division attacks.

259	    Another common formula that a TCP MAY use to update cwnd during
260	    congestion avoidance is given in equation 3:

262	        cwnd += SMSS*SMSS/cwnd                     (3)

264	    This adjustment is executed on every incoming ACK that acknowledges
265	    new data.
266	    Equation (3) provides an acceptable approximation to the underlying
267	    principle of increasing cwnd by 1 full-sized segment per RTT.  (Note
268	    that for a connection in which the receiver is acknowledging
269	    every-other packet, (3) is less aggressive than allowed -- roughly
270	    increasing cwnd every second RTT.)
271	    Implementation Note: Since integer arithmetic is usually used in TCP
272	    implementations, the formula given in equation 3 can fail to
273	    increase cwnd when the congestion window is larger than SMSS*SMSS.
274	    If the above formula yields 0, the result SHOULD be rounded up to 1
275	    byte.

277	    Implementation Note: older implementations have an additional
278	    additive constant on the right-hand side of equation (3).  This is
279	    incorrect and can actually lead to diminished performance [RFC2525].

281	    Implementation Note: some implementations maintain cwnd in units of
282	    bytes, while others in units of full-sized segments.  The latter
283	    will find equation (3) difficult to use, and may prefer to use the
284	    counting approach discussed in the previous paragraph.

286	    When a TCP sender detects segment loss using the retransmission
287	    timer and the given segment has not yet been retransmitted, the
288	    value of ssthresh MUST be set to no more than the value given in
289	    equation 4:

291	        ssthresh = max (FlightSize / 2, 2*SMSS)            (4)

293	    where, as discussed above, FlightSize is the amount of outstanding
294	    data in the network.

296	    On the other hand, when a TCP sender detects segment loss using the
297	    retransmission timer and the given segment has already been
298	    retransmitted at least once, the value of ssthresh MUST be set to no
299	    more than the value given in equation 5:

301	        ssthresh = max (ssthresh / 2, 2*SMSS)              (5)

303	    In other words, upon the first retransmission of a segment the value
304	    of ssthresh should be set to half the amount of outstanding data in
305	    the network, whereas on subsequent retransmissions the value of
306	    ssthresh should simply be halved.

308	    Implementation Note: an easy mistake to make is to simply use cwnd,
309	    rather than FlightSize, which in some implementations may
310	    incidentally increase well beyond rwnd.

312	    Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be
313	    set to no more than the loss window, LW, which equals 1 full-sized
314	    segment (regardless of the value of IW).  Therefore, after
315	    retransmitting the dropped segment the TCP sender uses the slow
316	    start algorithm to increase the window from 1 full-sized segment to
317	    the new value of ssthresh, at which point congestion avoidance again
318	    takes over.

320	    As shown in [FF96,RFC3782], slow start-based loss recovery after a
321	    timeout can cause spurious retransmissions that trigger duplicate
322	    acknowledgments.  The reaction to the arrival of these duplicate
323	    ACKs in TCP implementations varies widely.  This document does not
324	    specify how to treat such acknowledgments, but does note this as an
325	    area that may benefit from additional attention, experimentation and
326	    specification.

328	3.2 Fast Retransmit/Fast Recovery

330	    A TCP receiver SHOULD send an immediate duplicate ACK when an out-
331	    of-order segment arrives.  The purpose of this ACK is to inform the
332	    sender that a segment was received out-of-order and which sequence
333	    number is expected.  From the sender's perspective, duplicate ACKs
334	    can be caused by a number of network problems.  First, they can be
335	    caused by dropped segments.  In this case, all segments after the
336	    dropped segment will trigger duplicate ACKs until the loss is
337	    repaired.  Second, duplicate ACKs can be caused by the re-ordering
338	    of data segments by the network (not a rare event along some network
339	    paths [Pax97]).  Finally, duplicate ACKs can be caused by
340	    replication of ACK or data segments by the network.  In addition, a
341	    TCP receiver SHOULD send an immediate ACK when the incoming segment
342	    fills in all or part of a gap in the sequence space.  This will
343	    generate more timely information for a sender recovering from a loss
344	    through a retransmission timeout, a fast retransmit, or an advanced
345	    loss recovery algorithm, as outlined in section 4.3.

347	    The TCP sender SHOULD use the "fast retransmit" algorithm to detect
348	    and repair loss, based on incoming duplicate ACKs.  The fast
349	    retransmit algorithm uses the arrival of 3 duplicate ACKs (as
350	    defined in section 2, without any intervening ACKs which move
351	    SND.UNA) as an indication that a segment has been lost.  After
352	    receiving 3 duplicate ACKs, TCP performs a retransmission of what
353	    appears to be the missing segment, without waiting for the
354	    retransmission timer to expire.

356	    After the fast retransmit algorithm sends what appears to be the
357	    missing segment, the "fast recovery" algorithm governs the
358	    transmission of new data until a non-duplicate ACK arrives.  The
359	    reason for not performing slow start is that the receipt of the
360	    duplicate ACKs not only indicates that a segment has been lost, but
361	    also that segments are most likely leaving the network (although a
362	    massive segment duplication by the network can invalidate this
363	    conclusion).  In other words, since the receiver can only generate a
364	    duplicate ACK when a segment has arrived, that segment has left the
365	    network and is in the receiver's buffer, so we know it is no longer
366	    consuming network resources.  Furthermore, since the ACK "clock"
367	    [Jac88] is preserved, the TCP sender can continue to transmit new
368	    segments (although transmission must continue using a reduced cwnd,
369	    since loss is an indication of congestion).

371	    The fast retransmit and fast recovery algorithms are implemented
372	    together as follows.

374	    1.  On the first and second duplicate ACKs received at a sender, a
375	        TCP SHOULD send a segment of previously unsent data per
376	        [RFC3042] provided that the receiver's advertised window allows,
377	        the total FlightSize would remain less than or equal to cwnd
378	        plus 2*SMSS, and that new data is available for transmission.
379	        Further, the TCP sender MUST NOT change cwnd to reflect these
380	        two segments [RFC3042].  Note that a sender using SACK [RFC2018]
381	        MUST NOT send new data unless the incoming duplicate
382	        acknowledgment contains new SACK information.

384	    2.  When the third duplicate ACK is received, a TCP MUST set
385	        ssthresh to no more than the value given in equation 4.

387	    3.  The lost segment MUST be retransmitted and cwnd set to
388	        ssthresh plus 3*SMSS. This artificially "inflates" the
389	        congestion window by the number of segments (three) that have
390	        left the network and which the receiver has buffered.

392	    4.  For each additional duplicate ACK received (after the third),
393	        cwnd MUST be incremented by SMSS.  This artificially inflates
394	        the congestion window in order to reflect the additional segment
395	        that has left the network.

397	    5.  Transmit a segment, if allowed by the new value of cwnd and the
398	        receiver's advertised window.

400	    6.  When the next ACK arrives that acknowledges new data, a TCP
401	        MUST set cwnd to ssthresh (the value set in step 1).  This is
402	        termed "deflating" the window.

404	        This ACK should be the acknowledgment elicited by the
405	        retransmission from step 1, one RTT after the retransmission
406	        (though it may arrive sooner in the presence of significant out-
407	        of-order delivery of data segments at the
408	        receiver). Additionally, this ACK should acknowledge all the
409	        intermediate segments sent between the lost segment and the
410	        receipt of the third duplicate ACK, if none of these were lost.

412	    Note: This algorithm is known to generally not recover efficiently
413	    from multiple losses in a single flight of packets [FF96].  Section
414	    4.3 below addresses such cases.

416	4. Additional Considerations

418	4.1 Re-starting Idle Connections

420	    A known problem with the TCP congestion control algorithms described
421	    above is that they allow a potentially inappropriate burst of
422	    traffic to be transmitted after TCP has been idle for a relatively
423	    long period of time.  After an idle period, TCP cannot use the ACK
424	    clock to strobe new segments into the network, as all the ACKs have
425	    drained from the network.  Therefore, as specified above, TCP can
426	    potentially send a cwnd-size line-rate burst into the network after
427	    an idle period.

429	    [Jac88] recommends that a TCP use slow start to restart
430	    transmission after a relatively long idle period.  Slow start
431	    serves to restart the ACK clock, just as it does at the beginning
432	    of a transfer.  This mechanism has been widely deployed in the
433	    following manner.  When TCP has not received a segment for more
434	    than one retransmission timeout, cwnd is reduced to the value of
435	    the restart window (RW) before transmission begins.

437	    For the purposes of this standard, we define RW = min(IW,cwnd).

439	    Using the last time a segment was received to determine whether or
440	    not to decrease cwnd can fail to deflate cwnd in the common case of
441	    persistent HTTP connections [HTH98].  In this case, a Web server
442	    receives a request before transmitting data to the Web client.  The
443	    reception of the request makes the test for an idle connection fail,
444	    and allows the TCP to begin transmission with a possibly
445	    inappropriately large cwnd.

447	    Therefore, a TCP SHOULD set cwnd to no more than RW before beginning
448	    transmission if the TCP has not sent data in an interval exceeding
449	    the retransmission timeout.

451	4.2 Generating Acknowledgments

453	    The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a
454	    TCP receiver.  When using delayed ACKs, a TCP receiver MUST NOT
455	    excessively delay acknowledgments.  Specifically, an ACK SHOULD be
456	    generated for at least every second full-sized segment, and MUST be
457	    generated within 500 ms of the arrival of the first unacknowledged
458	    packet.

460	    The requirement that an ACK "SHOULD" be generated for at least every
461	    second full-sized segment is listed in [RFC1122] in one place as a
462	    SHOULD and another as a MUST.  Here we unambiguously state it is a
463	    SHOULD.  We also emphasize that this is a SHOULD, meaning that an
464	    implementor should indeed only deviate from this requirement after
465	    careful consideration of the implications.  See the discussion of
466	    "Stretch ACK violation" in [RFC2525] and the references therein for a
467	    discussion of the possible performance problems with generating ACKs
468	    less frequently than every second full-sized segment.

470	    In some cases, the sender and receiver may not agree on what
471	    constitutes a full-sized segment.  An implementation is deemed to
472	    comply with this requirement if it sends at least one acknowledgment
473	    every time it receives 2*RMSS bytes of new data from the sender,
474	    where RMSS is the Maximum Segment Size specified by the receiver to
475	    the sender (or the default value of 536 bytes, per [RFC1122], if the
476	    receiver does not specify an MSS option during connection
477	    establishment).  The sender may be forced to use a segment size less
478	    than RMSS due to the maximum transmission unit (MTU), the path MTU
479	    discovery algorithm or other factors.  For instance, consider the
480	    case when the receiver announces an RMSS of X bytes but the sender
481	    ends up using a segment size of Y bytes (Y < X) due to path MTU
482	    discovery (or the sender's MTU size).  The receiver will generate
483	    stretch ACKs if it waits for 2*X bytes to arrive before an ACK is
484	    sent.  Clearly this will take more than 2 segments of size Y bytes.
485	    Therefore, while a specific algorithm is not defined, it is
486	    desirable for receivers to attempt to prevent this situation, for
487	    example by acknowledging at least every second segment, regardless
488	    of size.  Finally, we repeat that an ACK MUST NOT be delayed for
489	    more than 500 ms waiting on a second full-sized segment to arrive.

491	    Out-of-order data segments SHOULD be acknowledged immediately, in
492	    order to accelerate loss recovery.  To trigger the fast retransmit
493	    algorithm, the receiver SHOULD send an immediate duplicate ACK when
494	    it receives a data segment above a gap in the sequence space.  To
495	    provide feedback to senders recovering from losses, the receiver
496	    SHOULD send an immediate ACK when it receives a data segment that
497	    fills in all or part of a gap in the sequence space.

499	    A TCP receiver MUST NOT generate more than one ACK for every
500	    incoming segment, other than to update the offered window as the
501	    receiving application consumes new data [page 42, RFC793][RFC813].

503	4.3 Loss Recovery Mechanisms

505	    A number of loss recovery algorithms that augment fast retransmit
506	    and fast recovery have been suggested by TCP researchers and
507	    specified in the RFC series.  While some of these algorithms are
508	    based on the TCP selective acknowledgment (SACK) option [RFC2018],
509	    such as [FF96,MM96a,MM96b,RFC3517], others do not require SACKs
510	    [Hoe96,FF96,RFC3782].  The non-SACK algorithms use "partial
511	    acknowledgments" (ACKs which cover previously unacknowledged data,
512	    but not all the data outstanding when loss was detected) to trigger
513	    retransmissions.  While this document does not standardize any of
514	    the specific algorithms that may improve fast retransmit/fast
515	    recovery, these enhanced algorithms are implicitly allowed, as long
516	    as they follow the general principles of the basic four algorithms
517	    outlined above.

519	    That is, when the first loss in a window of data is detected,
520	    ssthresh MUST be set to no more than the value given by equation
521	    (4).  Second, until all lost segments in the window of data in
522	    question are repaired, the number of segments transmitted in each
523	    RTT MUST be no more than half the number of outstanding segments
524	    when the loss was detected.  Finally, after all loss in the given
525	    window of segments has been successfully retransmitted, cwnd MUST be
526	    set to no more than ssthresh and congestion avoidance MUST be used
527	    to further increase cwnd.  Loss in two successive windows of data,
528	    or the loss of a retransmission, should be taken as two indications
529	    of congestion and, therefore, cwnd (and ssthresh) MUST be lowered
530	    twice in this case.

532	    We RECOMMEND that TCP implementers employ some form of advanced loss
533	    recovery that can cope with multiple losses in a window of data.
534	    The algorithms detailed in [RFC3782] and [RFC3517] conform to the
535	    general principles outlined above.  We note that while these are not
536	    the only two algorithms that conform to the above general principles
537	    these two algorithms have been vetted by the community and are
538	    currently on the standards track.

540	5.  Security Considerations

542	    This document requires a TCP to diminish its sending rate in the
543	    presence of retransmission timeouts and the arrival of duplicate
544	    acknowledgments.  An attacker can therefore impair the performance
545	    of a TCP connection by either causing data packets or their
546	    acknowledgments to be lost, or by forging excessive duplicate
547	    acknowledgments.  Causing two congestion control events back-to-back
548	    will often cut ssthresh to its minimum value of 2*SMSS, causing the
549	    connection to immediately enter the slower-performing congestion
550	    avoidance phase.

552	    In response to the ACK division attack outlined in [SCWA99] this
553	    document RECOMMENDS increasing the congestion window based on the
554	    number of bytes newly acknowledged in each arriving ACK rather than
555	    by a particular constant on each arriving ACK (as outlined in
556	    section 3.1).

558	    The Internet to a considerable degree relies on the correct
559	    implementation of these algorithms in order to preserve network
560	    stability and avoid congestion collapse.  An attacker could cause
561	    TCP endpoints to respond more aggressively in the face of congestion
562	    by forging excessive duplicate acknowledgments or excessive
563	    acknowledgments for new data.  Conceivably, such an attack could
564	    drive a portion of the network into congestion collapse.

566	6.  Changes Between RFC 2001 and RFC 2581

568	    This document has been extensively rewritten editorially and it is
569	    not feasible to itemize the list of changes between the two
570	    documents. The intention of this document is not to change any of
571	    the recommendations given in RFC 2001, but to further clarify cases
572	    that were not discussed in detail in 2001. Specifically, this
573	    document suggests what TCP connections should do after a relatively
574	    long idle period, as well as specifying and clarifying some of the
575	    issues pertaining to TCP ACK generation.  Finally, the allowable
576	    upper bound for the initial congestion window has also been raised
577	    from one to two segments.

579	7.  Changes Relative to RFC 2581

581	    A specific definition for "duplicate acknowledgment" has been
582	    added, based on the definition used by BSD TCP.  In addition, the
583	    definition explicitly does not take into account the presence (or
584	    absence) of DSACK [RFC2883] information.

586	    The document now notes that what to do with duplicate ACKs after the
587	    retransmission timer has fired is future work and explicitly
588	    unspecified in this document.

590	    The initial window requirements were changed to allow Larger
591	    Initial Windows as standardized in [RFC3390].  Additionally, the
592	    steps to take when an initial window is discovered to be too large
593	    due to Path MTU Discovery [RFC1191] are detailed.

595	    The recommended initial value for ssthresh has been changed to say
596	    that it SHOULD be arbitrarily high, where it was previously MAY.
597	    This is to provide additional guidance to implementors on the
598	    matter.

600	    During slow start, the usage of Appropriate Byte Counting [RFC3465]
601	    with L=1*SMSS is explicitly recommended.  The method of increasing
602	    cwnd given in [RFC2581] is still explicitly allowed.  Byte counting
603	    during congestion avoidance is also recommended, while the method
604	    from [RFC2581] and other safe methods are still allowed.

606	    The treatment of ssthresh on retransmission timeout was clarified.
607	    Specifically, Equation (3) from [RFC2581] was split into Equations
608	    (4) and (5) in this document.

610	    The description of fast retransmit and fast recovery has been
611	    clarified, and the use of Limited Transmit [RFC3042] is now
612	    recommended.

614	    The restart window has been changed to min(IW,cwnd) from IW.  This
615	    behavior was described as "experimental" in [RFC2581].

617	    It is now recommended that TCP implementors implement an advanced
618	    loss recovery algorithm conforming to the principles outlined in
619	    this document.

621	    The security considerations have been updated to discuss ACK
622	    division and recommend byte counting as a counter to this attack.

624	Acknowledgments

626	    The core algorithms we describe were developed by Van Jacobson
627	    [Jac88, Jac90].  In addition, Limited Transmit [RFC3042] was
628	    developed in conjunction with Hari Balakrishnan and Sally Floyd.
629	    The initial congestion window size specified in this document is a
630	    result of work with Sally Floyd and Craig Partridge
631	    [RFC2414,RFC3390].

633	    W. Richard ("Rich") Stevens wrote the first version of this document
634	    [RFC2001] and co-authored the second version [RFC2581].  This
635	    present version much benefits from his clarity and thoughtfulness of
636	    description, and we are grateful for Rich's contributions in
637	    elucidating TCP congestion control, as well as in more broadly
638	    helping us understand numerous issues relating to networking.

640	    We wish to emphasize that the shortcomings and mistakes of this
641	    document are solely the responsibility of the current authors.

643	    Some of the text from this document is taken from "TCP/IP
644	    Illustrated, Volume 1: The Protocols" by W. Richard Stevens
645	    (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
646	    Implementation" by Gary R. Wright and W.  Richard Stevens (Addison-
647	    Wesley, 1995).  This material is used with the permission of
648	    Addison-Wesley.

650	    Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John
651	    Heffner, Sally Floyd, Reiner Ludwig, Matt Mathis, Craig Partridge
652	    and Joe Touch contributed a number of helpful suggestions.

654	Normative References

656	    [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC
657	        793, September 1981.

659	    [RFC1122] Braden, R., "Requirements for Internet Hosts --
660	        Communication Layers", STD 3, RFC 1122, October 1989.

662	    [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191,
663	        November 1990.

665	Informative References

667	    [CJ89] Chiu, D. and R. Jain, "Analysis of the Increase/Decrease
668	        Algorithms for Congestion Avoidance in Computer Networks",
669	        Journal of Computer Networks and ISDN Systems, vol. 17, no. 1,
670	        pp. 1-14, June 1989.

672	    [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of
673	        Tahoe, Reno and SACK TCP", Computer Communication Review, July
674	        1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.

676	    [Flo94] Floyd, S., "TCP and Successive Fast Retransmits. Technical
677	        report", October 1994.
678	        ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.

680	    [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion
681	        Control Scheme for TCP", In ACM SIGCOMM, August 1996.

683	    [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP
684	        Slow-Start Restart After Idle", Work in Progress.

686	    [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer
687	        Communication Review, vol. 18, no. 4, pp. 314-329, Aug.  1988.
688	        ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.

690	    [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm",
691	        end2end-interest mailing list, April 30, 1990.
692	        ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.

694	    [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining
695	        TCP Congestion Control", Proceedings of SIGCOMM'96, August,
696	        1996, Stanford, CA.  Available
697	        fromhttp://www.psc.edu/networking/papers/papers.html

699	    [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding
700	        Parameters", Technical report.  Available from
701	        http://www.psc.edu/networking/papers/FACKnotes/current.

703	    [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics",
704	        Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997.

706	    [RFC813] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC
707	        813, July 1982.

709	    [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
710	        Retransmit, and Fast Recovery Algorithms", RFC 2001, January
711	        1997.

713	    [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP
714	        Selective Acknowledgement Options", RFC 2018, October 1996.

716	    [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
717	        Requirement Levels", BCP 14, RFC 2119, March 1997.

719	    [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
720	        Initial Window Size", RFC 2414, September 1998.

722	    [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J.,
723	        Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP
724	        Implementation Problems", RFC 2525, March 1999.

726	    [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion
727	        Control, RFC 2581, April 1999.

729	    [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An
730	        Extension to the Selective Acknowledgement (SACK) Option for
731	        TCP, RFC 2883, July 2000.

733	    [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission
734	        Timer", RFC 2988, November 2000.

736	    [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing
737	        TCP's Loss Recovery Using Limited Transmit", RFC 3042, January
738	        2001.

740	    [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte
741	        Counting (ABC), RFC 3465, February 2003.

743	    [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A
744	        Conservative Selective Acknowledgment (SACK)-based Loss Recovery
745	        Algorithm for TCP, RFC 3517, April 2003.

747	    [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno
748	        Modification to TCP's Fast Recovery Algorithm, RFC 3782, April
749	        2004.

751	    [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
752	        "TCP Congestion Control With a Misbehaving Receiver", ACM
753	        Computer Communication Review, 29(5), October 1999.

755	    [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols",
756	        Addison-Wesley, 1994.

758	    [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The
759	        Implementation", Addison-Wesley, 1995.

761	Authors' Addresses

763	    Mark Allman
764	    ICIR / ICSI
765	    1947 Center Street
766	    Suite 600
767	    Berkeley, CA 94704-1198
768	    Phone: +1 440 235 1792
769	    EMail: mallman@icir.org
770	    http://www.icir.org/mallman/

772	    Vern Paxson
773	    ICIR / ICSI
774	    1947 Center Street
775	    Suite 600
776	    Berkeley, CA 94704-1198
777	    Phone: +1 510/642-4274 x302
778	    EMail: vern@icir.org
779	    http://www.icir.org/vern/

781	    Ethan Blanton
782	    Purdue University Computer Sciences
783	    1398 Computer Science Building
784	    West Lafayette, IN  47907
785	    EMail: eblanton@cs.purdue.edu
786	    http://www.cs.purdue.edu/homes/eblanton/

788	Intellectual Property Statement

790	    The IETF takes no position regarding the validity or scope of any
791	    Intellectual Property Rights or other rights that might be claimed
792	    to pertain to the implementation or use of the technology described
793	    in this document or the extent to which any license under such
794	    rights might or might not be available; nor does it represent that
795	    it has made any independent effort to identify any such rights.
796	    Information on the procedures with respect to rights in RFC
797	    documents can be found in BCP 78 and BCP 79.

799	    Copies of IPR disclosures made to the IETF Secretariat and any
800	    assurances of licenses to be made available, or the result of an
801	    attempt made to obtain a general license or permission for the use
802	    of such proprietary rights by implementers or users of this
803	    specification can be obtained from the IETF on-line IPR repository
804	    at http://www.ietf.org/ipr.

806	    The IETF invites any interested party to bring to its attention any
807	    copyrights, patents or patent applications, or other proprietary
808	    rights that may cover technology that may be required to implement
809	    this standard.  Please address the information to the IETF at
810	    ietf-ipr@ietf.org.

812	Disclaimer of Validity

814	    This document and the information contained herein are provided on
815	    an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
816	    REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE
817	    INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR
818	    IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
819	    THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
820	    WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

822	Copyright Statement

824	    Copyright (C) The Internet Society (2006).  This document is subject
825	    to the rights, licenses and restrictions contained in BCP 78, and
826	    except as set forth therein, the authors retain all their rights.

828	Acknowledgment

830	    Funding for the RFC Editor function is currently provided by the
831	    Internet Society.