idnits 2.17.1 

draft-ietf-tcpm-rfc2581bis-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document seems to contain a disclaimer for pre-RFC5378 work, and may
     have content which was first submitted before 10 November 2008.  The
     disclaimer is necessary when there are original authors that you have
     been unable to contact, or if some do not wish to grant the BCP78 rights
     to the IETF Trust.  If you are able to get all authors (current and
     original) to grant those rights, you can and should remove the
     disclaimer; otherwise, the disclaimer is needed and you can ignore this
     comment. (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Draft Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC  813
     (Obsoleted by RFC 7805)

  -- Obsolete informational reference (is this intentional?): RFC 2001
     (Obsoleted by RFC 2581)

  -- Obsolete informational reference (is this intentional?): RFC 2414
     (Obsoleted by RFC 3390)

  -- Obsolete informational reference (is this intentional?): RFC 2581
     (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 2988
     (Obsoleted by RFC 6298)

  -- Obsolete informational reference (is this intentional?): RFC 3517
     (Obsoleted by RFC 6675)

  -- Obsolete informational reference (is this intentional?): RFC 3782
     (Obsoleted by RFC 6582)


     Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                          M. Allman
2	Internet-Draft                                                 V. Paxson
3	Obsoletes: 2581                                                     ICSI
4	Intended status: Draft Standard                               E. Blanton
5	Expires: January 27 2010                               Purdue University
6	                                                            July 27 2009

8	                         TCP Congestion Control
9	                   draft-ietf-tcpm-rfc2581bis-07.txt

11	Status of this Memo

13	    This Internet-Draft is submitted to IETF in full conformance with
14	    the provisions of BCP 78 and BCP 79.  This document may contain
15	    material from IETF Documents or IETF Contributions published or made
16	    publicly available before November 10, 2008.  The person(s)
17	    controlling the copyright in some of this material may not have
18	    granted the IETF Trust the right to allow modifications of such
19	    material outside the IETF Standards Process.  Without obtaining an
20	    adequate license from the person(s) controlling the copyright in
21	    such materials, this document may not be modified outside the IETF
22	    Standards Process, and derivative works of it may not be created
23	    outside the IETF Standards Process, except to format it for
24	    publication as an RFC or to translate it into languages other than
25	    English.

27	    Internet-Drafts are working documents of the Internet Engineering
28	    Task Force (IETF), its areas, and its working groups.  Note that
29	    other groups may also distribute working documents as
30	    Internet-Drafts.

32	    Internet-Drafts are draft documents valid for a maximum of six
33	    months and may be updated, replaced, or obsoleted by other documents
34	    at any time.  It is inappropriate to use Internet-Drafts as
35	    reference material or to cite them other than as "work in progress."

37	    The list of current Internet-Drafts can be accessed at
38	    http://www.ietf.org/ietf/1id-abstracts.txt.

40	    The list of Internet-Draft Shadow Directories can be accessed at
41	    http://www.ietf.org/shadow.html.

43	Copyright Statement

45	    Copyright (c) 2009 IETF Trust and the persons identified as the
46	    document authors.  All rights reserved.

48	    This document is subject to BCP 78 and the IETF Trust's Legal
49	    Provisions Relating to IETF Documents in effect on the date of
50	    publication of this document (http://trustee.ietf.org/license-info).
51	    Please review these documents carefully, as they describe your
52	    rights and restrictions with respect to this document.

54	    This document may contain material from IETF Documents or IETF
55	    Contributions published or made publicly available before November
56	    10, 2008.  The person(s) controlling the copyright in some of this
57	    material may not have granted the IETF Trust the right to allow
58	    modifications of such material outside the IETF Standards Process.
59	    Without obtaining an adequate license from the person(s) controlling
60	    the copyright in such materials, this document may not be modified
61	    outside the IETF Standards Process, and derivative works of it may
62	    not be created outside the IETF Standards Process, except to format
63	    it for publication as an RFC or to translate it into languages other
64	    than English.

66	Abstract

68	    This document defines TCP's four intertwined congestion control
69	    algorithms: slow start, congestion avoidance, fast retransmit, and
70	    fast recovery.  In addition, the document specifies how TCP should
71	    begin transmission after a relatively long idle period, as well as
72	    discussing various acknowledgment generation methods.  This document
73	    obsoletes RFC 2581.

75	Table Of Contents

77	    1.        Introduction. . . . . . . . . . . . . . . . . 2
78	    2.        Definitions . . . . . . . . . . . . . . . . . 3
79	    3.        Congestion Control Algorithms . . . . . . . . 4
80	    3.1       Slow Start and Congestion Avoidance . . . . . 4
81	    3.2       Fast Retransmit/Fast Recovery . . . . . . . . 7
82	    4.        Additional Considerations . . . . . . . . . . 9
83	    4.1       Re-starting Idle Connections. . . . . . . . . 9
84	    4.2       Generating Acknowledgments. . . . . . . . . . 10
85	    4.3       Loss Recovery Mechanisms. . . . . . . . . . . 11
86	    5.        Security Considerations . . . . . . . . . . . 12
87	    6.        Changes Between RFC 2001 and RFC 2581 . . . . 12
88	    7.        Changes Relative to RFC 2581. . . . . . . . . 12
89	    8.        IANA Considerations . . . . . . . . . . . . . 13

91	1. Introduction

93	    This document specifies four TCP [RFC793] congestion control
94	    algorithms: slow start, congestion avoidance, fast retransmit and
95	    fast recovery.  These algorithms were devised in [Jac88] and
96	    [Jac90]. Their use with TCP is standardized in [RFC1122].
97	    Additional early work in additive-increase, multiplicative-decrease
98	    congestion control is given in [CJ89].

100	    Note that [Ste94] provides examples of these algorithms in action
101	    and [WS95] provides an explanation of the source code for the BSD
102	    implementation of these algorithms.

104	    In addition to specifying these congestion control algorithms, this
105	    document specifies what TCP connections should do after a relatively
106	    long idle period, as well as specifying and clarifying some of the
107	    issues pertaining to TCP ACK generation.

109	    This document obsoletes [RFC2581], which in turn obsoleted
110	    [RFC2001].

112	    This document is organized as follows.  Section 2 provides various
113	    definitions which will be used throughout the document.  Section 3
114	    provides a specification of the congestion control
115	    algorithms. Section 4 outlines concerns related to the congestion
116	    control algorithms and finally, section 5 outlines security
117	    considerations.

119	    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
120	    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
121	    document are to be interpreted as described in [RFC2119].

123	2. Definitions

125	    This section provides the definition of several terms that will be
126	    used throughout the remainder of this document.

128	    SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or
129	        both).

131	    SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the
132	        largest segment that the sender can transmit.  This value can be
133	        based on the maximum transmission unit of the network, the path
134	        MTU discovery [RFC1191,RFC4821] algorithm, RMSS (see next item),
135	        or other factors.  The size does not include the TCP/IP headers
136	        and options.

138	    RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the
139	        largest segment the receiver is willing to accept.  This is the
140	        value specified in the MSS option sent by the receiver during
141	        connection startup.  Or, if the MSS option is not used, 536
142	        bytes [RFC1122].  The size does not include the TCP/IP headers
143	        and options.

145	    FULL-SIZED SEGMENT: A segment that contains the maximum number of
146	        data bytes permitted (i.e., a segment containing SMSS bytes of
147	        data).

149	    RECEIVER WINDOW (rwnd): The most recently advertised receiver
150	        window.

152	    CONGESTION WINDOW (cwnd): A TCP state variable that limits the
153	        amount of data a TCP can send.  At any given time, a TCP MUST
154	        NOT send data with a sequence number higher than the sum of the
155	        highest acknowledged sequence number and the minimum of cwnd and
156	        rwnd.

158	    INITIAL WINDOW (IW): The initial window is the size of the sender's
159	        congestion window after the three-way handshake is completed.

161	    LOSS WINDOW (LW): The loss window is the size of the congestion
162	        window after a TCP sender detects loss using its retransmission
163	        timer.

165	    RESTART WINDOW (RW): The restart window is the size of the
166	        congestion window after a TCP restarts transmission after an
167	        idle period (if the slow start algorithm is used; see section
168	        4.1 for more discussion).

170	    FLIGHT SIZE: The amount of data that has been sent but not yet
171	        cumulatively acknowledged.

173	    DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
174	        "duplicate" in the following algorithms when (a) the receiver of
175	        the ACK has outstanding data, (b) the incoming acknowledgment
176	        carries no data, (c) the SYN and FIN bits are both off, (d) the
177	        acknowledgment number is equal to the greatest acknowledgment
178	        received on the given connection (TCP.UNA from [RFC793]) and (e)
179	        the advertised window in the incoming acknowledgment equals the
180	        advertised window in the last incoming acknowledgment.

182	        Alternatively, a TCP that utilizes selective acknowledgments
183	        [RFC2018,RFC2883] can leverage the SACK information to determine
184	        when an incoming ACK is a "duplicate" (e.g., if the ACK contains
185	        previously unknown SACK information).

187	3. Congestion Control Algorithms

189	    This section defines the four congestion control algorithms: slow
190	    start, congestion avoidance, fast retransmit and fast recovery,
191	    developed in [Jac88] and [Jac90].  In some situations it may be
192	    beneficial for a TCP sender to be more conservative than the
193	    algorithms allow, however a TCP MUST NOT be more aggressive than the
194	    following algorithms allow (that is, MUST NOT send data when the
195	    value of cwnd computed by the following algorithms would not allow
196	    the data to be sent).

198	    Also note that the algorithms specified in this document work in
199	    terms of using loss as the signal of congestion.  Explicit
200	    Congestion Notification (ECN) could also be used as specified in
201	    [RFC3168].

203	3.1 Slow Start and Congestion Avoidance

205	    The slow start and congestion avoidance algorithms MUST be used by a
206	    TCP sender to control the amount of outstanding data being injected
207	    into the network.  To implement these algorithms, two variables are
208	    added to the TCP per-connection state.  The congestion window (cwnd)
209	    is a sender-side limit on the amount of data the sender can transmit
210	    into the network before receiving an acknowledgment (ACK), while the
211	    receiver's advertised window (rwnd) is a receiver-side limit on the
212	    amount of outstanding data.  The minimum of cwnd and rwnd governs
213	    data transmission.

215	    Another state variable, the slow start threshold (ssthresh), is used
216	    to determine whether the slow start or congestion avoidance
217	    algorithm is used to control data transmission, as discussed below.

219	    Beginning transmission into a network with unknown conditions
220	    requires TCP to slowly probe the network to determine the available
221	    capacity, in order to avoid congesting the network with an
222	    inappropriately large burst of data.  The slow start algorithm is
223	    used for this purpose at the beginning of a transfer, or after
224	    repairing loss detected by the retransmission timer.  Slow start
225	    additionally serves to start the "ACK clock" used by the TCP sender
226	    to release data into the network in the slow start, congestion
227	    avoidance, and loss recovery algorithms.

229	    IW, the initial value of cwnd, MUST be set using the following
230	    guidelines as an upper bound.

232	    If SMSS > 2190 bytes:
233	        IW = 2 * SMSS bytes and MUST NOT be more than 2 segments
234	    If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes):
235	        IW = 3 * SMSS bytes and MUST NOT be more than 3 segments
236	    if SMSS <= 1095 bytes:
237	        IW = 4 * SMSS bytes and MUST NOT be more than 4 segments

239	    As specified in [RFC3390], the SYN/ACK and the acknowledgment of the
240	    SYN/ACK MUST NOT increase the size of the congestion window.
241	    Further, if the SYN or SYN/ACK is lost, the initial window used by a
242	    sender after a correctly transmitted SYN MUST be one segment
243	    consisting of at most SMSS bytes.

245	    A detailed rationale and discussion of the IW setting is provided in
246	    [RFC3390].

248	    When initial congestion windows of more than one segment are
249	    implemented along with Path MTU Discovery [RFC1191], and the MSS
250	    being used is found to be too large, the congestion window cwnd
251	    SHOULD be reduced to prevent large bursts of smaller segments.
252	    Specifically, cwnd SHOULD be reduced by the ratio of the old segment
253	    size to the new segment size.

255	    The initial value of ssthresh SHOULD be set arbitrarily high (e.g.,
256	    to the size of the largest possible advertised window), but ssthresh
257	    MUST be reduced in response to congestion.  Setting ssthresh as high
258	    as possible allows the network conditions, rather than some
259	    arbitrary host limit, to dictate the sending rate.  In cases where
260	    the end systems have a solid understanding of the network path, more
261	    carefully setting the initial ssthresh value may have merit (e.g.,
262	    such that the end host does not create congestion along the path).

264	    The slow start algorithm is used when cwnd < ssthresh, while the
265	    congestion avoidance algorithm is used when cwnd > ssthresh.  When
266	    cwnd and ssthresh are equal the sender may use either slow start or
267	    congestion avoidance.

269	    During slow start, a TCP increments cwnd by at most SMSS bytes for
270	    each ACK received that cumulatively acknowledges new data.  Slow
271	    start ends when cwnd exceeds ssthresh (or, optionally, when it
272	    reaches it, as noted above) or when congestion is observed.  While
273	    traditionally TCP implementations have increased cwnd by precisely
274	    SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND
275	    that TCP implementations increase cwnd, per:

277	        cwnd += min (N, SMSS)                      (2)

279	    where N is the number of previously unacknowledged bytes
280	    acknowledged in the incoming ACK.  This adjustment is part of
281	    Appropriate Byte Counting [RFC3465] and provides robustness against
282	    misbehaving receivers which may attempt to induce a sender to
283	    artificially inflate cwnd using a mechanism known as "ACK Division"
284	    [SCWA99].  ACK Division consists of a receiver sending multiple ACKs
285	    for a single TCP data segment, each acknowledging only a portion of
286	    its data.  A TCP that increments cwnd by SMSS for each such ACK will
287	    inappropriately inflate the amount of data injected into the
288	    network.

290	    During congestion avoidance, cwnd is incremented by roughly 1
291	    full-sized segment per round-trip time (RTT).  Congestion avoidance
292	    continues until congestion is detected.  The basic guidelines for
293	    incrementing cwnd during congestion avoidance are:

295	      * MAY increment cwnd by SMSS bytes

297	      * SHOULD increment cwnd per equation (2) once per RTT

299	      * MUST NOT increment cwnd by more than SMSS bytes

301	    We note that [RFC3465] allows for cwnd increases of more than SMSS
302	    bytes for incoming acknowledgments during slow start on an
303	    experimental basis, however such behavior is not allowed as part of
304	    the standard.

306	    The RECOMMENDED way to increase cwnd during congestion avoidance is
307	    to count the number of bytes that have been acknowledged by ACKs for
308	    new data.  (A drawback of this implementation is that it requires
309	    maintaining an additional state variable.)  When the number of bytes
310	    acknowledged reaches cwnd, then cwnd can be incremented by up to
311	    SMSS bytes.  Note that during congestion avoidance, cwnd MUST NOT be
312	    increased by more than SMSS bytes per RTT.  This method both allows
313	    TCPs to increase cwnd by one segment per RTT in the face of delayed
314	    ACKs and provides robustness against ACK Division attacks.

316	    Another common formula that a TCP MAY use to update cwnd during
317	    congestion avoidance is given in equation 3:

319	        cwnd += SMSS*SMSS/cwnd                     (3)

321	    This adjustment is executed on every incoming ACK that acknowledges
322	    new data.  Equation (3) provides an acceptable approximation to the
323	    underlying principle of increasing cwnd by 1 full-sized segment per
324	    RTT.  (Note that for a connection in which the receiver is
325	    acknowledging every-other packet, (3) is less aggressive than
326	    allowed -- roughly increasing cwnd every second RTT.)

328	    Implementation Note: Since integer arithmetic is usually used in TCP
329	    implementations, the formula given in equation 3 can fail to
330	    increase cwnd when the congestion window is larger than SMSS*SMSS.
331	    If the above formula yields 0, the result SHOULD be rounded up to 1
332	    byte.

334	    Implementation Note: Older implementations have an additional
335	    additive constant on the right-hand side of equation (3).  This is
336	    incorrect and can actually lead to diminished performance [RFC2525].

338	    Implementation Note: Some implementations maintain cwnd in units of
339	    bytes, while others in units of full-sized segments.  The latter
340	    will find equation (3) difficult to use, and may prefer to use the
341	    counting approach discussed in the previous paragraph.

343	    When a TCP sender detects segment loss using the retransmission
344	    timer and the given segment has not yet been resent by way of the
345	    retransmission timer, the value of ssthresh MUST be set to no more
346	    than the value given in equation 4:

348	        ssthresh = max (FlightSize / 2, 2*SMSS)            (4)

350	    where, as discussed above, FlightSize is the amount of outstanding
351	    data in the network.

353	    On the other hand, when a TCP sender detects segment loss using the
354	    retransmission timer and the given segment has already been
355	    retransmitted by way of the retransmission timer at least once, the
356	    value of ssthresh is held constant.

358	    Implementation Note: An easy mistake to make is to simply use cwnd,
359	    rather than FlightSize, which in some implementations may
360	    incidentally increase well beyond rwnd.

362	    Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be
363	    set to no more than the loss window, LW, which equals 1 full-sized
364	    segment (regardless of the value of IW).  Therefore, after
365	    retransmitting the dropped segment the TCP sender uses the slow
366	    start algorithm to increase the window from 1 full-sized segment to
367	    the new value of ssthresh, at which point congestion avoidance again
368	    takes over.

370	    As shown in [FF96,RFC3782], slow start-based loss recovery after a
371	    timeout can cause spurious retransmissions that trigger duplicate
372	    acknowledgments.  The reaction to the arrival of these duplicate
373	    ACKs in TCP implementations varies widely.  This document does not
374	    specify how to treat such acknowledgments, but does note this as an
375	    area that may benefit from additional attention, experimentation and
376	    specification.

378	3.2 Fast Retransmit/Fast Recovery
379	    A TCP receiver SHOULD send an immediate duplicate ACK when an out-
380	    of-order segment arrives.  The purpose of this ACK is to inform the
381	    sender that a segment was received out-of-order and which sequence
382	    number is expected.  From the sender's perspective, duplicate ACKs
383	    can be caused by a number of network problems.  First, they can be
384	    caused by dropped segments.  In this case, all segments after the
385	    dropped segment will trigger duplicate ACKs until the loss is
386	    repaired.  Second, duplicate ACKs can be caused by the re-ordering
387	    of data segments by the network (not a rare event along some network
388	    paths [Pax97]).  Finally, duplicate ACKs can be caused by
389	    replication of ACK or data segments by the network.  In addition, a
390	    TCP receiver SHOULD send an immediate ACK when the incoming segment
391	    fills in all or part of a gap in the sequence space.  This will
392	    generate more timely information for a sender recovering from a loss
393	    through a retransmission timeout, a fast retransmit, or an advanced
394	    loss recovery algorithm, as outlined in section 4.3.

396	    The TCP sender SHOULD use the "fast retransmit" algorithm to detect
397	    and repair loss, based on incoming duplicate ACKs.  The fast
398	    retransmit algorithm uses the arrival of 3 duplicate ACKs (as
399	    defined in section 2, without any intervening ACKs which move
400	    SND.UNA) as an indication that a segment has been lost.  After
401	    receiving 3 duplicate ACKs, TCP performs a retransmission of what
402	    appears to be the missing segment, without waiting for the
403	    retransmission timer to expire.

405	    After the fast retransmit algorithm sends what appears to be the
406	    missing segment, the "fast recovery" algorithm governs the
407	    transmission of new data until a non-duplicate ACK arrives.  The
408	    reason for not performing slow start is that the receipt of the
409	    duplicate ACKs not only indicates that a segment has been lost, but
410	    also that segments are most likely leaving the network (although a
411	    massive segment duplication by the network can invalidate this
412	    conclusion).  In other words, since the receiver can only generate a
413	    duplicate ACK when a segment has arrived, that segment has left the
414	    network and is in the receiver's buffer, so we know it is no longer
415	    consuming network resources.  Furthermore, since the ACK "clock"
416	    [Jac88] is preserved, the TCP sender can continue to transmit new
417	    segments (although transmission must continue using a reduced cwnd,
418	    since loss is an indication of congestion).

420	    The fast retransmit and fast recovery algorithms are implemented
421	    together as follows.

423	    1.  On the first and second duplicate ACKs received at a sender, a
424	        TCP SHOULD send a segment of previously unsent data per
425	        [RFC3042] provided that the receiver's advertised window allows,
426	        the total FlightSize would remain less than or equal to cwnd
427	        plus 2*SMSS, and that new data is available for transmission.
428	        Further, the TCP sender MUST NOT change cwnd to reflect these
429	        two segments [RFC3042].  Note that a sender using SACK [RFC2018]
430	        MUST NOT send new data unless the incoming duplicate
431	        acknowledgment contains new SACK information.

433	    2.  When the third duplicate ACK is received, a TCP MUST set
434	        ssthresh to no more than the value given in equation 4.  When
435	        [RFC3042] is in use, additional data sent in limited transmit
436	        MUST NOT be included in this calculation.

438	    3.  The lost segment starting at SND.UNA MUST be retransmitted and
439	        cwnd set to ssthresh plus 3*SMSS. This artificially "inflates"
440	        the congestion window by the number of segments (three) that
441	        have left the network and which the receiver has buffered.

443	    4.  For each additional duplicate ACK received (after the third),
444	        cwnd MUST be incremented by SMSS.  This artificially inflates
445	        the congestion window in order to reflect the additional segment
446	        that has left the network.

448	        Note: [SCWA99] discusses a receiver-based attack whereby many
449	        bogus duplicate ACKs are sent to the data sender in order to
450	        artificially inflate cwnd and cause a higher than appropriate
451	        sending rate to be used.  A TCP MAY therefore limit the number
452	        of times cwnd is artificially inflated during loss recovery
453	        to the number of outstanding segments (or, an approximation
454	        thereof).

456	        Note: When an advanced loss recovery mechanism (such as outlined
457	        in section 4.3) is not in use, this increase in FlightSize can
458	        cause equation 4 to slightly inflate cwnd and ssthresh, as some
459	        of the segments between SND.UNA and SND.NXT are assumed to have
460	        left the network but are still reflected in FlightSize.

462	    5.  When previously unsent data is available and the new value of
463	        cwnd and the receiver's advertised window allow, a TCP SHOULD
464	        send 1*SMSS bytes of previously unsent data.

466	    6.  When the next ACK arrives that acknowledges previously
467	        unacknowledged data, a TCP MUST set cwnd to ssthresh (the value
468	        set in step 2).  This is termed "deflating" the window.

470	        This ACK should be the acknowledgment elicited by the
471	        retransmission from step 3, one RTT after the retransmission
472	        (though it may arrive sooner in the presence of significant out-
473	        of-order delivery of data segments at the receiver).
474	        Additionally, this ACK should acknowledge all the intermediate
475	        segments sent between the lost segment and the receipt of the
476	        third duplicate ACK, if none of these were lost.

478	    Note: This algorithm is known to generally not recover efficiently
479	    from multiple losses in a single flight of packets [FF96].  Section
480	    4.3 below addresses such cases.

482	4. Additional Considerations

484	4.1 Re-starting Idle Connections
485	    A known problem with the TCP congestion control algorithms described
486	    above is that they allow a potentially inappropriate burst of
487	    traffic to be transmitted after TCP has been idle for a relatively
488	    long period of time.  After an idle period, TCP cannot use the ACK
489	    clock to strobe new segments into the network, as all the ACKs have
490	    drained from the network.  Therefore, as specified above, TCP can
491	    potentially send a cwnd-size line-rate burst into the network after
492	    an idle period.  In addition, changing network conditions may have
493	    rendered TCP's notion of the available end-to-end network capacity
494	    between two endpoints, as estimated by cwnd, inaccurate during the
495	    course of a long idle period.

497	    [Jac88] recommends that a TCP use slow start to restart
498	    transmission after a relatively long idle period.  Slow start
499	    serves to restart the ACK clock, just as it does at the beginning
500	    of a transfer.  This mechanism has been widely deployed in the
501	    following manner.  When TCP has not received a segment for more
502	    than one retransmission timeout, cwnd is reduced to the value of
503	    the restart window (RW) before transmission begins.

505	    For the purposes of this standard, we define RW = min(IW,cwnd).

507	    Using the last time a segment was received to determine whether or
508	    not to decrease cwnd can fail to deflate cwnd in the common case of
509	    persistent HTTP connections [HTH98].  In this case, a Web server
510	    receives a request before transmitting data to the Web client.  The
511	    reception of the request makes the test for an idle connection fail,
512	    and allows the TCP to begin transmission with a possibly
513	    inappropriately large cwnd.

515	    Therefore, a TCP SHOULD set cwnd to no more than RW before beginning
516	    transmission if the TCP has not sent data in an interval exceeding
517	    the retransmission timeout.

519	4.2 Generating Acknowledgments

521	    The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a
522	    TCP receiver.  When using delayed ACKs, a TCP receiver MUST NOT
523	    excessively delay acknowledgments.  Specifically, an ACK SHOULD be
524	    generated for at least every second full-sized segment, and MUST be
525	    generated within 500 ms of the arrival of the first unacknowledged
526	    packet.

528	    The requirement that an ACK "SHOULD" be generated for at least every
529	    second full-sized segment is listed in [RFC1122] in one place as a
530	    SHOULD and another as a MUST.  Here we unambiguously state it is a
531	    SHOULD.  We also emphasize that this is a SHOULD, meaning that an
532	    implementor should indeed only deviate from this requirement after
533	    careful consideration of the implications.  See the discussion of
534	    "Stretch ACK violation" in [RFC2525] and the references therein for
535	    a discussion of the possible performance problems with generating
536	    ACKs less frequently than every second full-sized segment.

538	    In some cases, the sender and receiver may not agree on what
539	    constitutes a full-sized segment.  An implementation is deemed to
540	    comply with this requirement if it sends at least one acknowledgment
541	    every time it receives 2*RMSS bytes of new data from the sender,
542	    where RMSS is the Maximum Segment Size specified by the receiver to
543	    the sender (or the default value of 536 bytes, per [RFC1122], if the
544	    receiver does not specify an MSS option during connection
545	    establishment).  The sender may be forced to use a segment size less
546	    than RMSS due to the maximum transmission unit (MTU), the path MTU
547	    discovery algorithm or other factors.  For instance, consider the
548	    case when the receiver announces an RMSS of X bytes but the sender
549	    ends up using a segment size of Y bytes (Y < X) due to path MTU
550	    discovery (or the sender's MTU size).  The receiver will generate
551	    stretch ACKs if it waits for 2*X bytes to arrive before an ACK is
552	    sent.  Clearly this will take more than 2 segments of size Y bytes.
553	    Therefore, while a specific algorithm is not defined, it is
554	    desirable for receivers to attempt to prevent this situation, for
555	    example by acknowledging at least every second segment, regardless
556	    of size.  Finally, we repeat that an ACK MUST NOT be delayed for
557	    more than 500 ms waiting on a second full-sized segment to arrive.

559	    Out-of-order data segments SHOULD be acknowledged immediately, in
560	    order to accelerate loss recovery.  To trigger the fast retransmit
561	    algorithm, the receiver SHOULD send an immediate duplicate ACK when
562	    it receives a data segment above a gap in the sequence space.  To
563	    provide feedback to senders recovering from losses, the receiver
564	    SHOULD send an immediate ACK when it receives a data segment that
565	    fills in all or part of a gap in the sequence space.

567	    A TCP receiver MUST NOT generate more than one ACK for every
568	    incoming segment, other than to update the offered window as the
569	    receiving application consumes new data [page 42, RFC793][RFC813].

571	4.3 Loss Recovery Mechanisms

573	    A number of loss recovery algorithms that augment fast retransmit
574	    and fast recovery have been suggested by TCP researchers and
575	    specified in the RFC series.  While some of these algorithms are
576	    based on the TCP selective acknowledgment (SACK) option [RFC2018],
577	    such as [FF96,MM96a,MM96b,RFC3517], others do not require SACKs
578	    [Hoe96,FF96,RFC3782].  The non-SACK algorithms use "partial
579	    acknowledgments" (ACKs which cover previously unacknowledged data,
580	    but not all the data outstanding when loss was detected) to trigger
581	    retransmissions.  While this document does not standardize any of
582	    the specific algorithms that may improve fast retransmit/fast
583	    recovery, these enhanced algorithms are implicitly allowed, as long
584	    as they follow the general principles of the basic four algorithms
585	    outlined above.

587	    That is, when the first loss in a window of data is detected,
588	    ssthresh MUST be set to no more than the value given by equation
589	    (4).  Second, until all lost segments in the window of data in
590	    question are repaired, the number of segments transmitted in each
591	    RTT MUST be no more than half the number of outstanding segments
592	    when the loss was detected.  Finally, after all loss in the given
593	    window of segments has been successfully retransmitted, cwnd MUST be
594	    set to no more than ssthresh and congestion avoidance MUST be used
595	    to further increase cwnd.  Loss in two successive windows of data,
596	    or the loss of a retransmission, should be taken as two indications
597	    of congestion and, therefore, cwnd (and ssthresh) MUST be lowered
598	    twice in this case.

600	    We RECOMMEND that TCP implementers employ some form of advanced loss
601	    recovery that can cope with multiple losses in a window of data.
602	    The algorithms detailed in [RFC3782] and [RFC3517] conform to the
603	    general principles outlined above.  We note that while these are not
604	    the only two algorithms that conform to the above general principles
605	    these two algorithms have been vetted by the community and are
606	    currently on the standards track.

608	5.  Security Considerations

610	    This document requires a TCP to diminish its sending rate in the
611	    presence of retransmission timeouts and the arrival of duplicate
612	    acknowledgments.  An attacker can therefore impair the performance
613	    of a TCP connection by either causing data packets or their
614	    acknowledgments to be lost, or by forging excessive duplicate
615	    acknowledgments.

617	    In response to the ACK division attack outlined in [SCWA99] this
618	    document RECOMMENDS increasing the congestion window based on the
619	    number of bytes newly acknowledged in each arriving ACK rather than
620	    by a particular constant on each arriving ACK (as outlined in
621	    section 3.1).

623	    The Internet to a considerable degree relies on the correct
624	    implementation of these algorithms in order to preserve network
625	    stability and avoid congestion collapse.  An attacker could cause
626	    TCP endpoints to respond more aggressively in the face of congestion
627	    by forging excessive duplicate acknowledgments or excessive
628	    acknowledgments for new data.  Conceivably, such an attack could
629	    drive a portion of the network into congestion collapse.

631	6.  Changes Between RFC 2001 and RFC 2581

633	    [RFC2001] was extensively rewritten editorially and it is not
634	    feasible to itemize the list of changes between [RFC2001] and
635	    [RFC2581]. The intention of [RFC2581] was to not change any of the
636	    recommendations given in [RFC2001], but to further clarify cases
637	    that were not discussed in detail in [RFC2001]. Specifically,
638	    [RFC2581] suggested what TCP connections should do after a
639	    relatively long idle period, as well as specified and clarified
640	    some of the issues pertaining to TCP ACK generation.  Finally, the
641	    allowable upper bound for the initial congestion window was raised
642	    from one to two segments.

644	7.  Changes Relative to RFC 2581

646	    A specific definition for "duplicate acknowledgment" has been
647	    added, based on the definition used by BSD TCP.

649	    The document now notes that what to do with duplicate ACKs after the
650	    retransmission timer has fired is future work and explicitly
651	    unspecified in this document.

653	    The initial window requirements were changed to allow Larger
654	    Initial Windows as standardized in [RFC3390].  Additionally, the
655	    steps to take when an initial window is discovered to be too large
656	    due to Path MTU Discovery [RFC1191] are detailed.

658	    The recommended initial value for ssthresh has been changed to say
659	    that it SHOULD be arbitrarily high, where it was previously MAY.
660	    This is to provide additional guidance to implementors on the
661	    matter.

663	    During slow start, the usage of Appropriate Byte Counting [RFC3465]
664	    with L=1*SMSS is explicitly recommended.  The method of increasing
665	    cwnd given in [RFC2581] is still explicitly allowed.  Byte counting
666	    during congestion avoidance is also recommended, while the method
667	    from [RFC2581] and other safe methods are still allowed.

669	    The treatment of ssthresh on retransmission timeout was clarified.
670	    In particular, ssthresh must be set to half the FlightSize on the
671	    first retransmission of a given segment and then is held constant on
672	    subsequent retransmissions of the same segment.

674	    The description of fast retransmit and fast recovery has been
675	    clarified, and the use of Limited Transmit [RFC3042] is now
676	    recommended.

678	    TCPs now MAY limit the number of duplicate ACKs that artificially
679	    inflate cwnd during loss recovery to the number of segments
680	    outstanding to avoid the duplicate ACK spoofing attack described in
681	    [SCWA99].

683	    The restart window has been changed to min(IW,cwnd) from IW.  This
684	    behavior was described as "experimental" in [RFC2581].

686	    It is now recommended that TCP implementors implement an advanced
687	    loss recovery algorithm conforming to the principles outlined in
688	    this document.

690	    The security considerations have been updated to discuss ACK
691	    division and recommend byte counting as a counter to this attack.

693	8.  IANA Considerations

695	    This document contains no IANA considerations, but apparently an
696	    Internet *Draft* can no longer be published without this section.

698	Acknowledgments

700	    The core algorithms we describe were developed by Van Jacobson

702	    [Jac88, Jac90].  In addition, Limited Transmit [RFC3042] was
703	    developed in conjunction with Hari Balakrishnan and Sally Floyd.
704	    The initial congestion window size specified in this document is a
705	    result of work with Sally Floyd and Craig Partridge
706	    [RFC2414,RFC3390].

708	    W. Richard ("Rich") Stevens wrote the first version of this document
709	    [RFC2001] and co-authored the second version [RFC2581].  This
710	    present version much benefits from his clarity and thoughtfulness of
711	    description, and we are grateful for Rich's contributions in
712	    elucidating TCP congestion control, as well as in more broadly
713	    helping us understand numerous issues relating to networking.

715	    We wish to emphasize that the shortcomings and mistakes of this
716	    document are solely the responsibility of the current authors.

718	    Some of the text from this document is taken from "TCP/IP
719	    Illustrated, Volume 1: The Protocols" by W. Richard Stevens
720	    (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
721	    Implementation" by Gary R. Wright and W.  Richard Stevens (Addison-
722	    Wesley, 1995).  This material is used with the permission of
723	    Addison-Wesley.

725	    Anil Agarwal, Steve Arden, Neal Cardwell, Noritoshi Demizu, Gorry
726	    Fairhurst, Kevin Fall, John Heffner, Alfred Hoenes, Sally Floyd,
727	    Reiner Ludwig, Matt Mathis, Craig Partridge and Joe Touch
728	    contributed a number of helpful suggestions.

730	Normative References

732	    [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC
733	        793, September 1981.

735	    [RFC1122] Braden, R., "Requirements for Internet Hosts --
736	        Communication Layers", STD 3, RFC 1122, October 1989.

738	    [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191,
739	        November 1990.

741	Informative References

743	    [CJ89] Chiu, D. and R. Jain, "Analysis of the Increase/Decrease
744	        Algorithms for Congestion Avoidance in Computer Networks",
745	        Journal of Computer Networks and ISDN Systems, vol. 17, no. 1,
746	        pp. 1-14, June 1989.

748	    [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of
749	        Tahoe, Reno and SACK TCP", Computer Communication Review, July
750	        1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.

752	    [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion
753	        Control Scheme for TCP", In ACM SIGCOMM, August 1996.

755	    [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP
756	        Slow-Start Restart After Idle", Work in Progress.

758	    [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer
759	        Communication Review, vol. 18, no. 4, pp. 314-329, Aug.  1988.
760	        ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.

762	    [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm",
763	        end2end-interest mailing list, April 30, 1990.
764	        ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.

766	    [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining
767	        TCP Congestion Control", Proceedings of SIGCOMM'96, August,
768	        1996, Stanford, CA.  Available
769	        from http://www.psc.edu/networking/papers/papers.html

771	    [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding
772	        Parameters", Technical report.  Available from
773	        http://www.psc.edu/networking/papers/FACKnotes/current.

775	    [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics",
776	        Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997.

778	    [RFC813] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC
779	        813, July 1982.

781	    [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
782	        Retransmit, and Fast Recovery Algorithms", RFC 2001, January
783	        1997.

785	    [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP
786	        Selective Acknowledgement Options", RFC 2018, October 1996.

788	    [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
789	        Requirement Levels", BCP 14, RFC 2119, March 1997.

791	    [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
792	        Initial Window Size", RFC 2414, September 1998.

794	    [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner,
795	        J., Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP
796	        Implementation Problems", RFC 2525, March 1999.

798	    [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion
799	        Control, RFC 2581, April 1999.

801	    [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An
802	        Extension to the Selective Acknowledgement (SACK) Option for
803	        TCP, RFC 2883, July 2000.

805	    [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission
806	        Timer", RFC 2988, November 2000.

808	    [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing
809	        TCP's Loss Recovery Using Limited Transmit", RFC 3042, January
810	        2001.

812	    [RFC3168] K. Ramakrishnan, S. Floyd, D. Black, "The Addition of
813	        Explicit Congestion Notification (ECN) to IP", RFC 3168,
814	        September 2001.

816	    [RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's
817	        Initial Window", RFC 3390, October 2002.

819	    [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte
820	        Counting (ABC), RFC 3465, February 2003.

822	    [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A
823	        Conservative Selective Acknowledgment (SACK)-based Loss Recovery
824	        Algorithm for TCP, RFC 3517, April 2003.

826	    [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno
827	        Modification to TCP's Fast Recovery Algorithm, RFC 3782, April
828	        2004.

830	    [RFC4821] Matt Mathis, John Heffner, Packetization Layer Path MTU
831	        Discovery, RFC 4821, March 2007.

833	    [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
834	        "TCP Congestion Control With a Misbehaving Receiver", ACM
835	        Computer Communication Review, 29(5), October 1999.

837	    [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols",
838	        Addison-Wesley, 1994.

840	    [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The
841	        Implementation", Addison-Wesley, 1995.

843	Authors' Addresses

845	    Mark Allman
846	    International Computer Science Institute (ICSI)
847	    1947 Center Street
848	    Suite 600
849	    Berkeley, CA 94704-1198
850	    Phone: +1 440 235 1792
851	    EMail: mallman@icir.org
852	    http://www.icir.org/mallman/

854	    Vern Paxson
855	    International Computer Science Institute (ICSI)
856	    1947 Center Street
857	    Suite 600
858	    Berkeley, CA 94704-1198
859	    Phone: +1 510/642-4274 x302
860	    EMail: vern@icir.org
861	    http://www.icir.org/vern/
862	    Ethan Blanton
863	    Purdue University Computer Sciences
864	    305 North University Street
865	    West Lafayette, IN  47907
866	    EMail: eblanton@cs.purdue.edu
867	    http://www.cs.purdue.edu/homes/eblanton/

869	Acknowledgment

871	    Funding for the RFC Editor function is currently provided by the
872	    Internet Society.