idnits 2.17.1 

draft-ietf-tcpm-rfc2581bis-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 855.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 831.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 838.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 844.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (April 2008) is 5855 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'Flo94' is defined on line 699, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC  813
     (Obsoleted by RFC 7805)

  -- Obsolete informational reference (is this intentional?): RFC 2001
     (Obsoleted by RFC 2581)

  -- Obsolete informational reference (is this intentional?): RFC 2414
     (Obsoleted by RFC 3390)

  -- Obsolete informational reference (is this intentional?): RFC 2581
     (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 2988
     (Obsoleted by RFC 6298)

  -- Obsolete informational reference (is this intentional?): RFC 3517
     (Obsoleted by RFC 6675)

  -- Obsolete informational reference (is this intentional?): RFC 3782
     (Obsoleted by RFC 6582)


     Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 14 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                          M. Allman
2	Internet-Draft                                                 V. Paxson
3	Expires: October 2008                                               ICSI
4	                                                              E. Blanton
5	                                                       Purdue University
6	                                                              April 2008

8	                         TCP Congestion Control
9	                   draft-ietf-tcpm-rfc2581bis-04.txt

11	Status of this Memo

13	    By submitting this Internet-Draft, each author represents that any
14	    applicable patent or other IPR claims of which he or she is aware
15	    have been or will be disclosed, and any of which he or she becomes
16	    aware will be disclosed, in accordance with Section 6 of BCP 79.

18	    Internet-Drafts are working documents of the Internet Engineering
19	    Task Force (IETF), its areas, and its working groups.  Note that
20	    other groups may also distribute working documents as
21	    Internet-Drafts.

23	    Internet-Drafts are draft documents valid for a maximum of six
24	    months and may be updated, replaced, or obsoleted by other documents
25	    at any time.  It is inappropriate to use Internet-Drafts as
26	    reference material or to cite them other than as "work in progress."

28	    The list of current Internet-Drafts can be accessed at
29	    http://www.ietf.org/ietf/1id-abstracts.txt.

31	    The list of Internet-Draft Shadow Directories can be accessed at
32	    http://www.ietf.org/shadow.html.

34	Abstract

36	    This document defines TCP's four intertwined congestion control
37	    algorithms: slow start, congestion avoidance, fast retransmit, and
38	    fast recovery.  In addition, the document specifies how TCP should
39	    begin transmission after a relatively long idle period, as well as
40	    discussing various acknowledgment generation methods.

42	1. Introduction

44	    This document specifies four TCP [RFC793] congestion control
45	    algorithms: slow start, congestion avoidance, fast retransmit and
46	    fast recovery.  These algorithms were devised in [Jac88] and
47	    [Jac90]. Their use with TCP is standardized in [RFC1122].
48	    Additional early work in additive-increase, multiplicative-decrease
49	    congestion control is given in [CJ89].

51	    This document obsoletes [RFC2581] which in turned obsoleted
52	    [RFC2001].

54	    In addition to specifying the congestion control algorithms, this
55	    document specifies what TCP connections should do after a relatively
56	    long idle period, as well as specifying and clarifying some of the
57	    issues pertaining to TCP ACK generation.

59	    Note that [Ste94] provides examples of these algorithms in action
60	    and [WS95] provides an explanation of the source code for the BSD
61	    implementation of these algorithms.

63	    This document is organized as follows.  Section 2 provides various
64	    definitions which will be used throughout the document.  Section 3
65	    provides a specification of the congestion control
66	    algorithms. Section 4 outlines concerns related to the congestion
67	    control algorithms and finally, section 5 outlines security
68	    considerations.

70	    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
71	    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
72	    document are to be interpreted as described in [RFC2119].

74	2. Definitions

76	    This section provides the definition of several terms that will be
77	    used throughout the remainder of this document.

79	    SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or
80	        both).

82	    SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the
83	        largest segment that the sender can transmit.  This value can be
84	        based on the maximum transmission unit of the network, the path
85	        MTU discovery [RFC1191,RFC4821] algorithm, RMSS (see next item),
86	        or other factors.  The size does not include the TCP/IP headers
87	        and options.

89	    RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the
90	        largest segment the receiver is willing to accept.  This is the
91	        value specified in the MSS option sent by the receiver during
92	        connection startup.  Or, if the MSS option is not used, 536
93	        bytes [RFC1122].  The size does not include the TCP/IP headers
94	        and options.

96	    FULL-SIZED SEGMENT: A segment that contains the maximum number of
97	        data bytes permitted (i.e., a segment containing SMSS bytes of
98	        data).

100	    RECEIVER WINDOW (rwnd): The most recently advertised receiver
101	        window.

103	    CONGESTION WINDOW (cwnd): A TCP state variable that limits the
104	        amount of data a TCP can send.  At any given time, a TCP MUST
105	        NOT send data with a sequence number higher than the sum of the
106	        highest acknowledged sequence number and the minimum of cwnd and
107	        rwnd.

109	    INITIAL WINDOW (IW): The initial window is the size of the sender's
110	        congestion window after the three-way handshake is completed.

112	    LOSS WINDOW (LW): The loss window is the size of the congestion
113	        window after a TCP sender detects loss using its retransmission
114	        timer.

116	    RESTART WINDOW (RW): The restart window is the size of the
117	        congestion window after a TCP restarts transmission after an
118	        idle period (if the slow start algorithm is used; see section
119	        4.1 for more discussion).

121	    FLIGHT SIZE: The amount of data that has been sent but not yet
122	        cumulatively acknowledged.

124	    DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
125	        "duplicate" in the following algorithms when (a) the receiver of
126	        the ACK has outstanding data, (b) the incoming acknowledgment
127	        carries no data, (c) the SYN and FIN bits are both off, (d) the
128	        acknowledgment number is equal to the greatest acknowledgment
129	        received on the given connection (TCP.UNA from [RFC793]) and (e)
130	        the advertised window in the incoming acknowledgment equals the
131	        advertised window in the last incoming acknowledgment.

133	        Alternatively, a TCP that utilizes selective acknowledgments
134	        [RFC2018,RFC2883] can leverage the SACK information to determine
135	        when an incoming ACK is a "duplicate" (e.g., if the ACK contains
136	        previously unknown SACK information).

138	3. Congestion Control Algorithms

140	    This section defines the four congestion control algorithms: slow
141	    start, congestion avoidance, fast retransmit and fast recovery,
142	    developed in [Jac88] and [Jac90].  In some situations it may be
143	    beneficial for a TCP sender to be more conservative than the
144	    algorithms allow, however a TCP MUST NOT be more aggressive than the
145	    following algorithms allow (that is, MUST NOT send data when the
146	    value of cwnd computed by the following algorithms would not allow
147	    the data to be sent).

149	    Also note that the algorithms specified in this document work in
150	    terms of using loss as the signal of congestion.  Explicit
151	    Congestion Notification (ECN) could also be used as specified in
152	    [RFC3168].

154	3.1 Slow Start and Congestion Avoidance

156	    The slow start and congestion avoidance algorithms MUST be used by a
157	    TCP sender to control the amount of outstanding data being injected
158	    into the network.  To implement these algorithms, two variables are
159	    added to the TCP per-connection state.  The congestion window (cwnd)
160	    is a sender-side limit on the amount of data the sender can transmit
161	    into the network before receiving an acknowledgment (ACK), while the
162	    receiver's advertised window (rwnd) is a receiver-side limit on the
163	    amount of outstanding data.  The minimum of cwnd and rwnd governs
164	    data transmission.

166	    Another state variable, the slow start threshold (ssthresh), is used
167	    to determine whether the slow start or congestion avoidance
168	    algorithm is used to control data transmission, as discussed below.

170	    Beginning transmission into a network with unknown conditions
171	    requires TCP to slowly probe the network to determine the available
172	    capacity, in order to avoid congesting the network with an
173	    inappropriately large burst of data.  The slow start algorithm is
174	    used for this purpose at the beginning of a transfer, or after
175	    repairing loss detected by the retransmission timer.  Slow start
176	    additionally serves to start the "ACK clock" used by the TCP sender
177	    to release data into the network in the slow start, congestion
178	    avoidance, and loss recovery algorithms.

180	    IW, the initial value of cwnd, MUST be set using the following
181	    guidelines as an upper bound.

183	    If SMSS > 2190 bytes:
184	        IW = 2 * SMSS bytes and MUST NOT be more than 2 segments
185	    If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes):
186	        IW = 3 * SMSS bytes and MUST NOT be more than 3 segments
187	    if SMSS <= 1095 bytes:
188	        IW = 4 * SMSS bytes and MUST NOT be more than 4 segments

190	    As specified in [RFC3390], the SYN/ACK and the acknowledgment of the
191	    SYN/ACK MUST NOT increase the size of the congestion window.
192	    Further, if the SYN or SYN/ACK is lost, the initial window used by a
193	    sender after a correctly transmitted SYN MUST be one segment
194	    consisting of at most SMSS bytes.

196	    A detailed rationale and discussion of the IW setting is provided in
197	    [RFC3390].

199	    When initial congestion windows of more than one segment are
200	    implemented along with Path MTU Discovery [RFC1191], and the MSS
201	    being used is found to be too large, the congestion window cwnd
202	    SHOULD be reduced to prevent large bursts of smaller segments.
203	    Specifically, cwnd SHOULD be reduced by the ratio of the old segment
204	    size to the new segment size.

206	    The initial value of ssthresh SHOULD be set arbitrarily high (e.g.,
207	    to the size of the largest possible advertised window), but ssthresh
208	    MUST be reduced in response to congestion.  Setting ssthresh as high
209	    as possible allows the network conditions, rather than some
210	    arbitrary host limit, to dictate the sending rate.  In cases where
211	    the end systems have a solid understanding of the network path, more
212	    carefully setting the initial ssthresh value may have merit (e.g.,
213	    such that the end host does not create congestion along the path).

215	    The slow start algorithm is used when cwnd < ssthresh, while the
216	    congestion avoidance algorithm is used when cwnd > ssthresh.  When
217	    cwnd and ssthresh are equal the sender may use either slow start or
218	    congestion avoidance.

220	    During slow start, a TCP increments cwnd by at most SMSS bytes for
221	    each ACK received that cumulatively acknowledges new data.  Slow
222	    start ends when cwnd exceeds ssthresh (or, optionally, when it
223	    reaches it, as noted above) or when congestion is observed.  While
224	    traditionally TCP implementations have increased cwnd by precisely
225	    SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND
226	    that TCP implementations increase cwnd, per:

228	        cwnd += min (N, SMSS)                      (2)

230	    where N is the number of previously unacknowledged bytes
231	    acknowledged in the incoming ACK.  This adjustment is part of
232	    Appropriate Byte Counting [RFC3465] and provides robustness against
233	    misbehaving receivers which may attempt to induce a sender to
234	    artificially inflate cwnd using a mechanism known as "ACK Division"
235	    [SCWA99].  ACK Division consists of a receiver sending multiple ACKs
236	    for a single TCP data segment, each acknowledging only a portion of
237	    its data.  A TCP that increments cwnd by SMSS for each such ACK will
238	    inappropriately inflate the amount of data injected into the
239	    network.

241	    During congestion avoidance, cwnd is incremented by roughly 1
242	    full-sized segment per round-trip time (RTT).  Congestion avoidance
243	    continues until congestion is detected.  The basic guidelines for
244	    incrementing cwnd during congestion avoidance are:

246	      * MAY increment cwnd by SMSS bytes

248	      * SHOULD increment cwnd per equation (2) once per RTT

250	      * MUST NOT increment cwnd by more than SMSS bytes

252	    We note that [RFC3465] allows for cwnd increases of more than SMSS
253	    bytes for incoming acknowledgments during slow start on an
254	    experimental basis, however such behavior is not allowed as part of
255	    the standard.

257	    The RECOMMENDED way to increase cwnd during congestion avoidance is
258	    to count the number of bytes that have been acknowledged by ACKs for
259	    new data.  (A drawback of this implementation is that it requires
260	    maintaining an additional state variable.)  When the number of bytes
261	    acknowledged reaches cwnd, then cwnd can be incremented by up to
262	    SMSS bytes.  Note that during congestion avoidance, cwnd MUST NOT be
263	    increased by more than SMSS bytes per RTT.  This method both allows
264	    TCPs to increase cwnd by one segment per RTT in the face of delayed
265	    ACKs and provides robustness against ACK Division attacks.

267	    Another common formula that a TCP MAY use to update cwnd during
268	    congestion avoidance is given in equation 3:

270	        cwnd += SMSS*SMSS/cwnd                     (3)

272	    This adjustment is executed on every incoming ACK that acknowledges
273	    new data.  Equation (3) provides an acceptable approximation to the
274	    underlying principle of increasing cwnd by 1 full-sized segment per
275	    RTT.  (Note that for a connection in which the receiver is
276	    acknowledging every-other packet, (3) is less aggressive than
277	    allowed -- roughly increasing cwnd every second RTT.)

279	    Implementation Note: Since integer arithmetic is usually used in TCP
280	    implementations, the formula given in equation 3 can fail to
281	    increase cwnd when the congestion window is larger than SMSS*SMSS.
282	    If the above formula yields 0, the result SHOULD be rounded up to 1
283	    byte.

285	    Implementation Note: Older implementations have an additional
286	    additive constant on the right-hand side of equation (3).  This is
287	    incorrect and can actually lead to diminished performance [RFC2525].

289	    Implementation Note: Some implementations maintain cwnd in units of
290	    bytes, while others in units of full-sized segments.  The latter
291	    will find equation (3) difficult to use, and may prefer to use the
292	    counting approach discussed in the previous paragraph.

294	    When a TCP sender detects segment loss using the retransmission
295	    timer and the given segment has not yet been retransmitted, the
296	    value of ssthresh MUST be set to no more than the value given in
297	    equation 4:

299	        ssthresh = max (FlightSize / 2, 2*SMSS)            (4)

301	    where, as discussed above, FlightSize is the amount of outstanding
302	    data in the network.

304	    On the other hand, when a TCP sender detects segment loss using the
305	    retransmission timer and the given segment has already been
306	    retransmitted by way of the retransmission timer at least once, the
307	    value of ssthresh is held constant.

309	    Implementation Note: An easy mistake to make is to simply use cwnd,
310	    rather than FlightSize, which in some implementations may
311	    incidentally increase well beyond rwnd.

313	    Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be
314	    set to no more than the loss window, LW, which equals 1 full-sized
315	    segment (regardless of the value of IW).  Therefore, after
316	    retransmitting the dropped segment the TCP sender uses the slow
317	    start algorithm to increase the window from 1 full-sized segment to
318	    the new value of ssthresh, at which point congestion avoidance again
319	    takes over.

321	    As shown in [FF96,RFC3782], slow start-based loss recovery after a
322	    timeout can cause spurious retransmissions that trigger duplicate
323	    acknowledgments.  The reaction to the arrival of these duplicate
324	    ACKs in TCP implementations varies widely.  This document does not
325	    specify how to treat such acknowledgments, but does note this as an
326	    area that may benefit from additional attention, experimentation and
327	    specification.

329	3.2 Fast Retransmit/Fast Recovery

331	    A TCP receiver SHOULD send an immediate duplicate ACK when an out-
332	    of-order segment arrives.  The purpose of this ACK is to inform the
333	    sender that a segment was received out-of-order and which sequence
334	    number is expected.  From the sender's perspective, duplicate ACKs
335	    can be caused by a number of network problems.  First, they can be
336	    caused by dropped segments.  In this case, all segments after the
337	    dropped segment will trigger duplicate ACKs until the loss is
338	    repaired.  Second, duplicate ACKs can be caused by the re-ordering
339	    of data segments by the network (not a rare event along some network
340	    paths [Pax97]).  Finally, duplicate ACKs can be caused by
341	    replication of ACK or data segments by the network.  In addition, a
342	    TCP receiver SHOULD send an immediate ACK when the incoming segment
343	    fills in all or part of a gap in the sequence space.  This will
344	    generate more timely information for a sender recovering from a loss
345	    through a retransmission timeout, a fast retransmit, or an advanced
346	    loss recovery algorithm, as outlined in section 4.3.

348	    The TCP sender SHOULD use the "fast retransmit" algorithm to detect
349	    and repair loss, based on incoming duplicate ACKs.  The fast
350	    retransmit algorithm uses the arrival of 3 duplicate ACKs (as
351	    defined in section 2, without any intervening ACKs which move
352	    SND.UNA) as an indication that a segment has been lost.  After
353	    receiving 3 duplicate ACKs, TCP performs a retransmission of what
354	    appears to be the missing segment, without waiting for the
355	    retransmission timer to expire.

357	    After the fast retransmit algorithm sends what appears to be the
358	    missing segment, the "fast recovery" algorithm governs the
359	    transmission of new data until a non-duplicate ACK arrives.  The
360	    reason for not performing slow start is that the receipt of the
361	    duplicate ACKs not only indicates that a segment has been lost, but
362	    also that segments are most likely leaving the network (although a
363	    massive segment duplication by the network can invalidate this
364	    conclusion).  In other words, since the receiver can only generate a
365	    duplicate ACK when a segment has arrived, that segment has left the
366	    network and is in the receiver's buffer, so we know it is no longer
367	    consuming network resources.  Furthermore, since the ACK "clock"
368	    [Jac88] is preserved, the TCP sender can continue to transmit new
369	    segments (although transmission must continue using a reduced cwnd,
370	    since loss is an indication of congestion).

372	    The fast retransmit and fast recovery algorithms are implemented
373	    together as follows.

375	    1.  On the first and second duplicate ACKs received at a sender, a
376	        TCP SHOULD send a segment of previously unsent data per
377	        [RFC3042] provided that the receiver's advertised window allows,
378	        the total FlightSize would remain less than or equal to cwnd
379	        plus 2*SMSS, and that new data is available for transmission.
380	        Further, the TCP sender MUST NOT change cwnd to reflect these
381	        two segments [RFC3042].  Note that a sender using SACK [RFC2018]
382	        MUST NOT send new data unless the incoming duplicate
383	        acknowledgment contains new SACK information.

385	    2.  When the third duplicate ACK is received, a TCP MUST set
386	        ssthresh to no more than the value given in equation 4.  When
387	        [RFC3042] is in use, additional data sent in limited transmit
388	        MUST NOT be included in this calculation.

390	    3.  The lost segment starting at SND.UNA MUST be retransmitted and
391	        cwnd set to ssthresh plus 3*SMSS. This artificially "inflates"
392	        the congestion window by the number of segments (three) that
393	        have left the network and which the receiver has buffered.

395	    4.  For each additional duplicate ACK received (after the third),
396	        cwnd MUST be incremented by SMSS.  This artificially inflates
397	        the congestion window in order to reflect the additional segment
398	        that has left the network.

400	        Note: [SCWA99] discusses a receiver-based attack whereby many
401	        bogus duplicate ACKs are sent to the data sender in order to
402	        artificially inflate cwnd and cause a higher than appropriate
403	        sending rate to be used.  A TCP MAY therefore limit the number
404	        of times cwnd is artificially inflated during loss recovery
405	        to the number of outstanding segments (or, an approximation
406	        thereof).

408	    5.  When previously unsent data is available and the new value of
409	        cwnd and the receiver's advertised window allow, a TCP SHOULD
410	        send 1*SMSS bytes of previously unsent data.

412	    6.  When the next ACK arrives that acknowledges previously
413	        unacknowledged data, a TCP MUST set cwnd to ssthresh (the value
414	        set in step 2).  This is termed "deflating" the window.

416	        This ACK should be the acknowledgment elicited by the
417	        retransmission from step 3, one RTT after the retransmission
418	        (though it may arrive sooner in the presence of significant out-
419	        of-order delivery of data segments at the receiver).
420	        Additionally, this ACK should acknowledge all the intermediate
421	        segments sent between the lost segment and the receipt of the
422	        third duplicate ACK, if none of these were lost.

424	    Note: This algorithm is known to generally not recover efficiently
425	    from multiple losses in a single flight of packets [FF96].  Section
426	    4.3 below addresses such cases.

428	4. Additional Considerations

430	4.1 Re-starting Idle Connections

432	    A known problem with the TCP congestion control algorithms described
433	    above is that they allow a potentially inappropriate burst of
434	    traffic to be transmitted after TCP has been idle for a relatively
435	    long period of time.  After an idle period, TCP cannot use the ACK
436	    clock to strobe new segments into the network, as all the ACKs have
437	    drained from the network.  Therefore, as specified above, TCP can
438	    potentially send a cwnd-size line-rate burst into the network after
439	    an idle period.  In addition, changing network conditions may have
440	    rendered TCP's notion of the available end-to-end network capacity
441	    between two endpoints, as estimated by cwnd, inaccurate during the
442	    course of a long idle period.

444	    [Jac88] recommends that a TCP use slow start to restart
445	    transmission after a relatively long idle period.  Slow start
446	    serves to restart the ACK clock, just as it does at the beginning
447	    of a transfer.  This mechanism has been widely deployed in the
448	    following manner.  When TCP has not received a segment for more
449	    than one retransmission timeout, cwnd is reduced to the value of
450	    the restart window (RW) before transmission begins.

452	    For the purposes of this standard, we define RW = min(IW,cwnd).

454	    Using the last time a segment was received to determine whether or
455	    not to decrease cwnd can fail to deflate cwnd in the common case of
456	    persistent HTTP connections [HTH98].  In this case, a Web server
457	    receives a request before transmitting data to the Web client.  The
458	    reception of the request makes the test for an idle connection fail,
459	    and allows the TCP to begin transmission with a possibly
460	    inappropriately large cwnd.

462	    Therefore, a TCP SHOULD set cwnd to no more than RW before beginning
463	    transmission if the TCP has not sent data in an interval exceeding
464	    the retransmission timeout.

466	4.2 Generating Acknowledgments

468	    The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a
469	    TCP receiver.  When using delayed ACKs, a TCP receiver MUST NOT
470	    excessively delay acknowledgments.  Specifically, an ACK SHOULD be
471	    generated for at least every second full-sized segment, and MUST be
472	    generated within 500 ms of the arrival of the first unacknowledged
473	    packet.

475	    The requirement that an ACK "SHOULD" be generated for at least every
476	    second full-sized segment is listed in [RFC1122] in one place as a
477	    SHOULD and another as a MUST.  Here we unambiguously state it is a
478	    SHOULD.  We also emphasize that this is a SHOULD, meaning that an
479	    implementor should indeed only deviate from this requirement after
480	    careful consideration of the implications.  See the discussion of
481	    "Stretch ACK violation" in [RFC2525] and the references therein for
482	    a discussion of the possible performance problems with generating
483	    ACKs less frequently than every second full-sized segment.

485	    In some cases, the sender and receiver may not agree on what
486	    constitutes a full-sized segment.  An implementation is deemed to
487	    comply with this requirement if it sends at least one acknowledgment
488	    every time it receives 2*RMSS bytes of new data from the sender,
489	    where RMSS is the Maximum Segment Size specified by the receiver to
490	    the sender (or the default value of 536 bytes, per [RFC1122], if the
491	    receiver does not specify an MSS option during connection
492	    establishment).  The sender may be forced to use a segment size less
493	    than RMSS due to the maximum transmission unit (MTU), the path MTU
494	    discovery algorithm or other factors.  For instance, consider the
495	    case when the receiver announces an RMSS of X bytes but the sender
496	    ends up using a segment size of Y bytes (Y < X) due to path MTU
497	    discovery (or the sender's MTU size).  The receiver will generate
498	    stretch ACKs if it waits for 2*X bytes to arrive before an ACK is
499	    sent.  Clearly this will take more than 2 segments of size Y bytes.
500	    Therefore, while a specific algorithm is not defined, it is
501	    desirable for receivers to attempt to prevent this situation, for
502	    example by acknowledging at least every second segment, regardless
503	    of size.  Finally, we repeat that an ACK MUST NOT be delayed for
504	    more than 500 ms waiting on a second full-sized segment to arrive.

506	    Out-of-order data segments SHOULD be acknowledged immediately, in
507	    order to accelerate loss recovery.  To trigger the fast retransmit
508	    algorithm, the receiver SHOULD send an immediate duplicate ACK when
509	    it receives a data segment above a gap in the sequence space.  To
510	    provide feedback to senders recovering from losses, the receiver
511	    SHOULD send an immediate ACK when it receives a data segment that
512	    fills in all or part of a gap in the sequence space.

514	    A TCP receiver MUST NOT generate more than one ACK for every
515	    incoming segment, other than to update the offered window as the
516	    receiving application consumes new data [page 42, RFC793][RFC813].

518	4.3 Loss Recovery Mechanisms

520	    A number of loss recovery algorithms that augment fast retransmit
521	    and fast recovery have been suggested by TCP researchers and
522	    specified in the RFC series.  While some of these algorithms are
523	    based on the TCP selective acknowledgment (SACK) option [RFC2018],
524	    such as [FF96,MM96a,MM96b,RFC3517], others do not require SACKs
525	    [Hoe96,FF96,RFC3782].  The non-SACK algorithms use "partial
526	    acknowledgments" (ACKs which cover previously unacknowledged data,
527	    but not all the data outstanding when loss was detected) to trigger
528	    retransmissions.  While this document does not standardize any of
529	    the specific algorithms that may improve fast retransmit/fast
530	    recovery, these enhanced algorithms are implicitly allowed, as long
531	    as they follow the general principles of the basic four algorithms
532	    outlined above.

534	    That is, when the first loss in a window of data is detected,
535	    ssthresh MUST be set to no more than the value given by equation
536	    (4).  Second, until all lost segments in the window of data in
537	    question are repaired, the number of segments transmitted in each
538	    RTT MUST be no more than half the number of outstanding segments
539	    when the loss was detected.  Finally, after all loss in the given
540	    window of segments has been successfully retransmitted, cwnd MUST be
541	    set to no more than ssthresh and congestion avoidance MUST be used
542	    to further increase cwnd.  Loss in two successive windows of data,
543	    or the loss of a retransmission, should be taken as two indications
544	    of congestion and, therefore, cwnd (and ssthresh) MUST be lowered
545	    twice in this case.

547	    We RECOMMEND that TCP implementers employ some form of advanced loss
548	    recovery that can cope with multiple losses in a window of data.
549	    The algorithms detailed in [RFC3782] and [RFC3517] conform to the
550	    general principles outlined above.  We note that while these are not
551	    the only two algorithms that conform to the above general principles
552	    these two algorithms have been vetted by the community and are
553	    currently on the standards track.

555	5.  Security Considerations

557	    This document requires a TCP to diminish its sending rate in the
558	    presence of retransmission timeouts and the arrival of duplicate
559	    acknowledgments.  An attacker can therefore impair the performance
560	    of a TCP connection by either causing data packets or their
561	    acknowledgments to be lost, or by forging excessive duplicate
562	    acknowledgments.  Causing two congestion control events back-to-back
563	    will often cut ssthresh to its minimum value of 2*SMSS, causing the
564	    connection to immediately enter the slower-performing congestion
565	    avoidance phase.

567	    In response to the ACK division attack outlined in [SCWA99] this
568	    document RECOMMENDS increasing the congestion window based on the
569	    number of bytes newly acknowledged in each arriving ACK rather than
570	    by a particular constant on each arriving ACK (as outlined in
571	    section 3.1).

573	    The Internet to a considerable degree relies on the correct
574	    implementation of these algorithms in order to preserve network
575	    stability and avoid congestion collapse.  An attacker could cause
576	    TCP endpoints to respond more aggressively in the face of congestion
577	    by forging excessive duplicate acknowledgments or excessive
578	    acknowledgments for new data.  Conceivably, such an attack could
579	    drive a portion of the network into congestion collapse.

581	6.  Changes Between RFC 2001 and RFC 2581

583	    [RFC2001] has been extensively rewritten editorially and it is not
584	    feasible to itemize the list of changes between [RFC2001] and
585	    [RFC2581]. The intention of [RFC2581] is to not change any of the
586	    recommendations given in [RFC2001], but to further clarify cases
587	    that were not discussed in detail in [RFC2001]. Specifically,
588	    [RFC2581] suggests what TCP connections should do after a relatively
589	    long idle period, as well as specifying and clarifying some of the
590	    issues pertaining to TCP ACK generation.  Finally, the allowable
591	    upper bound for the initial congestion window has also been raised
592	    from one to two segments.

594	7.  Changes Relative to RFC 2581
595	    A specific definition for "duplicate acknowledgment" has been
596	    added, based on the definition used by BSD TCP.

598	    The document now notes that what to do with duplicate ACKs after the
599	    retransmission timer has fired is future work and explicitly
600	    unspecified in this document.

602	    The initial window requirements were changed to allow Larger
603	    Initial Windows as standardized in [RFC3390].  Additionally, the
604	    steps to take when an initial window is discovered to be too large
605	    due to Path MTU Discovery [RFC1191] are detailed.

607	    The recommended initial value for ssthresh has been changed to say
608	    that it SHOULD be arbitrarily high, where it was previously MAY.
609	    This is to provide additional guidance to implementors on the
610	    matter.

612	    During slow start, the usage of Appropriate Byte Counting [RFC3465]
613	    with L=1*SMSS is explicitly recommended.  The method of increasing
614	    cwnd given in [RFC2581] is still explicitly allowed.  Byte counting
615	    during congestion avoidance is also recommended, while the method
616	    from [RFC2581] and other safe methods are still allowed.

618	    The treatment of ssthresh on retransmission timeout was clarified.
619	    In particular, ssthresh must be set to half the FlightSize on the
620	    first retransmission of a given segment and then is held constant on
621	    subsequent retransmissions of the same segment.

623	    The description of fast retransmit and fast recovery has been
624	    clarified, and the use of Limited Transmit [RFC3042] is now
625	    recommended.

627	    TCPs now MAY limit the number of duplicate ACKs that artificially
628	    inflate cwnd during loss recovery to the number of segments
629	    outstanding to avoid the duplicate ACK spoofing attack described in
630	    [SCWA99].

632	    The restart window has been changed to min(IW,cwnd) from IW.  This
633	    behavior was described as "experimental" in [RFC2581].

635	    It is now recommended that TCP implementors implement an advanced
636	    loss recovery algorithm conforming to the principles outlined in
637	    this document.

639	    The security considerations have been updated to discuss ACK
640	    division and recommend byte counting as a counter to this attack.

642	8.  IANA Considerations

644	    This document contains no IANA considerations, but apparently an
645	    Internet *Draft* can no longer be published without this section.

647	Acknowledgments
648	    The core algorithms we describe were developed by Van Jacobson
649	    [Jac88, Jac90].  In addition, Limited Transmit [RFC3042] was
650	    developed in conjunction with Hari Balakrishnan and Sally Floyd.
651	    The initial congestion window size specified in this document is a
652	    result of work with Sally Floyd and Craig Partridge
653	    [RFC2414,RFC3390].

655	    W. Richard ("Rich") Stevens wrote the first version of this document
656	    [RFC2001] and co-authored the second version [RFC2581].  This
657	    present version much benefits from his clarity and thoughtfulness of
658	    description, and we are grateful for Rich's contributions in
659	    elucidating TCP congestion control, as well as in more broadly
660	    helping us understand numerous issues relating to networking.

662	    We wish to emphasize that the shortcomings and mistakes of this
663	    document are solely the responsibility of the current authors.

665	    Some of the text from this document is taken from "TCP/IP
666	    Illustrated, Volume 1: The Protocols" by W. Richard Stevens
667	    (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
668	    Implementation" by Gary R. Wright and W.  Richard Stevens (Addison-
669	    Wesley, 1995).  This material is used with the permission of
670	    Addison-Wesley.

672	    Anil Agarwal, Steve Arden, Neal Cardwell, Noritoshi Demizu, Gorry
673	    Fairhurst, Kevin Fall, John Heffner, Alfred Hoenes, Sally Floyd,
674	    Reiner Ludwig, Matt Mathis, Craig Partridge and Joe Touch
675	    contributed a number of helpful suggestions.

677	Normative References

679	    [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC
680	        793, September 1981.

682	    [RFC1122] Braden, R., "Requirements for Internet Hosts --
683	        Communication Layers", STD 3, RFC 1122, October 1989.

685	    [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191,
686	        November 1990.

688	Informative References

690	    [CJ89] Chiu, D. and R. Jain, "Analysis of the Increase/Decrease
691	        Algorithms for Congestion Avoidance in Computer Networks",
692	        Journal of Computer Networks and ISDN Systems, vol. 17, no. 1,
693	        pp. 1-14, June 1989.

695	    [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of
696	        Tahoe, Reno and SACK TCP", Computer Communication Review, July
697	        1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.

699	    [Flo94] Floyd, S., "TCP and Successive Fast Retransmits. Technical
700	        report", October 1994.

702	        ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.

704	    [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion
705	        Control Scheme for TCP", In ACM SIGCOMM, August 1996.

707	    [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP
708	        Slow-Start Restart After Idle", Work in Progress.

710	    [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer
711	        Communication Review, vol. 18, no. 4, pp. 314-329, Aug.  1988.
712	        ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.

714	    [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm",
715	        end2end-interest mailing list, April 30, 1990.
716	        ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.

718	    [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining
719	        TCP Congestion Control", Proceedings of SIGCOMM'96, August,
720	        1996, Stanford, CA.  Available
721	        from http://www.psc.edu/networking/papers/papers.html

723	    [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding
724	        Parameters", Technical report.  Available from
725	        http://www.psc.edu/networking/papers/FACKnotes/current.

727	    [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics",
728	        Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997.

730	    [RFC813] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC
731	        813, July 1982.

733	    [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
734	        Retransmit, and Fast Recovery Algorithms", RFC 2001, January
735	        1997.

737	    [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP
738	        Selective Acknowledgement Options", RFC 2018, October 1996.

740	    [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
741	        Requirement Levels", BCP 14, RFC 2119, March 1997.

743	    [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
744	        Initial Window Size", RFC 2414, September 1998.

746	    [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner,
747	        J., Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP
748	        Implementation Problems", RFC 2525, March 1999.

750	    [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion
751	        Control, RFC 2581, April 1999.

753	    [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An
754	        Extension to the Selective Acknowledgement (SACK) Option for
755	        TCP, RFC 2883, July 2000.

757	    [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission
758	        Timer", RFC 2988, November 2000.

760	    [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing
761	        TCP's Loss Recovery Using Limited Transmit", RFC 3042, January
762	        2001.

764	    [RFC3168] K. Ramakrishnan, S. Floyd, D. Black, "The Addition of
765	        Explicit Congestion Notification (ECN) to IP", RFC 3168,
766	        September 2001.

768	    [RFC3390] Allman, M., Floyd, S., C. Partridge, "Increasing TCP's
769	        Initial Window", RFC 3390, October 2002.

771	    [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte
772	        Counting (ABC), RFC 3465, February 2003.

774	    [RFC3517] Ethan Blanton, Mark Allman, Kevin Fall, Lili Wang, A
775	        Conservative Selective Acknowledgment (SACK)-based Loss Recovery
776	        Algorithm for TCP, RFC 3517, April 2003.

778	    [RFC3782] Sally Floyd, Tom Henderson, Andrei Gurtov, The NewReno
779	        Modification to TCP's Fast Recovery Algorithm, RFC 3782, April
780	        2004.

782	    [RFC4821] Matt Mathis, John Heffner, Packetization Layer Path MTU
783	        Discovery, RFC 4821, March 2007.

785	    [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
786	        "TCP Congestion Control With a Misbehaving Receiver", ACM
787	        Computer Communication Review, 29(5), October 1999.

789	    [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols",
790	        Addison-Wesley, 1994.

792	    [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The
793	        Implementation", Addison-Wesley, 1995.

795	Authors' Addresses

797	    Mark Allman
798	    International Computer Science Institute (ICSI)
799	    1947 Center Street
800	    Suite 600
801	    Berkeley, CA 94704-1198
802	    Phone: +1 440 235 1792
803	    EMail: mallman@icir.org
804	    http://www.icir.org/mallman/

806	    Vern Paxson
807	    International Computer Science Institute (ICSI)
808	    1947 Center Street
809	    Suite 600
810	    Berkeley, CA 94704-1198
811	    Phone: +1 510/642-4274 x302
812	    EMail: vern@icir.org
813	    http://www.icir.org/vern/

815	    Ethan Blanton
816	    Purdue University Computer Sciences
817	    1398 Computer Science Building
818	    West Lafayette, IN  47907
819	    EMail: eblanton@cs.purdue.edu
820	    http://www.cs.purdue.edu/homes/eblanton/

822	Intellectual Property Statement

824	    The IETF takes no position regarding the validity or scope of any
825	    Intellectual Property Rights or other rights that might be claimed
826	    to pertain to the implementation or use of the technology described
827	    in this document or the extent to which any license under such
828	    rights might or might not be available; nor does it represent that
829	    it has made any independent effort to identify any such rights.
830	    Information on the procedures with respect to rights in RFC
831	    documents can be found in BCP 78 and BCP 79.

833	    Copies of IPR disclosures made to the IETF Secretariat and any
834	    assurances of licenses to be made available, or the result of an
835	    attempt made to obtain a general license or permission for the use
836	    of such proprietary rights by implementers or users of this
837	    specification can be obtained from the IETF on-line IPR repository
838	    at http://www.ietf.org/ipr.

840	    The IETF invites any interested party to bring to its attention any
841	    copyrights, patents or patent applications, or other proprietary
842	    rights that may cover technology that may be required to implement
843	    this standard.  Please address the information to the IETF at
844	    ietf-ipr@ietf.org.

846	Disclaimer of Validity

848	    This document and the information contained herein are provided
849	    on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
850	    REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
851	    IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
852	    WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
853	    WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
854	    ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
855	    FOR A PARTICULAR PURPOSE.

857	Copyright Statement

859	    Copyright (C) The IETF Trust (2008).  This document is subject to
860	    the rights, licenses and restrictions contained in BCP 78, and
861	    except as set forth therein, the authors retain all their rights.

863	Acknowledgment

865	    Funding for the RFC Editor function is currently provided by the
866	    Internet Society.