| < draft-ietf-tcpm-rfc2581bis-00.txt | draft-ietf-tcpm-rfc2581bis-01.txt > | |||
|---|---|---|---|---|
| Network Working Group M. Allman | Network Working Group M. Allman | |||
| Internet-Draft V. Paxson | Internet-Draft V. Paxson | |||
| Expires: July 2006 ICIR / ICSI | Expires: December 2006 ICIR / ICSI | |||
| E. Blanton | E. Blanton | |||
| Purdue University | Purdue University | |||
| January 2006 | June 2006 | |||
| TCP Congestion Control | TCP Congestion Control | |||
| draft-ietf-tcpm-rfc2581bis-00.txt | draft-ietf-tcpm-rfc2581bis-01.txt | |||
| Status of this Memo | Status of this Memo | |||
| By submitting this Internet-Draft, each author represents that any | By submitting this Internet-Draft, each author represents that any | |||
| applicable patent or other IPR claims of which he or she is aware | applicable patent or other IPR claims of which he or she is aware | |||
| have been or will be disclosed, and any of which he or she becomes | have been or will be disclosed, and any of which he or she becomes | |||
| aware will be disclosed, in accordance with Section 6 of BCP 79. | aware will be disclosed, in accordance with Section 6 of BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| skipping to change at page 3, line 22 ¶ | skipping to change at page 3, line 22 ¶ | |||
| RESTART WINDOW (RW): The restart window is the size of the | RESTART WINDOW (RW): The restart window is the size of the | |||
| congestion window after a TCP restarts transmission after an | congestion window after a TCP restarts transmission after an | |||
| idle period (if the slow start algorithm is used; see section | idle period (if the slow start algorithm is used; see section | |||
| 4.1 for more discussion). | 4.1 for more discussion). | |||
| FLIGHT SIZE: The amount of data that has been sent but not yet | FLIGHT SIZE: The amount of data that has been sent but not yet | |||
| acknowledged. | acknowledged. | |||
| DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a | DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a | |||
| "duplicate" in the following algorithms when (a) the | "duplicate" in the following algorithms when (a) the receiver of | |||
| receiver of the ACK has outstanding data, (b) the incoming | the ACK has outstanding data, (b) the incoming acknowledgment | |||
| acknowledgment carries no data, (c) the SYN and FIN bits are | carries no data, (c) the SYN and FIN bits are both off, (d) the | |||
| both off, (d) the acknowledgment number is equal to the greatest | acknowledgment number is equal to the greatest acknowledgment | |||
| acknowledgment received on the given connection (TCP.UNA from | received on the given connection (TCP.UNA from [RFC793]) and (e) | |||
| [RFC793]) and (e) the advertised window in the incoming | the advertised window in the incoming acknowledgment equals the | |||
| acknowledgment equals the advertised window in the last incoming | advertised window in the last incoming acknowledgment. | |||
| acknowledgment. Alternatively, a TCP that utilizes selective | Alternatively, a TCP that utilizes selective acknowledgments | |||
| acknowledgments [RFC2018] can determine an incoming ACK is a | [RFC2018,RFC2883] can determine an incoming ACK is a "duplicate" | |||
| "duplicate" if the ACK contains previously unknown SACK | if the ACK contains previously unknown SACK information. | |||
| information. | ||||
| 3. Congestion Control Algorithms | 3. Congestion Control Algorithms | |||
| This section defines the four congestion control algorithms: slow | This section defines the four congestion control algorithms: slow | |||
| start, congestion avoidance, fast retransmit and fast recovery, | start, congestion avoidance, fast retransmit and fast recovery, | |||
| developed in [Jac88] and [Jac90]. In some situations it may be | developed in [Jac88] and [Jac90]. In some situations it may be | |||
| beneficial for a TCP sender to be more conservative than the | beneficial for a TCP sender to be more conservative than the | |||
| algorithms allow, however a TCP MUST NOT be more aggressive than the | algorithms allow, however a TCP MUST NOT be more aggressive than the | |||
| following algorithms allow (that is, MUST NOT send data when the | following algorithms allow (that is, MUST NOT send data when the | |||
| value of cwnd computed by the following algorithms would not allow | value of cwnd computed by the following algorithms would not allow | |||
| skipping to change at page 4, line 24 ¶ | skipping to change at page 4, line 23 ¶ | |||
| IW, the initial value of cwnd, MUST be set using the following | IW, the initial value of cwnd, MUST be set using the following | |||
| guidelines as an upper bound. | guidelines as an upper bound. | |||
| If SMSS > 2190 bytes: | If SMSS > 2190 bytes: | |||
| IW = 2 * SMSS bytes and MUST NOT be more than 2 segments | IW = 2 * SMSS bytes and MUST NOT be more than 2 segments | |||
| If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): | If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes): | |||
| IW = 3 * SMSS bytes and MUST NOT be more than 3 segments | IW = 3 * SMSS bytes and MUST NOT be more than 3 segments | |||
| if SMSS <= 1095 bytes: | if SMSS <= 1095 bytes: | |||
| IW = 4 * SMSS bytes and MUST NOT be more than 4 segments | IW = 4 * SMSS bytes and MUST NOT be more than 4 segments | |||
| As specified in [RFC3390], the SYN/ACK and the acknowledgment of the | ||||
| SYN/ACK MUST NOT increase the size of the congestion window. | ||||
| Further, if the SYN or SYN/ACK is lost, the initial window used by a | ||||
| sender after a correctly transmitted SYN MUST be one segment | ||||
| consisting of at most SMSS bytes. | ||||
| A detailed rationale and discussion of the IW setting is provided in | A detailed rationale and discussion of the IW setting is provided in | |||
| [RFC3390]. | [RFC3390]. | |||
| When larger initial windows are implemented along with Path MTU | When larger initial windows are implemented along with Path MTU | |||
| Discovery [RFC1191], and the MSS being used is found to be too | Discovery [RFC1191], and the MSS being used is found to be too | |||
| large, the congestion window cwnd SHOULD be reduced to prevent | large, the congestion window cwnd SHOULD be reduced to prevent | |||
| large bursts of smaller segments. Specifically, cwnd SHOULD be | large bursts of smaller segments. Specifically, cwnd SHOULD be | |||
| reduced by the ratio of the old segment size to the new segment | reduced by the ratio of the old segment size to the new segment | |||
| size. | size. | |||
| The initial value of ssthresh SHOULD be arbitrarily high (for | The initial value of ssthresh SHOULD be arbitrarily high (e.g., to | |||
| example, some implementations use the size of the advertised | the size of the largest possible advertised window), but ssthresh | |||
| window), but ssthresh MUST be reduced in response to congestion. | MUST be reduced in response to congestion. Setting ssthresh as high | |||
| as possible allows the network conditions, rather than some | ||||
| arbitrary host limit, to dictate the sending rate. In cases where | ||||
| the end systems have a solid understanding of the network path, more | ||||
| carefully setting the initial ssthresh value may have merit (e.g., | ||||
| such that the end host does not create congestion along the path). | ||||
| The slow start algorithm is used when cwnd < ssthresh, while the | The slow start algorithm is used when cwnd < ssthresh, while the | |||
| congestion avoidance algorithm is used when cwnd > ssthresh. When | congestion avoidance algorithm is used when cwnd > ssthresh. When | |||
| cwnd and ssthresh are equal the sender may use either slow start or | cwnd and ssthresh are equal the sender may use either slow start or | |||
| congestion avoidance. | congestion avoidance. | |||
| During slow start, a TCP increments cwnd by at most SMSS bytes for | During slow start, a TCP increments cwnd by at most SMSS bytes for | |||
| each ACK received that acknowledges new data. Slow start ends when | each ACK received that acknowledges new data. Slow start ends when | |||
| cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted | cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted | |||
| above) or when congestion is observed. While traditionally TCP | above) or when congestion is observed. While traditionally TCP | |||
| implementations have increased cwnd by precisely SMSS bytes upon | implementations have increased cwnd by precisely SMSS bytes upon | |||
| skipping to change at page 6, line 44 ¶ | skipping to change at page 6, line 53 ¶ | |||
| incidentally increase well beyond rwnd. | incidentally increase well beyond rwnd. | |||
| Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be | Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be | |||
| set to no more than the loss window, LW, which equals 1 full-sized | set to no more than the loss window, LW, which equals 1 full-sized | |||
| segment (regardless of the value of IW). Therefore, after | segment (regardless of the value of IW). Therefore, after | |||
| retransmitting the dropped segment the TCP sender uses the slow | retransmitting the dropped segment the TCP sender uses the slow | |||
| start algorithm to increase the window from 1 full-sized segment to | start algorithm to increase the window from 1 full-sized segment to | |||
| the new value of ssthresh, at which point congestion avoidance again | the new value of ssthresh, at which point congestion avoidance again | |||
| takes over. | takes over. | |||
| As shown in [FF96,RFC3782], slow start-based loss recovery after a | ||||
| timeout can cause spurious retransmissions that trigger duplicate | ||||
| acknowledgments. The reaction to the arrival of these duplicate | ||||
| ACKs in TCP implementations varies widely. This document does not | ||||
| specify how to treat such acknowledgments, but does note this as an | ||||
| area that may benefit from additional attention, experimentation and | ||||
| specification. | ||||
| 3.2 Fast Retransmit/Fast Recovery | 3.2 Fast Retransmit/Fast Recovery | |||
| A TCP receiver SHOULD send an immediate duplicate ACK when an out- | A TCP receiver SHOULD send an immediate duplicate ACK when an out- | |||
| of-order segment arrives. The purpose of this ACK is to inform the | of-order segment arrives. The purpose of this ACK is to inform the | |||
| sender that a segment was received out-of-order and which sequence | sender that a segment was received out-of-order and which sequence | |||
| number is expected. From the sender's perspective, duplicate ACKs | number is expected. From the sender's perspective, duplicate ACKs | |||
| can be caused by a number of network problems. First, they can be | can be caused by a number of network problems. First, they can be | |||
| caused by dropped segments. In this case, all segments after the | caused by dropped segments. In this case, all segments after the | |||
| dropped segment will trigger duplicate ACKs until the loss is | dropped segment will trigger duplicate ACKs until the loss is | |||
| repaired. Second, duplicate ACKs can be caused by the re-ordering | repaired. Second, duplicate ACKs can be caused by the re-ordering | |||
| skipping to change at page 7, line 10 ¶ | skipping to change at page 7, line 29 ¶ | |||
| paths [Pax97]). Finally, duplicate ACKs can be caused by | paths [Pax97]). Finally, duplicate ACKs can be caused by | |||
| replication of ACK or data segments by the network. In addition, a | replication of ACK or data segments by the network. In addition, a | |||
| TCP receiver SHOULD send an immediate ACK when the incoming segment | TCP receiver SHOULD send an immediate ACK when the incoming segment | |||
| fills in all or part of a gap in the sequence space. This will | fills in all or part of a gap in the sequence space. This will | |||
| generate more timely information for a sender recovering from a loss | generate more timely information for a sender recovering from a loss | |||
| through a retransmission timeout, a fast retransmit, or an advanced | through a retransmission timeout, a fast retransmit, or an advanced | |||
| loss recovery algorithm, as outlined in section 4.3. | loss recovery algorithm, as outlined in section 4.3. | |||
| The TCP sender SHOULD use the "fast retransmit" algorithm to detect | The TCP sender SHOULD use the "fast retransmit" algorithm to detect | |||
| and repair loss, based on incoming duplicate ACKs. The fast | and repair loss, based on incoming duplicate ACKs. The fast | |||
| retransmit algorithm uses the arrival of 3 duplicate ACKs (4 | retransmit algorithm uses the arrival of 3 duplicate ACKs (as | |||
| identical ACKs without the arrival of any other intervening packets) | defined in section 2, without any intervening ACKs which move | |||
| as an indication that a segment has been lost. After receiving 3 | SND.UNA) as an indication that a segment has been lost. After | |||
| duplicate ACKs, TCP performs a retransmission of what appears to be | receiving 3 duplicate ACKs, TCP performs a retransmission of what | |||
| the missing segment, without waiting for the retransmission timer to | appears to be the missing segment, without waiting for the | |||
| expire. | retransmission timer to expire. | |||
| After the fast retransmit algorithm sends what appears to be the | After the fast retransmit algorithm sends what appears to be the | |||
| missing segment, the "fast recovery" algorithm governs the | missing segment, the "fast recovery" algorithm governs the | |||
| transmission of new data until a non-duplicate ACK arrives. The | transmission of new data until a non-duplicate ACK arrives. The | |||
| reason for not performing slow start is that the receipt of the | reason for not performing slow start is that the receipt of the | |||
| duplicate ACKs not only indicates that a segment has been lost, but | duplicate ACKs not only indicates that a segment has been lost, but | |||
| also that segments are most likely leaving the network (although a | also that segments are most likely leaving the network (although a | |||
| massive segment duplication by the network can invalidate this | massive segment duplication by the network can invalidate this | |||
| conclusion). In other words, since the receiver can only generate a | conclusion). In other words, since the receiver can only generate a | |||
| duplicate ACK when a segment has arrived, that segment has left the | duplicate ACK when a segment has arrived, that segment has left the | |||
| skipping to change at page 11, line 27 ¶ | skipping to change at page 11, line 47 ¶ | |||
| that were not discussed in detail in 2001. Specifically, this | that were not discussed in detail in 2001. Specifically, this | |||
| document suggests what TCP connections should do after a relatively | document suggests what TCP connections should do after a relatively | |||
| long idle period, as well as specifying and clarifying some of the | long idle period, as well as specifying and clarifying some of the | |||
| issues pertaining to TCP ACK generation. Finally, the allowable | issues pertaining to TCP ACK generation. Finally, the allowable | |||
| upper bound for the initial congestion window has also been raised | upper bound for the initial congestion window has also been raised | |||
| from one to two segments. | from one to two segments. | |||
| 7. Changes Relative to RFC 2581 | 7. Changes Relative to RFC 2581 | |||
| A specific definition for "duplicate acknowledgment" has been | A specific definition for "duplicate acknowledgment" has been | |||
| added, based on the definition used by BSD TCP. | added, based on the definition used by BSD TCP. In addition, the | |||
| definition explicitly does not take into account the presence (or | ||||
| absence) of DSACK [RFC2883] information. | ||||
| The document now notes that what to do with duplicate ACKs after the | ||||
| retransmission timer has fired is future work and explicitly | ||||
| unspecified in this document. | ||||
| The initial window requirements were changed to allow Larger | The initial window requirements were changed to allow Larger | |||
| Initial Windows as standardized in [RFC3390]. Additionally, the | Initial Windows as standardized in [RFC3390]. Additionally, the | |||
| steps to take when an initial window is discovered to be too large | steps to take when an initial window is discovered to be too large | |||
| due to Path MTU Discovery [RFC1191] are detailed. | due to Path MTU Discovery [RFC1191] are detailed. | |||
| The recommended initial value for ssthresh has been changed to say | The recommended initial value for ssthresh has been changed to say | |||
| that it SHOULD be arbitrarily high, where it was previously MAY. | that it SHOULD be arbitrarily high, where it was previously MAY. | |||
| This is to provide additional guidance to implementors on the | This is to provide additional guidance to implementors on the | |||
| matter. | matter. | |||
| skipping to change at page 12, line 35 ¶ | skipping to change at page 13, line 6 ¶ | |||
| We wish to emphasize that the shortcomings and mistakes of this | We wish to emphasize that the shortcomings and mistakes of this | |||
| document are solely the responsibility of the current authors. | document are solely the responsibility of the current authors. | |||
| Some of the text from this document is taken from "TCP/IP | Some of the text from this document is taken from "TCP/IP | |||
| Illustrated, Volume 1: The Protocols" by W. Richard Stevens | Illustrated, Volume 1: The Protocols" by W. Richard Stevens | |||
| (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The | (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The | |||
| Implementation" by Gary R. Wright and W. Richard Stevens (Addison- | Implementation" by Gary R. Wright and W. Richard Stevens (Addison- | |||
| Wesley, 1995). This material is used with the permission of | Wesley, 1995). This material is used with the permission of | |||
| Addison-Wesley. | Addison-Wesley. | |||
| Neal Cardwell, Noritoshi Demizu, Kevin Fall, Sally Floyd, Craig | Steve Arden, Neal Cardwell, Noritoshi Demizu, Kevin Fall, John | |||
| Partridge and Joe Touch contributed a number of helpful suggestions. | Heffner, Sally Floyd, Reiner Ludwig, Matt Mathis, Craig Partridge | |||
| and Joe Touch contributed a number of helpful suggestions. | ||||
| Normative References | Normative References | |||
| [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC | [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC | |||
| 793, September 1981. | 793, September 1981. | |||
| [RFC1122] Braden, R., "Requirements for Internet Hosts -- | [RFC1122] Braden, R., "Requirements for Internet Hosts -- | |||
| Communication Layers", STD 3, RFC 1122, October 1989. | Communication Layers", STD 3, RFC 1122, October 1989. | |||
| [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, | [RFC1191] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191, | |||
| skipping to change at page 14, line 5 ¶ | skipping to change at page 14, line 31 ¶ | |||
| [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's | [RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's | |||
| Initial Window Size", RFC 2414, September 1998. | Initial Window Size", RFC 2414, September 1998. | |||
| [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., | [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., | |||
| Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP | Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP | |||
| Implementation Problems", RFC 2525, March 1999. | Implementation Problems", RFC 2525, March 1999. | |||
| [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion | [RFC2581] Allman, M., Paxson, V., W. Stevens, TCP Congestion | |||
| Control, RFC 2581, April 1999. | Control, RFC 2581, April 1999. | |||
| [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, M. Podolsky, An | ||||
| Extension to the Selective Acknowledgement (SACK) Option for | ||||
| TCP, RFC 2883, July 2000. | ||||
| [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission | [RFC2988] V. Paxson and M. Allman, "Computing TCP's Retransmission | |||
| Timer", RFC 2988, November 2000. | Timer", RFC 2988, November 2000. | |||
| [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing | [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing | |||
| TCP's Loss Recovery Using Limited Transmit", RFC 3042, January | TCP's Loss Recovery Using Limited Transmit", RFC 3042, January | |||
| 2001. | 2001. | |||
| [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte | [RFC3465] Mark Allman, TCP Congestion Control with Appropriate Byte | |||
| Counting (ABC), RFC 3465, February 2003. | Counting (ABC), RFC 3465, February 2003. | |||
| skipping to change at page 14, line 40 ¶ | skipping to change at page 15, line 16 ¶ | |||
| [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The | [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The | |||
| Implementation", Addison-Wesley, 1995. | Implementation", Addison-Wesley, 1995. | |||
| Authors' Addresses | Authors' Addresses | |||
| Mark Allman | Mark Allman | |||
| ICIR / ICSI | ICIR / ICSI | |||
| 1947 Center Street | 1947 Center Street | |||
| Suite 600 | Suite 600 | |||
| Berkeley, CA 94704-1198 | Berkeley, CA 94704-1198 | |||
| Phone: +1 440 243 7361 | Phone: +1 440 235 1792 | |||
| EMail: mallman@icir.org | EMail: mallman@icir.org | |||
| http://www.icir.org/mallman/ | http://www.icir.org/mallman/ | |||
| Vern Paxson | Vern Paxson | |||
| ICIR / ICSI | ICIR / ICSI | |||
| 1947 Center Street | 1947 Center Street | |||
| Suite 600 | Suite 600 | |||
| Berkeley, CA 94704-1198 | Berkeley, CA 94704-1198 | |||
| Phone: +1 510/642-4274 x302 | Phone: +1 510/642-4274 x302 | |||
| EMail: vern@icir.org | EMail: vern@icir.org | |||
| End of changes. 12 change blocks. | ||||
| 27 lines changed or deleted | 57 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||