| < draft-ietf-tcpm-dctcp-02.txt | draft-ietf-tcpm-dctcp-03.txt > | |||
|---|---|---|---|---|
| Network Working Group S. Bensley | Network Working Group S. Bensley | |||
| Internet-Draft Microsoft | Internet-Draft Microsoft | |||
| Intended status: Informational L. Eggert | Intended status: Informational L. Eggert | |||
| Expires: January 19, 2017 NetApp | Expires: May 18, 2017 NetApp | |||
| D. Thaler | D. Thaler | |||
| P. Balasubramanian | P. Balasubramanian | |||
| Microsoft | Microsoft | |||
| G. Judd | G. Judd | |||
| Morgan Stanley | Morgan Stanley | |||
| July 18, 2016 | November 14, 2016 | |||
| Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters | Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters | |||
| draft-ietf-tcpm-dctcp-02 | draft-ietf-tcpm-dctcp-03 | |||
| Abstract | Abstract | |||
| This informational memo describes Datacenter TCP (DCTCP), an | This informational memo describes Datacenter TCP (DCTCP), an | |||
| improvement to TCP congestion control for datacenter traffic. DCTCP | improvement to TCP congestion control for datacenter traffic. DCTCP | |||
| uses improved Explicit Congestion Notification (ECN) processing to | uses improved Explicit Congestion Notification (ECN) processing to | |||
| estimate the fraction of bytes that encounter congestion, rather than | estimate the fraction of bytes that encounter congestion, rather than | |||
| simply detecting that some congestion has occurred. DCTCP then | simply detecting that some congestion has occurred. DCTCP then | |||
| scales the TCP congestion window based on this estimate. This method | scales the TCP congestion window based on this estimate. This method | |||
| achieves high burst tolerance, low latency, and high throughput with | achieves high burst tolerance, low latency, and high throughput with | |||
| shallow-buffered switches. This memo also discusses deployment | shallow-buffered switches. This memo also discusses deployment | |||
| issues related to the coexistence of DCTCP and conventional TCP, the | issues related to the coexistence of DCTCP and conventional TCP, the | |||
| lack of a negotiating mechanism between sender and receiver, and | lack of a negotiating mechanism between sender and receiver, and | |||
| presents some possible mitigations. | presents some possible mitigations. DCTCP as described in this draft | |||
| is applicable to deployments in controlled environments like | ||||
| datacenters but it MUST NOT be deployed over the public Internet | ||||
| without additional measures, as detailed in Section 5. | ||||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on January 19, 2017. | This Internet-Draft will expire on May 18, 2017. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2016 IETF Trust and the persons identified as the | Copyright (c) 2016 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| skipping to change at page 2, line 25 ¶ | skipping to change at page 2, line 25 ¶ | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 | 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 3. DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4 | 3. DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 3.1. Marking Congestion on the Switches . . . . . . . . . . . 4 | 3.1. Marking Congestion on the L3 Switches and Routers . . . . 4 | |||
| 3.2. Echoing Congestion Information on the Receiver . . . . . 4 | 3.2. Echoing Congestion Information on the Receiver . . . . . 4 | |||
| 3.3. Processing Congestion Indications on the Sender . . . . . 5 | 3.3. Processing Congestion Indications on the Sender . . . . . 6 | |||
| 3.4. Handling of SYN, SYN-ACK, RST Packets . . . . . . . . . . 7 | 3.4. Handling of SYN, SYN-ACK, RST Packets . . . . . . . . . . 8 | |||
| 4. Implementation Issues . . . . . . . . . . . . . . . . . . . . 8 | 4. Implementation Issues . . . . . . . . . . . . . . . . . . . . 8 | |||
| 5. Deployment Issues . . . . . . . . . . . . . . . . . . . . . . 9 | 5. Deployment Issues . . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 6. Known Issues . . . . . . . . . . . . . . . . . . . . . . . . 10 | 6. Known Issues . . . . . . . . . . . . . . . . . . . . . . . . 10 | |||
| 7. Implementation Status . . . . . . . . . . . . . . . . . . . . 11 | 7. Implementation Status . . . . . . . . . . . . . . . . . . . . 11 | |||
| 8. Security Considerations . . . . . . . . . . . . . . . . . . . 11 | 8. Security Considerations . . . . . . . . . . . . . . . . . . . 11 | |||
| 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 | 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 | 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 12 | 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 11.1. Normative References . . . . . . . . . . . . . . . . . . 12 | 11.1. Normative References . . . . . . . . . . . . . . . . . . 12 | |||
| 11.2. Informative References . . . . . . . . . . . . . . . . . 12 | 11.2. Informative References . . . . . . . . . . . . . . . . . 13 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 14 | |||
| 1. Introduction | 1. Introduction | |||
| Large datacenters necessarily need many network switches to | Large datacenters necessarily need many network switches to | |||
| interconnect their many servers. Therefore, a datacenter can greatly | interconnect their many servers. Therefore, a datacenter can greatly | |||
| reduce its capital expenditure by leveraging low-cost switches. | reduce its capital expenditure by leveraging low-cost switches. | |||
| However, such low-cost switches tend to have limited queue capacities | However, such low-cost switches tend to have limited queue capacities | |||
| and are thus more susceptible to packet loss due to congestion. | and are thus more susceptible to packet loss due to congestion. | |||
| Network traffic in a datacenter is often a mix of short and long | Network traffic in a datacenter is often a mix of short and long | |||
| skipping to change at page 3, line 18 ¶ | skipping to change at page 3, line 18 ¶ | |||
| These factors place some conflicting demands on the queue occupancy | These factors place some conflicting demands on the queue occupancy | |||
| of a switch: | of a switch: | |||
| o The queue must be short enough that it does not impose excessive | o The queue must be short enough that it does not impose excessive | |||
| latency on short flows. | latency on short flows. | |||
| o The queue must be long enough to buffer sufficient data for the | o The queue must be long enough to buffer sufficient data for the | |||
| long flows to saturate the path capacity. | long flows to saturate the path capacity. | |||
| o The queue must be short enough to absorb incast bursts without | o The queue must be long enough to absorb incast bursts without | |||
| excessive packet loss. | excessive packet loss. | |||
| Standard TCP congestion control [RFC5681] relies on packet loss to | Standard TCP congestion control [RFC5681] relies on packet loss to | |||
| detect congestion. This does not meet the demands described above. | detect congestion. This does not meet the demands described above. | |||
| First, short flows will start to experience unacceptable latencies | First, short flows will start to experience unacceptable latencies | |||
| before packet loss occurs. Second, by the time TCP congestion | before packet loss occurs. Second, by the time TCP congestion | |||
| control kicks in on the senders, most of the incast burst has already | control kicks in on the senders, most of the incast burst has already | |||
| been dropped. | been dropped. | |||
| [RFC3168] describes a mechanism for using Explicit Congestion | [RFC3168] describes a mechanism for using Explicit Congestion | |||
| skipping to change at page 3, line 45 ¶ | skipping to change at page 3, line 45 ¶ | |||
| Datacenter TCP (DCTCP) improves traditional ECN processing by | Datacenter TCP (DCTCP) improves traditional ECN processing by | |||
| estimating the fraction of bytes that encounter congestion, rather | estimating the fraction of bytes that encounter congestion, rather | |||
| than simply detecting that some congestion has occurred. DCTCP then | than simply detecting that some congestion has occurred. DCTCP then | |||
| scales the TCP congestion window based on this estimate. This method | scales the TCP congestion window based on this estimate. This method | |||
| achieves high burst tolerance, low latency, and high throughput with | achieves high burst tolerance, low latency, and high throughput with | |||
| shallow-buffered switches. | shallow-buffered switches. | |||
| It is recommended that DCTCP be only deployed in a datacenter | It is recommended that DCTCP be only deployed in a datacenter | |||
| environment where the endpoints and the switching fabric are under a | environment where the endpoints and the switching fabric are under a | |||
| single administrative domain. This protocol is not meant for | single administrative domain. This protocol is not meant for | |||
| uncontrolled deployment in the global Internet. | uncontrolled deployment in the global Internet. Refer to Section 5 | |||
| for more details. | ||||
| 2. Terminology | 2. Terminology | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in [RFC2119]. Normative | document are to be interpreted as described in [RFC2119]. Normative | |||
| language is used to describe how necessary the various aspects of the | language is used to describe how necessary the various aspects of the | |||
| Microsoft implementation are for interoperability, but even compliant | Microsoft implementation are for interoperability, but even compliant | |||
| implementations without the measures in sections 4-6 would still only | implementations without the measures in sections 4-6 would still only | |||
| be safe to deploy in controlled environments. | be safe to deploy in controlled environments. | |||
| skipping to change at page 4, line 22 ¶ | skipping to change at page 4, line 23 ¶ | |||
| o The switches (or other intermediate devices in the network) detect | o The switches (or other intermediate devices in the network) detect | |||
| congestion and set the Congestion Encountered (CE) codepoint in | congestion and set the Congestion Encountered (CE) codepoint in | |||
| the IP header. | the IP header. | |||
| o The receiver echoes the congestion information back to the sender, | o The receiver echoes the congestion information back to the sender, | |||
| using the ECN-Echo (ECE) flag in the TCP header. | using the ECN-Echo (ECE) flag in the TCP header. | |||
| o The sender computes a congestion estimate and reacts, by reducing | o The sender computes a congestion estimate and reacts, by reducing | |||
| the TCP congestion window accordingly (cwnd). | the TCP congestion window accordingly (cwnd). | |||
| 3.1. Marking Congestion on the Switches | 3.1. Marking Congestion on the L3 Switches and Routers | |||
| The switches in a datacenter fabric indicate congestion to the end | The L3 switches and routers in a datacenter fabric indicate | |||
| nodes by setting the CE codepoint in the IP header as specified in | congestion to the end nodes by setting the CE codepoint in the IP | |||
| Section 5 of [RFC3168]. For example, the switches may be configured | header as specified in Section 5 of [RFC3168]. For example, the | |||
| with a congestion threshold. When a packet arrives at a switch and | switches may be configured with a congestion threshold. When a | |||
| its queue length is greater than the congestion threshold, the switch | packet arrives at a switch and its queue length is greater than the | |||
| sets the CE codepoint in the packet. For example, Section 3.4 of | congestion threshold, the switch sets the CE codepoint in the packet. | |||
| [DCTCP10] suggests threshold marking with a threshold K > (RTT * | For example, Section 3.4 of [DCTCP10] suggests threshold marking with | |||
| C)/7, where C is the link rate in packets per second. However, the | a threshold K > (RTT * C)/7, where C is the link rate in packets per | |||
| actual algorithm for marking congestion is an implementation detail | second. However, the actual algorithm for marking congestion is an | |||
| of the switch and will generally not be known to the sender and | implementation detail of the switch and will generally not be known | |||
| receiver. Therefore, sender and receiver should not assume that a | to the sender and receiver. Therefore, sender and receiver should | |||
| particular marking algorithm is implemented by the switching fabric. | not assume that a particular marking algorithm is implemented by the | |||
| switching fabric. | ||||
| 3.2. Echoing Congestion Information on the Receiver | 3.2. Echoing Congestion Information on the Receiver | |||
| According to Section 6.1.3 of [RFC3168], the receiver sets the ECE | According to Section 6.1.3 of [RFC3168], the receiver sets the ECE | |||
| flag if any of the packets being acknowledged had the CE code point | flag if any of the packets being acknowledged had the CE code point | |||
| set. The receiver then continues to set the ECE flag until it | set. The receiver then continues to set the ECE flag until it | |||
| receives a packet with the Congestion Window Reduced (CWR) flag set. | receives a packet with the Congestion Window Reduced (CWR) flag set. | |||
| However, the DCTCP algorithm requires more detailed congestion | However, the DCTCP algorithm requires more detailed congestion | |||
| information. In particular, the sender must be able to determine the | information. In particular, the sender must be able to determine the | |||
| number of bytes sent that encountered congestion. Thus, the scheme | number of bytes sent that encountered congestion. Thus, the scheme | |||
| described in [RFC3168] does not suffice. | described in [RFC3168] does not suffice. | |||
| One possible solution is to ACK every packet and set the ECE flag in | One possible solution is to ACK every packet and set the ECE flag in | |||
| the ACK if and only if the CE code point was set in the packet being | the ACK if and only if the CE code point was set in the packet being | |||
| acknowledged. However, this prevents the use of delayed ACKs, which | acknowledged. However, this prevents the use of delayed ACKs, which | |||
| are an important performance optimization in datacenters. | are an important performance optimization in datacenters. If the | |||
| delayed ACK frequency is m, then an ACK is generated every m packets. | ||||
| The typical value of m is 2 but it could be affected by ACK | ||||
| throttling or packet coalescing techniques designed to improve | ||||
| performance. | ||||
| Instead, DCTCP introduces a new Boolean TCP state variable, "DCTCP | Instead, DCTCP introduces a new Boolean TCP state variable, "DCTCP | |||
| Congestion Encountered" (DCTCP.CE), which is initialized to false and | Congestion Encountered" (DCTCP.CE), which is initialized to false and | |||
| stored in the Transmission Control Block (TCB). When sending an ACK, | stored in the Transmission Control Block (TCB). When sending an ACK, | |||
| the ECE flag MUST be set if and only if DCTCP.CE is true. When | the ECE flag MUST be set if and only if DCTCP.CE is true. When | |||
| receiving packets, the CE codepoint MUST be processed as follows: | receiving packets, the CE codepoint MUST be processed as follows: | |||
| 1. If the CE codepoint is set and DCTCP.CE is false, send an ACK for | 1. If the CE codepoint is set and DCTCP.CE is false, send an ACK for | |||
| any previously unacknowledged packets and set DCTCP.CE to true. | any previously unacknowledged packets and set DCTCP.CE to true. | |||
| 2. If the CE codepoint is not set and DCTCP.CE is true, send an ACK | 2. If the CE codepoint is not set and DCTCP.CE is true, send an ACK | |||
| for any previously unacknowledged packets and set DCTCP.CE to | for any previously unacknowledged packets and set DCTCP.CE to | |||
| false. | false. | |||
| 3. Otherwise, ignore the CE codepoint. | 3. Otherwise, ignore the CE codepoint. | |||
| The immediate ACK generated SHOULD NOT acknowledge any data in the | ||||
| received packet that changes the DCTCP.CE state. | ||||
| Receiver handling of the "Congestion Window Reduced" (CWR) bit is | Receiver handling of the "Congestion Window Reduced" (CWR) bit is | |||
| also exactly as per [RFC3168] including [RFC3168-ERRATA3639]. That | also per [RFC3168] including [RFC3168-ERRATA3639]. That is, on | |||
| is, on receipt of a segment with both the CE and CWR bits set, CWR is | receipt of a segment with both the CE and CWR bits set, CWR is | |||
| processed first and then ECE is processed. | processed first and then ECE is processed. | |||
| Send immediate | Send immediate | |||
| ACK with ECE=0 | ACK with ECE=0 | |||
| .----. .-------------. .---. | .----. .-------------. .---. | |||
| Send 1 ACK / v v | | \ | Send 1 ACK / v v | | \ | |||
| for every | .------. .------. | Send 1 ACK | for every | .------. .------. | Send 1 ACK | |||
| m packets | | CE=0 | | CE=1 | | for every | m packets | | CE=0 | | CE=1 | | for every | |||
| with ECE=0 | '------' '------' | m packets | with ECE=0 | '------' '------' | m packets | |||
| \ | | ^ ^ / with ECE=1 | \ | | ^ ^ / with ECE=1 | |||
| skipping to change at page 6, line 18 ¶ | skipping to change at page 6, line 33 ¶ | |||
| particular, an observation window ends when all bytes in flight at | particular, an observation window ends when all bytes in flight at | |||
| the beginning of the window have been acknowledged. | the beginning of the window have been acknowledged. | |||
| In order to update DCTCP.Alpha, the TCP state variables defined in | In order to update DCTCP.Alpha, the TCP state variables defined in | |||
| [RFC0793] are used, and three additional TCP state variables are | [RFC0793] are used, and three additional TCP state variables are | |||
| introduced: | introduced: | |||
| o DCTCP.WindowEnd: The TCP sequence number threshold for beginning a | o DCTCP.WindowEnd: The TCP sequence number threshold for beginning a | |||
| new observation window; initialized to SND.UNA. | new observation window; initialized to SND.UNA. | |||
| o DCTCP.BytesSent: The number of bytes sent during the current | o DCTCP.BytesAcked: The number of sent bytes acknowledged during the | |||
| observation window; initialized to zero. | current observation window; initialized to zero. | |||
| o DCTCP.BytesMarked: The number of bytes sent during the current | o DCTCP.BytesMarked: The number of bytes sent during the current | |||
| observation window that encountered congestion; initialized to | observation window that encountered congestion; initialized to | |||
| zero. | zero. | |||
| The congestion estimator on the sender SHOULD process acceptable ACKs | The congestion estimator on the sender SHOULD process acceptable ACKs | |||
| as follows: | as follows: | |||
| 1. Compute the bytes acknowledged (TCP SACK options [RFC2018] are | 1. Compute the bytes acknowledged (TCP SACK options [RFC2018] are | |||
| ignored for this computation): | ignored for this computation): | |||
| BytesAcked = SEG.ACK - SND.UNA | BytesAcked = SEG.ACK - SND.UNA | |||
| 2. Update the bytes sent: | 2. Update the bytes sent: | |||
| DCTCP.BytesSent += BytesAcked | DCTCP.BytesAcked += BytesAcked | |||
| 3. If the ECE flag is set, update the bytes marked: | 3. If the ECE flag is set, update the bytes marked: | |||
| DCTCP.BytesMarked += BytesAcked | DCTCP.BytesMarked += BytesAcked | |||
| 4. If the acknowledgment number is less than or equal to | 4. If the acknowledgment number is less than or equal to | |||
| DCTCP.WindowEnd, stop processing. Otherwise, the end of the | DCTCP.WindowEnd, stop processing. Otherwise, the end of the | |||
| observation window has been reached, so proceed to update the | observation window has been reached, so proceed to update the | |||
| congestion estimate as follows: | congestion estimate as follows: | |||
| 5. Compute the congestion level for the current observation window: | 5. Compute the congestion level for the current observation window: | |||
| M = DCTCP.BytesMarked / DCTCP.BytesSent | M = DCTCP.BytesMarked / DCTCP.BytesAcked | |||
| 6. Update the congestion estimate: | 6. Update the congestion estimate: | |||
| DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M | DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M | |||
| 7. Determine the end of the next observation window: | 7. Determine the end of the next observation window: | |||
| DCTCP.WindowEnd = SND.NXT | DCTCP.WindowEnd = SND.NXT | |||
| 8. Reset the byte counters: | 8. Reset the byte counters: | |||
| DCTCP.BytesSent = DCTCP.BytesMarked = 0 | DCTCP.BytesAcked = DCTCP.BytesMarked = 0 | |||
| Rather than always halving the congestion window as described in | Rather than always halving the congestion window as described in | |||
| [RFC3168], when the sender receives an indication of congestion | [RFC3168], when the sender receives an indication of congestion | |||
| (ECE), the sender SHOULD update cwnd as follows: | (ECE), the sender SHOULD update cwnd as follows: | |||
| cwnd = cwnd * (1 - DCTCP.Alpha / 2) | cwnd = cwnd * (1 - DCTCP.Alpha / 2) | |||
| Thus, when no bytes sent experienced congestion, DCTCP.Alpha equals | Thus, when no bytes sent experienced congestion, DCTCP.Alpha equals | |||
| zero, and cwnd is left unchanged. When all sent bytes experienced | zero, and cwnd is left unchanged. When all sent bytes experienced | |||
| congestion, DCTCP.Alpha equals one, and cwnd is reduced by half. | congestion, DCTCP.Alpha equals one, and cwnd is reduced by half. | |||
| skipping to change at page 7, line 39 ¶ | skipping to change at page 8, line 7 ¶ | |||
| potential misconfigurations. | potential misconfigurations. | |||
| A DCTCP sender MUST deal with loss episodes in the same way as | A DCTCP sender MUST deal with loss episodes in the same way as | |||
| conventional TCP. In case of a timeout or fast retransmit or any | conventional TCP. In case of a timeout or fast retransmit or any | |||
| change in delay (for delay based congestion control), the cwnd and | change in delay (for delay based congestion control), the cwnd and | |||
| other state variables like ssthresh must be changed in the same way | other state variables like ssthresh must be changed in the same way | |||
| that a conventional TCP would have changed them. | that a conventional TCP would have changed them. | |||
| 3.4. Handling of SYN, SYN-ACK, RST Packets | 3.4. Handling of SYN, SYN-ACK, RST Packets | |||
| [RFC3168] requires that a compliant TCP MUST NOT set ECT on SYN or | The switching fabric can drop TCP packets that do not have the ECT | |||
| SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets, | set in the IP header. If SYN and SYN-ACK packets for DCTCP | |||
| but maintains the restriction of no ECT on SYN packets. Both these | connections do not have ECT set, they will be dropped with high | |||
| RFCs prohibit ECT in SYN packets due to security concerns regarding | probability. For DCTCP connections, the sender SHOULD set ECT for | |||
| malicious SYN packets with ECT set. These RFCs, however, are | SYN, SYN-ACK and RST packets. | |||
| intended for general Internet use, and do not directly apply to a | ||||
| controlled datacenter environment. The switching fabric can drop TCP | ||||
| packets that do not have the ECT set in the IP header. If SYN and | ||||
| SYN-ACK packets for DCTCP connections do not have ECT set, they will | ||||
| be dropped with high probability. For DCTCP connections, the sender | ||||
| SHOULD set ECT for SYN, SYN-ACK and RST packets. The security | ||||
| concerns addressed by both these RFCs might not apply in controlled | ||||
| environments like datacenters, and it might not be necessary to cater | ||||
| to both the presence of non-ECN servers. | ||||
| 4. Implementation Issues | 4. Implementation Issues | |||
| As noted in Section 3.3, the implementation will need to choose a | As noted in Section 3.3, the implementation will need to choose a | |||
| suitable estimation gain. [DCTCP10] provides a theoretical basis for | suitable estimation gain. [DCTCP10] provides a theoretical basis for | |||
| selecting the gain. However, it may be more practical to use | selecting the gain. However, it may be more practical to use | |||
| experimentation to select a suitable gain for a particular network | experimentation to select a suitable gain for a particular network | |||
| and workload. The Microsoft implementation of DCTCP in Windows | and workload. The Microsoft implementation of DCTCP in Windows | |||
| Server 2012 uses a fixed estimation gain of 1/16. | Server 2012 uses a fixed estimation gain of 1/16. | |||
| skipping to change at page 8, line 34 ¶ | skipping to change at page 8, line 42 ¶ | |||
| ensure a DCTCP sender is always paired with a DCTCP receiver. One | ensure a DCTCP sender is always paired with a DCTCP receiver. One | |||
| approach is to enable DCTCP based on the IP address of the remote | approach is to enable DCTCP based on the IP address of the remote | |||
| endpoint. Another approach is to detect connections that transmit | endpoint. Another approach is to detect connections that transmit | |||
| within the bounds a datacenter. For example, Microsoft Windows | within the bounds a datacenter. For example, Microsoft Windows | |||
| Server 2012 (and later versions) supports automatic selection of | Server 2012 (and later versions) supports automatic selection of | |||
| DCTCP if the estimated RTT is less than 10 msec and ECN is | DCTCP if the estimated RTT is less than 10 msec and ECN is | |||
| successfully negotiated, under the assumption that if the RTT is low, | successfully negotiated, under the assumption that if the RTT is low, | |||
| then the two endpoints are likely in the same datacenter network. | then the two endpoints are likely in the same datacenter network. | |||
| [RFC3168] forbids the ECN-marking of pure ACK packets, because of the | [RFC3168] forbids the ECN-marking of pure ACK packets, because of the | |||
| inability of TCP to mitigate ACK-path congestion and the extra | inability of TCP to mitigate ACK-path congestion. RFC 3168 also | |||
| advantage to injection attackers that ECN is perceived to offer. For | forbids ECN-marking of retransmissions, window probes and RSTs. | |||
| the latter reason RFC 3168 also forbids ECN-marking of | However, dropping all these control packets - rather than ECN marking | |||
| retransmissions, window probes and RSTs. However, dropping all these | them - has considerable performance disadvantages. It is RECOMMENDED | |||
| control packets - rather than ECN marking them - has considerable | that an implementation provide a configuration knob that will cause | |||
| performance disadvantages. It is RECOMMENDED that an implementation | ECT to be set on such control packes, which can be used in | |||
| provide a configuration knob that will cause ECT to be set on such | environments where such concerns do not apply. | |||
| control packes, which can be used in environments where such concerns | ||||
| do not apply. | ||||
| It would be useful to implement DCTCP as additional actions on top of | It would be useful to implement DCTCP as additional actions on top of | |||
| an existing congestion control algorithm like NewReno. The DCTCP | an existing congestion control algorithm like NewReno. The DCTCP | |||
| implementation MAY also allow configuration of resetting the value of | implementation MAY also allow configuration of resetting the value of | |||
| DCTCP.Alpha as part of processing any loss episodes. | DCTCP.Alpha as part of processing any loss episodes. | |||
| The DCTCP.Alpha calculation as per the formula in Section 3.3 | The DCTCP.Alpha calculation as per the formula in Section 3.3 | |||
| involves fractions. An efficient kernel implementation MAY scale the | involves fractions. An efficient kernel implementation MAY scale the | |||
| DCTCP.Alpha value for efficient computation using shift operations. | DCTCP.Alpha value for efficient computation using shift operations. | |||
| For example, if the implementation chooses g as 1/16, multiplications | For example, if the implementation chooses g as 1/16, multiplications | |||
| skipping to change at page 9, line 18 ¶ | skipping to change at page 9, line 25 ¶ | |||
| DCTCP.Alpha does not exceed the scaling factor, which would be | DCTCP.Alpha does not exceed the scaling factor, which would be | |||
| equivalent to greater than 100% congestion. So, DCTCP.Alpha MUST be | equivalent to greater than 100% congestion. So, DCTCP.Alpha MUST be | |||
| clamped after an update. | clamped after an update. | |||
| This results in the following computations replacing steps 5 and 6 in | This results in the following computations replacing steps 5 and 6 in | |||
| Section 3.3, where SCF is the chosen scaling factor (65536 in the | Section 3.3, where SCF is the chosen scaling factor (65536 in the | |||
| example) and SHF is the shift factor (4 in the example): | example) and SHF is the shift factor (4 in the example): | |||
| 1. Compute the congestion level for the current observation window: | 1. Compute the congestion level for the current observation window: | |||
| ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesSent | ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked | |||
| 2. Update the congestion estimate: | 2. Update the congestion estimate: | |||
| if (DCTCP.Alpha >> SHF) == 0 then DCTCP.Alpha = 0 | if (DCTCP.Alpha >> SHF) == 0 then DCTCP.Alpha = 0 | |||
| DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF) | DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF) | |||
| if DCTCP.Alpha > SCF then DCTCP.Alpha = SCF | if DCTCP.Alpha > SCF then DCTCP.Alpha = SCF | |||
| 5. Deployment Issues | 5. Deployment Issues | |||
| skipping to change at page 11, line 13 ¶ | skipping to change at page 11, line 19 ¶ | |||
| negotiation, causing further interoperability issues. | negotiation, causing further interoperability issues. | |||
| Much like standard TCP, DCTCP is biased against flows with longer | Much like standard TCP, DCTCP is biased against flows with longer | |||
| RTTs. A method for improving the RTT fairness of DCTCP has been | RTTs. A method for improving the RTT fairness of DCTCP has been | |||
| proposed in [ADCTCP], but requires additional experimental | proposed in [ADCTCP], but requires additional experimental | |||
| evaluation. | evaluation. | |||
| 7. Implementation Status | 7. Implementation Status | |||
| This section documents the implementation status of the specification | This section documents the implementation status of the specification | |||
| in this document, as recommended by [RFC6982]. | in this document, as recommended by [RFC7942]. | |||
| This document describes DCTCP as implemented in Microsoft Windows | This document describes DCTCP as implemented in Microsoft Windows | |||
| Server 2012. Since publication of the first versions of this | Server 2012. Since publication of the first versions of this | |||
| document, the Linux [LINUX] and FreeBSD [FREEBSD] operating systems | document, the Linux [LINUX] and FreeBSD [FREEBSD] operating systems | |||
| have also implemented support for DCTCP in a way that is believed to | have also implemented support for DCTCP in a way that is believed to | |||
| follow this document. | follow this document. | |||
| 8. Security Considerations | 8. Security Considerations | |||
| DCTCP enhances ECN and thus inherits the security considerations | DCTCP enhances ECN and thus inherits the security considerations | |||
| discussed in [RFC3168]. The processing changes introduced by DCTCP | discussed in [RFC3168]. The processing changes introduced by DCTCP | |||
| do not exacerbate these considerations or introduce new ones. In | do not exacerbate these considerations or introduce new ones. In | |||
| particular, with either algorithm, the network infrastructure or the | particular, with either algorithm, the network infrastructure or the | |||
| remote endpoint can falsely report congestion and thus cause the | remote endpoint can falsely report congestion and thus cause the | |||
| sender to reduce cwnd. However, this is no worse than what can be | sender to reduce cwnd. However, this is no worse than what can be | |||
| achieved by simply dropping packets. | achieved by simply dropping packets. | |||
| [RFC3168] requires that a compliant TCP must not set ECT on SYN or | ||||
| SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets, | ||||
| but maintains the restriction of no ECT on SYN packets. Both these | ||||
| RFCs prohibit ECT in SYN packets due to security concerns regarding | ||||
| malicious SYN packets with ECT set. These RFCs, however, are | ||||
| intended for general Internet use, and do not directly apply to a | ||||
| controlled datacenter environment. The security concerns addressed | ||||
| by both these RFCs might not apply in controlled environments like | ||||
| datacenters, and it might not be necessary to account for the | ||||
| presence of non-ECN servers. Since most servers run virtualized in | ||||
| datacenters, additional security can be imposed in the physical | ||||
| servers to intercept and drop traffic resembling an attack. | ||||
| 9. IANA Considerations | 9. IANA Considerations | |||
| This document has no actions for IANA. | This document has no actions for IANA. | |||
| 10. Acknowledgements | 10. Acknowledgements | |||
| The DCTCP algorithm was originally proposed and analyzed in [DCTCP10] | The DCTCP algorithm was originally proposed and analyzed in [DCTCP10] | |||
| by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye, | by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye, | |||
| Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari | Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari | |||
| Sridharan. | Sridharan. | |||
| skipping to change at page 12, line 40 ¶ | skipping to change at page 13, line 13 ¶ | |||
| <http://www.rfc-editor.org/info/rfc5681>. | <http://www.rfc-editor.org/info/rfc5681>. | |||
| [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. | [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. | |||
| Ramakrishnan, "Adding Explicit Congestion Notification | Ramakrishnan, "Adding Explicit Congestion Notification | |||
| (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, | (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, | |||
| DOI 10.17487/RFC5562, June 2009, | DOI 10.17487/RFC5562, June 2009, | |||
| <http://www.rfc-editor.org/info/rfc5562>. | <http://www.rfc-editor.org/info/rfc5562>. | |||
| 11.2. Informative References | 11.2. Informative References | |||
| [RFC6982] Sheffer, Y. and A. Farrel, "Improving Awareness of Running | [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running | |||
| Code: The Implementation Status Section", RFC 6982, | Code: The Implementation Status Section", BCP 205, | |||
| DOI 10.17487/RFC6982, July 2013, | RFC 7942, DOI 10.17487/RFC7942, July 2016, | |||
| <http://www.rfc-editor.org/info/rfc6982>. | <http://www.rfc-editor.org/info/rfc7942>. | |||
| [DCTCP10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, | [DCTCP10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, | |||
| P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data | P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data | |||
| Center TCP (DCTCP)", DOI 10.1145/1851182.1851192, Proc. | Center TCP (DCTCP)", DOI 10.1145/1851182.1851192, Proc. | |||
| ACM SIGCOMM 2010 Conference (SIGCOMM 10), August 2010, | ACM SIGCOMM 2010 Conference (SIGCOMM 10), August 2010, | |||
| <http://dl.acm.org/citation.cfm?doid=1851182.1851192>. | <http://dl.acm.org/citation.cfm?doid=1851182.1851192>. | |||
| [ODCTCP] Kato, M., "Improving Transmission Performance with One- | [ODCTCP] Kato, M., "Improving Transmission Performance with One- | |||
| Sided Datacenter TCP", M.S. Thesis, Keio University, | Sided Datacenter TCP", M.S. Thesis, Keio University, | |||
| 2014, <http://eggert.org/students/kato-thesis.pdf>. | 2014, <http://eggert.org/students/kato-thesis.pdf>. | |||
| End of changes. 26 change blocks. | ||||
| 64 lines changed or deleted | 78 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||