TCP Maintenance and Minor Extensions M. Kuehlewind, Ed. (tcpm) University of Stuttgart Internet-Draft R. Scheffenegger Intended status: Experimental NetApp, Inc. Expires: January 17, 2013 July 16, 2012 More Accurate ECN Feedback in TCP draft-kuehlewind-tcpm-accurate-ecn-01 Abstract Explicit Congestion Notification (ECN) is an IP/TCP mechanism where network nodes can mark IP packets instead of dropping them to indicate congestion to the end-points. An ECN-capable receiver will feedback this information to the sender. ECN is specified for TCP in such a way that only one feedback signal can be transmitted per Round-Trip Time (RTT). Recently, new TCP mechanisms like ConEx or DCTCP need more accurate ECN feedback information in the case where more than one marking is received in one RTT. This documents specifies a different scheme for the ECN feedback in the TCP header to provide more than one feedback signal per RTT. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 17, 2013. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 1] Internet-Draft More Accurate ECN Feedback in TCP July 2012 publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Overview ECN and ECN Nonce in IP/TCP . . . . . . . . . . . 4 1.3. Requirements . . . . . . . . . . . . . . . . . . . . . . . 5 1.4. Design choices . . . . . . . . . . . . . . . . . . . . . . 6 1.5. Requirements Language . . . . . . . . . . . . . . . . . . 7 2. Negotiation during the TCP handshake . . . . . . . . . . . . . 7 3. More Accurate ECN Feedback . . . . . . . . . . . . . . . . . . 9 3.1. Codepoint Coding . . . . . . . . . . . . . . . . . . . . . 9 3.2. More Accurate ECN TCP Sender . . . . . . . . . . . . . . . 10 3.3. More Accurate ECN TCP Receiver . . . . . . . . . . . . . . 11 3.3.1. Implementation . . . . . . . . . . . . . . . . . . . . 11 3.4. Advanced Compatibility Mode . . . . . . . . . . . . . . . 12 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 6. Security Considerations . . . . . . . . . . . . . . . . . . . 14 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14 7.1. Normative References . . . . . . . . . . . . . . . . . . . 14 7.2. Informative References . . . . . . . . . . . . . . . . . . 15 Appendix A. Estimating CE-marked bytes . . . . . . . . . . . . . 15 Appendix B. Use with ECN Nonce . . . . . . . . . . . . . . . . . 15 B.1. Pseudo Code for the Codepoint Coding . . . . . . . . . . . 17 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 2] Internet-Draft More Accurate ECN Feedback in TCP July 2012 1. Introduction Explicit Congestion Notification (ECN) [RFC3168] is an IP/TCP mechanism where network nodes can mark IP packets instead of dropping them to indicate congestion to the end-points. An ECN-capable receiver will feedback this information to the sender. ECN is specified for TCP in such a way that only one feedback signal can be transmitted per Round-Trip Time (RTT). Recently, proposed mechanisms like Congestion Exposure (ConEx) or DCTCP [Ali10] need more accurate ECN feedback information in case when more than one marking is received in one RTT. This documents specifies a different scheme for the ECN feedback in the TCP header to provide more than one feedback signal per RTT. This modification does not obsolete [RFC3168]. To avoid confusion we call the ECN specification of [RFC3168] 'classic ECN' in this document. This document provides an extension that requires additional negotiation in the TCP handshake by using the TCP nonce sum (NS) bit, as specified in [RFC3540], which is currently not used when SYN is set. If the more accurate ECN extension has been negotiated successfully, the meaning of ECN TCP bits and the ECN NS bit is different from the specification in [RFC3168] and [RFC3540]. This document specifies the additional negotiation as well as the new coding of the TCP ECN/NS bits. The proposed coding scheme maintains the given bit space as the ECN feedback information is needed in a timely manner and as such should be reported in every ACK. The reuse will avoid additional network load as the ACK size will not increase. Moreover, the more accurate ECN information will replace the classic ECN feedback if negotiated. Thus those bits are not needed otherwise. But the proposed schemes requires also the use of the NS bit in the TCP handshake as well as for the more accurate ECN feedback itself. The proposed more accurate ECN feedback extension can include the ECN-Nonce integrity mechanism as some coding space is left open. The use of ECN-Nonce is not part of the specification in this document but is discussed in the appendix. 1.1. Use Cases The following scenarios should briefly show where the accurate feedback is needed or provides additional value: A Standard (RFC5681) TCP sender that supports ConEx: In this case the congestion control algorithm still ignores multiple marks per RTT, while the ConEx mechanism uses the extra information per RTT to re-echo more precise congestion information. Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 3] Internet-Draft More Accurate ECN Feedback in TCP July 2012 A sender using DCTCP congestion control without ConEx: The congestion control algorithm uses the extra info per RTT to perform its decrease depending on the number of congestion marks. A sender using DCTCP congestion control and supports ConEx: Both the congestion control algorithm and ConEx use the accurate ECN feedback mechanism. A standard TCP sender (using RFC5681 congestion control algorithm) without ConEx: No accurate feedback is necessary here. The congestion control algorithm still react only on one signal per RTT. But it is best to have one generic feedback mechanism, whether it is used or not. 1.2. Overview ECN and ECN Nonce in IP/TCP ECN requires two bits in the IP header. The ECN capability of a packet is indicated when either one of the two bits is set. An ECN sender can set one or the other bit to indicate an ECN-capable transport (ECT) which results in two signals, ECT(0) and ECT(1). A network node can set both bits simultaneously when it experiences congestion. When both bits are set the packet is regarded as "Congestion Experienced" (CE). In the TCP header the first two bits in byte 14 are defined for the use of ECN. The TCP mechanism for signaling the reception of a congestion mark uses the ECN-Echo (ECE) flag in the TCP header. To enable the TCP receiver to determine when to stop setting the ECN- Echo flag, the CWR flag is set by the sender upon reception of the feedback signal. This leads always to a full RTT of ACKs with ECE set. Thus any additional CE markings arriving within this RTT can not signaled back anymore. ECN-Nonce [RFC3540] is an optional addition to ECN that is used to protect the TCP sender against accidental or malicious concealment of marked or dropped packets. This addition defines the last bit of byte 13 in the TCP header as the Nonce Sum (NS) bit. With ECN-Nonce a nonce sum is maintain that counts the occurrence of ECT(1) packets. Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 4] Internet-Draft More Accurate ECN Feedback in TCP July 2012 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | | | N | C | E | U | A | P | R | S | F | | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | | | | | R | E | G | K | H | T | N | N | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ Figure 1: The (post-ECN Nonce) definition of the TCP header flags 1.3. Requirements The requirements of the accurate ECN feedback protocol for the use of e.g. Conex or DCTCP are to have a fairly accurate (not necessarily perfect), timely and protected signaling. This leads to the following requirements: Resilience The ECN feedback signal is carried within the TCP acknowledgment. TCP ACKs can get lost. Moreover, delayed ACK are mostly used with TCP. That means in most cases only every second data packets triggers an ACK. In a high congestion situation where most of the packet are marked with CE, an accurate feedback mechanism must still be able to signal sufficient congestion information. Thus the accurate ECN feedback extension has to take delayed ACK and ACK loss into account. Timely The CE marking is induced by a network node on the transmission path and echoed by the receiver in the TCP acknowledgment. Thus when this information arrives at the sender, its naturally already about one RTT old. With a sufficient ACK rate a further delay of a small number of ACK can be tolerated but with large delays this information will be out dated due to high dynamic in the network. TCP congestion control which introduces parts of these dynamics operates on a time scale of one RTT. Thus the congestion feedback information should be delivered timely (within one RTT). Integrity With ECN Nonce, a misbehaving receiver or network node can be detected with a certain probability. As this accurate ECN feedback is reusing the NS bit, it is encouraged to ensure integrity as least as good as ECN Nonce. If this is not possible, alternative approaches should be provided how a mechanism using the accurate ECN feedback extension can re- ensure integrity or give strong incentives for the receiver Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 5] Internet-Draft More Accurate ECN Feedback in TCP July 2012 and network node to cooperate honestly. Accuracy Classic ECN feeds back one congestion notification per RTT, as this is supposed to be used for TCP congestion control which reduces the sending rate at most once per RTT. The accurate ECN feedback scheme has to ensure that if a congestion events occurs at least one congestion notification is echoed and received per RTT as classic ECN would do. Of course, the goal of this extension is to reconstruct the number of CE marking more accurately. However, a sender should not assume to get the exact number of congestion marking in all situations. Complexity Of course, the more accurate ECN feedback can also be used, even if only one ECN feedback signal per RTT is need. The implementation should be as simple as possible and only a minimum of addition state information should be needed. A proposal fulfilling this for a more accurate ECN feedback can then also be the standard ECN feedback mechanism. 1.4. Design choices The idea of this document is to use the ECE, CWR and NS bits for additional capability negotiation during the / exchange, and then for the more accurate ECN feedback itself on subsequent packets in the flow (where SYN is not set). Alternatively, a new TCP option could be introduced, to help maintain the accuracy, and integrity of the ECN feedback between receiver and sender. Such an option could provide more information. E.g. ECN for RTP/UDP provides explicit the number of ECT(0), ECT(1), CE, non- ECT marked and lost packets. However, deploying new TCP options has its own challenges. A separate document proposes a new TCP Option for accurate ECN feedback [draft-kuehlewind-tcpm-accurate-ecn-option]. This option could be used in addition to a more accurate ECN feedback scheme described here or in addition to classic ECN, when available and needed. As seen in Figure 1, there are currently three unused flag bits in the TCP header. The proposed scheme could be extended by one or more bits, to add higher resiliency against ACK loss. The relative gain would be proportionally higher resiliency against ACK loss, while the respective drawbacks would remain identical. Thus the approach in this document is to maintain the scope of the given number of header bits as they seem to be already sufficient. This accurate ECN feedback scheme will only be used instead of the classic ECN and Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 6] Internet-Draft More Accurate ECN Feedback in TCP July 2012 never in parallel. 1.5. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. We use the following terminology from [RFC3168] and [RFC3540]: The ECN field in the IP header: CE: the Congestion Experienced codepoint, and ECT(0): the first ECN-Capable Transport codepoint, and ECT(1): the second ECN-Capable Transport codepoint. The ECN flags in the TCP header: CWR: the Congestion Window Reduced flag, ECE: the ECN-Echo flag, and NS: ECN Nonce Sum. In this document, we will call the ECN feedback scheme as specified in [RFC3168] the 'classic ECN' and our new proposal the 'more accurate ECN feedback' scheme. A 'congestion mark' is defined as an IP packet where the CE codepoint is set. A 'congestion event' refers to one or more congestion marks belong to the same overload situation in the network (usually during one RTT). 2. Negotiation during the TCP handshake During the TCP hand-shake at the start of a connection, an originator of the connection (host A) MUST indicate a request to get more accurate ECN feedback by setting the TCP flags NS=1, CWR=1 and ECE=1 in the initial . A responding host (host B) MUST return a with flags CWR=1 and ECE=0. The responding host MUST NOT set this combination of flags unless the preceding has already requested support for Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 7] Internet-Draft More Accurate ECN Feedback in TCP July 2012 more accurate ECN feedback as above. Normally a server (B) will reply to a client with NS=0, but if the initial from client A is marked CE, the sever B SHOULD set the NS flag to 1 to indicate the congestion immediately instead of delaying the signal to the first acknowledgment when the actually data transmission already started. So, server B MAY set the alternative TCP header flags in its : NS=1, CWR=1 and ECE=0. The addition of ECN to TCP packets is discussed and specified as experimental in [RFC5562]. The addition of ECN to the packet is optional. The security implication when using this option are not further discussed here. This handshake is summarized in Table 1 below, with X indicating NS can be either 0 or 1 depending on whether congestion had been experienced. The handshakes used for the other flavors of ECN are also shown for comparison. To compress the width of the table, the headings of the first four columns have been severely abbreviated, as follows: Ac: *Ac*curate ECN Feedback N: ECN-*N*once (RFC3540) E: *E*CN (RFC3168) I: Not-ECN (*I*mplicit congestion notification). +----+---+---+---+------------+----------------+------------------+ | Ac | N | E | I | A->B | B->A | Mode | +----+---+---+---+------------+----------------+------------------+ | | | | | NS CWR ECE | NS CWR ECE | | | AB | | | | 1 1 1 | X 1 0 | accurate ECN | | A | B | | | 1 1 1 | 1 0 1 | ECN Nonce | | A | | B | | 1 1 1 | 0 0 1 | classic ECN | | A | | | B | 1 1 1 | 0 0 0 | Not ECN | | A | | | B | 1 1 1 | X 1 1 | Not ECN (broken) | +----+---+---+---+------------+----------------+------------------+ Table 1: ECN capability negotiation between Sender (A) and Receiver (B) Recall that, if the reflects the same flag settings as the preceding (because there is a broken TCP implementation that behaves this way), RFC3168 specifies that the whole connection MUST revert to Not-ECT. Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 8] Internet-Draft More Accurate ECN Feedback in TCP July 2012 3. More Accurate ECN Feedback In this section we refer the sender to be the one sending data and the receiver as the one that will acknowledge this data. Of course such a scenario is describing only one half connection of a TCP connection. The proposed scheme, if negotiated, will be used for both half connection as both, sender and receiver, need to be capable to echo and understand the accurate ECN feedback scheme. This section proposes the new coding of the two ECN TCP bits (ECE/ CWR) as well as the TCP NS bit to provide a more accurate ECN feedback. This coding MUST only be used if the more accurate ECN feedback has been negotiated successfully in the TCP handshake. Section Section 3.4 provides basically another alternative to allow a compatibility mode when a sender needs more accurate ECN feedback but has to operate with a legacy [RFC3168] classic ECN receiver. 3.1. Codepoint Coding The more accurate ECN feedback coding uses the ECE, CWR and NS bits as one field to encode 8 distinct codepoints. This overloaded use of these 3 header flags as one 3-bit more Accurate ECN (AcE) field is shown in Figure 2. The actual definition of the TCP header, including the addition of support for the ECN Nonce, is shown for comparison in Figure 1. This specification does not redefine the names of these three TCP flags, it merely overloads them with another definition once a flow with more accurate ECN feedback is established. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | | | | U | A | P | R | S | F | | Header Length | Reserved | AcE | R | C | S | S | Y | I | | | | | G | K | H | T | N | N | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ Figure 2: Definition of the AcE field within bytes 13 and 14 of the TCP Header (when SYN=0). The 8 possible codepoints are shown below. Five of them are used to encode a "congestion indication" (CI) counter. The other three codepoints are undefined but can be used for some kind of integrity check (see appendix Appendix B). The CI counter maintains the number of CE marks observed at the receiver (see Section 3.3.1). Also note that, whenever the SYN flag of a TCP segment is set (including when the ACK flag is also set), the NS, CWR and ECE flags Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 9] Internet-Draft More Accurate ECN Feedback in TCP July 2012 (i.e. the AcE field of the ) MUST NOT be interpreted as the 3-bit codepoint, which is only used in non-SYN packets. +-----+----+-----+-----+------------+ | AcE | NS | CWR | ECE | CI (base5) | +-----+----+-----+-----+------------+ | 0 | 0 | 0 | 0 | 0 | | 1 | 0 | 0 | 1 | 1 | | 2 | 0 | 1 | 0 | 2 | | 3 | 0 | 1 | 1 | 3 | | 4 | 1 | 0 | 0 | 4 | | 5 | 1 | 0 | 1 | - | | 6 | 1 | 1 | 0 | - | | 7 | 1 | 1 | 1 | - | +-----+----+-----+-----+------------+ Table 2: Codepoint assignment for accurate ECN feedback By default an accurate ECN receiver MUST echo one of the codepoints encoding the CI counter value. Whenever a CE is received and thus the value of the CI has changed, the receiver MUST echo the CI in the next ACK. Moreover, the receiver MUST repeat the codepoint, that provides the CI counter, directly on the subsequent ACK. Thus every value of CI will be transmitted at least twice. Otherwise the receiver MAY send one of the other, currently undefined, codepoints. This requirement may conflict with delayed ACK ratios larger than two, using the available number of codepoints. A receiver MUST change the ACK'ing rate such that a sufficient rate of feedback signals can be sent. Details on how the change in the ACK'ing rate can be implemented are given in the section Section 3.3. 3.2. More Accurate ECN TCP Sender This section specifies the sender-side action describing how to exclude the number of congestion markings from the given receiver feedback signal. When the more accurate ECN feedback scheme is supported by the sender, the sender will maintain a congestion indication received (CI.r) counter. This CI.r counter will hold the number of CE marks as signaled by the receiver, and reconstructed by the sender. On the arrival of every ACK, the sender calculates the difference D between the local CI.r value modulo 5, and the signaled CI value of the codepoint in the ACK. The value of CI.r is increased by D, and D is assumed to be the number of CE marked packets that arrived at the receiver since it sent the previously received ACK. Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 10] Internet-Draft More Accurate ECN Feedback in TCP July 2012 3.3. More Accurate ECN TCP Receiver This section describes the receiver-side action to signal the accurate ECN feedback back to the sender. The receiver will need to maintain a congestion indication (CI) counter of how many CE marking have been seen during a connection. Thus for each incoming segment with a CE marking, the receiver will increase CI by 1. With each ACK the receiver will calculate CI modulo 5 and set the respective codepoint in the AcE field (see table Table 2). To avoid counter wrap-arounds in a high congestion situation, the receiver SHOULD switch from a delayed ACK behavior to send ACKs immediately after the data packet reception if needed. 3.3.1. Implementation The receiver counts how many packets carry a congestion notification. This could, in principle, be achieved by directly increasing the CI for every incoming CE marked segment. Since the space for communicating the information back to the sender in ACKs is limited, instead of directly increasing this counter, a "gauge" (CI.g) is increased instead. When sending an ACK, the CI is increased by either CI.g or at maximum by 4 as a larger increase could cause an overflow in the codepoint counter signaling. Thereafter, CI.g is reduced by the same amount. Then the current CI value (modulo 5) is encoded in the current ACK. To avoid losing information, it must be ensured that an ACK is sent at least after 5 incoming, outstanding congestion marks (i.e. when CI.g exceeds 5). Architecturally the counters never decrease during a TCP session. However, any overflow MUST be modulo a multiple of 5 for CI. For resilience against lost ACKs, an indicator flag (CI.i) SHOULD be used to ensure that, whether another congestion indication arrives or not, a second ACK transmits the previous counter value again. Thus when a codepoint is transmitted the first time, CI.i will be set to one. Then with the next ACK the same codepoint is transmitted again and the CI.i is reset to zero. Only when CI.i is zero, the counter CI can be increased. In case of heavy congestion (basically all segments are CE marked) the CI.g might grow continuously. In this case the ACK rate should be increased by sending an immediate ACK for an incoming data segment. The following table provides an example showing an half-connection with a TCP sender A and a TCP receiver B. The sender maintains a counter CI.r to reconstruct the number of CE mark seen at the receiver-side. Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 11] Internet-Draft More Accurate ECN Feedback in TCP July 2012 +----+------+---------------+------------+---------------+------+ | | Data | TCP A | IP | TCP B | Data | +----+------+---------------+------------+---------------+------+ | | | SEQ ACK CTL | | SEQ ACK CTL | | | -- | | ------------- | ---------- | ------------- | | | 1 | | 0100 SYN | ----> | | | | | | CWR,ECE,NS | | | | | 2 | | | <---- | 0300 0101 SYN | | | | | | | ACK,CWR | | | 3 | | 0101 0301 ACK | ECT0 -CE-> | | | | | | | | CI.c=0 CI.g=1 | | | 4 | 100 | 0101 0301 ACK | ECT0 ----> | | | | | | | | CI.c=1 CI.g=0 | | | 5 | | | <---- | 0301 0201 ACK | | | | | | | ECI=CI.1 | | | | | CI.r=1 | | | | | 6 | 100 | 0201 0301 ACK | ECT0 -CE-> | | | | | | | | CI.c=1 CI.g=1 | | | 7 | 100 | 0301 0301 ACK | ECT0 -CE-> | | | | | | | | CI.c=1 CI.g=2 | | | 8 | | | XX-- | 0301 0401 ACK | | | | | | | ECI=CI.1 | | | | | CI.r=1 | | | | | 9 | 100 | 0401 0301 ACK | ECT0 -CE-> | | | | | | | | CI.c=1 CI.g=3 | | | 10 | 100 | 0501 0301 ACK | ECT0 -CE-> | | | | | | | | CI.c=5 CI.g=0 | | | 11 | | | <---- | 0301 0601 ACK | | | | | | | ECI=CI.0 | | | | | CI.r=5 | | | | | 12 | 100 | 0601 0301 ACK | ECT0 -CE-> | | | | | | | | CI.c=5 CI.g=1 | | | 13 | 100 | 0701 0301 ACK | ECT0 -CE-> | | | | | | | | CI.c=5 CI.g=2 | | | 14 | | | <---- | 0301 0801 ACK | | | | | | | ECI=CI.0 | | | | | CI.r=5 | | | | +----+------+---------------+------------+---------------+------+ Table 3: Codepoint signal example 3.4. Advanced Compatibility Mode TBD (more detailed description see draft-ietf-conex-tcp-modifications) This section describes a possible mechanism to achieve more accurate ECN feedback even when the receiver is not capable of the new more Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 12] Internet-Draft More Accurate ECN Feedback in TCP July 2012 accurate ECN feedback scheme with the drawback of less reliability. During initial deployment, a large number of receivers will only support [RFC3168] classic ECN feedback. Such a receiver will set the ECE bit whenever it receives a segment with the CE codepoint set, and clear the ECE bit only when it receives a segment with the CWR bit set. As the CE codepoint has priority over the CWR bit (Note: the wording in this regard is ambiguous in [RFC3168], but the reference implementation of ECN in ns2 is clear), a [RFC3168] compliant receiver will not clear the ECE bit on the reception of a segment, where both CE and CWR are set simultaneously. This property allows the use of a compatibility mode, to extract more accurate feedback from legacy [RFC3168] receivers by setting the CWR permanently. Assuming a delayed ACK ratio of one (no delayed ACKs), a sender can permanently set the CWR bit in the TCP header, to receive a more accurate feedback of the CE codepoints as seen at the receiver. This feedback signal is however very brittle and any ACK loss may cause congestion information to become lost. Delayed ACKs and ACK loss can both not be accounted for in a reliable way, however. Therefore, a sender would need to use heuristics to determine the current delay ACK ratio M used by the receiver (e.g. most receivers will use M=2), and also the recent ACK loss ratio. Acknowledge Congestion Control (AckCC) as defined in [RFC5690] can not be used, as deployment of this feature is only experimental. Using a phase locked loop algorithm, the CWR bit can then be set only on those data segments, that will trigger a (delayed) ACK. Thereby, no congestion information is lost, as long as the ACK carrying the ECE bit is seen by the sender. Whenever the sender sees an ACK with ECE set, this indicates that at least one, and at most M data segments with the CE codepoint set where seen by the receiver. The sender SHOULD react, as if M CE indications where reflected back to the sender by the receiver, unless additional heuristics (e.g. dead time correction) can determine a more accurate value of the "true" number of received CE marks. 4. Acknowledgements We want to thank Bob Briscoe and Michael Welzl for their input and discussion. Special thanks to Bob Briscoe, who first proposed the use of the ECN bits as one field and the handshake negotiation for more accurate ECN. Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 13] Internet-Draft More Accurate ECN Feedback in TCP July 2012 5. IANA Considerations This memo includes no request to IANA. 6. Security Considerations TBD ACK loss This scheme sends each codepoint (of the two subsets) at least two times. In the worst case at least one, and often two or more consecutive ACKs can be dropped without losing congestion information. Further refinements, such as interleaving ACKs when sending codepoints belonging to the two subsets (e.g. CI, E1), can allow the loss of any two consecutive ACKs, without the sender losing congestion information, at the cost of also reducing the ACK ratio. At low congestion rates, the sending of the current value of the CI counter by default allows higher numbers of consecutive ACKs to be lost, without impacting the accuracy of the ECN signal. ECN Nonce In the proposed scheme there are three more codepoints available that could be used for an integrity check like ECN Nonce. If ECN nonce would be implemented as proposed in Appendix B, even more information would be provided for ECN Nonce than in the original specification. A delayed ACK ratio of two can be sustained indefinitely even during heavy congestion, but not during excessive ECT(1) marking, which is under the control of the sender. A higher ACK ratio can be sustained when congestion is low, but a low ACK ratio my be needed for the E1 feedback. 7. References 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 14] Internet-Draft More Accurate ECN Feedback in TCP July 2012 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit Congestion Notification (ECN) Signaling with Nonces", RFC 3540, June 2003. 7.2. Informative References [Ali10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: Efficient Packet Transport for the Commoditized Data Center", Jan 2010. [I-D.briscoe-tsvwg-re-ecn-tcp] Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, "Re-ECN: Adding Accountability for Causing Congestion to TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-09 (work in progress), October 2010. [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. Ramakrishnan, "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, June 2009. [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009. [RFC5690] Floyd, S., Arcia, A., Ros, D., and J. Iyengar, "Adding Acknowledgement Congestion Control to TCP", RFC 5690, February 2010. [draft-kuehlewind-tcpm-accurate-ecn-option] Kuehlewind, M. and R. Scheffenegger, "Accurate ECN Feedback Option in TCP", draft-kuehlewind-tcpm-accurate-ecn-option-01 (work in progress), Jul 2012. Appendix A. Estimating CE-marked bytes TBD (see draft-ietf-conex-tcp-modifications-02 and 'late ACK' scheme of 1 Bit scheme in draft-kuehlewind-tcpm-accurate-ecn-00) Appendix B. Use with ECN Nonce In ECN Nonce, by comparing the number of incoming ECT(1) notifications with the actual number of packets that were transmitted with an ECT(1) mark as well as the sum of the sender's two internal counters, the sender can probabilistically detect a receiver that Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 15] Internet-Draft More Accurate ECN Feedback in TCP July 2012 sends false marks or supresses accurate ECN feedback, or a path that does not properly support ECN. +-----+----+-----+-----+------------+------------+ | ECI | NS | CWR | ECE | CI (base5) | E1 (base3) | +-----+----+-----+-----+------------+------------+ | 0 | 0 | 0 | 0 | 0 | - | | 1 | 0 | 0 | 1 | 1 | - | | 2 | 0 | 1 | 0 | 2 | - | | 3 | 0 | 1 | 1 | 3 | - | | 4 | 1 | 0 | 0 | 4 | - | | 5 | 1 | 0 | 1 | - | 0 | | 6 | 1 | 1 | 0 | - | 1 | | 7 | 1 | 1 | 1 | - | 2 | +-----+----+-----+-----+------------+------------+ Table 4: Codepoint assignment for accurate ECN feedback and ECN Nonce If an ECT(1) mark is received, an ETC(1) counter (E1) is incremented. The receiver has to convey that updated information to the sender with the next possible ACK using the three remaining codepoints as show in table Table 4. Thus on the reception of a ECT(1) marked packet, the receiver should signal the current value of the E1 counter (modulo 3) in the next ACK. If a CE mark was received before sending the next ACK (e.g. delayed ACKs) sending that update MUST take precedence. The receiver should also repeat sending every E1 value. But this repetition does not need to be in the consecutive ACK as the E1 value will only be transmitted when no changes in the CI have occurred. Each E1 value will therefore be sent exactly twice. The repetition of every signal will provide further resilience against lost ACKs. As only a limited number of E1 codepoints exist and the receiver might not acknowledge every single data packet immediately (delayed ACKs), a sender SHOULD NOT mark more than 1/m of the packets with ECT(1), where m is the ACK ratio (e.g. 50% when every second data packet triggers an ACK). This constraint will avoid a permanent feedback of E1 only, and must be maintained also on short timescales. A sender SHOULD send no more than 3 consecutive packets marked with ECT(1). The same counter / gauge method as described in Section 3.3.1 can be used to count and return (using a different mapping) the number of incoming packets marked ECT(1) (called E1 in the algorithm). As few codepoints are available for conveying the E1 counter value, an immediate ACK MUST be triggered whenever the gauge E1.g exceeds a threshold of 3. The sender receives the receiver's counter values and compares them with the locally maintained counter. Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 16] Internet-Draft More Accurate ECN Feedback in TCP July 2012 B.1. Pseudo Code for the Codepoint Coding IP signals: CE TCP Fields: AcE Counters: CI Congestion Indication - counter [0..(n*5-1)] CI.g Congestion Indication - Gauge [0.."inf"]) CI.i Congestion Indication - indicator flag [0,1] At session initialization, all these counters are initialized to zero. When a segment (Data, ACK) is received, perform the following steps: If (CE) # When a CE codepoint is received, CI.g++ # Increase CI.g by 1 If (ECT(1)) # When a ECT(1) codepoint is received, E1.g++ # Increase E1.g by 1 If (CI.g > 5) or # When ACK rate is not sufficient to keep (E1.g > 3) # gauges close to zero, Send ACK immediately # increase ACK rate When preparing an ACK to be sent: If (CI.g > 0) or # When there is a unsent change in CI ( (E1.i != 0) and # this check is to in effect alternate (CI.i != 0) ) # sending CI and E1 codepoints If (CI.i == 0) and # updates to CI allowed (CI.g > 0) # update is meaningful CI.i = 1 # set flag to repeat CI value CI += min(4,CI.g) # 4 for 5 codepoints CI %= 5 # using modulo the available codepoints CI.g -= min(4,CI.g) # reduce the holding gauge accordingly Else CI.i-- # just in case CI.f was set to # more than 1 for resiliency Send ACK with AcE set to CI Else If (E1.g > 0) or (E1.i != 0) If (E1.i == 0) and (E1.g > 0) E1.i = 1 E1 += min(2, E1.g) Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 17] Internet-Draft More Accurate ECN Feedback in TCP July 2012 E1 %= 3 E1.g -= min(2, E1.g) Else E1.i-- Send ACK with AcE set to E1 Else Send ACK with AcE set to CI # default action Sender: Counters: CI.r - current value of CEs seen by receiver E1.s - sum of all sent ECT(1) marked packets (up to snd.nxt) E1.s(t) - value of E1.s at time (in sequence space) t E1.r - value signaled by receiver about received ECT(1) segments E1.r(t) - value of E1.r at time (in sequence space) t CI.r(t) - ditto # Note: With a codepoint implementation, # a reverse table ECI[n] -> CI.r / E1.r is needed. # The wire protocol transports the absolute value # of the receiver-side counter. # Thus the (positive only) delta needs to be calculated, # and added to the sender-side counter. If ACK AcE in the set of CI values D = (AcE.CI + 5 - (CI.r mod 5)) mod 5 CI.r += D If ACK AcE in the set of E1 values D = (Ace.E1 + 3 - (E1.r mod 3)) mod 3 E1.r += D # Before CI.r or E1.r reach a (binary) rollover, # they need to roll over some multiple of 5 # and 3 respectively. CI.r = CI.r modulo 255 # 5 * 51 E1.r = E1.r modulo 255 # 3 * 85 # (an implementation may choose to use another constant, # ie 3^4*5^4 (50625) for 16-bit integers, # or 3^8*5^8 (2562890625) for 32-bit integers) # The following test can (probabilistically) reveal, # if the receiver or path is not properly # handling ECN (CE, E1) marks Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 18] Internet-Draft More Accurate ECN Feedback in TCP July 2012 If not E1.r(t) <= E1.s(t) <= E1.r(t) + CI.r(t) # -> receiver or path do not properly reflect ECN # (or too many ACKs got lost, which can be checked # also by the sender). Authors' Addresses Mirja Kuehlewind (editor) University of Stuttgart Pfaffenwaldring 47 Stuttgart 70569 Germany Email: mirja.kuehlewind@ikr.uni-stuttgart.de Richard Scheffenegger NetApp, Inc. Am Euro Platz 2 Vienna, 1120 Austria Phone: +43 1 3676811 3146 Email: rs@netapp.com Kuehlewind & Scheffenegger Expires January 17, 2013 [Page 19]