idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC3168, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. (Using the creation date from RFC3168, updated by this document, for RFC5378 checks: 2000-11-17) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 28, 2020) is 1274 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B' is mentioned on line 2395, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-11) exists of draft-ietf-tcpm-2140bis-05 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-05 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-07 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Independent 4 Updates: 3168 (if approved) M. Kuehlewind 5 Intended status: Standards Track Ericsson 6 Expires: May 1, 2021 R. Scheffenegger 7 NetApp 8 October 28, 2020 10 More Accurate ECN Feedback in TCP 11 draft-ietf-tcpm-accurate-ecn-12 13 Abstract 15 Explicit Congestion Notification (ECN) is a mechanism where network 16 nodes can mark IP packets instead of dropping them to indicate 17 incipient congestion to the end-points. Receivers with an ECN- 18 capable transport protocol feed back this information to the sender. 19 ECN is specified for TCP in such a way that only one feedback signal 20 can be transmitted per Round-Trip Time (RTT). Recent new TCP 21 mechanisms like Congestion Exposure (ConEx), Data Center TCP (DCTCP) 22 or Low Latency Low Loss Scalable Throughput (L4S) need more accurate 23 ECN feedback information whenever more than one marking is received 24 in one RTT. This document specifies a scheme to provide more than 25 one feedback signal per RTT in the TCP header. Given TCP header 26 space is scarce, it allocates a reserved header bit, that was 27 previously used for the ECN-Nonce which has now been declared 28 historic. It also overloads the two existing ECN flags in the TCP 29 header. The resulting extra space is exploited to feed back the IP- 30 ECN field received during the 3-way handshake as well. Supplementary 31 feedback information can optionally be provided in a new TCP option, 32 which is never used on the TCP SYN. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on May 1, 2021. 50 Copyright Notice 52 Copyright (c) 2020 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 5 69 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 70 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 71 1.4. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 72 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 73 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 8 74 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 75 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 76 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 77 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 10 78 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 11 79 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 11 80 3.1.1. Negotiation during the TCP handshake . . . . . . . . 11 81 3.1.2. Backward Compatibility . . . . . . . . . . . . . . . 12 82 3.1.3. Forward Compatibility . . . . . . . . . . . . . . . . 15 83 3.1.4. Retransmission of the SYN . . . . . . . . . . . . . . 15 84 3.1.5. Implications of AccECN Mode . . . . . . . . . . . . . 16 85 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 17 86 3.2.1. Initialization of Feedback Counters . . . . . . . . . 18 87 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 19 88 3.2.3. The AccECN Option . . . . . . . . . . . . . . . . . . 26 89 3.3. Requirements for TCP Proxies, Offload Engines and other 90 Middleboxes on AccECN Compliance . . . . . . . . . . . . 35 91 4. Updates to RFC 3168 . . . . . . . . . . . . . . . . . . . . . 36 92 5. Interaction with TCP Variants . . . . . . . . . . . . . . . . 37 93 5.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 37 94 5.2. Compatibility with TCP Experiments and Common TCP Options 38 95 5.3. Compatibility with Feedback Integrity Mechanisms . . . . 39 97 6. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 40 98 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 42 99 8. Security Considerations . . . . . . . . . . . . . . . . . . . 43 100 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 44 101 10. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 44 102 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 44 103 11.1. Normative References . . . . . . . . . . . . . . . . . . 44 104 11.2. Informative References . . . . . . . . . . . . . . . . . 45 105 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 48 106 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 48 107 A.2. Example Algorithm for Safety Against Long Sequences of 108 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 49 109 A.2.1. Safety Algorithm without the AccECN Option . . . . . 49 110 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 51 111 A.3. Example Algorithm to Estimate Marked Bytes from Marked 112 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 53 113 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 53 114 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 54 115 Appendix B. Rationale for Usage of TCP Header Flags . . . . . . 55 116 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake . . . 55 117 B.2. Four Codepoints in the SYN/ACK . . . . . . . . . . . . . 56 118 B.3. Space for Future Evolution . . . . . . . . . . . . . . . 56 119 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 58 121 1. Introduction 123 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 124 network nodes can mark IP packets instead of dropping them to 125 indicate incipient congestion to the end-points. Receivers with an 126 ECN-capable transport protocol feed back this information to the 127 sender. In RFC 3168, ECN was specified for TCP in such a way that 128 only one feedback signal could be transmitted per Round-Trip Time 129 (RTT). Recently, proposed mechanisms like Congestion Exposure (ConEx 130 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need to 131 know when more than one marking is received in one RTT which is 132 information that cannot be provided by the feedback scheme as 133 specified in [RFC3168]. This document specifies an update to the ECN 134 feedback scheme of RFC 3168 that provides more accurate information 135 and could be used by these and potentially other future TCP 136 extensions. A fuller treatment of the motivation for this 137 specification is given in the associated requirements document 138 [RFC7560]. 140 This documents specifies a standards track scheme for ECN feedback in 141 the TCP header to provide more than one feedback signal per RTT. It 142 will be called the more accurate ECN feedback scheme, or AccECN for 143 short. This document updates RFC 3168 with respect to negotiation 144 and use of the feedback scheme for TCP. All aspects of RFC 3168 145 other than the TCP feedback scheme, in particular the definition of 146 ECN at the IP layer, remain unchanged by this specification. 147 Section 4 gives a more detailed specification of exactly which 148 aspects of RFC 3168 this document updates. 150 AccECN is intended to be a complete replacement for classic TCP/ECN 151 feedback, not a fork in the design of TCP. AccECN feedback 152 complements TCP's loss feedback and it can coexist alongside 153 'classic' [RFC3168] TCP/ECN feedback. So its applicability is 154 intended to include all public and private IP networks (and even any 155 non-IP networks over which TCP is used today), whether or not any 156 nodes on the path support ECN, of whatever flavour. This document 157 uses the term Classic ECN when it needs to distinguish the RFC 3168 158 ECN TCP feedback scheme from the AccECN TCP feedback scheme. 160 AccECN feedback overloads the two existing ECN flags in the TCP 161 header and allocates the currently reserved flag (previously called 162 NS) in the TCP header, to be used as one three-bit counter field 163 indicating the number of congestion experienced marked packets. 164 Given the new definitions of these three bits, both ends have to 165 support the new wire protocol before it can be used. Therefore 166 during the TCP handshake the two ends use these three bits in the TCP 167 header to negotiate the most advanced feedback protocol that they can 168 both support, in a way that is backward compatible with [RFC3168]. 170 AccECN is solely a change to the TCP wire protocol; it covers the 171 negotiation and signaling of more accurate ECN feedback from a TCP 172 Data Receiver to a Data Sender. It is completely independent of how 173 TCP might respond to congestion feedback, which is out of scope, but 174 ultimately the motivation for accurate ECN feedback. Like Classic 175 ECN feedback, AccECN can be used by standard Reno congestion control 176 [RFC5681] to respond to the existence of at least one congestion 177 notification within a round trip. Or, unlike Reno, AccECN can be 178 used to respond to the extent of congestion notification over a round 179 trip, as for example DCTCP does in controlled environments [RFC8257]. 180 For congestion response, this specification refers to RFC 3168, or 181 ECN experiments such as those referred to in [RFC8311], namely: a 182 TCP-based Low Latency Low Loss Scalable (L4S) congestion control 183 [I-D.ietf-tsvwg-l4s-arch]; or Alternative Backoff with ECN (ABE) 184 [RFC8511]. 186 It is recommended that the AccECN protocol is implemented alongside 187 SACK [RFC2018] and the experimental ECN++ protocol 188 [I-D.ietf-tcpm-generalized-ecn], which allows the ECN capability to 189 be used on TCP control packets. Therefore, this specification does 190 not discuss implementing AccECN alongside [RFC5562], which was an 191 earlier experimental protocol with narrower scope than ECN++. 193 1.1. Document Roadmap 195 The following introductory section outlines the goals of AccECN 196 (Section 1.2). Then terminology is defined (Section 1.3) and a recap 197 of existing prerequisite technology is given (Section 1.4). 199 Section 2 gives an informative overview of the AccECN protocol. Then 200 Section 3 gives the normative protocol specification, and Section 4 201 clarifies which aspects of RFC 3168 are updated by this 202 specification. Section 5 assesses the interaction of AccECN with 203 commonly used variants of TCP, whether standardized or not. 204 Section 6 summarizes the features and properties of AccECN. 206 Section 7 summarizes the protocol fields and numbers that IANA will 207 need to assign and Section 8 points to the aspects of the protocol 208 that will be of interest to the security community. 210 Appendix A gives pseudocode examples for the various algorithms that 211 AccECN uses and Appendix B explains why AccECN uses flags in the main 212 TCP header and quantifies the space left for future use. 214 1.2. Goals 216 [RFC7560] enumerates requirements that a candidate feedback scheme 217 will need to satisfy, under the headings: resilience, timeliness, 218 integrity, accuracy (including ordering and lack of bias), 219 complexity, overhead and compatibility (both backward and forward). 220 It recognizes that a perfect scheme that fully satisfies all the 221 requirements is unlikely and trade-offs between requirements are 222 likely. Section 6 presents the properties of AccECN against these 223 requirements and discusses the trade-offs made. 225 The requirements document recognizes that a protocol as ubiquitous as 226 TCP needs to be able to serve as-yet-unspecified requirements. 227 Therefore an AccECN receiver aims to act as a generic (dumb) 228 reflector of congestion information so that in future new sender 229 behaviours can be deployed unilaterally. 231 1.3. Terminology 233 AccECN: The more accurate ECN feedback scheme will be called AccECN 234 for short. 236 Classic ECN: the ECN protocol specified in [RFC3168]. 238 Classic ECN feedback: the feedback aspect of the ECN protocol 239 specified in [RFC3168], including generation, encoding, 240 transmission and decoding of feedback, but not the Data Sender's 241 subsequent response to that feedback. 243 ACK: A TCP acknowledgement, with or without a data payload (ACK=1). 245 Pure ACK: A TCP acknowledgement without a data payload. 247 Acceptable packet / segment: A packet or segment that passes the 248 acceptability tests in [RFC0793] and [RFC5961]. 250 TCP client: The TCP stack that originates a connection. 252 TCP server: The TCP stack that responds to a connection request. 254 Data Receiver: The endpoint of a TCP half-connection that receives 255 data and sends AccECN feedback. 257 Data Sender: The endpoint of a TCP half-connection that sends data 258 and receives AccECN feedback. 260 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 261 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 262 document are to be interpreted as described in BCP 14 [RFC2119] 263 [RFC8174] when, and only when, they appear in all capitals, as shown 264 here. 266 1.4. Recap of Existing ECN feedback in IP/TCP 268 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 269 negotiated with the receiver at the transport layer, an ECN sender 270 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 271 to indicate an ECN-capable transport (ECT). If both ECN bits are 272 zero, the packet is considered to have been sent by a Not-ECN-capable 273 Transport (Not-ECT). When a network node experiences congestion, it 274 will occasionally either drop or mark a packet, with the choice 275 depending on the packet's ECN codepoint. If the codepoint is Not- 276 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 277 the node can mark the packet by setting both ECN bits, which is 278 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 279 Table 1 summarises these codepoints. 281 +------------------+----------------+---------------------------+ 282 | IP-ECN codepoint | Codepoint name | Description | 283 +------------------+----------------+---------------------------+ 284 | 0b00 | Not-ECT | Not ECN-Capable Transport | 285 | 0b01 | ECT(1) | ECN-Capable Transport (1) | 286 | 0b10 | ECT(0) | ECN-Capable Transport (0) | 287 | 0b11 | CE | Congestion Experienced | 288 +------------------+----------------+---------------------------+ 290 Table 1: The ECN Field in the IP Header 292 In the TCP header the first two bits in byte 14 are defined as flags 293 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 294 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 295 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 296 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 297 Data Receiver starts to set the Echo Congestion Experienced (ECE) 298 flag continuously in the TCP header of ACKs, which ensures the signal 299 is received reliably even if ACKs are lost. The TCP sender confirms 300 that it has received at least one ECE signal by responding with the 301 congestion window reduced (CWR) flag, which allows the TCP receiver 302 to stop repeating the ECN-Echo flag. This always leads to a full RTT 303 of ACKs with ECE set. Thus any additional CE markings arriving 304 within this RTT cannot be fed back. 306 The last bit in byte 13 of the TCP header was defined as the Nonce 307 Sum (NS) for the ECN Nonce [RFC3540]. In the absence of widespread 308 deployment RFC 3540 has been reclassified as historic [RFC8311] and 309 the respective flag has been marked as "reserved", making this TCP 310 flag available for use by the AccECN experiment instead. 312 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 313 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 314 | | | N | C | E | U | A | P | R | S | F | 315 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 316 | | | | R | E | G | K | H | T | N | N | 317 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 319 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 321 2. AccECN Protocol Overview and Rationale 323 This section provides an informative overview of the AccECN protocol 324 that will be normatively specified in Section 3 326 Like the original TCP approach, the Data Receiver of each TCP half- 327 connection sends AccECN feedback to the Data Sender on TCP 328 acknowledgements, reusing data packets of the other half-connection 329 whenever possible. 331 The AccECN protocol has had to be designed in two parts: 333 o an essential part that re-uses ECN TCP header bits to feed back 334 the number of arriving CE marked packets. This provides more 335 accuracy than classic ECN feedback, but limited resilience against 336 ACK loss; 338 o a supplementary part using a new AccECN TCP Option that provides 339 additional feedback on the number of bytes that arrive marked with 340 each of the three ECN codepoints (not just CE marks). This 341 provides greater resilience against ACK loss than the essential 342 feedback, but it is more likely to suffer from middlebox 343 interference. 345 The two part design was necessary, given limitations on the space 346 available for TCP options and given the possibility that certain 347 incorrectly designed middleboxes prevent TCP using any new options. 349 The essential part overloads the previous definition of the three 350 flags in the TCP header that had been assigned for use by ECN. This 351 design choice deliberately replaces the classic ECN feedback 352 protocol, rather than leaving classic ECN feedback intact and adding 353 more accurate feedback separately because: 355 o this efficiently reuses scarce TCP header space, given TCP option 356 space is approaching saturation; 358 o a single upgrade path for the TCP protocol is preferable to a fork 359 in the design; 361 o otherwise classic and accurate ECN feedback could give conflicting 362 feedback on the same segment, which could open up new security 363 concerns and make implementations unnecessarily complex; 365 o middleboxes are more likely to faithfully forward the TCP ECN 366 flags than newly defined areas of the TCP header. 368 AccECN is designed to work even if the supplementary part is removed 369 or zeroed out, as long as the essential part gets through. 371 2.1. Capability Negotiation 373 AccECN is a change to the wire protocol of the main TCP header, 374 therefore it can only be used if both endpoints have been upgraded to 375 understand it. The TCP client signals support for AccECN on the 376 initial SYN of a connection and the TCP server signals whether it 377 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 378 client uses to signal AccECN support have been carefully chosen so 379 that a TCP server will interpret them as a request to support the 380 most recent variant of ECN feedback that it supports. Then the 381 client falls back to the same variant of ECN feedback. 383 An AccECN TCP client does not send the new AccECN Option on the SYN 384 as SYN option space is limited. The TCP server sends the AccECN 385 Option on the SYN/ACK and the client sends it on the first ACK to 386 test whether the network path forwards the option correctly. 388 2.2. Feedback Mechanism 390 A Data Receiver maintains four counters initialized at the start of 391 the half-connection. Three count the number of arriving payload 392 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 393 the number of packets arriving marked with a CE codepoint (including 394 control packets without payload if they are CE-marked). 396 The Data Sender maintains four equivalent counters for the half 397 connection, and the AccECN protocol is designed to ensure they will 398 match the values in the Data Receiver's counters, albeit after a 399 little delay. 401 Each ACK carries the three least significant bits (LSBs) of the 402 packet-based CE counter using the ECN bits in the TCP header, now 403 renamed the Accurate ECN (ACE) field (see Figure 3 later). The 24 404 LSBs of each byte counter are carried in the AccECN Option. 406 2.3. Delayed ACKs and Resilience Against ACK Loss 408 With both the ACE and the AccECN Option mechanisms, the Data Receiver 409 continually repeats the current LSBs of each of its respective 410 counters. There is no need to acknowledge these continually repeated 411 counters, so the congestion window reduced (CWR) mechanism is no 412 longer used. Even if some ACKs are lost, the Data Sender should be 413 able to infer how much to increment its own counters, even if the 414 protocol field has wrapped. 416 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 417 it appears to have incremented by one (say), the field might have 418 actually cycled completely then incremented by one. The Data 419 Receiver is not allowed to delay sending an ACK to such an extent 420 that the ACE field would cycle. However cycling is still a 421 possibility at the Data Sender because a whole sequence of ACKs 422 carrying intervening values of the field might all be lost or delayed 423 in transit. 425 The fields in the AccECN Option are larger, but they will increment 426 in larger steps because they count bytes not packets. Nonetheless, 427 their size has been chosen such that a whole cycle of the field would 428 never occur between ACKs unless there had been an infeasibly long 429 sequence of ACK losses. Therefore, as long as the AccECN Option is 430 available, it can be treated as a dependable feedback channel. 432 If the AccECN Option is not available, e.g. it is being stripped by a 433 middlebox, the AccECN protocol will only feed back information on CE 434 markings (using the ACE field). Although not ideal, this will be 435 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 436 will ever indicate more severe congestion than CE, even though future 437 uses for ECT(0) or ECT(1) are still unclear [RFC8311]. Because the 438 3-bit ACE field is so small, when it is the only field available the 439 Data Sender has to interpret it assuming the most likely wrap, but 440 with a degree of conservatism. 442 Certain specified events trigger the Data Receiver to include an 443 AccECN Option on an ACK. The rules are designed to ensure that the 444 order in which different markings arrive at the receiver is 445 communicated to the sender (as long as options are reaching the 446 sender and as long as there is no ACK loss). Implementations are 447 encouraged to send an AccECN Option more frequently, but this is left 448 up to the implementer. 450 2.4. Feedback Metrics 452 The CE packet counter in the ACE field and the CE byte counter in the 453 AccECN Option both provide feedback on received CE-marks. The CE 454 packet counter includes control packets that do not have payload 455 data, while the CE byte counter solely includes marked payload bytes. 456 If both are present, the byte counter in the option will provide the 457 more accurate information needed for modern congestion control and 458 policing schemes, such as L4S, DCTCP or ConEx. If the option is 459 stripped, a simple algorithm to estimate the number of marked bytes 460 from the ACE field is given in Appendix A.3. 462 Feedback in bytes is recommended in order to protect against the 463 receiver using attacks similar to 'ACK-Division' to artificially 464 inflate the congestion window, which is why [RFC5681] now recommends 465 that TCP counts acknowledged bytes not packets. 467 2.5. Generic (Dumb) Reflector 469 The ACE field provides information about CE markings on both data and 470 control packets. According to [RFC3168] the Data Sender is meant to 471 set control packets to Not-ECT. However, mechanisms in certain 472 private networks (e.g. data centres) set control packets to be ECN 473 capable because they are precisely the packets that performance 474 depends on most. 476 For this reason, AccECN is designed to be a generic reflector of 477 whatever ECN markings it sees, whether or not they are compliant with 478 a current standard. Then as standards evolve, Data Senders can 479 upgrade unilaterally without any need for receivers to upgrade too. 480 It is also useful to be able to rely on generic reflection behaviour 481 when senders need to test for unexpected interference with markings 482 (for instance Section 3.2.2.3, Section 3.2.2.4 and Section 3.2.3.2 of 483 the present document and para 2 of Section 20.2 of [RFC3168]). 485 The initial SYN is the most critical control packet, so AccECN 486 provides feedback on its ECN marking. Although RFC 3168 prohibits an 487 ECN-capable SYN, providing feedback of ECN marking on the SYN 488 supports future scenarios in which SYNs might be ECN-enabled (without 489 prejudging whether they ought to be). For instance, [RFC8311] 490 updates this aspect of RFC 3168 to allow experimentation with ECN- 491 capable TCP control packets. 493 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 494 not-ECT in compliance with RFC 3168, feedback on the state of the ECN 495 field when it arrives at the receiver could still be useful, because 496 middleboxes have been known to overwrite the ECN IP field as if it is 497 still part of the old Type of Service (ToS) field [Mandalari18]. If 498 a TCP client has set the SYN to Not-ECT, but receives feedback that 499 the ECN field on the SYN arrived with a different codepoint, it can 500 detect such middlebox interference and send Not-ECT for the rest of 501 the connection. Today, if a TCP server receives ECT or CE on a SYN, 502 it cannot know whether it is invalid (or valid) because only the TCP 503 client knows whether it originally marked the SYN as Not-ECT (or 504 ECT). Therefore, prior to AccECN, the server's only safe course of 505 action was to disable ECN for the connection. Instead, the AccECN 506 protocol allows the server to feed back the received ECN field to the 507 client, which then has all the information to decide whether the 508 connection has to fall-back from supporting ECN (or not). 510 3. AccECN Protocol Specification 512 3.1. Negotiating to use AccECN 514 3.1.1. Negotiation during the TCP handshake 516 Given the ECN Nonce [RFC3540] has been reclassified as historic 517 [RFC8311], the present specification re-allocates the TCP flag at bit 518 7 of the TCP header, which was previously called NS (Nonce Sum), as 519 the AE (Accurate ECN) flag (see IANA Considerations in Section 7) as 520 shown below. 522 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 523 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 524 | | | A | C | E | U | A | P | R | S | F | 525 | Header Length | Reserved | E | W | C | R | C | S | S | Y | I | 526 | | | | R | E | G | K | H | T | N | N | 527 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 529 Figure 2: The (post-AccECN) definition of the TCP header flags during 530 the TCP handshake 532 During the TCP handshake at the start of a connection, to request 533 more accurate ECN feedback the TCP client (host A) MUST set the TCP 534 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 536 If a TCP server (B) that is AccECN-enabled receives a SYN with the 537 above three flags set, it MUST set both its half connections into 538 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 539 the 4 values shown in the top block of Table 2 to confirm that it 540 supports AccECN. The TCP server MUST NOT set one of these 4 541 combination of flags on the SYN/ACK unless the preceding SYN 542 requested support for AccECN as above. 544 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 545 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 546 that arrived on the SYN. This applies whether or not the server 547 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 548 Section 2.5 for rationale). 550 Once a TCP client (A) has sent the above SYN to declare that it 551 supports AccECN, and once it has received the above SYN/ACK segment 552 that confirms that the TCP server supports AccECN, the TCP client 553 MUST set both its half connections into AccECN mode. 555 Once in AccECN mode, a TCP client or server has the rights and 556 obligations to participate in the ECN protocol defined in 557 Section 3.1.5. 559 The procedure for the client to follow if a SYN/ACK does not arrive 560 before its retransmission timer expires is given in Section 3.1.4. 562 3.1.2. Backward Compatibility 564 The three flags set to 1 to indicate AccECN support on the SYN have 565 been carefully chosen to enable natural fall-back to prior stages in 566 the evolution of ECN, as above. Table 2 tabulates all the 567 negotiation possibilities for ECN-related capabilities that involve 568 at least one AccECN-capable host. The entries in the first two 569 columns have been abbreviated, as follows: 571 AccECN: More Accurate ECN Feedback (the present specification) 573 Nonce: ECN Nonce feedback [RFC3540] 575 ECN: 'Classic' ECN feedback [RFC3168] 577 No ECN: Not-ECN-capable. Implicit congestion notification using 578 packet drop. 580 +--------+--------+------------+-----------+------------------------+ 581 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 582 | | | | B->A | | 583 +--------+--------+------------+-----------+------------------------+ 584 | | | AE CWR ECE | AE CWR | | 585 | | | | ECE | | 586 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN (no ECT on SYN) | 587 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 588 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 589 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 590 | | | | | | 591 | AccECN | Nonce | 1 1 1 | 1 0 1 | (Reserved) | 592 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 593 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 594 | | | | | | 595 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 596 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 597 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 598 | | | | | | 599 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 600 +--------+--------+------------+-----------+------------------------+ 602 Table 2: ECN capability negotiation between Client (A) and Server (B) 604 Table 2 is divided into blocks each separated by an empty row. 606 1. The top block shows the case already described in Section 3.1 607 where both endpoints support AccECN and how the TCP server (B) 608 indicates congestion feedback. 610 2. The second block shows the cases where the TCP client (A) 611 supports AccECN but the TCP server (B) supports some earlier 612 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 613 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 614 shown it MUST set both its half connections into the feedback 615 mode shown in the rightmost column. If it has set itself into 616 classic ECN feedback mode it MUST then comply with [RFC3168]. 618 The server response called 'Nonce' in the table is now historic. 619 For an AccECN implementation, there is no need to recognize or 620 support ECN Nonce feedback [RFC3540], which has been reclassified 621 as historic [RFC8311]. AccECN is compatible with alternative ECN 622 feedback integrity approaches (see Section 5.3). 624 3. The third block shows the cases where the TCP server (B) supports 625 AccECN but the TCP client (A) supports some earlier variant of 626 TCP feedback, indicated in its SYN. 628 When an AccECN-enabled TCP server (B) receives a SYN with 629 AE,CWR,ECE = 0,1,1 it MUST do one of the following: 631 * set both its half connections into the classic ECN feedback 632 mode and return a SYN/ACK with AE, CWR, ECE = 0,0,1 as shown. 633 Then it MUST comply with [RFC3168]. 635 * set both its half-connections into No ECN mode and return a 636 SYN/ACK with AE,CWR,ECE = 0,0,0, then continue with ECN 637 disabled. This latter case is unlikely to be desirable, but 638 it is allowed as a possibility, e.g. for minimal TCP 639 implementations. 641 When an AccECN-enabled TCP server (B) receives a SYN with 642 AE,CWR,ECE = 0,0,0 it MUST set both its half connections into the 643 Not ECN feedback mode, return a SYN/ACK with AE,CWR,ECE = 0,0,0 644 as shown and continue with ECN disabled. 646 4. The fourth block displays a combination labelled `Broken'. Some 647 older TCP server implementations incorrectly set the reserved 648 flags in the SYN/ACK by reflecting those in the SYN. Such broken 649 TCP servers (B) cannot support ECN, so as soon as an AccECN- 650 capable TCP client (A) receives such a broken SYN/ACK it MUST 651 fall back to Not ECN mode for both its half connections and 652 continue with ECN disabled. 654 The following additional rules do not fit the structure of the table, 655 but they complement it: 657 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 658 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 659 Host A MUST then enter the same feedback mode as it would have 660 entered had it been a responding host and received the same SYN. 661 Then host A MUST send the same SYN/ACK as it would have sent had 662 it been a responding host. 664 In-window SYN during TIME-WAIT: Many TCP implementations create a 665 new TCP connection if they receive an in-window SYN packet during 666 TIME-WAIT state. When a TCP host enters TIME-WAIT or CLOSED 667 state, it should ignore any previous state about the negotiation 668 of AccECN for that connection and renegotiate the feedback mode 669 according to Table 2. 671 3.1.3. Forward Compatibility 673 If a TCP server that implements AccECN receives a SYN with the three 674 TCP header flags (AE, CWR and ECE) set to any combination other than 675 000, 011 or 111, it MUST negotiate the use of AccECN as if they had 676 been set to 111. This ensures that future uses of the other 677 combinations on a SYN can rely on consistent behaviour from the 678 installed base of AccECN servers. 680 For the avoidance of doubt, the behaviour described in the present 681 specification applies whether or not the three remaining reserved TCP 682 header flags are zero. 684 3.1.4. Retransmission of the SYN 686 If the sender of an AccECN SYN times out before receiving the SYN/ 687 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 688 least one more time by continuing to set all three TCP ECN flags on 689 the first retransmitted SYN (using the usual retransmission time- 690 outs). If this first retransmission also fails to be acknowledged, 691 the sender SHOULD send subsequent retransmissions of the SYN with the 692 three TCP-ECN flags cleared (AE=CWR=ECE=0). A retransmitted SYN MUST 693 use the same ISN as the original SYN. 695 Retrying once before fall-back adds delay in the case where a 696 middlebox drops an AccECN (or ECN) SYN deliberately. However, 697 current measurements imply that a drop is less likely to be due to 698 middlebox interference than other intermittent causes of loss, e.g. 699 congestion, wireless interference, etc. 701 Implementers MAY use other fall-back strategies if they are found to 702 be more effective (e.g. attempting to negotiate AccECN on the SYN 703 only once or more than twice (most appropriate during high levels of 704 congestion). However, other fall-back strategies will need to follow 705 all the rules in Section 3.1.5, which concern behaviour when SYNs or 706 SYN/ACKs negotiating different types of feedback have been sent 707 within the same connection. 709 Further it may make sense to also remove any other new or 710 experimental fields or options on the SYN in case a middlebox might 711 be blocking them, although the required behaviour will depend on the 712 specification of the other option(s) and any attempt to co-ordinate 713 fall-back between different modules of the stack. 715 Whichever fall-back strategy is used, the TCP initiator SHOULD cache 716 failed connection attempts. If it does, it SHOULD NOT give up 717 attempting to negotiate AccECN on the SYN of subsequent connection 718 attempts until it is clear that the blockage is persistently and 719 specifically due to AccECN. The cache should be arranged to expire 720 so that the initiator will infrequently attempt to check whether the 721 problem has been resolved. 723 The fall-back procedure if the TCP server receives no ACK to 724 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 725 Section 3.2.3.2. 727 3.1.5. Implications of AccECN Mode 729 Section 3.1.1 describes the only ways that a host can enter AccECN 730 mode, whether as a client or as a server. 732 As a Data Sender, a host in AccECN mode has the rights and 733 obligations concerning the use of ECN defined below, which build on 734 those in [RFC3168] as updated by [RFC8311]: 736 o Using ECT: 738 * It can set an ECT codepoint in the IP header of packets to 739 indicate to the network that the transport is capable and 740 willing to participate in ECN for this packet. 742 * It does not have to set ECT on any packet (for instance if it 743 has reason to believe such a packet would be blocked). 745 o Switching feedback negotiation (e.g. fall-back): 747 * It SHOULD NOT set ECT on any packet if it has received at least 748 one valid SYN or Acceptable SYN/ACK with AE=CWR=ECE=0. A 749 "valid SYN" has the same port numbers and the same ISN as the 750 SYN that caused the server to enter AccECN mode. 752 * It MUST NOT send an ECN-setup SYN [RFC3168] within the same 753 connection as it has sent a SYN requesting AccECN feedback. 755 * It MUST NOT send an ECN-setup SYN/ACK [RFC3168] within the same 756 connection as it has sent a SYN/ACK agreeing to use AccECN 757 feedback. 759 The above rules are necessary because, when one peer negotiates 760 the feedback mode in two different types of handshake, it is not 761 possible for the other peer to know for certain which handshake 762 packet(s) the other end eventually receives or in which order it 763 receives them. So the two peers can end up using difference 764 feedback modes without knowing it. 766 o Congestion response: 768 * It is still obliged to respond appropriately to AccECN feedback 769 with congestion indications on packets it had previously sent, 770 as defined in Section 6.1 of [RFC3168] and updated by Sections 771 2.1 and 4.1 of [RFC8311]. 773 * The commitment to respond appropriately to incoming indications 774 of congestion remains even if it sends a SYN packet with 775 AE=CWR=ECE=0, in a later transmission within the same TCP 776 connection. 778 * Unlike an RFC 3168 data sender, it MUST NOT set CWR to indicate 779 it has received and responded to indications of congestion (for 780 the avoidance of doubt, this does not preclude it from setting 781 the bits of the ACE counter field, which includes an overloaded 782 use of the same bit). 784 As a Data Receiver: 786 o a host in AccECN mode MUST feed back the information in the IP-ECN 787 field on incoming packets using Accurate ECN feedback, as 788 specified in Section 3.2 below. 790 o if it receives an ECN-setup SYN or ECN-setup SYN/ACK [RFC3168] 791 during the same connection as it receives a SYN requesting AccECN 792 feedback or a SYN/ACK agreeing to use AccECN feedback, it MUST 793 reset the connection with a RST packet. 795 o If for any reason it is not willing to provide ECN feedback on a 796 particular TCP connection, to indicate this unwillingness it 797 SHOULD clear the AE, CWR and ECE flags in all SYN and/or SYN/ACK 798 packets that it sends. 800 o it MUST NOT use reception of packets with ECT set in the IP-ECN 801 field as an implicit signal that the peer is ECN-capable. Reason: 802 ECT at the IP layer does not explicitly confirm the peer has the 803 correct ECN feedback logic, and the packets could have been 804 mangled at the IP layer. 806 3.2. AccECN Feedback 808 Each Data Receiver of each half connection maintains four counters, 809 r.cep, r.ceb, r.e0b and r.e1b: 811 o The Data Receiver MUST increment the CE packet counter (r.cep), 812 for every Acceptable packet that it receives with the CE code 813 point in the IP ECN field, including CE marked control packets but 814 excluding CE on SYN packets (SYN=1; ACK=0). 816 o The Data Receiver MUST increment the r.ceb, r.e0b or r.e1b byte 817 counters by the number of TCP payload octets in Acceptable packets 818 marked respectively with the CE, ECT(0) and ECT(1) codepoint in 819 their IP-ECN field, including any payload octets on control 820 packets, but not including any payload octets on SYN packets 821 (SYN=1; ACK=0). 823 Each Data Sender of each half connection maintains four counters, 824 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 825 counters at the Data Receiver. 827 A Data Receiver feeds back the CE packet counter using the Accurate 828 ECN (ACE) field, as explained in Section 3.2.2. And it feeds back 829 all the byte counters using the AccECN TCP Option, as specified in 830 Section 3.2.3. 832 Whenever a host feeds back the value of any counter, it MUST report 833 the most recent value, no matter whether it is in a pure ACK, an ACK 834 with new payload data or a retransmission. Therefore the feedback 835 carried on a retransmitted packet is unlikely to be the same as the 836 feedback on the original packet. 838 3.2.1. Initialization of Feedback Counters 840 When a host first enters AccECN mode, in its role as a Data Receiver 841 it initializes its counters to r.cep = 5 and r.ceb = 0, The initial 842 values of the other two byte counters depend on the Data Receiver's 843 choice of the order of fields it will use in the AccECN TCP Option 844 (see Section 3.2.3). If field order 0, it will initialize the 845 remaining counters to r.e0b = 1; r.e1b.= 0. If field order 1, it 846 will initialize them to r.e0b = 0 and r.e1b.= 0x800001. 848 Non-zero initial values are used to support a stateless handshake 849 (see Section 5.1) and to be distinct from cases where the fields are 850 incorrectly zeroed (e.g. by middleboxes - see Section 3.2.3.2.4). 852 When a host enters AccECN mode, in its role as a Data Sender it 853 initializes its counters to s.cep = 5 and s.ceb = 0. The initial 854 values of the other two byte counters depend on the peer's choice of 855 the order of fields it will use in the AccECN TCP Option (see 856 Section 3.2.3). If field order 0, it will initialize the remaining 857 counters to s.e0b = 1; s.e1b.= 0. If field order 1, it will 858 initialize them to s.e0b = 0 and s.e1b.= 0x800001. 860 3.2.2. The ACE Field 862 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 863 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 864 as one 3-bit field. Then the field is given a new name, ACE, as 865 shown in Figure 3. 867 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 868 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 869 | | | | U | A | P | R | S | F | 870 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 871 | | | | G | K | H | T | N | N | 872 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 874 Figure 3: Definition of the ACE field within bytes 13 and 14 of the 875 TCP Header (when AccECN has been negotiated and SYN=0). 877 The original definition of these three flags in the TCP header, 878 including the addition of support for the ECN Nonce, is shown for 879 comparison in Figure 1. This specification does not rename these 880 three TCP flags to ACE unconditionally; it merely overloads them with 881 another name and definition once an AccECN connection has been 882 established. 884 With one exception (Section 3.2.2.1), a host with both of its half- 885 connections in AccECN mode MUST interpret the AE, CWR and ECE flags 886 as the 3-bit ACE counter on a segment with the SYN flag cleared 887 (SYN=0). On such a packet, a Data Receiver MUST encode the three 888 least significant bits of its r.cep counter into the ACE field that 889 it feeds back to the Data Sender. A host MUST NOT interpret the 3 890 flags as a 3-bit ACE field on any segment with SYN=1 (whether ACK is 891 0 or 1), or if AccECN negotiation is incomplete or has not succeeded. 893 Both parts of each of these conditions are equally important. For 894 instance, even if AccECN negotiation has been successful, the ACE 895 field is not defined on any segments with SYN=1 (e.g. a 896 retransmission of an unacknowledged SYN/ACK, or when both ends send 897 SYN/ACKs after AccECN support has been successfully negotiated during 898 a simultaneous open). 900 3.2.2.1. ACE Field on the ACK of the SYN/ACK 902 A TCP client (A) in AccECN mode MUST feed back which of the 4 903 possible values of the IP-ECN field was on the SYN/ACK by writing it 904 into the ACE field of a pure ACK with no SACK blocks using the binary 905 encoding in Table 3 (which is the same as that used on the SYN/ACK in 906 Table 2). This shall be called the handshake encoding of the ACE 907 field, and it is the only exception to the rule that the ACE field 908 carries the 3 least significant bits of the r.cep counter on packets 909 with SYN=0. 911 Normally, a TCP client acknowledges a SYN/ACK with an ACK that 912 satisfies the above conditions anyway (SYN=0, no data, no SACK 913 blocks). If an AccECN TCP client intends to acknowledge the SYN/ACK 914 with a packet that does not satisfy these conditions (e.g. it has 915 data to include on the ACK), it SHOULD first send a pure ACK that 916 does satisfy these conditions (see Section 5.2), so that it can feed 917 back which of the four values of the IP-ECN field arrived on the SYN/ 918 ACK. A valid exception to this "SHOULD" would be where the 919 implementation will only be used in an environment where mangling of 920 the ECN field is unlikely. 922 +---------------------+---------------------+-----------------------+ 923 | IP-ECN codepoint on | ACE on pure ACK of | r.cep of client in | 924 | SYN/ACK | SYN/ACK | AccECN mode | 925 +---------------------+---------------------+-----------------------+ 926 | Not-ECT | 0b010 | 5 | 927 | ECT(1) | 0b011 | 5 | 928 | ECT(0) | 0b100 | 5 | 929 | CE | 0b110 | 6 | 930 +---------------------+---------------------+-----------------------+ 932 Table 3: The encoding of the ACE field in the ACK of the SYN-ACK to 933 reflect the SYN-ACK's IP-ECN field 935 When an AccECN server in SYN-RCVD state receives a pure ACK with 936 SYN=0 and no SACK blocks, instead of treating the ACE field as a 937 counter, it MUST infer the meaning of each possible value of the ACE 938 field from Table 4, which also shows the value that an AccECN server 939 MUST set s.cep to as a result. 941 Given this encoding of the ACE field on the ACK of a SYN/ACK is 942 exceptional, an AccECN server using large receive offload (LRO) might 943 prefer to disable LRO until such an ACK has transitioned it out of 944 SYN-RCVD state. 946 +---------------+-----------------------------+---------------------+ 947 | ACE on ACK of | IP-ECN codepoint on SYN/ACK | s.cep of server in | 948 | SYN/ACK | inferred by server | AccECN mode | 949 +---------------+-----------------------------+---------------------+ 950 | 0b000 | {Notes 1, 3} | Disable ECN | 951 | 0b001 | {Notes 2, 3} | 5 | 952 | 0b010 | Not-ECT | 5 | 953 | 0b011 | ECT(1) | 5 | 954 | 0b100 | ECT(0) | 5 | 955 | 0b101 | Currently Unused {Note 2} | 5 | 956 | 0b110 | CE | 6 | 957 | 0b111 | Currently Unused {Note 2} | 5 | 958 +---------------+-----------------------------+---------------------+ 960 Table 4: Meaning of the ACE field on the ACK of the SYN/ACK 962 {Note 1}: If the server is in AccECN mode, the value of zero raises 963 suspicion of zeroing of the ACE field on the path (see 964 Section 3.2.2.3). 966 {Note 2}: If the server is in AccECN mode, these values are Currently 967 Unused but the AccECN server's behaviour is still defined for forward 968 compatibility. Then the designer of a future protocol can know for 969 certain what AccECN servers will do with these codepoints. 971 {Note 3}: In the case where a server that implements AccECN is also 972 using a stateless handshake (termed a SYN cookie) it will not 973 remember whether it entered AccECN mode. The values 0b000 or 0b001 974 will remind it that it did not enter AccECN mode, because AccECN does 975 not use them (see Section 5.1 for details). If a stateless server 976 that implements AccECN receives either of these two values in the 977 ACK, its action is implementation-dependent and outside the scope of 978 this spec, It will certainly not take the action in the third column 979 because, after it receives either of these values, it is not in 980 AccECN mode. I.e., it will not disable ECN (at least not just 981 because ACE is 0b000) and it will not set s.cep. 983 3.2.2.2. Encoding and Decoding Feedback in the ACE Field 985 Whenever the Data Receiver sends an ACK with SYN=0 (with or without 986 data), unless the handshake encoding in Section 3.2.2.1 applies, the 987 Data Receiver MUST encode the least significant 3 bits of its r.cep 988 counter into the ACE field (see Appendix A.2). 990 Whenever the Data Sender receives an ACK with SYN=0 (with or without 991 data), it first checks whether it has already been superseded by 992 another ACK in which case it ignores the ECN feedback. If the ACK 993 has not been superseded, and if the special handshake encoding in 994 Section 3.2.2.1 does not apply, the Data Sender decodes the ACE field 995 as follows (see Appendix A.2 for examples). 997 o It takes the least significant 3 bits of its local s.cep counter 998 and subtracts them from the incoming ACE counter to work out the 999 minimum positive increment it could apply to s.cep (assuming the 1000 ACE field only wrapped at most once). 1002 o It then follows the safety procedures in Section 3.2.2.5.2 to 1003 calculate or estimate how many packets the ACK could have 1004 acknowledged under the prevailing conditions to determine whether 1005 the ACE field might have wrapped more than once. 1007 The encode/decode procedures during the three-way handshake are 1008 exceptions to the general rules given so far, so they are spelled out 1009 step by step below for clarity: 1011 o If a TCP server in AccECN mode receives a CE mark in the IP-ECN 1012 field of a SYN (SYN=1, ACK=0), it MUST NOT increment r.cep (it 1013 remains at its initial value of 5). 1015 Reason: It would be redundant for the server to include CE-marked 1016 SYNs in its r.cep counter, because it already reliably delivers 1017 feedback of any CE marking on the SYN/ACK using the encoding in 1018 Table 2. This also ensures that, when the server starts using the 1019 ACE field, it has not unnecessarily consumed more than one initial 1020 value, given they can be used to negotiate variants of the AccECN 1021 protocol (see Appendix B.3). 1023 o If a TCP client in AccECN mode receives CE feedback in the TCP 1024 flags of a SYN/ACK, it MUST NOT increment s.cep (it remains at its 1025 initial value of 5), so that it stays in step with r.cep on the 1026 server. Nonetheless, the TCP client still triggers the congestion 1027 control actions necessary to respond to the CE feedback. 1029 o If a TCP client in AccECN mode receives a CE mark in the IP-ECN 1030 field of a SYN/ACK, it MUST increment r.cep, but no more than once 1031 no matter how many CE-marked SYN/ACKs it receives (i.e. 1032 incremented from 5 to 6, but no further). 1034 Reason: Incrementing r.cep ensures the client will eventually 1035 deliver any CE marking to the server reliably when it starts using 1036 the ACE field. Even though the client also feeds back any CE 1037 marking on the ACK of the SYN/ACK using the encoding in Table 3, 1038 this ACK is not delivered reliably, so it can be considered as a 1039 timely notification that is redundant but unreliable. The client 1040 does not increment r.cep more than once, because the server can 1041 only increment s.cep once (see next bullet). Also, this limits 1042 the unnecessarily consumed initial values of the ACE field to two. 1044 o If a TCP server in AccECN mode and in SYN-RCVD state receives CE 1045 feedback in the TCP flags of a pure ACK with no SACK blocks, it 1046 MUST increment s.cep (from 5 to 6). The TCP server then triggers 1047 the congestion control actions necessary to respond to the CE 1048 feedback. 1050 Reasoning: The TCP server can only increment s.cep once, because 1051 the first ACK it receives will cause it to transition out of SYN- 1052 RCVD state. The server's congestion response would be no 1053 different even if it could receive feedback of more than one CE- 1054 marked SYN/ACK. 1056 Once the TCP server transitions to ESTABLISHED state, it might 1057 later receive other pure ACK(s) with the handshake encoding in the 1058 ACE field. The conditions for this to occur are quite unusual, 1059 but not impossible, e.g. a SYN/ACK (or ACK of the SYN/ACK) that is 1060 delayed for longer than the server's retransmission timeout; or 1061 packet duplication by the network. Nonetheless, once in the 1062 ESTABLISHED state, the server will consider the ACE field to be 1063 encoded as the normal ACE counter on all packets with SYN=0 (given 1064 it will be following the above rule in this bullet). The server 1065 MAY include a test to avoid this case. 1067 3.2.2.3. Testing for Zeroing of the ACE Field 1069 Section 3.2.2 required the Data Receiver to initialize the r.cep 1070 counter to a non-zero value. Therefore, in either direction the 1071 initial value of the ACE counter ought to be non-zero. 1073 If AccECN has been successfully negotiated, the Data Sender SHOULD 1074 check the value of the ACE counter in the first packet (with or 1075 without data) that arrives with SYN=0. If the value of this ACE 1076 field is zero (0b000), the Data Sender disables sending ECN-capable 1077 packets for the remainder of the half-connection by setting the IP/ 1078 ECN field in all subsequent packets to Not-ECT. 1080 Usually, the server checks the ACK of the SYN/ACK from the client, 1081 while the client checks the first data segment from the server. 1082 However, if reordering occurs, "the first packet ... that arrives" 1083 will not necessarily be the same as the first packet in sequence 1084 order. The test has been specified loosely like this to simplify 1085 implementation, and because it would not have been any more precise 1086 to have specified the first packet in sequence order, which would not 1087 necessarily be the first ACE counter that the Data Receiver fed back 1088 anyway, given it might have been a retransmission. 1090 The possibility of re-ordering means that there is a small chance 1091 that the ACE field on the first packet to arrive is genuinely zero 1092 (without middlebox interference). This would cause a host to 1093 unnecessarily disable ECN for a half connection. Therefore, in 1094 environments where there is no evidence of the ACE field being 1095 zeroed, implementations can skip this test. 1097 Note that the Data Sender MUST NOT test whether the arriving counter 1098 in the initial ACE field has been initialized to a specific valid 1099 value - the above check solely tests whether the ACE fields have been 1100 incorrectly zeroed. This allows hosts to use different initial 1101 values as an additional signalling channel in future. 1103 3.2.2.4. Testing for Mangling of the IP/ECN Field 1105 The value of the ACE field on the SYN/ACK indicates the value of the 1106 IP/ECN field when the SYN arrived at the server. The client can 1107 compare this with how it originally set the IP/ECN field on the SYN. 1108 If this comparison implies an unsafe transition (see below) of the 1109 IP/ECN field, for the remainder of the connection the client MUST NOT 1110 send ECN-capable packets, but it MUST continue to feed back any ECN 1111 markings on arriving packets. 1113 The value of the ACE field on the last ACK of the 3WHS indicates the 1114 value of the IP/ECN field when the SYN/ACK arrived at the client. 1115 The server can compare this with how it originally set the IP/ECN 1116 field on the SYN/ACK. If this comparison implies an unsafe 1117 transition of the IP/ECN field, for the remainder of the connection 1118 the server MUST NOT send ECN-capable packets, but it MUST continue to 1119 feedback any ECN markings on arriving packets. 1121 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 1122 count of CE marks is still eventually delivered reliably). If this 1123 ACK does not arrive, the server can continue to send ECN-capable 1124 packets without having tested for mangling of the IP/ECN field on the 1125 SYN/ACK. 1127 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 1128 repeated here for convenience: 1130 o the not-ECT codepoint changes; 1132 o either ECT codepoint transitions to not-ECT; 1134 o the CE codepoint changes. 1136 RFC 3168 says that a router that changes ECT to not-ECT is invalid 1137 but safe. However, from a host's viewpoint, this transition is 1138 unsafe because it could be the result of two transitions at different 1139 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 1140 This scenario could well happen where an ECN-enabled home router 1141 congests its upstream mobile broadband bottleneck link, then the 1142 ingress to the mobile network clears the ECN field [Mandalari18]. 1144 The above fall-back behaviours are necessary in case mangling of the 1145 IP/ECN field is asymmetric, which is currently common over some 1146 mobile networks [Mandalari18]. Then one end might see no unsafe 1147 transition and continue sending ECN-capable packets, while the other 1148 end sees an unsafe transition and stops sending ECN-capable packets. 1150 3.2.2.5. Safety against Ambiguity of the ACE Field 1152 If too many CE-marked segments are acknowledged at once, or if a long 1153 run of ACKs is lost or thinned out, the 3-bit counter in the ACE 1154 field might have cycled between two ACKs arriving at the Data Sender. 1155 The following safety procedures minimize this ambiguity. 1157 3.2.2.5.1. Data Receiver Safety Procedures 1159 An AccECN Data Receiver: 1161 o SHOULD immediately send an ACK whenever a data packet marked CE 1162 arrives after the previous data packet was not CE. 1164 o MUST immediately send an ACK once 'n' CE marks have arrived since 1165 the previous ACK, where 'n' SHOULD be 2 and MUST be no greater 1166 than 6. 1168 These rules for when to send an ACK are designed to be complemented 1169 by those in Section 3.2.3.3, which concern whether the AccECN TCP 1170 Option ought to be included on ACKs. 1172 For the avoidance of doubt, the change-triggered ACK mechanism is 1173 deliberately worded to solely apply to data packets, and to ignore 1174 the arrival of a control packet with no payload, because it is 1175 important that TCP does not acknowledge pure ACKs. The change- 1176 triggered ACK approach can lead to some additional ACKs but it feeds 1177 back the timing and the order in which ECN marks are received with 1178 minimal additional complexity. If only CE marks are infrequent, or 1179 there are multiple marks in a row, the additional load will be low. 1180 Other marking patterns could increase the load significantly. 1182 Even though the first bullet is stated as a "SHOULD", it is important 1183 for a transition to immediately trigger an ACK if at all possible, so 1184 that the Data Sender can rely on change-triggered ACKs to detect 1185 queue growth as soon as possible, e.g. at the start of a flow. This 1186 requirement can only be relaxed if certain offload hardware needed 1187 for high performance cannot support change-triggered ACKs (although 1188 high performance protocols such as DCTCP already successfully use 1189 change-triggered ACKs). One possible compromise would be for the 1190 receiver to heuristically detect whether the sender is in slow-start, 1191 then to implement change-triggered ACKs while the sender is in slow- 1192 start, and offload otherwise. 1194 3.2.2.5.2. Data Sender Safety Procedures 1196 If the Data Sender has not received AccECN TCP Options to give it 1197 more dependable information, and it detects that the ACE field could 1198 have cycled, it SHOULD deem whether it cycled by taking the safest 1199 likely case under the prevailing conditions. It can detect if the 1200 counter could have cycled by using the jump in the acknowledgement 1201 number since the last ACK to calculate or estimate how many segments 1202 could have been acknowledged. An example algorithm to implement this 1203 policy is given in Appendix A.2. An implementer MAY develop an 1204 alternative algorithm as long as it satisfies these requirements. 1206 If missing acknowledgement numbers arrive later (reordering) and 1207 prove that the counter did not cycle, the Data Sender MAY attempt to 1208 neutralize the effect of any action it took based on a conservative 1209 assumption that it later found to be incorrect. 1211 The Data Sender can estimate how many packets (of any marking) an ACK 1212 acknowledges. If the ACE counter on an ACK seems to imply that the 1213 minimum number of newly CE-marked packets is greater that the number 1214 of newly acknowledged packets, the Data Sender SHOULD believe the ACE 1215 counter, unless it can be sure that it is counting all control 1216 packets correctly. 1218 3.2.3. The AccECN Option 1220 The AccECN Option is defined as shown in Figure 4. The initial 'E' 1221 of each field name stands for 'Echo'. 1223 0 1 2 3 1224 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1225 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1226 | Kind = TBD1 | Length = 11 | EE0B field | 1227 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1228 | EE0B (cont'd) | ECEB field | 1229 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1230 | EE1B field | Order 0 1231 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1233 0 1 2 3 1234 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1235 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1236 | Kind = TBD1 | Length = 11 | EE1B field | 1237 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1238 | EE1B (cont'd) | ECEB field | 1239 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1240 | EE0B field | Order 1 1241 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1243 Figure 4: The AccECN TCP Option 1245 When a Data Receiver sends an AccECN Option, it MUST set the Kind 1246 field to TBD1, which is registered in Section 7 as a new TCP option 1247 Kind called AccECN. 1249 Figure 4 shows two option field orders; order 0 and order 1. They 1250 both consists of three 24-bit fields. Order 0 provides the 24 least 1251 significant bits of the r.e0b, r.ceb and r.e1b counters, 1252 respectively. Order 1 provides the same fields, but in the opposite 1253 order. Each half-connection can use a different field order, but a 1254 Data Receiver MUST consistently send the same field order within the 1255 same half-connection. 1257 The field order to use for each half-connection is up to the Data 1258 Receiver implementation. It might use the same hard-coded order for 1259 all half-connections, or it might make a different choice for each 1260 half-connection. For instance, the implementation of a Data Receiver 1261 might default to using order 0, unless the ECN field in the IP header 1262 of the packet it received during the 3WHS is ECT(1). A Data Receiver 1263 just starts using its chosen field order and the field immediately 1264 after the length field in the first AccECN TCP Option of a half- 1265 connection will intrinsically indicate which order it is using, 1266 because the initial counter values that it is required to use depend 1267 on its chosen field order (see Section 3.2.1). 1269 A Data Sender can know which field order the Data Receiver is using 1270 for a half-connection from the most significant bit (MSB) of the 1271 counter in the field immediately after the length field in the first 1272 non-empty AccECN TCP Option to arrive. If this MSB = 0, field order 1273 0 is being used, and if MSB = 1, field order 1 is being used. Note 1274 that the Data Sender only tests the most significant bit, not the 1275 value of the whole field, because the counters in the first packet to 1276 arrive might have started to increment (e.g. if the first packet to 1277 arrive is not the first packet sent due to loss or reordering). 1279 Note that there is no field to feed back Not-ECT bytes. Nonetheless 1280 an algorithm for the Data Sender to calculate the number of payload 1281 bytes received as Not-ECT is given in Appendix A.5. 1283 Whenever a Data Receiver sends an AccECN Option, the rules in 1284 Section 3.2.3.3 expect it to usually send a full-length option. To 1285 cope with option space limitations, it can omit unchanged fields from 1286 the tail of the option, as long as it preserves the order of the 1287 remaining fields and includes any field that has changed. The length 1288 field MUST indicate which fields are present as follows: 1290 +--------+------------------+------------------+ 1291 | Length | Type 0 | Type 1 | 1292 +--------+------------------+------------------+ 1293 | 11 | EE0B, ECEB, EE1B | EE1B, ECEB, EE0B | 1294 | 8 | EE0B, ECEB | EE1B, ECEB | 1295 | 5 | EE0B | EE1B | 1296 | 2 | (empty) | (empty) | 1297 +--------+------------------+------------------+ 1299 The empty option of Length=2 is provided to allow for a case where an 1300 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 1301 but there is very limited space for the option. 1303 All implementations of a Data Sender that read any AccECN Option MUST 1304 be able to read in AccECN Options of any of the above lengths. For 1305 forward compatibility, if the AccECN Option is of any other length, 1306 implementations MUST use those whole 3 octet fields that fit within 1307 the length and ignore the remainder of the option. 1309 The AccECN Option has to be optional to implement, because both 1310 sender and receiver have to be able to cope without the option anyway 1311 - in cases where it does not traverse a network path. It is 1312 RECOMMENDED to implement both sending and receiving of the AccECN 1313 Option. If sending of the AccECN Option is implemented, the fall- 1314 backs described in this document will need to be implemented as well 1315 (unless solely for a controlled environment where path traversal is 1316 not considered a problem). Even if a developer does not implement 1317 sending of the AccECN Option, it is RECOMMENDED that they still 1318 implement logic to receive and understand any AccECN Options sent by 1319 remote peers. 1321 If a Data Receiver intends to send the AccECN Option at any time 1322 during the rest of the connection it is strongly recommended to also 1323 test path traversal of the AccECN Option as specified in 1324 Section 3.2.3.2. 1326 3.2.3.1. Encoding and Decoding Feedback in the AccECN Option Fields 1328 Whenever the Data Receiver includes any of the counter fields (ECEB, 1329 EE0B, EE1B) in an AccECN Option, it MUST encode the 24 least 1330 significant bits of the current value of the associated counter into 1331 the field (respectively r.ceb, r.e0b, r.e1b). 1333 Whenever the Data Sender receives ACK carrying an AccECN Option, it 1334 first checks whether the ACK has already been superseded by another 1335 ACK in which case it ignores the ECN feedback. If the ACK has not 1336 been superseded, the Data Sender MUST decode the fields in the AccECN 1337 Option as follows. For each field, it takes the least significant 24 1338 bits of its associated local counter (s.ceb, s.e0b or s.e1b) and 1339 subtracts them from the counter in the associated field of the 1340 incoming AccECN Option (respectively ECEB, EE0B, EE1B), to work out 1341 the minimum positive increment it could apply to s.ceb, s.e0b or 1342 s.e1b (assuming the field in the option only wrapped at most once). 1344 Appendix A.1 gives an example algorithm for the Data Receiver to 1345 encode its byte counters into the AccECN Option, and for the Data 1346 Sender to decode the AccECN Option fields into its byte counters. 1348 Note that, as specified in Section 3.2, any data on the SYN (SYN=1, 1349 ACK=0) is not included in any of the locally held octet counters nor 1350 in the AccECN Option on the wire. 1352 3.2.3.2. Path Traversal of the AccECN Option 1354 3.2.3.2.1. Testing the AccECN Option during the Handshake 1356 The TCP client MUST NOT include the AccECN TCP Option on the SYN. (A 1357 fall-back strategy for the loss of the SYN (possibly due to middlebox 1358 interference) is specified in Section 3.1.4.) 1360 A TCP server that confirms its support for AccECN (in response to an 1361 AccECN SYN from the client as described in Section 3.1) SHOULD 1362 include an AccECN TCP Option on the SYN/ACK. 1364 A TCP client that has successfully negotiated AccECN SHOULD include 1365 an AccECN Option in the first ACK at the end of the 3WHS. However, 1366 this first ACK is not delivered reliably, so the TCP client SHOULD 1367 also include an AccECN Option on the first data segment it sends (if 1368 it ever sends one). 1370 A host MAY NOT include an AccECN Option in any of these three cases 1371 if it has cached knowledge that the packet would be likely to be 1372 blocked on the path to the other host if it included an AccECN 1373 Option. 1375 3.2.3.2.2. Testing for Loss of Packets Carrying the AccECN Option 1377 If after the normal TCP timeout the TCP server has not received an 1378 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 1379 lost, e.g. due to congestion, or a middlebox might be blocking the 1380 AccECN Option. To expedite connection setup, the TCP server SHOULD 1381 retransmit the SYN/ACK repeating the same AE, CWR and ECE TCP flags 1382 as on the original SYN/ACK but with no AccECN Option. If this 1383 retransmission times out, to expedite connection setup, the TCP 1384 server SHOULD disable AccECN and ECN for this connection by 1385 retransmitting the SYN/ACK with AE=CWR=ECE=0 and no AccECN Option. 1387 Implementers MAY use other fall-back strategies if they are found to 1388 be more effective (e.g. retrying the AccECN Option for a second time 1389 before fall-back - most appropriate during high levels of 1390 congestion). However, other fall-back strategies will need to follow 1391 all the rules in Section 3.1.5, which concern behaviour when SYNs or 1392 SYN/ACKs negotiating different types of feedback have been sent 1393 within the same connection. 1395 If the TCP client detects that the first data segment it sent with 1396 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1397 on the retransmission. Again, implementers MAY use other fall-back 1398 strategies such as attempting to retransmit a second segment with the 1399 AccECN Option before fall-back, and/or caching whether the AccECN 1400 Option is blocked for subsequent connections. 1401 [I-D.ietf-tcpm-2140bis] further discusses caching of TCP parameters 1402 and status information. 1404 If a host falls back to not sending the AccECN Option, it will 1405 continue to process any incoming AccECN Options as normal. 1407 Either host MAY include the AccECN Option in a subsequent segment to 1408 retest whether the AccECN Option can traverse the path. 1410 If the TCP server receives a second SYN with a request for AccECN 1411 support, it should resend the SYN/ACK, again confirming its support 1412 for AccECN, but this time without the AccECN Option. This approach 1413 rules out any interference by middleboxes that may drop packets with 1414 unknown options, even though it is more likely that the SYN/ACK would 1415 have been lost due to congestion. The TCP server MAY try to send 1416 another packet with the AccECN Option at a later point during the 1417 connection but should monitor if that packet got lost as well, in 1418 which case it SHOULD disable the sending of the AccECN Option for 1419 this half-connection. 1421 Similarly, an AccECN end-point MAY separately memorize which data 1422 packets carried an AccECN Option and disable the sending of AccECN 1423 Options if the loss probability of those packets is significantly 1424 higher than that of all other data packets in the same connection. 1426 3.2.3.2.3. Testing for Absence of the AccECN Option 1428 If the TCP client has successfully negotiated AccECN but does not 1429 receive an AccECN Option on the SYN/ACK (e.g. because is has been 1430 stripped by a middlebox or not sent by the server), the client 1431 switches into a mode that assumes that the AccECN Option is not 1432 available for this half connection. 1434 Similarly, if the TCP server has successfully negotiated AccECN but 1435 does not receive an AccECN Option on the first segment that 1436 acknowledges sequence space at least covering the ISN, it switches 1437 into a mode that assumes that the AccECN Option is not available for 1438 this half connection. 1440 While a host is in this mode that assumes incoming AccECN Options are 1441 not available, it MUST adopt the conservative interpretation of the 1442 ACE field discussed in Section 3.2.2.5. However, it cannot make any 1443 assumption about support of outgoing AccECN Options on the other half 1444 connection, so it SHOULD continue to send the AccECN Option itself 1445 (unless it has established that sending the AccECN Option is causing 1446 packets to be blocked as in Section 3.2.3.2.2). 1448 If a host is in the mode that assumes incoming AccECN Options are not 1449 available, but it receives an AccECN Option at any later point during 1450 the connection, this clearly indicates that the AccECN Option is not 1451 blocked on the respective path, and the AccECN endpoint MAY switch 1452 out of the mode that assumes the AccECN Option is not available for 1453 this half connection. 1455 3.2.3.2.4. Test for Zeroing of the AccECN Option 1457 For a related test for invalid initialization of the ACE field, see 1458 Section 3.2.2.3 1460 Section 3.2 required the Data Receiver to initialize the r.e0b 1461 counter to a non-zero value. Therefore, in either direction the 1462 initial value of the EE0B field in the AccECN Option (if one exists) 1463 ought to be non-zero. If AccECN has been negotiated: 1465 o the TCP server MAY check the initial value of the EE0B field in 1466 the first segment that acknowledges sequence space that at least 1467 covers the ISN plus 1. If the initial value of the EE0B field is 1468 zero, the server will switch into a mode that ignores the AccECN 1469 Option for this half connection. 1471 o the TCP client MAY check the initial value of the EE0B field on 1472 the SYN/ACK. If the initial value of the EE0B field is zero, the 1473 client will switch into a mode that ignores the AccECN Option for 1474 this half connection. 1476 While a host is in the mode that ignores the AccECN Option it MUST 1477 adopt the conservative interpretation of the ACE field discussed in 1478 Section 3.2.2.5. 1480 Note that the Data Sender MUST NOT test whether the arriving byte 1481 counters in the initial AccECN Option have been initialized to 1482 specific valid values - the above checks solely test whether these 1483 fields have been incorrectly zeroed. This allows hosts to use 1484 different initial values as an additional signalling channel in 1485 future. Also note that the initial value of either field might be 1486 greater than its expected initial value, because the counters might 1487 already have been incremented. Nonetheless, the initial values of 1488 the counters have been chosen so that they cannot wrap to zero on 1489 these initial segments. 1491 3.2.3.2.5. Consistency between AccECN Feedback Fields 1493 When the AccECN Option is available it supplements but does not 1494 replace the ACE field. An endpoint using AccECN feedback MUST always 1495 consider the information provided in the ACE field whether or not the 1496 AccECN Option is also available. 1498 If the AccECN option is present, the s.cep counter might increase 1499 while the s.ceb counter does not (e.g. due to a CE-marked control 1500 packet). The sender's response to such a situation is out of scope, 1501 and needs to be dealt with in a specification that uses ECN-capable 1502 control packets. Theoretically, this situation could also occur if a 1503 middlebox mangled the AccECN Option but not the ACE field. However, 1504 the Data Sender has to assume that the integrity of the AccECN Option 1505 is sound, based on the above test of the well-known initial values 1506 and optionally other integrity tests (Section 5.3). 1508 If either end-point detects that the s.ceb counter has increased but 1509 the s.cep has not (and by testing ACK coverage it is certain how much 1510 the ACE field has wrapped), this invalid protocol transition has to 1511 be due to some form of feedback mangling. So, the Data Sender MUST 1512 disable sending ECN-capable packets for the remainder of the half- 1513 connection by setting the IP/ECN field in all subsequent packets to 1514 Not-ECT. 1516 3.2.3.3. Usage of the AccECN TCP Option 1518 If the Data Receiver intends to use the AccECN TCP Option to provide 1519 feedback, the following rules determine when a Data Receiver in 1520 AccECN mode sends an ACK with the AccECN TCP Option, and which fields 1521 to include: 1523 Change-Triggered ACKs: If an arriving packet increments a different 1524 byte counter to that incremented by the previous packet, the Data 1525 Receiver SHOULD immediately send an ACK with an AccECN Option, 1526 without waiting for the next delayed ACK (this is in addition to 1527 the safety recommendation in Section 3.2.2.5 against ambiguity of 1528 the ACE field). 1530 Even though this bullet is stated as a "SHOULD", it is important 1531 for a transition to immediately trigger an ACK if at all possible, 1532 as already argued when specifying change-triggered ACKs for the 1533 ACE. 1535 Continual Repetition: Otherwise, if arriving packets continue to 1536 increment the same byte counter, the Data Receiver can include an 1537 AccECN Option on most or all (delayed) ACKs, but it does not have 1538 to. 1540 * It SHOULD include a counter that has continued to increment on 1541 the next scheduled ACK following a change-triggered ACK; 1543 * while the same counter continues to increment, it SHOULD 1544 include the counter every n ACKs as consistently as possible, 1545 where n can be chosen by the implementer; 1547 * It SHOULD always include an AccECN Option if the r.ceb counter 1548 is incrementing and it MAY include an AccECN Option if r.ec0b 1549 or r.ec1b is incrementing 1551 * It SHOULD, include each counter at least once for every 2^22 1552 bytes incremented to prevent overflow during continual 1553 repetition. 1555 If the smallest allowed AccECN Option would leave insufficient 1556 space for two SACK blocks on a particular ACK, the Data Receiver 1557 MUST give precedence to the SACK option (total 18 octets), because 1558 loss feedback is more critical. 1560 Necessary Option Length: It MAY exclude counter(s) that have not 1561 changed for the whole connection (but beacons still include all 1562 fields - see below). It SHOULD include counter(s) that have 1563 incremented at some time during the connection. It MUST include 1564 the counter(s) that have incremented since the previous AccECN 1565 Option and it MUST only truncate fields from the right-hand tail 1566 of the option to preserve the order of the remaining fields (see 1567 Section 3.2.3); 1569 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1570 length AccECN TCP Option on at least three ACKs per RTT, or on all 1571 ACKs if there are less than three per RTT (see Appendix A.4 for an 1572 example algorithm that satisfies this requirement). 1574 The above rules complement those in Section 3.2.2.5, which determine 1575 when to generate an ACK irrespective of whether an AccECN TCP Option 1576 is to be included. 1578 The following example series of arriving IP/ECN fields illustrates 1579 when a Data Receiver will emit an ACK with an AccECN Option if it is 1580 using a delayed ACK factor of 2 segments and change-triggered ACKs: 1581 01 -> ACK, 01, 01 -> ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 1582 -> ACK. 1584 Even though first bullet is stated as a "SHOULD", it is important for 1585 a transition to immediately trigger an ACK if at all possible, so 1586 that the Data Sender can rely on change-triggered ACKs to detect 1587 queue growth as soon as possible, e.g. at the start of a flow. This 1588 requirement can only be relaxed if certain offload hardware needed 1589 for high performance cannot support change-triggered ACKs (although 1590 high performance protocols such as DCTCP already successfully use 1591 change-triggered ACKs). One possible experimental compromise would 1592 be for the receiver to heuristically detect whether the sender is in 1593 slow-start, then to implement change-triggered ACKs while the sender 1594 is in slow-start, and offload otherwise. 1596 For the avoidance of doubt, this change-triggered ACK mechanism is 1597 deliberately worded to ignore the arrival of a control packet with no 1598 payload, which therefore does not alter any byte counters, because it 1599 is important that TCP does not acknowledge pure ACKs. The change- 1600 triggered ACK approach can lead to some additional ACKs but it feeds 1601 back the timing and the order in which ECN marks are received with 1602 minimal additional complexity. If only CE marks are infrequent, or 1603 there are multiple marks in a row, the additional load will be low. 1604 Other marking patterns could increase the load significantly, 1605 Investigating the additional load is a goal of the proposed 1606 experiment. 1608 Implementation note: sending an AccECN Option each time a different 1609 counter changes and including a full-length AccECN Option on every 1610 delayed ACK will satisfy the requirements described above and might 1611 be the easiest implementation, as long as sufficient space is 1612 available in each ACK (in total and in the option space). 1614 Appendix A.3 gives an example algorithm to estimate the number of 1615 marked bytes from the ACE field alone, if the AccECN Option is not 1616 available. 1618 If a host has determined that segments with the AccECN Option always 1619 seem to be discarded somewhere along the path, it is no longer 1620 obliged to follow the above rules. 1622 3.3. Requirements for TCP Proxies, Offload Engines and other 1623 Middleboxes on AccECN Compliance 1625 A large class of middleboxes split TCP connections. Such a middlebox 1626 would be compliant with the AccECN protocol if the TCP implementation 1627 on each side complied with the present AccECN specification and each 1628 side negotiated AccECN independently of the other side. 1630 Another large class of middleboxes intervenes to some degree at the 1631 transport layer, but attempts to be transparent (invisible) to the 1632 end-to-end connection. A subset of this class of middleboxes 1633 attempts to `normalize' the TCP wire protocol by checking that all 1634 values in header fields comply with a rather narrow interpretation of 1635 the TCP specifications. To comply with the present AccECN 1636 specification, such a middlebox MUST NOT change the ACE field or the 1637 AccECN Option and it SHOULD preserve the timing of each ACK (for 1638 example, if it coalesced ACKs it would not be AccECN-compliant) as 1639 these can be used by the Data Sender to infer further information 1640 about the path congestion level. A middlebox claiming to be 1641 transparent at the transport layer MUST forward the AccECN TCP Option 1642 unaltered, whether or not the length value matches one of those 1643 specified in Section 3.2.3, and whether or not the initial values of 1644 the byte-counter fields are correct. This is because blocking 1645 apparently invalid values does not improve security (because AccECN 1646 hosts are required to ignore invalid values anyway), while it 1647 prevents the standardized set of values being extended in future 1648 (because outdated normalizers would block updated hosts from using 1649 the extended AccECN standard). 1651 Hardware to offload certain TCP processing represents another large 1652 class of middleboxes, even though it is often a function of a host's 1653 network interface and rarely in its own 'box'. Leeway has been 1654 allowed in the present AccECN specification in the expectation that 1655 offload hardware could comply and still serve its function. 1656 Nonetheless, such hardware SHOULD also preserve the timing of each 1657 ACK (for example, if it coalesced ACKs it would not be AccECN- 1658 compliant). 1660 The ACE field changes with every received CE marking, so today's 1661 receive offloading could lead to many interrupts in high congestion 1662 situations. Although that would be useful (because congestion 1663 information is received sooner), it could also significantly increase 1664 processor load, particularly in scenarios such as DCTCP or L4S where 1665 the marking rate is generally higher. 1667 In data centres it has been fortunate for offload hardware that 1668 DCTCP-style feedback changes less often when there are long sequences 1669 of CE marks, which is more common with a step marking threshold. In 1670 order to enable DCTCP to improve its responsiveness, DCs will need to 1671 move beyond step marking. Before this can happen, offload hardware 1672 will have to explicitly address the variability of ECN feedback. 1674 ECN encodes a varying signal in the ACK stream, so it is inevitable 1675 that offload hardware will ultimately need to handle any form of ECN 1676 feedback exceptionally. The purpose of working towards standardized 1677 TCP ECN feedback is to reduce the risk for hardware developers, who 1678 would otherwise have to guess which scheme is likely to become 1679 dominant. 1681 4. Updates to RFC 3168 1683 Normative statements in the following sections of RFC3168 are updated 1684 by the present AccECN specification: 1686 o The whole of "6.1.1 TCP Initialization" of [RFC3168] is updated by 1687 Section 3.1 of the present specification. 1689 o In "6.1.2. The TCP Sender" of [RFC3168], all mentions of a 1690 congestion response to an ECN-Echo (ECE) ACK packet are updated by 1691 Section 3.2 of the present specification to mean an increment to 1692 the sender's count of CE-marked packets, s.cep. And the 1693 requirements to set the CWR flag no longer apply, as specified in 1694 Section 3.1.5 of the present specification. Otherwise, the 1695 remaining requirements in "6.1.2. The TCP Sender" still stand. 1697 It will be noted that RFC 8311 already updates, or potentially 1698 updates, a number of the requirements in "6.1.2. The TCP Sender". 1699 Section 6.1.2 of RFC 3168 extended standard TCP congestion control 1700 [RFC5681] to cover ECN marking as well as packet drop. Whereas, 1701 RFC 8311 enables experimentation with alternative responses to ECN 1702 marking, if specified for instance by an experimental RFC on the 1703 IETF document stream. RFC 8311 also strengthened the statement 1704 that "ECT(0) SHOULD be used" to a "MUST" (see [RFC8311] for the 1705 details). 1707 o The whole of "6.1.3. The TCP Receiver" of [RFC3168] is updated by 1708 Section 3.2 of the present specification, with the exception of 1709 the last paragraph (about congestion response to drop and ECN in 1710 the same round trip), which still stands. Incidentally, this last 1711 paragraph is in the wrong section, because it relates to TCP 1712 sender behaviour. 1714 o The following text within "6.1.5. Retransmitted TCP packets": 1716 "the TCP data receiver SHOULD ignore the ECN field on arriving 1717 data packets that are outside of the receiver's current 1718 window." 1720 is updated by more stringent acceptability tests for any packet 1721 (not just data packets) in the present specification. 1722 Specifically, in the normative specification of AccECN (Section 3) 1723 only 'Acceptable' packets contribute to the ECN counters at the 1724 AccECN receiver and Section 1.3 defines an Acceptable packet as 1725 one that passes the acceptability tests in both [RFC0793] and 1726 [RFC5961]. 1728 o Sections 5.2, 6.1.1, 6.1.4, 6.1.5 and 6.1.6 of [RFC3168] prohibit 1729 use of ECN on TCP control packets and retransmissions. The 1730 present specification does not update that aspect of RFC 3168, but 1731 it does say what feedback an AccECN Data Receiver should provide 1732 if it receives an ECN-capable control packet or retransmission. 1733 This ensures AccECN is forward compatible with any future scheme 1734 that allows ECN on these packets, as provided for in section 4.3 1735 of [RFC8311] and as proposed in [I-D.ietf-tcpm-generalized-ecn]. 1737 5. Interaction with TCP Variants 1739 This section is informative, not normative. 1741 5.1. Compatibility with SYN Cookies 1743 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1744 protect itself from SYN flooding attacks. It places minimal commonly 1745 used connection state in the SYN/ACK, and deliberately does not hold 1746 any state while waiting for the subsequent ACK (e.g. it closes the 1747 thread). Therefore it cannot record the fact that it entered AccECN 1748 mode for both half-connections. Indeed, it cannot even remember 1749 whether it negotiated the use of classic ECN [RFC3168]. 1751 Nonetheless, such a server can determine that it negotiated AccECN as 1752 follows. If a TCP server using SYN Cookies supports AccECN and if it 1753 receives a pure ACK that acknowledges an ISN that is a valid SYN 1754 cookie, and if the ACK contains an ACE field with the value 0b010 to 1755 0b111 (decimal 2 to 7), it can assume that: 1757 o the TCP client must have requested AccECN support on the SYN 1759 o it (the server) must have confirmed that it supported AccECN 1761 Therefore the server can switch itself into AccECN mode, and continue 1762 as if it had never forgotten that it switched itself into AccECN mode 1763 earlier. 1765 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1766 with the value 0b000 or 0b001, these values indicate that the client 1767 did not request support for AccECN and therefore the server does not 1768 enter AccECN mode for this connection. Further, 0b001 on the ACK 1769 implies that the server sent an ECN-capable SYN/ACK, which was marked 1770 CE in the network, and the non-AccECN client fed this back by setting 1771 ECE on the ACK of the SYN/ACK. 1773 5.2. Compatibility with TCP Experiments and Common TCP Options 1775 AccECN is compatible (at least on paper) with the most commonly used 1776 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1777 also compatible with the recent promising experimental TCP options 1778 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1779 AccECN is friendly to all these protocols, because space for TCP 1780 options is particularly scarce on the SYN, where AccECN consumes zero 1781 additional header space. 1783 When option space is under pressure from other options, 1784 Section 3.2.3.3 provides guidance on how important it is to send an 1785 AccECN Option and whether it needs to be a full-length option. 1787 Implementers of TFO need to take careful note of the recommendation 1788 in Section 3.2.2.1. That section recommends that, if the client has 1789 successfully negotiated AccECN, when acknowledging the SYN/ACK, even 1790 if it has data to send, it sends a pure ACK immediately before the 1791 data. Then it can reflect the IP-ECN field of the SYN/ACK on this 1792 pure ACK, which allows the server to detect ECN mangling. 1794 5.3. Compatibility with Feedback Integrity Mechanisms 1796 Three alternative mechanisms are available to assure the integrity of 1797 ECN and/or loss signals. AccECN is compatible with any of these 1798 approaches: 1800 o The Data Sender can test the integrity of the receiver's ECN (or 1801 loss) feedback by occasionally setting the IP-ECN field to a value 1802 normally only set by the network (and/or deliberately leaving a 1803 sequence number gap). Then it can test whether the Data 1804 Receiver's feedback faithfully reports what it expects (similar to 1805 para 2 of Section 20.2 of [RFC3168]). Unlike the ECN Nonce 1806 [RFC3540], this approach does not waste the ECT(1) codepoint in 1807 the IP header, it does not require standardization and it does not 1808 rely on misbehaving receivers volunteering to reveal feedback 1809 information that allows them to be detected. However, setting the 1810 CE mark by the sender might conceal actual congestion feedback 1811 from the network and should therefore only be done sparingly. 1813 o Networks generate congestion signals when they are becoming 1814 congested, so networks are more likely than Data Senders to be 1815 concerned about the integrity of the receiver's feedback of these 1816 signals. A network can enforce a congestion response to its ECN 1817 markings (or packet losses) using congestion exposure (ConEx) 1818 audit [RFC7713]. Whether the receiver or a downstream network is 1819 suppressing congestion feedback or the sender is unresponsive to 1820 the feedback, or both, ConEx audit can neutralize any advantage 1821 that any of these three parties would otherwise gain. 1823 ConEx is a change to the Data Sender that is most useful when 1824 combined with AccECN. Without AccECN, the ConEx behaviour of a 1825 Data Sender would have to be more conservative than would be 1826 necessary if it had the accurate feedback of AccECN. 1828 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1829 detect any tampering with AccECN feedback between the Data 1830 Receiver and the Data Sender (whether malicious or accidental). 1831 The AccECN fields are immutable end-to-end, so they are amenable 1832 to TCP-AO protection, which covers TCP options by default. 1833 However, TCP-AO is often too brittle to use on many end-to-end 1834 paths, where middleboxes can make verification fail in their 1835 attempts to improve performance or security, e.g. by 1836 resegmentation or shifting the sequence space. 1838 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1839 of congestion feedback. With minor changes AccECN could be optimized 1840 for the possibility that the ECT(1) codepoint might be used as an ECN 1841 Nonce. However, given RFC 3540 has been reclassified as historic, 1842 the AccECN design has been generalized so that it ought to be able to 1843 support other possible uses of the ECT(1) codepoint, such as a lower 1844 severity or a more instant congestion signal than CE. 1846 6. Protocol Properties 1848 This section is informative not normative. It describes how well the 1849 protocol satisfies the agreed requirements for a more accurate ECN 1850 feedback protocol [RFC7560]. 1852 Accuracy: From each ACK, the Data Sender can infer the number of new 1853 CE marked segments since the previous ACK. This provides better 1854 accuracy on CE feedback than classic ECN. In addition if the 1855 AccECN Option is present (not blocked by the network path) the 1856 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1858 Overhead: The AccECN scheme is divided into two parts. The 1859 essential part reuses the 3 flags already assigned to ECN in the 1860 IP header. The supplementary part adds an additional TCP option 1861 consuming up to 11 bytes. However, no TCP option is consumed in 1862 the SYN. 1864 Ordering: The order in which marks arrive at the Data Receiver is 1865 preserved in AccECN feedback, because the Data Receiver is 1866 expected to send an ACK immediately whenever a different mark 1867 arrives. 1869 Timeliness: While the same ECN markings are arriving continually at 1870 the Data Receiver, it can defer ACKs as TCP does normally, but it 1871 will immediately send an ACK as soon as a different ECN marking 1872 arrives. 1874 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1875 latency-sensitive uses of ECN feedback by capturing the timing of 1876 transitions but not wasting resources while the state of the 1877 signalling system is stable. Within the constraints of the 1878 change-triggered ACK rules, the receiver can control how 1879 frequently it sends the AccECN TCP Option and therefore to some 1880 extent it can control the overhead induced by AccECN. 1882 Resilience: All information is provided based on counters. 1883 Therefore if ACKs are lost, the counters on the first ACK 1884 following the losses allows the Data Sender to immediately recover 1885 the number of the ECN markings that it missed. And if data or 1886 ACKs are reordered, stale congestion information can be identified 1887 and ignored. 1889 Resilience against Bias: Because feedback is based on repetition of 1890 counters, random losses do not remove any information, they only 1891 delay it. Therefore, even though some ACKs are change-triggered, 1892 random losses will not alter the proportions of the different ECN 1893 markings in the feedback. 1895 Resilience vs Overhead: If space is limited in some segments (e.g. 1896 because more options are needed on some segments, such as the SACK 1897 option after loss), the Data Receiver can send AccECN Options less 1898 frequently or truncate fields that have not changed, usually down 1899 to as little as 5 bytes. However, it has to send a full-sized 1900 AccECN Option at least three times per RTT, which the Data Sender 1901 can rely on as a regular beacon or checkpoint. 1903 Resilience vs Timeliness and Ordering: Ordering information and the 1904 timing of transitions cannot be communicated in three cases: i) 1905 during ACK loss; ii) if something on the path strips the AccECN 1906 Option; or iii) if the Data Receiver is unable to support Change- 1907 Triggered ACKs. Following ACK reordering, the Data Sender can 1908 reconstruct the order in which feedback was sent, but not until 1909 all the missing feedback has arrived. 1911 Complexity: An AccECN implementation solely involves simple counter 1912 increments, some modulo arithmetic to communicate the least 1913 significant bits and allow for wrap, and some heuristics for 1914 safety against fields cycling due to prolonged periods of ACK 1915 loss. Each host needs to maintain eight additional counters. The 1916 hosts have to apply some additional tests to detect tampering by 1917 middleboxes, but in general the protocol is simple to understand, 1918 simple to implement and requires few cycles per packet to execute. 1920 Integrity: AccECN is compatible with at least three approaches that 1921 can assure the integrity of ECN feedback. If the AccECN Option is 1922 stripped the resolution of the feedback is degraded, but the 1923 integrity of this degraded feedback can still be assured. 1925 Backward Compatibility: If only one endpoint supports the AccECN 1926 scheme, it will fall-back to the most advanced ECN feedback scheme 1927 supported by the other end. 1929 Backward Compatibility: If the AccECN Option is stripped by a 1930 middlebox, AccECN still provides basic congestion feedback in the 1931 ACE field. Further, AccECN can be used to detect mangling of the 1932 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1933 marked segments; and blocking of segments carrying the AccECN 1934 Option. It can detect these conditions during TCP's 3WHS so that 1935 it can fall back to operation without ECN and/or operation without 1936 the AccECN Option. 1938 Forward Compatibility: The behaviour of endpoints and middleboxes is 1939 carefully defined for all reserved or currently unused codepoints 1940 in the scheme. Then, the designers of security devices can 1941 understand which currently unused values might appear in future. 1942 So, even if they choose to treat such values as anomalous while 1943 they are not widely used, any blocking will at least be under 1944 policy control not hard-coded. Then, if previously unused values 1945 start to appear on the Internet (or in standards), such policies 1946 could be quickly reversed. 1948 7. IANA Considerations 1950 This document reassigns bit 7 of the TCP header flags to the AccECN 1951 experiment. This bit was previously called the Nonce Sum (NS) flag 1952 [RFC3540], but RFC 3540 has been reclassified as historic [RFC8311]. 1953 The flag will now be defined as: 1955 +-----+-------------------+-----------+ 1956 | Bit | Name | Reference | 1957 +-----+-------------------+-----------+ 1958 | 7 | AE (Accurate ECN) | RFC XXXX | 1959 +-----+-------------------+-----------+ 1961 [TO BE REMOVED: IANA is requested to update the existing entry in the 1962 Transmission Control Protocol (TCP) Header Flags registration 1963 (https://www.iana.org/assignments/tcp-header-flags/tcp-header- 1964 flags.xhtml#tcp-header-flags-1) for Bit 7 to "AE (Accurate ECN), 1965 previously used as NS (Nonce Sum) by [RFC3540], which is now Historic 1966 [RFC8311]" and change the reference to this RFC-to-be instead of 1967 RFC8311.] 1969 This document also defines a new TCP option for AccECN, assigned a 1970 value of TBD1 (decimal) from the TCP option space. This value is 1971 defined as: 1973 +------+--------+-----------------------+-----------+ 1974 | Kind | Length | Meaning | Reference | 1975 +------+--------+-----------------------+-----------+ 1976 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1977 +------+--------+-----------------------+-----------+ 1979 [TO BE REMOVED: This registration should take place at the following 1980 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1981 parameters.xhtml#tcp-parameters-1 ] 1983 Early implementations using experimental option 254 per [RFC6994] 1984 with magic number 0xACCE (16 bits), as allocated in the IANA "TCP 1985 Experimental Option Experiment Identifiers (TCP ExIDs)" registry, 1986 SHOULD migrate to use this new option kind (TBD1). 1988 [TO BE REMOVED: The description of the 0xACCE value in the TCP ExIDs 1989 registry should be changed to "AccECN (current and new 1990 implementations SHOULD use option kind TBD1)" at the following 1991 location: https://www.iana.org/assignments/tcp-parameters/tcp- 1992 parameters.xhtml#tcp-exids ] 1994 8. Security Considerations 1996 If ever the supplementary part of AccECN based on the new AccECN TCP 1997 Option is unusable (due for example to middlebox interference) the 1998 essential part of AccECN's congestion feedback offers only limited 1999 resilience to long runs of ACK loss (see Section 3.2.2.5). These 2000 problems are unlikely to be due to malicious intervention (because if 2001 an attacker could strip a TCP option or discard a long run of ACKs it 2002 could wreak other arbitrary havoc). However, it would be of concern 2003 if AccECN's resilience could be indirectly compromised during a 2004 flooding attack. AccECN is still considered safe though, because if 2005 the option is not presented, the AccECN Data Sender is then required 2006 to switch to more conservative assumptions about wrap of congestion 2007 indication counters (see Section 3.2.2.5 and Appendix A.2). 2009 Section 5.1 describes how a TCP server can negotiate AccECN and use 2010 the SYN cookie method for mitigating SYN flooding attacks. 2012 There is concern that ECN markings could be altered or suppressed, 2013 particularly because a misbehaving Data Receiver could increase its 2014 own throughput at the expense of others. AccECN is compatible with 2015 the three schemes known to assure the integrity of ECN feedback (see 2016 Section 5.3 for details). If the AccECN Option is stripped by an 2017 incorrectly implemented middlebox, the resolution of the feedback 2018 will be degraded, but the integrity of this degraded information can 2019 still be assured. 2021 There is a potential concern that a receiver could deliberately omit 2022 the AccECN Option pretending that it had been stripped by a 2023 middlebox. No known way can yet be contrived to take advantage of 2024 this downgrade attack, but it is mentioned here in case someone else 2025 can contrive one. 2027 The AccECN protocol is not believed to introduce any new privacy 2028 concerns, because it merely counts and feeds back signals at the 2029 transport layer that had already been visible at the IP layer. 2031 9. Acknowledgements 2033 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 2034 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf, 2035 Michael Tuexen, Yuchung Cheng, Kenjiro Cho, Olivier Tilmans and Ilpo 2036 Jaervinen for their input and discussion. The idea of using the 2037 three ECN-related TCP flags as one field for more accurate TCP-ECN 2038 feedback was first introduced in the re-ECN protocol that was the 2039 ancestor of ConEx. 2041 Bob Briscoe was part-funded by the Comcast Innovation Fund, the 2042 European Community under its Seventh Framework Programme through the 2043 Reducing Internet Transport Latency (RITE) project (ICT-317700) and 2044 through the Trilogy 2 project (ICT-317756), and the Research Council 2045 of Norway through the TimeIn project. The views expressed here are 2046 solely those of the authors. 2048 Mirja Kuehlewind was partly supported by the European Commission 2049 under Horizon 2020 grant agreement no. 688421 Measurement and 2050 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 2051 State Secretariat for Education, Research, and Innovation under 2052 contract no. 15.0268. This support does not imply endorsement. 2054 10. Comments Solicited 2056 Comments and questions are encouraged and very welcome. They can be 2057 addressed to the IETF TCP maintenance and minor modifications working 2058 group mailing list , and/or to the authors. 2060 11. References 2062 11.1. Normative References 2064 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 2065 RFC 793, DOI 10.17487/RFC0793, September 1981, 2066 . 2068 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2069 Requirement Levels", BCP 14, RFC 2119, 2070 DOI 10.17487/RFC2119, March 1997, 2071 . 2073 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2074 of Explicit Congestion Notification (ECN) to IP", 2075 RFC 3168, DOI 10.17487/RFC3168, September 2001, 2076 . 2078 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2079 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 2080 . 2082 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2083 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2084 May 2017, . 2086 11.2. Informative References 2088 [I-D.ietf-tcpm-2140bis] 2089 Touch, J., Welzl, M., and S. Islam, "TCP Control Block 2090 Interdependence", draft-ietf-tcpm-2140bis-05 (work in 2091 progress), April 2020. 2093 [I-D.ietf-tcpm-generalized-ecn] 2094 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 2095 Congestion Notification (ECN) to TCP Control Packets", 2096 draft-ietf-tcpm-generalized-ecn-05 (work in progress), 2097 November 2019. 2099 [I-D.ietf-tsvwg-l4s-arch] 2100 Briscoe, B., Schepper, K., Bagnulo, M., and G. White, "Low 2101 Latency, Low Loss, Scalable Throughput (L4S) Internet 2102 Service: Architecture", draft-ietf-tsvwg-l4s-arch-07 (work 2103 in progress), October 2020. 2105 [Mandalari18] 2106 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 2107 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 2108 over Mobile", IEEE Communications Magazine , March 2018. 2110 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 2111 Selective Acknowledgment Options", RFC 2018, 2112 DOI 10.17487/RFC2018, October 1996, 2113 . 2115 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2116 Congestion Notification (ECN) Signaling with Nonces", 2117 RFC 3540, DOI 10.17487/RFC3540, June 2003, 2118 . 2120 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 2121 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 2122 . 2124 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 2125 Ramakrishnan, "Adding Explicit Congestion Notification 2126 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 2127 DOI 10.17487/RFC5562, June 2009, 2128 . 2130 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 2131 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 2132 June 2010, . 2134 [RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's 2135 Robustness to Blind In-Window Attacks", RFC 5961, 2136 DOI 10.17487/RFC5961, August 2010, 2137 . 2139 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 2140 "TCP Extensions for Multipath Operation with Multiple 2141 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 2142 . 2144 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 2145 RFC 6994, DOI 10.17487/RFC6994, August 2013, 2146 . 2148 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 2149 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 2150 . 2152 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 2153 "Problem Statement and Requirements for Increased Accuracy 2154 in Explicit Congestion Notification (ECN) Feedback", 2155 RFC 7560, DOI 10.17487/RFC7560, August 2015, 2156 . 2158 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 2159 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 2160 DOI 10.17487/RFC7713, December 2015, 2161 . 2163 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 2164 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 2165 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 2166 October 2017, . 2168 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 2169 Notification (ECN) Experimentation", RFC 8311, 2170 DOI 10.17487/RFC8311, January 2018, 2171 . 2173 [RFC8511] Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 2174 "TCP Alternative Backoff with ECN (ABE)", RFC 8511, 2175 DOI 10.17487/RFC8511, December 2018, 2176 . 2178 Appendix A. Example Algorithms 2180 This appendix is informative, not normative. It gives example 2181 algorithms that would satisfy the normative requirements of the 2182 AccECN protocol. However, implementers are free to choose other ways 2183 to implement the requirements. 2185 A.1. Example Algorithm to Encode/Decode the AccECN Option 2187 The example algorithms below show how a Data Receiver in AccECN mode 2188 could encode its CE byte counter r.ceb into the ECEB field within the 2189 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 2190 the ECEB field into its byte counter s.ceb. The other counters for 2191 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 2192 similarly encoded and decoded. 2194 It is assumed that each local byte counter is an unsigned integer 2195 greater than 24b (probably 32b), and that the following constant has 2196 been assigned: 2198 DIVOPT = 2^24 2200 Every time a CE marked data segment arrives, the Data Receiver 2201 increments its local value of r.ceb by the size of the TCP Data. 2202 Whenever it sends an ACK with the AccECN Option, the value it writes 2203 into the ECEB field is 2205 ECEB = r.ceb % DIVOPT 2207 where '%' is the remainder operator. 2209 On the arrival of an AccECN Option, the Data Sender first makes sure 2210 the ACK has not been superseded in order to avoid winding the s.ceb 2211 counter backwards. It uses the TCP acknowledgement number and any 2212 SACK options to calculate newlyAckedB, the amount of new data that 2213 the ACK acknowledges in bytes (newlyAckedB can be zero but not 2214 negative). If newlyAckedB is zero, either the ACK has been 2215 superseded or CE-marked packet(s) without data could have arrived. 2216 To break the tie for the latter case, the Data Sender could use 2217 timestamps (if present) to work out newlyAckedT, the amount of new 2218 time that the ACK acknowledges. If the Data Sender determines that 2219 the ACK has been superseded it ignores the AccECN Option. Otherwise, 2220 the Data Sender calculates the minimum non-negative difference d.ceb 2221 between the ECEB field and its local s.ceb counter, using modulo 2222 arithmetic as follows: 2224 if ((newlyAckedB > 0) || (newlyAckedT > 0)) { 2225 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 2226 s.ceb += d.ceb 2227 } 2229 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 2230 then 2232 s.ceb % DIVOPT = 1 2233 d.ceb = (1461 + 2^24 - 1) % 2^24 2234 = 1460 2235 s.ceb = 33,554,433 + 1460 2236 = 33,555,893 2238 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 2240 The example algorithms below show how a Data Receiver in AccECN mode 2241 could encode its CE packet counter r.cep into the ACE field, and how 2242 the Data Sender in AccECN mode could decode the ACE field into its 2243 s.cep counter. The Data Sender's algorithm includes code to 2244 heuristically detect a long enough unbroken string of ACK losses that 2245 could have concealed a cycle of the congestion counter in the ACE 2246 field of the next ACK to arrive. 2248 Two variants of the algorithm are given: i) a more conservative 2249 variant for a Data Sender to use if it detects that the AccECN Option 2250 is not available (see Section 3.2.2.5 and Section 3.2.3.2); and ii) a 2251 less conservative variant that is feasible when complementary 2252 information is available from the AccECN Option. 2254 A.2.1. Safety Algorithm without the AccECN Option 2256 It is assumed that each local packet counter is a sufficiently sized 2257 unsigned integer (probably 32b) and that the following constant has 2258 been assigned: 2260 DIVACE = 2^3 2262 Every time an Acceptable CE marked packet arrives (Section 3.2.2.2), 2263 the Data Receiver increments its local value of r.cep by 1. It 2264 repeats the same value of ACE in every subsequent ACK until the next 2265 CE marking arrives, where 2267 ACE = r.cep % DIVACE. 2269 If the Data Sender received an earlier value of the counter that had 2270 been delayed due to ACK reordering, it might incorrectly calculate 2271 that the ACE field had wrapped. Therefore, on the arrival of every 2272 ACK, the Data Sender ensures the ACK has not been superseded using 2273 the TCP acknowledgement number, any SACK options and timestamps (if 2274 available) to calculate newlyAckedB, as in Appendix A.1. If the ACK 2275 has not been superseded, the Data Sender calculates the minimum 2276 difference d.cep between the ACE field and its local s.cep counter, 2277 using modulo arithmetic as follows: 2279 if ((newlyAckedB > 0) || (newlyAckedT > 0)) 2280 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 2282 Section 3.2.2.5 expects the Data Sender to assume that the ACE field 2283 cycled if it is the safest likely case under prevailing conditions. 2284 The 3-bit ACE field in an arriving ACK could have cycled and become 2285 ambiguous to the Data Sender if a row of ACKs goes missing that 2286 covers a stream of data long enough to contain 8 or more CE marks. 2287 We use the word `missing' rather than `lost', because some or all the 2288 missing ACKs might arrive eventually, but out of order. Even if some 2289 of the missing ACKs were piggy-backed on data (i.e. not pure ACKs) 2290 retransmissions will not repair the lost AccECN information, because 2291 AccECN requires retransmissions to carry the latest AccECN counters, 2292 not the original ones. 2294 The phrase `under prevailing conditions' allows for implementation- 2295 dependent interpretation. A Data Sender might take account of the 2296 prevailing size of data segments and the prevailing CE marking rate 2297 just before the sequence of missing ACKs. However, we shall start 2298 with the simplest algorithm, which assumes segments are all full- 2299 sized and ultra-conservatively it assumes that ECN marking was 100% 2300 on the forward path when ACKs on the reverse path started to all be 2301 dropped. Specifically, if newlyAckedB is the amount of data that an 2302 ACK acknowledges since the previous ACK, then the Data Sender could 2303 assume that this acknowledges newlyAckedPkt full-sized segments, 2304 where newlyAckedPkt = newlyAckedB/MSS. Then it could assume that the 2305 ACE field incremented by 2307 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 2309 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 2310 size segments than any previous ACK, and that ACE increments by a 2311 minimum of 2 CE marks (d.cep=2). The above formula works out that it 2312 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 2313 2). However, if ACE increases by a minimum of 2 but acknowledges 10 2314 full-sized segments, then it would be necessary to assume that there 2315 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 2317 ACKs that acknowledge a large stretch of packets might be common in 2318 data centres to achieve a high packet rate or might be due to ACK 2319 thinning by a middlebox. In these cases, cycling of the ACE field 2320 would often appear to have been possible, so the above algorithm 2321 would be over-conservative, leading to a false high marking rate and 2322 poor performance. Therefore it would be reasonable to only use 2323 dSafer.cep rather than d.cep if the moving average of newlyAckedPkt 2324 was well below 8. 2326 Implementers could build in more heuristics to estimate prevailing 2327 average segment size and prevailing ECN marking. For instance, 2328 newlyAckedPkt in the above formula could be replaced with 2329 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 2330 segment size and p is the prevailing ECN marking probability. 2331 However, ultimately, if TCP's ECN feedback becomes inaccurate it 2332 still has loss detection to fall back on. Therefore, it would seem 2333 safe to implement a simple algorithm, rather than a perfect one. 2335 The simple algorithm for dSafer.cep above requires no monitoring of 2336 prevailing conditions and it would still be safe if, for example, 2337 segments were on average at least 5% of full-sized as long as ECN 2338 marking was 5% or less. Assuming it was used, the Data Sender would 2339 increment its packet counter as follows: 2341 s.cep += dSafer.cep 2343 If missing acknowledgement numbers arrive later (due to reordering), 2344 Section 3.2.2.5 says "the Data Sender MAY attempt to neutralize the 2345 effect of any action it took based on a conservative assumption that 2346 it later found to be incorrect". To do this, the Data Sender would 2347 have to store the values of all the relevant variables whenever it 2348 made assumptions, so that it could re-evaluate them later. Given 2349 this could become complex and it is not required, we do not attempt 2350 to provide an example of how to do this. 2352 A.2.2. Safety Algorithm with the AccECN Option 2354 When the AccECN Option is available on the ACKs before and after the 2355 possible sequence of ACK losses, if the Data Sender only needs CE- 2356 marked bytes, it will have sufficient information in the AccECN 2357 Option without needing to process the ACE field. If for some reason 2358 it needs CE-marked packets, if dSafer.cep is different from d.cep, it 2359 can determine whether d.cep is likely to be a safe enough estimate by 2360 checking whether the average marked segment size (s = d.ceb/d.cep) is 2361 less than the MSS (where d.ceb is the amount of newly CE-marked bytes 2362 - see Appendix A.1). Specifically, it could use the following 2363 algorithm: 2365 SAFETY_FACTOR = 2 2366 if (dSafer.cep > d.cep) { 2367 if (d.ceb <= MSS * d.cep) { % Same as (s <= MSS), but no DBZ 2368 sSafer = d.ceb/dSafer.cep 2369 if (sSafer < MSS/SAFETY_FACTOR) 2370 dSafer.cep = d.cep % d.cep is a safe enough estimate 2371 } % else 2372 % No need for else; dSafer.cep is already correct, 2373 % because d.cep must have been too small 2374 } 2376 The chart below shows when the above algorithm will consider d.cep 2377 can replace dSafer.cep as a safe enough estimate of the number of CE- 2378 marked packets: 2380 ^ 2381 sSafer| 2382 | 2383 MSS+ 2384 | 2385 | dSafer.cep 2386 | is 2387 MSS/SAFETY_FACTOR+--------------+ safest 2388 | | 2389 | d.cep is safe| 2390 | enough | 2391 +--------------------> 2392 MSS s 2394 The following examples give the reasoning behind the algorithm, 2395 assuming MSS=1460 [B]: 2397 o if d.cep=0, dSafer.cep=8 and d.ceb=1460, then s=infinity and 2398 sSafer=182.5. 2399 Therefore even though the average size of 8 data segments is 2400 unlikely to have been as small as MSS/8, d.cep cannot have been 2401 correct, because it would imply an average segment size greater 2402 than the MSS. 2404 o if d.cep=2, dSafer.cep=10 and d.ceb=1460, then s=730 and 2405 sSafer=146. 2406 Therefore d.cep is safe enough, because the average size of 10 2407 data segments is unlikely to have been as small as MSS/10. 2409 o if d.cep=7, dSafer.cep=15 and d.ceb=10200, then s=1457 and 2410 sSafer=680. 2412 Therefore d.cep is safe enough, because the average data segment 2413 size is more likely to have been just less than one MSS, rather 2414 than below MSS/2. 2416 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 2417 far less likely. However, because [RFC3168] currently precludes 2418 this, the above algorithm assumes that pure ACKs are not ECN-capable. 2420 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 2422 If the AccECN Option is not available, the Data Sender can only 2423 decode CE-marking from the ACE field in packets. Every time an ACK 2424 arrives, to convert this into an estimate of CE-marked bytes, it 2425 needs an average of the segment size, s_ave. Then it can add or 2426 subtract s_ave from the value of d.ceb as the value of d.cep 2427 increments or decrements. Some possible ways to calculate s_ave are 2428 outlined below. The precise details will depend on why an estimate 2429 of marked bytes is needed. 2431 The implementation could keep a record of the byte numbers of all the 2432 boundaries between packets in flight (including control packets), and 2433 recalculate s_ave on every ACK. However it would be simpler to 2434 merely maintain a counter packets_in_flight for the number of packets 2435 in flight (including control packets), which is reset once per RTT. 2436 Either way, it would estimate s_ave as: 2438 s_ave ~= flightsize / packets_in_flight, 2440 where flightsize is the variable that TCP already maintains for the 2441 number of bytes in flight. To avoid floating point arithmetic, it 2442 could right-bit-shift by lg(packets_in_flight), where lg() means log 2443 base 2. 2445 An alternative would be to maintain an exponentially weighted moving 2446 average (EWMA) of the segment size: 2448 s_ave = a * s + (1-a) * s_ave, 2450 where a is the decay constant for the EWMA. However, then it is 2451 necessary to choose a good value for this constant, which ought to 2452 depend on the number of packets in flight. Also the decay constant 2453 needs to be power of two to avoid floating point arithmetic. 2455 A.4. Example Algorithm to Beacon AccECN Options 2457 Section 3.2.3.3 requires a Data Receiver to beacon a full-length 2458 AccECN Option at least 3 times per RTT. This could be implemented by 2459 maintaining a variable to store the number of ACKs (pure and data 2460 ACKs) since a full AccECN Option was last sent and another for the 2461 approximate number of ACKs sent in the last round trip time: 2463 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 2464 send_full_AccECN_Option() 2466 For optimized integer arithmetic, BEACON_FREQ = 4 could be used, 2467 rather than 3, so that the division could be implemented as an 2468 integer right bit-shift by lg(BEACON_FREQ). 2470 In certain operating systems, it might be too complex to maintain 2471 acks_in_round. In others it might be possible by tagging each data 2472 segment in the retransmit buffer with the number of ACKs sent at the 2473 point that segment was sent. This would not work well if the Data 2474 Receiver was not sending data itself, in which case it might be 2475 necessary to beacon based on time instead, as follows: 2477 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 2478 send_full_AccECN_Option() 2480 This time-based approach does not work well when all the ACKs are 2481 sent early in each round trip, as is the case during slow-start. In 2482 this case few options will be sent (evtl. even less than 3 per RTT). 2483 However, when continuously sending data, data packets as well as ACKs 2484 will spread out equally over the RTT and sufficient ACKs with the 2485 AccECN option will be sent. 2487 A.5. Example Algorithm to Count Not-ECT Bytes 2489 A Data Sender in AccECN mode can infer the amount of TCP payload data 2490 arriving at the receiver marked Not-ECT from the difference between 2491 the amount of newly ACKed data and the sum of the bytes with the 2492 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 2493 r.e0b is initialized to 1 and the other two counters are initialized 2494 to 0, the initial sum will be 1, which matches the initial offset of 2495 the TCP sequence number on completion of the 3WHS. 2497 For this approach to be precise, it has to be assumed that spurious 2498 (unnecessary) retransmissions do not lead to double counting. This 2499 assumption is currently correct, given that RFC 3168 requires that 2500 the Data Sender marks retransmitted segments as Not-ECT. However, 2501 the converse is not true; necessary retransmissions will result in 2502 under-counting. 2504 However, such precision is unlikely to be necessary. The only known 2505 use of a count of Not-ECT marked bytes is to test whether equipment 2506 on the path is clearing the ECN field (perhaps due to an out-dated 2507 attempt to clear, or bleach, what used to be the ToS field). To 2508 detect bleaching it will be sufficient to detect whether nearly all 2509 bytes arrive marked as Not-ECT. Therefore there should be no need to 2510 keep track of the details of retransmissions. 2512 Appendix B. Rationale for Usage of TCP Header Flags 2514 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake 2516 AccECN uses a rather unorthodox approach to negotiate the highest 2517 version TCP ECN feedback scheme that both ends support, as justified 2518 below. It follows from the original TCP ECN capability negotiation 2519 [RFC3168], in which the client set the 2 least significant of the 2520 original reserved flags in the TCP header, and fell back to no ECN 2521 support if the server responded with the 2 flags cleared, which had 2522 previously been the default. 2524 ECN originally used header flags rather than a TCP option because it 2525 was considered more efficient to use a header flag for 1 bit of 2526 feedback per ACK, and this bit could be overloaded to indicate 2527 support for ECN during the handshake. During the development of ECN, 2528 1 bit crept up to 2, in order to deliver the feedback reliably and to 2529 work round some broken hosts that reflected the reserved flags during 2530 the handshake. 2532 In order to be backward compatible with RFC 3168, AccECN continues 2533 this approach, using the 3rd least significant TCP header flag that 2534 had previously been allocated for the ECN nonce (now historic). 2535 Then, whatever form of server an AccECN client encounters, the 2536 connection can fall back to the highest version of feedback protocol 2537 that both ends support, as explained in Section 3.1. 2539 If AccECN had used the more orthodox approach of a TCP option, it 2540 would still have had to set the two ECN flags in the main TCP header, 2541 in order to be able to fall back to Classic RFC 3168 ECN, or to 2542 disable ECN support, without another round of negotiation. Then 2543 AccECN would also have had to handle all the different ways that 2544 servers currently respond to settings of the ECN flags in the main 2545 TCP header, including all the conflicting cases where a server might 2546 have said it supported one approach in the flags and another approach 2547 in the new TCP option. And AccECN would have had to deal with all 2548 the additional possibilities where a middlebox might have mangled the 2549 ECN flags, or removed the TCP option. Thus, usage of the 3rd 2550 reserved TCP header flag simplified the protocol. 2552 The third flag was used in a way that could be distinguished from the 2553 ECN nonce, in case any nonce deployment was encountered. Previous 2554 usage of this flag for the ECN nonce was integrated into the original 2555 ECN negotiation. This further justified the 3rd flag's use for 2556 AccECN, because a non-ECN usage of this flag would have had to use it 2557 as a separate single bit, rather than in combination with the other 2 2558 ECN flags. 2560 Indeed, having overloaded the original uses of these three flags for 2561 its handshake, AccECN overloads all three bits again as a 3-bit 2562 counter. 2564 B.2. Four Codepoints in the SYN/ACK 2566 Of the 8 possible codepoints that the 3 TCP header flags can indicate 2567 on the SYN/ACK, 4 already indicated earlier (or broken) versions of 2568 ECN support. In the early design of AccECN, an AccECN server could 2569 use only 2 of the 4 remaining codepoints. They both indicated AccECN 2570 support, but one fed back that the SYN had arrived marked as CE. 2571 Even though ECN support on a SYN is not yet on the standards track, 2572 the idea is for either end to act as a dumb reflector, so that future 2573 capabilities can be unilaterally deployed without requiring 2-ended 2574 deployment (justified in Section 2.5). 2576 During traversal testing it was discovered that the ECN field in the 2577 SYN was mangled on a non-negligible proportion of paths. Therefore 2578 it was necessary to allow the SYN/ACK to feed all four IP/ECN 2579 codepoints that the SYN could arrive with back to the client. 2580 Without this, the client could not know whether to disable ECN for 2581 the connection due to mangling of the IP/ECN field (also explained in 2582 Section 2.5). This development consumed the remaining 2 codepoints 2583 on the SYN/ACK that had been reserved for future use by AccECN in 2584 earlier versions. 2586 B.3. Space for Future Evolution 2588 Despite availability of usable TCP header space being extremely 2589 scarce, the AccECN protocol has taken all possible steps to ensure 2590 that there is space to negotiate possible future variants of the 2591 protocol, either if the experiment proves that a variant of AccECN is 2592 required, or if a completely different ECN feedback approach is 2593 needed: 2595 Future AccECN variants: When the AccECN capability is negotiated 2596 during TCP's 3WHS, the rows in Table 2 tagged as 'Nonce' and 2597 'Broken' in the column for the capability of node B are unused by 2598 any current protocol in the RFC series. These could be used by 2599 TCP servers in future to indicate a variant of the AccECN 2600 protocol. In recent measurement studies in which the response of 2601 large numbers of servers to an AccECN SYN has been tested, e.g. 2602 [Mandalari18], a very small number of SYN/ACKs arrive with the 2603 pattern tagged as 'Nonce', and a small but more significant number 2604 arrive with the pattern tagged as 'Broken'. The 'Nonce' pattern 2605 could be a sign that a few servers have implemented the ECN Nonce 2606 [RFC3540], which has now been reclassified as historic [RFC8311], 2607 or it could be the random result of some unknown middlebox 2608 behaviour. The greater prevalence of the 'Broken' pattern 2609 suggests that some instances still exist of the broken code that 2610 reflects the reserved flags on the SYN. 2612 The requirement not to reject unexpected initial values of the ACE 2613 counter (in the main TCP header) in the last para of 2614 Section 3.2.2.3 ensures that 3 unused codepoints on the ACK of the 2615 SYN/ACK, 6 unused values on the first SYN=0 data packet from the 2616 client and 7 unused values on the first SYN=0 data packet from the 2617 server could be used to declare future variants of the AccECN 2618 protocol. The word 'declare' is used rather than 'negotiate' 2619 because, at this late stage in the 3WHS, it would be too late for 2620 a negotiation between the endpoints to be completed. A similar 2621 requirement not to reject unexpected initial values in the TCP 2622 option (Section 3.2.3.2.4) is for the same purpose. If traversal 2623 of the TCP option were reliable, this would have enabled a far 2624 wider range of future variation of the whole AccECN protocol. 2625 Nonetheless, it could be used to reliably negotiate a wide range 2626 of variation in the semantics of the AccECN Option. 2628 Future non-AccECN variants: Five codepoints out of the 8 possible in 2629 the 3 TCP header flags used by AccECN are unused on the initial 2630 SYN (in the order AE,CWR,ECE): 001, 010, 100, 101, 110. 2631 Section 3.1.3 ensures that the installed base of AccECN servers 2632 will all assume these are equivalent to AccECN negotiation with 2633 111 on the SYN. These codepoints would not allow fall-back to 2634 Classic ECN support for a server that did not understand them, but 2635 this approach ensures they are available in future, perhaps for 2636 uses other than ECN alongside the AccECN scheme. All possible 2637 combinations of SYN/ACK could be used in response except either 2638 000 or reflection of the same values sent on the SYN. 2640 Of course, other ways could be resorted to in order to extend 2641 AccECN or ECN in future, although their traversal properties are 2642 likely to be inferior. They include a new TCP option; using the 2643 remaining reserved flags in the main TCP header (preferably 2644 extending the 3-bit combinations used by AccECN to 4-bit 2645 combinations, rather than burning one bit for just one state); a 2646 non-zero urgent pointer in combination with the URG flag cleared; 2647 or some other unexpected combination of fields yet to be invented. 2649 Authors' Addresses 2651 Bob Briscoe 2652 Independent 2653 UK 2655 EMail: ietf@bobbriscoe.net 2656 URI: http://bobbriscoe.net/ 2658 Mirja Kuehlewind 2659 Ericsson 2660 Germany 2662 EMail: ietf@kuehlewind.net 2664 Richard Scheffenegger 2665 NetApp 2666 Vienna 2667 Austria 2669 EMail: Richard.Scheffenegger@netapp.com