idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. -- The document date (March 5, 2020) is 1513 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 2370, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-11) exists of draft-ietf-tcpm-2140bis-02 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-05 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-05 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 1 error (**), 0 flaws (~~), 8 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Independent 4 Intended status: Experimental M. Kuehlewind 5 Expires: September 6, 2020 Ericsson 6 R. Scheffenegger 7 NetApp 8 March 5, 2020 10 More Accurate ECN Feedback in TCP 11 draft-ietf-tcpm-accurate-ecn-10 13 Abstract 15 Explicit Congestion Notification (ECN) is a mechanism where network 16 nodes can mark IP packets instead of dropping them to indicate 17 incipient congestion to the end-points. Receivers with an ECN- 18 capable transport protocol feed back this information to the sender. 19 ECN is specified for TCP in such a way that only one feedback signal 20 can be transmitted per Round-Trip Time (RTT). Recent new TCP 21 mechanisms like Congestion Exposure (ConEx), Data Center TCP (DCTCP) 22 or Low Latency Low Loss Scalable Throughput (L4S) need more accurate 23 ECN feedback information whenever more than one marking is received 24 in one RTT. This document specifies an experimental scheme to 25 provide more than one feedback signal per RTT in the TCP header. 26 Given TCP header space is scarce, it allocates a reserved header bit, 27 that was previously used for the ECN-Nonce which has now been 28 declared historic. It also overloads the two existing ECN flags in 29 the TCP header. The resulting extra space is exploited to feed back 30 the IP-ECN field received during the 3-way handshake as well. 31 Supplementary feedback information can optionally be provided in a 32 new TCP option, which is never used on the TCP SYN. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on September 6, 2020. 50 Copyright Notice 52 Copyright (c) 2020 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 69 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 70 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 71 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 72 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 7 73 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 8 74 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 9 75 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 76 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 10 77 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 11 78 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 11 79 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 12 80 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 12 81 3.1.1. Negotiation during the TCP handshake . . . . . . . . 12 82 3.1.2. Backward Compatibility . . . . . . . . . . . . . . . 13 83 3.1.3. Forward Compatibility . . . . . . . . . . . . . . . . 15 84 3.1.4. Retransmission of the SYN . . . . . . . . . . . . . . 16 85 3.1.5. Implications of AccECN Mode . . . . . . . . . . . . . 17 86 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 18 87 3.2.1. Initialization of Feedback Counters . . . . . . . . . 19 88 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 19 89 3.2.3. The AccECN Option . . . . . . . . . . . . . . . . . . 27 90 3.3. Requirements for TCP Proxies, Offload Engines and other 91 Middleboxes on AccECN Compliance . . . . . . . . . . . . 36 92 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 37 93 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 37 94 4.2. Compatibility with Other TCP Options and Experiments . . 38 95 4.3. Compatibility with Feedback Integrity Mechanisms . . . . 38 97 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 40 98 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 42 99 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43 100 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 43 101 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 44 102 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 44 103 10.1. Normative References . . . . . . . . . . . . . . . . . . 44 104 10.2. Informative References . . . . . . . . . . . . . . . . . 45 105 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 47 106 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 47 107 A.2. Example Algorithm for Safety Against Long Sequences of 108 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 48 109 A.2.1. Safety Algorithm without the AccECN Option . . . . . 48 110 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 50 111 A.3. Example Algorithm to Estimate Marked Bytes from Marked 112 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 52 113 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 52 114 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 53 115 Appendix B. Rationale for Usage of TCP Header Flags . . . . . . 54 116 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake . . . 54 117 B.2. Four Codepoints in the SYN/ACK . . . . . . . . . . . . . 55 118 B.3. Space for Future Evolution . . . . . . . . . . . . . . . 55 119 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 57 121 1. Introduction 123 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 124 network nodes can mark IP packets instead of dropping them to 125 indicate incipient congestion to the end-points. Receivers with an 126 ECN-capable transport protocol feed back this information to the 127 sender. ECN is specified for TCP in such a way that only one 128 feedback signal can be transmitted per Round-Trip Time (RTT). 129 Recently, proposed mechanisms like Congestion Exposure (ConEx 130 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need to 131 know when more than one marking is received in one RTT which is 132 information that cannot be provided by the feedback scheme as 133 specified in [RFC3168]. This document specifies an alternative 134 feedback scheme that provides more accurate information and could be 135 used by these new TCP extensions. A fuller treatment of the 136 motivation for this specification is given in the associated 137 requirements document [RFC7560]. 139 This documents specifies an experimental scheme for ECN feedback in 140 the TCP header to provide more than one feedback signal per RTT. It 141 will be called the more accurate ECN feedback scheme, or AccECN for 142 short. If AccECN progresses from experimental to the standards 143 track, it is intended to be a complete replacement for classic TCP/ 144 ECN feedback, not a fork in the design of TCP. AccECN feedback 145 complements TCP's loss feedback and it supplements classic TCP/ECN 146 feedback, so its applicability is intended to include all public and 147 private IP networks (and even any non-IP networks over which TCP is 148 used today), whether or not any nodes on the path support ECN of 149 whatever flavour. 151 Until the AccECN experiment succeeds, [RFC3168] will remain as the 152 only standards track specification for adding ECN to TCP. To avoid 153 confusion, in this document we use the term 'classic ECN' for the 154 pre-existing ECN specification [RFC3168]. 156 AccECN feedback overloads the two existing ECN flags and allocates 157 the currently reserved flag (previously called NS) in the TCP header, 158 to be used as one field indicating the number of congestion 159 experienced marked packets. Given the new definitions of these three 160 bits, both ends have to support the new wire protocol before it can 161 be used. Therefore during the TCP handshake the two ends use these 162 three bits in the TCP header to negotiate the most advanced feedback 163 protocol that they can both support, in a way that is backward 164 compatible with [RFC3168]. 166 AccECN is solely an (experimental) change to the TCP wire protocol; 167 it only specifies the negotiation and signaling of more accurate ECN 168 feedback from a TCP Data Receiver to a Data Sender. It is completely 169 independent of how TCP might respond to congestion feedback, which is 170 out of scope. For that we refer to [RFC3168] or any RFC that 171 specifies a different response to TCP ECN feedback, for example: 172 [RFC8257]; or ECN experiments such as those referred to in [RFC8311], 173 namely: a TCP-based Low Latency Low Loss Scalable (L4S) congestion 174 control [I-D.ietf-tsvwg-l4s-arch]; ECN-capable TCP control packets 175 [I-D.ietf-tcpm-generalized-ecn], or Alternative Backoff with ECN 176 (ABE) [RFC8511]. 178 It is recommended that the AccECN protocol is implemented alongside 179 SACK [RFC2018] and the experimental ECN++ protocol 180 [I-D.ietf-tcpm-generalized-ecn], which allows the ECN capability to 181 be used on TCP control packets. Therefore, this specification does 182 not discuss implementing AccECN alongside [RFC5562], which was an 183 earlier experimental protocol with narrower scope than ECN++. 185 1.1. Document Roadmap 187 The following introductory sections outline the goals of AccECN 188 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 189 that it is clear what success would look like. Then terminology is 190 defined (Section 1.4) and a recap of existing prerequisite technology 191 is given (Section 1.5). 193 Section 2 gives an informative overview of the AccECN protocol. Then 194 Section 3 gives the normative protocol specification. Section 4 195 assesses the interaction of AccECN with commonly used variants of 196 TCP, whether standardized or not. Section 5 summarizes the features 197 and properties of AccECN. 199 Section 6 summarizes the protocol fields and numbers that IANA will 200 need to assign and Section 7 points to the aspects of the protocol 201 that will be of interest to the security community. 203 Appendix A gives pseudocode examples for the various algorithms that 204 AccECN uses. 206 1.2. Goals 208 [RFC7560] enumerates requirements that a candidate feedback scheme 209 will need to satisfy, under the headings: resilience, timeliness, 210 integrity, accuracy (including ordering and lack of bias), 211 complexity, overhead and compatibility (both backward and forward). 212 It recognizes that a perfect scheme that fully satisfies all the 213 requirements is unlikely and trade-offs between requirements are 214 likely. Section 5 presents the properties of AccECN against these 215 requirements and discusses the trade-offs made. 217 The requirements document recognizes that a protocol as ubiquitous as 218 TCP needs to be able to serve as-yet-unspecified requirements. 219 Therefore an AccECN receiver aims to act as a generic (dumb) 220 reflector of congestion information so that in future new sender 221 behaviours can be deployed unilaterally. 223 1.3. Experiment Goals 225 TCP is critical to the robust functioning of the Internet, therefore 226 any proposed modifications to TCP need to be thoroughly tested. The 227 present specification describes an experimental protocol that adds 228 more accurate ECN feedback to the TCP protocol. The intention is to 229 specify the protocol sufficiently so that more than one 230 implementation can be built in order to test its function, robustness 231 and interoperability (with itself and with previous version of ECN 232 and TCP). 234 The experimental protocol will be considered successful if testing 235 confirms that the proposed mechanism can be deployed at large scale. 236 Testing will mostly focus on fall-back strategies in case of 237 middlebox interference. Current recommended strategies are specified 238 in Sections 3.1.4, 3.2.2.3, 3.2.2.4 and 3.2.3.2. The effectiveness 239 of these strategies depends on the actual deployment situation of 240 middleboxes. Therefore experimental verification to confirm large- 241 scale path traversal in the Internet is needed before finalizing this 242 specification on the Standards Track. 244 Another experimentation focus is the implementation feasibiliy of 245 change-triggered ACKs as described in section 3.2.3.3. While on 246 average this should not lead to a higher ACK rate, it changes the ACK 247 pattern which can particularly have an impact on hardware offload. 248 It is currently specified as a hard requirement, because the sender 249 can exploit the predictability of the receiver's behaviour. However, 250 further experimentation is needed to advise if will have to become 251 just preferred behavior. 253 1.4. Terminology 255 AccECN: The more accurate ECN feedback scheme will be called AccECN 256 for short. 258 Classic ECN: the ECN protocol specified in [RFC3168]. 260 Classic ECN feedback: the feedback aspect of the ECN protocol 261 specified in [RFC3168], including generation, encoding, 262 transmission and decoding of feedback, but not the Data Sender's 263 subsequent response to that feedback. 265 ACK: A TCP acknowledgement, with or without a data payload (ACK=1). 267 Pure ACK: A TCP acknowledgement without a data payload. 269 Acceptable packet / segment: A packet or segment that passes the 270 acceptability tests in [RFC0793] and [RFC5961]. 272 TCP client: The TCP stack that originates a connection. 274 TCP server: The TCP stack that responds to a connection request. 276 Data Receiver: The endpoint of a TCP half-connection that receives 277 data and sends AccECN feedback. 279 Data Sender: The endpoint of a TCP half-connection that sends data 280 and receives AccECN feedback. 282 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 283 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 284 document are to be interpreted as described in BCP 14 [RFC2119] 285 [RFC8174] when, and only when, they appear in all capitals, as shown 286 here. 288 1.5. Recap of Existing ECN feedback in IP/TCP 290 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 291 negotiated with the receiver at the transport layer, an ECN sender 292 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 293 to indicate an ECN-capable transport (ECT). If both ECN bits are 294 zero, the packet is considered to have been sent by a Not-ECN-capable 295 Transport (Not-ECT). When a network node experiences congestion, it 296 will occasionally either drop or mark a packet, with the choice 297 depending on the packet's ECN codepoint. If the codepoint is Not- 298 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 299 the node can mark the packet by setting both ECN bits, which is 300 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 301 Table 1 summarises these codepoints. 303 +-------------------------+---------------+-------------------------+ 304 | IP-ECN codepoint | Codepoint | Description | 305 | (binary) | name | | 306 +-------------------------+---------------+-------------------------+ 307 | 00 | Not-ECT | Not ECN-Capable | 308 | | | Transport | 309 | 01 | ECT(1) | ECN-Capable Transport | 310 | | | (1) | 311 | 10 | ECT(0) | ECN-Capable Transport | 312 | | | (0) | 313 | 11 | CE | Congestion Experienced | 314 +-------------------------+---------------+-------------------------+ 316 Table 1: The ECN Field in the IP Header 318 In the TCP header the first two bits in byte 14 are defined as flags 319 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 320 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 321 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 322 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 323 Data Receiver starts to set the Echo Congestion Experienced (ECE) 324 flag continuously in the TCP header of ACKs, which ensures the signal 325 is received reliably even if ACKs are lost. The TCP sender confirms 326 that it has received at least one ECE signal by responding with the 327 congestion window reduced (CWR) flag, which allows the TCP receiver 328 to stop repeating the ECN-Echo flag. This always leads to a full RTT 329 of ACKs with ECE set. Thus any additional CE markings arriving 330 within this RTT cannot be fed back. 332 The last bit in byte 13 of the TCP header was defined as the Nonce 333 Sum (NS) for the ECN Nonce [RFC3540]. In the absence of widespread 334 deployment RFC 3540 has been reclassified as historic [RFC8311] and 335 the respective flag has been marked as "reserved", making this TCP 336 flag available for use by the AccECN experiment instead. 338 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 339 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 340 | | | N | C | E | U | A | P | R | S | F | 341 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 342 | | | | R | E | G | K | H | T | N | N | 343 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 345 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 347 2. AccECN Protocol Overview and Rationale 349 This section provides an informative overview of the AccECN protocol 350 that will be normatively specified in Section 3 352 Like the original TCP approach, the Data Receiver of each TCP half- 353 connection sends AccECN feedback to the Data Sender on TCP 354 acknowledgements, reusing data packets of the other half-connection 355 whenever possible. 357 The AccECN protocol has had to be designed in two parts: 359 o an essential part that re-uses ECN TCP header bits to feed back 360 the number of arriving CE marked packets. This provides more 361 accuracy than classic ECN feedback, but limited resilience against 362 ACK loss; 364 o a supplementary part using a new AccECN TCP Option that provides 365 additional feedback on the number of bytes that arrive marked with 366 each of the three ECN codepoints (not just CE marks). This 367 provides greater resilience against ACK loss than the essential 368 feedback, but it is more likely to suffer from middlebox 369 interference. 371 The two part design was necessary, given limitations on the space 372 available for TCP options and given the possibility that certain 373 incorrectly designed middleboxes prevent TCP using any new options. 375 The essential part overloads the previous definition of the three 376 flags in the TCP header that had been assigned for use by ECN. This 377 design choice deliberately replaces the classic ECN feedback 378 protocol, rather than leaving classic ECN feedback intact and adding 379 more accurate feedback separately because: 381 o this efficiently reuses scarce TCP header space, given TCP option 382 space is approaching saturation; 384 o a single upgrade path for the TCP protocol is preferable to a fork 385 in the design; 387 o otherwise classic and accurate ECN feedback could give conflicting 388 feedback on the same segment, which could open up new security 389 concerns and make implementations unnecessarily complex; 391 o middleboxes are more likely to faithfully forward the TCP ECN 392 flags than newly defined areas of the TCP header. 394 AccECN is designed to work even if the supplementary part is removed 395 or zeroed out, as long as the essential part gets through. 397 2.1. Capability Negotiation 399 AccECN is a change to the wire protocol of the main TCP header, 400 therefore it can only be used if both endpoints have been upgraded to 401 understand it. The TCP client signals support for AccECN on the 402 initial SYN of a connection and the TCP server signals whether it 403 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 404 client uses to signal AccECN support have been carefully chosen so 405 that a TCP server will interpret them as a request to support the 406 most recent variant of ECN feedback that it supports. Then the 407 client falls back to the same variant of ECN feedback. 409 An AccECN TCP client does not send the new AccECN Option on the SYN 410 as SYN option space is limited. The TCP server sends the AccECN 411 Option on the SYN/ACK and the client sends it on the first ACK to 412 test whether the network path forwards the option correctly. 414 2.2. Feedback Mechanism 416 A Data Receiver maintains four counters initialized at the start of 417 the half-connection. Three count the number of arriving payload 418 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 419 the number of packets arriving marked with a CE codepoint (including 420 control packets without payload if they are CE-marked). 422 The Data Sender maintains four equivalent counters for the half 423 connection, and the AccECN protocol is designed to ensure they will 424 match the values in the Data Receiver's counters, albeit after a 425 little delay. 427 Each ACK carries the three least significant bits (LSBs) of the 428 packet-based CE counter using the ECN bits in the TCP header, now 429 renamed the Accurate ECN (ACE) field (see Figure 3 later). The 24 430 LSBs of each byte counter are carried in the AccECN Option. 432 2.3. Delayed ACKs and Resilience Against ACK Loss 434 With both the ACE and the AccECN Option mechanisms, the Data Receiver 435 continually repeats the current LSBs of each of its respective 436 counters. There is no need to acknowledge these continually repeated 437 counters, so the congestion window reduced (CWR) mechanism is no 438 longer used. Even if some ACKs are lost, the Data Sender should be 439 able to infer how much to increment its own counters, even if the 440 protocol field has wrapped. 442 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 443 it appears to have incremented by one (say), the field might have 444 actually cycled completely then incremented by one. The Data 445 Receiver is not allowed to delay sending an ACK to such an extent 446 that the ACE field would cycle. However cycling is still a 447 possibility at the Data Sender because a whole sequence of ACKs 448 carrying intervening values of the field might all be lost or delayed 449 in transit. 451 The fields in the AccECN Option are larger, but they will increment 452 in larger steps because they count bytes not packets. Nonetheless, 453 their size has been chosen such that a whole cycle of the field would 454 never occur between ACKs unless there had been an infeasibly long 455 sequence of ACK losses. Therefore, as long as the AccECN Option is 456 available, it can be treated as a dependable feedback channel. 458 If the AccECN Option is not available, e.g. it is being stripped by a 459 middlebox, the AccECN protocol will only feed back information on CE 460 markings (using the ACE field). Although not ideal, this will be 461 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 462 will ever indicate more severe congestion than CE, even though future 463 uses for ECT(0) or ECT(1) are still unclear [RFC8311]. Because the 464 3-bit ACE field is so small, when it is the only field available the 465 Data Sender has to interpret it assuming the most likely wrap, but 466 with a degree of conservatism. 468 Certain specified events trigger the Data Receiver to include an 469 AccECN Option on an ACK. The rules are designed to ensure that the 470 order in which different markings arrive at the receiver is 471 communicated to the sender (as long as options are reaching the 472 sender and as long as there is no ACK loss). Implementations are 473 encouraged to send an AccECN Option more frequently, but this is left 474 up to the implementer. 476 2.4. Feedback Metrics 478 The CE packet counter in the ACE field and the CE byte counter in the 479 AccECN Option both provide feedback on received CE-marks. The CE 480 packet counter includes control packets that do not have payload 481 data, while the CE byte counter solely includes marked payload bytes. 482 If both are present, the byte counter in the option will provide the 483 more accurate information needed for modern congestion control and 484 policing schemes, such as L4S, DCTCP or ConEx. If the option is 485 stripped, a simple algorithm to estimate the number of marked bytes 486 from the ACE field is given in Appendix A.3. 488 Feedback in bytes is recommended in order to protect against the 489 receiver using attacks similar to 'ACK-Division' to artificially 490 inflate the congestion window, which is why [RFC5681] now recommends 491 that TCP counts acknowledged bytes not packets. 493 2.5. Generic (Dumb) Reflector 495 The ACE field provides information about CE markings on both data and 496 control packets. According to [RFC3168] the Data Sender is meant to 497 set control packets to Not-ECT. However, mechanisms in certain 498 private networks (e.g. data centres) set control packets to be ECN 499 capable because they are precisely the packets that performance 500 depends on most. 502 For this reason, AccECN is designed to be a generic reflector of 503 whatever ECN markings it sees, whether or not they are compliant with 504 a current standard. Then as standards evolve, Data Senders can 505 upgrade unilaterally without any need for receivers to upgrade too. 506 It is also useful to be able to rely on generic reflection behaviour 507 when senders need to test for unexpected interference with markings 508 (for instance Section 3.2.2.3, Section 3.2.2.4 and Section 3.2.3.2 of 509 the present document, para 2 of Section 20.2 of [RFC3168]) and 510 [I-D.kuehlewind-tcpm-ecn-fallback]. 512 The initial SYN is the most critical control packet, so AccECN 513 provides feedback on its ECN marking. Although RFC 3168 prohibits an 514 ECN-capable SYN, providing feedback of ECN marking on the SYN 515 supports future scenarios in which SYNs might be ECN-enabled (without 516 prejudging whether they ought to be). For instance, [RFC8311] 517 updates this aspect of RFC 3168 to allow experimentation with ECN- 518 capable TCP control packets. 520 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 521 not-ECT in compliance with RFC 3168, feedback on the state of the ECN 522 field when it arrives at the receiver could still be useful, because 523 middleboxes have been known to overwrite the ECN IP field as if it is 524 still part of the old Type of Service (ToS) field [Mandalari18]. If 525 a TCP client has set the SYN to Not-ECT, but receives feedback that 526 the ECN field on the SYN arrived with a different codepoint, it can 527 detect such middlebox interference and send Not-ECT for the rest of 528 the connection (see [I-D.kuehlewind-tcpm-ecn-fallback]). Today, if a 529 TCP server receives ECT or CE on a SYN, it cannot know whether it is 530 invalid (or valid) because only the TCP client knows whether it 531 originally marked the SYN as Not-ECT (or ECT). Therefore, prior to 532 AccECN, the server's only safe course of action was to disable ECN 533 for the connection. Instead, the AccECN protocol allows the server 534 to feed back the received ECN field to the client, which then has all 535 the information to decide whether the connection has to fall-back 536 from supporting ECN (or not). 538 3. AccECN Protocol Specification 540 3.1. Negotiating to use AccECN 542 3.1.1. Negotiation during the TCP handshake 544 Given the ECN Nonce [RFC3540] has been reclassified as historic 545 [RFC8311], the present specification re-allocates the TCP flag at bit 546 7 of the TCP header, which was previously called NS (Nonce Sum), as 547 the AE (Accurate ECN) flag (see IANA Considerations in Section 6) as 548 shown below. 550 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 551 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 552 | | | A | C | E | U | A | P | R | S | F | 553 | Header Length | Reserved | E | W | C | R | C | S | S | Y | I | 554 | | | | R | E | G | K | H | T | N | N | 555 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 557 Figure 2: The (post-AccECN) definition of the TCP header flags during 558 the TCP handshake 560 During the TCP handshake at the start of a connection, to request 561 more accurate ECN feedback the TCP client (host A) MUST set the TCP 562 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 564 If a TCP server (B) that is AccECN-enabled receives a SYN with the 565 above three flags set, it MUST set both its half connections into 566 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 567 the 4 values shown in the top block of Table 2 to confirm that it 568 supports AccECN. The TCP server MUST NOT set one of these 4 569 combination of flags on the SYN/ACK unless the preceding SYN 570 requested support for AccECN as above. 572 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 573 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 574 that arrived on the SYN. This applies whether or not the server 575 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 576 Section 2.5 for rationale). 578 Once a TCP client (A) has sent the above SYN to declare that it 579 supports AccECN, and once it has received the above SYN/ACK segment 580 that confirms that the TCP server supports AccECN, the TCP client 581 MUST set both its half connections into AccECN mode. 583 Once in AccECN mode, a TCP client or server has the rights and 584 obligations to participate in the ECN protocol defined in 585 Section 3.1.5. 587 The procedure for the client to follow if a SYN/ACK does not arrive 588 before its retransmission timer expires is given in Section 3.1.4. 590 3.1.2. Backward Compatibility 592 The three flags set to 1 to indicate AccECN support on the SYN have 593 been carefully chosen to enable natural fall-back to prior stages in 594 the evolution of ECN, as above. Table 2 tabulates all the 595 negotiation possibilities for ECN-related capabilities that involve 596 at least one AccECN-capable host. The entries in the first two 597 columns have been abbreviated, as follows: 599 AccECN: More Accurate ECN Feedback (the present specification) 601 Nonce: ECN Nonce feedback [RFC3540] 603 ECN: 'Classic' ECN feedback [RFC3168] 605 No ECN: Not-ECN-capable. Implicit congestion notification using 606 packet drop. 608 +--------+--------+------------+-----------+------------------------+ 609 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 610 | | | | B->A | | 611 +--------+--------+------------+-----------+------------------------+ 612 | | | AE CWR ECE | AE CWR | | 613 | | | | ECE | | 614 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN (no ECT on SYN) | 615 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 616 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 617 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 618 | | | | | | 619 | AccECN | Nonce | 1 1 1 | 1 0 1 | (Reserved) | 620 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 621 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 622 | | | | | | 623 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 624 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 625 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 626 | | | | | | 627 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 628 +--------+--------+------------+-----------+------------------------+ 630 Table 2: ECN capability negotiation between Client (A) and Server (B) 632 Table 2 is divided into blocks each separated by an empty row. 634 1. The top block shows the case already described in Section 3.1 635 where both endpoints support AccECN and how the TCP server (B) 636 indicates congestion feedback. 638 2. The second block shows the cases where the TCP client (A) 639 supports AccECN but the TCP server (B) supports some earlier 640 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 641 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 642 shown it MUST set both its half connections into the feedback 643 mode shown in the rightmost column. If it has set itself into 644 classic ECN feedback mode it MUST then comply with [RFC3168]. 646 The server response called 'Nonce' in the table is now historic. 647 For an AccECN implementation, there is no need to recognize or 648 support ECN Nonce feedback [RFC3540], which has been reclassified 649 as historic [RFC8311]. AccECN is compatible with alternative ECN 650 feedback integrity approaches (see Section 4.3). 652 3. The third block shows the cases where the TCP server (B) supports 653 AccECN but the TCP client (A) supports some earlier variant of 654 TCP feedback, indicated in its SYN. 656 When an AccECN-enabled TCP server (B) receives a SYN with 657 AE,CWR,ECE = 0,1,1 it MUST do one of the following: 659 * set both its half connections into the classic ECN feedback 660 mode and return a SYN/ACK with AE, CWR, ECE = 0,0,1 as shown. 661 Then it MUST comply with [RFC3168]. 663 * set both its half-connections into No ECN mode and return a 664 SYN/ACK with AE,CWR,ECE = 0,0,0, then continue with ECN 665 disabled. This latter case is unlikely to be desirable, but 666 it is allowed as a possibility, e.g. for minimal TCP 667 implementations. 669 When an AccECN-enabled TCP server (B) receives a SYN with 670 AE,CWR,ECE = 0,0,0 it MUST set both its half connections into the 671 Not ECN feedback mode, return a SYN/ACK with AE,CWR,ECE = 0,0,0 672 as shown and continue with ECN disabled. 674 4. The fourth block displays a combination labelled `Broken'. Some 675 older TCP server implementations incorrectly set the reserved 676 flags in the SYN/ACK by reflecting those in the SYN. Such broken 677 TCP servers (B) cannot support ECN, so as soon as an AccECN- 678 capable TCP client (A) receives such a broken SYN/ACK it MUST 679 fall back to Not ECN mode for both its half connections and 680 continue with ECN disabled. 682 The following additional rules do not fit the structure of the table, 683 but they complement it: 685 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 686 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 687 Host A MUST then enter the same feedback mode as it would have 688 entered had it been a responding host and received the same SYN. 689 Then host A MUST send the same SYN/ACK as it would have sent had 690 it been a responding host. 692 In-window SYN during TIME-WAIT: Many TCP implementations create a 693 new TCP connection if they receive an in-window SYN packet during 694 TIME-WAIT state. When a TCP host enters TIME-WAIT or CLOSED 695 state, it should ignore any previous state about the negotiation 696 of AccECN for that connection and renegotiate the feedback mode 697 according to Table 2. 699 3.1.3. Forward Compatibility 701 If a TCP server that implements AccECN receives a SYN with the three 702 TCP header flags (AE, CWR and ECE) set to any combination other than 703 000, 011 or 111, it MUST negotiate the use of AccECN as if they had 704 been set to 111. This ensures that future uses of the other 705 combinations on a SYN can rely on consistent behaviour from the 706 installed base of AccECN servers. 708 For the avoidance of doubt, the behaviour described in the present 709 specification applies whether or not the three remaining reserved TCP 710 header flags are zero. 712 3.1.4. Retransmission of the SYN 714 If the sender of an AccECN SYN times out before receiving the SYN/ 715 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 716 least one more time by continuing to set all three TCP ECN flags on 717 the first retransmitted SYN (using the usual retransmission time- 718 outs). If this first retransmission also fails to be acknowledged, 719 the sender SHOULD send subsequent retransmissions of the SYN with the 720 three TCP-ECN flags cleared (AE=CWR=ECE=0). A retransmitted SYN MUST 721 use the same ISN as the original SYN. 723 Retrying once before fall-back adds delay in the case where a 724 middlebox drops an AccECN (or ECN) SYN deliberately. However, 725 current measurements imply that a drop is less likely to be due to 726 middlebox interference than other intermittent causes of loss, e.g. 727 congestion, wireless interference, etc. 729 Implementers MAY use other fall-back strategies if they are found to 730 be more effective (e.g. attempting to negotiate AccECN on the SYN 731 only once or more than twice (most appropriate during high levels of 732 congestion). However, other fall-back strategies will need to follow 733 all the rules in Section 3.1.5, which concern behaviour when SYNs or 734 SYN/ACKs negotiating different types of feedback have been sent 735 within the same connection. 737 Further it may make sense to also remove any other new or 738 experimental fields or options on the SYN in case a middlebox might 739 be blocking them, although the required behaviour will depend on the 740 specification of the other option(s) and any attempt to co-ordinate 741 fall-back between different modules of the stack. 743 Whichever fall-back strategy is used, the TCP initiator SHOULD cache 744 failed connection attempts. If it does, it SHOULD NOT give up 745 attempting to negotiate AccECN on the SYN of subsequent connection 746 attempts until it is clear that the blockage is persistently and 747 specifically due to AccECN. The cache should be arranged to expire 748 so that the initiator will infrequently attempt to check whether the 749 problem has been resolved. 751 The fall-back procedure if the TCP server receives no ACK to 752 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 753 Section 3.2.3.2. 755 3.1.5. Implications of AccECN Mode 757 Section 3.1.1 describes the only ways that a host can enter AccECN 758 mode, whether as a client or as a server. 760 As a Data Sender, a host in AccECN mode has the rights and 761 obligations concerning the use of ECN defined below, which build on 762 those in [RFC3168] as updated by [RFC8311]: 764 o Using ECT: 766 * It can set an ECT codepoint in the IP header of packets to 767 indicate to the network that the transport is capable and 768 willing to participate in ECN for this packet. 770 * It does not have to set ECT on any packet (for instance if it 771 has reason to believe such a packet would be blocked). 773 * If for any reason it is not willing to provide ECN feedback on 774 a particular TCP connection, to indicate this unwillingness it 775 SHOULD clear the AE, CWR and ECE flags in all SYN and/or SYN/ 776 ACK packets that it sends. 778 o Switching feedback negotiation (e.g. fall-back): 780 * It SHOULD NOT set ECT on any packet if it has received at least 781 one valid SYN or Acceptable SYN/ACK with AE=CWR=ECE=0. A 782 "valid SYN" has the same port numbers and the same ISN as the 783 SYN that caused the server to enter AccECN mode. 785 * It MUST NOT send an ECN-setup SYN [RFC3168] within the same 786 connection as it has sent a SYN requesting AccECN feedback. 788 * It MUST NOT send an ECN-setup SYN/ACK [RFC3168] within the same 789 connection as it has sent a SYN/ACK agreeing to use AccECN 790 feedback. 792 The above rules are necessary because, when one peer negotiates 793 the feedback mode in two different types of handshake, it is not 794 possible for the other peer to know for certain which handshake 795 packet(s) the other end eventually receives or in which order it 796 receives them. So the two peers can end up using difference 797 feedback modes without knowing it. 799 o Congestion response: 801 * It is still obliged to respond appropriately to AccECN feedback 802 with congestion indications on packets it had previously sent, 803 as defined in Section 6.1 of [RFC3168] and updated by Sections 804 2.1 and 4.1 of [RFC8311]. 806 * The commitment to respond appropriately to incoming indications 807 of congestion remains even if it sends a SYN packet with 808 AE=CWR=ECE=0, in a later transmission within the same TCP 809 connection. 811 * Unlike an RFC 3168 data sender, it MUST NOT set CWR to indicate 812 it has received and responded to indications of congestion (for 813 the avoidance of doubt, this does not preclude it from setting 814 the bits of the ACE counter field, which includes an overloaded 815 use of the same bit). 817 As a Data Receiver: 819 o a host in AccECN mode MUST feed back the information in the IP-ECN 820 field on incoming packets using Accurate ECN feedback, as 821 specified in Section 3.2 below. 823 o if it receives an ECN-setup SYN or ECN-setup SYN/ACK [RFC3168] 824 during the same connection as it receives a SYN requesting AccECN 825 feedback or a SYN/ACK agreeing to use AccECN feedback, it MUST 826 reset the connection with a RST packet. 828 o it MUST NOT use reception of packets with ECT set in the IP-ECN 829 field as an implicit signal that the peer is ECN-capable. Reason: 830 ECT at the IP layer does not explicitly confirm the peer has the 831 correct ECN feedback logic, and the packets could have been 832 mangled at the IP layer. 834 3.2. AccECN Feedback 836 Each Data Receiver of each half connection maintains four counters, 837 r.cep, r.ceb, r.e0b and r.e1b: 839 o The Data Receiver MUST increment the CE packet counter (r.cep), 840 for every Acceptable packet that it receives with the CE code 841 point in the IP ECN field, including CE marked control packets but 842 excluding CE on SYN packets (SYN=1; ACK=0). 844 o The Data Receiver MUST increment the r.ceb, r.e0b or r.e1b byte 845 counters by the number of TCP payload octets in Acceptable packets 846 marked respectively with the CE, ECT(0) and ECT(1) codepoint in 847 their IP-ECN field, including any payload octets on control 848 packets, but not including any payload octets on SYN packets 849 (SYN=1; ACK=0). 851 Each Data Sender of each half connection maintains four counters, 852 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 853 counters at the Data Receiver. 855 A Data Receiver feeds back the CE packet counter using the Accurate 856 ECN (ACE) field, as explained in Section 3.2.2. And it feeds back 857 all the byte counters using the AccECN TCP Option, as specified in 858 Section 3.2.3. 860 Whenever a host feeds back the value of any counter, it MUST report 861 the most recent value, no matter whether it is in a pure ACK, an ACK 862 with new payload data or a retransmission. Therefore the feedback 863 carried on a retransmitted packet is unlikely to be the same as the 864 feedback on the original packet. 866 3.2.1. Initialization of Feedback Counters 868 When a host first enters AccECN mode, in its role as a Data Receiver 869 it initializes its counters to r.cep = 5 and r.ceb = 0, The initial 870 values of the other two byte counters depend on the Data Receiver's 871 choice of the order of fields it will use in the AccECN TCP Option 872 (see Section 3.2.3). If field order 0, it will initialize the 873 remaining counters to r.e0b = 1; r.e1b.= 0. If field order 1, it 874 will initialize them to r.e0b = 0 and r.e1b.= 0x800001. 876 Non-zero initial values are used to support a stateless handshake 877 (see Section 4.1) and to be distinct from cases where the fields are 878 incorrectly zeroed (e.g. by middleboxes - see Section 3.2.3.2.4). 880 When a host enters AccECN mode, in its role as a Data Sender it 881 initializes its counters to s.cep = 5 and s.ceb = 0. The initial 882 values of the other two byte counters depend on the peer's choice of 883 the order of fields it will use in the AccECN TCP Option (see 884 Section 3.2.3). If field order 0, it will initialize the remaining 885 counters to s.e0b = 1; s.e1b.= 0. If field order 1, it will 886 initialize them to s.e0b = 0 and s.e1b.= 0x800001. 888 3.2.2. The ACE Field 890 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 891 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 892 as one 3-bit field. Then the field is given a new name, ACE, as 893 shown in Figure 3. 895 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 896 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 897 | | | | U | A | P | R | S | F | 898 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 899 | | | | G | K | H | T | N | N | 900 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 902 Figure 3: Definition of the ACE field within bytes 13 and 14 of the 903 TCP Header (when AccECN has been negotiated and SYN=0). 905 The original definition of these three flags in the TCP header, 906 including the addition of support for the ECN Nonce, is shown for 907 comparison in Figure 1. This specification does not rename these 908 three TCP flags to ACE unconditionally; it merely overloads them with 909 another name and definition once an AccECN connection has been 910 established. 912 With one exception (Section 3.2.2.1), a host with both of its half- 913 connections in AccECN mode MUST interpret the AE, CWR and ECE flags 914 as the 3-bit ACE counter on a segment with the SYN flag cleared 915 (SYN=0). On such a packet, a Data Receiver MUST encode the three 916 least significant bits of its r.cep counter into the ACE field that 917 it feeds back to the Data Sender. A host MUST NOT interpret the 3 918 flags as a 3-bit ACE field on any segment with SYN=1 (whether ACK is 919 0 or 1), or if AccECN negotiation is incomplete or has not succeeded. 921 Both parts of each of these conditions are equally important. For 922 instance, even if AccECN negotiation has been successful, the ACE 923 field is not defined on any segments with SYN=1 (e.g. a 924 retransmission of an unacknowledged SYN/ACK, or when both ends send 925 SYN/ACKs after AccECN support has been successfully negotiated during 926 a simultaneous open). 928 3.2.2.1. ACE Field on the ACK of the SYN/ACK 930 A TCP client (A) in AccECN mode MUST feed back which of the 4 931 possible values of the IP-ECN field was on the SYN/ACK by writing it 932 into the ACE field of a pure ACK with no SACK blocks using the binary 933 encoding in Table 3 (which is the same as that used on the SYN/ACK in 934 Table 2). This shall be called the handshake encoding of the ACE 935 field, and it is the only exception to the rule that the ACE field 936 carries the 3 least significant bits of the r.cep counter on packets 937 with SYN=0. 939 Normally, a TCP client acknowledges a SYN/ACK with an ACK that 940 satisfies the above conditions anyway (SYN=0, no data, no SACK 941 blocks). If an AccECN TCP client intends to acknowledge the SYN/ACK 942 with a packet that does not satisfy these conditions (e.g. it has 943 data to include on the ACK), it SHOULD first send a pure ACK that 944 does satisfy these conditions (see Section 4.2), so that it can feed 945 back which of the four values of the IP-ECN field arrived on the SYN/ 946 ACK. A valid exception to this "SHOULD" would be where the 947 implementation will only be used in an environment where mangling of 948 the ECN field is unlikely. 950 +---------------------+---------------------+-----------------------+ 951 | IP-ECN codepoint on | ACE on pure ACK of | r.cep of client in | 952 | SYN/ACK | SYN/ACK | AccECN mode | 953 +---------------------+---------------------+-----------------------+ 954 | Not-ECT | 0b010 | 5 | 955 | ECT(1) | 0b011 | 5 | 956 | ECT(0) | 0b100 | 5 | 957 | CE | 0b110 | 6 | 958 +---------------------+---------------------+-----------------------+ 960 Table 3: The encoding of the ACE field in the ACK of the SYN-ACK to 961 reflect the SYN-ACK's IP-ECN field 963 When an AccECN server in SYN-RCVD state receives a pure ACK with 964 SYN=0 and no SACK blocks, instead of treating the ACE field as a 965 counter, it MUST infer the meaning of each possible value of the ACE 966 field from Table 4, which also shows the value that an AccECN server 967 MUST set s.cep to as a result. 969 Given this encoding of the ACE field on the ACK of a SYN/ACK is 970 exceptional, an AccECN server using large receive offload (LRO) might 971 prefer to disable LRO until such an ACK has transitioned it out of 972 SYN-RCVD state. 974 +---------------+-----------------------------+---------------------+ 975 | ACE on ACK of | IP-ECN codepoint on SYN/ACK | s.cep of server in | 976 | SYN/ACK | inferred by server | AccECN mode | 977 +---------------+-----------------------------+---------------------+ 978 | 0b000 | {Notes 1, 3} | Disable ECN | 979 | 0b001 | {Notes 2, 3} | 5 | 980 | 0b010 | Not-ECT | 5 | 981 | 0b011 | ECT(1) | 5 | 982 | 0b100 | ECT(0) | 5 | 983 | 0b101 | Currently Unused {Note 2} | 5 | 984 | 0b110 | CE | 6 | 985 | 0b111 | Currently Unused {Note 2} | 5 | 986 +---------------+-----------------------------+---------------------+ 988 Table 4: Meaning of the ACE field on the ACK of the SYN/ACK 990 {Note 1}: If the server is in AccECN mode, the value of zero raises 991 suspicion of zeroing of the ACE field on the path (see 992 Section 3.2.2.3). 994 {Note 2}: If the server is in AccECN mode, these values are Currently 995 Unused but the AccECN server's behaviour is still defined for forward 996 compatibility. Then the designer of a future protocol can know for 997 certain what AccECN servers will do with these codepoints. 999 {Note 3}: In the case where a server that implements AccECN is also 1000 using a stateless handshake (termed a SYN cookie) it will not 1001 remember whether it entered AccECN mode. The values 0b000 or 0b001 1002 will remind it that it did not enter AccECN mode, because AccECN does 1003 not use them (see Section 4.1 for details). If a stateless server 1004 that implements AccECN receives either of these two values in the 1005 ACK, its action is implementation-dependent and outside the scope of 1006 this spec, It will certainly not take the action in the third column 1007 because, after it receives either of these values, it is not in 1008 AccECN mode. I.e., it will not disable ECN (at least not just 1009 because ACE is 0b000) and it will not set s.cep. 1011 3.2.2.2. Encoding and Decoding Feedback in the ACE Field 1013 Whenever the Data Receiver sends an ACK with SYN=0 (with or without 1014 data), unless the handshake encoding in Section 3.2.2.1 applies, the 1015 Data Receiver MUST encode the least significant 3 bits of its r.cep 1016 counter into the ACE field (see Appendix A.2). 1018 Whenever the Data Sender receives an ACK with SYN=0 (with or without 1019 data), it first checks whether it has already been superseded by 1020 another ACK in which case it ignores the ECN feedback. If the ACK 1021 has not been superseded, and if the special handshake encoding in 1022 Section 3.2.2.1 does not apply, the Data Sender decodes the ACE field 1023 as follows (see Appendix A.2 for examples). 1025 o It takes the least significant 3 bits of its local s.cep counter 1026 and subtracts them from the incoming ACE counter to work out the 1027 minimum positive increment it could apply to s.cep (assuming the 1028 ACE field only wrapped at most once). 1030 o It then follows the safety procedures in Section 3.2.2.5.2 to 1031 calculate or estimate how many packets the ACK could have 1032 acknowledged under the prevailing conditions to determine whether 1033 the ACE field might have wrapped more than once. 1035 The encode/decode procedures during the three-way handshake are 1036 exceptions to the general rules given so far, so they are spelled out 1037 step by step below for clarity: 1039 o If a TCP server in AccECN mode receives a CE mark in the IP-ECN 1040 field of a SYN (SYN=1, ACK=0), it MUST NOT increment r.cep (it 1041 remains at its initial value of 5). 1043 Reason: It would be redundant for the server to include CE-marked 1044 SYNs in its r.cep counter, because it already reliably delivers 1045 feedback of any CE marking on the SYN/ACK using the encoding in 1046 Table 2. This also ensures that, when the server starts using the 1047 ACE field, it has not unnecessarily consumed more than one initial 1048 value, given they can be used to negotiate variants of the AccECN 1049 protocol (see Appendix B.3). 1051 o If a TCP client in AccECN mode receives CE feedback in the TCP 1052 flags of a SYN/ACK, it MUST NOT increment s.cep (it remains at its 1053 initial value of 5), so that it stays in step with r.cep on the 1054 server. Nonetheless, the TCP client still triggers the congestion 1055 control actions necessary to respond to the CE feedback. 1057 o If a TCP client in AccECN mode receives a CE mark in the IP-ECN 1058 field of a SYN/ACK, it MUST increment r.cep, but no more than once 1059 no matter how many CE-marked SYN/ACKs it receives (i.e. 1060 incremented from 5 to 6, but no further). 1062 Reason: Incrementing r.cep ensures the client will eventually 1063 deliver any CE marking to the server reliably when it starts using 1064 the ACE field. Even though the client also feeds back any CE 1065 marking on the ACK of the SYN/ACK using the encoding in Table 3, 1066 this ACK is not delivered reliably, so it can be considered as a 1067 timely notification that is redundant but unreliable. The client 1068 does not increment r.cep more than once, because the server can 1069 only increment s.cep once (see next bullet). Also, this limits 1070 the unnecessarily consumed initial values of the ACE field to two. 1072 o If a TCP server in AccECN mode and in SYN-RCVD state receives CE 1073 feedback in the TCP flags of a pure ACK with no SACK blocks, it 1074 MUST increment s.cep (from 5 to 6). The TCP server then triggers 1075 the congestion control actions necessary to respond to the CE 1076 feedback. 1078 Reasoning: The TCP server can only increment s.cep once, because 1079 the first ACK it receives will cause it to transition out of SYN- 1080 RCVD state. The server's congestion response would be no 1081 different even if it could receive feedback of more than one CE- 1082 marked SYN/ACK. 1084 Once the TCP server transitions to ESTABLISHED state, it might 1085 later receive other pure ACK(s) with the handshake encoding in the 1086 ACE field. The conditions for this to occur are quite unusual, 1087 but not impossible, e.g. a SYN/ACK (or ACK of the SYN/ACK) that is 1088 delayed for longer than the server's retransmission timeout; or 1089 packet duplication by the network. Nonetheless, once in the 1090 ESTABLISHED state, the server will consider the ACE field to be 1091 encoded as the normal ACE counter on all packets with SYN=0 (given 1092 it will be following the above rule in this bullet). The server 1093 MAY include a test to avoid this case. 1095 3.2.2.3. Testing for Zeroing of the ACE Field 1097 Section 3.2.2 required the Data Receiver to initialize the r.cep 1098 counter to a non-zero value. Therefore, in either direction the 1099 initial value of the ACE counter ought to be non-zero. 1101 If AccECN has been successfully negotiated, the Data Sender SHOULD 1102 check the value of the ACE counter in the first packet (with or 1103 without data) that arrives with SYN=0. If the value of this ACE 1104 field is zero (0b000), the Data Sender disables sending ECN-capable 1105 packets for the remainder of the half-connection by setting the IP/ 1106 ECN field in all subsequent packets to Not-ECT. 1108 Usually, the server checks the ACK of the SYN/ACK from the client, 1109 while the client checks the first data segment from the server. 1110 However, if reordering occurs, "the first packet ... that arrives" 1111 will not necessarily be the same as the first packet in sequence 1112 order. The test has been specified loosely like this to simplify 1113 implementation, and because it would not have been any more precise 1114 to have specified the first packet in sequence order, which would not 1115 necessarily be the first ACE counter that the Data Receiver fed back 1116 anyway, given it might have been a retransmission. 1118 The possibility of re-ordering means that there is a small chance 1119 that the ACE field on the first packet to arrive is genuinely zero 1120 (without middlebox interference). This would cause a host to 1121 unnecessarily disable ECN for a half connection. Therefore, in 1122 environments where there is no evidence of the ACE field being 1123 zeroed, implementations can skip this test. 1125 Note that the Data Sender MUST NOT test whether the arriving counter 1126 in the initial ACE field has been initialized to a specific valid 1127 value - the above check solely tests whether the ACE fields have been 1128 incorrectly zeroed. This allows hosts to use different initial 1129 values as an additional signalling channel in future. 1131 3.2.2.4. Testing for Mangling of the IP/ECN Field 1133 The value of the ACE field on the SYN/ACK indicates the value of the 1134 IP/ECN field when the SYN arrived at the server. The client can 1135 compare this with how it originally set the IP/ECN field on the SYN. 1136 If this comparison implies an unsafe transition (see below) of the 1137 IP/ECN field, for the remainder of the connection the client MUST NOT 1138 send ECN-capable packets, but it MUST continue to feed back any ECN 1139 markings on arriving packets. 1141 The value of the ACE field on the last ACK of the 3WHS indicates the 1142 value of the IP/ECN field when the SYN/ACK arrived at the client. 1143 The server can compare this with how it originally set the IP/ECN 1144 field on the SYN/ACK. If this comparison implies an unsafe 1145 transition of the IP/ECN field, for the remainder of the connection 1146 the server MUST NOT send ECN-capable packets, but it MUST continue to 1147 feedback any ECN markings on arriving packets. 1149 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 1150 count of CE marks is still eventually delivered reliably). If this 1151 ACK does not arrive, the server can continue to send ECN-capable 1152 packets without having tested for mangling of the IP/ECN field on the 1153 SYN/ACK. Experiments with AccECN deployment will assess whether this 1154 limitation has any effect in practice. 1156 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 1157 repeated here for convenience: 1159 o the not-ECT codepoint changes; 1161 o either ECT codepoint transitions to not-ECT; 1163 o the CE codepoint changes. 1165 RFC 3168 says that a router that changes ECT to not-ECT is invalid 1166 but safe. However, from a host's viewpoint, this transition is 1167 unsafe because it could be the result of two transitions at different 1168 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 1169 This scenario could well happen where an ECN-enabled home router 1170 congests its upstream mobile broadband bottleneck link, then the 1171 ingress to the mobile network clears the ECN field [Mandalari18]. 1173 The above fall-back behaviours are necessary in case mangling of the 1174 IP/ECN field is asymmetric, which is currently common over some 1175 mobile networks [Mandalari18]. Then one end might see no unsafe 1176 transition and continue sending ECN-capable packets, while the other 1177 end sees an unsafe transition and stops sending ECN-capable packets. 1179 3.2.2.5. Safety against Ambiguity of the ACE Field 1181 If too many CE-marked segments are acknowledged at once, or if a long 1182 run of ACKs is lost or thinned out, the 3-bit counter in the ACE 1183 field might have cycled between two ACKs arriving at the Data Sender. 1184 The following safety procedures minimize this ambiguity. 1186 3.2.2.5.1. Data Receiver Safety Procedures 1188 An AccECN Data Receiver: 1190 o SHOULD immediately send an ACK whenever a data packet marked CE 1191 arrives after the previous data packet was not CE. 1193 o MUST immediately send an ACK once 'n' CE marks have arrived since 1194 the previous ACK, where 'n' SHOULD be 2 and MUST be no greater 1195 than 6. 1197 These rules for when to send an ACK are designed to be complemented 1198 by those in Section 3.2.3.3, which concern whether the AccECN TCP 1199 Option ought to be included on ACKs. 1201 For the avoidance of doubt, the change-triggered ACK mechanism is 1202 deliberately worded to solely apply to data packets, and to ignore 1203 the arrival of a control packet with no payload, because it is 1204 important that TCP does not acknowledge pure ACKs. The change- 1205 triggered ACK approach can lead to some additional ACKs but it feeds 1206 back the timing and the order in which ECN marks are received with 1207 minimal additional complexity. If only CE marks are infrequent, or 1208 there are multiple marks in a row, the additional load will be low. 1209 Other marking patterns could increase the load significantly. 1210 Investigating the additional load is a goal of the proposed 1211 experiment. 1213 Even though the first bullet is stated as a "SHOULD", it is important 1214 for a transition to immediately trigger an ACK if at all possible, so 1215 that the Data Sender can rely on change-triggered ACKs to detect 1216 queue growth as soon as possible, e.g. at the start of a flow. This 1217 requirement can only be relaxed if certain offload hardware needed 1218 for high performance cannot support change-triggered ACKs (although 1219 high performance protocols such as DCTCP already successfully use 1220 change-triggered ACKs). One possible experimental compromise would 1221 be for the receiver to heuristically detect whether the sender is in 1222 slow-start, then to implement change-triggered ACKs while the sender 1223 is in slow-start, and offload otherwise. 1225 3.2.2.5.2. Data Sender Safety Procedures 1227 If the Data Sender has not received AccECN TCP Options to give it 1228 more dependable information, and it detects that the ACE field could 1229 have cycled, it SHOULD deem whether it cycled by taking the safest 1230 likely case under the prevailing conditions. It can detect if the 1231 counter could have cycled by using the jump in the acknowledgement 1232 number since the last ACK to calculate or estimate how many segments 1233 could have been acknowledged. An example algorithm to implement this 1234 policy is given in Appendix A.2. An implementer MAY develop an 1235 alternative algorithm as long as it satisfies these requirements. 1237 If missing acknowledgement numbers arrive later (reordering) and 1238 prove that the counter did not cycle, the Data Sender MAY attempt to 1239 neutralize the effect of any action it took based on a conservative 1240 assumption that it later found to be incorrect. 1242 The Data Sender can estimate how many packets (of any marking) an ACK 1243 acknowledges. If the ACE counter on an ACK seems to imply that the 1244 minimum number of newly CE-marked packets is greater that the number 1245 of newly acknowledged packets, the Data Sender SHOULD believe the ACE 1246 counter, unless it can be sure that it is counting all control 1247 packets correctly. 1249 3.2.3. The AccECN Option 1251 The AccECN Option is defined as shown in Figure 4. The initial 'E' 1252 of each field name stands for 'Echo'. 1254 0 1 2 3 1255 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1256 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1257 | Kind = TBD1 | Length = 11 | EE0B field | 1258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1259 | EE0B (cont'd) | ECEB field | 1260 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1261 | EE1B field | Order 0 1262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1264 0 1 2 3 1265 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1266 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1267 | Kind = TBD1 | Length = 11 | EE1B field | 1268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1269 | EE1B (cont'd) | ECEB field | 1270 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1271 | EE0B field | Order 1 1272 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1274 Figure 4: The AccECN TCP Option 1276 When a Data Receiver sends an AccECN Option, it MUST set the Kind 1277 field to TBD1, which is registered in Section 6 as a new TCP option 1278 Kind called AccECN. An experimental TCP option with Kind=254 MAY be 1279 used for initial experiments, with magic number 0xACCE. 1281 Figure 4 shows two option field orders; order 0 and order 1. They 1282 both consists of three 24-bit fields. Order 0 provides the 24 least 1283 significant bits of the r.e0b, r.ceb and r.e1b counters, 1284 respectively. Order 1 provides the same fields, but in the opposite 1285 order. Each half-connection can use a different field order, but a 1286 Data Receiver MUST consistently send the same field order within the 1287 same half-connection. 1289 The field order to use for each half-connection is up to the Data 1290 Receiver implementation. It might use the same hard-coded order for 1291 all half-connections, or it might make a different choice for each 1292 half-connection. For instance, the implementation of a Data Receiver 1293 might default to using order 0, unless the ECN field in the IP header 1294 of the packet it received during the 3WHS is ECT(1). A Data Receiver 1295 just starts using its chosen field order and the field immediately 1296 after the length field in the first AccECN TCP Option of a half- 1297 connection will intrinsically indicate which order it is using, 1298 because the initial counter values that it is required to use depend 1299 on its chosen field order (see Section 3.2.1). 1301 A Data Sender can know which field order the Data Receiver is using 1302 for a half-connection from the most significant bit (MSB) of the 1303 counter in the field immediately after the length field in the first 1304 non-empty AccECN TCP Option to arrive. If this MSB = 0, field order 1305 0 is being used, and if MSB = 1, field order 1 is being used. Note 1306 that the Data Sender only tests the most significant bit, not the 1307 value of the whole field, because the counters in the first packet to 1308 arrive might have started to increment (e.g. if the first packet to 1309 arrive is not the first packet sent due to loss or reordering). 1311 Note that there is no field to feed back Not-ECT bytes. Nonetheless 1312 an algorithm for the Data Sender to calculate the number of payload 1313 bytes received as Not-ECT is given in Appendix A.5. 1315 Whenever a Data Receiver sends an AccECN Option, the rules in 1316 Section 3.2.3.3 expect it to usually send a full-length option. To 1317 cope with option space limitations, it can omit unchanged fields from 1318 the tail of the option, as long as it preserves the order of the 1319 remaining fields and includes any field that has changed. The length 1320 field MUST indicate which fields are present as follows: 1322 +--------+------------------+------------------+ 1323 | Length | Type 0 | Type 1 | 1324 +--------+------------------+------------------+ 1325 | 11 | EE0B, ECEB, EE1B | EE1B, ECEB, EE0B | 1326 | 8 | EE0B, ECEB | EE1B, ECEB | 1327 | 5 | EE0B | EE1B | 1328 | 2 | (empty) | (empty) | 1329 +--------+------------------+------------------+ 1331 The empty option of Length=2 is provided to allow for a case where an 1332 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 1333 but there is very limited space for the option. For initial 1334 experiments, the Length field MUST be 2 greater to accommodate the 1335 16-bit magic number. 1337 All implementations of a Data Sender that read any AccECN Option MUST 1338 be able to read in AccECN Options of any of the above lengths. For 1339 forward compatibility, if the AccECN Option is of any other length, 1340 implementations MUST use those whole 3 octet fields that fit within 1341 the length and ignore the remainder of the option. 1343 The AccECN Option has to be optional to implement, because both 1344 sender and receiver have to be able to cope without the option anyway 1345 - in cases where it does not traverse a network path. It is 1346 RECOMMENDED to implement both sending and receiving of the AccECN 1347 Option. If sending of the AccECN Option is implemented, the fall- 1348 backs described in this document will need to be implemented as well 1349 (unless solely for a controlled environment where path traversal is 1350 not considered a problem). Even if a developer does not implement 1351 sending of the AccECN Option, it is RECOMMENDED that they still 1352 implement logic to receive and understand any AccECN Options sent by 1353 remote peers. 1355 If a Data Receiver intends to send the AccECN Option at any time 1356 during the rest of the connection it is strongly recommended to also 1357 test path traversal of the AccECN Option as specified in 1358 Section 3.2.3.2. 1360 3.2.3.1. Encoding and Decoding Feedback in the AccECN Option Fields 1362 Whenever the Data Receiver includes any of the counter fields (ECEB, 1363 EE0B, EE1B) in an AccECN Option, it MUST encode the 24 least 1364 significant bits of the current value of the associated counter into 1365 the field (respectively r.ceb, r.e0b, r.e1b). 1367 Whenever the Data Sender receives ACK carrying an AccECN Option, it 1368 first checks whether the ACK has already been superseded by another 1369 ACK in which case it ignores the ECN feedback. If the ACK has not 1370 been superseded, the Data Sender MUST decode the fields in the AccECN 1371 Option as follows. For each field, it takes the least significant 24 1372 bits of its associated local counter (s.ceb, s.e0b or s.e1b) and 1373 subtracts them from the counter in the associated field of the 1374 incoming AccECN Option (respectively ECEB, EE0B, EE1B), to work out 1375 the minimum positive increment it could apply to s.ceb, s.e0b or 1376 s.e1b (assuming the field in the option only wrapped at most once). 1378 Appendix A.1 gives an example algorithm for the Data Receiver to 1379 encode its byte counters into the AccECN Option, and for the Data 1380 Sender to decode the AccECN Option fields into its byte counters. 1382 Note that, as specified in Section 3.2, any data on the SYN (SYN=1, 1383 ACK=0) is not included in any of the locally held octet counters nor 1384 in the AccECN Option on the wire. 1386 3.2.3.2. Path Traversal of the AccECN Option 1388 3.2.3.2.1. Testing the AccECN Option during the Handshake 1390 The TCP client MUST NOT include the AccECN TCP Option on the SYN. (A 1391 fall-back strategy for the loss of the SYN (possibly due to middlebox 1392 interference) is specified in Section 3.1.4.) 1394 A TCP server that confirms its support for AccECN (in response to an 1395 AccECN SYN from the client as described in Section 3.1) SHOULD 1396 include an AccECN TCP Option on the SYN/ACK. 1398 A TCP client that has successfully negotiated AccECN SHOULD include 1399 an AccECN Option in the first ACK at the end of the 3WHS. However, 1400 this first ACK is not delivered reliably, so the TCP client SHOULD 1401 also include an AccECN Option on the first data segment it sends (if 1402 it ever sends one). 1404 A host MAY NOT include an AccECN Option in any of these three cases 1405 if it has cached knowledge that the packet would be likely to be 1406 blocked on the path to the other host if it included an AccECN 1407 Option. 1409 3.2.3.2.2. Testing for Loss of Packets Carrying the AccECN Option 1411 If after the normal TCP timeout the TCP server has not received an 1412 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 1413 lost, e.g. due to congestion, or a middlebox might be blocking the 1414 AccECN Option. To expedite connection setup, the TCP server SHOULD 1415 retransmit the SYN/ACK repeating the same AE, CWR and ECE TCP flags 1416 as on the original SYN/ACK but with no AccECN Option. If this 1417 retransmission times out, to expedite connection setup, the TCP 1418 server SHOULD disable AccECN and ECN for this connection by 1419 retransmitting the SYN/ACK with AE=CWR=ECE=0 and no AccECN Option. 1421 Implementers MAY use other fall-back strategies if they are found to 1422 be more effective (e.g. retrying the AccECN Option for a second time 1423 before fall-back - most appropriate during high levels of 1424 congestion). However, other fall-back strategies will need to follow 1425 all the rules in Section 3.1.5, which concern behaviour when SYNs or 1426 SYN/ACKs negotiating different types of feedback have been sent 1427 within the same connection. 1429 If the TCP client detects that the first data segment it sent with 1430 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1431 on the retransmission. Again, implementers MAY use other fall-back 1432 strategies such as attempting to retransmit a second segment with the 1433 AccECN Option before fall-back, and/or caching whether the AccECN 1434 Option is blocked for subsequent connections. 1435 [I-D.ietf-tcpm-2140bis] further discusses caching of TCP parameters 1436 and status information. 1438 If a host falls back to not sending the AccECN Option, it will 1439 continue to process any incoming AccECN Options as normal. 1441 Either host MAY include the AccECN Option in a subsequent segment to 1442 retest whether the AccECN Option can traverse the path. 1444 If the TCP server receives a second SYN with a request for AccECN 1445 support, it should resend the SYN/ACK, again confirming its support 1446 for AccECN, but this time without the AccECN Option. This approach 1447 rules out any interference by middleboxes that may drop packets with 1448 unknown options, even though it is more likely that the SYN/ACK would 1449 have been lost due to congestion. The TCP server MAY try to send 1450 another packet with the AccECN Option at a later point during the 1451 connection but should monitor if that packet got lost as well, in 1452 which case it SHOULD disable the sending of the AccECN Option for 1453 this half-connection. 1455 Similarly, an AccECN end-point MAY separately memorize which data 1456 packets carried an AccECN Option and disable the sending of AccECN 1457 Options if the loss probability of those packets is significantly 1458 higher than that of all other data packets in the same connection. 1460 3.2.3.2.3. Testing for Absence of the AccECN Option 1462 If the TCP client has successfully negotiated AccECN but does not 1463 receive an AccECN Option on the SYN/ACK (e.g. because is has been 1464 stripped by a middlebox or not sent by the server), the client 1465 switches into a mode that assumes that the AccECN Option is not 1466 available for this half connection. 1468 Similarly, if the TCP server has successfully negotiated AccECN but 1469 does not receive an AccECN Option on the first segment that 1470 acknowledges sequence space at least covering the ISN, it switches 1471 into a mode that assumes that the AccECN Option is not available for 1472 this half connection. 1474 While a host is in this mode that assumes incoming AccECN Options are 1475 not available, it MUST adopt the conservative interpretation of the 1476 ACE field discussed in Section 3.2.2.5. However, it cannot make any 1477 assumption about support of outgoing AccECN Options on the other half 1478 connection, so it SHOULD continue to send the AccECN Option itself 1479 (unless it has established that sending the AccECN Option is causing 1480 packets to be blocked as in Section 3.2.3.2.2). 1482 If a host is in the mode that assumes incoming AccECN Options are not 1483 available, but it receives an AccECN Option at any later point during 1484 the connection, this clearly indicates that the AccECN Option is not 1485 blocked on the respective path, and the AccECN endpoint MAY switch 1486 out of the mode that assumes the AccECN Option is not available for 1487 this half connection. 1489 3.2.3.2.4. Test for Zeroing of the AccECN Option 1491 For a related test for invalid initialization of the ACE field, see 1492 Section 3.2.2.3 1493 Section 3.2 required the Data Receiver to initialize the r.e0b 1494 counter to a non-zero value. Therefore, in either direction the 1495 initial value of the EE0B field in the AccECN Option (if one exists) 1496 ought to be non-zero. If AccECN has been negotiated: 1498 o the TCP server MAY check the initial value of the EE0B field in 1499 the first segment that acknowledges sequence space that at least 1500 covers the ISN plus 1. If the initial value of the EE0B field is 1501 zero, the server will switch into a mode that ignores the AccECN 1502 Option for this half connection. 1504 o the TCP client MAY check the initial value of the EE0B field on 1505 the SYN/ACK. If the initial value of the EE0B field is zero, the 1506 client will switch into a mode that ignores the AccECN Option for 1507 this half connection. 1509 While a host is in the mode that ignores the AccECN Option it MUST 1510 adopt the conservative interpretation of the ACE field discussed in 1511 Section 3.2.2.5. 1513 Note that the Data Sender MUST NOT test whether the arriving byte 1514 counters in the initial AccECN Option have been initialized to 1515 specific valid values - the above checks solely test whether these 1516 fields have been incorrectly zeroed. This allows hosts to use 1517 different initial values as an additional signalling channel in 1518 future. Also note that the initial value of either field might be 1519 greater than its expected initial value, because the counters might 1520 already have been incremented. Nonetheless, the initial values of 1521 the counters have been chosen so that they cannot wrap to zero on 1522 these initial segments. 1524 3.2.3.2.5. Consistency between AccECN Feedback Fields 1526 When the AccECN Option is available it supplements but does not 1527 replace the ACE field. An endpoint using AccECN feedback MUST always 1528 consider the information provided in the ACE field whether or not the 1529 AccECN Option is also available. 1531 If the AccECN option is present, the s.cep counter might increase 1532 while the s.ceb counter does not (e.g. due to a CE-marked control 1533 packet). The sender's response to such a situation is out of scope, 1534 and needs to be dealt with in a specification that uses ECN-capable 1535 control packets. Theoretically, this situation could also occur if a 1536 middlebox mangled the AccECN Option but not the ACE field. However, 1537 the Data Sender has to assume that the integrity of the AccECN Option 1538 is sound, based on the above test of the well-known initial values 1539 and optionally other integrity tests (Section 4.3). 1541 If either end-point detects that the s.ceb counter has increased but 1542 the s.cep has not (and by testing ACK coverage it is certain how much 1543 the ACE field has wrapped), this invalid protocol transition has to 1544 be due to some form of feedback mangling. So, the Data Sender MUST 1545 disable sending ECN-capable packets for the remainder of the half- 1546 connection by setting the IP/ECN field in all subsequent packets to 1547 Not-ECT. 1549 3.2.3.3. Usage of the AccECN TCP Option 1551 If the Data Receiver intends to use the AccECN TCP Option to provide 1552 feedback, the following rules determine when a Data Receiver in 1553 AccECN mode sends an ACK with the AccECN TCP Option, and which fields 1554 to include: 1556 Change-Triggered ACKs: If an arriving packet increments a different 1557 byte counter to that incremented by the previous packet, the Data 1558 Receiver SHOULD immediately send an ACK with an AccECN Option, 1559 without waiting for the next delayed ACK (this is in addition to 1560 the safety recommendation in Section 3.2.2.5 against ambiguity of 1561 the ACE field). 1563 Even though this bullet is stated as a "SHOULD", it is important 1564 for a transition to immediately trigger an ACK if at all possible, 1565 as already argued when specifying change-triggered ACKs for the 1566 ACE. 1568 Continual Repetition: Otherwise, if arriving packets continue to 1569 increment the same byte counter, the Data Receiver can include an 1570 AccECN Option on most or all (delayed) ACKs, but it does not have 1571 to. 1573 * It SHOULD include a counter that has continued to increment on 1574 the next scheduled ACK following a change-triggered ACK; 1576 * while the same counter continues to increment, it SHOULD 1577 include the counter every n ACKs as consistently as possible, 1578 where n can be chosen by the implementer; 1580 * It SHOULD always include an AccECN Option if the r.ceb counter 1581 is incrementing and it MAY include an AccECN Option if r.ec0b 1582 or r.ec1b is incrementing 1584 * It SHOULD, include each counter at least once for every 2^22 1585 bytes incremented to prevent overflow during continual 1586 repetition. 1588 If the smallest allowed AccECN Option would leave insufficient 1589 space for two SACK blocks on a particular ACK, the Data Receiver 1590 MUST give precedence to the SACK option (total 18 octets), because 1591 loss feedback is more critical. 1593 Necessary Option Length: It MAY exclude counter(s) that have not 1594 changed for the whole connection (but beacons still include all 1595 fields - see below). It SHOULD include counter(s) that have 1596 incremented at some time during the connection. It MUST include 1597 the counter(s) that have incremented since the previous AccECN 1598 Option and it MUST only truncate fields from the right-hand tail 1599 of the option to preserve the order of the remaining fields (see 1600 Section 3.2.3); 1602 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1603 length AccECN TCP Option on at least three ACKs per RTT, or on all 1604 ACKs if there are less than three per RTT (see Appendix A.4 for an 1605 example algorithm that satisfies this requirement). 1607 The above rules complement those in Section 3.2.2.5, which determine 1608 when to generate an ACK irrespective of whether an AccECN TCP Option 1609 is to be included. 1611 The following example series of arriving IP/ECN fields illustrates 1612 when a Data Receiver will emit an ACK with an AccECN Option if it is 1613 using a delayed ACK factor of 2 segments and change-triggered ACKs: 1614 01 -> ACK, 01, 01 -> ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 1615 -> ACK. 1617 Even though first bullet is stated as a "SHOULD", it is important for 1618 a transition to immediately trigger an ACK if at all possible, so 1619 that the Data Sender can rely on change-triggered ACKs to detect 1620 queue growth as soon as possible, e.g. at the start of a flow. This 1621 requirement can only be relaxed if certain offload hardware needed 1622 for high performance cannot support change-triggered ACKs (although 1623 high performance protocols such as DCTCP already successfully use 1624 change-triggered ACKs). One possible experimental compromise would 1625 be for the receiver to heuristically detect whether the sender is in 1626 slow-start, then to implement change-triggered ACKs while the sender 1627 is in slow-start, and offload otherwise. 1629 For the avoidance of doubt, this change-triggered ACK mechanism is 1630 deliberately worded to ignore the arrival of a control packet with no 1631 payload, which therefore does not alter any byte counters, because it 1632 is important that TCP does not acknowledge pure ACKs. The change- 1633 triggered ACK approach can lead to some additional ACKs but it feeds 1634 back the timing and the order in which ECN marks are received with 1635 minimal additional complexity. If only CE marks are infrequent, or 1636 there are multiple marks in a row, the additional load will be low. 1637 Other marking patterns could increase the load significantly, 1638 Investigating the additional load is a goal of the proposed 1639 experiment. 1641 Implementation note: sending an AccECN Option each time a different 1642 counter changes and including a full-length AccECN Option on every 1643 delayed ACK will satisfy the requirements described above and might 1644 be the easiest implementation, as long as sufficient space is 1645 available in each ACK (in total and in the option space). 1647 Appendix A.3 gives an example algorithm to estimate the number of 1648 marked bytes from the ACE field alone, if the AccECN Option is not 1649 available. 1651 If a host has determined that segments with the AccECN Option always 1652 seem to be discarded somewhere along the path, it is no longer 1653 obliged to follow the above rules. 1655 3.3. Requirements for TCP Proxies, Offload Engines and other 1656 Middleboxes on AccECN Compliance 1658 A large class of middleboxes split TCP connections. Such a middlebox 1659 would be compliant with the AccECN protocol if the TCP implementation 1660 on each side complied with the present AccECN specification and each 1661 side negotiated AccECN independently of the other side. 1663 Another large class of middleboxes intervenes to some degree at the 1664 transport layer, but attempts to be transparent (invisible) to the 1665 end-to-end connection. A subset of this class of middleboxes 1666 attempts to `normalize' the TCP wire protocol by checking that all 1667 values in header fields comply with a rather narrow interpretation of 1668 the TCP specifications. To comply with the present AccECN 1669 specification, such a middlebox MUST NOT change the ACE field or the 1670 AccECN Option and it SHOULD preserve the timing of each ACK (for 1671 example, if it coalesced ACKs it would not be AccECN-compliant) as 1672 these can be used by the Data Sender to infer further information 1673 about the path congestion level. A middlebox claiming to be 1674 transparent at the transport layer MUST forward the AccECN TCP Option 1675 unaltered, whether or not the length value matches one of those 1676 specified in Section 3.2.3, and whether or not the initial values of 1677 the byte-counter fields are correct. This is because blocking 1678 apparently invalid values does not improve security (because AccECN 1679 hosts are required to ignore invalid values anyway), while it 1680 prevents the standardized set of values being extended in future 1681 (because outdated normalizers would block updated hosts from using 1682 the extended AccECN standard). 1684 Hardware to offload certain TCP processing represents another large 1685 class of middleboxes, even though it is often a function of a host's 1686 network interface and rarely in its own 'box'. Leeway has been 1687 allowed in the present AccECN specification in the expectation that 1688 offload hardware could comply and still serve its function. 1689 Nonetheless, such hardware SHOULD also preserve the timing of each 1690 ACK (for example, if it coalesced ACKs it would not be AccECN- 1691 compliant). 1693 The ACE field changes with every received CE marking, so today's 1694 receive offloading could lead to many interrupts in high congestion 1695 situations. Although that would be useful (because congestion 1696 information is received sooner), it could also significantly increase 1697 processor load, particularly in scenarios such as DCTCP or L4S where 1698 the marking rate is generally higher. 1700 In data centres it has been fortunate for offload hardware that 1701 DCTCP-style feedback changes less often when there are long sequences 1702 of CE marks, which is more common with a step marking threshold. In 1703 order to enable DCTCP to improve its responsiveness, DCs will need to 1704 move beyond step marking. Before this can happen, offload hardware 1705 will have to explicitly address the variability of ECN feedback. 1707 ECN encodes a varying signal in the ACK stream, so it is inevitable 1708 that offload hardware will ultimately need to handle any form of ECN 1709 feedback exceptionally. The purpose of working towards standardized 1710 TCP ECN feedback is to reduce the risk for hardware developers, who 1711 would otherwise have to guess which scheme is likely to become 1712 dominant. 1714 4. Interaction with Other TCP Variants 1716 This section is informative, not normative. 1718 4.1. Compatibility with SYN Cookies 1720 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1721 protect itself from SYN flooding attacks. It places minimal commonly 1722 used connection state in the SYN/ACK, and deliberately does not hold 1723 any state while waiting for the subsequent ACK (e.g. it closes the 1724 thread). Therefore it cannot record the fact that it entered AccECN 1725 mode for both half-connections. Indeed, it cannot even remember 1726 whether it negotiated the use of classic ECN [RFC3168]. 1728 Nonetheless, such a server can determine that it negotiated AccECN as 1729 follows. If a TCP server using SYN Cookies supports AccECN and if it 1730 receives a pure ACK that acknowledges an ISN that is a valid SYN 1731 cookie, and if the ACK contains an ACE field with the value 0b010 to 1732 0b111 (decimal 2 to 7), it can assume that: 1734 o the TCP client must have requested AccECN support on the SYN 1736 o it (the server) must have confirmed that it supported AccECN 1738 Therefore the server can switch itself into AccECN mode, and continue 1739 as if it had never forgotten that it switched itself into AccECN mode 1740 earlier. 1742 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1743 with the value 0b000 or 0b001, these values indicate that the client 1744 did not request support for AccECN and therefore the server does not 1745 enter AccECN mode for this connection. Further, 0b001 on the ACK 1746 implies that the server sent an ECN-capable SYN/ACK, which was marked 1747 CE in the network, and the non-AccECN client fed this back by setting 1748 ECE on the ACK of the SYN/ACK. 1750 4.2. Compatibility with Other TCP Options and Experiments 1752 AccECN is compatible (at least on paper) with the most commonly used 1753 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1754 also compatible with the recent promising experimental TCP options 1755 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1756 AccECN is friendly to all these protocols, because space for TCP 1757 options is particularly scarce on the SYN, where AccECN consumes zero 1758 additional header space. 1760 When option space is under pressure from other options, 1761 Section 3.2.3.3 provides guidance on how important it is to send an 1762 AccECN Option and whether it needs to be a full-length option. 1764 Implementers of TFO need to take careful note of the recommendation 1765 in Section 3.2.2.1. That section recommends that, if the client has 1766 successfully negotiated AccECN, when acknowledging the SYN/ACK, even 1767 if it has data to send, it sends a pure ACK immediately before the 1768 data. Then it can reflect the IP-ECN field of the SYN/ACK on this 1769 pure ACK, which allows the server to detect ECN mangling. 1771 4.3. Compatibility with Feedback Integrity Mechanisms 1773 Three alternative mechanisms are available to assure the integrity of 1774 ECN and/or loss signals. AccECN is compatible with any of these 1775 approaches: 1777 o The Data Sender can test the integrity of the receiver's ECN (or 1778 loss) feedback by occasionally setting the IP-ECN field to a value 1779 normally only set by the network (and/or deliberately leaving a 1780 sequence number gap). Then it can test whether the Data 1781 Receiver's feedback faithfully reports what it expects (similar to 1782 para 2 of Section 20.2 of [RFC3168]). Unlike the ECN Nonce 1783 [RFC3540], this approach does not waste the ECT(1) codepoint in 1784 the IP header, it does not require standardization and it does not 1785 rely on misbehaving receivers volunteering to reveal feedback 1786 information that allows them to be detected. However, setting the 1787 CE mark by the sender might conceal actual congestion feedback 1788 from the network and should therefore only be done sparingly. 1790 o Networks generate congestion signals when they are becoming 1791 congested, so networks are more likely than Data Senders to be 1792 concerned about the integrity of the receiver's feedback of these 1793 signals. A network can enforce a congestion response to its ECN 1794 markings (or packet losses) using congestion exposure (ConEx) 1795 audit [RFC7713]. Whether the receiver or a downstream network is 1796 suppressing congestion feedback or the sender is unresponsive to 1797 the feedback, or both, ConEx audit can neutralize any advantage 1798 that any of these three parties would otherwise gain. 1800 ConEx is a change to the Data Sender that is most useful when 1801 combined with AccECN. Without AccECN, the ConEx behaviour of a 1802 Data Sender would have to be more conservative than would be 1803 necessary if it had the accurate feedback of AccECN. 1805 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1806 detect any tampering with AccECN feedback between the Data 1807 Receiver and the Data Sender (whether malicious or accidental). 1808 The AccECN fields are immutable end-to-end, so they are amenable 1809 to TCP-AO protection, which covers TCP options by default. 1810 However, TCP-AO is often too brittle to use on many end-to-end 1811 paths, where middleboxes can make verification fail in their 1812 attempts to improve performance or security, e.g. by 1813 resegmentation or shifting the sequence space. 1815 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1816 of congestion feedback. With minor changes AccECN could be optimized 1817 for the possibility that the ECT(1) codepoint might be used as an ECN 1818 Nonce. However, given RFC 3540 has been reclassified as historic, 1819 the AccECN design has been generalized so that it ought to be able to 1820 support other possible uses of the ECT(1) codepoint, such as a lower 1821 severity or a more instant congestion signal than CE. 1823 5. Protocol Properties 1825 This section is informative not normative. It describes how well the 1826 protocol satisfies the agreed requirements for a more accurate ECN 1827 feedback protocol [RFC7560]. 1829 Accuracy: From each ACK, the Data Sender can infer the number of new 1830 CE marked segments since the previous ACK. This provides better 1831 accuracy on CE feedback than classic ECN. In addition if the 1832 AccECN Option is present (not blocked by the network path) the 1833 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1835 Overhead: The AccECN scheme is divided into two parts. The 1836 essential part reuses the 3 flags already assigned to ECN in the 1837 IP header. The supplementary part adds an additional TCP option 1838 consuming up to 11 bytes. However, no TCP option is consumed in 1839 the SYN. 1841 Ordering: The order in which marks arrive at the Data Receiver is 1842 preserved in AccECN feedback, because the Data Receiver is 1843 expected to send an ACK immediately whenever a different mark 1844 arrives. 1846 Timeliness: While the same ECN markings are arriving continually at 1847 the Data Receiver, it can defer ACKs as TCP does normally, but it 1848 will immediately send an ACK as soon as a different ECN marking 1849 arrives. 1851 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1852 latency-sensitive uses of ECN feedback by capturing the timing of 1853 transitions but not wasting resources while the state of the 1854 signalling system is stable. Within the constraints of the 1855 change-triggered ACK rules, the receiver can control how 1856 frequently it sends the AccECN TCP Option and therefore to some 1857 extent it can control the overhead induced by AccECN. 1859 Resilience: All information is provided based on counters. 1860 Therefore if ACKs are lost, the counters on the first ACK 1861 following the losses allows the Data Sender to immediately recover 1862 the number of the ECN markings that it missed. And if data or 1863 ACKs are reordered, stale congestion information can be identified 1864 and ignored. 1866 Resilience against Bias: Because feedback is based on repetition of 1867 counters, random losses do not remove any information, they only 1868 delay it. Therefore, even though some ACKs are change-triggered, 1869 random losses will not alter the proportions of the different ECN 1870 markings in the feedback. 1872 Resilience vs Overhead: If space is limited in some segments (e.g. 1873 because more options are needed on some segments, such as the SACK 1874 option after loss), the Data Receiver can send AccECN Options less 1875 frequently or truncate fields that have not changed, usually down 1876 to as little as 5 bytes. However, it has to send a full-sized 1877 AccECN Option at least three times per RTT, which the Data Sender 1878 can rely on as a regular beacon or checkpoint. 1880 Resilience vs Timeliness and Ordering: Ordering information and the 1881 timing of transitions cannot be communicated in three cases: i) 1882 during ACK loss; ii) if something on the path strips the AccECN 1883 Option; or iii) if the Data Receiver is unable to support Change- 1884 Triggered ACKs. Following ACK reordering, the Data Sender can 1885 reconstruct the order in which feedback was sent, but not until 1886 all the missing feedback has arrived. 1888 Complexity: An AccECN implementation solely involves simple counter 1889 increments, some modulo arithmetic to communicate the least 1890 significant bits and allow for wrap, and some heuristics for 1891 safety against fields cycling due to prolonged periods of ACK 1892 loss. Each host needs to maintain eight additional counters. The 1893 hosts have to apply some additional tests to detect tampering by 1894 middleboxes, but in general the protocol is simple to understand, 1895 simple to implement and requires few cycles per packet to execute. 1897 Integrity: AccECN is compatible with at least three approaches that 1898 can assure the integrity of ECN feedback. If the AccECN Option is 1899 stripped the resolution of the feedback is degraded, but the 1900 integrity of this degraded feedback can still be assured. 1902 Backward Compatibility: If only one endpoint supports the AccECN 1903 scheme, it will fall-back to the most advanced ECN feedback scheme 1904 supported by the other end. 1906 Backward Compatibility: If the AccECN Option is stripped by a 1907 middlebox, AccECN still provides basic congestion feedback in the 1908 ACE field. Further, AccECN can be used to detect mangling of the 1909 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1910 marked segments; and blocking of segments carrying the AccECN 1911 Option. It can detect these conditions during TCP's 3WHS so that 1912 it can fall back to operation without ECN and/or operation without 1913 the AccECN Option. 1915 Forward Compatibility: The behaviour of endpoints and middleboxes is 1916 carefully defined for all reserved or currently unused codepoints 1917 in the scheme. Then, the designers of security devices can 1918 understand which currently unused values might appear in future. 1919 So, even if they choose to treat such values as anomalous while 1920 they are not widely used, any blocking will at least be under 1921 policy control not hard-coded. Then, if previously unused values 1922 start to appear on the Internet (or in standards), such policies 1923 could be quickly reversed. 1925 6. IANA Considerations 1927 This document reassigns bit 7 of the TCP header flags to the AccECN 1928 experiment. This bit was previously called the Nonce Sum (NS) flag 1929 [RFC3540], but RFC 3540 has been reclassified as historic [RFC8311]. 1930 The flag will now be defined as: 1932 +-----+-------------------+-----------+ 1933 | Bit | Name | Reference | 1934 +-----+-------------------+-----------+ 1935 | 7 | AE (Accurate ECN) | RFC XXXX | 1936 +-----+-------------------+-----------+ 1938 [TO BE REMOVED: IANA is requested to update the existing entry in the 1939 Transmission Control Protocol (TCP) Header Flags registration 1940 (https://www.iana.org/assignments/tcp-header-flags/tcp-header- 1941 flags.xhtml#tcp-header-flags-1) for Bit 7 to "AE (Accurate ECN), 1942 previously used as NS (Nonce Sum) by [RFC3540], which is now Historic 1943 [RFC8311]" and change the reference to this RFC-to-be instead of 1944 RFC8311.] 1946 This document also defines a new TCP option for AccECN, assigned a 1947 value of TBD1 (decimal) from the TCP option space. This value is 1948 defined as: 1950 +------+--------+-----------------------+-----------+ 1951 | Kind | Length | Meaning | Reference | 1952 +------+--------+-----------------------+-----------+ 1953 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1954 +------+--------+-----------------------+-----------+ 1956 [TO BE REMOVED: This registration should take place at the following 1957 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1958 parameters.xhtml#tcp-parameters-1 ] 1960 Early implementation before the IANA allocation MUST follow [RFC6994] 1961 and use experimental option 254 and magic number 0xACCE (16 bits), 1962 then migrate to the new option after the allocation. 1964 7. Security Considerations 1966 If ever the supplementary part of AccECN based on the new AccECN TCP 1967 Option is unusable (due for example to middlebox interference) the 1968 essential part of AccECN's congestion feedback offers only limited 1969 resilience to long runs of ACK loss (see Section 3.2.2.5). These 1970 problems are unlikely to be due to malicious intervention (because if 1971 an attacker could strip a TCP option or discard a long run of ACKs it 1972 could wreak other arbitrary havoc). However, it would be of concern 1973 if AccECN's resilience could be indirectly compromised during a 1974 flooding attack. AccECN is still considered safe though, because if 1975 the option is not presented, the AccECN Data Sender is then required 1976 to switch to more conservative assumptions about wrap of congestion 1977 indication counters (see Section 3.2.2.5 and Appendix A.2). 1979 Section 4.1 describes how a TCP server can negotiate AccECN and use 1980 the SYN cookie method for mitigating SYN flooding attacks. 1982 There is concern that ECN markings could be altered or suppressed, 1983 particularly because a misbehaving Data Receiver could increase its 1984 own throughput at the expense of others. AccECN is compatible with 1985 the three schemes known to assure the integrity of ECN feedback (see 1986 Section 4.3 for details). If the AccECN Option is stripped by an 1987 incorrectly implemented middlebox, the resolution of the feedback 1988 will be degraded, but the integrity of this degraded information can 1989 still be assured. 1991 There is a potential concern that a receiver could deliberately omit 1992 the AccECN Option pretending that it had been stripped by a 1993 middlebox. No known way can yet be contrived to take advantage of 1994 this downgrade attack, but it is mentioned here in case someone else 1995 can contrive one. 1997 The AccECN protocol is not believed to introduce any new privacy 1998 concerns, because it merely counts and feeds back signals at the 1999 transport layer that had already been visible at the IP layer. 2001 8. Acknowledgements 2003 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 2004 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf, 2005 Michael Tuexen, Yuchung Cheng, Kenjiro Cho, Olivier Tilmans and Ilpo 2006 Jaervinen for their input and discussion. The idea of using the 2007 three ECN-related TCP flags as one field for more accurate TCP-ECN 2008 feedback was first introduced in the re-ECN protocol that was the 2009 ancestor of ConEx. 2011 Bob Briscoe was part-funded by the Comcast Innovation Fund, the 2012 European Community under its Seventh Framework Programme through the 2013 Reducing Internet Transport Latency (RITE) project (ICT-317700) and 2014 through the Trilogy 2 project (ICT-317756), and the Research Council 2015 of Norway through the TimeIn project. The views expressed here are 2016 solely those of the authors. 2018 Mirja Kuehlewind was partly supported by the European Commission 2019 under Horizon 2020 grant agreement no. 688421 Measurement and 2020 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 2021 State Secretariat for Education, Research, and Innovation under 2022 contract no. 15.0268. This support does not imply endorsement. 2024 9. Comments Solicited 2026 Comments and questions are encouraged and very welcome. They can be 2027 addressed to the IETF TCP maintenance and minor modifications working 2028 group mailing list , and/or to the authors. 2030 10. References 2032 10.1. Normative References 2034 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 2035 RFC 793, DOI 10.17487/RFC0793, September 1981, 2036 . 2038 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2039 Requirement Levels", BCP 14, RFC 2119, 2040 DOI 10.17487/RFC2119, March 1997, 2041 . 2043 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2044 of Explicit Congestion Notification (ECN) to IP", 2045 RFC 3168, DOI 10.17487/RFC3168, September 2001, 2046 . 2048 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2049 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 2050 . 2052 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2053 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2054 May 2017, . 2056 10.2. Informative References 2058 [I-D.ietf-tcpm-2140bis] 2059 Touch, J., Welzl, M., and S. Islam, "TCP Control Block 2060 Interdependence", draft-ietf-tcpm-2140bis-02 (work in 2061 progress), February 2020. 2063 [I-D.ietf-tcpm-generalized-ecn] 2064 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 2065 Congestion Notification (ECN) to TCP Control Packets", 2066 draft-ietf-tcpm-generalized-ecn-05 (work in progress), 2067 November 2019. 2069 [I-D.ietf-tsvwg-l4s-arch] 2070 Briscoe, B., Schepper, K., Bagnulo, M., and G. White, "Low 2071 Latency, Low Loss, Scalable Throughput (L4S) Internet 2072 Service: Architecture", draft-ietf-tsvwg-l4s-arch-05 (work 2073 in progress), February 2020. 2075 [I-D.kuehlewind-tcpm-ecn-fallback] 2076 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 2077 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 2078 fallback-01 (work in progress), September 2013. 2080 [Mandalari18] 2081 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 2082 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 2083 over Mobile", IEEE Communications Magazine , March 2018. 2085 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 2086 Selective Acknowledgment Options", RFC 2018, 2087 DOI 10.17487/RFC2018, October 1996, 2088 . 2090 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2091 Congestion Notification (ECN) Signaling with Nonces", 2092 RFC 3540, DOI 10.17487/RFC3540, June 2003, 2093 . 2095 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 2096 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 2097 . 2099 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 2100 Ramakrishnan, "Adding Explicit Congestion Notification 2101 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 2102 DOI 10.17487/RFC5562, June 2009, 2103 . 2105 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 2106 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 2107 June 2010, . 2109 [RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's 2110 Robustness to Blind In-Window Attacks", RFC 5961, 2111 DOI 10.17487/RFC5961, August 2010, 2112 . 2114 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 2115 "TCP Extensions for Multipath Operation with Multiple 2116 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 2117 . 2119 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 2120 RFC 6994, DOI 10.17487/RFC6994, August 2013, 2121 . 2123 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 2124 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 2125 . 2127 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 2128 "Problem Statement and Requirements for Increased Accuracy 2129 in Explicit Congestion Notification (ECN) Feedback", 2130 RFC 7560, DOI 10.17487/RFC7560, August 2015, 2131 . 2133 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 2134 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 2135 DOI 10.17487/RFC7713, December 2015, 2136 . 2138 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 2139 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 2140 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 2141 October 2017, . 2143 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 2144 Notification (ECN) Experimentation", RFC 8311, 2145 DOI 10.17487/RFC8311, January 2018, 2146 . 2148 [RFC8511] Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 2149 "TCP Alternative Backoff with ECN (ABE)", RFC 8511, 2150 DOI 10.17487/RFC8511, December 2018, 2151 . 2153 Appendix A. Example Algorithms 2155 This appendix is informative, not normative. It gives example 2156 algorithms that would satisfy the normative requirements of the 2157 AccECN protocol. However, implementers are free to choose other ways 2158 to implement the requirements. 2160 A.1. Example Algorithm to Encode/Decode the AccECN Option 2162 The example algorithms below show how a Data Receiver in AccECN mode 2163 could encode its CE byte counter r.ceb into the ECEB field within the 2164 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 2165 the ECEB field into its byte counter s.ceb. The other counters for 2166 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 2167 similarly encoded and decoded. 2169 It is assumed that each local byte counter is an unsigned integer 2170 greater than 24b (probably 32b), and that the following constant has 2171 been assigned: 2173 DIVOPT = 2^24 2175 Every time a CE marked data segment arrives, the Data Receiver 2176 increments its local value of r.ceb by the size of the TCP Data. 2177 Whenever it sends an ACK with the AccECN Option, the value it writes 2178 into the ECEB field is 2180 ECEB = r.ceb % DIVOPT 2182 where '%' is the remainder operator. 2184 On the arrival of an AccECN Option, the Data Sender first makes sure 2185 the ACK has not been superseded in order to avoid winding the s.ceb 2186 counter backwards. It uses the TCP acknowledgement number and any 2187 SACK options to calculate newlyAckedB, the amount of new data that 2188 the ACK acknowledges in bytes (newlyAckedB can be zero but not 2189 negative). If newlyAckedB is zero, either the ACK has been 2190 superseded or CE-marked packet(s) without data could have arrived. 2191 To break the tie for the latter case, the Data Sender could use 2192 timestamps (if present) to work out newlyAckedT, the amount of new 2193 time that the ACK acknowledges. If the Data Sender determines that 2194 the ACK has been superseded it ignores the AccECN Option. Otherwise, 2195 the Data Sender calculates the minimum non-negative difference d.ceb 2196 between the ECEB field and its local s.ceb counter, using modulo 2197 arithmetic as follows: 2199 if ((newlyAckedB > 0) || (newlyAckedT > 0)) { 2200 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 2201 s.ceb += d.ceb 2202 } 2204 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 2205 then 2207 s.ceb % DIVOPT = 1 2208 d.ceb = (1461 + 2^24 - 1) % 2^24 2209 = 1460 2210 s.ceb = 33,554,433 + 1460 2211 = 33,555,893 2213 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 2215 The example algorithms below show how a Data Receiver in AccECN mode 2216 could encode its CE packet counter r.cep into the ACE field, and how 2217 the Data Sender in AccECN mode could decode the ACE field into its 2218 s.cep counter. The Data Sender's algorithm includes code to 2219 heuristically detect a long enough unbroken string of ACK losses that 2220 could have concealed a cycle of the congestion counter in the ACE 2221 field of the next ACK to arrive. 2223 Two variants of the algorithm are given: i) a more conservative 2224 variant for a Data Sender to use if it detects that the AccECN Option 2225 is not available (see Section 3.2.2.5 and Section 3.2.3.2); and ii) a 2226 less conservative variant that is feasible when complementary 2227 information is available from the AccECN Option. 2229 A.2.1. Safety Algorithm without the AccECN Option 2231 It is assumed that each local packet counter is a sufficiently sized 2232 unsigned integer (probably 32b) and that the following constant has 2233 been assigned: 2235 DIVACE = 2^3 2237 Every time an Acceptable CE marked packet arrives (Section 3.2.2.2), 2238 the Data Receiver increments its local value of r.cep by 1. It 2239 repeats the same value of ACE in every subsequent ACK until the next 2240 CE marking arrives, where 2242 ACE = r.cep % DIVACE. 2244 If the Data Sender received an earlier value of the counter that had 2245 been delayed due to ACK reordering, it might incorrectly calculate 2246 that the ACE field had wrapped. Therefore, on the arrival of every 2247 ACK, the Data Sender ensures the ACK has not been superseded using 2248 the TCP acknowledgement number, any SACK options and timestamps (if 2249 available) to calculate newlyAckedB, as in Appendix A.1. If the ACK 2250 has not been superseded, the Data Sender calculates the minimum 2251 difference d.cep between the ACE field and its local s.cep counter, 2252 using modulo arithmetic as follows: 2254 if ((newlyAckedB > 0) || (newlyAckedT > 0)) 2255 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 2257 Section 3.2.2.5 expects the Data Sender to assume that the ACE field 2258 cycled if it is the safest likely case under prevailing conditions. 2259 The 3-bit ACE field in an arriving ACK could have cycled and become 2260 ambiguous to the Data Sender if a row of ACKs goes missing that 2261 covers a stream of data long enough to contain 8 or more CE marks. 2262 We use the word `missing' rather than `lost', because some or all the 2263 missing ACKs might arrive eventually, but out of order. Even if some 2264 of the missing ACKs were piggy-backed on data (i.e. not pure ACKs) 2265 retransmissions will not repair the lost AccECN information, because 2266 AccECN requires retransmissions to carry the latest AccECN counters, 2267 not the original ones. 2269 The phrase `under prevailing conditions' allows for implementation- 2270 dependent interpretation. A Data Sender might take account of the 2271 prevailing size of data segments and the prevailing CE marking rate 2272 just before the sequence of missing ACKs. However, we shall start 2273 with the simplest algorithm, which assumes segments are all full- 2274 sized and ultra-conservatively it assumes that ECN marking was 100% 2275 on the forward path when ACKs on the reverse path started to all be 2276 dropped. Specifically, if newlyAckedB is the amount of data that an 2277 ACK acknowledges since the previous ACK, then the Data Sender could 2278 assume that this acknowledges newlyAckedPkt full-sized segments, 2279 where newlyAckedPkt = newlyAckedB/MSS. Then it could assume that the 2280 ACE field incremented by 2282 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 2284 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 2285 size segments than any previous ACK, and that ACE increments by a 2286 minimum of 2 CE marks (d.cep=2). The above formula works out that it 2287 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 2288 2). However, if ACE increases by a minimum of 2 but acknowledges 10 2289 full-sized segments, then it would be necessary to assume that there 2290 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 2292 ACKs that acknowledge a large stretch of packets might be common in 2293 data centres to achieve a high packet rate or might be due to ACK 2294 thinning by a middlebox. In these cases, cycling of the ACE field 2295 would often appear to have been possible, so the above algorithm 2296 would be over-conservative, leading to a false high marking rate and 2297 poor performance. Therefore it would be reasonable to only use 2298 dSafer.cep rather than d.cep if the moving average of newlyAckedPkt 2299 was well below 8. 2301 Implementers could build in more heuristics to estimate prevailing 2302 average segment size and prevailing ECN marking. For instance, 2303 newlyAckedPkt in the above formula could be replaced with 2304 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 2305 segment size and p is the prevailing ECN marking probability. 2306 However, ultimately, if TCP's ECN feedback becomes inaccurate it 2307 still has loss detection to fall back on. Therefore, it would seem 2308 safe to implement a simple algorithm, rather than a perfect one. 2310 The simple algorithm for dSafer.cep above requires no monitoring of 2311 prevailing conditions and it would still be safe if, for example, 2312 segments were on average at least 5% of full-sized as long as ECN 2313 marking was 5% or less. Assuming it was used, the Data Sender would 2314 increment its packet counter as follows: 2316 s.cep += dSafer.cep 2318 If missing acknowledgement numbers arrive later (due to reordering), 2319 Section 3.2.2.5 says "the Data Sender MAY attempt to neutralize the 2320 effect of any action it took based on a conservative assumption that 2321 it later found to be incorrect". To do this, the Data Sender would 2322 have to store the values of all the relevant variables whenever it 2323 made assumptions, so that it could re-evaluate them later. Given 2324 this could become complex and it is not required, we do not attempt 2325 to provide an example of how to do this. 2327 A.2.2. Safety Algorithm with the AccECN Option 2329 When the AccECN Option is available on the ACKs before and after the 2330 possible sequence of ACK losses, if the Data Sender only needs CE- 2331 marked bytes, it will have sufficient information in the AccECN 2332 Option without needing to process the ACE field. If for some reason 2333 it needs CE-marked packets, if dSafer.cep is different from d.cep, it 2334 can determine whether d.cep is likely to be a safe enough estimate by 2335 checking whether the average marked segment size (s = d.ceb/d.cep) is 2336 less than the MSS (where d.ceb is the amount of newly CE-marked bytes 2337 - see Appendix A.1). Specifically, it could use the following 2338 algorithm: 2340 SAFETY_FACTOR = 2 2341 if (dSafer.cep > d.cep) { 2342 if (d.ceb <= MSS * d.cep) { % Same as (s <= MSS), but no DBZ 2343 sSafer = d.ceb/dSafer.cep 2344 if (sSafer < MSS/SAFETY_FACTOR) 2345 dSafer.cep = d.cep % d.cep is a safe enough estimate 2346 } % else 2347 % No need for else; dSafer.cep is already correct, 2348 % because d.cep must have been too small 2349 } 2351 The chart below shows when the above algorithm will consider d.cep 2352 can replace dSafer.cep as a safe enough estimate of the number of CE- 2353 marked packets: 2355 ^ 2356 sSafer| 2357 | 2358 MSS+ 2359 | 2360 | dSafer.cep 2361 | is 2362 MSS/SAFETY_FACTOR+--------------+ safest 2363 | | 2364 | d.cep is safe| 2365 | enough | 2366 +--------------------> 2367 MSS s 2369 The following examples give the reasoning behind the algorithm, 2370 assuming MSS=1460 [B]: 2372 o if d.cep=0, dSafer.cep=8 and d.ceb=1460, then s=infinity and 2373 sSafer=182.5. 2374 Therefore even though the average size of 8 data segments is 2375 unlikely to have been as small as MSS/8, d.cep cannot have been 2376 correct, because it would imply an average segment size greater 2377 than the MSS. 2379 o if d.cep=2, dSafer.cep=10 and d.ceb=1460, then s=730 and 2380 sSafer=146. 2381 Therefore d.cep is safe enough, because the average size of 10 2382 data segments is unlikely to have been as small as MSS/10. 2384 o if d.cep=7, dSafer.cep=15 and d.ceb=10200, then s=1457 and 2385 sSafer=680. 2387 Therefore d.cep is safe enough, because the average data segment 2388 size is more likely to have been just less than one MSS, rather 2389 than below MSS/2. 2391 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 2392 far less likely. However, because [RFC3168] currently precludes 2393 this, the above algorithm assumes that pure ACKs are not ECN-capable. 2395 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 2397 If the AccECN Option is not available, the Data Sender can only 2398 decode CE-marking from the ACE field in packets. Every time an ACK 2399 arrives, to convert this into an estimate of CE-marked bytes, it 2400 needs an average of the segment size, s_ave. Then it can add or 2401 subtract s_ave from the value of d.ceb as the value of d.cep 2402 increments or decrements. Some possible ways to calculate s_ave are 2403 outlined below. The precise details will depend on why an estimate 2404 of marked bytes is needed. 2406 The implementation could keep a record of the byte numbers of all the 2407 boundaries between packets in flight (including control packets), and 2408 recalculate s_ave on every ACK. However it would be simpler to 2409 merely maintain a counter packets_in_flight for the number of packets 2410 in flight (including control packets), which is reset once per RTT. 2411 Either way, it would estimate s_ave as: 2413 s_ave ~= flightsize / packets_in_flight, 2415 where flightsize is the variable that TCP already maintains for the 2416 number of bytes in flight. To avoid floating point arithmetic, it 2417 could right-bit-shift by lg(packets_in_flight), where lg() means log 2418 base 2. 2420 An alternative would be to maintain an exponentially weighted moving 2421 average (EWMA) of the segment size: 2423 s_ave = a * s + (1-a) * s_ave, 2425 where a is the decay constant for the EWMA. However, then it is 2426 necessary to choose a good value for this constant, which ought to 2427 depend on the number of packets in flight. Also the decay constant 2428 needs to be power of two to avoid floating point arithmetic. 2430 A.4. Example Algorithm to Beacon AccECN Options 2432 Section 3.2.3.3 requires a Data Receiver to beacon a full-length 2433 AccECN Option at least 3 times per RTT. This could be implemented by 2434 maintaining a variable to store the number of ACKs (pure and data 2435 ACKs) since a full AccECN Option was last sent and another for the 2436 approximate number of ACKs sent in the last round trip time: 2438 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 2439 send_full_AccECN_Option() 2441 For optimized integer arithmetic, BEACON_FREQ = 4 could be used, 2442 rather than 3, so that the division could be implemented as an 2443 integer right bit-shift by lg(BEACON_FREQ). 2445 In certain operating systems, it might be too complex to maintain 2446 acks_in_round. In others it might be possible by tagging each data 2447 segment in the retransmit buffer with the number of ACKs sent at the 2448 point that segment was sent. This would not work well if the Data 2449 Receiver was not sending data itself, in which case it might be 2450 necessary to beacon based on time instead, as follows: 2452 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 2453 send_full_AccECN_Option() 2455 This time-based approach does not work well when all the ACKs are 2456 sent early in each round trip, as is the case during slow-start. In 2457 this case few options will be sent (evtl. even less than 3 per RTT). 2458 However, when continuously sending data, data packets as well as ACKs 2459 will spread out equally over the RTT and sufficient ACKs with the 2460 AccECN option will be sent. 2462 A.5. Example Algorithm to Count Not-ECT Bytes 2464 A Data Sender in AccECN mode can infer the amount of TCP payload data 2465 arriving at the receiver marked Not-ECT from the difference between 2466 the amount of newly ACKed data and the sum of the bytes with the 2467 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 2468 r.e0b is initialized to 1 and the other two counters are initialized 2469 to 0, the initial sum will be 1, which matches the initial offset of 2470 the TCP sequence number on completion of the 3WHS. 2472 For this approach to be precise, it has to be assumed that spurious 2473 (unnecessary) retransmissions do not lead to double counting. This 2474 assumption is currently correct, given that RFC 3168 requires that 2475 the Data Sender marks retransmitted segments as Not-ECT. However, 2476 the converse is not true; necessary retransmissions will result in 2477 under-counting. 2479 However, such precision is unlikely to be necessary. The only known 2480 use of a count of Not-ECT marked bytes is to test whether equipment 2481 on the path is clearing the ECN field (perhaps due to an out-dated 2482 attempt to clear, or bleach, what used to be the ToS field). To 2483 detect bleaching it will be sufficient to detect whether nearly all 2484 bytes arrive marked as Not-ECT. Therefore there should be no need to 2485 keep track of the details of retransmissions. 2487 Appendix B. Rationale for Usage of TCP Header Flags 2489 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake 2491 AccECN uses a rather unorthodox approach to negotiate the highest 2492 version TCP ECN feedback scheme that both ends support, as justified 2493 below. It follows from the original TCP ECN capability negotiation 2494 [RFC3168], in which the client set the 2 least significant of the 2495 original reserved flags in the TCP header, and fell back to no ECN 2496 support if the server responded with the 2 flags cleared, which had 2497 previously been the default. 2499 ECN originally used header flags rather than a TCP option because it 2500 was considered more efficient to use a header flag for 1 bit of 2501 feedback per ACK, and this bit could be overloaded to indicate 2502 support for ECN during the handshake. During the development of ECN, 2503 1 bit crept up to 2, in order to deliver the feedback reliably and to 2504 work round some broken hosts that reflected the reserved flags during 2505 the handshake. 2507 In order to be backward compatible with RFC 3168, AccECN continues 2508 this approach, using the 3rd least significant TCP header flag that 2509 had previously been allocated for the ECN nonce (now historic). 2510 Then, whatever form of server an AccECN client encounters, the 2511 connection can fall back to the highest version of feedback protocol 2512 that both ends support, as explained in Section 3.1. 2514 If AccECN had used the more orthodox approach of a TCP option, it 2515 would still have had to set the two ECN flags in the main TCP header, 2516 in order to be able to fall back to Classic RFC 3168 ECN, or to 2517 disable ECN support, without another round of negotiation. Then 2518 AccECN would also have had to handle all the different ways that 2519 servers currently respond to settings of the ECN flags in the main 2520 TCP header, including all the conflicting cases where a server might 2521 have said it supported one approach in the flags and another approach 2522 in the new TCP option. And AccECN would have had to deal with all 2523 the additional possibilities where a middlebox might have mangled the 2524 ECN flags, or removed the TCP option. Thus, usage of the 3rd 2525 reserved TCP header flag simplified the protocol. 2527 The third flag was used in a way that could be distinguished from the 2528 ECN nonce, in case any nonce deployment was encountered. Previous 2529 usage of this flag for the ECN nonce was integrated into the original 2530 ECN negotiation. This further justified the 3rd flag's use for 2531 AccECN, because a non-ECN usage of this flag would have had to use it 2532 as a separate single bit, rather than in combination with the other 2 2533 ECN flags. 2535 Indeed, having overloaded the original uses of these three flags for 2536 its handshake, AccECN overloads all three bits again as a 3-bit 2537 counter. 2539 B.2. Four Codepoints in the SYN/ACK 2541 Of the 8 possible codepoints that the 3 TCP header flags can indicate 2542 on the SYN/ACK, 4 already indicated earlier (or broken) versions of 2543 ECN support. In the early design of AccECN, an AccECN server could 2544 use only 2 of the 4 remaining codepoints. They both indicated AccECN 2545 support, but one fed back that the SYN had arrived marked as CE. 2546 Even though ECN support on a SYN is not yet on the standards track, 2547 the idea is for either end to act as a dumb reflector, so that future 2548 capabilities can be unilaterally deployed without requiring 2-ended 2549 deployment (justified in Section 2.5). 2551 During traversal testing it was discovered that the ECN field in the 2552 SYN was mangled on a non-negligible proportion of paths. Therefore 2553 it was necessary to allow the SYN/ACK to feed all four IP/ECN 2554 codepoints that the SYN could arrive with back to the client. 2555 Without this, the client could not know whether to disable ECN for 2556 the connection due to mangling of the IP/ECN field (also explained in 2557 Section 2.5). This development consumed the remaining 2 codepoints 2558 on the SYN/ACK that had been reserved for future use by AccECN in 2559 earlier versions. 2561 B.3. Space for Future Evolution 2563 Despite availability of usable TCP header space being extremely 2564 scarce, the AccECN protocol has taken all possible steps to ensure 2565 that there is space to negotiate possible future variants of the 2566 protocol, either if the experiment proves that a variant of AccECN is 2567 required, or if a completely different ECN feedback approach is 2568 needed: 2570 Future AccECN variants: When the AccECN capability is negotiated 2571 during TCP's 3WHS, the rows in Table 2 tagged as 'Nonce' and 2572 'Broken' in the column for the capability of node B are unused by 2573 any current protocol in the RFC series. These could be used by 2574 TCP servers in future to indicate a variant of the AccECN 2575 protocol. In recent measurement studies in which the response of 2576 large numbers of servers to an AccECN SYN has been tested, e.g. 2577 [Mandalari18], a very small number of SYN/ACKs arrive with the 2578 pattern tagged as 'Nonce', and a small but more significant number 2579 arrive with the pattern tagged as 'Broken'. The 'Nonce' pattern 2580 could be a sign that a few servers have implemented the ECN Nonce 2581 [RFC3540], which has now been reclassified as historic [RFC8311], 2582 or it could be the random result of some unknown middlebox 2583 behaviour. The greater prevalence of the 'Broken' pattern 2584 suggests that some instances still exist of the broken code that 2585 reflects the reserved flags on the SYN. 2587 The requirement not to reject unexpected initial values of the ACE 2588 counter (in the main TCP header) in the last para of 2589 Section 3.2.2.3 ensures that 3 unused codepoints on the ACK of the 2590 SYN/ACK, 6 unused values on the first SYN=0 data packet from the 2591 client and 7 unused values on the first SYN=0 data packet from the 2592 server could be used to declare future variants of the AccECN 2593 protocol. The word 'declare' is used rather than 'negotiate' 2594 because, at this late stage in the 3WHS, it would be too late for 2595 a negotiation between the endpoints to be completed. A similar 2596 requirement not to reject unexpected initial values in the TCP 2597 option (Section 3.2.3.2.4) is for the same purpose. If traversal 2598 of the TCP option were reliable, this would have enabled a far 2599 wider range of future variation of the whole AccECN protocol. 2600 Nonetheless, it could be used to reliably negotiate a wide range 2601 of variation in the semantics of the AccECN Option. 2603 Future non-AccECN variants: Five codepoints out of the 8 possible in 2604 the 3 TCP header flags used by AccECN are unused on the initial 2605 SYN (in the order AE,CWR,ECE): 001, 010, 100, 101, 110. 2606 Section 3.1.3 ensures that the installed base of AccECN servers 2607 will all assume these are equivalent to AccECN negotiation with 2608 111 on the SYN. These codepoints would not allow fall-back to 2609 Classic ECN support for a server that did not understand them, but 2610 this approach ensures they are available in future, perhaps for 2611 uses other than ECN alongside the AccECN scheme. All possible 2612 combinations of SYN/ACK could be used in response except either 2613 000 or reflection of the same values sent on the SYN. 2615 Of course, other ways could be resorted to in order to extend 2616 AccECN or ECN in future, although their traversal properties are 2617 likely to be inferior. They include a new TCP option; using the 2618 remaining reserved flags in the main TCP header (preferably 2619 extending the 3-bit combinations used by AccECN to 4-bit 2620 combinations, rather than burning one bit for just one state); a 2621 non-zero urgent pointer in combination with the URG flag cleared; 2622 or some other unexpected combination of fields yet to be invented. 2624 Authors' Addresses 2626 Bob Briscoe 2627 Independent 2628 UK 2630 EMail: ietf@bobbriscoe.net 2631 URI: http://bobbriscoe.net/ 2633 Mirja Kuehlewind 2634 Ericsson 2635 Germany 2637 EMail: ietf@kuehlewind.net 2639 Richard Scheffenegger 2640 NetApp 2641 Vienna 2642 Austria 2644 EMail: Richard.Scheffenegger@netapp.com