idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. -- The document date (March 11, 2019) is 1872 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 1908, but not defined == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-03 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-03 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft CableLabs 4 Intended status: Experimental M. Kuehlewind 5 Expires: September 12, 2019 ETH Zurich 6 R. Scheffenegger 7 March 11, 2019 9 More Accurate ECN Feedback in TCP 10 draft-ietf-tcpm-accurate-ecn-08 12 Abstract 14 Explicit Congestion Notification (ECN) is a mechanism where network 15 nodes can mark IP packets instead of dropping them to indicate 16 incipient congestion to the end-points. Receivers with an ECN- 17 capable transport protocol feed back this information to the sender. 18 ECN is specified for TCP in such a way that only one feedback signal 19 can be transmitted per Round-Trip Time (RTT). Recent new TCP 20 mechanisms like Congestion Exposure (ConEx), Data Center TCP (DCTCP) 21 or Low Latency Low Loss Scalable Throughput (L4S) need more accurate 22 ECN feedback information whenever more than one marking is received 23 in one RTT. This document specifies an experimental scheme to 24 provide more than one feedback signal per RTT in the TCP header. 25 Given TCP header space is scarce, it allocates a reserved header bit, 26 that was previously used for the ECN-Nonce which has now been 27 declared historic. It also overloads the two existing ECN flags in 28 the TCP header. Supplementary feedback information can optionally be 29 provided in a new TCP option, which is never used on the TCP SYN. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on September 12, 2019. 48 Copyright Notice 50 Copyright (c) 2019 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (https://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 67 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 68 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 69 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 70 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 7 71 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 8 72 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 9 73 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 74 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 10 75 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 11 76 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 11 77 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 12 78 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 12 79 3.1.1. Negotiation during the TCP handshake . . . . . . . . 12 80 3.1.2. Forward Compatibility . . . . . . . . . . . . . . . . 14 81 3.1.3. Retransmission of the SYN . . . . . . . . . . . . . . 15 82 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 15 83 3.2.1. Initialization of Feedback Counters at the Data 84 Sender . . . . . . . . . . . . . . . . . . . . . . . 16 85 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 16 86 3.2.3. Testing for Zeroing of the ACE Field . . . . . . . . 18 87 3.2.4. Testing for Mangling of the IP/ECN Field . . . . . . 19 88 3.2.5. Safety against Ambiguity of the ACE Field . . . . . . 20 89 3.2.6. The AccECN Option . . . . . . . . . . . . . . . . . . 20 90 3.2.7. Path Traversal of the AccECN Option . . . . . . . . . 22 91 3.2.8. Usage of the AccECN TCP Option . . . . . . . . . . . 25 92 3.3. Requirements for TCP Proxies, Offload Engines and other 93 Middleboxes on AccECN Compliance . . . . . . . . . . . . 27 94 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 28 95 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 28 96 4.2. Compatibility with Other TCP Options and Experiments . . 29 97 4.3. Compatibility with Feedback Integrity Mechanisms . . . . 29 98 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 30 99 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 100 7. Security Considerations . . . . . . . . . . . . . . . . . . . 33 101 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 33 102 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 34 103 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 34 104 10.1. Normative References . . . . . . . . . . . . . . . . . . 34 105 10.2. Informative References . . . . . . . . . . . . . . . . . 35 106 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 37 107 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 37 108 A.2. Example Algorithm for Safety Against Long Sequences of 109 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 38 110 A.2.1. Safety Algorithm without the AccECN Option . . . . . 38 111 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 40 112 A.3. Example Algorithm to Estimate Marked Bytes from Marked 113 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 41 114 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 42 115 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 43 116 Appendix B. Rationale for Usage of TCP Header Flags . . . . . . 43 117 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake . . . 43 118 B.2. Four Codepoints in the SYN/ACK . . . . . . . . . . . . . 44 119 B.3. Space for Future Evolution . . . . . . . . . . . . . . . 45 120 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 46 122 1. Introduction 124 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 125 network nodes can mark IP packets instead of dropping them to 126 indicate incipient congestion to the end-points. Receivers with an 127 ECN-capable transport protocol feed back this information to the 128 sender. ECN is specified for TCP in such a way that only one 129 feedback signal can be transmitted per Round-Trip Time (RTT). 130 Recently, proposed mechanisms like Congestion Exposure (ConEx 131 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need to 132 know when more than one marking is received in one RTT which is 133 information that cannot be provided by the feedback scheme as 134 specified in [RFC3168]. This document specifies an alternative 135 feedback scheme that provides more accurate information and could be 136 used by these new TCP extensions. A fuller treatment of the 137 motivation for this specification is given in the associated 138 requirements document [RFC7560]. 140 This documents specifies an experimental scheme for ECN feedback in 141 the TCP header to provide more than one feedback signal per RTT. It 142 will be called the more accurate ECN feedback scheme, or AccECN for 143 short. If AccECN progresses from experimental to the standards 144 track, it is intended to be a complete replacement for classic TCP/ 145 ECN feedback, not a fork in the design of TCP. AccECN feedback 146 complements TCP's loss feedback and it supplements classic TCP/ECN 147 feedback, so its applicability is intended to include all public and 148 private IP networks (and even any non-IP networks over which TCP is 149 used today), whether or not any nodes on the path support ECN of 150 whatever flavour. 152 Until the AccECN experiment succeeds, [RFC3168] will remain as the 153 only standards track specification for adding ECN to TCP. To avoid 154 confusion, in this document we use the term 'classic ECN' for the 155 pre-existing ECN specification [RFC3168]. 157 AccECN feedback overloads the two existing ECN flags and allocates 158 the currently reserved flag (previously called NS) in the TCP header, 159 to be used as one field indicating the number of congestion 160 experienced marked packets. Given the new definitions of these three 161 bits, both ends have to support the new wire protocol before it can 162 be used. Therefore during the TCP handshake the two ends use these 163 three bits in the TCP header to negotiate the most advanced feedback 164 protocol that they can both support, in a way that is backward 165 compatible with [RFC3168]. 167 AccECN is solely an (experimental) change to the TCP wire protocol; 168 it only specifies the negotiation and signaling of more accurate ECN 169 feedback from a TCP Data Receiver to a Data Sender. It is completely 170 independent of how TCP might respond to congestion feedback, which is 171 out of scope. For that we refer to [RFC3168] or any RFC that 172 specifies a different response to TCP ECN feedback, for example: 173 [RFC8257]; or the ECN experiments referred to in [RFC8311], namely: a 174 TCP-based Low Latency Low Loss Scalable (L4S) congestion control 175 [I-D.ietf-tsvwg-l4s-arch]; ECN-capable TCP control packets 176 [I-D.ietf-tcpm-generalized-ecn], or Alternative Backoff with ECN 177 (ABE) [RFC8511]. 179 It is recommended that the AccECN protocol is implemented alongside 180 the experimental ECN++ protocol [I-D.ietf-tcpm-generalized-ecn]. 181 Therefore, this specification does not discuss implementing AccECN 182 alongside [RFC5562], which was an earlier experimental protocol with 183 narrower scope than ECN++. 185 1.1. Document Roadmap 187 The following introductory sections outline the goals of AccECN 188 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 189 that it is clear what success would look like. Then terminology is 190 defined (Section 1.4) and a recap of existing prerequisite technology 191 is given (Section 1.5). 193 Section 2 gives an informative overview of the AccECN protocol. Then 194 Section 3 gives the normative protocol specification. Section 4 195 assesses the interaction of AccECN with commonly used variants of 196 TCP, whether standardised or not. Section 5 summarises the features 197 and properties of AccECN. 199 Section 6 summarises the protocol fields and numbers that IANA will 200 need to assign and Section 7 points to the aspects of the protocol 201 that will be of interest to the security community. 203 Appendix A gives pseudocode examples for the various algorithms that 204 AccECN uses. 206 1.2. Goals 208 [RFC7560] enumerates requirements that a candidate feedback scheme 209 will need to satisfy, under the headings: resilience, timeliness, 210 integrity, accuracy (including ordering and lack of bias), 211 complexity, overhead and compatibility (both backward and forward). 212 It recognises that a perfect scheme that fully satisfies all the 213 requirements is unlikely and trade-offs between requirements are 214 likely. Section 5 presents the properties of AccECN against these 215 requirements and discusses the trade-offs made. 217 The requirements document recognises that a protocol as ubiquitous as 218 TCP needs to be able to serve as-yet-unspecified requirements. 219 Therefore an AccECN receiver aims to act as a generic (dumb) 220 reflector of congestion information so that in future new sender 221 behaviours can be deployed unilaterally. 223 1.3. Experiment Goals 225 TCP is critical to the robust functioning of the Internet, therefore 226 any proposed modifications to TCP need to be thoroughly tested. The 227 present specification describes an experimental protocol that adds 228 more accurate ECN feedback to the TCP protocol. The intention is to 229 specify the protocol sufficiently so that more than one 230 implementation can be built in order to test its function, robustness 231 and interoperability (with itself and with previous version of ECN 232 and TCP). 234 The experimental protocol will be considered successful if testing 235 confirms that the proposed mechanism can be deployed at large scale. 236 Testing will mostly focus on fall-back strategies in case of 237 middlebox interference. Current recommended strategies are specified 238 in Sections 3.1.3, 3.2.3, 3.2.4 and 3.2.7. The effectiveness of 239 these strategies depends on the actual deployment situation of 240 middleboxes. Therefore experimental verification to confirm large- 241 scale path traversal in the Internet is needed before finalizing this 242 specification on the Standards Track. 244 Another experimentation focus is the implementation feasibiliy of 245 change-triggered ACKs as described in section 3.2.8. While on 246 average this should not lead to a higher ACK rate, it changes the ACK 247 pattern which can particularly have an impact on hardware offload. 248 It is currently specified as a hard requirement, because the sender 249 can exploit the predictability of the receiver's behaviour. However, 250 further experimentation is needed to advise if will have to become 251 just preferred behavior. 253 1.4. Terminology 255 AccECN: The more accurate ECN feedback scheme will be called AccECN 256 for short. 258 Classic ECN: the ECN protocol specified in [RFC3168]. 260 Classic ECN feedback: the feedback aspect of the ECN protocol 261 specified in [RFC3168], including generation, encoding, 262 transmission and decoding of feedback, but not the Data Sender's 263 subsequent response to that feedback. 265 ACK: A TCP acknowledgement, with or without a data payload. 267 Pure ACK: A TCP acknowledgement without a data payload. 269 TCP client: The TCP stack that originates a connection. 271 TCP server: The TCP stack that responds to a connection request. 273 Data Receiver: The endpoint of a TCP half-connection that receives 274 data and sends AccECN feedback. 276 Data Sender: The endpoint of a TCP half-connection that sends data 277 and receives AccECN feedback. 279 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 280 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 281 document are to be interpreted as described in BCP 14 [RFC2119] 282 [RFC8174] when, and only when, they appear in all capitals, as shown 283 here. 285 1.5. Recap of Existing ECN feedback in IP/TCP 287 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 288 negotiated with the receiver at the transport layer, an ECN sender 289 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 290 to indicate an ECN-capable transport (ECT). If both ECN bits are 291 zero, the packet is considered to have been sent by a Not-ECN-capable 292 Transport (Not-ECT). When a network node experiences congestion, it 293 will occasionally either drop or mark a packet, with the choice 294 depending on the packet's ECN codepoint. If the codepoint is Not- 295 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 296 the node can mark the packet by setting both ECN bits, which is 297 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 298 Table 1 summarises these codepoints. 300 +-------------------------+---------------+-------------------------+ 301 | IP-ECN codepoint | Codepoint | Description | 302 | (binary) | name | | 303 +-------------------------+---------------+-------------------------+ 304 | 00 | Not-ECT | Not ECN-Capable | 305 | | | Transport | 306 | 01 | ECT(1) | ECN-Capable Transport | 307 | | | (1) | 308 | 10 | ECT(0) | ECN-Capable Transport | 309 | | | (0) | 310 | 11 | CE | Congestion Experienced | 311 +-------------------------+---------------+-------------------------+ 313 Table 1: The ECN Field in the IP Header 315 In the TCP header the first two bits in byte 14 are defined as flags 316 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 317 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 318 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 319 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 320 Data Receiver starts to set the Echo Congestion Experienced (ECE) 321 flag continuously in the TCP header of ACKs, which ensures the signal 322 is received reliably even if ACKs are lost. The TCP sender confirms 323 that it has received at least one ECE signal by responding with the 324 congestion window reduced (CWR) flag, which allows the TCP receiver 325 to stop repeating the ECN-Echo flag. This always leads to a full RTT 326 of ACKs with ECE set. Thus any additional CE markings arriving 327 within this RTT cannot be fed back. 329 The last bit in byte 13 of the TCP header was defined as the Nonce 330 Sum (NS) for the ECN Nonce [RFC3540]. In the absence of widespread 331 deployment RFC 3540 has been reclassified as historic [RFC8311] and 332 the respective flag has been marked as "reserved", making this TCP 333 flag available for use by the AccECN experiment instead. 335 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 336 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 337 | | | N | C | E | U | A | P | R | S | F | 338 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 339 | | | | R | E | G | K | H | T | N | N | 340 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 342 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 344 2. AccECN Protocol Overview and Rationale 346 This section provides an informative overview of the AccECN protocol 347 that will be normatively specified in Section 3 349 Like the original TCP approach, the Data Receiver of each TCP half- 350 connection sends AccECN feedback to the Data Sender on TCP 351 acknowledgements, reusing data packets of the other half-connection 352 whenever possible. 354 The AccECN protocol has had to be designed in two parts: 356 o an essential part that re-uses ECN TCP header bits to feed back 357 the number of arriving CE marked packets. This provides more 358 accuracy than classic ECN feedback, but limited resilience against 359 ACK loss; 361 o a supplementary part using a new AccECN TCP Option that provides 362 additional feedback on the number of bytes that arrive marked with 363 each of the three ECN codepoints (not just CE marks). This 364 provides greater resilience against ACK loss than the essential 365 feedback, but it is more likely to suffer from middlebox 366 interference. 368 The two part design was necessary, given limitations on the space 369 available for TCP options and given the possibility that certain 370 incorrectly designed middleboxes prevent TCP using any new options. 372 The essential part overloads the previous definition of the three 373 flags in the TCP header that had been assigned for use by ECN. This 374 design choice deliberately replaces the classic ECN feedback 375 protocol, rather than leaving classic ECN feedback intact and adding 376 more accurate feedback separately because: 378 o this efficiently reuses scarce TCP header space, given TCP option 379 space is approaching saturation; 381 o a single upgrade path for the TCP protocol is preferable to a fork 382 in the design; 384 o otherwise classic and accurate ECN feedback could give conflicting 385 feedback on the same segment, which could open up new security 386 concerns and make implementations unnecessarily complex; 388 o middleboxes are more likely to faithfully forward the TCP ECN 389 flags than newly defined areas of the TCP header. 391 AccECN is designed to work even if the supplementary part is removed 392 or zeroed out, as long as the essential part gets through. 394 2.1. Capability Negotiation 396 AccECN is a change to the wire protocol of the main TCP header, 397 therefore it can only be used if both endpoints have been upgraded to 398 understand it. The TCP client signals support for AccECN on the 399 initial SYN of a connection and the TCP server signals whether it 400 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 401 client uses to signal AccECN support have been carefully chosen so 402 that a TCP server will interpret them as a request to support the 403 most recent variant of ECN feedback that it supports. Then the 404 client falls back to the same variant of ECN feedback. 406 An AccECN TCP client does not send the new AccECN Option on the SYN 407 as SYN option space is limited and successful negotiation using the 408 flags in the main header is taken as sufficient evidence that both 409 ends also support the AccECN Option. The TCP server sends the AccECN 410 Option on the SYN/ACK and the client sends it on the first ACK to 411 test whether the network path forwards the option correctly. 413 2.2. Feedback Mechanism 415 A Data Receiver maintains four counters initialised at the start of 416 the half-connection. Three count the number of arriving payload 417 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 418 the number of packets arriving marked with a CE codepoint (including 419 control packets without payload if they are CE-marked). 421 The Data Sender maintains four equivalent counters for the half 422 connection, and the AccECN protocol is designed to ensure they will 423 match the values in the Data Receiver's counters, albeit after a 424 little delay. 426 Each ACK carries the three least significant bits (LSBs) of the 427 packet-based CE counter using the ECN bits in the TCP header, now 428 renamed the Accurate ECN (ACE) field (see Figure 2 later). The LSBs 429 of each of the three byte counters are carried in the AccECN Option. 431 2.3. Delayed ACKs and Resilience Against ACK Loss 433 With both the ACE and the AccECN Option mechanisms, the Data Receiver 434 continually repeats the current LSBs of each of its respective 435 counters. There is no need to acknowledge these continually repeated 436 counters, so the congestion window reduced (CWR) mechanism is no 437 longer used. Even if some ACKs are lost, the Data Sender should be 438 able to infer how much to increment its own counters, even if the 439 protocol field has wrapped. 441 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 442 it appears to have incremented by one (say), the field might have 443 actually cycled completely then incremented by one. The Data 444 Receiver is required not to delay sending an ACK to such an extent 445 that the ACE field would cycle. However cyling is still a 446 possibility at the Data Sender because a whole sequence of ACKs 447 carrying intervening values of the field might all be lost or delayed 448 in transit. 450 The fields in the AccECN Option are larger, but they will increment 451 in larger steps because they count bytes not packets. Nonetheless, 452 their size has been chosen such that a whole cycle of the field would 453 never occur between ACKs unless there had been an infeasibly long 454 sequence of ACK losses. Therefore, as long as the AccECN Option is 455 available, it can be treated as a dependable feedback channel. 457 If the AccECN Option is not available, e.g. it is being stripped by a 458 middlebox, the AccECN protocol will only feed back information on CE 459 markings (using the ACE field). Although not ideal, this will be 460 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 461 will ever indicate more severe congestion than CE, even though future 462 uses for ECT(0) or ECT(1) are still unclear [RFC8311]. Because the 463 3-bit ACE field is so small, when it is the only field available the 464 Data Sender has to interpret it conservatively assuming the worst 465 possible wrap. 467 Certain specified events trigger the Data Receiver to include an 468 AccECN Option on an ACK. The rules are designed to ensure that the 469 order in which different markings arrive at the receiver is 470 communicated to the sender (as long as there is no ACK loss). 471 Implementations are encouraged to send an AccECN Option more 472 frequently, but this is left up to the implementer. 474 2.4. Feedback Metrics 476 The CE packet counter in the ACE field and the CE byte counter in the 477 AccECN Option both provide feedback on received CE-marks. The CE 478 packet counter includes control packets that do not have payload 479 data, while the CE byte counter solely includes marked payload bytes. 480 If both are present, the byte counter in the option will provide the 481 more accurate information needed for modern congestion control and 482 policing schemes, such as DCTCP or ConEx. If the option is stripped, 483 a simple algorithm to estimate the number of marked bytes from the 484 ACE field is given in Appendix A.3. 486 Feedback in bytes is recommended in order to protect against the 487 receiver using attacks similar to 'ACK-Division' to artificially 488 inflate the congestion window, which is why [RFC5681] now recommends 489 that TCP counts acknowledged bytes not packets. 491 2.5. Generic (Dumb) Reflector 493 The ACE field provides information about CE markings on both data and 494 control packets. According to [RFC3168] the Data Sender is meant to 495 set control packets to Not-ECT. However, mechanisms in certain 496 private networks (e.g. data centres) set control packets to be ECN 497 capable because they are precisely the packets that performance 498 depends on most. 500 For this reason, AccECN is designed to be a generic reflector of 501 whatever ECN markings it sees, whether or not they are compliant with 502 a current standard. Then as standards evolve, Data Senders can 503 upgrade unilaterally without any need for receivers to upgrade too. 504 It is also useful to be able to rely on generic reflection behaviour 505 when senders need to test for unexpected interference with markings 506 (for instance [I-D.kuehlewind-tcpm-ecn-fallback] and para 2 of 507 Section 20.2 of [RFC3168]). 509 The initial SYN is the most critical control packet, so AccECN 510 provides feedback on its ECN marking. Although RFC 3168 prohibits an 511 ECN-capable SYN, providing feedback of ECN marking on the SYN 512 supports future scenarios in which SYNs might be ECN-enabled (without 513 prejudging whether they ought to be). For instance, [RFC8311] 514 updates this aspect of RFC 3168 to allow experimentation with ECN- 515 capable TCP control packets. 517 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 518 not-ECT in compliance with RFC 3168, feedback on the state of the ECN 519 field when it arrives at the receiver could still be useful, because 520 middleboxes have been known to overwrite the ECN IP field as if it is 521 still part of the old Type of Service (ToS) field [Mandalari18]. If 522 a TCP client has set the SYN to Not-ECT, but receives feedback that 523 the ECN field on the SYN arrived with a different codepoint, it can 524 detect such middlebox interference and send Not-ECT for the rest of 525 the connection (see [I-D.kuehlewind-tcpm-ecn-fallback]). Today, if a 526 TCP server receives ECT or CE on a SYN, it cannot know whether it is 527 invalid (or valid) because only the TCP client knows whether it 528 originally marked the SYN as Not-ECT (or ECT). Therefore, prior to 529 AccECN, the server's only safe course of action was to disable ECN 530 for the connection. Instead, the AccECN protocol allows the server 531 to feed back the received ECN field to the client, which then has all 532 the information to decide whether the connection has to fall-back 533 from supporting ECN (or not). 535 3. AccECN Protocol Specification 537 3.1. Negotiating to use AccECN 539 3.1.1. Negotiation during the TCP handshake 541 Given the ECN Nonce [RFC3540] has been reclassified as historic 542 [RFC8311], the present specification re-allocates the TCP flag at bit 543 7 of the TCP header, which was previously called NS (Nonce Sum), as 544 the AE (Accurate ECN) flag (see IANA Considerations in Section 6). 546 During the TCP handshake at the start of a connection, to request 547 more accurate ECN feedback the TCP client (host A) MUST set the TCP 548 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 550 If a TCP server (B) that is AccECN-enabled receives a SYN with the 551 above three flags set, it MUST set both its half connections into 552 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 553 the 4 values shown in the top block of Table 2 to confirm that it 554 supports AccECN. The TCP server MUST NOT set one of these 4 555 combination of flags on the SYN/ACK unless the preceding SYN 556 requested support for AccECN as above. 558 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 559 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 560 that arrived on the SYN. This applies whether or not the server 561 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 562 Section 2.5 for rationale). 564 Once a TCP client (A) has sent the above SYN to declare that it 565 supports AccECN, and once it has received the above SYN/ACK segment 566 that confirms that the TCP server supports AccECN, the TCP client 567 MUST set both its half connections into AccECN mode. 569 The procedure for the client to follow if a SYN/ACK does not arrive 570 before its retransmission timer expires is given in Section 3.1.3. 572 The three flags set to 1 to indicate AccECN support on the SYN have 573 been carefully chosen to enable natural fall-back to prior stages in 574 the evolution of ECN. Table 2 tabulates all the negotiation 575 possibilities for ECN-related capabilities that involve at least one 576 AccECN-capable host. The entries in the first two columns have been 577 abbreviated, as follows: 579 AccECN: More Accurate ECN Feedback (the present specification) 581 Nonce: ECN Nonce feedback [RFC3540] 583 ECN: 'Classic' ECN feedback [RFC3168] 585 No ECN: Not-ECN-capable. Implicit congestion notification using 586 packet drop. 588 +--------+--------+------------+-------------+----------------------+ 589 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 590 | | | | B->A | | 591 +--------+--------+------------+-------------+----------------------+ 592 | | | AE CWR ECE | AE CWR ECE | | 593 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN (Not-ECT on | 594 | | | | | SYN) | 595 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 596 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 597 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 598 | | | | | | 599 | AccECN | Nonce | 1 1 1 | 1 0 1 | classic ECN | 600 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 601 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 602 | | | | | | 603 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 604 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 605 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 606 | | | | | | 607 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 608 +--------+--------+------------+-------------+----------------------+ 610 Table 2: ECN capability negotiation between Client (A) and Server (B) 612 Table 2 is divided into blocks each separated by an empty row. 614 1. The top block shows the case already described where both 615 endpoints support AccECN and how the TCP server (B) indicates 616 congestion feedback. 618 2. The second block shows the cases where the TCP client (A) 619 supports AccECN but the TCP server (B) supports some earlier 620 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 621 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 622 shown it MUST set both its half connections into the feedback 623 mode shown in the rightmost column. 625 3. The third block shows the cases where the TCP server (B) supports 626 AccECN but the TCP client (A) supports some earlier variant of 627 TCP feedback, indicated in its SYN. Therefore, as soon as an 628 AccECN-enabled TCP server (B) receives the SYN shown, it MUST set 629 both its half connections into the feedback mode shown in the 630 rightmost column. 632 4. The fourth block displays a combination labelled `Broken' . Some 633 older TCP server implementations incorrectly set the reserved 634 flags in the SYN/ACK by reflecting those in the SYN. Such broken 635 TCP servers (B) cannot support ECN, so as soon as an AccECN- 636 capable TCP client (A) receives such a broken SYN/ACK it MUST 637 fall-back to Not ECN mode for both its half connections. 639 The following exceptional cases need some explanation: 641 ECN Nonce: With AccECN implementation, there is no need for the ECN 642 Nonce feedback mode [RFC3540], which has been reclassified as 643 historic [RFC8311], as AccECN is compatible with an alternative 644 ECN feedback integrity approach that does not use up the ECT(1) 645 codepoint and can be implemented solely at the sender (see 646 Section 4.3). 648 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 649 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 650 Host A MUST then enter the same feedback mode as it would have 651 entered had it been a responding host and received the same SYN. 652 Then host A MUST send the same SYN/ACK as it would have sent had 653 it been a responding host. 655 3.1.2. Forward Compatibility 657 If a TCP server that implements AccECN receives a SYN with the three 658 TCP header flags (AE, CWR and ECE) set to any combination other than 659 000, 011 or 111, it MUST negotiate the use of AccECN as if they had 660 been set to 111. This ensures that future uses of the other 661 combinations on a SYN can rely on consistent behaviour from the 662 installed base of AccECN servers. 664 For the avoidance of doubt, the negotiation tabulated in Table 2 665 solely concerns the three TCP header flags shown (AE, CWR and ECE). 667 An AccECN host (client or server) MUST ignore the three remaining 668 reserved TCP header flags on all packets. 670 3.1.3. Retransmission of the SYN 672 If the sender of an AccECN SYN times out before receiving the SYN/ 673 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 674 least one more time by continuing to set all three TCP ECN flags on 675 the first retransmitted SYN (using the usual retransmission time- 676 outs). If this first retransmission also fails to be acknowledged, 677 the sender SHOULD send subsequent retransmissions of the SYN without 678 any TCP-ECN flags set. This adds delay, in the case where a 679 middlebox drops an AccECN (or ECN) SYN deliberately. However, 680 current measurements imply that a drop is less likely to be due to 681 middlebox interference than other intermittent causes of loss, e.g. 682 congestion, wireless interference, etc. 684 Implementers MAY use other fall-back strategies if they are found to 685 be more effective (e.g. attempting to negotiate AccECN on the SYN 686 only once or more than twice (most appropriate during high levels of 687 congestion); or falling back to classic ECN feedback rather than non- 688 ECN). Further it may make sense to also remove any other 689 experimental fields or options on the SYN in case a middlebox might 690 be blocking them, although the required behaviour will depend on the 691 specification of the other option(s) and any attempt to co-ordinate 692 fall-back between different modules of the stack. In any case, the 693 TCP initiator SHOULD cache failed connection attempts. If it does, 694 it SHOULD NOT give up attempting to negotiate AccECN on the SYN of 695 subsequent connection attempts until it is clear that the blockage is 696 persistently and specifically due to AccECN. The cache should be 697 arranged to expire so that the initiator will infrequently attempt to 698 check whether the problem has been resolved. 700 The fall-back procedure if the TCP server receives no ACK to 701 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 702 Section 3.2.7. 704 3.2. AccECN Feedback 706 Each Data Receiver of each half connection maintains four counters, 707 r.cep, r.ceb, r.e0b and r.e1b. The CE packet counter (r.cep), counts 708 the number of packets the host receives with the CE code point in the 709 IP ECN field, including CE marks on control packets without data. 710 r.ceb, r.e0b and r.e1b count the number of TCP payload bytes in 711 packets marked respectively with the CE, ECT(0) and ECT(1) codepoint 712 in their IP-ECN field. When a host first enters AccECN mode, it 713 initializes its counters to r.cep = 5, r.e0b = 1 and r.ceb = r.e1b.= 714 0 (see Appendix A.5). Non-zero initial values are used to support a 715 stateless handshake (see Section 4.1) and to be distinct from cases 716 where the fields are incorrectly zeroed (e.g. by middleboxes - see 717 Section 3.2.7.4). 719 A host feeds back the CE packet counter using the Accurate ECN (ACE) 720 field, as explained in the next section. And it feeds back all the 721 byte counters using the AccECN TCP Option, as specified in 722 Section 3.2.6. Whenever a host feeds back the value of any counter, 723 it MUST report the most recent value, no matter whether it is in a 724 pure ACK, an ACK with new payload data or a retransmission. 725 Therefore the feedback carried on a retransmitted packet is unlikely 726 to be the same as the feedback on the original packet. 728 3.2.1. Initialization of Feedback Counters at the Data Sender 730 Each Data Sender of each half connection maintains four counters, 731 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 732 counters at the Data Receiver. When a host enters AccECN mode, it 733 initializes them to s.cep = 5, s.e0b = 1 and s.ceb = s.e1b.= 0. 735 If a TCP client (A) in AccECN mode receives a SYN/ACK with CE 736 feedback, i.e. AE=1, CWR=1, ECE=0, it increments s.cep to 6. 737 Otherwise, for any of the 3 other combinations of the 3 ECN TCP flags 738 (the top 3 rows in Table 2), s.cep remains initialized to 5. 740 3.2.2. The ACE Field 742 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 743 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 744 as one 3-bit field. Then the field is given a new name, ACE, as 745 shown in Figure 2. 747 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 748 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 749 | | | | U | A | P | R | S | F | 750 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 751 | | | | G | K | H | T | N | N | 752 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 754 Figure 2: Definition of the ACE field within bytes 13 and 14 of the 755 TCP Header (when AccECN has been negotiated and SYN=0). 757 The original definition of these three flags in the TCP header, 758 including the addition of support for the ECN Nonce, is shown for 759 comparison in Figure 1. This specification does not rename these 760 three TCP flags to ACE unconditionally; it merely overloads them with 761 another name and definition once an AccECN connection has been 762 established. 764 A host MUST interpret the AE, CWR and ECE flags as the 3-bit ACE 765 counter on a segment with the SYN flag cleared (SYN=0) that it sends 766 or receives if both of its half-connections are set into AccECN mode 767 having successfully negotiated AccECN (see Section 3.1). A host MUST 768 NOT interpret the 3 flags as a 3-bit ACE field on any segment with 769 SYN=1 (whether ACK is 0 or 1), or if AccECN negotiation is incomplete 770 or has not succeeded. 772 Both parts of each of these conditions are equally important. For 773 instance, even if AccECN negotiation has been successful, the ACE 774 field is not defined on any segments with SYN=1 (e.g. a 775 retransmission of an unacknowledged SYN/ACK, or when both ends send 776 SYN/ACKs after AccECN support has been successfully negotiated during 777 a simultaneous open). 779 With only one exception, on any packet with the SYN flag cleared 780 (SYN=0), the Data Receiver MUST encode the three least significant 781 bits of its r.cep counter into the ACE field it feeds back to the 782 Data Sender. 784 There is only one exception to this rule: On the final ACK of the 785 3-way handshake (3WHS), a TCP client (A) in AccECN mode MUST use the 786 ACE field to feed back which of the 4 possible values of the IP-ECN 787 field were on the SYN/ACK (the binary encoding is the same as that 788 used on the SYN/ACK). Table 3 shows the meaning of each possible 789 value of the ACE field on the ACK of the SYN/ACK and the value that 790 an AccECN server MUST set s.cep to as a result. The encoding in 791 Table 3 is solely applicable on a packet in the client-server 792 direction with an acknowledgement number 1 greater than the Initial 793 Sequence Number (ISN) that was used by the server. 795 +--------------+---------------------------+------------------------+ 796 | ACE on ACK | IP-ECN codepoint on | Initial s.cep of | 797 | of SYN/ACK | SYN/ACK inferred by | server in AccECN mode | 798 | | server | | 799 +--------------+---------------------------+------------------------+ 800 | 0b000 | {Notes 1, 2} | Disable ECN | 801 | 0b001 | {Notes 2, 3} | 5 | 802 | 0b010 | Not-ECT | 5 | 803 | 0b011 | ECT(1) | 5 | 804 | 0b100 | ECT(0) | 5 | 805 | 0b101 | Currently Unused {Note 3} | 5 | 806 | 0b110 | CE | 6 | 807 | 0b111 | Currently Unused {Note 3} | 5 | 808 +--------------+---------------------------+------------------------+ 810 Table 3: Meaning of the ACE field on the ACK of the SYN/ACK 812 {Note 1}: If the server is in AccECN mode, the value of zero raises 813 suspicion of zeroing of the ACE field on the path (see 814 Section 3.2.3). 816 {Note 2}: If a server is in AccECN mode, there ought to be no valid 817 case where the ACE field on the last ACK of the 3WHS has a value of 818 0b000 or 0b001. 820 However, in the case where a server that implements AccECN is also 821 using a stateless handshake (termed a SYN cookie) it will not 822 remember whether it entered AccECN mode. Then these two values 823 remind it that it did not enter AccECN mode (see Section 4.1 for 824 details). 826 {Note 3}: If the server is in AccECN mode, these values are Currently 827 Unused but the AccECN server's behaviour is still defined for forward 828 compatibility. 830 3.2.3. Testing for Zeroing of the ACE Field 832 Section 3.2.2 required the Data Receiver to initialize the r.cep 833 counter to a non-zero value. Therefore, in either direction the 834 initial value of the ACE field ought to be non-zero. 836 If AccECN has been successfully negotiated, the Data Sender SHOULD 837 check the initial value of the ACE field in the first arriving 838 segment with SYN=0. If the initial value of the ACE field is zero 839 (0b000), the Data Sender MUST disable sending ECN-capable packets for 840 the remainder of the half-connection by setting the IP/ECN field in 841 all subsequent packets to Not-ECT. 843 For example, the server checks the ACK of the SYN/ACK or the first 844 data segment from the client, while the client checks the first data 845 segment from the server. More precisely, the "first segment with 846 SYN=0" is defined as: the segment with SYN=0 that i) acknowledges 847 sequence space at least covering the initial sequence number (ISN) 848 plus 1; and ii) arrives before any other segments with SYN=0 so it is 849 unlikely to be a retransmission. If no such segment arrives (e.g. 850 because it is lost and the ISN is first acknowledged by a subsequent 851 segment), no test for invalid initialization can be conducted, and 852 the half-connection will continue in AccECN mode. 854 Note that the Data Sender MUST NOT test whether the arriving counter 855 in the initial ACE field has been initialized to a specific valid 856 value - the above check solely tests whether the ACE fields have been 857 incorrectly zeroed. This allows hosts to use different initial 858 values as an additional signalling channel in future. 860 3.2.4. Testing for Mangling of the IP/ECN Field 862 The value of the ACE field on the SYN/ACK indicates the value of the 863 IP/ECN field when the SYN arrived at the server. The client can 864 compare this with how it originally set the IP/ECN field on the SYN. 865 If this comparison implies an unsafe transition (see below) of the 866 IP/ECN field, for the remainder of the connection the client MUST NOT 867 send ECN-capable packets, but it MUST continue to feed back any ECN 868 markings on arriving packets. 870 The value of the ACE field on the last ACK of the 3WHS indicates the 871 value of the IP/ECN field when the SYN/ACK arrived at the client. 872 The server can compare this with how it originally set the IP/ECN 873 field on the SYN/ACK. If this comparison implies an unsafe 874 transition of the IP/ECN field, for the remainder of the connection 875 the server MUST NOT send ECN-capable packets, but it MUST continue to 876 feedback any ECN markings on arriving packets. 878 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 879 count of CE marks is still eventually delivered reliably). If this 880 ACK does not arrive, the server can continue to send ECN-capable 881 packets without having tested for mangling of the IP/ECN field on the 882 SYN/ACK. Experiments with AccECN deployment will assess whether this 883 limitation has any effect in practice. 885 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 886 repeated here for convenience: 888 o the not-ECT codepoint changes; 890 o either ECT codepoint transitions to not-ECT; 892 o the CE codepoint changes. 894 RFC 3168 says that a router that changes ECT to not-ECT is invalid 895 but safe. However, from a host's viewpoint, this transition is 896 unsafe because it could be the result of two transitions at different 897 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 898 This scenario could well happen where an ECN-enabled home router 899 congests its upstream mobile broadband bottleneck link, then the 900 ingress to the mobile network clears the ECN field [Mandalari18]. 902 The above fall-back behaviours are necessary in case mangling of the 903 IP/ECN field is asymmetric, which is currently common over some 904 mobile networks [Mandalari18]. Then one end might see no unsafe 905 transition and continue sending ECN-capable packets, while the other 906 end sees an unsafe transition and stops sending ECN-capable packets. 908 3.2.5. Safety against Ambiguity of the ACE Field 910 If too many CE-marked segments are acknowledged at once, or if a long 911 run of ACKs is lost, the 3-bit counter in the ACE field might have 912 cycled between two ACKs arriving at the Data Sender. 914 Therefore an AccECN Data Receiver SHOULD immediately send an ACK once 915 'n' CE marks have arrived since the previous ACK, where 'n' SHOULD be 916 2 and MUST be no greater than 6. 918 If the Data Sender has not received AccECN TCP Options to give it 919 more dependable information, and it detects that the ACE field could 920 have cycled under the prevailing conditions, it SHOULD conservatively 921 assume that the counter did cycle. It can detect if the counter 922 could have cycled by using the jump in the acknowledgement number 923 since the last ACK to calculate or estimate how many segments could 924 have been acknowledged. An example algorithm to implement this 925 policy is given in Appendix A.2. An implementer MAY develop an 926 alternative algorithm as long as it satisfies these requirements. 928 If missing acknowledgement numbers arrive later (reordering) and 929 prove that the counter did not cycle, the Data Sender MAY attempt to 930 neutralise the effect of any action it took based on a conservative 931 assumption that it later found to be incorrect. 933 3.2.6. The AccECN Option 935 The AccECN Option is defined as shown below in Figure 3. It consists 936 of three 24-bit fields that provide the 24 least significant bits of 937 the r.e0b, r.ceb and r.e1b counters, respectively. The initial 'E' 938 of each field name stands for 'Echo'. 940 0 1 2 3 941 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 942 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 943 | Kind = TBD1 | Length = 11 | EE0B field | 944 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 945 | EE0B (cont'd) | ECEB field | 946 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 947 | EE1B field | 948 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 950 Figure 3: The AccECN Option 952 When a Data Receiver sends an AccECN Option, it MUST set the Kind 953 field to TBD1, which is registered in Section 6 as a new TCP option 954 Kind called AccECN. An experimental TCP option with Kind=254 MAY be 955 used for initial experiments, with magic number 0xACCE. 957 Appendix A.1 gives an example algorithm for the Data Receiver to 958 encode its byte counters into the AccECN Option, and for the Data 959 Sender to decode the AccECN Option fields into its byte counters. 961 Note that there is no field to feed back Not-ECT bytes. Nonetheless 962 an algorithm for the Data Sender to calculate the number of payload 963 bytes received as Not-ECT is given in Appendix A.5. 965 Whenever a Data Receiver sends an AccECN Option, the rules in 966 Section 3.2.8 expect it to always send a full-length option. To cope 967 with option space limitations, it can omit unchanged fields from the 968 tail of the option, as long as it preserves the order of the 969 remaining fields and includes any field that has changed. The length 970 field MUST indicate which fields are present as follows: 972 Length=11: EE0B, ECEB, EE1B 974 Length=8: EE0B, ECEB 976 Length=5: EE0B 978 Length=2: (empty) 980 The empty option of Length=2 is provided to allow for a case where an 981 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 982 but there is very limited space for the option. For initial 983 experiments, the Length field MUST be 2 greater to accommodate the 984 16-bit magic number. 986 All implementations of a Data Sender that read any AccECN Option MUST 987 be able to read in AccECN Options of any of the above lengths. If 988 the AccECN Option is of any other length, implementations MUST use 989 those whole 3 octet fields that fit within the length and ignore the 990 remainder of the option. 992 The AccECN Option has to be optional to implement, because both 993 sender and receiver have to be able to cope without the option anyway 994 - in cases where it does not traverse a network path. It is 995 RECOMMENDED to implement both sending and receiving of the AccECN 996 Option. If sending of the AccECN Option is implemented, the fall- 997 backs described in this document will need to be implemented as well 998 (unless solely for a controlled environment where path traversal is 999 not considered a problem). Even if a developer does not implement 1000 sending of the AccECN Option, it is RECOMMENDED that they still 1001 implement logic to receive and understand any AccECN Options sent by 1002 remote peers. 1004 If a Data Receiver intends to send the AccECN Option at any time 1005 during the rest of the connection it is strongly recommended to also 1006 test path traversal of the AccECN Option as specified in the next 1007 section. 1009 3.2.7. Path Traversal of the AccECN Option 1011 3.2.7.1. Testing the AccECN Option during the Handshake 1013 The TCP client MUST NOT include the AccECN TCP Option on the SYN. A 1014 fall-back strategy for the loss of the SYN (possibly due to middlebox 1015 interference) is specified in Section 3.1.3. 1017 A TCP server that confirms its support for AccECN (in response to an 1018 AccECN SYN from the client as described in Section 3.1) SHOULD 1019 include an AccECN TCP Option in the SYN/ACK. 1021 A TCP client that has successfully negotiated AccECN SHOULD include 1022 an AccECN Option in the first ACK at the end of the 3WHS. However, 1023 this first ACK is not delivered reliably, so the TCP client SHOULD 1024 also include an AccECN Option on the first data segment it sends (if 1025 it ever sends one). 1027 A host MAY NOT include an AccECN Option in any of these three cases 1028 if it has cached knowledge that the packet would be likely to be 1029 blocked on the path to the other host if it included an AccECN 1030 Option. 1032 3.2.7.2. Testing for Loss of Packets Carrying the AccECN Option 1034 If after the normal TCP timeout the TCP server has not received an 1035 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 1036 lost, e.g. due to congestion, or a middlebox might be blocking the 1037 AccECN Option. To expedite connection setup, the TCP server SHOULD 1038 retransmit the SYN/ACK repeating the AE, CWR and ECE TCP flags on the 1039 original SYN/ACK but with no AccECN Option. If this retransmission 1040 times out, to expedite connection setup, the TCP server SHOULD 1041 disable AccECN and ECN for this connection by retransmitting the SYN/ 1042 ACK with AE=CWR=ECE=0 and no AccECN Option. Implementers MAY use 1043 other fall-back strategies if they are found to be more effective 1044 (e.g. falling back to classic ECN feedback on the first 1045 retransmission; retrying the AccECN Option for a second time before 1046 fall-back (most appropriate during high levels of congestion); or 1047 falling back to classic ECN feedback rather than non-ECN on the third 1048 retransmission). 1050 If the TCP client detects that the first data segment it sent with 1051 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1052 on the retransmission. Again, implementers MAY use other fall-back 1053 strategies such as attempting to retransmit a second segment with the 1054 AccECN Option before fall-back, and/or caching whether the AccECN 1055 Option is blocked for subsequent connections. 1057 Either host MAY include the AccECN Option in a subsequent segment to 1058 retest whether the AccECN Option can traverse the path. 1060 If the TCP server receives a second SYN with a request for AccECN 1061 support, it should resend the SYN/ACK, again confirming its support 1062 for AccECN, but this time without the AccECN Option. This approach 1063 rules out any interference by middleboxes that may drop packets with 1064 unknown options, even though it is more likely that the SYN/ACK would 1065 have been lost due to congestion. The TCP server MAY try to send 1066 another packet with the AccECN Option at a later point during the 1067 connection but should monitor if that packet got lost as well, in 1068 which case it SHOULD disable the sending of the AccECN Option for 1069 this half-connection. 1071 Similarly, an AccECN end-point MAY separately memorize which data 1072 packets carried an AccECN Option and disable the sending of AccECN 1073 Options if the loss probability of those packets is significantly 1074 higher than that of all other data packets in the same connection. 1076 3.2.7.3. Testing for Stripping of the AccECN Option 1078 If the TCP client has successfully negotiated AccECN but does not 1079 receive an AccECN Option on the SYN/ACK, it switches into a mode that 1080 assumes that the AccECN Option is not available for this half 1081 connection. 1083 Similarly, if the TCP server has successfully negotiated AccECN but 1084 does not receive an AccECN Option on the first segment that 1085 acknowledges sequence space at least covering the ISN, it switches 1086 into a mode that assumes that the AccECN Option is not available for 1087 this half connection. 1089 While a host is in this mode that assumes incoming AccECN Options are 1090 not available, it MUST adopt the conservative interpretation of the 1091 ACE field discussed in Section 3.2.5. However, it cannot make any 1092 assumption about support of outgoing AccECN Options on the other half 1093 connection, so it SHOULD continue to send the AccECN Option itself 1094 (unless it has established that sending the AccECN Option is causing 1095 packets to be blocked as in Section 3.2.7.2). 1097 If a host is in the mode that assumes incoming AccECN Options are not 1098 available, but it receives an AccECN Option at any later point during 1099 the connection, this clearly indicates that the AccECN Option is not 1100 blocked on the respective path, and the AccECN endpoint MAY switch 1101 out of the mode that assumes the AccECN Option is not available for 1102 this half connection. 1104 3.2.7.4. Test for Zeroing of the AccECN Option 1106 For a related test for invalid initialization of the ACE field, see 1107 Section 3.2.3 1109 Section 3.2 required the Data Receiver to initialize the r.e0b 1110 counter to a non-zero value. Therefore, in either direction the 1111 initial value of the EE0B field in the AccECN Option (if one exists) 1112 ought to be non-zero. If AccECN has been negotiated: 1114 o the TCP server MAY check the initial value of the EE0B field in 1115 the first segment that acknowledges sequence space that at least 1116 covers the ISN plus 1. If the initial value of the EE0B field is 1117 zero, the server will switch into a mode that ignores the AccECN 1118 Option for this half connection. 1120 o the TCP client MAY check the initial value of the EE0B field on 1121 the SYN/ACK. If the initial value of the EE0B field is zero, the 1122 client will switch into a mode that ignores the AccECN Option for 1123 this half connection. 1125 While a host is in the mode that ignores the AccECN Option it MUST 1126 adopt the conservative interpretation of the ACE field discussed in 1127 Section 3.2.5. 1129 Note that the Data Sender MUST NOT test whether the arriving byte 1130 counters in the initial AccECN Option have been initialized to 1131 specific valid values - the above checks solely test whether these 1132 fields have been incorrectly zeroed. This allows hosts to use 1133 different initial values as an additional signalling channel in 1134 future. Also note that the initial value of either field might be 1135 greater than its expected initial value, because the counters might 1136 already have been incremented. Nonetheless, the initial values of 1137 the counters have been chosen so that they cannot wrap to zero on 1138 these initial segments. 1140 3.2.7.5. Consistency between AccECN Feedback Fields 1142 When the AccECN Option is available it supplements but does not 1143 replace the ACE field. An endpoint using AccECN feedback MUST always 1144 consider the information provided in the ACE field whether or not the 1145 AccECN Option is also available. 1147 If the AccECN option is present, the s.cep counter might increase 1148 while the s.ceb counter does not (e.g. due to a CE-marked control 1149 packet). The sender's response to such a situation is out of scope, 1150 and needs to be dealt with in a specification that uses ECN-capable 1151 control packets. Theoretically, this situation could also occur if a 1152 middlebox mangled the AccECN Option but not the ACE field. However, 1153 the Data Sender has to assume that the integrity of the AccECN Option 1154 is sound, based on the above test of the well-known initial values 1155 and optionally other integrity tests (Section 4.3). 1157 If either end-point detects that the s.ceb counter has increased but 1158 the s.cep has not (and by testing ACK coverage it is certain how much 1159 the ACE field has wrapped), this invalid protocol transition has to 1160 be due to some form of feedback mangling. So, the Data Sender MUST 1161 disable sending ECN-capable packets for the remainder of the half- 1162 connection by setting the IP/ECN field in all subsequent packets to 1163 Not-ECT. 1165 3.2.8. Usage of the AccECN TCP Option 1167 The following rules determine when a Data Receiver in AccECN mode 1168 sends the AccECN TCP Option, and which fields to include: 1170 Change-Triggered ACKs: If an arriving packet increments a different 1171 byte counter to that incremented by the previous packet, the Data 1172 Receiver MUST immediately send an ACK with an AccECN Option, 1173 without waiting for the next delayed ACK (this is in addition to 1174 the safety recommendation in Section 3.2.5 against ambiguity of 1175 the ACE field). 1177 This is stated as a "MUST" so that the data sender can rely on 1178 change-triggered ACKs to detect transitions right from the very 1179 start of a flow, without first having to detect whether the 1180 receiver complies. A concern has been raised that certain offload 1181 hardware needed for high performance might not be able to support 1182 change-triggered ACKs, although high performance protocols such as 1183 DCTCP successfully use change-triggered ACKs. One possible 1184 experimental compromise would be for the receiver to heuristically 1185 detect whether the sender is in slow-start, then to implement 1186 change-triggered ACKs in software while the sender is in slow- 1187 start, and offload to hardware otherwise. If the operator 1188 disables change-triggered ACKs, whether partially like this or 1189 otherwise, the operator will also be responsible for ensuring a 1190 co-ordinated sender algorithm is deployed; 1192 Continual Repetition: Otherwise, if arriving packets continue to 1193 increment the same byte counter, the Data Receiver can include an 1194 AccECN Option on most or all (delayed) ACKs, but it does not have 1195 to. If option space is limited on a particular ACK, the Data 1196 Receiver MUST give precedence to SACK information about loss. It 1197 SHOULD include an AccECN Option if the r.ceb counter has 1198 incremented and it MAY include an AccECN Option if r.ec0b or 1199 r.ec1b has incremented; 1201 Full-Length Options Preferred: It SHOULD always use full-length 1202 AccECN Options. It MAY use shorter AccECN Options if space is 1203 limited, but it MUST include the counter(s) that have incremented 1204 since the previous AccECN Option and it MUST only truncate fields 1205 from the right-hand tail of the option to preserve the order of 1206 the remaining fields (see Section 3.2.6); 1208 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1209 length AccECN TCP Option on at least three ACKs per RTT, or on all 1210 ACKs if there are less than three per RTT (see Appendix A.4 for an 1211 example algorithm that satisfies this requirement). 1213 The following example series of arriving IP/ECN fields illustrates 1214 when a Data Receiver will emit an ACK if it is using a delayed ACK 1215 factor of 2 segments and change-triggered ACKs: 01 -> ACK, 01, 01 -> 1216 ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 -> ACK. 1218 For the avoidance of doubt, the change-triggered ACK mechanism is 1219 deliberately worded to ignore the arrival of a control packet with no 1220 payload, which therefore does not alter any byte counters, because it 1221 is important that TCP does not acknowledge pure ACKs. The change- 1222 triggered ACK approach can lead to some additional ACKs but it feeds 1223 back the timing and the order in which ECN marks are received with 1224 minimal additional complexity. If only CE marks are infrequent, or 1225 there are multiple marks in a row, the additional load will be low. 1226 Other marking patterns could increase the load significantly, 1227 Investigating the additional load is a goal of the proposed 1228 experiment. 1230 Implementation note: sending an AccECN Option each time a different 1231 counter changes and including a full-length AccECN Option on every 1232 delayed ACK will satisfy the requirements described above and might 1233 be the easiest implementation, as long as sufficient space is 1234 available in each ACK (in total and in the option space). 1236 Appendix A.3 gives an example algorithm to estimate the number of 1237 marked bytes from the ACE field alone, if the AccECN Option is not 1238 available. 1240 If a host has determined that segments with the AccECN Option always 1241 seem to be discarded somewhere along the path, it is no longer 1242 obliged to follow the above rules. 1244 3.3. Requirements for TCP Proxies, Offload Engines and other 1245 Middleboxes on AccECN Compliance 1247 A large class of middleboxes split TCP connections. Such a middlebox 1248 would be compliant with the AccECN protocol if the TCP implementation 1249 on each side complied with the present AccECN specification and each 1250 side negotiated AccECN independently of the other side. 1252 Another large class of middleboxes intervenes to some degree at the 1253 transport layer, but attempts to be transparent (invisible) to the 1254 end-to-end connection. A subset of this class of middleboxes 1255 attempts to `normalise' the TCP wire protocol by checking that all 1256 values in header fields comply with a rather narrow interpretation of 1257 the TCP specifications. To comply with the present AccECN 1258 specification, such a middlebox MUST NOT change the ACE field or the 1259 AccECN Option and it SHOULD preserve the timing of each ACK (for 1260 example, if it coalesced ACKs it would not be AccECN-compliant) as 1261 these can be used by the Data Sender to infer further information 1262 about the path congestion level. A middlebox claiming to be 1263 transparent at the transport layer MUST forward the AccECN TCP Option 1264 unaltered, whether or not the length value matches one of those 1265 specified in Section 3.2.6, and whether or not the initial values of 1266 the byte-counter fields are correct. This is because blocking 1267 apparently invalid values does not improve security (because AccECN 1268 hosts are required to ignore invalid values anyway), while it 1269 prevents the standardised set of values being extended in future 1270 (because outdated normalisers would block updated hosts from using 1271 the extended AccECN standard). 1273 Hardware to offload certain TCP processing represents another large 1274 class of middleboxes, even though it is often a function of a host's 1275 network interface and rarely in its own 'box'. Leeway has been 1276 allowed in the present AccECN specification in the expectation that 1277 offload hardware could comply and still serve its function. 1278 Nonetheless, such hardware SHOULD also preserve the timing of each 1279 ACK (for example, if it coalesced ACKs it would not be AccECN- 1280 compliant). 1282 The ACE field changes with every received CE marking, so today's 1283 receive offloading could lead to many interrupts in high congestion 1284 situations. Although that would be useful (because congestion 1285 information is received sooner), it could also significantly increase 1286 processor load, particularly in scenarios such as DCTCP or L4S where 1287 the marking rate is generally higher. 1289 In data centres it has been fortunate for offload hardware that 1290 DCTCP-style feedback changes less often when there are long sequences 1291 of CE marks, which is more common with a step marking threshold. In 1292 order to enable DCTCP to improve its responsiveness, DCs will need to 1293 move beyond step marking. Before this can happen, offload hardware 1294 will have to explicitly address the variability of ECN feedback. 1296 ECN encodes a varying signal in the ACK stream, so it is inevitable 1297 that offload hardware will ultimately need to handle any form of ECN 1298 feedback exceptionally. The purpose of working towards standardized 1299 TCP ECN feedback is to reduce the risk for hardware developers, who 1300 will have to choose which scheme is likely to become dominant. 1302 4. Interaction with Other TCP Variants 1304 This section is informative, not normative. 1306 4.1. Compatibility with SYN Cookies 1308 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1309 protect itself from SYN flooding attacks. It places minimal commonly 1310 used connection state in the SYN/ACK, and deliberately does not hold 1311 any state while waiting for the subsequent ACK (e.g. it closes the 1312 thread). Therefore it cannot record the fact that it entered AccECN 1313 mode for both half-connections. Indeed, it cannot even remember 1314 whether it negotiated the use of classic ECN [RFC3168]. 1316 Nonetheless, such a server can determine that it negotiated AccECN as 1317 follows. If a TCP server using SYN Cookies supports AccECN and if it 1318 receives a pure ACK that acknowledges an ISN that is a valid SYN 1319 cookie, and if the ACK contains an ACE field with the value 0b010 to 1320 0b111 (decimal 2 to 7), it can assume that: 1322 o the TCP client must have requested AccECN support on the SYN 1324 o it (the server) must have confirmed that it supported AccECN 1326 Therefore the server can switch itself into AccECN mode, and continue 1327 as if it had never forgotten that it switched itself into AccECN mode 1328 earlier. 1330 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1331 with the value 0b000 or 0b001, these values indicate that the client 1332 did not request support for AccECN and therefore the server does not 1333 enter AccECN mode for this connection. Further, 0b001 on the ACK 1334 implies that the server sent an ECN-capable SYN/ACK, which was marked 1335 CE in the network, and the non-AccECN client fed this back by setting 1336 ECE on the ACK of the SYN/ACK. 1338 4.2. Compatibility with Other TCP Options and Experiments 1340 AccECN is compatible (at least on paper) with the most commonly used 1341 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1342 also compatible with the recent promising experimental TCP options 1343 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1344 AccECN is friendly to all these protocols, because space for TCP 1345 options is particularly scarce on the SYN, where AccECN consumes zero 1346 additional header space. 1348 When option space is under pressure from other options, Section 3.2.8 1349 provides guidance on how important it is to send an AccECN Option and 1350 whether it needs to be a full-length option. 1352 4.3. Compatibility with Feedback Integrity Mechanisms 1354 Three alternative mechanisms are available to assure the integrity of 1355 ECN and/or loss signals. AccECN is compatible with any of these 1356 approaches: 1358 o The Data Sender can test the integrity of the receiver's ECN (or 1359 loss) feedback by occasionally setting the IP-ECN field to a value 1360 normally only set by the network (and/or deliberately leaving a 1361 sequence number gap). Then it can test whether the Data 1362 Receiver's feedback faithfully reports what it expects (similar to 1363 para 2 of Section 20.2 of [RFC3168]). Unlike the ECN Nonce 1364 [RFC3540], this approach does not waste the ECT(1) codepoint in 1365 the IP header, it does not require standardisation and it does not 1366 rely on misbehaving receivers volunteering to reveal feedback 1367 information that allows them to be detected. However, setting the 1368 CE mark by the sender might conceal actual congestion feedback 1369 from the network and should therefore only be done sparsely. 1371 o Networks generate congestion signals when they are becoming 1372 congested, so networks are more likely than Data Senders to be 1373 concerned about the integrity of the receiver's feedback of these 1374 signals. A network can enforce a congestion response to its ECN 1375 markings (or packet losses) using congestion exposure (ConEx) 1376 audit [RFC7713]. Whether the receiver or a downstream network is 1377 suppressing congestion feedback or the sender is unresponsive to 1378 the feedback, or both, ConEx audit can neutralise any advantage 1379 that any of these three parties would otherwise gain. 1381 ConEx is a change to the Data Sender that is most useful when 1382 combined with AccECN. Without AccECN, the ConEx behaviour of a 1383 Data Sender would have to be more conservative than would be 1384 necessary if it had the accurate feedback of AccECN. 1386 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1387 detect any tampering with AccECN feedback between the Data 1388 Receiver and the Data Sender (whether malicious or accidental). 1389 The AccECN fields are immutable end-to-end, so they are amenable 1390 to TCP-AO protection, which covers TCP options by default. 1391 However, TCP-AO is often too brittle to use on many end-to-end 1392 paths, where middleboxes can make verification fail in their 1393 attempts to improve performance or security, e.g. by 1394 resegmentation or shifting the sequence space. 1396 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1397 of congestion feedback. With minor changes AccECN could be optimised 1398 for the possibility that the ECT(1) codepoint might be used as an ECN 1399 Nonce. However, given RFC 3540 has been reclassified as historic, 1400 the AccECN design has been generalised so that it ought to be able to 1401 support other possible uses of the ECT(1) codepoint, such as a lower 1402 severity or a more instant congestion signal than CE. 1404 5. Protocol Properties 1406 This section is informative not normative. It describes how well the 1407 protocol satisfies the agreed requirements for a more accurate ECN 1408 feedback protocol [RFC7560]. 1410 Accuracy: From each ACK, the Data Sender can infer the number of new 1411 CE marked segments since the previous ACK. This provides better 1412 accuracy on CE feedback than classic ECN. In addition if the 1413 AccECN Option is present (not blocked by the network path) the 1414 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1416 Overhead: The AccECN scheme is divided into two parts. The 1417 essential part reuses the 3 flags already assigned to ECN in the 1418 IP header. The supplementary part adds an additional TCP option 1419 consuming up to 11 bytes. However, no TCP option is consumed in 1420 the SYN. 1422 Ordering: The order in which marks arrive at the Data Receiver is 1423 preserved in AccECN feedback, because the Data Receiver is 1424 expected to send an ACK immediately whenever a different mark 1425 arrives. 1427 Timeliness: While the same ECN markings are arriving continually at 1428 the Data Receiver, it can defer ACKs as TCP does normally, but it 1429 will immediately send an ACK as soon as a different ECN marking 1430 arrives. 1432 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1433 latency-sensitive uses of ECN feedback by capturing the timing of 1434 transitions but not wasting resources while the state of the 1435 signalling system is stable. The receiver can control how 1436 frequently it sends the AccECN TCP Option and therefore it can 1437 control the overhead induced by AccECN. 1439 Resilience: All information is provided based on counters. 1440 Therefore if ACKs are lost, the counters on the first ACK 1441 following the losses allows the Data Sender to immediately recover 1442 the number of the ECN markings that it missed. 1444 Resilience against Bias: Because feedback is based on repetition of 1445 counters, random losses do not remove any information, they only 1446 delay it. Therefore, even though some ACKs are change-triggered, 1447 random losses will not alter the proportions of the different ECN 1448 markings in the feedback. 1450 Resilience vs Overhead: If space is limited in some segments (e.g. 1451 because more option are need on some segments, such as the SACK 1452 option after loss), the Data Receiver can send AccECN Options less 1453 frequently or truncate fields that have not changed, usually down 1454 to as little as 5 bytes. However, it has to send a full-sized 1455 AccECN Option at least three times per RTT, which the Data Sender 1456 can rely on as a regular beacon or checkpoint. 1458 Resilience vs Timeliness and Ordering: Ordering information and the 1459 timing of transitions cannot be communicated in three cases: i) 1460 during ACK loss; ii) if something on the path strips the AccECN 1461 Option; or iii) if the Data Receiver is unable to support Change- 1462 Triggered ACKs. 1464 Complexity: An AccECN implementation solely involves simple counter 1465 increments, some modulo arithmetic to communicate the least 1466 significant bits and allow for wrap, and some heuristics for 1467 safety against fields cycling due to prolonged periods of ACK 1468 loss. Each host needs to maintain eight additional counters. The 1469 hosts have to apply some additional tests to detect tampering by 1470 middleboxes, but in general the protocol is simple to understand, 1471 simple to implement and requires few cycles per packet to execute. 1473 Integrity: AccECN is compatible with at least three approaches that 1474 can assure the integrity of ECN feedback. If the AccECN Option is 1475 stripped the resolution of the feedback is degraded, but the 1476 integrity of this degraded feedback can still be assured. 1478 Backward Compatibility: If only one endpoint supports the AccECN 1479 scheme, it will fall-back to the most advanced ECN feedback scheme 1480 supported by the other end. 1482 Backward Compatibility: If the AccECN Option is stripped by a 1483 middlebox, AccECN still provides basic congestion feedback in the 1484 ACE field. Further, AccECN can be used to detect mangling of the 1485 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1486 marked segments; and blocking of segments carrying the AccECN 1487 Option. It can detect these conditions during TCP's 3WHS so that 1488 it can fall back to operation without ECN and/or operation without 1489 the AccECN Option. 1491 Forward Compatibility: The behaviour of endpoints and middleboxes is 1492 carefully defined for all reserved or currently unused codepoints 1493 in the scheme, to ensure that any blocking of anomalous values is 1494 always at least under reversible policy control. 1496 6. IANA Considerations 1498 This document reassigns bit 7 of the TCP header flags to the AccECN 1499 experiment. This bit was previously called the Nonce Sum (NS) flag 1500 [RFC3540], but RFC 3540 has been reclassified as historic [RFC8311]. 1501 The flag will now be defined as: 1503 +-----+-------------------+-----------+ 1504 | Bit | Name | Reference | 1505 +-----+-------------------+-----------+ 1506 | 7 | AE (Accurate ECN) | RFC XXXX | 1507 +-----+-------------------+-----------+ 1509 [TO BE REMOVED: IANA is requested to update the existing entry in the 1510 Transmission Control Protocol (TCP) Header Flags registration 1511 (https://www.iana.org/assignments/tcp-header-flags/tcp-header- 1512 flags.xhtml#tcp-header-flags-1) for Bit 7 to "AE (Accurate ECN), 1513 previously used as NS (Nonce Sum) by [RFC3540], which is now Historic 1514 [RFC8311]" and change the reference to this RFC-to-be instead of 1515 RFC8311.] 1517 This document also defines a new TCP option for AccECN, assigned a 1518 value of TBD1 (decimal) from the TCP option space. This value is 1519 defined as: 1521 +------+--------+-----------------------+-----------+ 1522 | Kind | Length | Meaning | Reference | 1523 +------+--------+-----------------------+-----------+ 1524 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1525 +------+--------+-----------------------+-----------+ 1527 [TO BE REMOVED: This registration should take place at the following 1528 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1529 parameters.xhtml#tcp-parameters-1 ] 1530 Early implementation before the IANA allocation MUST follow [RFC6994] 1531 and use experimental option 254 and magic number 0xACCE (16 bits), 1532 then migrate to the new option after the allocation. 1534 7. Security Considerations 1536 If ever the supplementary part of AccECN based on the new AccECN TCP 1537 Option is unusable (due for example to middlebox interference) the 1538 essential part of AccECN's congestion feedback offers only limited 1539 resilience to long runs of ACK loss (see Section 3.2.5). These 1540 problems are unlikely to be due to malicious intervention (because if 1541 an attacker could strip a TCP option or discard a long run of ACKs it 1542 could wreak other arbitrary havoc). However, it would be of concern 1543 if AccECN's resilience could be indirectly compromised during a 1544 flooding attack. AccECN is still considered safe though, because if 1545 the option is not presented, the AccECN Data Sender is then required 1546 to switch to more conservative assumptions about wrap of congestion 1547 indication counters (see Section 3.2.5 and Appendix A.2). 1549 Section 4.1 describes how a TCP server can negotiate AccECN and use 1550 the SYN cookie method for mitigating SYN flooding attacks. 1552 There is concern that ECN markings could be altered or suppressed, 1553 particularly because a misbehaving Data Receiver could increase its 1554 own throughput at the expense of others. AccECN is compatible with 1555 the three schemes known to assure the integrity of ECN feedback (see 1556 Section 4.3 for details). If the AccECN Option is stripped by an 1557 incorrectly implemented middlebox, the resolution of the feedback 1558 will be degraded, but the integrity of this degraded information can 1559 still be assured. 1561 There is a potential concern that a receiver could deliberately omit 1562 the AccECN Option pretending that it had been stripped by a 1563 middlebox. No known way can yet be contrived to take advantage of 1564 this downgrade attack, but it is mentioned here in case someone else 1565 can contrive one. 1567 The AccECN protocol is not believed to introduce any new privacy 1568 concerns, because it merely counts and feeds back signals at the 1569 transport layer that had already been visible at the IP layer. 1571 8. Acknowledgements 1573 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 1574 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf 1575 and Michael Tuexen for their input and discussion. The idea of using 1576 the three ECN-related TCP flags as one field for more accurate TCP- 1577 ECN feedback was first introduced in the re-ECN protocol that was the 1578 ancestor of ConEx. 1580 Bob Briscoe was part-funded by the European Community under its 1581 Seventh Framework Programme through the Reducing Internet Transport 1582 Latency (RITE) project (ICT-317700) and through the Trilogy 2 project 1583 (ICT-317756). He was also part-funded by the Research Council of 1584 Norway through the TimeIn project. The views expressed here are 1585 solely those of the authors. 1587 Mirja Kuehlewind was partly supported by the European Commission 1588 under Horizon 2020 grant agreement no. 688421 Measurement and 1589 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 1590 State Secretariat for Education, Research, and Innovation under 1591 contract no. 15.0268. This support does not imply endorsement. 1593 9. Comments Solicited 1595 Comments and questions are encouraged and very welcome. They can be 1596 addressed to the IETF TCP maintenance and minor modifications working 1597 group mailing list , and/or to the authors. 1599 10. References 1601 10.1. Normative References 1603 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1604 Requirement Levels", BCP 14, RFC 2119, 1605 DOI 10.17487/RFC2119, March 1997, 1606 . 1608 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1609 of Explicit Congestion Notification (ECN) to IP", 1610 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1611 . 1613 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1614 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1615 . 1617 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 1618 RFC 6994, DOI 10.17487/RFC6994, August 2013, 1619 . 1621 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1622 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1623 May 2017, . 1625 10.2. Informative References 1627 [I-D.ietf-tcpm-generalized-ecn] 1628 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 1629 Congestion Notification (ECN) to TCP Control Packets", 1630 draft-ietf-tcpm-generalized-ecn-03 (work in progress), 1631 October 2018. 1633 [I-D.ietf-tsvwg-l4s-arch] 1634 Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, 1635 Low Loss, Scalable Throughput (L4S) Internet Service: 1636 Architecture", draft-ietf-tsvwg-l4s-arch-03 (work in 1637 progress), October 2018. 1639 [I-D.kuehlewind-tcpm-ecn-fallback] 1640 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 1641 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 1642 fallback-01 (work in progress), September 2013. 1644 [Mandalari18] 1645 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 1646 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 1647 over Mobile", IEEE Communications Magazine , March 2018. 1649 (to appear) 1651 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1652 Congestion Notification (ECN) Signaling with Nonces", 1653 RFC 3540, DOI 10.17487/RFC3540, June 2003, 1654 . 1656 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1657 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 1658 . 1660 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1661 Ramakrishnan, "Adding Explicit Congestion Notification 1662 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 1663 DOI 10.17487/RFC5562, June 2009, 1664 . 1666 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1667 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 1668 June 2010, . 1670 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 1671 "TCP Extensions for Multipath Operation with Multiple 1672 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 1673 . 1675 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1676 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1677 . 1679 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 1680 "Problem Statement and Requirements for Increased Accuracy 1681 in Explicit Congestion Notification (ECN) Feedback", 1682 RFC 7560, DOI 10.17487/RFC7560, August 2015, 1683 . 1685 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1686 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 1687 DOI 10.17487/RFC7713, December 2015, 1688 . 1690 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1691 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 1692 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 1693 October 2017, . 1695 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 1696 Notification (ECN) Experimentation", RFC 8311, 1697 DOI 10.17487/RFC8311, January 2018, 1698 . 1700 [RFC8511] Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 1701 "TCP Alternative Backoff with ECN (ABE)", RFC 8511, 1702 DOI 10.17487/RFC8511, December 2018, 1703 . 1705 Appendix A. Example Algorithms 1707 This appendix is informative, not normative. It gives example 1708 algorithms that would satisfy the normative requirements of the 1709 AccECN protocol. However, implementers are free to choose other ways 1710 to implement the requirements. 1712 A.1. Example Algorithm to Encode/Decode the AccECN Option 1714 The example algorithms below show how a Data Receiver in AccECN mode 1715 could encode its CE byte counter r.ceb into the ECEB field within the 1716 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 1717 the ECEB field into its byte counter s.ceb. The other counters for 1718 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 1719 similarly encoded and decoded. 1721 It is assumed that each local byte counter is an unsigned integer 1722 greater than 24b (probably 32b), and that the following constant has 1723 been assigned: 1725 DIVOPT = 2^24 1727 Every time a CE marked data segment arrives, the Data Receiver 1728 increments its local value of r.ceb by the size of the TCP Data. 1729 Whenever it sends an ACK with the AccECN Option, the value it writes 1730 into the ECEB field is 1732 ECEB = r.ceb % DIVOPT 1734 where '%' is the modulo operator. 1736 On the arrival of an AccECN Option, the Data Sender uses the TCP 1737 acknowledgement number and any SACK options to calculate newlyAckedB, 1738 the amount of new data that the ACK acknowledges in bytes. If 1739 newlyAckedB is negative it means that a more up to date ACK has 1740 already been processed, so this ACK has been superseded and the Data 1741 Sender has to ignore the AccECN Option. Then the Data Sender 1742 calculates the minimum difference d.ceb between the ECEB field and 1743 its local s.ceb counter, using modulo arithmetic as follows: 1745 if (newlyAckedB >= 0) { 1746 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 1747 s.ceb += d.ceb 1748 } 1750 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 1751 then 1752 s.ceb % DIVOPT = 1 1753 d.ceb = (1461 + 2^24 - 1) % 2^24 1754 = 1460 1755 s.ceb = 33,554,433 + 1460 1756 = 33,555,893 1758 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 1760 The example algorithms below show how a Data Receiver in AccECN mode 1761 could encode its CE packet counter r.cep into the ACE field, and how 1762 the Data Sender in AccECN mode could decode the ACE field into its 1763 s.cep counter. The Data Sender's algorithm includes code to 1764 heuristically detect a long enough unbroken string of ACK losses that 1765 could have concealed a cycle of the congestion counter in the ACE 1766 field of the next ACK to arrive. 1768 Two variants of the algorithm are given: i) a more conservative 1769 variant for a Data Sender to use if it detects that the AccECN Option 1770 is not available (see Section 3.2.5 and Section 3.2.7); and ii) a 1771 less conservative variant that is feasible when complementary 1772 information is available from the AccECN Option. 1774 A.2.1. Safety Algorithm without the AccECN Option 1776 It is assumed that each local packet counter is a sufficiently sized 1777 unsigned integer (probably 32b) and that the following constant has 1778 been assigned: 1780 DIVACE = 2^3 1782 Every time a CE marked packet arrives, the Data Receiver increments 1783 its local value of r.cep by 1. It repeats the same value of ACE in 1784 every subsequent ACK until the next CE marking arrives, where 1786 ACE = r.cep % DIVACE. 1788 If the Data Sender received an earlier value of the counter that had 1789 been delayed due to ACK reordering, it might incorrectly calculate 1790 that the ACE field had wrapped. Therefore, on the arrival of every 1791 ACK, the Data Sender uses the TCP acknowledgement number and any SACK 1792 options to calculate newlyAckedB, the amount of new data that the ACK 1793 acknowledges. If newlyAckedB is negative it means that a more up to 1794 date ACK has already been processed, so this ACK has been superseded 1795 and the Data Sender has to ignore the AccECN Option. If newlyAckedB 1796 is zero, to break the tie the Data Sender could use timestamps (if 1797 present) to work out newlyAckedT, the amount of new time that the ACK 1798 acknowledges. Then the Data Sender calculates the minimum difference 1799 d.cep between the ACE field and its local s.cep counter, using modulo 1800 arithmetic as follows: 1802 if ((newlyAckedB > 0) || (newlyAckedB == 0 && newlyAckedT > 0)) 1803 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 1805 Section 3.2.5 requires the Data Sender to assume that the ACE field 1806 did cycle if it could have cycled under prevailing conditions. The 1807 3-bit ACE field in an arriving ACK could have cycled and become 1808 ambiguous to the Data Sender if a row of ACKs goes missing that 1809 covers a stream of data long enough to contain 8 or more CE marks. 1810 We use the word `missing' rather than `lost', because some or all the 1811 missing ACKs might arrive eventually, but out of order. Even if some 1812 of the lost ACKs are piggy-backed on data (i.e. not pure ACKs) 1813 retransmissions will not repair the lost AccECN information, because 1814 AccECN requires retransmissions to carry the latest AccECN counters, 1815 not the original ones. 1817 The phrase `under prevailing conditions' allows the Data Sender to 1818 take account of the prevailing size of data segments and the 1819 prevailing CE marking rate just before the sequence of ACK losses. 1820 However, we shall start with the simplest algorithm, which assumes 1821 segments are all full-sized and ultra-conservatively it assumes that 1822 ECN marking was 100% on the forward path when ACKs on the reverse 1823 path started to all be dropped. Specifically, if newlyAckedB is the 1824 amount of data that an ACK acknowledges since the previous ACK, then 1825 the Data Sender could assume that this acknowledges newlyAckedPkt 1826 full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it 1827 could assume that the ACE field incremented by 1829 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 1831 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 1832 size segments than any previous ACK, and that ACE increments by a 1833 minimum of 2 CE marks (d.cep=2). The above formula works out that it 1834 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 1835 2). However, if ACE increases by a minimum of 2 but acknowledges 10 1836 full-sized segments, then it would be necessary to assume that there 1837 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 1839 Implementers could build in more heuristics to estimate prevailing 1840 average segment size and prevailing ECN marking. For instance, 1841 newlyAckedPkt in the above formula could be replaced with 1842 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 1843 segment size and p is the prevailing ECN marking probability. 1844 However, ultimately, if TCP's ECN feedback becomes inaccurate it 1845 still has loss detection to fall back on. Therefore, it would seem 1846 safe to implement a simple algorithm, rather than a perfect one. 1848 The simple algorithm for dSafer.cep above requires no monitoring of 1849 prevailing conditions and it would still be safe if, for example, 1850 segments were on average at least 5% of full-sized as long as ECN 1851 marking was 5% or less. Assuming it was used, the Data Sender would 1852 increment its packet counter as follows: 1854 s.cep += dSafer.cep 1856 If missing acknowledgement numbers arrive later (due to reordering), 1857 Section 3.2.5 says "the Data Sender MAY attempt to neutralise the 1858 effect of any action it took based on a conservative assumption that 1859 it later found to be incorrect". To do this, the Data Sender would 1860 have to store the values of all the relevant variables whenever it 1861 made assumptions, so that it could re-evaluate them later. Given 1862 this could become complex and it is not required, we do not attempt 1863 to provide an example of how to do this. 1865 A.2.2. Safety Algorithm with the AccECN Option 1867 When the AccECN Option is available on the ACKs before and after the 1868 possible sequence of ACK losses, if the Data Sender only needs CE- 1869 marked bytes, it will have sufficient information in the AccECN 1870 Option without needing to process the ACE field. However, if for 1871 some reason it needs CE-marked packets, if dSafer.cep is different 1872 from d.cep, it can calculate the average marked segment size that 1873 each implies to determine whether d.cep is likely to be a safe enough 1874 estimate. Specifically, it could use the following algorithm, where 1875 d.ceb is the amount of newly CE-marked bytes (see Appendix A.1): 1877 SAFETY_FACTOR = 2 1878 if (dSafer.cep > d.cep) { 1879 s = d.ceb/d.cep 1880 if (s <= MSS) { 1881 sSafer = d.ceb/dSafer.cep 1882 if (sSafer < MSS/SAFETY_FACTOR) 1883 dSafer.cep = d.cep % d.cep is a safe enough estimate 1884 } % else 1885 % No need for else; dSafer.cep is already correct, 1886 % because d.cep must have been too small 1887 } 1889 The chart below shows when the above algorithm will consider d.cep 1890 can replace dSafer.cep as a safe enough estimate of the number of CE- 1891 marked packets: 1893 ^ 1894 sSafer| 1895 | 1896 MSS+ 1897 | 1898 | dSafer.cep 1899 | is 1900 MSS/2+--------------+ safest 1901 | | 1902 | d.cep is safe| 1903 | enough | 1904 +--------------------> 1905 MSS s 1907 The following examples give the reasoning behind the algorithm, 1908 assuming MSS=1,460 [B]: 1910 o if d.cep=0, dSafer.cep=8 and d.ceb=1,460, then s=infinity and 1911 sSafer=182.5. 1912 Therefore even though the average size of 8 data segments is 1913 unlikely to have been as small as MSS/8, d.cep cannot have been 1914 correct, because it would imply an average segment size greater 1915 than the MSS. 1917 o if d.cep=2, dSafer.cep=10 and d.ceb=1,460, then s=730 and 1918 sSafer=146. 1919 Therefore d.cep is safe enough, because the average size of 10 1920 data segments is unlikely to have been as small as MSS/10. 1922 o if d.cep=7, dSafer.cep=15 and d.ceb=10,200, then s=1,457 and 1923 sSafer=680. 1924 Therefore d.cep is safe enough, because the average data segment 1925 size is more likely to have been just less than one MSS, rather 1926 than below MSS/2. 1928 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 1929 far less likely. However, because [RFC3168] currently precludes 1930 this, the above algorithm assumes that pure ACKs are not ECN-capable. 1932 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 1934 If the AccECN Option is not available, the Data Sender can only 1935 decode CE-marking from the ACE field in packets. Every time an ACK 1936 arrives, to convert this into an estimate of CE-marked bytes, it 1937 needs an average of the segment size, s_ave. Then it can add or 1938 subtract s_ave from the value of d.ceb as the value of d.cep 1939 increments or decrements. 1941 To calculate s_ave, it could keep a record of the byte numbers of all 1942 the boundaries between packets in flight (including control packets), 1943 and recalculate s_ave on every ACK. However it would be simpler to 1944 merely maintain a counter packets_in_flight for the number of packets 1945 in flight (including control packets), which it could update once per 1946 RTT. Either way, it would estimate s_ave as: 1948 s_ave ~= flightsize / packets_in_flight, 1950 where flightsize is the variable that TCP already maintains for the 1951 number of bytes in flight. To avoid floating point arithmetic, it 1952 could right-bit-shift by lg(packets_in_flight), where lg() means log 1953 base 2. 1955 An alternative would be to maintain an exponentially weighted moving 1956 average (EWMA) of the segment size: 1958 s_ave = a * s + (1-a) * s_ave, 1960 where a is the decay constant for the EWMA. However, then it is 1961 necessary to choose a good value for this constant, which ought to 1962 depend on the number of packets in flight. Also the decay constant 1963 needs to be power of two to avoid floating point arithmetic. 1965 A.4. Example Algorithm to Beacon AccECN Options 1967 Section 3.2.8 requires a Data Receiver to beacon a full-length AccECN 1968 Option at least 3 times per RTT. This could be implemented by 1969 maintaining a variable to store the number of ACKs (pure and data 1970 ACKs) since a full AccECN Option was last sent and another for the 1971 approximate number of ACKs sent in the last round trip time: 1973 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 1974 send_full_AccECN_Option() 1976 For optimised integer arithmetic, BEACON_FREQ = 4 could be used, 1977 rather than 3, so that the division could be implemented as an 1978 integer right bit-shift by lg(BEACON_FREQ). 1980 In certain operating systems, it might be too complex to maintain 1981 acks_in_round. In others it might be possible by tagging each data 1982 segment in the retransmit buffer with the number of ACKs sent at the 1983 point that segment was sent. This would not work well if the Data 1984 Receiver was not sending data itself, in which case it might be 1985 necessary to beacon based on time instead, as follows: 1987 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 1988 send_full_AccECN_Option() 1990 This time-based approach does not work well when all the ACKs are 1991 sent early in each round trip, as is the case during slow-start. In 1992 this case few options will be sent (evtl. even less than 3 per RTT). 1993 However, when continuously sending data, data packets as well as ACKs 1994 will spread out equally over the RTT and sufficient ACKs with the 1995 AccECN option will be sent. 1997 A.5. Example Algorithm to Count Not-ECT Bytes 1999 A Data Sender in AccECN mode can infer the amount of TCP payload data 2000 arriving at the receiver marked Not-ECT from the difference between 2001 the amount of newly ACKed data and the sum of the bytes with the 2002 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 2003 r.e0b is initialized to 1 and the other two counters are initialized 2004 to 0, the initial sum will be 1, which matches the initial offset of 2005 the TCP sequence number on completion of the 3WHS. 2007 For this approach to be precise, it has to be assumed that spurious 2008 (unnecessary) retransmissions do not lead to double counting. This 2009 assumption is currently correct, given that RFC 3168 requires that 2010 the Data Sender marks retransmitted segments as Not-ECT. However, 2011 the converse is not true; necessary transmissions will result in 2012 under-counting. 2014 However, such precision is unlikely to be necessary. The only known 2015 use of a count of Not-ECT marked bytes is to test whether equipment 2016 on the path is clearing the ECN field (perhaps due to an out-dated 2017 attempt to clear, or bleach, what used to be the ToS field). To 2018 detect bleaching it will be sufficient to detect whether nearly all 2019 bytes arrive marked as Not-ECT. Therefore there should be no need to 2020 keep track of the details of retransmissions. 2022 Appendix B. Rationale for Usage of TCP Header Flags 2024 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake 2026 AccECN uses a rather unorthodox but justified approach to negotiate 2027 the highest version TCP ECN feedback scheme that both ends support. 2028 It follows from the original TCP ECN capability negotiation 2029 [RFC3168], in which the client set the 2 least significant reserved 2030 flags in the TCP header, and fell back to no ECN support if the 2031 server responded with the 2 flags cleared, which had previously been 2032 the default. It is not recorded why ECN originally used this 2033 approach instead of the more orthodox use of a TCP option. 2035 In order to be backward compatible with RFC 3168, AccECN continues 2036 this approach, using the 3rd least significant TCP header flag that 2037 had previously been allocated for the ECN nonce (now historic). 2039 Then, whatever form of server an AccECN client encounters, the 2040 connection can fall back to the highest version of feedback protocol 2041 that both ends support, as explained in Section 3.1. 2043 If AccECN had used the more orthodox approach of a TCP option, it 2044 would still have had to set the two ECN flags in the main TCP header, 2045 in order to be able to fall back to Classic RFC 3168 ECN, or to 2046 disable ECN support, without another round of negotiation. Then 2047 AccECN would also have had to handle all the different ways that 2048 servers currently respond to settings of the ECN flags in the main 2049 TCP header, including all the conflicting cases where a server might 2050 have said it supported one approach in the flags and another approach 2051 in the new TCP option. And AccECN would have had to deal with all 2052 the additional possibilities where a middlebox might have mangled the 2053 ECN flags, or removed the TCP option. Thus, usage of the 3rd 2054 reserved TCP header flag simplified the protocol. 2056 The third flag was used in a way that could be distinguished from the 2057 ECN nonce, in case any nonce deployment was encountered. Previous 2058 usage of this flag for the ECN nonce was integrated into the original 2059 ECN negotiation. This further justified the 3rd flag's use for 2060 AccECN, because a non-ECN usage of this flag would have had to use it 2061 as a separate single bit, rather than in combination with the other 2 2062 ECN flags. 2064 Indeed, having overloaded the original uses of these three flags for 2065 its handshake, AccECN overloads all three bits again as a 3-bit 2066 counter. 2068 B.2. Four Codepoints in the SYN/ACK 2070 Of the 8 possible codepoints that the 3 TCP header flags can indicate 2071 on the SYN/ACK, 4 already indicated earlier (or broken) versions of 2072 ECN support. In the early design of AccECN, an AccECN server could 2073 use only 2 of the 4 remaining codepoints. They both indicated AccECN 2074 support, but one fed back that the SYN had arrived marked as CE. 2075 Even though ECN support on a SYN is not yet on the standards track, 2076 the idea is for either end to act as a dumb reflector, so that future 2077 capabilities can be unilaterally deployed without requiring 2-ended 2078 deployment (justified in Section 2.5). 2080 During traversal testing it was discovered that the ECN field in the 2081 SYN was mangled on a non-negligible proportion of paths. Therefore 2082 it was necessary to allow the SYN/ACK to feed all four IP/ECN 2083 codepoints that the SYN could arrive with back to the client. 2084 Without this, the client could not know whether to disable ECN for 2085 the connection due to mangling of the IP/ECN field (also explained in 2086 Section 2.5). This development consumed the remaining 2 codepoints 2087 on the SYN/ACK that had been reserved for future use by AccECN in 2088 earlier versions. 2090 B.3. Space for Future Evolution 2092 Despite availability of usable TCP header space being extremely 2093 scarce, the AccECN protocol has taken all possible steps to ensure 2094 that there is space to negotiate possible future variants of the 2095 protocol, either if the experiment proves that a variant of AccECN is 2096 required, or if a completely different ECN feedback approach is 2097 needed: 2099 Future AccECN variants: When the AccECN capability is negotiated 2100 during TCP's 3WHS, the rows in Table 2 tagged as 'Nonce' and 2101 'Broken' in the column for the capability of node B are unused by 2102 any current protocol in the RFC series. These could be used by 2103 TCP servers in future to indicate a variant of the AccECN 2104 protocol. In recent measurement studies in which the response of 2105 large numbers of servers to an AccECN SYN has been tested, e.g. 2106 [Mandalari18], a very small number of SYN/ACKs arrive with the 2107 pattern tagged as 'Nonce', and a small but more significant number 2108 arrive with the pattern tagged as 'Broken'. The 'Nonce' pattern 2109 could be a sign that a few servers have implemented the ECN Nonce 2110 [RFC3540], which has now been reclassified as historic [RFC8311], 2111 or it could be the random result of some unknown middlebox 2112 behaviour. The greater prevalence of the 'Broken' pattern 2113 suggests that some instances still exist of the broken code that 2114 reflects the reserved flags on the SYN. 2116 The requirement not to reject unexpected initial values of the ACE 2117 counter (in the main TCP header) in the last para of Section 3.2.3 2118 ensures that 5 unused codepoints on the final ACK of the 3WHS and 2119 7 unused values on the first data packet from the server could be 2120 used to declare future variants of the AccECN protocol. The word 2121 'declare' is used rather than 'negotiate' because, at this late 2122 stage in the 3WHS, it would be too late for a negotiation between 2123 the endpoints to be completed. A similar requirement not to 2124 reject unexpected initial values in the TCP option 2125 (Section 3.2.7.4) is for the same purpose. If traversal of the 2126 TCP option were reliable, this would have enabled a far wider 2127 range of future variation of the whole AccECN protocol. 2128 Nonetheless, it could be used to reliably negotiate a wide range 2129 of variation in the semantics of the AccECN Option. 2131 Future non-AccECN variants: Five codepoints out of the 8 possible in 2132 the 3 TCP header flags used by AccECN are unused on the initial 2133 SYN (in the order AE,CWR,ECE): 001, 010, 100, 101, 110. 2134 Section 3.1.2 ensures that the installed base of AccECN servers 2135 will all assume these are equivalent to AccECN negotiation with 2136 111 on the SYN. These codepoints would not allow fall-back to 2137 Classic ECN support for a server that did not understand them, but 2138 this approach ensures they are available in future, perhaps for 2139 uses other than ECN alongside the AccECN scheme. All possible 2140 combinations of SYN/ACK could be used in response except either 2141 000 or reflection of the same values sent on the SYN. 2143 Of course, other ways could be resorted to in order to extend 2144 AccECN or ECN in future, although their traversal properties are 2145 likely to be inferior. They include a new TCP option; using the 2146 remaining reserved flags in the main TCP header (preferably 2147 extending the 3-bit combinations used by AccECN to 4-bit 2148 combinations, rather than burning one bit for just one state); a 2149 non-zero urgent pointer in combination with the URG flag cleared; 2150 or some other unexpected combination of fields yet to be invented. 2152 Authors' Addresses 2154 Bob Briscoe 2155 CableLabs 2156 UK 2158 EMail: ietf@bobbriscoe.net 2159 URI: http://bobbriscoe.net/ 2161 Mirja Kuehlewind 2162 ETH Zurich 2163 Zurich 2164 Switzerland 2166 EMail: mirja.kuehlewind@tik.ee.ethz.ch 2168 Richard Scheffenegger 2169 Vienna 2170 Austria 2172 EMail: rscheff@gmx.at