idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-18.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC3168, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC3449, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). (Using the creation date from RFC3168, updated by this document, for RFC5378 checks: 2000-11-17) (Using the creation date from RFC3449, updated by this document, for RFC5378 checks: 1999-10-04) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 22, 2022) is 760 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B' is mentioned on line 2605, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-09 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-17 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Independent 4 Updates: 3168, 3449 (if approved) M. Kuehlewind 5 Intended status: Standards Track Ericsson 6 Expires: September 23, 2022 R. Scheffenegger 7 NetApp 8 March 22, 2022 10 More Accurate ECN Feedback in TCP 11 draft-ietf-tcpm-accurate-ecn-18 13 Abstract 15 Explicit Congestion Notification (ECN) is a mechanism where network 16 nodes can mark IP packets instead of dropping them to indicate 17 incipient congestion to the end-points. Receivers with an ECN- 18 capable transport protocol feed back this information to the sender. 19 ECN was originally specified for TCP in such a way that only one 20 feedback signal can be transmitted per Round-Trip Time (RTT). Recent 21 new TCP mechanisms like Congestion Exposure (ConEx), Data Center TCP 22 (DCTCP) or Low Latency Low Loss Scalable Throughput (L4S) need more 23 accurate ECN feedback information whenever more than one marking is 24 received in one RTT. This document updates the original ECN 25 specification to specify a scheme to provide more than one feedback 26 signal per RTT in the TCP header. Given TCP header space is scarce, 27 it allocates a reserved header bit previously assigned to the ECN- 28 Nonce. It also overloads the two existing ECN flags in the TCP 29 header. The resulting extra space is exploited to feed back the IP- 30 ECN field received during the 3-way handshake as well. Supplementary 31 feedback information can optionally be provided in a new TCP option, 32 which is never used on the TCP SYN. The document also specifies the 33 treatment of this updated TCP wire protocol by middleboxes, updating 34 BCP 69 with respect to ACK filtering. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on September 23, 2022. 53 Copyright Notice 55 Copyright (c) 2022 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (https://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 5 72 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 73 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 74 1.4. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 75 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 8 76 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 9 77 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 78 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 79 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 80 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 11 81 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 12 82 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 12 83 3.1.1. Negotiation during the TCP handshake . . . . . . . . 12 84 3.1.2. Backward Compatibility . . . . . . . . . . . . . . . 13 85 3.1.3. Forward Compatibility . . . . . . . . . . . . . . . . 15 86 3.1.4. Retransmission of the SYN . . . . . . . . . . . . . . 15 87 3.1.5. Implications of AccECN Mode . . . . . . . . . . . . . 16 88 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 18 89 3.2.1. Initialization of Feedback Counters . . . . . . . . . 19 90 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 19 91 3.2.2.1. ACE Field on the ACK of the SYN/ACK . . . . . . . 20 92 3.2.2.2. Encoding and Decoding Feedback in the ACE Field . 21 93 3.2.2.3. Testing for Mangling of the IP/ECN Field . . . . 23 94 3.2.2.4. Testing for Zeroing of the ACE Field . . . . . . 25 95 3.2.2.5. Safety against Ambiguity of the ACE Field . . . . 26 97 3.2.3. The AccECN Option . . . . . . . . . . . . . . . . . . 28 98 3.2.3.1. Encoding and Decoding Feedback in the AccECN 99 Option Fields . . . . . . . . . . . . . . . . . . 30 100 3.2.3.2. Path Traversal of the AccECN Option . . . . . . . 31 101 3.2.3.3. Usage of the AccECN TCP Option . . . . . . . . . 35 102 3.3. AccECN Compliance Requirements for TCP Proxies, Offload 103 Engines and other Middleboxes . . . . . . . . . . . . . . 37 104 3.3.1. Requirements for TCP Proxies . . . . . . . . . . . . 37 105 3.3.2. Requirements for Transparent Middleboxes and TCP 106 Normalizers . . . . . . . . . . . . . . . . . . . . . 37 107 3.3.3. Requirements for TCP ACK Filtering . . . . . . . . . 38 108 3.3.4. Requirements for TCP Segmentation Offload . . . . . . 39 109 4. Updates to RFC 3168 . . . . . . . . . . . . . . . . . . . . . 40 110 5. Interaction with TCP Variants . . . . . . . . . . . . . . . . 41 111 5.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 41 112 5.2. Compatibility with TCP Experiments and Common TCP Options 42 113 5.3. Compatibility with Feedback Integrity Mechanisms . . . . 42 114 6. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 43 115 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 45 116 8. Security Considerations . . . . . . . . . . . . . . . . . . . 47 117 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 48 118 10. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 48 119 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 48 120 11.1. Normative References . . . . . . . . . . . . . . . . . . 48 121 11.2. Informative References . . . . . . . . . . . . . . . . . 49 122 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 52 123 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 52 124 A.2. Example Algorithm for Safety Against Long Sequences of 125 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 53 126 A.2.1. Safety Algorithm without the AccECN Option . . . . . 53 127 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 55 128 A.3. Example Algorithm to Estimate Marked Bytes from Marked 129 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 57 130 A.4. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 58 131 Appendix B. Rationale for Usage of TCP Header Flags . . . . . . 58 132 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake . . . 58 133 B.2. Four Codepoints in the SYN/ACK . . . . . . . . . . . . . 59 134 B.3. Space for Future Evolution . . . . . . . . . . . . . . . 60 135 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 61 137 1. Introduction 139 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 140 network nodes can mark IP packets instead of dropping them to 141 indicate incipient congestion to the end-points. Receivers with an 142 ECN-capable transport protocol feed back this information to the 143 sender. In RFC 3168, ECN was specified for TCP in such a way that 144 only one feedback signal could be transmitted per Round-Trip Time 145 (RTT). Recently, proposed mechanisms like Congestion Exposure (ConEx 146 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need to 147 know when more than one marking is received in one RTT which is 148 information that cannot be provided by the feedback scheme as 149 specified in [RFC3168]. This document specifies an update to the ECN 150 feedback scheme of RFC 3168 that provides more accurate information 151 and could be used by these and potentially other future TCP 152 extensions. A fuller treatment of the motivation for this 153 specification is given in the associated requirements document 154 [RFC7560]. 156 This documents specifies a standards track scheme for ECN feedback in 157 the TCP header to provide more than one feedback signal per RTT. It 158 will be called the more accurate ECN feedback scheme, or AccECN for 159 short. This document updates RFC 3168 with respect to negotiation 160 and use of the feedback scheme for TCP. All aspects of RFC 3168 161 other than the TCP feedback scheme, in particular the definition of 162 ECN at the IP layer, remain unchanged by this specification. 163 Section 4 gives a more detailed specification of exactly which 164 aspects of RFC 3168 this document updates. 166 AccECN is intended to be a complete replacement for classic TCP/ECN 167 feedback, not a fork in the design of TCP. AccECN feedback 168 complements TCP's loss feedback and it can coexist alongside 169 'classic' [RFC3168] TCP/ECN feedback. So its applicability is 170 intended to include all public and private IP networks (and even any 171 non-IP networks over which TCP is used today), whether or not any 172 nodes on the path support ECN, of whatever flavour. This document 173 uses the term Classic ECN when it needs to distinguish the RFC 3168 174 ECN TCP feedback scheme from the AccECN TCP feedback scheme. 176 AccECN feedback overloads the two existing ECN flags in the TCP 177 header and allocates the currently reserved flag (previously called 178 NS) in the TCP header, to be used as one three-bit counter field 179 indicating the number of congestion experienced marked packets. 180 Given the new definitions of these three bits, both ends have to 181 support the new wire protocol before it can be used. Therefore 182 during the TCP handshake the two ends use these three bits in the TCP 183 header to negotiate the most advanced feedback protocol that they can 184 both support, in a way that is backward compatible with [RFC3168]. 186 AccECN is solely a change to the TCP wire protocol; it covers the 187 negotiation and signaling of more accurate ECN feedback from a TCP 188 Data Receiver to a Data Sender. It is completely independent of how 189 TCP might respond to congestion feedback, which is out of scope, but 190 ultimately the motivation for accurate ECN feedback. Like Classic 191 ECN feedback, AccECN can be used by standard Reno congestion control 192 [RFC5681] to respond to the existence of at least one congestion 193 notification within a round trip. Or, unlike Reno, AccECN can be 194 used to respond to the extent of congestion notification over a round 195 trip, as for example DCTCP does in controlled environments [RFC8257]. 196 For congestion response, this specification refers to RFC 3168, or 197 ECN experiments such as those referred to in [RFC8311], namely: a 198 TCP-based Low Latency Low Loss Scalable (L4S) congestion control 199 [I-D.ietf-tsvwg-l4s-arch]; or Alternative Backoff with ECN (ABE) 200 [RFC8511]. 202 It is RECOMMENDED that the AccECN protocol is implemented alongside 203 SACK [RFC2018] and the experimental ECN++ protocol 204 [I-D.ietf-tcpm-generalized-ecn], which allows the ECN capability to 205 be used on TCP control packets. Therefore, this specification does 206 not discuss implementing AccECN alongside [RFC5562], which was an 207 earlier experimental protocol with narrower scope than ECN++. 209 1.1. Document Roadmap 211 The following introductory section outlines the goals of AccECN 212 (Section 1.2). Then terminology is defined (Section 1.3) and a recap 213 of existing prerequisite technology is given (Section 1.4). 215 Section 2 gives an informative overview of the AccECN protocol. Then 216 Section 3 gives the normative protocol specification, and Section 4 217 clarifies which aspects of RFC 3168 are updated by this 218 specification. Section 5 assesses the interaction of AccECN with 219 commonly used variants of TCP, whether standardized or not. 220 Section 6 summarizes the features and properties of AccECN. 222 Section 7 summarizes the protocol fields and numbers that IANA will 223 need to assign and Section 8 points to the aspects of the protocol 224 that will be of interest to the security community. 226 Appendix A gives pseudocode examples for the various algorithms that 227 AccECN uses and Appendix B explains why AccECN uses flags in the main 228 TCP header and quantifies the space left for future use. 230 1.2. Goals 232 [RFC7560] enumerates requirements that a candidate feedback scheme 233 will need to satisfy, under the headings: resilience, timeliness, 234 integrity, accuracy (including ordering and lack of bias), 235 complexity, overhead and compatibility (both backward and forward). 236 It recognizes that a perfect scheme that fully satisfies all the 237 requirements is unlikely and trade-offs between requirements are 238 likely. Section 6 presents the properties of AccECN against these 239 requirements and discusses the trade-offs made. 241 The requirements document recognizes that a protocol as ubiquitous as 242 TCP needs to be able to serve as-yet-unspecified requirements. 243 Therefore an AccECN receiver aims to act as a generic (dumb) 244 reflector of congestion information so that in future new sender 245 behaviours can be deployed unilaterally. 247 1.3. Terminology 249 AccECN: The more accurate ECN feedback scheme will be called AccECN 250 for short. 252 Classic ECN: the ECN protocol specified in [RFC3168]. 254 Classic ECN feedback: the feedback aspect of the ECN protocol 255 specified in [RFC3168], including generation, encoding, 256 transmission and decoding of feedback, but not the Data Sender's 257 subsequent response to that feedback. 259 ACK: A TCP acknowledgement, with or without a data payload (ACK=1). 261 Pure ACK: A TCP acknowledgement without a data payload. 263 Acceptable packet / segment: A packet or segment that passes the 264 acceptability tests in [RFC0793] and [RFC5961]. 266 TCP client: The TCP stack that originates a connection. 268 TCP server: The TCP stack that responds to a connection request. 270 Data Receiver: The endpoint of a TCP half-connection that receives 271 data and sends AccECN feedback. 273 Data Sender: The endpoint of a TCP half-connection that sends data 274 and receives AccECN feedback. 276 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 277 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 278 document are to be interpreted as described in BCP 14 [RFC2119] 279 [RFC8174] when, and only when, they appear in all capitals, as shown 280 here. 282 1.4. Recap of Existing ECN feedback in IP/TCP 284 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 285 negotiated with the receiver at the transport layer, an ECN sender 286 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 287 to indicate an ECN-capable transport (ECT). If both ECN bits are 288 zero, the packet is considered to have been sent by a Not-ECN-capable 289 Transport (Not-ECT). When a network node experiences congestion, it 290 will occasionally either drop or mark a packet, with the choice 291 depending on the packet's ECN codepoint. If the codepoint is Not- 292 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 293 the node can mark the packet by setting both ECN bits, which is 294 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 295 Table 1 summarises these codepoints. 297 +------------------+----------------+---------------------------+ 298 | IP-ECN codepoint | Codepoint name | Description | 299 +------------------+----------------+---------------------------+ 300 | 0b00 | Not-ECT | Not ECN-Capable Transport | 301 | 0b01 | ECT(1) | ECN-Capable Transport (1) | 302 | 0b10 | ECT(0) | ECN-Capable Transport (0) | 303 | 0b11 | CE | Congestion Experienced | 304 +------------------+----------------+---------------------------+ 306 Table 1: The ECN Field in the IP Header 308 In the TCP header the first two bits in byte 14 are defined as flags 309 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 310 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 311 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 312 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 313 Data Receiver starts to set the Echo Congestion Experienced (ECE) 314 flag continuously in the TCP header of ACKs, which ensures the signal 315 is received reliably even if ACKs are lost. The TCP sender confirms 316 that it has received at least one ECE signal by responding with the 317 congestion window reduced (CWR) flag, which allows the TCP receiver 318 to stop repeating the ECN-Echo flag. This always leads to a full RTT 319 of ACKs with ECE set. Thus any additional CE markings arriving 320 within this RTT cannot be fed back. 322 The last bit in byte 13 of the TCP header was defined as the Nonce 323 Sum (NS) for the ECN Nonce [RFC3540]. In the absence of widespread 324 deployment RFC 3540 has been reclassified as historic [RFC8311] and 325 the respective flag has been marked as "reserved", making this TCP 326 flag available for use by the AccECN experiment instead. 328 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 329 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 330 | | | N | C | E | U | A | P | R | S | F | 331 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 332 | | | | R | E | G | K | H | T | N | N | 333 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 335 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 337 2. AccECN Protocol Overview and Rationale 339 This section provides an informative overview of the AccECN protocol 340 that will be normatively specified in Section 3 342 Like the original TCP approach, the Data Receiver of each TCP half- 343 connection sends AccECN feedback to the Data Sender on TCP 344 acknowledgements, reusing data packets of the other half-connection 345 whenever possible. 347 The AccECN protocol has had to be designed in two parts: 349 o an essential part that re-uses ECN TCP header bits for the Data 350 Receiver to feed back the number of packets arriving with CE in 351 the IP-ECN field. This provides more accuracy than classic ECN 352 feedback, but limited resilience against ACK loss; 354 o a supplementary part using a new AccECN TCP Option that provides 355 additional feedback on the number of bytes that arrive marked with 356 each of the three ECN codepoints in the IP-ECN field (not just CE 357 marks). This provides greater resilience against ACK loss than 358 the essential feedback, but it is more likely to suffer from 359 middlebox interference. 361 The two part design was necessary, given limitations on the space 362 available for TCP options and given the possibility that certain 363 incorrectly designed middleboxes prevent TCP using any new options. 365 The essential part overloads the previous definition of the three 366 flags in the TCP header that had been assigned for use by ECN. This 367 design choice deliberately replaces the classic ECN feedback 368 protocol, rather than leaving classic ECN feedback intact and adding 369 more accurate feedback separately because: 371 o this efficiently reuses scarce TCP header space, given TCP option 372 space is approaching saturation; 374 o a single upgrade path for the TCP protocol is preferable to a fork 375 in the design; 377 o otherwise classic and accurate ECN feedback could give conflicting 378 feedback on the same segment, which could open up new security 379 concerns and make implementations unnecessarily complex; 381 o middleboxes are more likely to faithfully forward the TCP ECN 382 flags than newly defined areas of the TCP header. 384 AccECN is designed to work even if the supplementary part is removed 385 or zeroed out, as long as the essential part gets through. 387 2.1. Capability Negotiation 389 AccECN is a change to the wire protocol of the main TCP header, 390 therefore it can only be used if both endpoints have been upgraded to 391 understand it. The TCP client signals support for AccECN on the 392 initial SYN of a connection and the TCP server signals whether it 393 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 394 client uses to signal AccECN support have been carefully chosen so 395 that a TCP server will interpret them as a request to support the 396 most recent variant of ECN feedback that it supports. Then the 397 client falls back to the same variant of ECN feedback. 399 An AccECN TCP client does not send the new AccECN Option on the SYN 400 as SYN option space is limited. The TCP server sends the AccECN 401 Option on the SYN/ACK and the client sends it on the first ACK to 402 test whether the network path forwards the option correctly. 404 2.2. Feedback Mechanism 406 A Data Receiver maintains four counters initialized at the start of 407 the half-connection. Three count the number of arriving payload 408 bytes respectively marked CE, ECT(1) and ECT(0) in the IP-ECN field. 409 The fourth counts the number of packets arriving marked with a CE 410 codepoint (including control packets without payload if they are CE- 411 marked). 413 The Data Sender maintains four equivalent counters for the half 414 connection, and the AccECN protocol is designed to ensure they will 415 match the values in the Data Receiver's counters, albeit after a 416 little delay. 418 Each ACK carries the three least significant bits (LSBs) of the 419 packet-based CE counter using the ECN bits in the TCP header, now 420 renamed the Accurate ECN (ACE) field (see Figure 3 later). The 24 421 LSBs of each byte counter are carried in the AccECN Option. 423 2.3. Delayed ACKs and Resilience Against ACK Loss 425 With both the ACE and the AccECN Option mechanisms, the Data Receiver 426 continually repeats the current LSBs of each of its respective 427 counters. There is no need to acknowledge these continually repeated 428 counters, so the congestion window reduced (CWR) mechanism is no 429 longer used. Even if some ACKs are lost, the Data Sender ought to be 430 able to infer how much to increment its own counters, even if the 431 protocol field has wrapped. 433 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 434 it appears to have incremented by one (say), the field might have 435 actually cycled completely then incremented by one. The Data 436 Receiver is not allowed to delay sending an ACK to such an extent 437 that the ACE field would cycle. However cycling is still a 438 possibility at the Data Sender because a whole sequence of ACKs 439 carrying intervening values of the field might all be lost or delayed 440 in transit. 442 The fields in the AccECN Option are larger, but they will increment 443 in larger steps because they count bytes not packets. Nonetheless, 444 their size has been chosen such that a whole cycle of the field would 445 never occur between ACKs unless there had been an infeasibly long 446 sequence of ACK losses. Therefore, as long as the AccECN Option is 447 available, it can be treated as a dependable feedback channel. 449 If the AccECN Option is not available, e.g. it is being stripped by a 450 middlebox, the AccECN protocol will only feed back information on CE 451 markings (using the ACE field). Although not ideal, this will be 452 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 453 will ever indicate more severe congestion than CE, even though future 454 uses for ECT(0) or ECT(1) are still unclear [RFC8311]. Because the 455 3-bit ACE field is so small, when it is the only field available, the 456 Data Sender has to interpret it assuming the most likely wrap, but 457 with a degree of conservatism. 459 Certain specified events trigger the Data Receiver to include an 460 AccECN Option on an ACK. The rules are designed to ensure that the 461 order in which different markings arrive at the receiver is 462 communicated to the sender (as long as options are reaching the 463 sender and as long as there is no ACK loss). Implementations are 464 encouraged to send an AccECN Option more frequently, but this is left 465 up to the implementer. 467 2.4. Feedback Metrics 469 The CE packet counter in the ACE field and the CE byte counter in the 470 AccECN Option both provide feedback on received CE-marks. The CE 471 packet counter includes control packets that do not have payload 472 data, while the CE byte counter solely includes marked payload bytes. 473 If both are present, the byte counter in the option will provide the 474 more accurate information needed for modern congestion control and 475 policing schemes, such as L4S, DCTCP or ConEx. If the option is 476 stripped, a simple algorithm to estimate the number of marked bytes 477 from the ACE field is given in Appendix A.3. 479 Feedback in bytes is provided in order to protect against the 480 receiver using attacks similar to 'ACK-Division' to artificially 481 inflate the congestion window, which is why [RFC5681] now recommends 482 that TCP counts acknowledged bytes not packets. 484 2.5. Generic (Dumb) Reflector 486 The ACE field provides feedback about CE markings in the IP-ECN field 487 of both data and control packets. According to [RFC3168] the Data 488 Sender is meant to set the IP-ECN field of control packets to Not- 489 ECT. However, mechanisms in certain private networks (e.g. data 490 centres) set control packets to be ECN capable because they are 491 precisely the packets that performance depends on most. 493 For this reason, AccECN is designed to be a generic reflector of 494 whatever ECN markings it sees, whether or not they are compliant with 495 a current standard. Then as standards evolve, Data Senders can 496 upgrade unilaterally without any need for receivers to upgrade too. 497 It is also useful to be able to rely on generic reflection behaviour 498 when senders need to test for unexpected interference with markings 499 (for instance Section 3.2.2.3, Section 3.2.2.4 and Section 3.2.3.2 of 500 the present document and para 2 of Section 20.2 of [RFC3168]). 502 The initial SYN is the most critical control packet, so AccECN 503 provides feedback on its IP-ECN field. Although RFC 3168 prohibits 504 an ECN-capable SYN, providing feedback of ECN marking on the SYN 505 supports future scenarios in which SYNs might be ECN-enabled (without 506 prejudging whether they ought to be). For instance, [RFC8311] 507 updates this aspect of RFC 3168 to allow experimentation with ECN- 508 capable TCP control packets. 510 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 511 not-ECT in compliance with RFC 3168, feedback on the state of the IP- 512 ECN field when it arrives at the receiver could still be useful, 513 because middleboxes have been known to overwrite the IP-ECN field as 514 if it is still part of the old Type of Service (ToS) field 515 [Mandalari18]. For example, if a TCP client has set the SYN to Not- 516 ECT, but receives feedback that the IP-ECN field on the SYN arrived 517 with a different codepoint, it can detect such middlebox 518 interference. Previously, neither end knew what IP-ECN field the 519 other had sent. So, if a TCP server received ECT or CE on a SYN, it 520 could not know whether it was invalid (or valid) because only the TCP 521 client knew whether it originally marked the SYN as Not-ECT (or ECT). 522 Therefore, prior to AccECN, the server's only safe course of action 523 in this example was to disable ECN for the connection. Instead, the 524 AccECN protocol allows the server to feed back the received ECN field 525 to the client, which then has all the information to decide whether 526 the connection has to fall-back from supporting ECN (or not). 528 3. AccECN Protocol Specification 530 3.1. Negotiating to use AccECN 532 3.1.1. Negotiation during the TCP handshake 534 Given the ECN Nonce [RFC3540] has been reclassified as historic 535 [RFC8311], the present specification re-allocates the TCP flag at bit 536 7 of the TCP header, which was previously called NS (Nonce Sum), as 537 the AE (Accurate ECN) flag (see IANA Considerations in Section 7) as 538 shown below. 540 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 541 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 542 | | | A | C | E | U | A | P | R | S | F | 543 | Header Length | Reserved | E | W | C | R | C | S | S | Y | I | 544 | | | | R | E | G | K | H | T | N | N | 545 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 547 Figure 2: The (post-AccECN) definition of the TCP header flags during 548 the TCP handshake 550 During the TCP handshake at the start of a connection, to request 551 more accurate ECN feedback the TCP client (host A) MUST set the TCP 552 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 554 If a TCP server (B) that is AccECN-enabled receives a SYN with the 555 above three flags set, it MUST set both its half connections into 556 AccECN mode. Then it MUST set the AE, CWR and ECE TCP flags on the 557 SYN/ACK to the combination in the top block of Table 2 that feeds 558 back the IP-ECN field that arrived on the SYN. This applies whether 559 or not the server itself supports setting the IP-ECN field on a SYN 560 or SYN/ACK (see Section 2.5 for rationale). 562 When the TCP server returns any of the 4 combinations in the top 563 block of Table 2, it confirms that it supports AccECN. The TCP 564 server MUST NOT set one of these 4 combination of flags on the SYN/ 565 ACK unless the preceding SYN requested support for AccECN as above. 567 Once a TCP client (A) has sent the above SYN to declare that it 568 supports AccECN, and once it has received the above SYN/ACK segment 569 that confirms that the TCP server supports AccECN, the TCP client 570 MUST set both its half connections into AccECN mode. 572 Once in AccECN mode, a TCP client or server has the rights and 573 obligations to participate in the ECN protocol defined in 574 Section 3.1.5. 576 The procedure for the client to follow if a SYN/ACK does not arrive 577 before its retransmission timer expires is given in Section 3.1.4. 579 3.1.2. Backward Compatibility 581 The three flags set to 1 to indicate AccECN support on the SYN have 582 been carefully chosen to enable natural fall-back to prior stages in 583 the evolution of ECN, as above. Table 2 tabulates all the 584 negotiation possibilities for ECN-related capabilities that involve 585 at least one AccECN-capable host. The entries in the first two 586 columns have been abbreviated, as follows: 588 AccECN: More Accurate ECN Feedback (the present specification) 590 Nonce: ECN Nonce feedback [RFC3540] 592 ECN: 'Classic' ECN feedback [RFC3168] 594 No ECN: Not-ECN-capable. Implicit congestion notification using 595 packet drop. 597 +--------+--------+------------+------------+-----------------------+ 598 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 599 | | | | B->A | | 600 +--------+--------+------------+------------+-----------------------+ 601 | | | AE CWR ECE | AE CWR ECE | | 602 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN(no ECT on SYN) | 603 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 604 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 605 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 606 | | | | | | 607 | AccECN | Nonce | 1 1 1 | 1 0 1 | (Reserved) | 608 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 609 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 610 | | | | | | 611 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 612 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 613 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 614 | | | | | | 615 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 616 +--------+--------+------------+------------+-----------------------+ 618 Table 2: ECN capability negotiation between Client (A) and Server (B) 620 Table 2 is divided into blocks each separated by an empty row. 622 1. The top block shows the case already described in Section 3.1 623 where both endpoints support AccECN and how the TCP server (B) 624 indicates congestion feedback. 626 2. The second block shows the cases where the TCP client (A) 627 supports AccECN but the TCP server (B) supports some earlier 628 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 629 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 630 shown it MUST set both its half connections into the feedback 631 mode shown in the rightmost column. If it has set itself into 632 classic ECN feedback mode it MUST then comply with [RFC3168]. 634 The server response called 'Nonce' in the table is now historic. 635 For an AccECN implementation, there is no need to recognize or 636 support ECN Nonce feedback [RFC3540], which has been reclassified 637 as historic [RFC8311]. AccECN is compatible with alternative ECN 638 feedback integrity approaches (see Section 5.3). 640 3. The third block shows the cases where the TCP server (B) supports 641 AccECN but the TCP client (A) supports some earlier variant of 642 TCP feedback, indicated in its SYN. 644 When an AccECN-enabled TCP server (B) receives a SYN with 645 AE,CWR,ECE = 0,1,1 it MUST do one of the following: 647 * set both its half connections into the classic ECN feedback 648 mode and return a SYN/ACK with AE, CWR, ECE = 0,0,1 as shown. 649 Then it MUST comply with [RFC3168]. 651 * set both its half-connections into No ECN mode and return a 652 SYN/ACK with AE,CWR,ECE = 0,0,0, then continue with ECN 653 disabled. This latter case is unlikely to be desirable, but 654 it is allowed as a possibility, e.g. for minimal TCP 655 implementations. 657 When an AccECN-enabled TCP server (B) receives a SYN with 658 AE,CWR,ECE = 0,0,0 it MUST set both its half connections into the 659 Not ECN feedback mode, return a SYN/ACK with AE,CWR,ECE = 0,0,0 660 as shown and continue with ECN disabled. 662 4. The fourth block displays a combination labelled `Broken'. Some 663 older TCP server implementations incorrectly set the reserved 664 flags in the SYN/ACK by reflecting those in the SYN. Such broken 665 TCP servers (B) cannot support ECN, so as soon as an AccECN- 666 capable TCP client (A) receives such a broken SYN/ACK it MUST 667 fall back to Not ECN mode for both its half connections and 668 continue with ECN disabled. 670 The following additional rules do not fit the structure of the table, 671 but they complement it: 673 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 674 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 675 Host A MUST then enter the same feedback mode as it would have 676 entered had it been a responding host and received the same SYN. 677 Then host A MUST send the same SYN/ACK as it would have sent had 678 it been a responding host. 680 In-window SYN during TIME-WAIT: Many TCP implementations create a 681 new TCP connection if they receive an in-window SYN packet during 682 TIME-WAIT state. When a TCP host enters TIME-WAIT or CLOSED 683 state, it ought to ignore any previous state about the negotiation 684 of AccECN for that connection and renegotiate the feedback mode 685 according to Table 2. 687 3.1.3. Forward Compatibility 689 If a TCP server that implements AccECN receives a SYN with the three 690 TCP header flags (AE, CWR and ECE) set to any combination other than 691 000, 011 or 111, it MUST negotiate the use of AccECN as if they had 692 been set to 111. This ensures that future uses of the other 693 combinations on a SYN can rely on consistent behaviour from the 694 installed base of AccECN servers. 696 For the avoidance of doubt, the behaviour described in the present 697 specification applies whether or not the three remaining reserved TCP 698 header flags are zero. 700 3.1.4. Retransmission of the SYN 702 If the sender of an AccECN SYN times out before receiving the SYN/ 703 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 704 least one more time by continuing to set all three TCP ECN flags on 705 the first retransmitted SYN (using the usual retransmission time- 706 outs). If this first retransmission also fails to be acknowledged, 707 the sender SHOULD send subsequent retransmissions of the SYN with the 708 three TCP-ECN flags cleared (AE=CWR=ECE=0). A retransmitted SYN MUST 709 use the same ISN as the original SYN. 711 Retrying once before fall-back adds delay in the case where a 712 middlebox drops an AccECN (or ECN) SYN deliberately. However, 713 current measurements imply that a drop is less likely to be due to 714 middlebox interference than other intermittent causes of loss, 715 e.g. congestion, wireless interference, etc. 717 Implementers MAY use other fall-back strategies if they are found to 718 be more effective (e.g. attempting to negotiate AccECN on the SYN 719 only once or more than twice (most appropriate during high levels of 720 congestion). However, other fall-back strategies will need to follow 721 all the rules in Section 3.1.5, which concern behaviour when SYNs or 722 SYN/ACKs negotiating different types of feedback have been sent 723 within the same connection. 725 Further it might make sense to also remove any other new or 726 experimental fields or options on the SYN in case a middlebox might 727 be blocking them, although the required behaviour will depend on the 728 specification of the other option(s) and any attempt to co-ordinate 729 fall-back between different modules of the stack. 731 Whichever fall-back strategy is used, the TCP initiator SHOULD cache 732 failed connection attempts. If it does, it SHOULD NOT give up 733 attempting to negotiate AccECN on the SYN of subsequent connection 734 attempts until it is clear that the blockage is persistently and 735 specifically due to AccECN. The cache needs to be arranged to expire 736 so that the initiator will infrequently attempt to check whether the 737 problem has been resolved. 739 The fall-back procedure if the TCP server receives no ACK to 740 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 741 Section 3.2.3.2. 743 3.1.5. Implications of AccECN Mode 745 Section 3.1.1 describes the only ways that a host can enter AccECN 746 mode, whether as a client or as a server. 748 As a Data Sender, a host in AccECN mode has the rights and 749 obligations concerning the use of ECN defined below, which build on 750 those in [RFC3168] as updated by [RFC8311]: 752 o Using ECT: 754 * It can set an ECT codepoint in the IP header of packets to 755 indicate to the network that the transport is capable and 756 willing to participate in ECN for this packet. 758 * It does not have to set ECT on any packet (for instance if it 759 has reason to believe such a packet would be blocked). 761 o Switching feedback negotiation (e.g. fall-back): 763 * It SHOULD NOT set ECT on any packet if it has received at least 764 one valid SYN or Acceptable SYN/ACK with AE=CWR=ECE=0. A 765 "valid SYN" has the same port numbers and the same ISN as the 766 SYN that caused the server to enter AccECN mode. 768 * It MUST NOT send an ECN-setup SYN [RFC3168] within the same 769 connection as it has sent a SYN requesting AccECN feedback. 771 * It MUST NOT send an ECN-setup SYN/ACK [RFC3168] within the same 772 connection as it has sent a SYN/ACK agreeing to use AccECN 773 feedback. 775 The above rules are necessary because, if one peer were to 776 negotiate the feedback mode in two different types of handshake, 777 it would not be possible for the other peer to know for certain 778 which handshake packet(s) the other end had eventually received or 779 in which order it received them. So, in the absence of these 780 rules, the two peers could end up using different feedback modes 781 without knowing it. 783 o Congestion response: 785 * It is still obliged to respond appropriately to AccECN feedback 786 that indicates there were ECN marks on packets it had 787 previously sent, as defined in Section 6.1 of [RFC3168] and 788 updated by Sections 2.1 and 4.1 of [RFC8311]. 790 In general, it is obliged to respond to congestion feedback 791 even when it is solely sending non-ECN-capable packets (for 792 rationale, some examples and some exceptions see 793 Section 3.2.2.3, Section 3.2.2.4). 795 * The commitment to respond appropriately to incoming indications 796 of congestion remains even if it sends a SYN packet with 797 AE=CWR=ECE=0, in a later transmission within the same TCP 798 connection. 800 * Unlike an RFC 3168 data sender, it MUST NOT set CWR to indicate 801 it has received and responded to indications of congestion (for 802 the avoidance of doubt, this does not preclude it from setting 803 the bits of the ACE counter field, which includes an overloaded 804 use of the same bit). 806 As a Data Receiver: 808 o a host in AccECN mode MUST feed back the information in the IP-ECN 809 field of incoming packets using Accurate ECN feedback, as 810 specified in Section 3.2 below. 812 o if it receives an ECN-setup SYN or ECN-setup SYN/ACK [RFC3168] 813 during the same connection as it receives a SYN requesting AccECN 814 feedback or a SYN/ACK agreeing to use AccECN feedback, it MUST 815 reset the connection with a RST packet. 817 o If for any reason it is not willing to provide ECN feedback on a 818 particular TCP connection, to indicate this unwillingness it 819 SHOULD clear the AE, CWR and ECE flags in all SYN and/or SYN/ACK 820 packets that it sends. 822 o it MUST NOT use reception of packets with ECT set in the IP-ECN 823 field as an implicit signal that the peer is ECN-capable. Reason: 824 ECT at the IP layer does not explicitly confirm the peer has the 825 correct ECN feedback logic, as the packets could have been mangled 826 at the IP layer. 828 3.2. AccECN Feedback 830 Each Data Receiver of each half connection maintains four counters, 831 r.cep, r.ceb, r.e0b and r.e1b: 833 o The Data Receiver MUST increment the CE packet counter (r.cep), 834 for every Acceptable packet that it receives with the CE code 835 point in the IP ECN field, including CE marked control packets but 836 excluding CE on SYN packets (SYN=1; ACK=0). 838 o A Data Receiver that supports sending of the AccECN TCP Option 839 MUST increment the r.ceb, r.e0b or r.e1b byte counters by the 840 number of TCP payload octets in Acceptable packets marked 841 respectively with the CE, ECT(0) and ECT(1) codepoint in their IP- 842 ECN field, including any payload octets on control packets, but 843 not including any payload octets on SYN packets (SYN=1; ACK=0). 845 Each Data Sender of each half connection maintains four counters, 846 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 847 counters at the Data Receiver. 849 A Data Receiver feeds back the CE packet counter using the Accurate 850 ECN (ACE) field, as explained in Section 3.2.2. And it optionally 851 feeds back all the byte counters using the AccECN TCP Option, as 852 specified in Section 3.2.3. 854 Whenever a host feeds back the value of any counter, it MUST report 855 the most recent value, no matter whether it is in a pure ACK, an ACK 856 with new payload data or a retransmission. Therefore the feedback 857 carried on a retransmitted packet is unlikely to be the same as the 858 feedback on the original packet. 860 3.2.1. Initialization of Feedback Counters 862 When a host first enters AccECN mode, in its role as a Data Receiver 863 it initializes its counters to r.cep = 5, r.e0b = r.e1b = 1 and r.ceb 864 = 0, 866 Non-zero initial values are used to support a stateless handshake 867 (see Section 5.1) and to be distinct from cases where the fields are 868 incorrectly zeroed (e.g. by middleboxes - see Section 3.2.3.2.4). 870 When a host enters AccECN mode, in its role as a Data Sender it 871 initializes its counters to s.cep = 5, s.e0b = s.e1b = 1 and s.ceb = 872 0. 874 3.2.2. The ACE Field 876 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 877 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 878 as one 3-bit field. Then the field is given a new name, ACE, as 879 shown in Figure 3. 881 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 882 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 883 | | | | U | A | P | R | S | F | 884 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 885 | | | | G | K | H | T | N | N | 886 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 888 Figure 3: Definition of the ACE field within bytes 13 and 14 of the 889 TCP Header (when AccECN has been negotiated and SYN=0). 891 The original definition of these three flags in the TCP header, 892 including the addition of support for the ECN Nonce, is shown for 893 comparison in Figure 1. This specification does not rename these 894 three TCP flags to ACE unconditionally; it merely overloads them with 895 another name and definition once an AccECN connection has been 896 established. 898 With one exception (Section 3.2.2.1), a host with both of its half- 899 connections in AccECN mode MUST interpret the AE, CWR and ECE flags 900 as the 3-bit ACE counter on a segment with the SYN flag cleared 901 (SYN=0). On such a packet, a Data Receiver MUST encode the three 902 least significant bits of its r.cep counter into the ACE field that 903 it feeds back to the Data Sender. A host MUST NOT interpret the 3 904 flags as a 3-bit ACE field on any segment with SYN=1 (whether ACK is 905 0 or 1), or if AccECN negotiation is incomplete or has not succeeded. 907 Both parts of each of these conditions are equally important. For 908 instance, even if AccECN negotiation has been successful, the ACE 909 field is not defined on any segments with SYN=1 (e.g. a 910 retransmission of an unacknowledged SYN/ACK, or when both ends send 911 SYN/ACKs after AccECN support has been successfully negotiated during 912 a simultaneous open). 914 3.2.2.1. ACE Field on the ACK of the SYN/ACK 916 A TCP client (A) in AccECN mode MUST feed back which of the 4 917 possible values of the IP-ECN field was on the SYN/ACK by writing it 918 into the ACE field of a pure ACK with no SACK blocks using the binary 919 encoding in Table 3 (which is the same as that used on the SYN/ACK in 920 Table 2). This shall be called the handshake encoding of the ACE 921 field, and it is the only exception to the rule that the ACE field 922 carries the 3 least significant bits of the r.cep counter on packets 923 with SYN=0. 925 Normally, a TCP client acknowledges a SYN/ACK with an ACK that 926 satisfies the above conditions anyway (SYN=0, no data, no SACK 927 blocks). If an AccECN TCP client intends to acknowledge the SYN/ACK 928 with a packet that does not satisfy these conditions (e.g. it has 929 data to include on the ACK), it SHOULD first send a pure ACK that 930 does satisfy these conditions (see Section 5.2), so that it can feed 931 back which of the four values of the IP-ECN field arrived on the SYN/ 932 ACK. A valid exception to this "SHOULD" would be where the 933 implementation will only be used in an environment where mangling of 934 the ECN field is unlikely. 936 +---------------------+---------------------+-----------------------+ 937 | IP-ECN codepoint on | ACE on pure ACK of | r.cep of client in | 938 | SYN/ACK | SYN/ACK | AccECN mode | 939 +---------------------+---------------------+-----------------------+ 940 | Not-ECT | 0b010 | 5 | 941 | ECT(1) | 0b011 | 5 | 942 | ECT(0) | 0b100 | 5 | 943 | CE | 0b110 | 6 | 944 +---------------------+---------------------+-----------------------+ 946 Table 3: The encoding of the ACE field in the ACK of the SYN-ACK to 947 reflect the SYN-ACK's IP-ECN field 949 When an AccECN server in SYN-RCVD state receives a pure ACK with 950 SYN=0 and no SACK blocks, instead of treating the ACE field as a 951 counter, it MUST infer the meaning of each possible value of the ACE 952 field from Table 4, which also shows the value that an AccECN server 953 MUST set s.cep to as a result. 955 Given this encoding of the ACE field on the ACK of a SYN/ACK is 956 exceptional, an AccECN server using large receive offload (LRO) might 957 prefer to disable LRO until such an ACK has transitioned it out of 958 SYN-RCVD state. 960 +---------------+-----------------------------+---------------------+ 961 | ACE on ACK of | IP-ECN codepoint on SYN/ACK | s.cep of server in | 962 | SYN/ACK | inferred by server | AccECN mode | 963 +---------------+-----------------------------+---------------------+ 964 | 0b000 | {Notes 1, 3} | Disable ECN | 965 | 0b001 | {Notes 2, 3} | 5 | 966 | 0b010 | Not-ECT | 5 | 967 | 0b011 | ECT(1) | 5 | 968 | 0b100 | ECT(0) | 5 | 969 | 0b101 | Currently Unused {Note 2} | 5 | 970 | 0b110 | CE | 6 | 971 | 0b111 | Currently Unused {Note 2} | 5 | 972 +---------------+-----------------------------+---------------------+ 974 Table 4: Meaning of the ACE field on the ACK of the SYN/ACK 976 {Note 1}: If the server is in AccECN mode, the value of zero raises 977 suspicion of zeroing of the ACE field on the path (see 978 Section 3.2.2.4). 980 {Note 2}: If the server is in AccECN mode, these values are Currently 981 Unused but the AccECN server's behaviour is still defined for forward 982 compatibility. Then the designer of a future protocol can know for 983 certain what AccECN servers will do with these codepoints. 985 {Note 3}: In the case where a server that implements AccECN is also 986 using a stateless handshake (termed a SYN cookie) it will not 987 remember whether it entered AccECN mode. The values 0b000 or 0b001 988 will remind it that it did not enter AccECN mode, because AccECN does 989 not use them (see Section 5.1 for details). If a stateless server 990 that implements AccECN receives either of these two values in the 991 ACK, its action is implementation-dependent and outside the scope of 992 this spec, It will certainly not take the action in the third column 993 because, after it receives either of these values, it is not in 994 AccECN mode. I.e., it will not disable ECN (at least not just 995 because ACE is 0b000) and it will not set s.cep. 997 3.2.2.2. Encoding and Decoding Feedback in the ACE Field 999 Whenever the Data Receiver sends an ACK with SYN=0 (with or without 1000 data), unless the handshake encoding in Section 3.2.2.1 applies, the 1001 Data Receiver MUST encode the least significant 3 bits of its r.cep 1002 counter into the ACE field (see Appendix A.2). 1004 Whenever the Data Sender receives an ACK with SYN=0 (with or without 1005 data), it first checks whether it has already been superseded by 1006 another ACK in which case it ignores the ECN feedback. If the ACK 1007 has not been superseded, and if the special handshake encoding in 1008 Section 3.2.2.1 does not apply, the Data Sender decodes the ACE field 1009 as follows (see Appendix A.2 for examples). 1011 o It takes the least significant 3 bits of its local s.cep counter 1012 and subtracts them from the incoming ACE counter to work out the 1013 minimum positive increment it could apply to s.cep (assuming the 1014 ACE field only wrapped at most once). 1016 o It then follows the safety procedures in Section 3.2.2.5.2 to 1017 calculate or estimate how many packets the ACK could have 1018 acknowledged under the prevailing conditions to determine whether 1019 the ACE field might have wrapped more than once. 1021 The encode/decode procedures during the three-way handshake are 1022 exceptions to the general rules given so far, so they are spelled out 1023 step by step below for clarity: 1025 o If a TCP server in AccECN mode receives a CE mark in the IP-ECN 1026 field of a SYN (SYN=1, ACK=0), it MUST NOT increment r.cep (it 1027 remains at its initial value of 5). 1029 Reason: It would be redundant for the server to include CE-marked 1030 SYNs in its r.cep counter, because it already reliably delivers 1031 feedback of any CE marking using the encoding in Table 2 in the 1032 SYN/ACK. This also ensures that, when the server starts using the 1033 ACE field, it has not unnecessarily consumed more than one initial 1034 value, given they can be used to negotiate variants of the AccECN 1035 protocol (see Appendix B.3). 1037 o If a TCP client in AccECN mode receives CE feedback in the TCP 1038 flags of a SYN/ACK, it MUST NOT increment s.cep (it remains at its 1039 initial value of 5), so that it stays in step with r.cep on the 1040 server. Nonetheless, the TCP client still triggers the congestion 1041 control actions necessary to respond to the CE feedback. 1043 o If a TCP client in AccECN mode receives a CE mark in the IP-ECN 1044 field of a SYN/ACK, it MUST increment r.cep, but no more than once 1045 no matter how many CE-marked SYN/ACKs it receives 1046 (i.e. incremented from 5 to 6, but no further). 1048 Reason: Incrementing r.cep ensures the client will eventually 1049 deliver any CE marking to the server reliably when it starts using 1050 the ACE field. Even though the client also feeds back any CE 1051 marking on the ACK of the SYN/ACK using the encoding in Table 3, 1052 this ACK is not delivered reliably, so it can be considered as a 1053 timely notification that is redundant but unreliable. The client 1054 does not increment r.cep more than once, because the server can 1055 only increment s.cep once (see next bullet). Also, this limits 1056 the unnecessarily consumed initial values of the ACE field to two. 1058 o If a TCP server in AccECN mode and in SYN-RCVD state receives CE 1059 feedback in the TCP flags of a pure ACK with no SACK blocks, it 1060 MUST increment s.cep (from 5 to 6). The TCP server then triggers 1061 the congestion control actions necessary to respond to the CE 1062 feedback. 1064 Reasoning: The TCP server can only increment s.cep once, because 1065 the first ACK it receives will cause it to transition out of SYN- 1066 RCVD state. The server's congestion response would be no 1067 different even if it could receive feedback of more than one CE- 1068 marked SYN/ACK. 1070 Once the TCP server transitions to ESTABLISHED state, it might 1071 later receive other pure ACK(s) with the handshake encoding in the 1072 ACE field. A server MAY implement a test for such a case, but it 1073 is not required. Therefore, once in the ESTABLISHED state, it 1074 will be sufficient for the server to consider the ACE field to be 1075 encoded as the normal ACE counter on all packets with SYN=0. 1077 Reasoning: Such ACKs will be quite unusual, e.g. a SYN/ACK (or ACK 1078 of the SYN/ACK) that is delayed for longer than the server's 1079 retransmission timeout; or packet duplication by the network. And 1080 the impact of any error in the feedback on such ACKs will only be 1081 temporary. 1083 3.2.2.3. Testing for Mangling of the IP/ECN Field 1085 The value of the ACE field on the SYN/ACK indicates the value of the 1086 IP/ECN field when the SYN arrived at the server. The client can 1087 compare this with how it originally set the IP/ECN field on the SYN. 1088 If this comparison implies an invalid transition (defined below) of 1089 the IP/ECN field, for the remainder of the half-connection the client 1090 is advised to send non-ECN-capable packets, but it still ought to 1091 respond to any feedback of CE markings (explained below). However, 1092 the client MUST remain in the AccECN feedback mode and it MUST 1093 continue to feed back any ECN markings on arriving packets (in its 1094 role as Data Receiver). 1096 The value of the ACE field on the last ACK of the 3WHS indicates the 1097 value of the IP/ECN field when the SYN/ACK arrived at the client. 1098 The server can compare this with how it originally set the IP/ECN 1099 field on the SYN/ACK. If this comparison implies an invalid 1100 transition of the IP/ECN field, for the remainder of the half- 1101 connection the server is advised to send non-ECN-capable packets, but 1102 it still ought to respond to any feedback of CE markings (explained 1103 below). However, the server MUST remain in the AccECN feedback mode 1104 and it MUST continue to feed back any ECN markings on arriving 1105 packets (in its role as Data Receiver). 1107 If a Data Sender in AccECN mode starts sending non-ECN-capable 1108 packets because it has detected mangling, it is still advised to 1109 respond to CE feedback. Reason: any CE-marking arriving at the Data 1110 Receiver could be due to something early in the path mangling the 1111 non-ECN-capable IP/ECN field into an ECN-capable codepoint and then, 1112 later in the path, a network bottleneck might be applying CE-markings 1113 to indicate genuine congestion. This argument applies whether the 1114 handshake packet originally sent by the client or server was non-ECN- 1115 capable or ECN-capable because, in either case, an unsafe transition 1116 could imply that future non-ECN-capable packets might get mangled. 1118 The above advice on switching to sending non-ECN-capable packets but 1119 still responding to CE-markings unless they become continuous is not 1120 stated normatively (in capitals), because the best strategy might 1121 depend on experience of the most likely types of mangling, which can 1122 only be known at the time of deployment. 1124 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 1125 count of CE marks is still eventually delivered reliably). If this 1126 ACK does not arrive, the server is advised to continue to send ECN- 1127 capable packets without having tested for mangling of the IP/ECN 1128 field on the SYN/ACK. 1130 Invalid transitions of the IP/ECN field are defined in section 18 of 1131 [RFC3168] and repeated here for convenience: 1133 o the not-ECT codepoint changes; 1135 o either ECT codepoint transitions to not-ECT; 1137 o the CE codepoint changes. 1139 RFC 3168 says that a router that changes ECT to not-ECT is invalid 1140 but safe. However, from a host's viewpoint, this transition is 1141 unsafe because it could be the result of two transitions at different 1142 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 1143 This scenario could well happen where an ECN-enabled home router 1144 congests its upstream mobile broadband bottleneck link, then the 1145 ingress to the mobile network clears the ECN field [Mandalari18]. 1147 Once a Data Sender has entered AccECN mode it is advised to check 1148 whether it is receiving continuous CE marking. Specifying exactly 1149 how to do this is beyond the scope of the present specification, but 1150 the sender might check whether the feedback for every packet it sends 1151 for the first three or four rounds indicates CE-marking. If 1152 continuous CE-marking is detected, for the remainder of the half- 1153 connection, the Data Sender ought to send non-ECN-capable packets and 1154 it is advised not to respond to any feedback of CE markings. The 1155 Data Sender might occasionally test whether it can resume sending 1156 ECN-capable packets. As always, once a host has entered AccECN mode, 1157 it MUST remain in the same feedback mode and it MUST continue to feed 1158 back any ECN markings on arriving packets. 1160 All the fall-back behaviours in this section are necessary in case 1161 mangling of the IP/ECN field is asymmetric, which is currently common 1162 over some mobile networks [Mandalari18]. Then one end might see no 1163 unsafe transition and continue sending ECN-capable packets, while the 1164 other end sees an unsafe transition and stops sending ECN-capable 1165 packets. 1167 3.2.2.4. Testing for Zeroing of the ACE Field 1169 Section 3.2.2 required the Data Receiver to initialize the r.cep 1170 counter to a non-zero value. Therefore, in either direction the 1171 initial value of the ACE counter ought to be non-zero. 1173 If AccECN has been successfully negotiated, the Data Sender SHOULD 1174 check the value of the ACE counter in the first packet (with or 1175 without data) that arrives with SYN=0. If the value of this ACE 1176 field is zero (0b000), for the remainder of the half-connection the 1177 Data Sender ought to send non-ECN-capable packets and it is advised 1178 not to respond to any feedback of CE markings. Reason: the symptoms 1179 imply either potential mangling of the ECN fields in both the IP and 1180 TCP headers, or a broken remote TCP implementation. This advice is 1181 not stated normatively (in capitals), because the best strategy might 1182 depend on experience of the most likely types of mangling, which can 1183 only be known at the time of deployment. 1185 If reordering occurs, "the first packet ... that arrives" will not 1186 necessarily be the same as the first packet in sequence order. The 1187 test has been specified loosely like this to simplify implementation, 1188 and because it would not have been any more precise to have specified 1189 the first packet in sequence order, which would not necessarily be 1190 the first ACE counter that the Data Receiver fed back anyway, given 1191 it might have been a retransmission. Usually, the server checks the 1192 ACK of the SYN/ACK from the client, while the client checks the first 1193 data segment from the server. 1195 The possibility of re-ordering means that there is a small chance 1196 that the ACE field on the first packet to arrive is genuinely zero 1197 (without middlebox interference). This would cause a host to 1198 unnecessarily disable ECN for a half connection. Therefore, in 1199 environments where there is no evidence of the ACE field being 1200 zeroed, implementations can skip this test. 1202 Note that the Data Sender MUST NOT test whether the arriving counter 1203 in the initial ACE field has been initialized to a specific valid 1204 value - the above check solely tests whether the ACE fields have been 1205 incorrectly zeroed. This allows hosts to use different initial 1206 values as an additional signalling channel in future. 1208 3.2.2.5. Safety against Ambiguity of the ACE Field 1210 If too many CE-marked segments are acknowledged at once, or if a long 1211 run of ACKs is lost or thinned out, the 3-bit counter in the ACE 1212 field might have cycled between two ACKs arriving at the Data Sender. 1213 The following safety procedures minimize this ambiguity. 1215 3.2.2.5.1. Data Receiver Safety Procedures 1217 The following rules define when a Data Receiver in AccECN mode emits 1218 an ACK: 1220 Change-Triggered ACKs: An AccECN Data Receiver SHOULD emit an ACK 1221 whenever a data packet marked CE arrives after the previous packet 1222 was not CE. 1224 Even though this rule is stated as a "SHOULD", it is important for 1225 a transition to trigger an ACK if at all possible, The only valid 1226 exception to this rule is given below these bullets. 1228 For the avoidance of doubt, this rule is deliberately worded to 1229 apply solely when _data_ packets arrive, but the comparison with 1230 the previous packet includes any packet, not just data packets. 1232 Increment-Triggered ACKs: An AccECN Data Receiver MUST emit an ACK 1233 if 'n' CE marks have arrived since the previous ACK. If there is 1234 newly delivered data to acknowledge, 'n' SHOULD be 2. If there is 1235 no newly delivered data to acknowledge, 'n' SHOULD be 3 and MUST 1236 be no less than 3. In either case, 'n' MUST be no greater than 7. 1238 The above rules for when to send an ACK are designed to be 1239 complemented by those in Section 3.2.3.3, which concern whether the 1240 AccECN TCP Option ought to be included on ACKs. 1242 If the arrivals of a number of data packets are all processed as one 1243 event, e.g. using large receive offload (LRO) or generic receive 1244 offload (GRO), both the above rules SHOULD be interpreted as 1245 requiring multiple ACKs to be emitted back-to-back (for each 1246 transition and for each repetition by 'n' CE marks). If this is 1247 problematic for high performance, either rule can be interpreted as 1248 requiring just a single ACK at the end of the whole receive event. 1250 Even if a number of data packets do not arrive as one event, the 1251 'Change-Triggered ACKs' rule could sometimes cause the ACK rate to be 1252 problematic for high performance (although high performance protocols 1253 such as DCTCP already successfully use change-triggered ACKs). The 1254 rationale for change-triggered ACKs is so that the Data Sender can 1255 rely on them to detect queue growth as soon as possible, particularly 1256 at the start of a flow. The approach can lead to some additional 1257 ACKs but it feeds back the timing and the order in which ECN marks 1258 are received with minimal additional complexity. If CE marks are 1259 infrequent, as is the case for most AQMs at the time of writing, or 1260 there are multiple marks in a row, the additional load will be low. 1261 However, marking patterns with numerous non-contiguous CE marks could 1262 increase the load significantly. One possible compromise would be 1263 for the receiver to heuristically detect whether the sender is in 1264 slow-start, then to implement change-triggered ACKs while the sender 1265 is in slow-start, and offload otherwise. 1267 With ECN-capable pure ACKs [I-D.ietf-tcpm-generalized-ecn], the 1268 'Increment-Triggered ACKs' rule could cause ECN-marked pure ACKs to 1269 trigger further ACKs. Although TCP normally only ACKs newly 1270 delivered data, in this case the ACKs of ACKs would feed back new 1271 congestion state. The minimum of 3 for 'n' in this case ensures 1272 that, even if there is pathological congestion in both directions, 1273 any resulting ping-pong of ACKs will be rapidly damped. 1275 These ACKs of ACKs could be misidentified as duplicate ACKs in 1276 certain circumstances described below. Therefore, a host in AccECN 1277 mode that is sending ECN-capable pure ACKs SHOULD add one of the 1278 following additional checks when it tests whether an incoming pure 1279 ACK is a duplicate: 1281 o If SACK has been negotatiated for the connection, but there is no 1282 SACK option on the incoming pure ACK, it is not a duplicate; 1284 o If timestamps are in use, and the incoming pure ACK echoes a 1285 timestamp older than the oldest unacknowledged data, it is not a 1286 duplicate. 1288 In the unlikely event that neither SACK nor timestamps are in use, or 1289 if the implementation has opted not to include either of the above 1290 two checks, it SHOULD NOT send ECN-capable pure ACKs. If it does, it 1291 could lead to false detection of duplicate ACKs, causing spurious 1292 retransmission(s) with a resulting unnecessary reduction in 1293 congestion window; but only in certain circumstances. Specifically, 1294 if TCP peer A has been sending data, then receiving, then within one 1295 round trip it starts sending again, and the ECN-capable pure ACKs it 1296 sent in the previous round encounter heavy enough congestion to 1297 trigger peer B to invoke the above 'n'-CE-mark rule. Also note that 1298 falsely considering these ACKs as duplicates would incorrectly imply 1299 that data left the network. 1301 3.2.2.5.2. Data Sender Safety Procedures 1303 If the Data Sender has not received AccECN TCP Options to give it 1304 more dependable information, and it detects that the ACE field could 1305 have cycled, it SHOULD deem whether it cycled by taking the safest 1306 likely case under the prevailing conditions. It can detect if the 1307 counter could have cycled by using the jump in the acknowledgement 1308 number since the last ACK to calculate or estimate how many segments 1309 could have been acknowledged. An example algorithm to implement this 1310 policy is given in Appendix A.2. An implementer MAY develop an 1311 alternative algorithm as long as it satisfies these requirements. 1313 If missing acknowledgement numbers arrive later (reordering) and 1314 prove that the counter did not cycle, the Data Sender MAY attempt to 1315 neutralize the effect of any action it took based on a conservative 1316 assumption that it later found to be incorrect. 1318 The Data Sender can estimate how many packets (of any marking) an ACK 1319 acknowledges. If the ACE counter on an ACK seems to imply that the 1320 minimum number of newly CE-marked packets is greater that the number 1321 of newly acknowledged packets, the Data Sender SHOULD believe the ACE 1322 counter, unless it can be sure that it is counting all control 1323 packets correctly. 1325 3.2.3. The AccECN Option 1327 The AccECN Option is defined as shown in Figure 4. The initial 'E' 1328 of each field name stands for 'Echo'. 1330 0 1 2 3 1331 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1332 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1333 | Kind = TBD0 | Length = 11 | EE0B field | 1334 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1335 | EE0B (cont'd) | ECEB field | 1336 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1337 | EE1B field | Order 0 1338 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1340 0 1 2 3 1341 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1342 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1343 | Kind = TBD1 | Length = 11 | EE1B field | 1344 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1345 | EE1B (cont'd) | ECEB field | 1346 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1347 | EE0B field | Order 1 1348 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1350 Figure 4: The AccECN TCP Option 1352 Figure 4 shows two option field orders; order 0 and order 1. They 1353 both consists of three 24-bit fields. Order 0 provides the 24 least 1354 significant bits of the r.e0b, r.ceb and r.e1b counters, 1355 respectively. Order 1 provides the same fields, but in the opposite 1356 order. On each packet, the Data Receiver can use whichever order is 1357 more efficient. 1359 When a Data Receiver sends an AccECN Option, it MUST set the Kind 1360 field to TBD0 if using Order 0, or to TBD1 if using Order 1. These 1361 two new TCP Option Kinds are registered in Section 7 and called 1362 respectively AccECN0 and AccECN1. 1364 Note that there is no field to feed back Not-ECT bytes. Nonetheless 1365 an algorithm for the Data Sender to calculate the number of payload 1366 bytes received as Not-ECT is given in Appendix A.4. 1368 Whenever a Data Receiver sends an AccECN Option, the rules in 1369 Section 3.2.3.3 allow it to omit unchanged fields from the tail of 1370 the option, to help cope with option space limitations, as long as it 1371 preserves the order of the remaining fields and includes any field 1372 that has changed. The length field MUST indicate which fields are 1373 present as follows: 1375 +--------+------------------+------------------+ 1376 | Length | Type 0 | Type 1 | 1377 +--------+------------------+------------------+ 1378 | 11 | EE0B, ECEB, EE1B | EE1B, ECEB, EE0B | 1379 | 8 | EE0B, ECEB | EE1B, ECEB | 1380 | 5 | EE0B | EE1B | 1381 | 2 | (empty) | (empty) | 1382 +--------+------------------+------------------+ 1384 Fields included in AccECN TCP Options of each length and type 1386 The empty option of Length=2 is provided to allow for a case where an 1387 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 1388 but there is very limited space for the option. 1390 All implementations of a Data Sender that read any AccECN Option MUST 1391 be able to read in AccECN Options of any of the above lengths. For 1392 forward compatibility, if the AccECN Option is of any other length, 1393 implementations MUST use those whole 3-octet fields that fit within 1394 the length and ignore the remainder of the option, treating it as 1395 padding. 1397 The AccECN Option has to be optional to implement, because both 1398 sender and receiver have to be able to cope without the option anyway 1399 - in cases where it does not traverse a network path. It is 1400 RECOMMENDED to implement both sending and receiving of the AccECN 1401 Option. Support for the AccECN Option is particularly valuable over 1402 paths that introduce a high degree of ACK filtering, where the 3-bit 1403 ACE counter alone might sometimes be insufficient, when it is 1404 ambiguous whether it has wrapped. If sending of the AccECN Option is 1405 implemented, the fall-backs described in this document will need to 1406 be implemented as well (unless solely for a controlled environment 1407 where path traversal is not considered a problem). Even if a 1408 developer does not implement sending of the AccECN Option, it is 1409 RECOMMENDED that they still implement logic to receive and understand 1410 any AccECN Options sent by remote peers. 1412 If a Data Receiver intends to send the AccECN Option at any time 1413 during the rest of the connection it is strongly RECOMMENDED to also 1414 test path traversal of the AccECN Option as specified in 1415 Section 3.2.3.2. 1417 3.2.3.1. Encoding and Decoding Feedback in the AccECN Option Fields 1419 Whenever the Data Receiver includes any of the counter fields (ECEB, 1420 EE0B, EE1B) in an AccECN Option, it MUST encode the 24 least 1421 significant bits of the current value of the associated counter into 1422 the field (respectively r.ceb, r.e0b, r.e1b). 1424 Whenever the Data Sender receives ACK carrying an AccECN Option, it 1425 first checks whether the ACK has already been superseded by another 1426 ACK in which case it ignores the ECN feedback. If the ACK has not 1427 been superseded, the Data Sender normally decodes the fields in the 1428 AccECN Option as follows. For each field, it takes the least 1429 significant 24 bits of its associated local counter (s.ceb, s.e0b or 1430 s.e1b) and subtracts them from the counter in the associated field of 1431 the incoming AccECN Option (respectively ECEB, EE0B, EE1B), to work 1432 out the minimum positive increment it could apply to s.ceb, s.e0b or 1433 s.e1b (assuming the field in the option only wrapped at most once). 1435 Appendix A.1 gives an example algorithm for the Data Receiver to 1436 encode its byte counters into the AccECN Option, and for the Data 1437 Sender to decode the AccECN Option fields into its byte counters. 1439 Note that, as specified in Section 3.2, any data on the SYN (SYN=1, 1440 ACK=0) is not included in any of the byte counters held locally for 1441 each ECN marking nor in the AccECN Option on the wire. 1443 3.2.3.2. Path Traversal of the AccECN Option 1445 3.2.3.2.1. Testing the AccECN Option during the Handshake 1447 The TCP client MUST NOT include the AccECN TCP Option on the SYN. If 1448 there is somehow an AccECN Option on a SYN, it MUST be ignored when 1449 forwarded or received. (A fall-back strategy for the loss of the 1450 SYN, possibly due to middlebox interference, is specified in 1451 Section 3.1.4.) 1453 A TCP server that confirms its support for AccECN (in response to an 1454 AccECN SYN from the client as described in Section 3.1) SHOULD 1455 include an AccECN TCP Option on the SYN/ACK. 1457 A TCP client that has successfully negotiated AccECN SHOULD include 1458 an AccECN Option in the first ACK at the end of the 3WHS. However, 1459 this first ACK is not delivered reliably, so the TCP client SHOULD 1460 also include an AccECN Option on the first data segment it sends (if 1461 it ever sends one). 1463 A host MAY omit the AccECN Option in any of the above three cases due 1464 to insufficient option space or if it has cached knowledge that the 1465 packet would be likely to be blocked on the path to the other host if 1466 it included an AccECN Option. 1468 3.2.3.2.2. Testing for Loss of Packets Carrying the AccECN Option 1470 If after the normal TCP timeout the TCP server has not received an 1471 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 1472 lost, e.g. due to congestion, or a middlebox might be blocking the 1473 AccECN Option. To expedite connection setup, the TCP server SHOULD 1474 retransmit the SYN/ACK repeating the same AE, CWR and ECE TCP flags 1475 as on the original SYN/ACK but with no AccECN Option. If this 1476 retransmission times out, to expedite connection setup, the TCP 1477 server SHOULD disable AccECN and ECN for this connection by 1478 retransmitting the SYN/ACK with AE=CWR=ECE=0 and no AccECN Option. 1480 Implementers MAY use other fall-back strategies if they are found to 1481 be more effective (e.g. retrying the AccECN Option for a second time 1482 before fall-back - most appropriate during high levels of 1483 congestion). However, other fall-back strategies will need to follow 1484 all the rules in Section 3.1.5, which concern behaviour when SYNs or 1485 SYN/ACKs negotiating different types of feedback have been sent 1486 within the same connection. 1488 If the TCP client detects that the first data segment it sent with 1489 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1490 on the retransmission. Again, implementers MAY use other fall-back 1491 strategies such as attempting to retransmit a second segment with the 1492 AccECN Option before fall-back, and/or caching whether the AccECN 1493 Option is blocked for subsequent connections. [RFC9040] further 1494 discusses caching of TCP parameters and status information. 1496 If a host falls back to not sending the AccECN Option, it will 1497 continue to process any incoming AccECN Options as normal. 1499 Either host MAY include the AccECN Option in a subsequent segment to 1500 retest whether the AccECN Option can traverse the path. 1502 If the TCP server receives a second SYN with a request for AccECN 1503 support, it is advised to resend the SYN/ACK, again confirming its 1504 support for AccECN, but this time without the AccECN Option. This 1505 approach rules out any interference by middleboxes that might drop 1506 packets with unknown options, even though it is more likely that the 1507 SYN/ACK would have been lost due to congestion. The TCP server MAY 1508 try to send another packet with the AccECN Option at a later point 1509 during the connection but it ought to monitor if that packet got lost 1510 as well, in which case it SHOULD disable the sending of the AccECN 1511 Option for this half-connection. 1513 Similarly, an AccECN end-point MAY separately memorize which data 1514 packets carried an AccECN Option and disable the sending of AccECN 1515 Options if the loss probability of those packets is significantly 1516 higher than that of all other data packets in the same connection. 1518 3.2.3.2.3. Testing for Absence of the AccECN Option 1520 If the TCP client has successfully negotiated AccECN but does not 1521 receive an AccECN Option on the SYN/ACK (e.g. because is has been 1522 stripped by a middlebox or not sent by the server), the client 1523 switches into a mode that assumes that the AccECN Option is not 1524 available for this half connection. 1526 Similarly, if the TCP server has successfully negotiated AccECN but 1527 does not receive an AccECN Option on the first segment that 1528 acknowledges sequence space at least covering the ISN, it switches 1529 into a mode that assumes that the AccECN Option is not available for 1530 this half connection. 1532 While a host is in this mode that assumes incoming AccECN Options are 1533 not available, it MUST adopt the conservative interpretation of the 1534 ACE field discussed in Section 3.2.2.5. However, it cannot make any 1535 assumption about support of outgoing AccECN Options on the other half 1536 connection, so it SHOULD continue to send the AccECN Option itself 1537 (unless it has established that sending the AccECN Option is causing 1538 packets to be blocked as in Section 3.2.3.2.2). 1540 If a host is in the mode that assumes incoming AccECN Options are not 1541 available, but it receives an AccECN Option at any later point during 1542 the connection, this clearly indicates that the AccECN Option is not 1543 blocked on the respective path, and the AccECN endpoint MAY switch 1544 out of the mode that assumes the AccECN Option is not available for 1545 this half connection. 1547 3.2.3.2.4. Test for Zeroing of the AccECN Option 1549 For a related test for invalid initialization of the ACE field, see 1550 Section 3.2.2.4 1552 Section 3.2.1 required the Data Receiver to initialize the r.e0b and 1553 r.e1b counters to a non-zero value. Therefore, in either direction 1554 the initial value of the EE0B field or EE1B field in the AccECN 1555 Option (if one exists) ought to be non-zero. If AccECN has been 1556 negotiated: 1558 o the TCP server MAY check that the initial value of the EE0B field 1559 or the EE1B field is non-zero in the first segment that 1560 acknowledges sequence space that at least covers the ISN plus 1. 1561 If it runs a test and either initial value is zero, the server 1562 will switch into a mode that ignores the AccECN Option for this 1563 half connection. 1565 o the TCP client MAY check the initial value of the EE0B field or 1566 the EE1B field is non-zero on the SYN/ACK. If it runs a test and 1567 either initial value is zero, the client will switch into a mode 1568 that ignores the AccECN Option for this half connection. 1570 While a host is in the mode that ignores the AccECN Option it MUST 1571 adopt the conservative interpretation of the ACE field discussed in 1572 Section 3.2.2.5. 1574 Note that the Data Sender MUST NOT test whether the arriving byte 1575 counters in the initial AccECN Option have been initialized to 1576 specific valid values - the above checks solely test whether these 1577 fields have been incorrectly zeroed. This allows hosts to use 1578 different initial values as an additional signalling channel in 1579 future. Also note that the initial value of either field might be 1580 greater than its expected initial value, because the counters might 1581 already have been incremented. Nonetheless, the initial values of 1582 the counters have been chosen so that they cannot wrap to zero on 1583 these initial segments. 1585 3.2.3.2.5. Consistency between AccECN Feedback Fields 1587 When the AccECN Option is available it ought to provide more 1588 unambiguous feedback. However, it supplements but does not replace 1589 the ACE field. An endpoint using AccECN feedback MUST always 1590 reconcile the information provided in the ACE field with that in any 1591 AccECN Option, so that the state of the ACE-related packet counter 1592 can be relied on if future feedback does not carry the AccECN Option. 1594 If the AccECN option is present, the s.cep counter might increase 1595 more than expected from the increase of the s.ceb counter (e.g. due 1596 to a CE-marked control packet). The sender's response to such a 1597 situation is out of scope, and needs to be dealt with in a 1598 specification that uses ECN-capable control packets. Theoretically, 1599 this situation could also occur if a middlebox mangled the AccECN 1600 Option but not the ACE field. However, the Data Sender has to assume 1601 that the integrity of the AccECN Option is sound, based on the above 1602 test of the well-known initial values and optionally other integrity 1603 tests (Section 5.3). 1605 If either end-point detects that the s.ceb counter has increased but 1606 the s.cep has not (and by testing ACK coverage it is certain how much 1607 the ACE field has wrapped), and if there is no explanation other than 1608 an invalid protocol transition due to some form of feedback mangling, 1609 the Data Sender MUST disable sending ECN-capable packets for the 1610 remainder of the half-connection by setting the IP/ECN field in all 1611 subsequent packets to Not-ECT. 1613 3.2.3.3. Usage of the AccECN TCP Option 1615 If a Data Receiver in AccECN mode intends to use the AccECN TCP 1616 Option to provide feedback, the rules below determine when it 1617 includes an AccECN TCP Option, and which fields to include, given 1618 other options might be competing for limited option space: 1620 Importance of Congestion Control: AccECN is for congestion control, 1621 which SHOULD generally be considered important relative to other 1622 TCP options. 1624 If SACK has been negotiated, and the smallest recommended AccECN 1625 Option would leave insufficient space for two SACK blocks on a 1626 particular ACK, the Data Receiver MUST give precedence to the SACK 1627 option (total 18 octets), because loss feedback is more critical. 1629 Recommended Simple Scheme: The Data Receiver SHOULD include an 1630 AccECN TCP Option on every scheduled ACK if any byte counter has 1631 incremented since the last ACK. Whenever possible, it SHOULD 1632 include a field for every byte counter that has changed at some 1633 time during the connection (see examples later). 1635 A scheduled ACK means an ACK that the Data Receiver would send by 1636 its regular delayed ACK rules. Recall that Section 1.3 defines an 1637 'ACK' as either with data payload or without. But the above rule 1638 is worded so that, in the common case when most of the data is 1639 from a server to a client, the server only includes an AccECN TCP 1640 Option while it is acknowledging data from the client. 1642 When available TCP option space is limited on particular packets, the 1643 recommended scheme will need to include compromises. To guide the 1644 implementer the rules below are ranked in order of importance, but 1645 the final decision has to be implementation-dependent, because 1646 tradeoffs will alter as new TCP options are defined and new use-cases 1647 arise. 1649 Necessary Option Length: The Data Receiver MUST only include an 1650 AccECN TCP Option on a packet if it includes all the counter(s) 1651 that have incremented since the previous AccECN Option. It MUST 1652 only truncate unchanged fields from the right-hand tail of the 1653 option to preserve the order of the remaining fields (see 1654 Section 3.2.3); 1656 Change-Triggered AccECN TCP Options: If an arriving packet 1657 increments a different byte counter to that incremented by the 1658 previous packet, the Data Receiver SHOULD feed it back in an 1659 AccECN Option on the next scheduled ACK. 1661 For the avoidance of doubt, this rule does not concern the arrival 1662 of control packets with no payload, because they cannot alter any 1663 byte counters. 1665 Continual Repetition: Otherwise, if arriving packets continue to 1666 increment the same byte counter: 1668 * the Data Receiver SHOULD include a counter that has continued 1669 to increment on the next scheduled ACK following a change- 1670 triggered AccECN TCP Option; 1672 * while the same counter continues to increment, it SHOULD 1673 include the counter every n ACKs as consistently as possible, 1674 where n can be chosen by the implementer; 1676 * It SHOULD always include an AccECN Option if the r.ceb counter 1677 is incrementing and it MAY include an AccECN Option if r.ec0b 1678 or r.ec1b is incrementing 1680 * It SHOULD, include each counter at least once for every 2^22 1681 bytes incremented to prevent overflow during continual 1682 repetition. 1684 The above rules complement those in Section 3.2.2.5, which determine 1685 when to generate an ACK irrespective of whether an AccECN TCP Option 1686 is to be included. 1688 The recommended scheme is intended as a simple way to ensure that all 1689 the relevant byte counters will be carried on any ACK that reaches 1690 the Data Sender, no matter how many pure ACKs are filtered or 1691 coalesced along the network path, and without consuming the space 1692 available for payload data with counter field(s) that have never 1693 changed. 1695 As an example of the recommended scheme, if ECT(0) is the only 1696 codepoint that has ever arrived in the IP-ECN field, the Data 1697 Receiver will feed back an AccECN0 TCP Option with only the EE0B 1698 field on every packet. However, as soon as even one CE-marked packet 1699 arrives, on every packet that acknowledges new data it will start to 1700 include an option with two fields, EE0B and ECEB. As a second 1701 example, if the first packet to arrive happens to be CE-marked, the 1702 Data Receiver will have to arbitrarily choose whether to precede the 1703 ECEB field with an EE0B field or an EE1B field. If it chooses, say, 1704 EEB0 but it turns out never to receive ECT(0), it can start sending 1705 EE1B and ECEB instead - it does not have to include the EE0B field if 1706 the r.e0b counter has never changed during the connection. 1708 With the recommended scheme, if the data sending direction switches 1709 during a connection, there can be cases where the AccECN TCP Option 1710 that is meant to feed back the counter values at the end of a volley 1711 in one direction never reaches the other peer, due to packet loss. 1712 ACE feedback ought to be sufficient to fill this gap, given accurate 1713 feedback becomes moot after data transmission has paused. 1715 Appendix A.3 gives an example algorithm to estimate the number of 1716 marked bytes from the ACE field alone, if the AccECN Option is not 1717 available. 1719 If a host has determined that segments with the AccECN Option always 1720 seem to be discarded somewhere along the path, it is no longer 1721 obliged to follow any of the rules in this section. 1723 3.3. AccECN Compliance Requirements for TCP Proxies, Offload Engines 1724 and other Middleboxes 1726 3.3.1. Requirements for TCP Proxies 1728 A large class of middleboxes split TCP connections. Such a middlebox 1729 would be compliant with the AccECN protocol if the TCP implementation 1730 on each side complied with the present AccECN specification and each 1731 side negotiated AccECN independently of the other side. 1733 3.3.2. Requirements for Transparent Middleboxes and TCP Normalizers 1735 Another large class of middleboxes intervenes to some degree at the 1736 transport layer, but attempts to be transparent (invisible) to the 1737 end-to-end connection. A subset of this class of middleboxes 1738 attempts to `normalize' the TCP wire protocol by checking that all 1739 values in header fields comply with a rather narrow interpretation of 1740 the TCP specifications that is also not always up to date. 1742 A middlebox that is not normalizing the TCP protocol and does not 1743 itself act as a back-to-back pair of TCP endpoints (i.e. a middlebox 1744 that intends to be transparent or invisible at the transport layer) 1745 ought to forward the AccECN TCP Option unaltered, whether or not the 1746 length value matches one of those specified in Section 3.2.3, and 1747 whether or not the initial values of the byte-counter fields match 1748 those in Section 3.2.1. This is because blocking apparently invalid 1749 values prevents the standardized set of values being extended in 1750 future (given outdated normalizers would block updated hosts from 1751 using the extended AccECN standard). 1753 A TCP normalizer is likely to block or alter an AccECN TCP Option if 1754 the length value or the initial values of its byte-counter fields do 1755 not match one of those specified in Section 3.2.3 or Section 3.2.1. 1756 However, to comply with the present AccECN specification, a middlebox 1757 MUST NOT change the ACE field; or those fields of the AccECN Option 1758 that are currently specified in Section 3.2.3; or any AccECN field 1759 covered by integrity protection (e.g. [RFC5925]). 1761 3.3.3. Requirements for TCP ACK Filtering 1763 A node that implements ACK filtering (aka. thinning or coalescing) 1764 SHOULD determine if an ACK is part of a connection using AccECN and 1765 SHOULD then preserve the correct operation of AccECN feedback. The 1766 following notes might help with each part of this requirement: 1768 o To determine whether a pure TCP ACK is part of an AccECN 1769 connection without resorting to connection tracking and per-flow 1770 state, a useful heuristic would be to check for a non-zero ECN 1771 field at the IP layer (because the ECN++ experiment only allows 1772 TCP pure ACKs to be ECN-capable if AccECN has been negotiated 1773 [I-D.ietf-tcpm-generalized-ecn]). This heuristic is simple and 1774 stateless. However, it might omit some AccECN ACKs, because it is 1775 only recommended but not obligatory to use ECN++ with AccECN - 1776 only deployment experience will tell. Also, TCP ACKs might be 1777 ECN-capable owing to some scheme other than AccECN, e.g. [RFC5690] 1778 or some future standards action. Again, only deployment 1779 experience will tell. 1781 o The main concern with preserving correct AccECN operation involves 1782 leaving enough ACKs for the Data Sender to work out whether the 1783 3-bit ACE field has wrapped. ACE field wrap might be of less 1784 concern if packets also carry the AccECN TCP Option. 1786 Note that the present specification of AccECN in TCP does not presume 1787 to rely on any of the above ACK filtering behaviour in the network 1788 (hence the use of 'SHOULD' rather than 'MUST' above), because it has 1789 to be robust against pre-existing network nodes that do not 1790 distinguish AccECN ACKs, and robust against ACK loss during overload 1791 more generally. 1793 Section 5.2.1 of BCP 69 [RFC3449] gives best current practice on pure 1794 TCP ACK filtering. It gives no advice on ACKs carrying ECN feedback, 1795 other than that filtering ought to preserve the correct operation of 1796 ECN feedback, because at the time it said that "SACK and ECN remain 1797 areas of ongoing research". This section updates that best current 1798 practice for a TCP connection that supports AccECN feedback. 1800 3.3.4. Requirements for TCP Segmentation Offload 1802 Hardware to offload certain TCP processing represents another large 1803 class of middleboxes (even though it is often a function of a host's 1804 network interface and rarely in its own 'box'). 1806 The ACE field changes with every received CE marking, so today's 1807 receive offloading could lead to many interrupts in high congestion 1808 situations. Although that would be useful (because congestion 1809 information is received sooner), it could also significantly increase 1810 processor load, particularly in scenarios such as DCTCP or L4S where 1811 the marking rate is generally higher. 1813 Current offload hardware ejects a segment from the coalescing process 1814 whenever the TCP ECN flags change. Thus Classic ECN causes offload 1815 to be inefficient. In data centres it has been fortunate for this 1816 offload hardware that DCTCP-style feedback changes less often when 1817 there are long sequences of CE marks, which is more common with a 1818 step marking threshold (but less likely the more short flows are in 1819 the mix). The ACE counter approach has been designed so that 1820 coalescing can continue over arbitrary patterns of marking and only 1821 needs to stop when the counter wraps. Nonetheless, until the 1822 particular offload hardware in use implements this more efficient 1823 approach, it is likely to be more efficient for AccECN connections to 1824 implement this counter-style logic using software segmentation 1825 offload. 1827 ECN encodes a varying signal in the ACK stream, so it is inevitable 1828 that offload hardware will ultimately need to handle any form of ECN 1829 feedback exceptionally. The ACE field has been designed as a counter 1830 so that it is straightforward for offload hardware to pass on the 1831 highest counter, and to push a segment from its cache before the 1832 counter wraps. The purpose of working towards standardized TCP ECN 1833 feedback is to reduce the risk for hardware developers, who would 1834 otherwise have to guess which scheme is likely to become dominant. 1836 The above process has been designed to enable a continuing 1837 incremental deployment path - to more highly dynamic congestion 1838 control. Once offload hardware supports AccECN, it will be able to 1839 coalesce efficiently for any sequence of marks, instead of relying 1840 for efficiency on the long marking sequences from step marking. In 1841 the next stage, marking can evolve from a step to a ramp function. 1842 That in turn will allow host congestion control algorithms to respond 1843 faster to dynamics, while being backwards compatible with existing 1844 host algorithms. 1846 4. Updates to RFC 3168 1848 Normative statements in the following sections of RFC3168 are updated 1849 by the present AccECN specification: 1851 o The whole of "6.1.1 TCP Initialization" of [RFC3168] is updated by 1852 Section 3.1 of the present specification. 1854 o In "6.1.2. The TCP Sender" of [RFC3168], all mentions of a 1855 congestion response to an ECN-Echo (ECE) ACK packet are updated by 1856 Section 3.2 of the present specification to mean an increment to 1857 the sender's count of CE-marked packets, s.cep. And the 1858 requirements to set the CWR flag no longer apply, as specified in 1859 Section 3.1.5 of the present specification. Otherwise, the 1860 remaining requirements in "6.1.2. The TCP Sender" still stand. 1862 It will be noted that RFC 8311 already updates, or potentially 1863 updates, a number of the requirements in "6.1.2. The TCP Sender". 1864 Section 6.1.2 of RFC 3168 extended standard TCP congestion control 1865 [RFC5681] to cover ECN marking as well as packet drop. Whereas, 1866 RFC 8311 enables experimentation with alternative responses to ECN 1867 marking, if specified for instance by an experimental RFC on the 1868 IETF document stream. RFC 8311 also strengthened the statement 1869 that "ECT(0) SHOULD be used" to a "MUST" (see [RFC8311] for the 1870 details). 1872 o The whole of "6.1.3. The TCP Receiver" of [RFC3168] is updated by 1873 Section 3.2 of the present specification, with the exception of 1874 the last paragraph (about congestion response to drop and ECN in 1875 the same round trip), which still stands. Incidentally, this last 1876 paragraph is in the wrong section, because it relates to TCP 1877 sender behaviour. 1879 o The following text within "6.1.5. Retransmitted TCP packets": 1881 "the TCP data receiver SHOULD ignore the ECN field on arriving 1882 data packets that are outside of the receiver's current 1883 window." 1885 is updated by more stringent acceptability tests for any packet 1886 (not just data packets) in the present specification. 1887 Specifically, in the normative specification of AccECN (Section 3) 1888 only 'Acceptable' packets contribute to the ECN counters at the 1889 AccECN receiver and Section 1.3 defines an Acceptable packet as 1890 one that passes the acceptability tests in both [RFC0793] and 1891 [RFC5961]. 1893 o Sections 5.2, 6.1.1, 6.1.4, 6.1.5 and 6.1.6 of [RFC3168] prohibit 1894 use of ECN on TCP control packets and retransmissions. The 1895 present specification does not update that aspect of RFC 3168, but 1896 it does say what feedback an AccECN Data Receiver ought to provide 1897 if it receives an ECN-capable control packet or retransmission. 1898 This ensures AccECN is forward compatible with any future scheme 1899 that allows ECN on these packets, as provided for in section 4.3 1900 of [RFC8311] and as proposed in [I-D.ietf-tcpm-generalized-ecn]. 1902 5. Interaction with TCP Variants 1904 This section is informative, not normative. 1906 5.1. Compatibility with SYN Cookies 1908 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1909 protect itself from SYN flooding attacks. It places minimal commonly 1910 used connection state in the SYN/ACK, and deliberately does not hold 1911 any state while waiting for the subsequent ACK (e.g. it closes the 1912 thread). Therefore it cannot record the fact that it entered AccECN 1913 mode for both half-connections. Indeed, it cannot even remember 1914 whether it negotiated the use of classic ECN [RFC3168]. 1916 Nonetheless, such a server can determine that it negotiated AccECN as 1917 follows. If a TCP server using SYN Cookies supports AccECN and if it 1918 receives a pure ACK that acknowledges an ISN that is a valid SYN 1919 cookie, and if the ACK contains an ACE field with the value 0b010 to 1920 0b111 (decimal 2 to 7), it can assume that: 1922 o the TCP client has to have requested AccECN support on the SYN 1924 o it (the server) has to have confirmed that it supported AccECN 1926 Therefore the server can switch itself into AccECN mode, and continue 1927 as if it had never forgotten that it switched itself into AccECN mode 1928 earlier. 1930 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1931 with the value 0b000 or 0b001, these values indicate that the client 1932 did not request support for AccECN and therefore the server does not 1933 enter AccECN mode for this connection. Further, 0b001 on the ACK 1934 implies that the server sent an ECN-capable SYN/ACK, which was marked 1935 CE in the network, and the non-AccECN client fed this back by setting 1936 ECE on the ACK of the SYN/ACK. 1938 5.2. Compatibility with TCP Experiments and Common TCP Options 1940 AccECN is compatible (at least on paper) with the most commonly used 1941 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1942 also compatible with the recent promising experimental TCP options 1943 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1944 AccECN is friendly to all these protocols, because space for TCP 1945 options is particularly scarce on the SYN, where AccECN consumes zero 1946 additional header space. 1948 When option space is under pressure from other options, 1949 Section 3.2.3.3 provides guidance on how important it is to send an 1950 AccECN Option relative to other options, and which fields are more 1951 important to include. 1953 Implementers of TFO need to take careful note of the recommendation 1954 in Section 3.2.2.1. That section recommends that, if the client has 1955 successfully negotiated AccECN, when acknowledging the SYN/ACK, even 1956 if it has data to send, it sends a pure ACK immediately before the 1957 data. Then it can reflect the IP-ECN field of the SYN/ACK on this 1958 pure ACK, which allows the server to detect ECN mangling. Note that, 1959 as specified in Section 3.2, any data on the SYN (SYN=1, ACK=0) is 1960 not included in any of the byte counters held locally for each ECN 1961 marking, nor in the AccECN Option on the wire. 1963 5.3. Compatibility with Feedback Integrity Mechanisms 1965 Three alternative mechanisms are available to assure the integrity of 1966 ECN and/or loss signals. AccECN is compatible with any of these 1967 approaches: 1969 o The Data Sender can test the integrity of the receiver's ECN (or 1970 loss) feedback by occasionally setting the IP-ECN field to a value 1971 normally only set by the network (and/or deliberately leaving a 1972 sequence number gap). Then it can test whether the Data 1973 Receiver's feedback faithfully reports what it expects (similar to 1974 para 2 of Section 20.2 of [RFC3168]). Unlike the ECN Nonce 1975 [RFC3540], this approach does not waste the ECT(1) codepoint in 1976 the IP header, it does not require standardization and it does not 1977 rely on misbehaving receivers volunteering to reveal feedback 1978 information that allows them to be detected. However, setting the 1979 CE mark by the sender might conceal actual congestion feedback 1980 from the network and therefore ought to only be done sparingly. 1982 o Networks generate congestion signals when they are becoming 1983 congested, so networks are more likely than Data Senders to be 1984 concerned about the integrity of the receiver's feedback of these 1985 signals. A network can enforce a congestion response to its ECN 1986 markings (or packet losses) using congestion exposure (ConEx) 1987 audit [RFC7713]. Whether the receiver or a downstream network is 1988 suppressing congestion feedback or the sender is unresponsive to 1989 the feedback, or both, ConEx audit can neutralize any advantage 1990 that any of these three parties would otherwise gain. 1992 ConEx is an experimental change to the Data Sender that would be 1993 most useful when combined with AccECN. Without AccECN, the ConEx 1994 behaviour of a Data Sender would have to be more conservative than 1995 would be necessary if it had the accurate feedback of AccECN. 1997 o The standards track TCP authentication option (TCP-AO [RFC5925]) 1998 can be used to detect any tampering with AccECN feedback between 1999 the Data Receiver and the Data Sender (whether malicious or 2000 accidental). The AccECN fields are immutable end-to-end, so they 2001 are amenable to TCP-AO protection, which covers TCP options by 2002 default. However, TCP-AO is often too brittle to use on many end- 2003 to-end paths, where middleboxes can make verification fail in 2004 their attempts to improve performance or security, e.g. by 2005 resegmentation or shifting the sequence space. 2007 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 2008 of congestion feedback. With minor changes AccECN could be optimized 2009 for the possibility that the ECT(1) codepoint might be used as an ECN 2010 Nonce. However, given RFC 3540 has been reclassified as historic, 2011 the AccECN design has been generalized so that it ought to be able to 2012 support other possible uses of the ECT(1) codepoint, such as a lower 2013 severity or a more instant congestion signal than CE. 2015 6. Protocol Properties 2017 This section is informative not normative. It describes how well the 2018 protocol satisfies the agreed requirements for a more accurate ECN 2019 feedback protocol [RFC7560]. 2021 Accuracy: From each ACK, the Data Sender can infer the number of new 2022 CE marked segments since the previous ACK. This provides better 2023 accuracy on CE feedback than classic ECN. In addition if the 2024 AccECN Option is present (not blocked by the network path) the 2025 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 2027 Overhead: The AccECN scheme is divided into two parts. The 2028 essential part reuses the 3 flags already assigned to ECN in the 2029 IP header. The supplementary part adds an additional TCP option 2030 consuming up to 11 bytes. However, no TCP option is consumed in 2031 the SYN. 2033 Ordering: The order in which marks arrive at the Data Receiver is 2034 preserved in AccECN feedback, because the Data Receiver is 2035 expected to send an ACK immediately whenever a different mark 2036 arrives. 2038 Timeliness: While the same ECN markings are arriving continually at 2039 the Data Receiver, it can defer ACKs as TCP does normally, but it 2040 will immediately send an ACK as soon as a different ECN marking 2041 arrives. 2043 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 2044 latency-sensitive uses of ECN feedback by capturing the timing of 2045 transitions but not wasting resources while the state of the 2046 signalling system is stable. Within the constraints of the 2047 change-triggered ACK rules, the receiver can control how 2048 frequently it sends the AccECN TCP Option and therefore to some 2049 extent it can control the overhead induced by AccECN. 2051 Resilience: All information is provided based on counters. 2052 Therefore if ACKs are lost, the counters on the first ACK 2053 following the losses allows the Data Sender to immediately recover 2054 the number of the ECN markings that it missed. And if data or 2055 ACKs are reordered, stale congestion information can be identified 2056 and ignored. 2058 Resilience against Bias: Because feedback is based on repetition of 2059 counters, random losses do not remove any information, they only 2060 delay it. Therefore, even though some ACKs are change-triggered, 2061 random losses will not alter the proportions of the different ECN 2062 markings in the feedback. 2064 Resilience vs Overhead: If space is limited in some segments 2065 (e.g. because more options are needed on some segments, such as 2066 the SACK option after loss), the Data Receiver can send AccECN 2067 Options less frequently or truncate fields that have not changed, 2068 usually down to as little as 5 bytes. However, it has to send a 2069 full-sized AccECN Option at least three times per RTT, which the 2070 Data Sender can rely on as a regular beacon or checkpoint. 2072 Resilience vs Timeliness and Ordering: Ordering information and the 2073 timing of transitions cannot be communicated in three cases: i) 2074 during ACK loss; ii) if something on the path strips the AccECN 2075 Option; or iii) if the Data Receiver is unable to support Change- 2076 Triggered ACKs. Following ACK reordering, the Data Sender can 2077 reconstruct the order in which feedback was sent, but not until 2078 all the missing feedback has arrived. 2080 Complexity: An AccECN implementation solely involves simple counter 2081 increments, some modulo arithmetic to communicate the least 2082 significant bits and allow for wrap, and some heuristics for 2083 safety against fields cycling due to prolonged periods of ACK 2084 loss. Each host needs to maintain eight additional counters. The 2085 hosts have to apply some additional tests to detect tampering by 2086 middleboxes, but in general the protocol is simple to understand, 2087 simple to implement and requires few cycles per packet to execute. 2089 Integrity: AccECN is compatible with at least three approaches that 2090 can assure the integrity of ECN feedback. If the AccECN Option is 2091 stripped the resolution of the feedback is degraded, but the 2092 integrity of this degraded feedback can still be assured. 2094 Backward Compatibility: If only one endpoint supports the AccECN 2095 scheme, it will fall-back to the most advanced ECN feedback scheme 2096 supported by the other end. 2098 Backward Compatibility: If the AccECN Option is stripped by a 2099 middlebox, AccECN still provides basic congestion feedback in the 2100 ACE field. Further, AccECN can be used to detect mangling of the 2101 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 2102 marked segments; and blocking of segments carrying the AccECN 2103 Option. It can detect these conditions during TCP's 3WHS so that 2104 it can fall back to operation without ECN and/or operation without 2105 the AccECN Option. 2107 Forward Compatibility: The behaviour of endpoints and middleboxes is 2108 carefully defined for all reserved or currently unused codepoints 2109 in the scheme. Then, the designers of security devices can 2110 understand which currently unused values might appear in future. 2111 So, even if they choose to treat such values as anomalous while 2112 they are not widely used, any blocking will at least be under 2113 policy control not hard-coded. Then, if previously unused values 2114 start to appear on the Internet (or in standards), such policies 2115 could be quickly reversed. 2117 7. IANA Considerations 2119 This document reassigns bit 7 of the TCP header flags to the AccECN 2120 protocol. This bit was previously called the Nonce Sum (NS) flag 2121 [RFC3540], but RFC 3540 has been reclassified as historic [RFC8311]. 2122 The flag will now be defined as: 2124 +-----+-------------------+-----------+ 2125 | Bit | Name | Reference | 2126 +-----+-------------------+-----------+ 2127 | 7 | AE (Accurate ECN) | RFC XXXX | 2128 +-----+-------------------+-----------+ 2130 TCP header flag reassignment 2132 [TO BE REMOVED: IANA is requested to update the existing entry in the 2133 Transmission Control Protocol (TCP) Header Flags registration 2134 (https://www.iana.org/assignments/tcp-header-flags/tcp-header- 2135 flags.xhtml#tcp-header-flags-1) for Bit 7 to "AE (Accurate ECN), 2136 previously used as NS (Nonce Sum) by [RFC3540], which is now Historic 2137 [RFC8311]" and change the reference to this RFC-to-be instead of 2138 RFC8311.] 2140 This document also defines two new TCP options for AccECN, assigned 2141 values of TBD0 and TBD1 (decimal) from the TCP option space. These 2142 values are defined as: 2144 +------+--------+--------------------------------+-----------+ 2145 | Kind | Length | Meaning | Reference | 2146 +------+--------+--------------------------------+-----------+ 2147 | TBD0 | N | Accurate ECN Order 0 (AccECN0) | RFC XXXX | 2148 | TBD1 | N | Accurate ECN Order 1 (AccECN1) | RFC XXXX | 2149 +------+--------+--------------------------------+-----------+ 2151 New TCP Option assignments 2153 [TO BE REMOVED: This registration should take place at the following 2154 location: http://www.iana.org/assignments/tcp-parameters/tcp- 2155 parameters.xhtml#tcp-parameters-1 ] 2157 Early implementations using experimental option 254 per [RFC6994] 2158 with the single magic number 0xACCE (16 bits), as allocated in the 2159 IANA "TCP Experimental Option Experiment Identifiers (TCP ExIDs)" 2160 registry, SHOULD migrate to use these new option kinds (TBD0 & TBD1). 2162 [TO BE REMOVED: The description of the 0xACCE value in the TCP ExIDs 2163 registry should be changed to "AccECN (current and new 2164 implementations SHOULD use option kinds TBD0 and TBD1)" at the 2165 following location: https://www.iana.org/assignments/tcp-parameters/ 2166 tcp-parameters.xhtml#tcp-exids ] 2168 8. Security Considerations 2170 If ever the supplementary part of AccECN based on the new AccECN TCP 2171 Option is unusable (due for example to middlebox interference) the 2172 essential part of AccECN's congestion feedback offers only limited 2173 resilience to long runs of ACK loss (see Section 3.2.2.5). These 2174 problems are unlikely to be due to malicious intervention (because if 2175 an attacker could strip a TCP option or discard a long run of ACKs it 2176 could wreak other arbitrary havoc). However, it would be of concern 2177 if AccECN's resilience could be indirectly compromised during a 2178 flooding attack. AccECN is still considered safe though, because if 2179 the option is not present, the AccECN Data Sender is then required to 2180 switch to more conservative assumptions about wrap of congestion 2181 indication counters (see Section 3.2.2.5 and Appendix A.2). 2183 Section 5.1 describes how a TCP server can negotiate AccECN and use 2184 the SYN cookie method for mitigating SYN flooding attacks. 2186 There is concern that ECN feedback could be altered or suppressed, 2187 particularly because a misbehaving Data Receiver could increase its 2188 own throughput at the expense of others. AccECN is compatible with 2189 the three schemes known to assure the integrity of ECN feedback (see 2190 Section 5.3 for details). If the AccECN Option is stripped by an 2191 incorrectly implemented middlebox, the resolution of the feedback 2192 will be degraded, but the integrity of this degraded information can 2193 still be assured. Assuring that Data Senders respond appropriately 2194 to ECN feedback is possible, but the scope of the present document is 2195 confined to the feedback protocol, and excludes the response to this 2196 feedback. 2198 In Section 3.2.3 a Data Sender is allowed to ignore an unrecognized 2199 TCP AccECN Option length and read as many whole 3-octet fields from 2200 it as possible up to a maximum of 3, treating the remainder as 2201 padding. This opens up a potential covert channel of up to 29B (40 - 2202 (2+3*3))B. However, it is really an overt channel (not hidden) and 2203 it is no different to the use of unknown TCP options with unknown 2204 option lengths in general. Therefore, where this is of concern, it 2205 can already be adequately mitigated by regular TCP normalizer 2206 technology (see Section 3.3.2). 2208 The AccECN protocol is not believed to introduce any new privacy 2209 concerns, because it merely counts and feeds back signals at the 2210 transport layer that had already been visible at the IP layer. A 2211 covert channel can be used to compromise privacy. However, as 2212 explained above, undefined TCP options in general open up such 2213 channels and common techniques are available to close them off. 2215 There is a potential concern that a Data Receiver could deliberately 2216 omit the AccECN Option pretending that it had been stripped by a 2217 middlebox. No known way can yet be contrived for a receiver to take 2218 advantage of this behaviour, which seems to always degrade its own 2219 performance. However, the concern is mentioned here for 2220 completeness. 2222 9. Acknowledgements 2224 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 2225 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf, 2226 Michael Tuexen, Yuchung Cheng, Kenjiro Cho, Olivier Tilmans, Ilpo 2227 Jaervinen, Neal Cardwell, Yoshifumi Nishida, Martin Duke and Jonathan 2228 Morton for their input and discussion. The idea of using the three 2229 ECN-related TCP flags as one field for more accurate TCP-ECN feedback 2230 was first introduced in the re-ECN protocol that was the ancestor of 2231 ConEx. 2233 Bob Briscoe was part-funded by the Comcast Innovation Fund, the 2234 European Community under its Seventh Framework Programme through the 2235 Reducing Internet Transport Latency (RITE) project (ICT-317700) and 2236 through the Trilogy 2 project (ICT-317756), and the Research Council 2237 of Norway through the TimeIn project. The views expressed here are 2238 solely those of the authors. 2240 Mirja Kuehlewind was partly supported by the European Commission 2241 under Horizon 2020 grant agreement no. 688421 Measurement and 2242 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 2243 State Secretariat for Education, Research, and Innovation under 2244 contract no. 15.0268. This support does not imply endorsement. 2246 10. Comments Solicited 2248 Comments and questions are encouraged and very welcome. They can be 2249 addressed to the IETF TCP maintenance and minor modifications working 2250 group mailing list , and/or to the authors. 2252 11. References 2254 11.1. Normative References 2256 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 2257 RFC 793, DOI 10.17487/RFC0793, September 1981, 2258 . 2260 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2261 Requirement Levels", BCP 14, RFC 2119, 2262 DOI 10.17487/RFC2119, March 1997, 2263 . 2265 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2266 of Explicit Congestion Notification (ECN) to IP", 2267 RFC 3168, DOI 10.17487/RFC3168, September 2001, 2268 . 2270 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2271 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 2272 . 2274 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2275 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2276 May 2017, . 2278 11.2. Informative References 2280 [I-D.ietf-tcpm-generalized-ecn] 2281 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 2282 Congestion Notification (ECN) to TCP Control Packets", 2283 draft-ietf-tcpm-generalized-ecn-09 (work in progress), 2284 January 2022. 2286 [I-D.ietf-tsvwg-l4s-arch] 2287 Briscoe, B., Schepper, K. D., Bagnulo, M., and G. White, 2288 "Low Latency, Low Loss, Scalable Throughput (L4S) Internet 2289 Service: Architecture", draft-ietf-tsvwg-l4s-arch-17 (work 2290 in progress), March 2022. 2292 [Mandalari18] 2293 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 2294 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 2295 over Mobile", IEEE Communications Magazine , March 2018. 2297 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 2298 Selective Acknowledgment Options", RFC 2018, 2299 DOI 10.17487/RFC2018, October 1996, 2300 . 2302 [RFC3449] Balakrishnan, H., Padmanabhan, V., Fairhurst, G., and M. 2303 Sooriyabandara, "TCP Performance Implications of Network 2304 Path Asymmetry", BCP 69, RFC 3449, DOI 10.17487/RFC3449, 2305 December 2002, . 2307 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2308 Congestion Notification (ECN) Signaling with Nonces", 2309 RFC 3540, DOI 10.17487/RFC3540, June 2003, 2310 . 2312 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 2313 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 2314 . 2316 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 2317 Ramakrishnan, "Adding Explicit Congestion Notification 2318 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 2319 DOI 10.17487/RFC5562, June 2009, 2320 . 2322 [RFC5690] Floyd, S., Arcia, A., Ros, D., and J. Iyengar, "Adding 2323 Acknowledgement Congestion Control to TCP", RFC 5690, 2324 DOI 10.17487/RFC5690, February 2010, 2325 . 2327 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 2328 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 2329 June 2010, . 2331 [RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's 2332 Robustness to Blind In-Window Attacks", RFC 5961, 2333 DOI 10.17487/RFC5961, August 2010, 2334 . 2336 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 2337 "TCP Extensions for Multipath Operation with Multiple 2338 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 2339 . 2341 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 2342 RFC 6994, DOI 10.17487/RFC6994, August 2013, 2343 . 2345 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 2346 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 2347 . 2349 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 2350 "Problem Statement and Requirements for Increased Accuracy 2351 in Explicit Congestion Notification (ECN) Feedback", 2352 RFC 7560, DOI 10.17487/RFC7560, August 2015, 2353 . 2355 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 2356 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 2357 DOI 10.17487/RFC7713, December 2015, 2358 . 2360 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 2361 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 2362 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 2363 October 2017, . 2365 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 2366 Notification (ECN) Experimentation", RFC 8311, 2367 DOI 10.17487/RFC8311, January 2018, 2368 . 2370 [RFC8511] Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 2371 "TCP Alternative Backoff with ECN (ABE)", RFC 8511, 2372 DOI 10.17487/RFC8511, December 2018, 2373 . 2375 [RFC9040] Touch, J., Welzl, M., and S. Islam, "TCP Control Block 2376 Interdependence", RFC 9040, DOI 10.17487/RFC9040, July 2377 2021, . 2379 Appendix A. Example Algorithms 2381 This appendix is informative, not normative. It gives example 2382 algorithms that would satisfy the normative requirements of the 2383 AccECN protocol. However, implementers are free to choose other ways 2384 to implement the requirements. 2386 A.1. Example Algorithm to Encode/Decode the AccECN Option 2388 The example algorithms below show how a Data Receiver in AccECN mode 2389 could encode its CE byte counter r.ceb into the ECEB field within the 2390 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 2391 the ECEB field into its byte counter s.ceb. The other counters for 2392 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 2393 similarly encoded and decoded. 2395 It is assumed that each local byte counter is an unsigned integer 2396 greater than 24b (probably 32b), and that the following constant has 2397 been assigned: 2399 DIVOPT = 2^24 2401 Every time a CE marked data segment arrives, the Data Receiver 2402 increments its local value of r.ceb by the size of the TCP Data. 2403 Whenever it sends an ACK with the AccECN Option, the value it writes 2404 into the ECEB field is 2406 ECEB = r.ceb % DIVOPT 2408 where '%' is the remainder operator. 2410 On the arrival of an AccECN Option, the Data Sender first makes sure 2411 the ACK has not been superseded in order to avoid winding the s.ceb 2412 counter backwards. It uses the TCP acknowledgement number and any 2413 SACK options to calculate newlyAckedB, the amount of new data that 2414 the ACK acknowledges in bytes (newlyAckedB can be zero but not 2415 negative). If newlyAckedB is zero, either the ACK has been 2416 superseded or CE-marked packet(s) without data could have arrived. 2417 To break the tie for the latter case, the Data Sender could use 2418 timestamps (if present) to work out newlyAckedT, the amount of new 2419 time that the ACK acknowledges. If the Data Sender determines that 2420 the ACK has been superseded it ignores the AccECN Option. Otherwise, 2421 the Data Sender calculates the minimum non-negative difference d.ceb 2422 between the ECEB field and its local s.ceb counter, using modulo 2423 arithmetic as follows: 2425 if ((newlyAckedB > 0) || (newlyAckedT > 0)) { 2426 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 2427 s.ceb += d.ceb 2428 } 2430 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 2431 then 2433 s.ceb % DIVOPT = 1 2434 d.ceb = (1461 + 2^24 - 1) % 2^24 2435 = 1460 2436 s.ceb = 33,554,433 + 1460 2437 = 33,555,893 2439 In practice an implementation might use heuristics to guess the 2440 feedback in missing ACKs, then when it subsequently receives feedback 2441 it might find that it needs to correct its earlier heuristics as part 2442 of the decoding process. The above decoding process does not include 2443 any such heuristics. 2445 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 2447 The example algorithms below show how a Data Receiver in AccECN mode 2448 could encode its CE packet counter r.cep into the ACE field, and how 2449 the Data Sender in AccECN mode could decode the ACE field into its 2450 s.cep counter. The Data Sender's algorithm includes code to 2451 heuristically detect a long enough unbroken string of ACK losses that 2452 could have concealed a cycle of the congestion counter in the ACE 2453 field of the next ACK to arrive. 2455 Two variants of the algorithm are given: i) a more conservative 2456 variant for a Data Sender to use if it detects that the AccECN Option 2457 is not available (see Section 3.2.2.5 and Section 3.2.3.2); and ii) a 2458 less conservative variant that is feasible when complementary 2459 information is available from the AccECN Option. 2461 A.2.1. Safety Algorithm without the AccECN Option 2463 It is assumed that each local packet counter is a sufficiently sized 2464 unsigned integer (probably 32b) and that the following constant has 2465 been assigned: 2467 DIVACE = 2^3 2469 Every time an Acceptable CE marked packet arrives (Section 3.2.2.2), 2470 the Data Receiver increments its local value of r.cep by 1. It 2471 repeats the same value of ACE in every subsequent ACK until the next 2472 CE marking arrives, where 2473 ACE = r.cep % DIVACE. 2475 If the Data Sender received an earlier value of the counter that had 2476 been delayed due to ACK reordering, it might incorrectly calculate 2477 that the ACE field had wrapped. Therefore, on the arrival of every 2478 ACK, the Data Sender ensures the ACK has not been superseded using 2479 the TCP acknowledgement number, any SACK options and timestamps (if 2480 available) to calculate newlyAckedB, as in Appendix A.1. If the ACK 2481 has not been superseded, the Data Sender calculates the minimum 2482 difference d.cep between the ACE field and its local s.cep counter, 2483 using modulo arithmetic as follows: 2485 if ((newlyAckedB > 0) || (newlyAckedT > 0)) 2486 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 2488 Section 3.2.2.5 expects the Data Sender to assume that the ACE field 2489 cycled if it is the safest likely case under prevailing conditions. 2490 The 3-bit ACE field in an arriving ACK could have cycled and become 2491 ambiguous to the Data Sender if a sequence of ACKs goes missing that 2492 covers a stream of data long enough to contain 8 or more CE marks. 2493 We use the word `missing' rather than `lost', because some or all the 2494 missing ACKs might arrive eventually, but out of order. Even if some 2495 of the missing ACKs were piggy-backed on data (i.e. not pure ACKs) 2496 retransmissions will not repair the lost AccECN information, because 2497 AccECN requires retransmissions to carry the latest AccECN counters, 2498 not the original ones. 2500 The phrase `under prevailing conditions' allows for implementation- 2501 dependent interpretation. A Data Sender might take account of the 2502 prevailing size of data segments and the prevailing CE marking rate 2503 just before the sequence of missing ACKs. However, we shall start 2504 with the simplest algorithm, which assumes segments are all full- 2505 sized and ultra-conservatively it assumes that ECN marking was 100% 2506 on the forward path when ACKs on the reverse path started to all be 2507 dropped. Specifically, if newlyAckedB is the amount of data that an 2508 ACK acknowledges since the previous ACK, then the Data Sender could 2509 assume that this acknowledges newlyAckedPkt full-sized segments, 2510 where newlyAckedPkt = newlyAckedB/MSS. Then it could assume that the 2511 ACE field incremented by 2513 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 2515 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 2516 size segments than any previous ACK, and that ACE increments by a 2517 minimum of 2 CE marks (d.cep=2). The above formula works out that it 2518 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 2519 2). However, if ACE increases by a minimum of 2 but acknowledges 10 2520 full-sized segments, then it would be necessary to assume that there 2521 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 2523 Note that checks would need to be added to the above pseudocode for 2524 (d.cep > newlyAckedPkt), which could occur if newlyAckedPkt had been 2525 wrongly estimated using an inappropriate packet size. 2527 ACKs that acknowledge a large stretch of packets might be common in 2528 data centres to achieve a high packet rate or might be due to ACK 2529 thinning by a middlebox. In these cases, cycling of the ACE field 2530 would often appear to have been possible, so the above algorithm 2531 would be over-conservative, leading to a false high marking rate and 2532 poor performance. Therefore it would be reasonable to only use 2533 dSafer.cep rather than d.cep if the moving average of newlyAckedPkt 2534 was well below 8. 2536 Implementers could build in more heuristics to estimate prevailing 2537 average segment size and prevailing ECN marking. For instance, 2538 newlyAckedPkt in the above formula could be replaced with 2539 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 2540 segment size and p is the prevailing ECN marking probability. 2541 However, ultimately, if TCP's ECN feedback becomes inaccurate it 2542 still has loss detection to fall back on. Therefore, it would seem 2543 safe to implement a simple algorithm, rather than a perfect one. 2545 The simple algorithm for dSafer.cep above requires no monitoring of 2546 prevailing conditions and it would still be safe if, for example, 2547 segments were on average at least 5% of full-sized as long as ECN 2548 marking was 5% or less. Assuming it was used, the Data Sender would 2549 increment its packet counter as follows: 2551 s.cep += dSafer.cep 2553 If missing acknowledgement numbers arrive later (due to reordering), 2554 Section 3.2.2.5 says "the Data Sender MAY attempt to neutralize the 2555 effect of any action it took based on a conservative assumption that 2556 it later found to be incorrect". To do this, the Data Sender would 2557 have to store the values of all the relevant variables whenever it 2558 made assumptions, so that it could re-evaluate them later. Given 2559 this could become complex and it is not required, we do not attempt 2560 to provide an example of how to do this. 2562 A.2.2. Safety Algorithm with the AccECN Option 2564 When the AccECN Option is available on the ACKs before and after the 2565 possible sequence of ACK losses, if the Data Sender only needs CE- 2566 marked bytes, it will have sufficient information in the AccECN 2567 Option without needing to process the ACE field. If for some reason 2568 it needs CE-marked packets, if dSafer.cep is different from d.cep, it 2569 can determine whether d.cep is likely to be a safe enough estimate by 2570 checking whether the average marked segment size (s = d.ceb/d.cep) is 2571 less than the MSS (where d.ceb is the amount of newly CE-marked bytes 2572 - see Appendix A.1). Specifically, it could use the following 2573 algorithm: 2575 SAFETY_FACTOR = 2 2576 if (dSafer.cep > d.cep) { 2577 if (d.ceb <= MSS * d.cep) { % Same as (s <= MSS), but no DBZ 2578 sSafer = d.ceb/dSafer.cep 2579 if (sSafer < MSS/SAFETY_FACTOR) 2580 dSafer.cep = d.cep % d.cep is a safe enough estimate 2581 } % else 2582 % No need for else; dSafer.cep is already correct, 2583 % because d.cep must have been too small 2584 } 2586 The chart below shows when the above algorithm will consider d.cep 2587 can replace dSafer.cep as a safe enough estimate of the number of CE- 2588 marked packets: 2590 ^ 2591 sSafer| 2592 | 2593 MSS+ 2594 | 2595 | dSafer.cep 2596 | is 2597 MSS/SAFETY_FACTOR+--------------+ safest 2598 | | 2599 | d.cep is safe| 2600 | enough | 2601 +--------------------> 2602 MSS s 2604 The following examples give the reasoning behind the algorithm, 2605 assuming MSS=1460 [B]: 2607 o if d.cep=0, dSafer.cep=8 and d.ceb=1460, then s=infinity and 2608 sSafer=182.5. 2609 Therefore even though the average size of 8 data segments is 2610 unlikely to have been as small as MSS/8, d.cep cannot have been 2611 correct, because it would imply an average segment size greater 2612 than the MSS. 2614 o if d.cep=2, dSafer.cep=10 and d.ceb=1460, then s=730 and 2615 sSafer=146. 2616 Therefore d.cep is safe enough, because the average size of 10 2617 data segments is unlikely to have been as small as MSS/10. 2619 o if d.cep=7, dSafer.cep=15 and d.ceb=10200, then s=1457 and 2620 sSafer=680. 2621 Therefore d.cep is safe enough, because the average data segment 2622 size is more likely to have been just less than one MSS, rather 2623 than below MSS/2. 2625 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 2626 far less likely. However, because [RFC3168] currently precludes 2627 this, the above algorithm assumes that pure ACKs are not ECN-capable. 2629 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 2631 If the AccECN Option is not available, the Data Sender can only 2632 decode CE-marking from the ACE field in packets. Every time an ACK 2633 arrives, to convert this into an estimate of CE-marked bytes, it 2634 needs an average of the segment size, s_ave. Then it can add or 2635 subtract s_ave from the value of d.ceb as the value of d.cep 2636 increments or decrements. Some possible ways to calculate s_ave are 2637 outlined below. The precise details will depend on why an estimate 2638 of marked bytes is needed. 2640 The implementation could keep a record of the byte numbers of all the 2641 boundaries between packets in flight (including control packets), and 2642 recalculate s_ave on every ACK. However it would be simpler to 2643 merely maintain a counter packets_in_flight for the number of packets 2644 in flight (including control packets), which is reset once per RTT. 2645 Either way, it would estimate s_ave as: 2647 s_ave ~= flightsize / packets_in_flight, 2649 where flightsize is the variable that TCP already maintains for the 2650 number of bytes in flight. To avoid floating point arithmetic, it 2651 could right-bit-shift by lg(packets_in_flight), where lg() means log 2652 base 2. 2654 An alternative would be to maintain an exponentially weighted moving 2655 average (EWMA) of the segment size: 2657 s_ave = a * s + (1-a) * s_ave, 2659 where a is the decay constant for the EWMA. However, then it is 2660 necessary to choose a good value for this constant, which ought to 2661 depend on the number of packets in flight. Also the decay constant 2662 needs to be power of two to avoid floating point arithmetic. 2664 A.4. Example Algorithm to Count Not-ECT Bytes 2666 A Data Sender in AccECN mode can infer the amount of TCP payload data 2667 arriving at the receiver marked Not-ECT from the difference between 2668 the amount of newly ACKed data and the sum of the bytes with the 2669 other three markings, d.ceb, d.e0b and d.e1b. 2671 For this approach to be precise, it has to be assumed that spurious 2672 (unnecessary) retransmissions do not lead to double counting. This 2673 assumption is currently correct, given that RFC 3168 requires that 2674 the Data Sender marks retransmitted segments as Not-ECT. However, 2675 the converse is not true; necessary retransmissions will result in 2676 under-counting. 2678 However, such precision is unlikely to be necessary. The only known 2679 use of a count of Not-ECT marked bytes is to test whether equipment 2680 on the path is clearing the ECN field (perhaps due to an out-dated 2681 attempt to clear, or bleach, what used to be the ToS field). To 2682 detect bleaching it will be sufficient to detect whether nearly all 2683 bytes arrive marked as Not-ECT. Therefore there ought to be no need 2684 to keep track of the details of retransmissions. 2686 Appendix B. Rationale for Usage of TCP Header Flags 2688 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake 2690 AccECN uses a rather unorthodox approach to negotiate the highest 2691 version TCP ECN feedback scheme that both ends support, as justified 2692 below. It follows from the original TCP ECN capability negotiation 2693 [RFC3168], in which the client set the 2 least significant of the 2694 original reserved flags in the TCP header, and fell back to no ECN 2695 support if the server responded with the 2 flags cleared, which had 2696 previously been the default. 2698 ECN originally used header flags rather than a TCP option because it 2699 was considered more efficient to use a header flag for 1 bit of 2700 feedback per ACK, and this bit could be overloaded to indicate 2701 support for ECN during the handshake. During the development of ECN, 2702 1 bit crept up to 2, in order to deliver the feedback reliably and to 2703 work round some broken hosts that reflected the reserved flags during 2704 the handshake. 2706 In order to be backward compatible with RFC 3168, AccECN continues 2707 this approach, using the 3rd least significant TCP header flag that 2708 had previously been allocated for the ECN nonce (now historic). 2710 Then, whatever form of server an AccECN client encounters, the 2711 connection can fall back to the highest version of feedback protocol 2712 that both ends support, as explained in Section 3.1. 2714 If AccECN had used the more orthodox approach of a TCP option, it 2715 would still have had to set the two ECN flags in the main TCP header, 2716 in order to be able to fall back to Classic RFC 3168 ECN, or to 2717 disable ECN support, without another round of negotiation. Then 2718 AccECN would also have had to handle all the different ways that 2719 servers currently respond to settings of the ECN flags in the main 2720 TCP header, including all the conflicting cases where a server might 2721 have said it supported one approach in the flags and another approach 2722 in the new TCP option. And AccECN would have had to deal with all 2723 the additional possibilities where a middlebox might have mangled the 2724 ECN flags, or removed the TCP option. Thus, usage of the 3rd 2725 reserved TCP header flag simplified the protocol. 2727 The third flag was used in a way that could be distinguished from the 2728 ECN nonce, in case any nonce deployment was encountered. Previous 2729 usage of this flag for the ECN nonce was integrated into the original 2730 ECN negotiation. This further justified the 3rd flag's use for 2731 AccECN, because a non-ECN usage of this flag would have had to use it 2732 as a separate single bit, rather than in combination with the other 2 2733 ECN flags. 2735 Indeed, having overloaded the original uses of these three flags for 2736 its handshake, AccECN overloads all three bits again as a 3-bit 2737 counter. 2739 B.2. Four Codepoints in the SYN/ACK 2741 Of the 8 possible codepoints that the 3 TCP header flags can indicate 2742 on the SYN/ACK, 4 already indicated earlier (or broken) versions of 2743 ECN support. In the early design of AccECN, an AccECN server could 2744 use only 2 of the 4 remaining codepoints. They both indicated AccECN 2745 support, but one fed back that the SYN had arrived marked as CE. 2746 Even though ECN support on a SYN is not yet on the standards track, 2747 the idea is for either end to act as a dumb reflector, so that future 2748 capabilities can be unilaterally deployed without requiring 2-ended 2749 deployment (justified in Section 2.5). 2751 During traversal testing it was discovered that the ECN field in the 2752 SYN was mangled on a non-negligible proportion of paths. Therefore 2753 it was necessary to allow the SYN/ACK to feed all four IP/ECN 2754 codepoints that the SYN could arrive with back to the client. 2755 Without this, the client could not know whether to disable ECN for 2756 the connection due to mangling of the IP/ECN field (also explained in 2757 Section 2.5). This development consumed the remaining 2 codepoints 2758 on the SYN/ACK that had been reserved for future use by AccECN in 2759 earlier versions. 2761 B.3. Space for Future Evolution 2763 Despite availability of usable TCP header space being extremely 2764 scarce, the AccECN protocol has taken all possible steps to ensure 2765 that there is space to negotiate possible future variants of the 2766 protocol, either if a variant of AccECN is required, or if a 2767 completely different ECN feedback approach is needed: 2769 Future AccECN variants: When the AccECN capability is negotiated 2770 during TCP's 3WHS, the rows in Table 2 tagged as 'Nonce' and 2771 'Broken' in the column for the capability of node B are unused by 2772 any current protocol in the RFC series. These could be used by 2773 TCP servers in future to indicate a variant of the AccECN 2774 protocol. In recent measurement studies in which the response of 2775 large numbers of servers to an AccECN SYN has been tested, 2776 e.g. [Mandalari18], a very small number of SYN/ACKs arrive with 2777 the pattern tagged as 'Nonce', and a small but more significant 2778 number arrive with the pattern tagged as 'Broken'. The 'Nonce' 2779 pattern could be a sign that a few servers have implemented the 2780 ECN Nonce [RFC3540], which has now been reclassified as historic 2781 [RFC8311], or it could be the random result of some unknown 2782 middlebox behaviour. The greater prevalence of the 'Broken' 2783 pattern suggests that some instances still exist of the broken 2784 code that reflects the reserved flags on the SYN. 2786 The requirement not to reject unexpected initial values of the ACE 2787 counter (in the main TCP header) in the last para of 2788 Section 3.2.2.4 ensures that 3 unused codepoints on the ACK of the 2789 SYN/ACK, 6 unused values on the first SYN=0 data packet from the 2790 client and 7 unused values on the first SYN=0 data packet from the 2791 server could be used to declare future variants of the AccECN 2792 protocol. The word 'declare' is used rather than 'negotiate' 2793 because, at this late stage in the 3WHS, it would be too late for 2794 a negotiation between the endpoints to be completed. A similar 2795 requirement not to reject unexpected initial values in the TCP 2796 option (Section 3.2.3.2.4) is for the same purpose. If traversal 2797 of the TCP option were reliable, this would have enabled a far 2798 wider range of future variation of the whole AccECN protocol. 2799 Nonetheless, it could be used to reliably negotiate a wide range 2800 of variation in the semantics of the AccECN Option. 2802 Future non-AccECN variants: Five codepoints out of the 8 possible in 2803 the 3 TCP header flags used by AccECN are unused on the initial 2804 SYN (in the order AE,CWR,ECE): 001, 010, 100, 101, 110. 2805 Section 3.1.3 ensures that the installed base of AccECN servers 2806 will all assume these are equivalent to AccECN negotiation with 2807 111 on the SYN. These codepoints would not allow fall-back to 2808 Classic ECN support for a server that did not understand them, but 2809 this approach ensures they are available in future, perhaps for 2810 uses other than ECN alongside the AccECN scheme. All possible 2811 combinations of SYN/ACK could be used in response except either 2812 000 or reflection of the same values sent on the SYN. 2814 Of course, other ways could be resorted to in order to extend 2815 AccECN or ECN in future, although their traversal properties are 2816 likely to be inferior. They include a new TCP option; using the 2817 remaining reserved flags in the main TCP header (preferably 2818 extending the 3-bit combinations used by AccECN to 4-bit 2819 combinations, rather than burning one bit for just one state); a 2820 non-zero urgent pointer in combination with the URG flag cleared; 2821 or some other unexpected combination of fields yet to be invented. 2823 Authors' Addresses 2825 Bob Briscoe 2826 Independent 2827 UK 2829 EMail: ietf@bobbriscoe.net 2830 URI: http://bobbriscoe.net/ 2832 Mirja Kuehlewind 2833 Ericsson 2834 Germany 2836 EMail: ietf@kuehlewind.net 2838 Richard Scheffenegger 2839 NetApp 2840 Vienna 2841 Austria 2843 EMail: Richard.Scheffenegger@netapp.com