idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-15.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC3168, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC3449, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). (Using the creation date from RFC3168, updated by this document, for RFC5378 checks: 2000-11-17) (Using the creation date from RFC3449, updated by this document, for RFC5378 checks: 1999-10-04) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 12, 2021) is 1013 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B' is mentioned on line 2535, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-07 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-08 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Independent 4 Updates: 3168, 3449 (if approved) M. Kuehlewind 5 Intended status: Standards Track Ericsson 6 Expires: January 13, 2022 R. Scheffenegger 7 NetApp 8 July 12, 2021 10 More Accurate ECN Feedback in TCP 11 draft-ietf-tcpm-accurate-ecn-15 13 Abstract 15 Explicit Congestion Notification (ECN) is a mechanism where network 16 nodes can mark IP packets instead of dropping them to indicate 17 incipient congestion to the end-points. Receivers with an ECN- 18 capable transport protocol feed back this information to the sender. 19 ECN was originally specified for TCP in such a way that only one 20 feedback signal can be transmitted per Round-Trip Time (RTT). Recent 21 new TCP mechanisms like Congestion Exposure (ConEx), Data Center TCP 22 (DCTCP) or Low Latency Low Loss Scalable Throughput (L4S) need more 23 accurate ECN feedback information whenever more than one marking is 24 received in one RTT. This document specifies a scheme to provide 25 more than one feedback signal per RTT in the TCP header. Given TCP 26 header space is scarce, it allocates a reserved header bit previously 27 assigned to the ECN-Nonce. It also overloads the two existing ECN 28 flags in the TCP header. The resulting extra space is exploited to 29 feed back the IP-ECN field received during the 3-way handshake as 30 well. Supplementary feedback information can optionally be provided 31 in a new TCP option, which is never used on the TCP SYN. The 32 document also specifies the treatment of this updated TCP wire 33 protocol by middleboxes. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at https://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on January 13, 2022. 51 Copyright Notice 53 Copyright (c) 2021 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (https://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 69 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 5 70 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 71 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 72 1.4. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 73 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 74 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 8 75 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 76 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 77 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 78 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 11 79 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 12 80 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 12 81 3.1.1. Negotiation during the TCP handshake . . . . . . . . 12 82 3.1.2. Backward Compatibility . . . . . . . . . . . . . . . 13 83 3.1.3. Forward Compatibility . . . . . . . . . . . . . . . . 15 84 3.1.4. Retransmission of the SYN . . . . . . . . . . . . . . 15 85 3.1.5. Implications of AccECN Mode . . . . . . . . . . . . . 16 86 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 18 87 3.2.1. Initialization of Feedback Counters . . . . . . . . . 18 88 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 19 89 3.2.3. The AccECN Option . . . . . . . . . . . . . . . . . . 27 90 3.3. AccECN Compliance Requirements for TCP Proxies, Offload 91 Engines and other Middleboxes . . . . . . . . . . . . . . 36 92 3.3.1. Requirements for TCP Proxies . . . . . . . . . . . . 36 93 3.3.2. Requirements for Transparent Middleboxes and TCP 94 Normalizers . . . . . . . . . . . . . . . . . . . . . 36 95 3.3.3. Requirements for TCP ACK Filtering . . . . . . . . . 36 96 3.3.4. Requirements for TCP Segmentation Offload . . . . . . 37 98 4. Updates to RFC 3168 . . . . . . . . . . . . . . . . . . . . . 38 99 5. Interaction with TCP Variants . . . . . . . . . . . . . . . . 39 100 5.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 40 101 5.2. Compatibility with TCP Experiments and Common TCP Options 40 102 5.3. Compatibility with Feedback Integrity Mechanisms . . . . 41 103 6. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 42 104 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 44 105 8. Security Considerations . . . . . . . . . . . . . . . . . . . 45 106 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 46 107 10. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 47 108 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 47 109 11.1. Normative References . . . . . . . . . . . . . . . . . . 47 110 11.2. Informative References . . . . . . . . . . . . . . . . . 47 111 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 50 112 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 50 113 A.2. Example Algorithm for Safety Against Long Sequences of 114 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 51 115 A.2.1. Safety Algorithm without the AccECN Option . . . . . 51 116 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 53 117 A.3. Example Algorithm to Estimate Marked Bytes from Marked 118 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 55 119 A.4. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 56 120 Appendix B. Rationale for Usage of TCP Header Flags . . . . . . 56 121 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake . . . 56 122 B.2. Four Codepoints in the SYN/ACK . . . . . . . . . . . . . 57 123 B.3. Space for Future Evolution . . . . . . . . . . . . . . . 58 124 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 59 126 1. Introduction 128 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 129 network nodes can mark IP packets instead of dropping them to 130 indicate incipient congestion to the end-points. Receivers with an 131 ECN-capable transport protocol feed back this information to the 132 sender. In RFC 3168, ECN was specified for TCP in such a way that 133 only one feedback signal could be transmitted per Round-Trip Time 134 (RTT). Recently, proposed mechanisms like Congestion Exposure (ConEx 135 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need to 136 know when more than one marking is received in one RTT which is 137 information that cannot be provided by the feedback scheme as 138 specified in [RFC3168]. This document specifies an update to the ECN 139 feedback scheme of RFC 3168 that provides more accurate information 140 and could be used by these and potentially other future TCP 141 extensions. A fuller treatment of the motivation for this 142 specification is given in the associated requirements document 143 [RFC7560]. 145 This documents specifies a standards track scheme for ECN feedback in 146 the TCP header to provide more than one feedback signal per RTT. It 147 will be called the more accurate ECN feedback scheme, or AccECN for 148 short. This document updates RFC 3168 with respect to negotiation 149 and use of the feedback scheme for TCP. All aspects of RFC 3168 150 other than the TCP feedback scheme, in particular the definition of 151 ECN at the IP layer, remain unchanged by this specification. 152 Section 4 gives a more detailed specification of exactly which 153 aspects of RFC 3168 this document updates. 155 AccECN is intended to be a complete replacement for classic TCP/ECN 156 feedback, not a fork in the design of TCP. AccECN feedback 157 complements TCP's loss feedback and it can coexist alongside 158 'classic' [RFC3168] TCP/ECN feedback. So its applicability is 159 intended to include all public and private IP networks (and even any 160 non-IP networks over which TCP is used today), whether or not any 161 nodes on the path support ECN, of whatever flavour. This document 162 uses the term Classic ECN when it needs to distinguish the RFC 3168 163 ECN TCP feedback scheme from the AccECN TCP feedback scheme. 165 AccECN feedback overloads the two existing ECN flags in the TCP 166 header and allocates the currently reserved flag (previously called 167 NS) in the TCP header, to be used as one three-bit counter field 168 indicating the number of congestion experienced marked packets. 169 Given the new definitions of these three bits, both ends have to 170 support the new wire protocol before it can be used. Therefore 171 during the TCP handshake the two ends use these three bits in the TCP 172 header to negotiate the most advanced feedback protocol that they can 173 both support, in a way that is backward compatible with [RFC3168]. 175 AccECN is solely a change to the TCP wire protocol; it covers the 176 negotiation and signaling of more accurate ECN feedback from a TCP 177 Data Receiver to a Data Sender. It is completely independent of how 178 TCP might respond to congestion feedback, which is out of scope, but 179 ultimately the motivation for accurate ECN feedback. Like Classic 180 ECN feedback, AccECN can be used by standard Reno congestion control 181 [RFC5681] to respond to the existence of at least one congestion 182 notification within a round trip. Or, unlike Reno, AccECN can be 183 used to respond to the extent of congestion notification over a round 184 trip, as for example DCTCP does in controlled environments [RFC8257]. 185 For congestion response, this specification refers to RFC 3168, or 186 ECN experiments such as those referred to in [RFC8311], namely: a 187 TCP-based Low Latency Low Loss Scalable (L4S) congestion control 188 [I-D.ietf-tsvwg-l4s-arch]; or Alternative Backoff with ECN (ABE) 189 [RFC8511]. 191 It is recommended that the AccECN protocol is implemented alongside 192 SACK [RFC2018] and the experimental ECN++ protocol 194 [I-D.ietf-tcpm-generalized-ecn], which allows the ECN capability to 195 be used on TCP control packets. Therefore, this specification does 196 not discuss implementing AccECN alongside [RFC5562], which was an 197 earlier experimental protocol with narrower scope than ECN++. 199 1.1. Document Roadmap 201 The following introductory section outlines the goals of AccECN 202 (Section 1.2). Then terminology is defined (Section 1.3) and a recap 203 of existing prerequisite technology is given (Section 1.4). 205 Section 2 gives an informative overview of the AccECN protocol. Then 206 Section 3 gives the normative protocol specification, and Section 4 207 clarifies which aspects of RFC 3168 are updated by this 208 specification. Section 5 assesses the interaction of AccECN with 209 commonly used variants of TCP, whether standardized or not. 210 Section 6 summarizes the features and properties of AccECN. 212 Section 7 summarizes the protocol fields and numbers that IANA will 213 need to assign and Section 8 points to the aspects of the protocol 214 that will be of interest to the security community. 216 Appendix A gives pseudocode examples for the various algorithms that 217 AccECN uses and Appendix B explains why AccECN uses flags in the main 218 TCP header and quantifies the space left for future use. 220 1.2. Goals 222 [RFC7560] enumerates requirements that a candidate feedback scheme 223 will need to satisfy, under the headings: resilience, timeliness, 224 integrity, accuracy (including ordering and lack of bias), 225 complexity, overhead and compatibility (both backward and forward). 226 It recognizes that a perfect scheme that fully satisfies all the 227 requirements is unlikely and trade-offs between requirements are 228 likely. Section 6 presents the properties of AccECN against these 229 requirements and discusses the trade-offs made. 231 The requirements document recognizes that a protocol as ubiquitous as 232 TCP needs to be able to serve as-yet-unspecified requirements. 233 Therefore an AccECN receiver aims to act as a generic (dumb) 234 reflector of congestion information so that in future new sender 235 behaviours can be deployed unilaterally. 237 1.3. Terminology 239 AccECN: The more accurate ECN feedback scheme will be called AccECN 240 for short. 242 Classic ECN: the ECN protocol specified in [RFC3168]. 244 Classic ECN feedback: the feedback aspect of the ECN protocol 245 specified in [RFC3168], including generation, encoding, 246 transmission and decoding of feedback, but not the Data Sender's 247 subsequent response to that feedback. 249 ACK: A TCP acknowledgement, with or without a data payload (ACK=1). 251 Pure ACK: A TCP acknowledgement without a data payload. 253 Acceptable packet / segment: A packet or segment that passes the 254 acceptability tests in [RFC0793] and [RFC5961]. 256 TCP client: The TCP stack that originates a connection. 258 TCP server: The TCP stack that responds to a connection request. 260 Data Receiver: The endpoint of a TCP half-connection that receives 261 data and sends AccECN feedback. 263 Data Sender: The endpoint of a TCP half-connection that sends data 264 and receives AccECN feedback. 266 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 267 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 268 document are to be interpreted as described in BCP 14 [RFC2119] 269 [RFC8174] when, and only when, they appear in all capitals, as shown 270 here. 272 1.4. Recap of Existing ECN feedback in IP/TCP 274 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 275 negotiated with the receiver at the transport layer, an ECN sender 276 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 277 to indicate an ECN-capable transport (ECT). If both ECN bits are 278 zero, the packet is considered to have been sent by a Not-ECN-capable 279 Transport (Not-ECT). When a network node experiences congestion, it 280 will occasionally either drop or mark a packet, with the choice 281 depending on the packet's ECN codepoint. If the codepoint is Not- 282 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 283 the node can mark the packet by setting both ECN bits, which is 284 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 285 Table 1 summarises these codepoints. 287 +------------------+----------------+---------------------------+ 288 | IP-ECN codepoint | Codepoint name | Description | 289 +------------------+----------------+---------------------------+ 290 | 0b00 | Not-ECT | Not ECN-Capable Transport | 291 | 0b01 | ECT(1) | ECN-Capable Transport (1) | 292 | 0b10 | ECT(0) | ECN-Capable Transport (0) | 293 | 0b11 | CE | Congestion Experienced | 294 +------------------+----------------+---------------------------+ 296 Table 1: The ECN Field in the IP Header 298 In the TCP header the first two bits in byte 14 are defined as flags 299 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 300 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 301 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 302 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 303 Data Receiver starts to set the Echo Congestion Experienced (ECE) 304 flag continuously in the TCP header of ACKs, which ensures the signal 305 is received reliably even if ACKs are lost. The TCP sender confirms 306 that it has received at least one ECE signal by responding with the 307 congestion window reduced (CWR) flag, which allows the TCP receiver 308 to stop repeating the ECN-Echo flag. This always leads to a full RTT 309 of ACKs with ECE set. Thus any additional CE markings arriving 310 within this RTT cannot be fed back. 312 The last bit in byte 13 of the TCP header was defined as the Nonce 313 Sum (NS) for the ECN Nonce [RFC3540]. In the absence of widespread 314 deployment RFC 3540 has been reclassified as historic [RFC8311] and 315 the respective flag has been marked as "reserved", making this TCP 316 flag available for use by the AccECN experiment instead. 318 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 319 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 320 | | | N | C | E | U | A | P | R | S | F | 321 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 322 | | | | R | E | G | K | H | T | N | N | 323 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 325 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 327 2. AccECN Protocol Overview and Rationale 329 This section provides an informative overview of the AccECN protocol 330 that will be normatively specified in Section 3 332 Like the original TCP approach, the Data Receiver of each TCP half- 333 connection sends AccECN feedback to the Data Sender on TCP 334 acknowledgements, reusing data packets of the other half-connection 335 whenever possible. 337 The AccECN protocol has had to be designed in two parts: 339 o an essential part that re-uses ECN TCP header bits for the Data 340 Receiver to feed back the number of packets arriving with CE in 341 the IP-ECN field. This provides more accuracy than classic ECN 342 feedback, but limited resilience against ACK loss; 344 o a supplementary part using a new AccECN TCP Option that provides 345 additional feedback on the number of bytes that arrive marked with 346 each of the three ECN codepoints in the IP-ECN field (not just CE 347 marks). This provides greater resilience against ACK loss than 348 the essential feedback, but it is more likely to suffer from 349 middlebox interference. 351 The two part design was necessary, given limitations on the space 352 available for TCP options and given the possibility that certain 353 incorrectly designed middleboxes prevent TCP using any new options. 355 The essential part overloads the previous definition of the three 356 flags in the TCP header that had been assigned for use by ECN. This 357 design choice deliberately replaces the classic ECN feedback 358 protocol, rather than leaving classic ECN feedback intact and adding 359 more accurate feedback separately because: 361 o this efficiently reuses scarce TCP header space, given TCP option 362 space is approaching saturation; 364 o a single upgrade path for the TCP protocol is preferable to a fork 365 in the design; 367 o otherwise classic and accurate ECN feedback could give conflicting 368 feedback on the same segment, which could open up new security 369 concerns and make implementations unnecessarily complex; 371 o middleboxes are more likely to faithfully forward the TCP ECN 372 flags than newly defined areas of the TCP header. 374 AccECN is designed to work even if the supplementary part is removed 375 or zeroed out, as long as the essential part gets through. 377 2.1. Capability Negotiation 379 AccECN is a change to the wire protocol of the main TCP header, 380 therefore it can only be used if both endpoints have been upgraded to 381 understand it. The TCP client signals support for AccECN on the 382 initial SYN of a connection and the TCP server signals whether it 383 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 384 client uses to signal AccECN support have been carefully chosen so 385 that a TCP server will interpret them as a request to support the 386 most recent variant of ECN feedback that it supports. Then the 387 client falls back to the same variant of ECN feedback. 389 An AccECN TCP client does not send the new AccECN Option on the SYN 390 as SYN option space is limited. The TCP server sends the AccECN 391 Option on the SYN/ACK and the client sends it on the first ACK to 392 test whether the network path forwards the option correctly. 394 2.2. Feedback Mechanism 396 A Data Receiver maintains four counters initialized at the start of 397 the half-connection. Three count the number of arriving payload 398 bytes respectively marked CE, ECT(1) and ECT(0) in the IP-ECN field. 399 The fourth counts the number of packets arriving marked with a CE 400 codepoint (including control packets without payload if they are CE- 401 marked). 403 The Data Sender maintains four equivalent counters for the half 404 connection, and the AccECN protocol is designed to ensure they will 405 match the values in the Data Receiver's counters, albeit after a 406 little delay. 408 Each ACK carries the three least significant bits (LSBs) of the 409 packet-based CE counter using the ECN bits in the TCP header, now 410 renamed the Accurate ECN (ACE) field (see Figure 3 later). The 24 411 LSBs of each byte counter are carried in the AccECN Option. 413 2.3. Delayed ACKs and Resilience Against ACK Loss 415 With both the ACE and the AccECN Option mechanisms, the Data Receiver 416 continually repeats the current LSBs of each of its respective 417 counters. There is no need to acknowledge these continually repeated 418 counters, so the congestion window reduced (CWR) mechanism is no 419 longer used. Even if some ACKs are lost, the Data Sender should be 420 able to infer how much to increment its own counters, even if the 421 protocol field has wrapped. 423 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 424 it appears to have incremented by one (say), the field might have 425 actually cycled completely then incremented by one. The Data 426 Receiver is not allowed to delay sending an ACK to such an extent 427 that the ACE field would cycle. However cycling is still a 428 possibility at the Data Sender because a whole sequence of ACKs 429 carrying intervening values of the field might all be lost or delayed 430 in transit. 432 The fields in the AccECN Option are larger, but they will increment 433 in larger steps because they count bytes not packets. Nonetheless, 434 their size has been chosen such that a whole cycle of the field would 435 never occur between ACKs unless there had been an infeasibly long 436 sequence of ACK losses. Therefore, as long as the AccECN Option is 437 available, it can be treated as a dependable feedback channel. 439 If the AccECN Option is not available, e.g. it is being stripped by a 440 middlebox, the AccECN protocol will only feed back information on CE 441 markings (using the ACE field). Although not ideal, this will be 442 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 443 will ever indicate more severe congestion than CE, even though future 444 uses for ECT(0) or ECT(1) are still unclear [RFC8311]. Because the 445 3-bit ACE field is so small, when it is the only field available, the 446 Data Sender has to interpret it assuming the most likely wrap, but 447 with a degree of conservatism. 449 Certain specified events trigger the Data Receiver to include an 450 AccECN Option on an ACK. The rules are designed to ensure that the 451 order in which different markings arrive at the receiver is 452 communicated to the sender (as long as options are reaching the 453 sender and as long as there is no ACK loss). Implementations are 454 encouraged to send an AccECN Option more frequently, but this is left 455 up to the implementer. 457 2.4. Feedback Metrics 459 The CE packet counter in the ACE field and the CE byte counter in the 460 AccECN Option both provide feedback on received CE-marks. The CE 461 packet counter includes control packets that do not have payload 462 data, while the CE byte counter solely includes marked payload bytes. 463 If both are present, the byte counter in the option will provide the 464 more accurate information needed for modern congestion control and 465 policing schemes, such as L4S, DCTCP or ConEx. If the option is 466 stripped, a simple algorithm to estimate the number of marked bytes 467 from the ACE field is given in Appendix A.3. 469 Feedback in bytes is recommended in order to protect against the 470 receiver using attacks similar to 'ACK-Division' to artificially 471 inflate the congestion window, which is why [RFC5681] now recommends 472 that TCP counts acknowledged bytes not packets. 474 2.5. Generic (Dumb) Reflector 476 The ACE field provides feedback about CE markings in the IP-ECN field 477 of both data and control packets. According to [RFC3168] the Data 478 Sender is meant to set the IP-ECN field of control packets to Not- 479 ECT. However, mechanisms in certain private networks (e.g. data 480 centres) set control packets to be ECN capable because they are 481 precisely the packets that performance depends on most. 483 For this reason, AccECN is designed to be a generic reflector of 484 whatever ECN markings it sees, whether or not they are compliant with 485 a current standard. Then as standards evolve, Data Senders can 486 upgrade unilaterally without any need for receivers to upgrade too. 487 It is also useful to be able to rely on generic reflection behaviour 488 when senders need to test for unexpected interference with markings 489 (for instance Section 3.2.2.3, Section 3.2.2.4 and Section 3.2.3.2 of 490 the present document and para 2 of Section 20.2 of [RFC3168]). 492 The initial SYN is the most critical control packet, so AccECN 493 provides feedback on its IP-ECN field. Although RFC 3168 prohibits 494 an ECN-capable SYN, providing feedback of ECN marking on the SYN 495 supports future scenarios in which SYNs might be ECN-enabled (without 496 prejudging whether they ought to be). For instance, [RFC8311] 497 updates this aspect of RFC 3168 to allow experimentation with ECN- 498 capable TCP control packets. 500 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 501 not-ECT in compliance with RFC 3168, feedback on the state of the IP- 502 ECN field when it arrives at the receiver could still be useful, 503 because middleboxes have been known to overwrite the IP-ECN field as 504 if it is still part of the old Type of Service (ToS) field 505 [Mandalari18]. If a TCP client has set the SYN to Not-ECT, but 506 receives feedback that the IP-ECN field on the SYN arrived with a 507 different codepoint, it can detect such middlebox interference and 508 send Not-ECT for the rest of the connection. Previously, if a TCP 509 server received ECT or CE on a SYN, it could not know whether it was 510 invalid (or valid) because only the TCP client knew whether it 511 originally marked the SYN as Not-ECT (or ECT). Therefore, prior to 512 AccECN, the server's only safe course of action was to disable ECN 513 for the connection. Instead, the AccECN protocol allows the server 514 to feed back the received ECN field to the client, which then has all 515 the information to decide whether the connection has to fall-back 516 from supporting ECN (or not). 518 3. AccECN Protocol Specification 520 3.1. Negotiating to use AccECN 522 3.1.1. Negotiation during the TCP handshake 524 Given the ECN Nonce [RFC3540] has been reclassified as historic 525 [RFC8311], the present specification re-allocates the TCP flag at bit 526 7 of the TCP header, which was previously called NS (Nonce Sum), as 527 the AE (Accurate ECN) flag (see IANA Considerations in Section 7) as 528 shown below. 530 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 531 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 532 | | | A | C | E | U | A | P | R | S | F | 533 | Header Length | Reserved | E | W | C | R | C | S | S | Y | I | 534 | | | | R | E | G | K | H | T | N | N | 535 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 537 Figure 2: The (post-AccECN) definition of the TCP header flags during 538 the TCP handshake 540 During the TCP handshake at the start of a connection, to request 541 more accurate ECN feedback the TCP client (host A) MUST set the TCP 542 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 544 If a TCP server (B) that is AccECN-enabled receives a SYN with the 545 above three flags set, it MUST set both its half connections into 546 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 547 the 4 values shown in the top block of Table 2 to confirm that it 548 supports AccECN. The TCP server MUST NOT set one of these 4 549 combination of flags on the SYN/ACK unless the preceding SYN 550 requested support for AccECN as above. 552 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 553 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 554 that arrived on the SYN. This applies whether or not the server 555 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 556 Section 2.5 for rationale). 558 Once a TCP client (A) has sent the above SYN to declare that it 559 supports AccECN, and once it has received the above SYN/ACK segment 560 that confirms that the TCP server supports AccECN, the TCP client 561 MUST set both its half connections into AccECN mode. 563 Once in AccECN mode, a TCP client or server has the rights and 564 obligations to participate in the ECN protocol defined in 565 Section 3.1.5. 567 The procedure for the client to follow if a SYN/ACK does not arrive 568 before its retransmission timer expires is given in Section 3.1.4. 570 3.1.2. Backward Compatibility 572 The three flags set to 1 to indicate AccECN support on the SYN have 573 been carefully chosen to enable natural fall-back to prior stages in 574 the evolution of ECN, as above. Table 2 tabulates all the 575 negotiation possibilities for ECN-related capabilities that involve 576 at least one AccECN-capable host. The entries in the first two 577 columns have been abbreviated, as follows: 579 AccECN: More Accurate ECN Feedback (the present specification) 581 Nonce: ECN Nonce feedback [RFC3540] 583 ECN: 'Classic' ECN feedback [RFC3168] 585 No ECN: Not-ECN-capable. Implicit congestion notification using 586 packet drop. 588 +--------+--------+------------+------------+-----------------------+ 589 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 590 | | | | B->A | | 591 +--------+--------+------------+------------+-----------------------+ 592 | | | AE CWR ECE | AE CWR ECE | | 593 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN(no ECT on SYN) | 594 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 595 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 596 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 597 | | | | | | 598 | AccECN | Nonce | 1 1 1 | 1 0 1 | (Reserved) | 599 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 600 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 601 | | | | | | 602 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 603 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 604 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 605 | | | | | | 606 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 607 +--------+--------+------------+------------+-----------------------+ 609 Table 2: ECN capability negotiation between Client (A) and Server (B) 611 Table 2 is divided into blocks each separated by an empty row. 613 1. The top block shows the case already described in Section 3.1 614 where both endpoints support AccECN and how the TCP server (B) 615 indicates congestion feedback. 617 2. The second block shows the cases where the TCP client (A) 618 supports AccECN but the TCP server (B) supports some earlier 619 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 620 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 621 shown it MUST set both its half connections into the feedback 622 mode shown in the rightmost column. If it has set itself into 623 classic ECN feedback mode it MUST then comply with [RFC3168]. 625 The server response called 'Nonce' in the table is now historic. 626 For an AccECN implementation, there is no need to recognize or 627 support ECN Nonce feedback [RFC3540], which has been reclassified 628 as historic [RFC8311]. AccECN is compatible with alternative ECN 629 feedback integrity approaches (see Section 5.3). 631 3. The third block shows the cases where the TCP server (B) supports 632 AccECN but the TCP client (A) supports some earlier variant of 633 TCP feedback, indicated in its SYN. 635 When an AccECN-enabled TCP server (B) receives a SYN with 636 AE,CWR,ECE = 0,1,1 it MUST do one of the following: 638 * set both its half connections into the classic ECN feedback 639 mode and return a SYN/ACK with AE, CWR, ECE = 0,0,1 as shown. 640 Then it MUST comply with [RFC3168]. 642 * set both its half-connections into No ECN mode and return a 643 SYN/ACK with AE,CWR,ECE = 0,0,0, then continue with ECN 644 disabled. This latter case is unlikely to be desirable, but 645 it is allowed as a possibility, e.g. for minimal TCP 646 implementations. 648 When an AccECN-enabled TCP server (B) receives a SYN with 649 AE,CWR,ECE = 0,0,0 it MUST set both its half connections into the 650 Not ECN feedback mode, return a SYN/ACK with AE,CWR,ECE = 0,0,0 651 as shown and continue with ECN disabled. 653 4. The fourth block displays a combination labelled `Broken'. Some 654 older TCP server implementations incorrectly set the reserved 655 flags in the SYN/ACK by reflecting those in the SYN. Such broken 656 TCP servers (B) cannot support ECN, so as soon as an AccECN- 657 capable TCP client (A) receives such a broken SYN/ACK it MUST 658 fall back to Not ECN mode for both its half connections and 659 continue with ECN disabled. 661 The following additional rules do not fit the structure of the table, 662 but they complement it: 664 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 665 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 666 Host A MUST then enter the same feedback mode as it would have 667 entered had it been a responding host and received the same SYN. 668 Then host A MUST send the same SYN/ACK as it would have sent had 669 it been a responding host. 671 In-window SYN during TIME-WAIT: Many TCP implementations create a 672 new TCP connection if they receive an in-window SYN packet during 673 TIME-WAIT state. When a TCP host enters TIME-WAIT or CLOSED 674 state, it should ignore any previous state about the negotiation 675 of AccECN for that connection and renegotiate the feedback mode 676 according to Table 2. 678 3.1.3. Forward Compatibility 680 If a TCP server that implements AccECN receives a SYN with the three 681 TCP header flags (AE, CWR and ECE) set to any combination other than 682 000, 011 or 111, it MUST negotiate the use of AccECN as if they had 683 been set to 111. This ensures that future uses of the other 684 combinations on a SYN can rely on consistent behaviour from the 685 installed base of AccECN servers. 687 For the avoidance of doubt, the behaviour described in the present 688 specification applies whether or not the three remaining reserved TCP 689 header flags are zero. 691 3.1.4. Retransmission of the SYN 693 If the sender of an AccECN SYN times out before receiving the SYN/ 694 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 695 least one more time by continuing to set all three TCP ECN flags on 696 the first retransmitted SYN (using the usual retransmission time- 697 outs). If this first retransmission also fails to be acknowledged, 698 the sender SHOULD send subsequent retransmissions of the SYN with the 699 three TCP-ECN flags cleared (AE=CWR=ECE=0). A retransmitted SYN MUST 700 use the same ISN as the original SYN. 702 Retrying once before fall-back adds delay in the case where a 703 middlebox drops an AccECN (or ECN) SYN deliberately. However, 704 current measurements imply that a drop is less likely to be due to 705 middlebox interference than other intermittent causes of loss, 706 e.g. congestion, wireless interference, etc. 708 Implementers MAY use other fall-back strategies if they are found to 709 be more effective (e.g. attempting to negotiate AccECN on the SYN 710 only once or more than twice (most appropriate during high levels of 711 congestion). However, other fall-back strategies will need to follow 712 all the rules in Section 3.1.5, which concern behaviour when SYNs or 713 SYN/ACKs negotiating different types of feedback have been sent 714 within the same connection. 716 Further it may make sense to also remove any other new or 717 experimental fields or options on the SYN in case a middlebox might 718 be blocking them, although the required behaviour will depend on the 719 specification of the other option(s) and any attempt to co-ordinate 720 fall-back between different modules of the stack. 722 Whichever fall-back strategy is used, the TCP initiator SHOULD cache 723 failed connection attempts. If it does, it SHOULD NOT give up 724 attempting to negotiate AccECN on the SYN of subsequent connection 725 attempts until it is clear that the blockage is persistently and 726 specifically due to AccECN. The cache should be arranged to expire 727 so that the initiator will infrequently attempt to check whether the 728 problem has been resolved. 730 The fall-back procedure if the TCP server receives no ACK to 731 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 732 Section 3.2.3.2. 734 3.1.5. Implications of AccECN Mode 736 Section 3.1.1 describes the only ways that a host can enter AccECN 737 mode, whether as a client or as a server. 739 As a Data Sender, a host in AccECN mode has the rights and 740 obligations concerning the use of ECN defined below, which build on 741 those in [RFC3168] as updated by [RFC8311]: 743 o Using ECT: 745 * It can set an ECT codepoint in the IP header of packets to 746 indicate to the network that the transport is capable and 747 willing to participate in ECN for this packet. 749 * It does not have to set ECT on any packet (for instance if it 750 has reason to believe such a packet would be blocked). 752 o Switching feedback negotiation (e.g. fall-back): 754 * It SHOULD NOT set ECT on any packet if it has received at least 755 one valid SYN or Acceptable SYN/ACK with AE=CWR=ECE=0. A 756 "valid SYN" has the same port numbers and the same ISN as the 757 SYN that caused the server to enter AccECN mode. 759 * It MUST NOT send an ECN-setup SYN [RFC3168] within the same 760 connection as it has sent a SYN requesting AccECN feedback. 762 * It MUST NOT send an ECN-setup SYN/ACK [RFC3168] within the same 763 connection as it has sent a SYN/ACK agreeing to use AccECN 764 feedback. 766 The above rules are necessary because, if one peer were to 767 negotiate the feedback mode in two different types of handshake, 768 it would not be possible for the other peer to know for certain 769 which handshake packet(s) the other end had eventually received or 770 in which order it received them. So, in the absence of these 771 rules, the two peers could end up using different feedback modes 772 without knowing it. 774 o Congestion response: 776 * It is still obliged to respond appropriately to AccECN feedback 777 that indicates there were ECN marks on packets it had 778 previously sent, as defined in Section 6.1 of [RFC3168] and 779 updated by Sections 2.1 and 4.1 of [RFC8311]. 781 * The commitment to respond appropriately to incoming indications 782 of congestion remains even if it sends a SYN packet with 783 AE=CWR=ECE=0, in a later transmission within the same TCP 784 connection. 786 * Unlike an RFC 3168 data sender, it MUST NOT set CWR to indicate 787 it has received and responded to indications of congestion (for 788 the avoidance of doubt, this does not preclude it from setting 789 the bits of the ACE counter field, which includes an overloaded 790 use of the same bit). 792 As a Data Receiver: 794 o a host in AccECN mode MUST feed back the information in the IP-ECN 795 field of incoming packets using Accurate ECN feedback, as 796 specified in Section 3.2 below. 798 o if it receives an ECN-setup SYN or ECN-setup SYN/ACK [RFC3168] 799 during the same connection as it receives a SYN requesting AccECN 800 feedback or a SYN/ACK agreeing to use AccECN feedback, it MUST 801 reset the connection with a RST packet. 803 o If for any reason it is not willing to provide ECN feedback on a 804 particular TCP connection, to indicate this unwillingness it 805 SHOULD clear the AE, CWR and ECE flags in all SYN and/or SYN/ACK 806 packets that it sends. 808 o it MUST NOT use reception of packets with ECT set in the IP-ECN 809 field as an implicit signal that the peer is ECN-capable. Reason: 810 ECT at the IP layer does not explicitly confirm the peer has the 811 correct ECN feedback logic, as the packets could have been mangled 812 at the IP layer. 814 3.2. AccECN Feedback 816 Each Data Receiver of each half connection maintains four counters, 817 r.cep, r.ceb, r.e0b and r.e1b: 819 o The Data Receiver MUST increment the CE packet counter (r.cep), 820 for every Acceptable packet that it receives with the CE code 821 point in the IP ECN field, including CE marked control packets but 822 excluding CE on SYN packets (SYN=1; ACK=0). 824 o The Data Receiver MUST increment the r.ceb, r.e0b or r.e1b byte 825 counters by the number of TCP payload octets in Acceptable packets 826 marked respectively with the CE, ECT(0) and ECT(1) codepoint in 827 their IP-ECN field, including any payload octets on control 828 packets, but not including any payload octets on SYN packets 829 (SYN=1; ACK=0). 831 Each Data Sender of each half connection maintains four counters, 832 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 833 counters at the Data Receiver. 835 A Data Receiver feeds back the CE packet counter using the Accurate 836 ECN (ACE) field, as explained in Section 3.2.2. And it feeds back 837 all the byte counters using the AccECN TCP Option, as specified in 838 Section 3.2.3. 840 Whenever a host feeds back the value of any counter, it MUST report 841 the most recent value, no matter whether it is in a pure ACK, an ACK 842 with new payload data or a retransmission. Therefore the feedback 843 carried on a retransmitted packet is unlikely to be the same as the 844 feedback on the original packet. 846 3.2.1. Initialization of Feedback Counters 848 When a host first enters AccECN mode, in its role as a Data Receiver 849 it initializes its counters to r.cep = 5, r.e0b = 1 and r.ceb = 850 r.e1b.= 0, 851 Non-zero initial values are used to support a stateless handshake 852 (see Section 5.1) and to be distinct from cases where the fields are 853 incorrectly zeroed (e.g. by middleboxes - see Section 3.2.3.2.4). 855 When a host enters AccECN mode, in its role as a Data Sender it 856 initializes its counters to s.cep = 5, s.e0b = 1 and s.ceb = s.e1b.= 857 0. 859 3.2.2. The ACE Field 861 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 862 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 863 as one 3-bit field. Then the field is given a new name, ACE, as 864 shown in Figure 3. 866 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 867 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 868 | | | | U | A | P | R | S | F | 869 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 870 | | | | G | K | H | T | N | N | 871 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 873 Figure 3: Definition of the ACE field within bytes 13 and 14 of the 874 TCP Header (when AccECN has been negotiated and SYN=0). 876 The original definition of these three flags in the TCP header, 877 including the addition of support for the ECN Nonce, is shown for 878 comparison in Figure 1. This specification does not rename these 879 three TCP flags to ACE unconditionally; it merely overloads them with 880 another name and definition once an AccECN connection has been 881 established. 883 With one exception (Section 3.2.2.1), a host with both of its half- 884 connections in AccECN mode MUST interpret the AE, CWR and ECE flags 885 as the 3-bit ACE counter on a segment with the SYN flag cleared 886 (SYN=0). On such a packet, a Data Receiver MUST encode the three 887 least significant bits of its r.cep counter into the ACE field that 888 it feeds back to the Data Sender. A host MUST NOT interpret the 3 889 flags as a 3-bit ACE field on any segment with SYN=1 (whether ACK is 890 0 or 1), or if AccECN negotiation is incomplete or has not succeeded. 892 Both parts of each of these conditions are equally important. For 893 instance, even if AccECN negotiation has been successful, the ACE 894 field is not defined on any segments with SYN=1 (e.g. a 895 retransmission of an unacknowledged SYN/ACK, or when both ends send 896 SYN/ACKs after AccECN support has been successfully negotiated during 897 a simultaneous open). 899 3.2.2.1. ACE Field on the ACK of the SYN/ACK 901 A TCP client (A) in AccECN mode MUST feed back which of the 4 902 possible values of the IP-ECN field was on the SYN/ACK by writing it 903 into the ACE field of a pure ACK with no SACK blocks using the binary 904 encoding in Table 3 (which is the same as that used on the SYN/ACK in 905 Table 2). This shall be called the handshake encoding of the ACE 906 field, and it is the only exception to the rule that the ACE field 907 carries the 3 least significant bits of the r.cep counter on packets 908 with SYN=0. 910 Normally, a TCP client acknowledges a SYN/ACK with an ACK that 911 satisfies the above conditions anyway (SYN=0, no data, no SACK 912 blocks). If an AccECN TCP client intends to acknowledge the SYN/ACK 913 with a packet that does not satisfy these conditions (e.g. it has 914 data to include on the ACK), it SHOULD first send a pure ACK that 915 does satisfy these conditions (see Section 5.2), so that it can feed 916 back which of the four values of the IP-ECN field arrived on the SYN/ 917 ACK. A valid exception to this "SHOULD" would be where the 918 implementation will only be used in an environment where mangling of 919 the ECN field is unlikely. 921 +---------------------+---------------------+-----------------------+ 922 | IP-ECN codepoint on | ACE on pure ACK of | r.cep of client in | 923 | SYN/ACK | SYN/ACK | AccECN mode | 924 +---------------------+---------------------+-----------------------+ 925 | Not-ECT | 0b010 | 5 | 926 | ECT(1) | 0b011 | 5 | 927 | ECT(0) | 0b100 | 5 | 928 | CE | 0b110 | 6 | 929 +---------------------+---------------------+-----------------------+ 931 Table 3: The encoding of the ACE field in the ACK of the SYN-ACK to 932 reflect the SYN-ACK's IP-ECN field 934 When an AccECN server in SYN-RCVD state receives a pure ACK with 935 SYN=0 and no SACK blocks, instead of treating the ACE field as a 936 counter, it MUST infer the meaning of each possible value of the ACE 937 field from Table 4, which also shows the value that an AccECN server 938 MUST set s.cep to as a result. 940 Given this encoding of the ACE field on the ACK of a SYN/ACK is 941 exceptional, an AccECN server using large receive offload (LRO) might 942 prefer to disable LRO until such an ACK has transitioned it out of 943 SYN-RCVD state. 945 +---------------+-----------------------------+---------------------+ 946 | ACE on ACK of | IP-ECN codepoint on SYN/ACK | s.cep of server in | 947 | SYN/ACK | inferred by server | AccECN mode | 948 +---------------+-----------------------------+---------------------+ 949 | 0b000 | {Notes 1, 3} | Disable ECN | 950 | 0b001 | {Notes 2, 3} | 5 | 951 | 0b010 | Not-ECT | 5 | 952 | 0b011 | ECT(1) | 5 | 953 | 0b100 | ECT(0) | 5 | 954 | 0b101 | Currently Unused {Note 2} | 5 | 955 | 0b110 | CE | 6 | 956 | 0b111 | Currently Unused {Note 2} | 5 | 957 +---------------+-----------------------------+---------------------+ 959 Table 4: Meaning of the ACE field on the ACK of the SYN/ACK 961 {Note 1}: If the server is in AccECN mode, the value of zero raises 962 suspicion of zeroing of the ACE field on the path (see 963 Section 3.2.2.3). 965 {Note 2}: If the server is in AccECN mode, these values are Currently 966 Unused but the AccECN server's behaviour is still defined for forward 967 compatibility. Then the designer of a future protocol can know for 968 certain what AccECN servers will do with these codepoints. 970 {Note 3}: In the case where a server that implements AccECN is also 971 using a stateless handshake (termed a SYN cookie) it will not 972 remember whether it entered AccECN mode. The values 0b000 or 0b001 973 will remind it that it did not enter AccECN mode, because AccECN does 974 not use them (see Section 5.1 for details). If a stateless server 975 that implements AccECN receives either of these two values in the 976 ACK, its action is implementation-dependent and outside the scope of 977 this spec, It will certainly not take the action in the third column 978 because, after it receives either of these values, it is not in 979 AccECN mode. I.e., it will not disable ECN (at least not just 980 because ACE is 0b000) and it will not set s.cep. 982 3.2.2.2. Encoding and Decoding Feedback in the ACE Field 984 Whenever the Data Receiver sends an ACK with SYN=0 (with or without 985 data), unless the handshake encoding in Section 3.2.2.1 applies, the 986 Data Receiver MUST encode the least significant 3 bits of its r.cep 987 counter into the ACE field (see Appendix A.2). 989 Whenever the Data Sender receives an ACK with SYN=0 (with or without 990 data), it first checks whether it has already been superseded by 991 another ACK in which case it ignores the ECN feedback. If the ACK 992 has not been superseded, and if the special handshake encoding in 993 Section 3.2.2.1 does not apply, the Data Sender decodes the ACE field 994 as follows (see Appendix A.2 for examples). 996 o It takes the least significant 3 bits of its local s.cep counter 997 and subtracts them from the incoming ACE counter to work out the 998 minimum positive increment it could apply to s.cep (assuming the 999 ACE field only wrapped at most once). 1001 o It then follows the safety procedures in Section 3.2.2.5.2 to 1002 calculate or estimate how many packets the ACK could have 1003 acknowledged under the prevailing conditions to determine whether 1004 the ACE field might have wrapped more than once. 1006 The encode/decode procedures during the three-way handshake are 1007 exceptions to the general rules given so far, so they are spelled out 1008 step by step below for clarity: 1010 o If a TCP server in AccECN mode receives a CE mark in the IP-ECN 1011 field of a SYN (SYN=1, ACK=0), it MUST NOT increment r.cep (it 1012 remains at its initial value of 5). 1014 Reason: It would be redundant for the server to include CE-marked 1015 SYNs in its r.cep counter, because it already reliably delivers 1016 feedback of any CE marking on the SYN/ACK using the encoding in 1017 Table 2. This also ensures that, when the server starts using the 1018 ACE field, it has not unnecessarily consumed more than one initial 1019 value, given they can be used to negotiate variants of the AccECN 1020 protocol (see Appendix B.3). 1022 o If a TCP client in AccECN mode receives CE feedback in the TCP 1023 flags of a SYN/ACK, it MUST NOT increment s.cep (it remains at its 1024 initial value of 5), so that it stays in step with r.cep on the 1025 server. Nonetheless, the TCP client still triggers the congestion 1026 control actions necessary to respond to the CE feedback. 1028 o If a TCP client in AccECN mode receives a CE mark in the IP-ECN 1029 field of a SYN/ACK, it MUST increment r.cep, but no more than once 1030 no matter how many CE-marked SYN/ACKs it receives 1031 (i.e. incremented from 5 to 6, but no further). 1033 Reason: Incrementing r.cep ensures the client will eventually 1034 deliver any CE marking to the server reliably when it starts using 1035 the ACE field. Even though the client also feeds back any CE 1036 marking on the ACK of the SYN/ACK using the encoding in Table 3, 1037 this ACK is not delivered reliably, so it can be considered as a 1038 timely notification that is redundant but unreliable. The client 1039 does not increment r.cep more than once, because the server can 1040 only increment s.cep once (see next bullet). Also, this limits 1041 the unnecessarily consumed initial values of the ACE field to two. 1043 o If a TCP server in AccECN mode and in SYN-RCVD state receives CE 1044 feedback in the TCP flags of a pure ACK with no SACK blocks, it 1045 MUST increment s.cep (from 5 to 6). The TCP server then triggers 1046 the congestion control actions necessary to respond to the CE 1047 feedback. 1049 Reasoning: The TCP server can only increment s.cep once, because 1050 the first ACK it receives will cause it to transition out of SYN- 1051 RCVD state. The server's congestion response would be no 1052 different even if it could receive feedback of more than one CE- 1053 marked SYN/ACK. 1055 Once the TCP server transitions to ESTABLISHED state, it might 1056 later receive other pure ACK(s) with the handshake encoding in the 1057 ACE field. A server MAY implement a test for such a case, but it 1058 is not required. Therefore, once in the ESTABLISHED state, it 1059 will be sufficient for the server to consider the ACE field to be 1060 encoded as the normal ACE counter on all packets with SYN=0. 1062 Reasoning: Such ACKs will be quite unusual, e.g. a SYN/ACK (or ACK 1063 of the SYN/ACK) that is delayed for longer than the server's 1064 retransmission timeout; or packet duplication by the network. And 1065 the impact of any error in the feedback on such ACKs will only be 1066 temporary. 1068 3.2.2.3. Testing for Zeroing of the ACE Field 1070 Section 3.2.2 required the Data Receiver to initialize the r.cep 1071 counter to a non-zero value. Therefore, in either direction the 1072 initial value of the ACE counter ought to be non-zero. 1074 If AccECN has been successfully negotiated, the Data Sender SHOULD 1075 check the value of the ACE counter in the first packet (with or 1076 without data) that arrives with SYN=0. If the value of this ACE 1077 field is zero (0b000), the Data Sender disables sending ECN-capable 1078 packets for the remainder of the half-connection by setting the IP/ 1079 ECN field in all subsequent packets to Not-ECT. 1081 Usually, the server checks the ACK of the SYN/ACK from the client, 1082 while the client checks the first data segment from the server. 1083 However, if reordering occurs, "the first packet ... that arrives" 1084 will not necessarily be the same as the first packet in sequence 1085 order. The test has been specified loosely like this to simplify 1086 implementation, and because it would not have been any more precise 1087 to have specified the first packet in sequence order, which would not 1088 necessarily be the first ACE counter that the Data Receiver fed back 1089 anyway, given it might have been a retransmission. 1091 The possibility of re-ordering means that there is a small chance 1092 that the ACE field on the first packet to arrive is genuinely zero 1093 (without middlebox interference). This would cause a host to 1094 unnecessarily disable ECN for a half connection. Therefore, in 1095 environments where there is no evidence of the ACE field being 1096 zeroed, implementations can skip this test. 1098 Note that the Data Sender MUST NOT test whether the arriving counter 1099 in the initial ACE field has been initialized to a specific valid 1100 value - the above check solely tests whether the ACE fields have been 1101 incorrectly zeroed. This allows hosts to use different initial 1102 values as an additional signalling channel in future. 1104 3.2.2.4. Testing for Mangling of the IP/ECN Field 1106 The value of the ACE field on the SYN/ACK indicates the value of the 1107 IP/ECN field when the SYN arrived at the server. The client can 1108 compare this with how it originally set the IP/ECN field on the SYN. 1109 If this comparison implies an unsafe transition (see below) of the 1110 IP/ECN field, for the remainder of the connection the client MUST NOT 1111 send ECN-capable packets, but it MUST continue to feed back any ECN 1112 markings on arriving packets. 1114 The value of the ACE field on the last ACK of the 3WHS indicates the 1115 value of the IP/ECN field when the SYN/ACK arrived at the client. 1116 The server can compare this with how it originally set the IP/ECN 1117 field on the SYN/ACK. If this comparison implies an unsafe 1118 transition of the IP/ECN field, for the remainder of the connection 1119 the server MUST NOT send ECN-capable packets, but it MUST continue to 1120 feed back any ECN markings on arriving packets. 1122 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 1123 count of CE marks is still eventually delivered reliably). If this 1124 ACK does not arrive, the server can continue to send ECN-capable 1125 packets without having tested for mangling of the IP/ECN field on the 1126 SYN/ACK. 1128 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 1129 repeated here for convenience: 1131 o the not-ECT codepoint changes; 1133 o either ECT codepoint transitions to not-ECT; 1135 o the CE codepoint changes. 1137 RFC 3168 says that a router that changes ECT to not-ECT is invalid 1138 but safe. However, from a host's viewpoint, this transition is 1139 unsafe because it could be the result of two transitions at different 1140 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 1141 This scenario could well happen where an ECN-enabled home router 1142 congests its upstream mobile broadband bottleneck link, then the 1143 ingress to the mobile network clears the ECN field [Mandalari18]. 1145 Once a Data Sender has entered AccECN mode it SHOULD check whether 1146 all feedback received for the first three or four rounds indicated 1147 that every packet it sent was CE-marked. If so, for the remainder of 1148 the connection, the Data Sender SHOULD NOT send ECN-capable packets, 1149 but it MUST continue to feed back any ECN markings on arriving 1150 packets. 1152 The above fall-back behaviours are necessary in case mangling of the 1153 IP/ECN field is asymmetric, which is currently common over some 1154 mobile networks [Mandalari18]. Then one end might see no unsafe 1155 transition and continue sending ECN-capable packets, while the other 1156 end sees an unsafe transition and stops sending ECN-capable packets. 1158 3.2.2.5. Safety against Ambiguity of the ACE Field 1160 If too many CE-marked segments are acknowledged at once, or if a long 1161 run of ACKs is lost or thinned out, the 3-bit counter in the ACE 1162 field might have cycled between two ACKs arriving at the Data Sender. 1163 The following safety procedures minimize this ambiguity. 1165 3.2.2.5.1. Data Receiver Safety Procedures 1167 The following rules define when a Data Receiver in AccECN mode emits 1168 an ACK: 1170 Change-Triggered ACKs: An AccECN Data Receiver SHOULD emit an ACK 1171 whenever a data packet marked CE arrives after the previous packet 1172 was not CE. 1174 Even though this rule is stated as a "SHOULD", it is important for 1175 a transition to trigger an ACK if at all possible, The only valid 1176 exception to this rule is given below these bullets. 1178 For the avoidance of doubt, this rule is deliberately worded to 1179 apply solely when _data_ packets arrive, but the comparison with 1180 the previous packet includes any packet, not just data packets. 1182 Increment-Triggered ACKs: An AccECN Data Receiver MUST emit an ACK 1183 if 'n' CE marks have arrived since the previous ACK. If there is 1184 new data to acknowledge, 'n' SHOULD be 2. If there is no new data 1185 to acknowledge, 'n' SHOULD be 3 and MUST be no less than 3. In 1186 either case, 'n' MUST be no greater than 6. 1188 The above rules for when to send an ACK are designed to be 1189 complemented by those in Section 3.2.3.3, which concern whether the 1190 AccECN TCP Option ought to be included on ACKs. 1192 If the arrivals of a number of data packets are all processed as one 1193 event, e.g. using large receive offload (LRO) or generic receive 1194 offload (GRO), both the above rules SHOULD be interpreted as 1195 requiring multiple ACKs to be emitted back-to-back (for each 1196 transition and for each repetition by 'n' CE marks). If this is 1197 problematic for high performance, either rule can be interpreted as 1198 requiring just a single ACK at the end of the whole receive event. 1200 Even if a number of data packets do not arrive as one event, the 1201 'Change-Triggered ACKs' rule could sometimes cause the ACK rate to be 1202 problematic for high performance (although high performance protocols 1203 such as DCTCP already successfully use change-triggered ACKs). The 1204 rationale for change-triggered ACKs is so that the Data Sender can 1205 rely on them to detect queue growth as soon as possible, particularly 1206 at the start of a flow. The approach can lead to some additional 1207 ACKs but it feeds back the timing and the order in which ECN marks 1208 are received with minimal additional complexity. If CE marks are 1209 infrequent, as is the case for most AQMs at the time of writing, or 1210 there are multiple marks in a row, the additional load will be low. 1211 However, marking patterns with numerous non-contiguous CE marks could 1212 increase the load significantly. One possible compromise would be 1213 for the receiver to heuristically detect whether the sender is in 1214 slow-start, then to implement change-triggered ACKs while the sender 1215 is in slow-start, and offload otherwise. 1217 With ECN-capable pure ACKs [I-D.ietf-tcpm-generalized-ecn], the 1218 'Increment-Triggered ACKs' rule could cause ECN-marked pure ACKs to 1219 trigger further ACKs. Although TCP normally only ACKs new data, in 1220 this case the ACKs of ACKs would feed back new congestion state. The 1221 minimum of 3 for 'n' in this case ensures that, even if there is 1222 pathological congestion in both directions, any resulting ping-pong 1223 of ACKs will be rapidly damped. 1225 These ACKs of ACKs could be misidentified as duplicate ACKs in 1226 certain circumstances described below. Therefore, a host in AccECN 1227 mode that is sending ECN-capable pure ACKs SHOULD add one of the 1228 following additional checks when it tests whether an incoming pure 1229 ACK is a duplicate: 1231 o If SACK has been negotatiated for the connection, but there is no 1232 SACK option on the incoming pure ACK, it is not a duplicate; 1234 o If timestamps are in use, and the incoming pure ACK echoes a 1235 timestamp older than the oldest unacknowledged data, it is not a 1236 duplicate. 1238 In the unlikely event that neither SACK nor timestamps are in use, or 1239 if the implementation has opted not to include either of the above 1240 two checks, it SHOULD NOT send ECN-capable pure ACKs. If it does, it 1241 could lead to false detection of duplicate ACKs, causing spurious 1242 retransmission(s) with a resulting unnecessary reduction in 1243 congestion window; but only in certain circumstances. Specifically, 1244 if TCP peer A has been sending data, then receiving, then within one 1245 round trip it starts sending again, and the ECN-capable pure ACKs it 1246 sent in the previous round encounter heavy enough congestion to 1247 trigger peer B to invoke the above 'n'-CE-mark rule. Also note that 1248 falsely considering these ACKs as duplicates would incorrectly imply 1249 that data left the network. 1251 3.2.2.5.2. Data Sender Safety Procedures 1253 If the Data Sender has not received AccECN TCP Options to give it 1254 more dependable information, and it detects that the ACE field could 1255 have cycled, it SHOULD deem whether it cycled by taking the safest 1256 likely case under the prevailing conditions. It can detect if the 1257 counter could have cycled by using the jump in the acknowledgement 1258 number since the last ACK to calculate or estimate how many segments 1259 could have been acknowledged. An example algorithm to implement this 1260 policy is given in Appendix A.2. An implementer MAY develop an 1261 alternative algorithm as long as it satisfies these requirements. 1263 If missing acknowledgement numbers arrive later (reordering) and 1264 prove that the counter did not cycle, the Data Sender MAY attempt to 1265 neutralize the effect of any action it took based on a conservative 1266 assumption that it later found to be incorrect. 1268 The Data Sender can estimate how many packets (of any marking) an ACK 1269 acknowledges. If the ACE counter on an ACK seems to imply that the 1270 minimum number of newly CE-marked packets is greater that the number 1271 of newly acknowledged packets, the Data Sender SHOULD believe the ACE 1272 counter, unless it can be sure that it is counting all control 1273 packets correctly. 1275 3.2.3. The AccECN Option 1277 The AccECN Option is defined as shown in Figure 4. The initial 'E' 1278 of each field name stands for 'Echo'. 1280 0 1 2 3 1281 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1282 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1283 | Kind = TBD0 | Length = 11 | EE0B field | 1284 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1285 | EE0B (cont'd) | ECEB field | 1286 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1287 | EE1B field | Order 0 1288 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1290 0 1 2 3 1291 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1292 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1293 | Kind = TBD1 | Length = 11 | EE1B field | 1294 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1295 | EE1B (cont'd) | ECEB field | 1296 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1297 | EE0B field | Order 1 1298 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1300 Figure 4: The AccECN TCP Option 1302 Figure 4 shows two option field orders; order 0 and order 1. They 1303 both consists of three 24-bit fields. Order 0 provides the 24 least 1304 significant bits of the r.e0b, r.ceb and r.e1b counters, 1305 respectively. Order 1 provides the same fields, but in the opposite 1306 order. On each packet, the Data Receiver can use whichever order is 1307 more efficient. 1309 When a Data Receiver sends an AccECN Option, it MUST set the Kind 1310 field to TBD0 if using Order 0, or to TBD1 if using Order 1. These 1311 two new TCP Option Kinds are registered in Section 7 and called 1312 respectively AccECN0 and AccECN1. 1314 Note that there is no field to feed back Not-ECT bytes. Nonetheless 1315 an algorithm for the Data Sender to calculate the number of payload 1316 bytes received as Not-ECT is given in Appendix A.4. 1318 Whenever a Data Receiver sends an AccECN Option, the rules in 1319 Section 3.2.3.3 allow it to omit unchanged fields from the tail of 1320 the option, to help cope with option space limitations, as long as it 1321 preserves the order of the remaining fields and includes any field 1322 that has changed. The length field MUST indicate which fields are 1323 present as follows: 1325 +--------+------------------+------------------+ 1326 | Length | Type 0 | Type 1 | 1327 +--------+------------------+------------------+ 1328 | 11 | EE0B, ECEB, EE1B | EE1B, ECEB, EE0B | 1329 | 8 | EE0B, ECEB | EE1B, ECEB | 1330 | 5 | EE0B | EE1B | 1331 | 2 | (empty) | (empty) | 1332 +--------+------------------+------------------+ 1334 The empty option of Length=2 is provided to allow for a case where an 1335 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 1336 but there is very limited space for the option. 1338 All implementations of a Data Sender that read any AccECN Option MUST 1339 be able to read in AccECN Options of any of the above lengths. For 1340 forward compatibility, if the AccECN Option is of any other length, 1341 implementations MUST use those whole 3-octet fields that fit within 1342 the length and ignore the remainder of the option, treating it as 1343 padding. 1345 The AccECN Option has to be optional to implement, because both 1346 sender and receiver have to be able to cope without the option anyway 1347 - in cases where it does not traverse a network path. It is 1348 RECOMMENDED to implement both sending and receiving of the AccECN 1349 Option. If sending of the AccECN Option is implemented, the fall- 1350 backs described in this document will need to be implemented as well 1351 (unless solely for a controlled environment where path traversal is 1352 not considered a problem). Even if a developer does not implement 1353 sending of the AccECN Option, it is RECOMMENDED that they still 1354 implement logic to receive and understand any AccECN Options sent by 1355 remote peers. 1357 If a Data Receiver intends to send the AccECN Option at any time 1358 during the rest of the connection it is strongly recommended to also 1359 test path traversal of the AccECN Option as specified in 1360 Section 3.2.3.2. 1362 3.2.3.1. Encoding and Decoding Feedback in the AccECN Option Fields 1364 Whenever the Data Receiver includes any of the counter fields (ECEB, 1365 EE0B, EE1B) in an AccECN Option, it MUST encode the 24 least 1366 significant bits of the current value of the associated counter into 1367 the field (respectively r.ceb, r.e0b, r.e1b). 1369 Whenever the Data Sender receives ACK carrying an AccECN Option, it 1370 first checks whether the ACK has already been superseded by another 1371 ACK in which case it ignores the ECN feedback. If the ACK has not 1372 been superseded, the Data Sender normally decodes the fields in the 1373 AccECN Option as follows. For each field, it takes the least 1374 significant 24 bits of its associated local counter (s.ceb, s.e0b or 1375 s.e1b) and subtracts them from the counter in the associated field of 1376 the incoming AccECN Option (respectively ECEB, EE0B, EE1B), to work 1377 out the minimum positive increment it could apply to s.ceb, s.e0b or 1378 s.e1b (assuming the field in the option only wrapped at most once). 1380 Appendix A.1 gives an example algorithm for the Data Receiver to 1381 encode its byte counters into the AccECN Option, and for the Data 1382 Sender to decode the AccECN Option fields into its byte counters. 1384 Note that, as specified in Section 3.2, any data on the SYN (SYN=1, 1385 ACK=0) is not included in any of the byte counters held locally for 1386 each ECN marking nor in the AccECN Option on the wire. 1388 3.2.3.2. Path Traversal of the AccECN Option 1390 3.2.3.2.1. Testing the AccECN Option during the Handshake 1392 The TCP client MUST NOT include the AccECN TCP Option on the SYN. If 1393 there is somehow an AccECN Option on a SYN, it MUST be ignored when 1394 forwarded or received. (A fall-back strategy for the loss of the 1395 SYN, possibly due to middlebox interference, is specified in 1396 Section 3.1.4.) 1398 A TCP server that confirms its support for AccECN (in response to an 1399 AccECN SYN from the client as described in Section 3.1) SHOULD 1400 include an AccECN TCP Option on the SYN/ACK. 1402 A TCP client that has successfully negotiated AccECN SHOULD include 1403 an AccECN Option in the first ACK at the end of the 3WHS. However, 1404 this first ACK is not delivered reliably, so the TCP client SHOULD 1405 also include an AccECN Option on the first data segment it sends (if 1406 it ever sends one). 1408 A host MAY omit the AccECN Option in any of the above three cases due 1409 to insufficient option space or if it has cached knowledge that the 1410 packet would be likely to be blocked on the path to the other host if 1411 it included an AccECN Option. 1413 3.2.3.2.2. Testing for Loss of Packets Carrying the AccECN Option 1415 If after the normal TCP timeout the TCP server has not received an 1416 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 1417 lost, e.g. due to congestion, or a middlebox might be blocking the 1418 AccECN Option. To expedite connection setup, the TCP server SHOULD 1419 retransmit the SYN/ACK repeating the same AE, CWR and ECE TCP flags 1420 as on the original SYN/ACK but with no AccECN Option. If this 1421 retransmission times out, to expedite connection setup, the TCP 1422 server SHOULD disable AccECN and ECN for this connection by 1423 retransmitting the SYN/ACK with AE=CWR=ECE=0 and no AccECN Option. 1425 Implementers MAY use other fall-back strategies if they are found to 1426 be more effective (e.g. retrying the AccECN Option for a second time 1427 before fall-back - most appropriate during high levels of 1428 congestion). However, other fall-back strategies will need to follow 1429 all the rules in Section 3.1.5, which concern behaviour when SYNs or 1430 SYN/ACKs negotiating different types of feedback have been sent 1431 within the same connection. 1433 If the TCP client detects that the first data segment it sent with 1434 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1435 on the retransmission. Again, implementers MAY use other fall-back 1436 strategies such as attempting to retransmit a second segment with the 1437 AccECN Option before fall-back, and/or caching whether the AccECN 1438 Option is blocked for subsequent connections. 1439 [I-D.ietf-tcpm-2140bis] further discusses caching of TCP parameters 1440 and status information. 1442 If a host falls back to not sending the AccECN Option, it will 1443 continue to process any incoming AccECN Options as normal. 1445 Either host MAY include the AccECN Option in a subsequent segment to 1446 retest whether the AccECN Option can traverse the path. 1448 If the TCP server receives a second SYN with a request for AccECN 1449 support, it should resend the SYN/ACK, again confirming its support 1450 for AccECN, but this time without the AccECN Option. This approach 1451 rules out any interference by middleboxes that may drop packets with 1452 unknown options, even though it is more likely that the SYN/ACK would 1453 have been lost due to congestion. The TCP server MAY try to send 1454 another packet with the AccECN Option at a later point during the 1455 connection but should monitor if that packet got lost as well, in 1456 which case it SHOULD disable the sending of the AccECN Option for 1457 this half-connection. 1459 Similarly, an AccECN end-point MAY separately memorize which data 1460 packets carried an AccECN Option and disable the sending of AccECN 1461 Options if the loss probability of those packets is significantly 1462 higher than that of all other data packets in the same connection. 1464 3.2.3.2.3. Testing for Absence of the AccECN Option 1466 If the TCP client has successfully negotiated AccECN but does not 1467 receive an AccECN Option on the SYN/ACK (e.g. because is has been 1468 stripped by a middlebox or not sent by the server), the client 1469 switches into a mode that assumes that the AccECN Option is not 1470 available for this half connection. 1472 Similarly, if the TCP server has successfully negotiated AccECN but 1473 does not receive an AccECN Option on the first segment that 1474 acknowledges sequence space at least covering the ISN, it switches 1475 into a mode that assumes that the AccECN Option is not available for 1476 this half connection. 1478 While a host is in this mode that assumes incoming AccECN Options are 1479 not available, it MUST adopt the conservative interpretation of the 1480 ACE field discussed in Section 3.2.2.5. However, it cannot make any 1481 assumption about support of outgoing AccECN Options on the other half 1482 connection, so it SHOULD continue to send the AccECN Option itself 1483 (unless it has established that sending the AccECN Option is causing 1484 packets to be blocked as in Section 3.2.3.2.2). 1486 If a host is in the mode that assumes incoming AccECN Options are not 1487 available, but it receives an AccECN Option at any later point during 1488 the connection, this clearly indicates that the AccECN Option is not 1489 blocked on the respective path, and the AccECN endpoint MAY switch 1490 out of the mode that assumes the AccECN Option is not available for 1491 this half connection. 1493 3.2.3.2.4. Test for Zeroing of the AccECN Option 1495 For a related test for invalid initialization of the ACE field, see 1496 Section 3.2.2.3 1498 Section 3.2 required the Data Receiver to initialize the r.e0b 1499 counter to a non-zero value. Therefore, in either direction the 1500 initial value of the EE0B field in the AccECN Option (if one exists) 1501 ought to be non-zero. If AccECN has been negotiated: 1503 o the TCP server MAY check the initial value of the EE0B field in 1504 the first segment that acknowledges sequence space that at least 1505 covers the ISN plus 1. If the initial value of the EE0B field is 1506 zero, the server will switch into a mode that ignores the AccECN 1507 Option for this half connection. 1509 o the TCP client MAY check the initial value of the EE0B field on 1510 the SYN/ACK. If the initial value of the EE0B field is zero, the 1511 client will switch into a mode that ignores the AccECN Option for 1512 this half connection. 1514 While a host is in the mode that ignores the AccECN Option it MUST 1515 adopt the conservative interpretation of the ACE field discussed in 1516 Section 3.2.2.5. 1518 Note that the Data Sender MUST NOT test whether the arriving byte 1519 counters in the initial AccECN Option have been initialized to 1520 specific valid values - the above checks solely test whether these 1521 fields have been incorrectly zeroed. This allows hosts to use 1522 different initial values as an additional signalling channel in 1523 future. Also note that the initial value of either field might be 1524 greater than its expected initial value, because the counters might 1525 already have been incremented. Nonetheless, the initial values of 1526 the counters have been chosen so that they cannot wrap to zero on 1527 these initial segments. 1529 3.2.3.2.5. Consistency between AccECN Feedback Fields 1531 When the AccECN Option is available it supplements but does not 1532 replace the ACE field. An endpoint using AccECN feedback MUST always 1533 consider the information provided in the ACE field whether or not the 1534 AccECN Option is also available. 1536 If the AccECN option is present, the s.cep counter might increase 1537 while the s.ceb counter does not (e.g. due to a CE-marked control 1538 packet). The sender's response to such a situation is out of scope, 1539 and needs to be dealt with in a specification that uses ECN-capable 1540 control packets. Theoretically, this situation could also occur if a 1541 middlebox mangled the AccECN Option but not the ACE field. However, 1542 the Data Sender has to assume that the integrity of the AccECN Option 1543 is sound, based on the above test of the well-known initial values 1544 and optionally other integrity tests (Section 5.3). 1546 If either end-point detects that the s.ceb counter has increased but 1547 the s.cep has not (and by testing ACK coverage it is certain how much 1548 the ACE field has wrapped), this invalid protocol transition has to 1549 be due to some form of feedback mangling. So, the Data Sender MUST 1550 disable sending ECN-capable packets for the remainder of the half- 1551 connection by setting the IP/ECN field in all subsequent packets to 1552 Not-ECT. 1554 3.2.3.3. Usage of the AccECN TCP Option 1556 If a Data Receiver in AccECN mode intends to use the AccECN TCP 1557 Option to provide feedback, the rules below determine when it 1558 includes an AccECN TCP Option, and which fields to include, given 1559 other options might be competing for limited option space: 1561 Importance of Congestion Control: AccECN is for congestion control, 1562 which SHOULD generally be considered important relative to other 1563 TCP options. 1565 If the smallest recommended AccECN Option would leave insufficient 1566 space for two SACK blocks on a particular ACK, the Data Receiver 1567 MUST give precedence to the SACK option (total 18 octets), because 1568 loss feedback is more critical. 1570 Recommended Simple Scheme: The Data Receiver SHOULD include an 1571 AccECN TCP Option on every scheduled ACK that acknowledges new 1572 data. Whenever possible, it SHOULD include a field for every byte 1573 counter that has changed at some time during the connection (see 1574 examples later). 1576 A scheduled ACK means an ACK that the Data Receiver would send by 1577 its regular delayed ACK rules. Recall that Section 1.3 defines an 1578 'ACK' as either with data payload or without. But the above rule 1579 is worded so that, in the common case when most of the data is 1580 from a server to a client, the server only includes an AccECN TCP 1581 Option while it is acknowledging data from the client. 1583 When available TCP option space is limited on particular packets, the 1584 recommended scheme will need to include compromises. To guide the 1585 implementer the rules below are ranked in order of importance, but 1586 the final decision has to be implementation-dependent, because 1587 tradeoffs will alter as new TCP options are defined and new use-cases 1588 arise. 1590 Necessary Option Length: The Data Receiver MUST only include an 1591 AccECN TCP Option on a packet if it includes all the counter(s) 1592 that have incremented since the previous AccECN Option. It MUST 1593 only truncate unchanged fields from the right-hand tail of the 1594 option to preserve the order of the remaining fields (see 1595 Section 3.2.3); 1597 Change-Triggered AccECN TCP Options: If an arriving packet 1598 increments a different byte counter to that incremented by the 1599 previous packet, the Data Receiver SHOULD feed it back in an 1600 AccECN Option on the next scheduled ACK. 1602 For the avoidance of doubt, this rule does not concern the arrival 1603 of control packets with no payload, because they cannot alter any 1604 byte counters. 1606 Continual Repetition: Otherwise, if arriving packets continue to 1607 increment the same byte counter: 1609 * the Data Receiver SHOULD include a counter that has continued 1610 to increment on the next scheduled ACK following a change- 1611 triggered AccECN TCP Option; 1613 * while the same counter continues to increment, it SHOULD 1614 include the counter every n ACKs as consistently as possible, 1615 where n can be chosen by the implementer; 1617 * It SHOULD always include an AccECN Option if the r.ceb counter 1618 is incrementing and it MAY include an AccECN Option if r.ec0b 1619 or r.ec1b is incrementing 1621 * It SHOULD, include each counter at least once for every 2^22 1622 bytes incremented to prevent overflow during continual 1623 repetition. 1625 The above rules complement those in Section 3.2.2.5, which determine 1626 when to generate an ACK irrespective of whether an AccECN TCP Option 1627 is to be included. 1629 The recommended scheme is intended as a simple way to ensure that all 1630 the relevant byte counters will be carried on any ACK that reaches 1631 the Data Sender, no matter how many pure ACKs are filtered or 1632 coalesced along the network path, and without consuming the space 1633 available for payload data with counter field(s) that have never 1634 changed. 1636 As an example of the recommended scheme, if ECT(0) is the only 1637 codepoint that has ever arrived in the IP-ECN field, the Data 1638 Receiver will feed back an AccECN0 TCP Option with only the EE0B 1639 field on every packet. However, as soon as even one CE-marked packet 1640 arrives, on every packet that acknowledges new data it will start to 1641 include an option with two fields, EE0B and ECEB. As a second 1642 example, if the first packet to arrive happens to be CE-marked, the 1643 Data Receiver will have to arbitrarily choose whether to precede the 1644 ECEB field with an EE0B field or an EE1B field. If it chooses, say, 1645 EEB0 but it turns out never to receive ECT(0), it can start sending 1646 EE1B and ECEB instead - it does not have to include the EE0B field if 1647 the r.e0b counter has never changed during the connection. 1649 With the recommended scheme, if the data sending direction switches 1650 during a connection, there can be cases where the AccECN TCP Option 1651 that is meant to feed back the counter values at the end of a volley 1652 in one direction never reaches the other peer, due to packet loss. 1653 ACE feedback ought to be sufficient to fill this gap, given accurate 1654 feedback becomes moot after data transmission has paused. 1656 Appendix A.3 gives an example algorithm to estimate the number of 1657 marked bytes from the ACE field alone, if the AccECN Option is not 1658 available. 1660 If a host has determined that segments with the AccECN Option always 1661 seem to be discarded somewhere along the path, it is no longer 1662 obliged to follow any of the rules in this section. 1664 3.3. AccECN Compliance Requirements for TCP Proxies, Offload Engines 1665 and other Middleboxes 1667 3.3.1. Requirements for TCP Proxies 1669 A large class of middleboxes split TCP connections. Such a middlebox 1670 would be compliant with the AccECN protocol if the TCP implementation 1671 on each side complied with the present AccECN specification and each 1672 side negotiated AccECN independently of the other side. 1674 3.3.2. Requirements for Transparent Middleboxes and TCP Normalizers 1676 Another large class of middleboxes intervenes to some degree at the 1677 transport layer, but attempts to be transparent (invisible) to the 1678 end-to-end connection. A subset of this class of middleboxes 1679 attempts to `normalize' the TCP wire protocol by checking that all 1680 values in header fields comply with a rather narrow interpretation of 1681 the TCP specifications that is also not always up to date. 1683 A middlebox that is not normalizing the TCP protocol and does not 1684 itself act as a back-to-back pair of TCP endpoints (i.e. a middlebox 1685 that intends to be transparent or invisible at the transport layer) 1686 ought to forward the AccECN TCP Option unaltered, whether or not the 1687 length value matches one of those specified in Section 3.2.3, and 1688 whether or not the initial values of the byte-counter fields match 1689 those in Section 3.2.1. This is because blocking apparently invalid 1690 values prevents the standardized set of values being extended in 1691 future (given outdated normalizers would block updated hosts from 1692 using the extended AccECN standard). 1694 A TCP normalizer is likely to block or alter an AccECN TCP Option if 1695 the length value or the initial values of its byte-counter fields do 1696 not match one of those specified in Section 3.2.3 or Section 3.2.1. 1697 However, to comply with the present AccECN specification, a middlebox 1698 MUST NOT change the ACE field; or those fields of the AccECN Option 1699 that are currently specified in Section 3.2.3; or any AccECN field 1700 covered by integrity protection (e.g. [RFC5925]). 1702 3.3.3. Requirements for TCP ACK Filtering 1704 A node that implements ACK filtering (aka. thinning or coalescing) 1705 and itself also implements ECN marking will not need to filter ACKs 1706 from connections that use AccECN feedback. Therefore, such a node 1707 SHOULD detect connections that are using AccECN feedback and it 1708 SHOULD refrain from filtering the ACKs of such connections (if it 1709 coalesced ACKs it would not be AccECN-compliant, but the requirement 1710 is stated as a "SHOULD" in order to allow leeway for pre-existing ACK 1711 filtering functions to be brought into line). 1713 A node that implements ACK filtering and does not itself implement 1714 ECN marking does not need to treat AccECN connections any differently 1715 from other TCP connections. Nonetheless, it is RECOMMENDED that such 1716 nodes implement ECN marking and comply with the requirements of the 1717 previous paragraph. This should be a better way than ACK filtering 1718 to improve the performance of AccECN TCP connections. 1720 The rationale for these requirements is that AccECN feedback provides 1721 sufficient information to a Data Receiver for it to be able to 1722 monitor ECN marking of the ACKs it has sent, so that it can thin the 1723 ACK stream itself. This could eventually mean that ACK filtering in 1724 the network gives no performance advantage. Then TCP will be able to 1725 maintain its own control over ACK coalescing. This will also allow 1726 the TCP Data Sender to use the timing of ACK arrivals to more 1727 reliably infer further information about the path congestion level. 1729 Note that the specification of AccECN in TCP does not presume to rely 1730 on any of the above ACK filtering behaviour in the network, because 1731 it has to be robust against pre-existing network nodes that still 1732 filter AccECN ACKs, and robust against ACK loss during overload. 1734 Section 5.2.1 of [RFC3449] gives best current practice on ACK 1735 filtering (aka. thinning or coalescing). It gives no advice on ACKs 1736 carrying ECN feedback (other than that filtering ought to preserve 1737 the correct operation of ECN feedback), because at the time is said 1738 that "ECN remain areas of ongoing research". This section updates 1739 that advice for a TCP connection that supports AccECN feedback. 1741 3.3.4. Requirements for TCP Segmentation Offload 1743 Hardware to offload certain TCP processing represents another large 1744 class of middleboxes (even though it is often a function of a host's 1745 network interface and rarely in its own 'box'). 1747 The ACE field changes with every received CE marking, so today's 1748 receive offloading could lead to many interrupts in high congestion 1749 situations. Although that would be useful (because congestion 1750 information is received sooner), it could also significantly increase 1751 processor load, particularly in scenarios such as DCTCP or L4S where 1752 the marking rate is generally higher. 1754 Current offload hardware ejects a segment from the coalescing process 1755 whenever the TCP ECN flags change. Thus Classic ECN causes offload 1756 to be inefficient. In data centres it has been fortunate for this 1757 offload hardware that DCTCP-style feedback changes less often when 1758 there are long sequences of CE marks, which is more common with a 1759 step marking threshold (but less likely the more short flows are in 1760 the mix). The ACE counter approach has been designed so that 1761 coalescing can continue over arbitrary patterns of marking and only 1762 needs to stop when the counter wraps. Nonetheless, until the 1763 particular offload hardware in use implements this more efficient 1764 approach, it is likely to be more efficient for AccECN connections to 1765 implement this counter-style logic using software segmentation 1766 offload. 1768 ECN encodes a varying signal in the ACK stream, so it is inevitable 1769 that offload hardware will ultimately need to handle any form of ECN 1770 feedback exceptionally. The ACE field has been designed as a counter 1771 so that it is straightforward for offload hardware to pass on the 1772 highest counter, and to push a segment from its cache before the 1773 counter wraps. The purpose of working towards standardized TCP ECN 1774 feedback is to reduce the risk for hardware developers, who would 1775 otherwise have to guess which scheme is likely to become dominant. 1777 The above process has been designed to enable a continuing 1778 incremental deployment path - to more highly dynamic congestion 1779 control. Once offload hardware supports AccECN, it will be able to 1780 coalesce efficiently for any sequence of marks, instead of relying 1781 for efficiency on the long marking sequences from step marking. In 1782 the next stage, marking can evolve from a step to a ramp function. 1783 That in turn will allow host congestion control algorithms to respond 1784 faster to dynamics, while being backwards compatible with existing 1785 host algorithms. 1787 4. Updates to RFC 3168 1789 Normative statements in the following sections of RFC3168 are updated 1790 by the present AccECN specification: 1792 o The whole of "6.1.1 TCP Initialization" of [RFC3168] is updated by 1793 Section 3.1 of the present specification. 1795 o In "6.1.2. The TCP Sender" of [RFC3168], all mentions of a 1796 congestion response to an ECN-Echo (ECE) ACK packet are updated by 1797 Section 3.2 of the present specification to mean an increment to 1798 the sender's count of CE-marked packets, s.cep. And the 1799 requirements to set the CWR flag no longer apply, as specified in 1800 Section 3.1.5 of the present specification. Otherwise, the 1801 remaining requirements in "6.1.2. The TCP Sender" still stand. 1803 It will be noted that RFC 8311 already updates, or potentially 1804 updates, a number of the requirements in "6.1.2. The TCP Sender". 1805 Section 6.1.2 of RFC 3168 extended standard TCP congestion control 1806 [RFC5681] to cover ECN marking as well as packet drop. Whereas, 1807 RFC 8311 enables experimentation with alternative responses to ECN 1808 marking, if specified for instance by an experimental RFC on the 1809 IETF document stream. RFC 8311 also strengthened the statement 1810 that "ECT(0) SHOULD be used" to a "MUST" (see [RFC8311] for the 1811 details). 1813 o The whole of "6.1.3. The TCP Receiver" of [RFC3168] is updated by 1814 Section 3.2 of the present specification, with the exception of 1815 the last paragraph (about congestion response to drop and ECN in 1816 the same round trip), which still stands. Incidentally, this last 1817 paragraph is in the wrong section, because it relates to TCP 1818 sender behaviour. 1820 o The following text within "6.1.5. Retransmitted TCP packets": 1822 "the TCP data receiver SHOULD ignore the ECN field on arriving 1823 data packets that are outside of the receiver's current 1824 window." 1826 is updated by more stringent acceptability tests for any packet 1827 (not just data packets) in the present specification. 1828 Specifically, in the normative specification of AccECN (Section 3) 1829 only 'Acceptable' packets contribute to the ECN counters at the 1830 AccECN receiver and Section 1.3 defines an Acceptable packet as 1831 one that passes the acceptability tests in both [RFC0793] and 1832 [RFC5961]. 1834 o Sections 5.2, 6.1.1, 6.1.4, 6.1.5 and 6.1.6 of [RFC3168] prohibit 1835 use of ECN on TCP control packets and retransmissions. The 1836 present specification does not update that aspect of RFC 3168, but 1837 it does say what feedback an AccECN Data Receiver should provide 1838 if it receives an ECN-capable control packet or retransmission. 1839 This ensures AccECN is forward compatible with any future scheme 1840 that allows ECN on these packets, as provided for in section 4.3 1841 of [RFC8311] and as proposed in [I-D.ietf-tcpm-generalized-ecn]. 1843 5. Interaction with TCP Variants 1845 This section is informative, not normative. 1847 5.1. Compatibility with SYN Cookies 1849 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1850 protect itself from SYN flooding attacks. It places minimal commonly 1851 used connection state in the SYN/ACK, and deliberately does not hold 1852 any state while waiting for the subsequent ACK (e.g. it closes the 1853 thread). Therefore it cannot record the fact that it entered AccECN 1854 mode for both half-connections. Indeed, it cannot even remember 1855 whether it negotiated the use of classic ECN [RFC3168]. 1857 Nonetheless, such a server can determine that it negotiated AccECN as 1858 follows. If a TCP server using SYN Cookies supports AccECN and if it 1859 receives a pure ACK that acknowledges an ISN that is a valid SYN 1860 cookie, and if the ACK contains an ACE field with the value 0b010 to 1861 0b111 (decimal 2 to 7), it can assume that: 1863 o the TCP client must have requested AccECN support on the SYN 1865 o it (the server) must have confirmed that it supported AccECN 1867 Therefore the server can switch itself into AccECN mode, and continue 1868 as if it had never forgotten that it switched itself into AccECN mode 1869 earlier. 1871 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1872 with the value 0b000 or 0b001, these values indicate that the client 1873 did not request support for AccECN and therefore the server does not 1874 enter AccECN mode for this connection. Further, 0b001 on the ACK 1875 implies that the server sent an ECN-capable SYN/ACK, which was marked 1876 CE in the network, and the non-AccECN client fed this back by setting 1877 ECE on the ACK of the SYN/ACK. 1879 5.2. Compatibility with TCP Experiments and Common TCP Options 1881 AccECN is compatible (at least on paper) with the most commonly used 1882 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1883 also compatible with the recent promising experimental TCP options 1884 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1885 AccECN is friendly to all these protocols, because space for TCP 1886 options is particularly scarce on the SYN, where AccECN consumes zero 1887 additional header space. 1889 When option space is under pressure from other options, 1890 Section 3.2.3.3 provides guidance on how important it is to send an 1891 AccECN Option relative to other options, and which fields are more 1892 important to include. 1894 Implementers of TFO need to take careful note of the recommendation 1895 in Section 3.2.2.1. That section recommends that, if the client has 1896 successfully negotiated AccECN, when acknowledging the SYN/ACK, even 1897 if it has data to send, it sends a pure ACK immediately before the 1898 data. Then it can reflect the IP-ECN field of the SYN/ACK on this 1899 pure ACK, which allows the server to detect ECN mangling. Note that, 1900 as specified in Section 3.2, any data on the SYN (SYN=1, ACK=0) is 1901 not included in any of the byte counters held locally for each ECN 1902 marking, nor in the AccECN Option on the wire. 1904 5.3. Compatibility with Feedback Integrity Mechanisms 1906 Three alternative mechanisms are available to assure the integrity of 1907 ECN and/or loss signals. AccECN is compatible with any of these 1908 approaches: 1910 o The Data Sender can test the integrity of the receiver's ECN (or 1911 loss) feedback by occasionally setting the IP-ECN field to a value 1912 normally only set by the network (and/or deliberately leaving a 1913 sequence number gap). Then it can test whether the Data 1914 Receiver's feedback faithfully reports what it expects (similar to 1915 para 2 of Section 20.2 of [RFC3168]). Unlike the ECN Nonce 1916 [RFC3540], this approach does not waste the ECT(1) codepoint in 1917 the IP header, it does not require standardization and it does not 1918 rely on misbehaving receivers volunteering to reveal feedback 1919 information that allows them to be detected. However, setting the 1920 CE mark by the sender might conceal actual congestion feedback 1921 from the network and should therefore only be done sparingly. 1923 o Networks generate congestion signals when they are becoming 1924 congested, so networks are more likely than Data Senders to be 1925 concerned about the integrity of the receiver's feedback of these 1926 signals. A network can enforce a congestion response to its ECN 1927 markings (or packet losses) using congestion exposure (ConEx) 1928 audit [RFC7713]. Whether the receiver or a downstream network is 1929 suppressing congestion feedback or the sender is unresponsive to 1930 the feedback, or both, ConEx audit can neutralize any advantage 1931 that any of these three parties would otherwise gain. 1933 ConEx is an experimental change to the Data Sender that would be 1934 most useful when combined with AccECN. Without AccECN, the ConEx 1935 behaviour of a Data Sender would have to be more conservative than 1936 would be necessary if it had the accurate feedback of AccECN. 1938 o The standards track TCP authentication option (TCP-AO [RFC5925]) 1939 can be used to detect any tampering with AccECN feedback between 1940 the Data Receiver and the Data Sender (whether malicious or 1941 accidental). The AccECN fields are immutable end-to-end, so they 1942 are amenable to TCP-AO protection, which covers TCP options by 1943 default. However, TCP-AO is often too brittle to use on many end- 1944 to-end paths, where middleboxes can make verification fail in 1945 their attempts to improve performance or security, e.g. by 1946 resegmentation or shifting the sequence space. 1948 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1949 of congestion feedback. With minor changes AccECN could be optimized 1950 for the possibility that the ECT(1) codepoint might be used as an ECN 1951 Nonce. However, given RFC 3540 has been reclassified as historic, 1952 the AccECN design has been generalized so that it ought to be able to 1953 support other possible uses of the ECT(1) codepoint, such as a lower 1954 severity or a more instant congestion signal than CE. 1956 6. Protocol Properties 1958 This section is informative not normative. It describes how well the 1959 protocol satisfies the agreed requirements for a more accurate ECN 1960 feedback protocol [RFC7560]. 1962 Accuracy: From each ACK, the Data Sender can infer the number of new 1963 CE marked segments since the previous ACK. This provides better 1964 accuracy on CE feedback than classic ECN. In addition if the 1965 AccECN Option is present (not blocked by the network path) the 1966 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1968 Overhead: The AccECN scheme is divided into two parts. The 1969 essential part reuses the 3 flags already assigned to ECN in the 1970 IP header. The supplementary part adds an additional TCP option 1971 consuming up to 11 bytes. However, no TCP option is consumed in 1972 the SYN. 1974 Ordering: The order in which marks arrive at the Data Receiver is 1975 preserved in AccECN feedback, because the Data Receiver is 1976 expected to send an ACK immediately whenever a different mark 1977 arrives. 1979 Timeliness: While the same ECN markings are arriving continually at 1980 the Data Receiver, it can defer ACKs as TCP does normally, but it 1981 will immediately send an ACK as soon as a different ECN marking 1982 arrives. 1984 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1985 latency-sensitive uses of ECN feedback by capturing the timing of 1986 transitions but not wasting resources while the state of the 1987 signalling system is stable. Within the constraints of the 1988 change-triggered ACK rules, the receiver can control how 1989 frequently it sends the AccECN TCP Option and therefore to some 1990 extent it can control the overhead induced by AccECN. 1992 Resilience: All information is provided based on counters. 1993 Therefore if ACKs are lost, the counters on the first ACK 1994 following the losses allows the Data Sender to immediately recover 1995 the number of the ECN markings that it missed. And if data or 1996 ACKs are reordered, stale congestion information can be identified 1997 and ignored. 1999 Resilience against Bias: Because feedback is based on repetition of 2000 counters, random losses do not remove any information, they only 2001 delay it. Therefore, even though some ACKs are change-triggered, 2002 random losses will not alter the proportions of the different ECN 2003 markings in the feedback. 2005 Resilience vs Overhead: If space is limited in some segments 2006 (e.g. because more options are needed on some segments, such as 2007 the SACK option after loss), the Data Receiver can send AccECN 2008 Options less frequently or truncate fields that have not changed, 2009 usually down to as little as 5 bytes. However, it has to send a 2010 full-sized AccECN Option at least three times per RTT, which the 2011 Data Sender can rely on as a regular beacon or checkpoint. 2013 Resilience vs Timeliness and Ordering: Ordering information and the 2014 timing of transitions cannot be communicated in three cases: i) 2015 during ACK loss; ii) if something on the path strips the AccECN 2016 Option; or iii) if the Data Receiver is unable to support Change- 2017 Triggered ACKs. Following ACK reordering, the Data Sender can 2018 reconstruct the order in which feedback was sent, but not until 2019 all the missing feedback has arrived. 2021 Complexity: An AccECN implementation solely involves simple counter 2022 increments, some modulo arithmetic to communicate the least 2023 significant bits and allow for wrap, and some heuristics for 2024 safety against fields cycling due to prolonged periods of ACK 2025 loss. Each host needs to maintain eight additional counters. The 2026 hosts have to apply some additional tests to detect tampering by 2027 middleboxes, but in general the protocol is simple to understand, 2028 simple to implement and requires few cycles per packet to execute. 2030 Integrity: AccECN is compatible with at least three approaches that 2031 can assure the integrity of ECN feedback. If the AccECN Option is 2032 stripped the resolution of the feedback is degraded, but the 2033 integrity of this degraded feedback can still be assured. 2035 Backward Compatibility: If only one endpoint supports the AccECN 2036 scheme, it will fall-back to the most advanced ECN feedback scheme 2037 supported by the other end. 2039 Backward Compatibility: If the AccECN Option is stripped by a 2040 middlebox, AccECN still provides basic congestion feedback in the 2041 ACE field. Further, AccECN can be used to detect mangling of the 2042 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 2043 marked segments; and blocking of segments carrying the AccECN 2044 Option. It can detect these conditions during TCP's 3WHS so that 2045 it can fall back to operation without ECN and/or operation without 2046 the AccECN Option. 2048 Forward Compatibility: The behaviour of endpoints and middleboxes is 2049 carefully defined for all reserved or currently unused codepoints 2050 in the scheme. Then, the designers of security devices can 2051 understand which currently unused values might appear in future. 2052 So, even if they choose to treat such values as anomalous while 2053 they are not widely used, any blocking will at least be under 2054 policy control not hard-coded. Then, if previously unused values 2055 start to appear on the Internet (or in standards), such policies 2056 could be quickly reversed. 2058 7. IANA Considerations 2060 This document reassigns bit 7 of the TCP header flags to the AccECN 2061 protocol. This bit was previously called the Nonce Sum (NS) flag 2062 [RFC3540], but RFC 3540 has been reclassified as historic [RFC8311]. 2063 The flag will now be defined as: 2065 +-----+-------------------+-----------+ 2066 | Bit | Name | Reference | 2067 +-----+-------------------+-----------+ 2068 | 7 | AE (Accurate ECN) | RFC XXXX | 2069 +-----+-------------------+-----------+ 2071 [TO BE REMOVED: IANA is requested to update the existing entry in the 2072 Transmission Control Protocol (TCP) Header Flags registration 2073 (https://www.iana.org/assignments/tcp-header-flags/tcp-header- 2074 flags.xhtml#tcp-header-flags-1) for Bit 7 to "AE (Accurate ECN), 2075 previously used as NS (Nonce Sum) by [RFC3540], which is now Historic 2076 [RFC8311]" and change the reference to this RFC-to-be instead of 2077 RFC8311.] 2079 This document also defines two new TCP options for AccECN, assigned 2080 values of TBD0 and TBD1 (decimal) from the TCP option space. These 2081 values are defined as: 2083 +------+--------+--------------------------------+-----------+ 2084 | Kind | Length | Meaning | Reference | 2085 +------+--------+--------------------------------+-----------+ 2086 | TBD0 | N | Accurate ECN Order 0 (AccECN0) | RFC XXXX | 2087 | TBD1 | N | Accurate ECN Order 1 (AccECN1) | RFC XXXX | 2088 +------+--------+--------------------------------+-----------+ 2090 [TO BE REMOVED: This registration should take place at the following 2091 location: http://www.iana.org/assignments/tcp-parameters/tcp- 2092 parameters.xhtml#tcp-parameters-1 ] 2094 Early implementations using experimental option 254 per [RFC6994] 2095 with the single magic number 0xACCE (16 bits), as allocated in the 2096 IANA "TCP Experimental Option Experiment Identifiers (TCP ExIDs)" 2097 registry, SHOULD migrate to use these new option kinds (TBD0 & TBD1). 2099 [TO BE REMOVED: The description of the 0xACCE value in the TCP ExIDs 2100 registry should be changed to "AccECN (current and new 2101 implementations SHOULD use option kinds TBD0 and TBD1)" at the 2102 following location: https://www.iana.org/assignments/tcp-parameters/ 2103 tcp-parameters.xhtml#tcp-exids ] 2105 8. Security Considerations 2107 If ever the supplementary part of AccECN based on the new AccECN TCP 2108 Option is unusable (due for example to middlebox interference) the 2109 essential part of AccECN's congestion feedback offers only limited 2110 resilience to long runs of ACK loss (see Section 3.2.2.5). These 2111 problems are unlikely to be due to malicious intervention (because if 2112 an attacker could strip a TCP option or discard a long run of ACKs it 2113 could wreak other arbitrary havoc). However, it would be of concern 2114 if AccECN's resilience could be indirectly compromised during a 2115 flooding attack. AccECN is still considered safe though, because if 2116 the option is not present, the AccECN Data Sender is then required to 2117 switch to more conservative assumptions about wrap of congestion 2118 indication counters (see Section 3.2.2.5 and Appendix A.2). {ToDo: 2119 is this still true?} 2121 Section 5.1 describes how a TCP server can negotiate AccECN and use 2122 the SYN cookie method for mitigating SYN flooding attacks. 2124 There is concern that ECN feedback could be altered or suppressed, 2125 particularly because a misbehaving Data Receiver could increase its 2126 own throughput at the expense of others. AccECN is compatible with 2127 the three schemes known to assure the integrity of ECN feedback (see 2128 Section 5.3 for details). If the AccECN Option is stripped by an 2129 incorrectly implemented middlebox, the resolution of the feedback 2130 will be degraded, but the integrity of this degraded information can 2131 still be assured. Assuring that Data Senders respond appropriately 2132 to ECN feedback is possible, but the scope of the present document is 2133 confined to the feedback protocol, and excludes the response to this 2134 feedback. 2136 In Section 3.2.3 a Data Sender is allowed to ignore an unrecognized 2137 TCP AccECN Option length and read as many whole 3-octet fields from 2138 it as possible up to a maximum of 3, treating the remainder as 2139 padding. This opens up a potential covert channel of up to 29B (40 - 2140 (2+3*3))B. However, it is really an overt channel (not hidden) and 2141 it is no different to the use of unknown TCP options with unknown 2142 option lengths in general. Therefore, where this is of concern, it 2143 can already be adequately mitigated by regular TCP normalizer 2144 technology (see Section 3.3.2). 2146 The AccECN protocol is not believed to introduce any new privacy 2147 concerns, because it merely counts and feeds back signals at the 2148 transport layer that had already been visible at the IP layer. A 2149 covert channel can be used to compromise privacy. However, as 2150 explained above, undefined TCP options in general open up such 2151 channels and common techniques are available to close them off. 2153 There is a potential concern that a Data Receiver could deliberately 2154 omit the AccECN Option pretending that it had been stripped by a 2155 middlebox. No known way can yet be contrived for a receiver to take 2156 advantage of this behaviour, which seems to always degrade its own 2157 performance. However, the concern is mentioned here for 2158 completeness. 2160 9. Acknowledgements 2162 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 2163 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf, 2164 Michael Tuexen, Yuchung Cheng, Kenjiro Cho, Olivier Tilmans, Ilpo 2165 Jaervinen, Neal Cardwell, Yoshifumi Nishida, Martin Duke and Jonathan 2166 Morton for their input and discussion. The idea of using the three 2167 ECN-related TCP flags as one field for more accurate TCP-ECN feedback 2168 was first introduced in the re-ECN protocol that was the ancestor of 2169 ConEx. 2171 Bob Briscoe was part-funded by the Comcast Innovation Fund, the 2172 European Community under its Seventh Framework Programme through the 2173 Reducing Internet Transport Latency (RITE) project (ICT-317700) and 2174 through the Trilogy 2 project (ICT-317756), and the Research Council 2175 of Norway through the TimeIn project. The views expressed here are 2176 solely those of the authors. 2178 Mirja Kuehlewind was partly supported by the European Commission 2179 under Horizon 2020 grant agreement no. 688421 Measurement and 2180 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 2181 State Secretariat for Education, Research, and Innovation under 2182 contract no. 15.0268. This support does not imply endorsement. 2184 10. Comments Solicited 2186 Comments and questions are encouraged and very welcome. They can be 2187 addressed to the IETF TCP maintenance and minor modifications working 2188 group mailing list , and/or to the authors. 2190 11. References 2192 11.1. Normative References 2194 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 2195 RFC 793, DOI 10.17487/RFC0793, September 1981, 2196 . 2198 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2199 Requirement Levels", BCP 14, RFC 2119, 2200 DOI 10.17487/RFC2119, March 1997, 2201 . 2203 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2204 of Explicit Congestion Notification (ECN) to IP", 2205 RFC 3168, DOI 10.17487/RFC3168, September 2001, 2206 . 2208 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2209 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 2210 . 2212 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2213 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2214 May 2017, . 2216 11.2. Informative References 2218 [I-D.ietf-tcpm-2140bis] 2219 Touch, J., Welzl, M., and S. Islam, "TCP Control Block 2220 Interdependence", draft-ietf-tcpm-2140bis-11 (work in 2221 progress), April 2021. 2223 [I-D.ietf-tcpm-generalized-ecn] 2224 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 2225 Congestion Notification (ECN) to TCP Control Packets", 2226 draft-ietf-tcpm-generalized-ecn-07 (work in progress), 2227 February 2021. 2229 [I-D.ietf-tsvwg-l4s-arch] 2230 Briscoe, B., Schepper, K. D., Bagnulo, M., and G. White, 2231 "Low Latency, Low Loss, Scalable Throughput (L4S) Internet 2232 Service: Architecture", draft-ietf-tsvwg-l4s-arch-08 (work 2233 in progress), November 2020. 2235 [Mandalari18] 2236 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 2237 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 2238 over Mobile", IEEE Communications Magazine , March 2018. 2240 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 2241 Selective Acknowledgment Options", RFC 2018, 2242 DOI 10.17487/RFC2018, October 1996, 2243 . 2245 [RFC3449] Balakrishnan, H., Padmanabhan, V., Fairhurst, G., and M. 2246 Sooriyabandara, "TCP Performance Implications of Network 2247 Path Asymmetry", BCP 69, RFC 3449, DOI 10.17487/RFC3449, 2248 December 2002, . 2250 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2251 Congestion Notification (ECN) Signaling with Nonces", 2252 RFC 3540, DOI 10.17487/RFC3540, June 2003, 2253 . 2255 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 2256 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 2257 . 2259 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 2260 Ramakrishnan, "Adding Explicit Congestion Notification 2261 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 2262 DOI 10.17487/RFC5562, June 2009, 2263 . 2265 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 2266 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 2267 June 2010, . 2269 [RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's 2270 Robustness to Blind In-Window Attacks", RFC 5961, 2271 DOI 10.17487/RFC5961, August 2010, 2272 . 2274 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 2275 "TCP Extensions for Multipath Operation with Multiple 2276 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 2277 . 2279 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 2280 RFC 6994, DOI 10.17487/RFC6994, August 2013, 2281 . 2283 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 2284 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 2285 . 2287 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 2288 "Problem Statement and Requirements for Increased Accuracy 2289 in Explicit Congestion Notification (ECN) Feedback", 2290 RFC 7560, DOI 10.17487/RFC7560, August 2015, 2291 . 2293 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 2294 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 2295 DOI 10.17487/RFC7713, December 2015, 2296 . 2298 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 2299 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 2300 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 2301 October 2017, . 2303 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 2304 Notification (ECN) Experimentation", RFC 8311, 2305 DOI 10.17487/RFC8311, January 2018, 2306 . 2308 [RFC8511] Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 2309 "TCP Alternative Backoff with ECN (ABE)", RFC 8511, 2310 DOI 10.17487/RFC8511, December 2018, 2311 . 2313 Appendix A. Example Algorithms 2315 This appendix is informative, not normative. It gives example 2316 algorithms that would satisfy the normative requirements of the 2317 AccECN protocol. However, implementers are free to choose other ways 2318 to implement the requirements. 2320 A.1. Example Algorithm to Encode/Decode the AccECN Option 2322 The example algorithms below show how a Data Receiver in AccECN mode 2323 could encode its CE byte counter r.ceb into the ECEB field within the 2324 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 2325 the ECEB field into its byte counter s.ceb. The other counters for 2326 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 2327 similarly encoded and decoded. 2329 It is assumed that each local byte counter is an unsigned integer 2330 greater than 24b (probably 32b), and that the following constant has 2331 been assigned: 2333 DIVOPT = 2^24 2335 Every time a CE marked data segment arrives, the Data Receiver 2336 increments its local value of r.ceb by the size of the TCP Data. 2337 Whenever it sends an ACK with the AccECN Option, the value it writes 2338 into the ECEB field is 2340 ECEB = r.ceb % DIVOPT 2342 where '%' is the remainder operator. 2344 On the arrival of an AccECN Option, the Data Sender first makes sure 2345 the ACK has not been superseded in order to avoid winding the s.ceb 2346 counter backwards. It uses the TCP acknowledgement number and any 2347 SACK options to calculate newlyAckedB, the amount of new data that 2348 the ACK acknowledges in bytes (newlyAckedB can be zero but not 2349 negative). If newlyAckedB is zero, either the ACK has been 2350 superseded or CE-marked packet(s) without data could have arrived. 2351 To break the tie for the latter case, the Data Sender could use 2352 timestamps (if present) to work out newlyAckedT, the amount of new 2353 time that the ACK acknowledges. If the Data Sender determines that 2354 the ACK has been superseded it ignores the AccECN Option. Otherwise, 2355 the Data Sender calculates the minimum non-negative difference d.ceb 2356 between the ECEB field and its local s.ceb counter, using modulo 2357 arithmetic as follows: 2359 if ((newlyAckedB > 0) || (newlyAckedT > 0)) { 2360 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 2361 s.ceb += d.ceb 2362 } 2364 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 2365 then 2367 s.ceb % DIVOPT = 1 2368 d.ceb = (1461 + 2^24 - 1) % 2^24 2369 = 1460 2370 s.ceb = 33,554,433 + 1460 2371 = 33,555,893 2373 In practice an implementation might use heuristics to guess the 2374 feedback in missing ACKs, then when it subsequently receives feedback 2375 it might find that it needs to correct its earlier heuristics as part 2376 of the decoding process. The above decoding process does not include 2377 any such heuristics. 2379 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 2381 The example algorithms below show how a Data Receiver in AccECN mode 2382 could encode its CE packet counter r.cep into the ACE field, and how 2383 the Data Sender in AccECN mode could decode the ACE field into its 2384 s.cep counter. The Data Sender's algorithm includes code to 2385 heuristically detect a long enough unbroken string of ACK losses that 2386 could have concealed a cycle of the congestion counter in the ACE 2387 field of the next ACK to arrive. 2389 Two variants of the algorithm are given: i) a more conservative 2390 variant for a Data Sender to use if it detects that the AccECN Option 2391 is not available (see Section 3.2.2.5 and Section 3.2.3.2); and ii) a 2392 less conservative variant that is feasible when complementary 2393 information is available from the AccECN Option. 2395 A.2.1. Safety Algorithm without the AccECN Option 2397 It is assumed that each local packet counter is a sufficiently sized 2398 unsigned integer (probably 32b) and that the following constant has 2399 been assigned: 2401 DIVACE = 2^3 2403 Every time an Acceptable CE marked packet arrives (Section 3.2.2.2), 2404 the Data Receiver increments its local value of r.cep by 1. It 2405 repeats the same value of ACE in every subsequent ACK until the next 2406 CE marking arrives, where 2407 ACE = r.cep % DIVACE. 2409 If the Data Sender received an earlier value of the counter that had 2410 been delayed due to ACK reordering, it might incorrectly calculate 2411 that the ACE field had wrapped. Therefore, on the arrival of every 2412 ACK, the Data Sender ensures the ACK has not been superseded using 2413 the TCP acknowledgement number, any SACK options and timestamps (if 2414 available) to calculate newlyAckedB, as in Appendix A.1. If the ACK 2415 has not been superseded, the Data Sender calculates the minimum 2416 difference d.cep between the ACE field and its local s.cep counter, 2417 using modulo arithmetic as follows: 2419 if ((newlyAckedB > 0) || (newlyAckedT > 0)) 2420 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 2422 Section 3.2.2.5 expects the Data Sender to assume that the ACE field 2423 cycled if it is the safest likely case under prevailing conditions. 2424 The 3-bit ACE field in an arriving ACK could have cycled and become 2425 ambiguous to the Data Sender if a sequence of ACKs goes missing that 2426 covers a stream of data long enough to contain 8 or more CE marks. 2427 We use the word `missing' rather than `lost', because some or all the 2428 missing ACKs might arrive eventually, but out of order. Even if some 2429 of the missing ACKs were piggy-backed on data (i.e. not pure ACKs) 2430 retransmissions will not repair the lost AccECN information, because 2431 AccECN requires retransmissions to carry the latest AccECN counters, 2432 not the original ones. 2434 The phrase `under prevailing conditions' allows for implementation- 2435 dependent interpretation. A Data Sender might take account of the 2436 prevailing size of data segments and the prevailing CE marking rate 2437 just before the sequence of missing ACKs. However, we shall start 2438 with the simplest algorithm, which assumes segments are all full- 2439 sized and ultra-conservatively it assumes that ECN marking was 100% 2440 on the forward path when ACKs on the reverse path started to all be 2441 dropped. Specifically, if newlyAckedB is the amount of data that an 2442 ACK acknowledges since the previous ACK, then the Data Sender could 2443 assume that this acknowledges newlyAckedPkt full-sized segments, 2444 where newlyAckedPkt = newlyAckedB/MSS. Then it could assume that the 2445 ACE field incremented by 2447 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 2449 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 2450 size segments than any previous ACK, and that ACE increments by a 2451 minimum of 2 CE marks (d.cep=2). The above formula works out that it 2452 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 2453 2). However, if ACE increases by a minimum of 2 but acknowledges 10 2454 full-sized segments, then it would be necessary to assume that there 2455 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 2457 ACKs that acknowledge a large stretch of packets might be common in 2458 data centres to achieve a high packet rate or might be due to ACK 2459 thinning by a middlebox. In these cases, cycling of the ACE field 2460 would often appear to have been possible, so the above algorithm 2461 would be over-conservative, leading to a false high marking rate and 2462 poor performance. Therefore it would be reasonable to only use 2463 dSafer.cep rather than d.cep if the moving average of newlyAckedPkt 2464 was well below 8. 2466 Implementers could build in more heuristics to estimate prevailing 2467 average segment size and prevailing ECN marking. For instance, 2468 newlyAckedPkt in the above formula could be replaced with 2469 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 2470 segment size and p is the prevailing ECN marking probability. 2471 However, ultimately, if TCP's ECN feedback becomes inaccurate it 2472 still has loss detection to fall back on. Therefore, it would seem 2473 safe to implement a simple algorithm, rather than a perfect one. 2475 The simple algorithm for dSafer.cep above requires no monitoring of 2476 prevailing conditions and it would still be safe if, for example, 2477 segments were on average at least 5% of full-sized as long as ECN 2478 marking was 5% or less. Assuming it was used, the Data Sender would 2479 increment its packet counter as follows: 2481 s.cep += dSafer.cep 2483 If missing acknowledgement numbers arrive later (due to reordering), 2484 Section 3.2.2.5 says "the Data Sender MAY attempt to neutralize the 2485 effect of any action it took based on a conservative assumption that 2486 it later found to be incorrect". To do this, the Data Sender would 2487 have to store the values of all the relevant variables whenever it 2488 made assumptions, so that it could re-evaluate them later. Given 2489 this could become complex and it is not required, we do not attempt 2490 to provide an example of how to do this. 2492 A.2.2. Safety Algorithm with the AccECN Option 2494 When the AccECN Option is available on the ACKs before and after the 2495 possible sequence of ACK losses, if the Data Sender only needs CE- 2496 marked bytes, it will have sufficient information in the AccECN 2497 Option without needing to process the ACE field. If for some reason 2498 it needs CE-marked packets, if dSafer.cep is different from d.cep, it 2499 can determine whether d.cep is likely to be a safe enough estimate by 2500 checking whether the average marked segment size (s = d.ceb/d.cep) is 2501 less than the MSS (where d.ceb is the amount of newly CE-marked bytes 2502 - see Appendix A.1). Specifically, it could use the following 2503 algorithm: 2505 SAFETY_FACTOR = 2 2506 if (dSafer.cep > d.cep) { 2507 if (d.ceb <= MSS * d.cep) { % Same as (s <= MSS), but no DBZ 2508 sSafer = d.ceb/dSafer.cep 2509 if (sSafer < MSS/SAFETY_FACTOR) 2510 dSafer.cep = d.cep % d.cep is a safe enough estimate 2511 } % else 2512 % No need for else; dSafer.cep is already correct, 2513 % because d.cep must have been too small 2514 } 2516 The chart below shows when the above algorithm will consider d.cep 2517 can replace dSafer.cep as a safe enough estimate of the number of CE- 2518 marked packets: 2520 ^ 2521 sSafer| 2522 | 2523 MSS+ 2524 | 2525 | dSafer.cep 2526 | is 2527 MSS/SAFETY_FACTOR+--------------+ safest 2528 | | 2529 | d.cep is safe| 2530 | enough | 2531 +--------------------> 2532 MSS s 2534 The following examples give the reasoning behind the algorithm, 2535 assuming MSS=1460 [B]: 2537 o if d.cep=0, dSafer.cep=8 and d.ceb=1460, then s=infinity and 2538 sSafer=182.5. 2539 Therefore even though the average size of 8 data segments is 2540 unlikely to have been as small as MSS/8, d.cep cannot have been 2541 correct, because it would imply an average segment size greater 2542 than the MSS. 2544 o if d.cep=2, dSafer.cep=10 and d.ceb=1460, then s=730 and 2545 sSafer=146. 2546 Therefore d.cep is safe enough, because the average size of 10 2547 data segments is unlikely to have been as small as MSS/10. 2549 o if d.cep=7, dSafer.cep=15 and d.ceb=10200, then s=1457 and 2550 sSafer=680. 2551 Therefore d.cep is safe enough, because the average data segment 2552 size is more likely to have been just less than one MSS, rather 2553 than below MSS/2. 2555 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 2556 far less likely. However, because [RFC3168] currently precludes 2557 this, the above algorithm assumes that pure ACKs are not ECN-capable. 2559 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 2561 If the AccECN Option is not available, the Data Sender can only 2562 decode CE-marking from the ACE field in packets. Every time an ACK 2563 arrives, to convert this into an estimate of CE-marked bytes, it 2564 needs an average of the segment size, s_ave. Then it can add or 2565 subtract s_ave from the value of d.ceb as the value of d.cep 2566 increments or decrements. Some possible ways to calculate s_ave are 2567 outlined below. The precise details will depend on why an estimate 2568 of marked bytes is needed. 2570 The implementation could keep a record of the byte numbers of all the 2571 boundaries between packets in flight (including control packets), and 2572 recalculate s_ave on every ACK. However it would be simpler to 2573 merely maintain a counter packets_in_flight for the number of packets 2574 in flight (including control packets), which is reset once per RTT. 2575 Either way, it would estimate s_ave as: 2577 s_ave ~= flightsize / packets_in_flight, 2579 where flightsize is the variable that TCP already maintains for the 2580 number of bytes in flight. To avoid floating point arithmetic, it 2581 could right-bit-shift by lg(packets_in_flight), where lg() means log 2582 base 2. 2584 An alternative would be to maintain an exponentially weighted moving 2585 average (EWMA) of the segment size: 2587 s_ave = a * s + (1-a) * s_ave, 2589 where a is the decay constant for the EWMA. However, then it is 2590 necessary to choose a good value for this constant, which ought to 2591 depend on the number of packets in flight. Also the decay constant 2592 needs to be power of two to avoid floating point arithmetic. 2594 A.4. Example Algorithm to Count Not-ECT Bytes 2596 A Data Sender in AccECN mode can infer the amount of TCP payload data 2597 arriving at the receiver marked Not-ECT from the difference between 2598 the amount of newly ACKed data and the sum of the bytes with the 2599 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 2600 r.e0b is initialized to 1 and the other two counters are initialized 2601 to 0, the initial sum will be 1, which matches the initial offset of 2602 the TCP sequence number on completion of the 3WHS. 2604 For this approach to be precise, it has to be assumed that spurious 2605 (unnecessary) retransmissions do not lead to double counting. This 2606 assumption is currently correct, given that RFC 3168 requires that 2607 the Data Sender marks retransmitted segments as Not-ECT. However, 2608 the converse is not true; necessary retransmissions will result in 2609 under-counting. 2611 However, such precision is unlikely to be necessary. The only known 2612 use of a count of Not-ECT marked bytes is to test whether equipment 2613 on the path is clearing the ECN field (perhaps due to an out-dated 2614 attempt to clear, or bleach, what used to be the ToS field). To 2615 detect bleaching it will be sufficient to detect whether nearly all 2616 bytes arrive marked as Not-ECT. Therefore there should be no need to 2617 keep track of the details of retransmissions. 2619 Appendix B. Rationale for Usage of TCP Header Flags 2621 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake 2623 AccECN uses a rather unorthodox approach to negotiate the highest 2624 version TCP ECN feedback scheme that both ends support, as justified 2625 below. It follows from the original TCP ECN capability negotiation 2626 [RFC3168], in which the client set the 2 least significant of the 2627 original reserved flags in the TCP header, and fell back to no ECN 2628 support if the server responded with the 2 flags cleared, which had 2629 previously been the default. 2631 ECN originally used header flags rather than a TCP option because it 2632 was considered more efficient to use a header flag for 1 bit of 2633 feedback per ACK, and this bit could be overloaded to indicate 2634 support for ECN during the handshake. During the development of ECN, 2635 1 bit crept up to 2, in order to deliver the feedback reliably and to 2636 work round some broken hosts that reflected the reserved flags during 2637 the handshake. 2639 In order to be backward compatible with RFC 3168, AccECN continues 2640 this approach, using the 3rd least significant TCP header flag that 2641 had previously been allocated for the ECN nonce (now historic). 2643 Then, whatever form of server an AccECN client encounters, the 2644 connection can fall back to the highest version of feedback protocol 2645 that both ends support, as explained in Section 3.1. 2647 If AccECN had used the more orthodox approach of a TCP option, it 2648 would still have had to set the two ECN flags in the main TCP header, 2649 in order to be able to fall back to Classic RFC 3168 ECN, or to 2650 disable ECN support, without another round of negotiation. Then 2651 AccECN would also have had to handle all the different ways that 2652 servers currently respond to settings of the ECN flags in the main 2653 TCP header, including all the conflicting cases where a server might 2654 have said it supported one approach in the flags and another approach 2655 in the new TCP option. And AccECN would have had to deal with all 2656 the additional possibilities where a middlebox might have mangled the 2657 ECN flags, or removed the TCP option. Thus, usage of the 3rd 2658 reserved TCP header flag simplified the protocol. 2660 The third flag was used in a way that could be distinguished from the 2661 ECN nonce, in case any nonce deployment was encountered. Previous 2662 usage of this flag for the ECN nonce was integrated into the original 2663 ECN negotiation. This further justified the 3rd flag's use for 2664 AccECN, because a non-ECN usage of this flag would have had to use it 2665 as a separate single bit, rather than in combination with the other 2 2666 ECN flags. 2668 Indeed, having overloaded the original uses of these three flags for 2669 its handshake, AccECN overloads all three bits again as a 3-bit 2670 counter. 2672 B.2. Four Codepoints in the SYN/ACK 2674 Of the 8 possible codepoints that the 3 TCP header flags can indicate 2675 on the SYN/ACK, 4 already indicated earlier (or broken) versions of 2676 ECN support. In the early design of AccECN, an AccECN server could 2677 use only 2 of the 4 remaining codepoints. They both indicated AccECN 2678 support, but one fed back that the SYN had arrived marked as CE. 2679 Even though ECN support on a SYN is not yet on the standards track, 2680 the idea is for either end to act as a dumb reflector, so that future 2681 capabilities can be unilaterally deployed without requiring 2-ended 2682 deployment (justified in Section 2.5). 2684 During traversal testing it was discovered that the ECN field in the 2685 SYN was mangled on a non-negligible proportion of paths. Therefore 2686 it was necessary to allow the SYN/ACK to feed all four IP/ECN 2687 codepoints that the SYN could arrive with back to the client. 2688 Without this, the client could not know whether to disable ECN for 2689 the connection due to mangling of the IP/ECN field (also explained in 2690 Section 2.5). This development consumed the remaining 2 codepoints 2691 on the SYN/ACK that had been reserved for future use by AccECN in 2692 earlier versions. 2694 B.3. Space for Future Evolution 2696 Despite availability of usable TCP header space being extremely 2697 scarce, the AccECN protocol has taken all possible steps to ensure 2698 that there is space to negotiate possible future variants of the 2699 protocol, either if a variant of AccECN is required, or if a 2700 completely different ECN feedback approach is needed: 2702 Future AccECN variants: When the AccECN capability is negotiated 2703 during TCP's 3WHS, the rows in Table 2 tagged as 'Nonce' and 2704 'Broken' in the column for the capability of node B are unused by 2705 any current protocol in the RFC series. These could be used by 2706 TCP servers in future to indicate a variant of the AccECN 2707 protocol. In recent measurement studies in which the response of 2708 large numbers of servers to an AccECN SYN has been tested, 2709 e.g. [Mandalari18], a very small number of SYN/ACKs arrive with 2710 the pattern tagged as 'Nonce', and a small but more significant 2711 number arrive with the pattern tagged as 'Broken'. The 'Nonce' 2712 pattern could be a sign that a few servers have implemented the 2713 ECN Nonce [RFC3540], which has now been reclassified as historic 2714 [RFC8311], or it could be the random result of some unknown 2715 middlebox behaviour. The greater prevalence of the 'Broken' 2716 pattern suggests that some instances still exist of the broken 2717 code that reflects the reserved flags on the SYN. 2719 The requirement not to reject unexpected initial values of the ACE 2720 counter (in the main TCP header) in the last para of 2721 Section 3.2.2.3 ensures that 3 unused codepoints on the ACK of the 2722 SYN/ACK, 6 unused values on the first SYN=0 data packet from the 2723 client and 7 unused values on the first SYN=0 data packet from the 2724 server could be used to declare future variants of the AccECN 2725 protocol. The word 'declare' is used rather than 'negotiate' 2726 because, at this late stage in the 3WHS, it would be too late for 2727 a negotiation between the endpoints to be completed. A similar 2728 requirement not to reject unexpected initial values in the TCP 2729 option (Section 3.2.3.2.4) is for the same purpose. If traversal 2730 of the TCP option were reliable, this would have enabled a far 2731 wider range of future variation of the whole AccECN protocol. 2732 Nonetheless, it could be used to reliably negotiate a wide range 2733 of variation in the semantics of the AccECN Option. 2735 Future non-AccECN variants: Five codepoints out of the 8 possible in 2736 the 3 TCP header flags used by AccECN are unused on the initial 2737 SYN (in the order AE,CWR,ECE): 001, 010, 100, 101, 110. 2738 Section 3.1.3 ensures that the installed base of AccECN servers 2739 will all assume these are equivalent to AccECN negotiation with 2740 111 on the SYN. These codepoints would not allow fall-back to 2741 Classic ECN support for a server that did not understand them, but 2742 this approach ensures they are available in future, perhaps for 2743 uses other than ECN alongside the AccECN scheme. All possible 2744 combinations of SYN/ACK could be used in response except either 2745 000 or reflection of the same values sent on the SYN. 2747 Of course, other ways could be resorted to in order to extend 2748 AccECN or ECN in future, although their traversal properties are 2749 likely to be inferior. They include a new TCP option; using the 2750 remaining reserved flags in the main TCP header (preferably 2751 extending the 3-bit combinations used by AccECN to 4-bit 2752 combinations, rather than burning one bit for just one state); a 2753 non-zero urgent pointer in combination with the URG flag cleared; 2754 or some other unexpected combination of fields yet to be invented. 2756 Authors' Addresses 2758 Bob Briscoe 2759 Independent 2760 UK 2762 EMail: ietf@bobbriscoe.net 2763 URI: http://bobbriscoe.net/ 2765 Mirja Kuehlewind 2766 Ericsson 2767 Germany 2769 EMail: ietf@kuehlewind.net 2771 Richard Scheffenegger 2772 NetApp 2773 Vienna 2774 Austria 2776 EMail: Richard.Scheffenegger@netapp.com