idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-13.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC3168, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC3449, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. (Using the creation date from RFC3168, updated by this document, for RFC5378 checks: 2000-11-17) (Using the creation date from RFC3449, updated by this document, for RFC5378 checks: 1999-10-04) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 2, 2020) is 1270 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B' is mentioned on line 2439, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-11) exists of draft-ietf-tcpm-2140bis-05 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-06 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-07 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Independent 4 Updates: 3168, 3449 (if approved) M. Kuehlewind 5 Intended status: Standards Track Ericsson 6 Expires: May 6, 2021 R. Scheffenegger 7 NetApp 8 November 2, 2020 10 More Accurate ECN Feedback in TCP 11 draft-ietf-tcpm-accurate-ecn-13 13 Abstract 15 Explicit Congestion Notification (ECN) is a mechanism where network 16 nodes can mark IP packets instead of dropping them to indicate 17 incipient congestion to the end-points. Receivers with an ECN- 18 capable transport protocol feed back this information to the sender. 19 ECN is specified for TCP in such a way that only one feedback signal 20 can be transmitted per Round-Trip Time (RTT). Recent new TCP 21 mechanisms like Congestion Exposure (ConEx), Data Center TCP (DCTCP) 22 or Low Latency Low Loss Scalable Throughput (L4S) need more accurate 23 ECN feedback information whenever more than one marking is received 24 in one RTT. This document specifies a scheme to provide more than 25 one feedback signal per RTT in the TCP header. Given TCP header 26 space is scarce, it allocates a reserved header bit, that was 27 previously used for the ECN-Nonce which has now been declared 28 historic. It also overloads the two existing ECN flags in the TCP 29 header. The resulting extra space is exploited to feed back the IP- 30 ECN field received during the 3-way handshake as well. Supplementary 31 feedback information can optionally be provided in a new TCP option, 32 which is never used on the TCP SYN. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on May 6, 2021. 50 Copyright Notice 52 Copyright (c) 2020 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 5 69 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 70 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 71 1.4. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 72 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 73 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 8 74 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 75 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 76 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 77 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 10 78 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 11 79 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 11 80 3.1.1. Negotiation during the TCP handshake . . . . . . . . 11 81 3.1.2. Backward Compatibility . . . . . . . . . . . . . . . 12 82 3.1.3. Forward Compatibility . . . . . . . . . . . . . . . . 15 83 3.1.4. Retransmission of the SYN . . . . . . . . . . . . . . 15 84 3.1.5. Implications of AccECN Mode . . . . . . . . . . . . . 16 85 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 17 86 3.2.1. Initialization of Feedback Counters . . . . . . . . . 18 87 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 18 88 3.2.3. The AccECN Option . . . . . . . . . . . . . . . . . . 26 89 3.3. AccECN Compliance Requirements for TCP Proxies, Offload 90 Engines and other Middleboxes . . . . . . . . . . . . . . 35 91 3.3.1. Requirements for TCP Proxies . . . . . . . . . . . . 35 92 3.3.2. Requirements for TCP Normalizers . . . . . . . . . . 35 93 3.3.3. Requirements for TCP ACK Filtering . . . . . . . . . 35 94 3.3.4. Requirements for TCP Segmentation Offload . . . . . . 36 95 4. Updates to RFC 3168 . . . . . . . . . . . . . . . . . . . . . 37 96 5. Interaction with TCP Variants . . . . . . . . . . . . . . . . 38 97 5.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 38 98 5.2. Compatibility with TCP Experiments and Common TCP Options 39 99 5.3. Compatibility with Feedback Integrity Mechanisms . . . . 39 100 6. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 41 101 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 43 102 8. Security Considerations . . . . . . . . . . . . . . . . . . . 44 103 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 44 104 10. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 45 105 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 45 106 11.1. Normative References . . . . . . . . . . . . . . . . . . 45 107 11.2. Informative References . . . . . . . . . . . . . . . . . 46 108 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 48 109 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 48 110 A.2. Example Algorithm for Safety Against Long Sequences of 111 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 49 112 A.2.1. Safety Algorithm without the AccECN Option . . . . . 49 113 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 51 114 A.3. Example Algorithm to Estimate Marked Bytes from Marked 115 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 53 116 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 53 117 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 54 118 Appendix B. Rationale for Usage of TCP Header Flags . . . . . . 55 119 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake . . . 55 120 B.2. Four Codepoints in the SYN/ACK . . . . . . . . . . . . . 56 121 B.3. Space for Future Evolution . . . . . . . . . . . . . . . 56 122 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 58 124 1. Introduction 126 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 127 network nodes can mark IP packets instead of dropping them to 128 indicate incipient congestion to the end-points. Receivers with an 129 ECN-capable transport protocol feed back this information to the 130 sender. In RFC 3168, ECN was specified for TCP in such a way that 131 only one feedback signal could be transmitted per Round-Trip Time 132 (RTT). Recently, proposed mechanisms like Congestion Exposure (ConEx 133 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need to 134 know when more than one marking is received in one RTT which is 135 information that cannot be provided by the feedback scheme as 136 specified in [RFC3168]. This document specifies an update to the ECN 137 feedback scheme of RFC 3168 that provides more accurate information 138 and could be used by these and potentially other future TCP 139 extensions. A fuller treatment of the motivation for this 140 specification is given in the associated requirements document 141 [RFC7560]. 143 This documents specifies a standards track scheme for ECN feedback in 144 the TCP header to provide more than one feedback signal per RTT. It 145 will be called the more accurate ECN feedback scheme, or AccECN for 146 short. This document updates RFC 3168 with respect to negotiation 147 and use of the feedback scheme for TCP. All aspects of RFC 3168 148 other than the TCP feedback scheme, in particular the definition of 149 ECN at the IP layer, remain unchanged by this specification. 150 Section 4 gives a more detailed specification of exactly which 151 aspects of RFC 3168 this document updates. 153 AccECN is intended to be a complete replacement for classic TCP/ECN 154 feedback, not a fork in the design of TCP. AccECN feedback 155 complements TCP's loss feedback and it can coexist alongside 156 'classic' [RFC3168] TCP/ECN feedback. So its applicability is 157 intended to include all public and private IP networks (and even any 158 non-IP networks over which TCP is used today), whether or not any 159 nodes on the path support ECN, of whatever flavour. This document 160 uses the term Classic ECN when it needs to distinguish the RFC 3168 161 ECN TCP feedback scheme from the AccECN TCP feedback scheme. 163 AccECN feedback overloads the two existing ECN flags in the TCP 164 header and allocates the currently reserved flag (previously called 165 NS) in the TCP header, to be used as one three-bit counter field 166 indicating the number of congestion experienced marked packets. 167 Given the new definitions of these three bits, both ends have to 168 support the new wire protocol before it can be used. Therefore 169 during the TCP handshake the two ends use these three bits in the TCP 170 header to negotiate the most advanced feedback protocol that they can 171 both support, in a way that is backward compatible with [RFC3168]. 173 AccECN is solely a change to the TCP wire protocol; it covers the 174 negotiation and signaling of more accurate ECN feedback from a TCP 175 Data Receiver to a Data Sender. It is completely independent of how 176 TCP might respond to congestion feedback, which is out of scope, but 177 ultimately the motivation for accurate ECN feedback. Like Classic 178 ECN feedback, AccECN can be used by standard Reno congestion control 179 [RFC5681] to respond to the existence of at least one congestion 180 notification within a round trip. Or, unlike Reno, AccECN can be 181 used to respond to the extent of congestion notification over a round 182 trip, as for example DCTCP does in controlled environments [RFC8257]. 183 For congestion response, this specification refers to RFC 3168, or 184 ECN experiments such as those referred to in [RFC8311], namely: a 185 TCP-based Low Latency Low Loss Scalable (L4S) congestion control 186 [I-D.ietf-tsvwg-l4s-arch]; or Alternative Backoff with ECN (ABE) 187 [RFC8511]. 189 It is recommended that the AccECN protocol is implemented alongside 190 SACK [RFC2018] and the experimental ECN++ protocol 192 [I-D.ietf-tcpm-generalized-ecn], which allows the ECN capability to 193 be used on TCP control packets. Therefore, this specification does 194 not discuss implementing AccECN alongside [RFC5562], which was an 195 earlier experimental protocol with narrower scope than ECN++. 197 1.1. Document Roadmap 199 The following introductory section outlines the goals of AccECN 200 (Section 1.2). Then terminology is defined (Section 1.3) and a recap 201 of existing prerequisite technology is given (Section 1.4). 203 Section 2 gives an informative overview of the AccECN protocol. Then 204 Section 3 gives the normative protocol specification, and Section 4 205 clarifies which aspects of RFC 3168 are updated by this 206 specification. Section 5 assesses the interaction of AccECN with 207 commonly used variants of TCP, whether standardized or not. 208 Section 6 summarizes the features and properties of AccECN. 210 Section 7 summarizes the protocol fields and numbers that IANA will 211 need to assign and Section 8 points to the aspects of the protocol 212 that will be of interest to the security community. 214 Appendix A gives pseudocode examples for the various algorithms that 215 AccECN uses and Appendix B explains why AccECN uses flags in the main 216 TCP header and quantifies the space left for future use. 218 1.2. Goals 220 [RFC7560] enumerates requirements that a candidate feedback scheme 221 will need to satisfy, under the headings: resilience, timeliness, 222 integrity, accuracy (including ordering and lack of bias), 223 complexity, overhead and compatibility (both backward and forward). 224 It recognizes that a perfect scheme that fully satisfies all the 225 requirements is unlikely and trade-offs between requirements are 226 likely. Section 6 presents the properties of AccECN against these 227 requirements and discusses the trade-offs made. 229 The requirements document recognizes that a protocol as ubiquitous as 230 TCP needs to be able to serve as-yet-unspecified requirements. 231 Therefore an AccECN receiver aims to act as a generic (dumb) 232 reflector of congestion information so that in future new sender 233 behaviours can be deployed unilaterally. 235 1.3. Terminology 237 AccECN: The more accurate ECN feedback scheme will be called AccECN 238 for short. 240 Classic ECN: the ECN protocol specified in [RFC3168]. 242 Classic ECN feedback: the feedback aspect of the ECN protocol 243 specified in [RFC3168], including generation, encoding, 244 transmission and decoding of feedback, but not the Data Sender's 245 subsequent response to that feedback. 247 ACK: A TCP acknowledgement, with or without a data payload (ACK=1). 249 Pure ACK: A TCP acknowledgement without a data payload. 251 Acceptable packet / segment: A packet or segment that passes the 252 acceptability tests in [RFC0793] and [RFC5961]. 254 TCP client: The TCP stack that originates a connection. 256 TCP server: The TCP stack that responds to a connection request. 258 Data Receiver: The endpoint of a TCP half-connection that receives 259 data and sends AccECN feedback. 261 Data Sender: The endpoint of a TCP half-connection that sends data 262 and receives AccECN feedback. 264 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 265 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 266 document are to be interpreted as described in BCP 14 [RFC2119] 267 [RFC8174] when, and only when, they appear in all capitals, as shown 268 here. 270 1.4. Recap of Existing ECN feedback in IP/TCP 272 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 273 negotiated with the receiver at the transport layer, an ECN sender 274 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 275 to indicate an ECN-capable transport (ECT). If both ECN bits are 276 zero, the packet is considered to have been sent by a Not-ECN-capable 277 Transport (Not-ECT). When a network node experiences congestion, it 278 will occasionally either drop or mark a packet, with the choice 279 depending on the packet's ECN codepoint. If the codepoint is Not- 280 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 281 the node can mark the packet by setting both ECN bits, which is 282 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 283 Table 1 summarises these codepoints. 285 +------------------+----------------+---------------------------+ 286 | IP-ECN codepoint | Codepoint name | Description | 287 +------------------+----------------+---------------------------+ 288 | 0b00 | Not-ECT | Not ECN-Capable Transport | 289 | 0b01 | ECT(1) | ECN-Capable Transport (1) | 290 | 0b10 | ECT(0) | ECN-Capable Transport (0) | 291 | 0b11 | CE | Congestion Experienced | 292 +------------------+----------------+---------------------------+ 294 Table 1: The ECN Field in the IP Header 296 In the TCP header the first two bits in byte 14 are defined as flags 297 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 298 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 299 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 300 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 301 Data Receiver starts to set the Echo Congestion Experienced (ECE) 302 flag continuously in the TCP header of ACKs, which ensures the signal 303 is received reliably even if ACKs are lost. The TCP sender confirms 304 that it has received at least one ECE signal by responding with the 305 congestion window reduced (CWR) flag, which allows the TCP receiver 306 to stop repeating the ECN-Echo flag. This always leads to a full RTT 307 of ACKs with ECE set. Thus any additional CE markings arriving 308 within this RTT cannot be fed back. 310 The last bit in byte 13 of the TCP header was defined as the Nonce 311 Sum (NS) for the ECN Nonce [RFC3540]. In the absence of widespread 312 deployment RFC 3540 has been reclassified as historic [RFC8311] and 313 the respective flag has been marked as "reserved", making this TCP 314 flag available for use by the AccECN experiment instead. 316 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 317 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 318 | | | N | C | E | U | A | P | R | S | F | 319 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 320 | | | | R | E | G | K | H | T | N | N | 321 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 323 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 325 2. AccECN Protocol Overview and Rationale 327 This section provides an informative overview of the AccECN protocol 328 that will be normatively specified in Section 3 330 Like the original TCP approach, the Data Receiver of each TCP half- 331 connection sends AccECN feedback to the Data Sender on TCP 332 acknowledgements, reusing data packets of the other half-connection 333 whenever possible. 335 The AccECN protocol has had to be designed in two parts: 337 o an essential part that re-uses ECN TCP header bits to feed back 338 the number of arriving CE marked packets. This provides more 339 accuracy than classic ECN feedback, but limited resilience against 340 ACK loss; 342 o a supplementary part using a new AccECN TCP Option that provides 343 additional feedback on the number of bytes that arrive marked with 344 each of the three ECN codepoints (not just CE marks). This 345 provides greater resilience against ACK loss than the essential 346 feedback, but it is more likely to suffer from middlebox 347 interference. 349 The two part design was necessary, given limitations on the space 350 available for TCP options and given the possibility that certain 351 incorrectly designed middleboxes prevent TCP using any new options. 353 The essential part overloads the previous definition of the three 354 flags in the TCP header that had been assigned for use by ECN. This 355 design choice deliberately replaces the classic ECN feedback 356 protocol, rather than leaving classic ECN feedback intact and adding 357 more accurate feedback separately because: 359 o this efficiently reuses scarce TCP header space, given TCP option 360 space is approaching saturation; 362 o a single upgrade path for the TCP protocol is preferable to a fork 363 in the design; 365 o otherwise classic and accurate ECN feedback could give conflicting 366 feedback on the same segment, which could open up new security 367 concerns and make implementations unnecessarily complex; 369 o middleboxes are more likely to faithfully forward the TCP ECN 370 flags than newly defined areas of the TCP header. 372 AccECN is designed to work even if the supplementary part is removed 373 or zeroed out, as long as the essential part gets through. 375 2.1. Capability Negotiation 377 AccECN is a change to the wire protocol of the main TCP header, 378 therefore it can only be used if both endpoints have been upgraded to 379 understand it. The TCP client signals support for AccECN on the 380 initial SYN of a connection and the TCP server signals whether it 381 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 382 client uses to signal AccECN support have been carefully chosen so 383 that a TCP server will interpret them as a request to support the 384 most recent variant of ECN feedback that it supports. Then the 385 client falls back to the same variant of ECN feedback. 387 An AccECN TCP client does not send the new AccECN Option on the SYN 388 as SYN option space is limited. The TCP server sends the AccECN 389 Option on the SYN/ACK and the client sends it on the first ACK to 390 test whether the network path forwards the option correctly. 392 2.2. Feedback Mechanism 394 A Data Receiver maintains four counters initialized at the start of 395 the half-connection. Three count the number of arriving payload 396 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 397 the number of packets arriving marked with a CE codepoint (including 398 control packets without payload if they are CE-marked). 400 The Data Sender maintains four equivalent counters for the half 401 connection, and the AccECN protocol is designed to ensure they will 402 match the values in the Data Receiver's counters, albeit after a 403 little delay. 405 Each ACK carries the three least significant bits (LSBs) of the 406 packet-based CE counter using the ECN bits in the TCP header, now 407 renamed the Accurate ECN (ACE) field (see Figure 3 later). The 24 408 LSBs of each byte counter are carried in the AccECN Option. 410 2.3. Delayed ACKs and Resilience Against ACK Loss 412 With both the ACE and the AccECN Option mechanisms, the Data Receiver 413 continually repeats the current LSBs of each of its respective 414 counters. There is no need to acknowledge these continually repeated 415 counters, so the congestion window reduced (CWR) mechanism is no 416 longer used. Even if some ACKs are lost, the Data Sender should be 417 able to infer how much to increment its own counters, even if the 418 protocol field has wrapped. 420 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 421 it appears to have incremented by one (say), the field might have 422 actually cycled completely then incremented by one. The Data 423 Receiver is not allowed to delay sending an ACK to such an extent 424 that the ACE field would cycle. However cycling is still a 425 possibility at the Data Sender because a whole sequence of ACKs 426 carrying intervening values of the field might all be lost or delayed 427 in transit. 429 The fields in the AccECN Option are larger, but they will increment 430 in larger steps because they count bytes not packets. Nonetheless, 431 their size has been chosen such that a whole cycle of the field would 432 never occur between ACKs unless there had been an infeasibly long 433 sequence of ACK losses. Therefore, as long as the AccECN Option is 434 available, it can be treated as a dependable feedback channel. 436 If the AccECN Option is not available, e.g. it is being stripped by a 437 middlebox, the AccECN protocol will only feed back information on CE 438 markings (using the ACE field). Although not ideal, this will be 439 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 440 will ever indicate more severe congestion than CE, even though future 441 uses for ECT(0) or ECT(1) are still unclear [RFC8311]. Because the 442 3-bit ACE field is so small, when it is the only field available the 443 Data Sender has to interpret it assuming the most likely wrap, but 444 with a degree of conservatism. 446 Certain specified events trigger the Data Receiver to include an 447 AccECN Option on an ACK. The rules are designed to ensure that the 448 order in which different markings arrive at the receiver is 449 communicated to the sender (as long as options are reaching the 450 sender and as long as there is no ACK loss). Implementations are 451 encouraged to send an AccECN Option more frequently, but this is left 452 up to the implementer. 454 2.4. Feedback Metrics 456 The CE packet counter in the ACE field and the CE byte counter in the 457 AccECN Option both provide feedback on received CE-marks. The CE 458 packet counter includes control packets that do not have payload 459 data, while the CE byte counter solely includes marked payload bytes. 460 If both are present, the byte counter in the option will provide the 461 more accurate information needed for modern congestion control and 462 policing schemes, such as L4S, DCTCP or ConEx. If the option is 463 stripped, a simple algorithm to estimate the number of marked bytes 464 from the ACE field is given in Appendix A.3. 466 Feedback in bytes is recommended in order to protect against the 467 receiver using attacks similar to 'ACK-Division' to artificially 468 inflate the congestion window, which is why [RFC5681] now recommends 469 that TCP counts acknowledged bytes not packets. 471 2.5. Generic (Dumb) Reflector 473 The ACE field provides information about CE markings on both data and 474 control packets. According to [RFC3168] the Data Sender is meant to 475 set control packets to Not-ECT. However, mechanisms in certain 476 private networks (e.g. data centres) set control packets to be ECN 477 capable because they are precisely the packets that performance 478 depends on most. 480 For this reason, AccECN is designed to be a generic reflector of 481 whatever ECN markings it sees, whether or not they are compliant with 482 a current standard. Then as standards evolve, Data Senders can 483 upgrade unilaterally without any need for receivers to upgrade too. 484 It is also useful to be able to rely on generic reflection behaviour 485 when senders need to test for unexpected interference with markings 486 (for instance Section 3.2.2.3, Section 3.2.2.4 and Section 3.2.3.2 of 487 the present document and para 2 of Section 20.2 of [RFC3168]). 489 The initial SYN is the most critical control packet, so AccECN 490 provides feedback on its ECN marking. Although RFC 3168 prohibits an 491 ECN-capable SYN, providing feedback of ECN marking on the SYN 492 supports future scenarios in which SYNs might be ECN-enabled (without 493 prejudging whether they ought to be). For instance, [RFC8311] 494 updates this aspect of RFC 3168 to allow experimentation with ECN- 495 capable TCP control packets. 497 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 498 not-ECT in compliance with RFC 3168, feedback on the state of the ECN 499 field when it arrives at the receiver could still be useful, because 500 middleboxes have been known to overwrite the ECN IP field as if it is 501 still part of the old Type of Service (ToS) field [Mandalari18]. If 502 a TCP client has set the SYN to Not-ECT, but receives feedback that 503 the ECN field on the SYN arrived with a different codepoint, it can 504 detect such middlebox interference and send Not-ECT for the rest of 505 the connection. Today, if a TCP server receives ECT or CE on a SYN, 506 it cannot know whether it is invalid (or valid) because only the TCP 507 client knows whether it originally marked the SYN as Not-ECT (or 508 ECT). Therefore, prior to AccECN, the server's only safe course of 509 action was to disable ECN for the connection. Instead, the AccECN 510 protocol allows the server to feed back the received ECN field to the 511 client, which then has all the information to decide whether the 512 connection has to fall-back from supporting ECN (or not). 514 3. AccECN Protocol Specification 516 3.1. Negotiating to use AccECN 518 3.1.1. Negotiation during the TCP handshake 520 Given the ECN Nonce [RFC3540] has been reclassified as historic 521 [RFC8311], the present specification re-allocates the TCP flag at bit 522 7 of the TCP header, which was previously called NS (Nonce Sum), as 523 the AE (Accurate ECN) flag (see IANA Considerations in Section 7) as 524 shown below. 526 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 527 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 528 | | | A | C | E | U | A | P | R | S | F | 529 | Header Length | Reserved | E | W | C | R | C | S | S | Y | I | 530 | | | | R | E | G | K | H | T | N | N | 531 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 533 Figure 2: The (post-AccECN) definition of the TCP header flags during 534 the TCP handshake 536 During the TCP handshake at the start of a connection, to request 537 more accurate ECN feedback the TCP client (host A) MUST set the TCP 538 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 540 If a TCP server (B) that is AccECN-enabled receives a SYN with the 541 above three flags set, it MUST set both its half connections into 542 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 543 the 4 values shown in the top block of Table 2 to confirm that it 544 supports AccECN. The TCP server MUST NOT set one of these 4 545 combination of flags on the SYN/ACK unless the preceding SYN 546 requested support for AccECN as above. 548 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 549 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 550 that arrived on the SYN. This applies whether or not the server 551 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 552 Section 2.5 for rationale). 554 Once a TCP client (A) has sent the above SYN to declare that it 555 supports AccECN, and once it has received the above SYN/ACK segment 556 that confirms that the TCP server supports AccECN, the TCP client 557 MUST set both its half connections into AccECN mode. 559 Once in AccECN mode, a TCP client or server has the rights and 560 obligations to participate in the ECN protocol defined in 561 Section 3.1.5. 563 The procedure for the client to follow if a SYN/ACK does not arrive 564 before its retransmission timer expires is given in Section 3.1.4. 566 3.1.2. Backward Compatibility 568 The three flags set to 1 to indicate AccECN support on the SYN have 569 been carefully chosen to enable natural fall-back to prior stages in 570 the evolution of ECN, as above. Table 2 tabulates all the 571 negotiation possibilities for ECN-related capabilities that involve 572 at least one AccECN-capable host. The entries in the first two 573 columns have been abbreviated, as follows: 575 AccECN: More Accurate ECN Feedback (the present specification) 577 Nonce: ECN Nonce feedback [RFC3540] 579 ECN: 'Classic' ECN feedback [RFC3168] 581 No ECN: Not-ECN-capable. Implicit congestion notification using 582 packet drop. 584 +--------+--------+------------+-----------+------------------------+ 585 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 586 | | | | B->A | | 587 +--------+--------+------------+-----------+------------------------+ 588 | | | AE CWR ECE | AE CWR | | 589 | | | | ECE | | 590 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN (no ECT on SYN) | 591 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 592 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 593 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 594 | | | | | | 595 | AccECN | Nonce | 1 1 1 | 1 0 1 | (Reserved) | 596 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 597 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 598 | | | | | | 599 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 600 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 601 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 602 | | | | | | 603 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 604 +--------+--------+------------+-----------+------------------------+ 606 Table 2: ECN capability negotiation between Client (A) and Server (B) 608 Table 2 is divided into blocks each separated by an empty row. 610 1. The top block shows the case already described in Section 3.1 611 where both endpoints support AccECN and how the TCP server (B) 612 indicates congestion feedback. 614 2. The second block shows the cases where the TCP client (A) 615 supports AccECN but the TCP server (B) supports some earlier 616 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 617 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 618 shown it MUST set both its half connections into the feedback 619 mode shown in the rightmost column. If it has set itself into 620 classic ECN feedback mode it MUST then comply with [RFC3168]. 622 The server response called 'Nonce' in the table is now historic. 623 For an AccECN implementation, there is no need to recognize or 624 support ECN Nonce feedback [RFC3540], which has been reclassified 625 as historic [RFC8311]. AccECN is compatible with alternative ECN 626 feedback integrity approaches (see Section 5.3). 628 3. The third block shows the cases where the TCP server (B) supports 629 AccECN but the TCP client (A) supports some earlier variant of 630 TCP feedback, indicated in its SYN. 632 When an AccECN-enabled TCP server (B) receives a SYN with 633 AE,CWR,ECE = 0,1,1 it MUST do one of the following: 635 * set both its half connections into the classic ECN feedback 636 mode and return a SYN/ACK with AE, CWR, ECE = 0,0,1 as shown. 637 Then it MUST comply with [RFC3168]. 639 * set both its half-connections into No ECN mode and return a 640 SYN/ACK with AE,CWR,ECE = 0,0,0, then continue with ECN 641 disabled. This latter case is unlikely to be desirable, but 642 it is allowed as a possibility, e.g. for minimal TCP 643 implementations. 645 When an AccECN-enabled TCP server (B) receives a SYN with 646 AE,CWR,ECE = 0,0,0 it MUST set both its half connections into the 647 Not ECN feedback mode, return a SYN/ACK with AE,CWR,ECE = 0,0,0 648 as shown and continue with ECN disabled. 650 4. The fourth block displays a combination labelled `Broken'. Some 651 older TCP server implementations incorrectly set the reserved 652 flags in the SYN/ACK by reflecting those in the SYN. Such broken 653 TCP servers (B) cannot support ECN, so as soon as an AccECN- 654 capable TCP client (A) receives such a broken SYN/ACK it MUST 655 fall back to Not ECN mode for both its half connections and 656 continue with ECN disabled. 658 The following additional rules do not fit the structure of the table, 659 but they complement it: 661 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 662 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 663 Host A MUST then enter the same feedback mode as it would have 664 entered had it been a responding host and received the same SYN. 665 Then host A MUST send the same SYN/ACK as it would have sent had 666 it been a responding host. 668 In-window SYN during TIME-WAIT: Many TCP implementations create a 669 new TCP connection if they receive an in-window SYN packet during 670 TIME-WAIT state. When a TCP host enters TIME-WAIT or CLOSED 671 state, it should ignore any previous state about the negotiation 672 of AccECN for that connection and renegotiate the feedback mode 673 according to Table 2. 675 3.1.3. Forward Compatibility 677 If a TCP server that implements AccECN receives a SYN with the three 678 TCP header flags (AE, CWR and ECE) set to any combination other than 679 000, 011 or 111, it MUST negotiate the use of AccECN as if they had 680 been set to 111. This ensures that future uses of the other 681 combinations on a SYN can rely on consistent behaviour from the 682 installed base of AccECN servers. 684 For the avoidance of doubt, the behaviour described in the present 685 specification applies whether or not the three remaining reserved TCP 686 header flags are zero. 688 3.1.4. Retransmission of the SYN 690 If the sender of an AccECN SYN times out before receiving the SYN/ 691 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 692 least one more time by continuing to set all three TCP ECN flags on 693 the first retransmitted SYN (using the usual retransmission time- 694 outs). If this first retransmission also fails to be acknowledged, 695 the sender SHOULD send subsequent retransmissions of the SYN with the 696 three TCP-ECN flags cleared (AE=CWR=ECE=0). A retransmitted SYN MUST 697 use the same ISN as the original SYN. 699 Retrying once before fall-back adds delay in the case where a 700 middlebox drops an AccECN (or ECN) SYN deliberately. However, 701 current measurements imply that a drop is less likely to be due to 702 middlebox interference than other intermittent causes of loss, e.g. 703 congestion, wireless interference, etc. 705 Implementers MAY use other fall-back strategies if they are found to 706 be more effective (e.g. attempting to negotiate AccECN on the SYN 707 only once or more than twice (most appropriate during high levels of 708 congestion). However, other fall-back strategies will need to follow 709 all the rules in Section 3.1.5, which concern behaviour when SYNs or 710 SYN/ACKs negotiating different types of feedback have been sent 711 within the same connection. 713 Further it may make sense to also remove any other new or 714 experimental fields or options on the SYN in case a middlebox might 715 be blocking them, although the required behaviour will depend on the 716 specification of the other option(s) and any attempt to co-ordinate 717 fall-back between different modules of the stack. 719 Whichever fall-back strategy is used, the TCP initiator SHOULD cache 720 failed connection attempts. If it does, it SHOULD NOT give up 721 attempting to negotiate AccECN on the SYN of subsequent connection 722 attempts until it is clear that the blockage is persistently and 723 specifically due to AccECN. The cache should be arranged to expire 724 so that the initiator will infrequently attempt to check whether the 725 problem has been resolved. 727 The fall-back procedure if the TCP server receives no ACK to 728 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 729 Section 3.2.3.2. 731 3.1.5. Implications of AccECN Mode 733 Section 3.1.1 describes the only ways that a host can enter AccECN 734 mode, whether as a client or as a server. 736 As a Data Sender, a host in AccECN mode has the rights and 737 obligations concerning the use of ECN defined below, which build on 738 those in [RFC3168] as updated by [RFC8311]: 740 o Using ECT: 742 * It can set an ECT codepoint in the IP header of packets to 743 indicate to the network that the transport is capable and 744 willing to participate in ECN for this packet. 746 * It does not have to set ECT on any packet (for instance if it 747 has reason to believe such a packet would be blocked). 749 o Switching feedback negotiation (e.g. fall-back): 751 * It SHOULD NOT set ECT on any packet if it has received at least 752 one valid SYN or Acceptable SYN/ACK with AE=CWR=ECE=0. A 753 "valid SYN" has the same port numbers and the same ISN as the 754 SYN that caused the server to enter AccECN mode. 756 * It MUST NOT send an ECN-setup SYN [RFC3168] within the same 757 connection as it has sent a SYN requesting AccECN feedback. 759 * It MUST NOT send an ECN-setup SYN/ACK [RFC3168] within the same 760 connection as it has sent a SYN/ACK agreeing to use AccECN 761 feedback. 763 The above rules are necessary because, when one peer negotiates 764 the feedback mode in two different types of handshake, it is not 765 possible for the other peer to know for certain which handshake 766 packet(s) the other end eventually receives or in which order it 767 receives them. So the two peers can end up using difference 768 feedback modes without knowing it. 770 o Congestion response: 772 * It is still obliged to respond appropriately to AccECN feedback 773 with congestion indications on packets it had previously sent, 774 as defined in Section 6.1 of [RFC3168] and updated by Sections 775 2.1 and 4.1 of [RFC8311]. 777 * The commitment to respond appropriately to incoming indications 778 of congestion remains even if it sends a SYN packet with 779 AE=CWR=ECE=0, in a later transmission within the same TCP 780 connection. 782 * Unlike an RFC 3168 data sender, it MUST NOT set CWR to indicate 783 it has received and responded to indications of congestion (for 784 the avoidance of doubt, this does not preclude it from setting 785 the bits of the ACE counter field, which includes an overloaded 786 use of the same bit). 788 As a Data Receiver: 790 o a host in AccECN mode MUST feed back the information in the IP-ECN 791 field on incoming packets using Accurate ECN feedback, as 792 specified in Section 3.2 below. 794 o if it receives an ECN-setup SYN or ECN-setup SYN/ACK [RFC3168] 795 during the same connection as it receives a SYN requesting AccECN 796 feedback or a SYN/ACK agreeing to use AccECN feedback, it MUST 797 reset the connection with a RST packet. 799 o If for any reason it is not willing to provide ECN feedback on a 800 particular TCP connection, to indicate this unwillingness it 801 SHOULD clear the AE, CWR and ECE flags in all SYN and/or SYN/ACK 802 packets that it sends. 804 o it MUST NOT use reception of packets with ECT set in the IP-ECN 805 field as an implicit signal that the peer is ECN-capable. Reason: 806 ECT at the IP layer does not explicitly confirm the peer has the 807 correct ECN feedback logic, and the packets could have been 808 mangled at the IP layer. 810 3.2. AccECN Feedback 812 Each Data Receiver of each half connection maintains four counters, 813 r.cep, r.ceb, r.e0b and r.e1b: 815 o The Data Receiver MUST increment the CE packet counter (r.cep), 816 for every Acceptable packet that it receives with the CE code 817 point in the IP ECN field, including CE marked control packets but 818 excluding CE on SYN packets (SYN=1; ACK=0). 820 o The Data Receiver MUST increment the r.ceb, r.e0b or r.e1b byte 821 counters by the number of TCP payload octets in Acceptable packets 822 marked respectively with the CE, ECT(0) and ECT(1) codepoint in 823 their IP-ECN field, including any payload octets on control 824 packets, but not including any payload octets on SYN packets 825 (SYN=1; ACK=0). 827 Each Data Sender of each half connection maintains four counters, 828 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 829 counters at the Data Receiver. 831 A Data Receiver feeds back the CE packet counter using the Accurate 832 ECN (ACE) field, as explained in Section 3.2.2. And it feeds back 833 all the byte counters using the AccECN TCP Option, as specified in 834 Section 3.2.3. 836 Whenever a host feeds back the value of any counter, it MUST report 837 the most recent value, no matter whether it is in a pure ACK, an ACK 838 with new payload data or a retransmission. Therefore the feedback 839 carried on a retransmitted packet is unlikely to be the same as the 840 feedback on the original packet. 842 3.2.1. Initialization of Feedback Counters 844 When a host first enters AccECN mode, in its role as a Data Receiver 845 it initializes its counters to r.cep = 5, r.e0b = 1 and r.ceb = 846 r.e1b.= 0, 848 Non-zero initial values are used to support a stateless handshake 849 (see Section 5.1) and to be distinct from cases where the fields are 850 incorrectly zeroed (e.g. by middleboxes - see Section 3.2.3.2.4). 852 When a host enters AccECN mode, in its role as a Data Sender it 853 initializes its counters to s.cep = 5, s.e0b = 1 and s.ceb = s.e1b.= 854 0. 856 3.2.2. The ACE Field 858 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 859 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 860 as one 3-bit field. Then the field is given a new name, ACE, as 861 shown in Figure 3. 863 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 864 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 865 | | | | U | A | P | R | S | F | 866 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 867 | | | | G | K | H | T | N | N | 868 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 870 Figure 3: Definition of the ACE field within bytes 13 and 14 of the 871 TCP Header (when AccECN has been negotiated and SYN=0). 873 The original definition of these three flags in the TCP header, 874 including the addition of support for the ECN Nonce, is shown for 875 comparison in Figure 1. This specification does not rename these 876 three TCP flags to ACE unconditionally; it merely overloads them with 877 another name and definition once an AccECN connection has been 878 established. 880 With one exception (Section 3.2.2.1), a host with both of its half- 881 connections in AccECN mode MUST interpret the AE, CWR and ECE flags 882 as the 3-bit ACE counter on a segment with the SYN flag cleared 883 (SYN=0). On such a packet, a Data Receiver MUST encode the three 884 least significant bits of its r.cep counter into the ACE field that 885 it feeds back to the Data Sender. A host MUST NOT interpret the 3 886 flags as a 3-bit ACE field on any segment with SYN=1 (whether ACK is 887 0 or 1), or if AccECN negotiation is incomplete or has not succeeded. 889 Both parts of each of these conditions are equally important. For 890 instance, even if AccECN negotiation has been successful, the ACE 891 field is not defined on any segments with SYN=1 (e.g. a 892 retransmission of an unacknowledged SYN/ACK, or when both ends send 893 SYN/ACKs after AccECN support has been successfully negotiated during 894 a simultaneous open). 896 3.2.2.1. ACE Field on the ACK of the SYN/ACK 898 A TCP client (A) in AccECN mode MUST feed back which of the 4 899 possible values of the IP-ECN field was on the SYN/ACK by writing it 900 into the ACE field of a pure ACK with no SACK blocks using the binary 901 encoding in Table 3 (which is the same as that used on the SYN/ACK in 902 Table 2). This shall be called the handshake encoding of the ACE 903 field, and it is the only exception to the rule that the ACE field 904 carries the 3 least significant bits of the r.cep counter on packets 905 with SYN=0. 907 Normally, a TCP client acknowledges a SYN/ACK with an ACK that 908 satisfies the above conditions anyway (SYN=0, no data, no SACK 909 blocks). If an AccECN TCP client intends to acknowledge the SYN/ACK 910 with a packet that does not satisfy these conditions (e.g. it has 911 data to include on the ACK), it SHOULD first send a pure ACK that 912 does satisfy these conditions (see Section 5.2), so that it can feed 913 back which of the four values of the IP-ECN field arrived on the SYN/ 914 ACK. A valid exception to this "SHOULD" would be where the 915 implementation will only be used in an environment where mangling of 916 the ECN field is unlikely. 918 +---------------------+---------------------+-----------------------+ 919 | IP-ECN codepoint on | ACE on pure ACK of | r.cep of client in | 920 | SYN/ACK | SYN/ACK | AccECN mode | 921 +---------------------+---------------------+-----------------------+ 922 | Not-ECT | 0b010 | 5 | 923 | ECT(1) | 0b011 | 5 | 924 | ECT(0) | 0b100 | 5 | 925 | CE | 0b110 | 6 | 926 +---------------------+---------------------+-----------------------+ 928 Table 3: The encoding of the ACE field in the ACK of the SYN-ACK to 929 reflect the SYN-ACK's IP-ECN field 931 When an AccECN server in SYN-RCVD state receives a pure ACK with 932 SYN=0 and no SACK blocks, instead of treating the ACE field as a 933 counter, it MUST infer the meaning of each possible value of the ACE 934 field from Table 4, which also shows the value that an AccECN server 935 MUST set s.cep to as a result. 937 Given this encoding of the ACE field on the ACK of a SYN/ACK is 938 exceptional, an AccECN server using large receive offload (LRO) might 939 prefer to disable LRO until such an ACK has transitioned it out of 940 SYN-RCVD state. 942 +---------------+-----------------------------+---------------------+ 943 | ACE on ACK of | IP-ECN codepoint on SYN/ACK | s.cep of server in | 944 | SYN/ACK | inferred by server | AccECN mode | 945 +---------------+-----------------------------+---------------------+ 946 | 0b000 | {Notes 1, 3} | Disable ECN | 947 | 0b001 | {Notes 2, 3} | 5 | 948 | 0b010 | Not-ECT | 5 | 949 | 0b011 | ECT(1) | 5 | 950 | 0b100 | ECT(0) | 5 | 951 | 0b101 | Currently Unused {Note 2} | 5 | 952 | 0b110 | CE | 6 | 953 | 0b111 | Currently Unused {Note 2} | 5 | 954 +---------------+-----------------------------+---------------------+ 956 Table 4: Meaning of the ACE field on the ACK of the SYN/ACK 958 {Note 1}: If the server is in AccECN mode, the value of zero raises 959 suspicion of zeroing of the ACE field on the path (see 960 Section 3.2.2.3). 962 {Note 2}: If the server is in AccECN mode, these values are Currently 963 Unused but the AccECN server's behaviour is still defined for forward 964 compatibility. Then the designer of a future protocol can know for 965 certain what AccECN servers will do with these codepoints. 967 {Note 3}: In the case where a server that implements AccECN is also 968 using a stateless handshake (termed a SYN cookie) it will not 969 remember whether it entered AccECN mode. The values 0b000 or 0b001 970 will remind it that it did not enter AccECN mode, because AccECN does 971 not use them (see Section 5.1 for details). If a stateless server 972 that implements AccECN receives either of these two values in the 973 ACK, its action is implementation-dependent and outside the scope of 974 this spec, It will certainly not take the action in the third column 975 because, after it receives either of these values, it is not in 976 AccECN mode. I.e., it will not disable ECN (at least not just 977 because ACE is 0b000) and it will not set s.cep. 979 3.2.2.2. Encoding and Decoding Feedback in the ACE Field 981 Whenever the Data Receiver sends an ACK with SYN=0 (with or without 982 data), unless the handshake encoding in Section 3.2.2.1 applies, the 983 Data Receiver MUST encode the least significant 3 bits of its r.cep 984 counter into the ACE field (see Appendix A.2). 986 Whenever the Data Sender receives an ACK with SYN=0 (with or without 987 data), it first checks whether it has already been superseded by 988 another ACK in which case it ignores the ECN feedback. If the ACK 989 has not been superseded, and if the special handshake encoding in 990 Section 3.2.2.1 does not apply, the Data Sender decodes the ACE field 991 as follows (see Appendix A.2 for examples). 993 o It takes the least significant 3 bits of its local s.cep counter 994 and subtracts them from the incoming ACE counter to work out the 995 minimum positive increment it could apply to s.cep (assuming the 996 ACE field only wrapped at most once). 998 o It then follows the safety procedures in Section 3.2.2.5.2 to 999 calculate or estimate how many packets the ACK could have 1000 acknowledged under the prevailing conditions to determine whether 1001 the ACE field might have wrapped more than once. 1003 The encode/decode procedures during the three-way handshake are 1004 exceptions to the general rules given so far, so they are spelled out 1005 step by step below for clarity: 1007 o If a TCP server in AccECN mode receives a CE mark in the IP-ECN 1008 field of a SYN (SYN=1, ACK=0), it MUST NOT increment r.cep (it 1009 remains at its initial value of 5). 1011 Reason: It would be redundant for the server to include CE-marked 1012 SYNs in its r.cep counter, because it already reliably delivers 1013 feedback of any CE marking on the SYN/ACK using the encoding in 1014 Table 2. This also ensures that, when the server starts using the 1015 ACE field, it has not unnecessarily consumed more than one initial 1016 value, given they can be used to negotiate variants of the AccECN 1017 protocol (see Appendix B.3). 1019 o If a TCP client in AccECN mode receives CE feedback in the TCP 1020 flags of a SYN/ACK, it MUST NOT increment s.cep (it remains at its 1021 initial value of 5), so that it stays in step with r.cep on the 1022 server. Nonetheless, the TCP client still triggers the congestion 1023 control actions necessary to respond to the CE feedback. 1025 o If a TCP client in AccECN mode receives a CE mark in the IP-ECN 1026 field of a SYN/ACK, it MUST increment r.cep, but no more than once 1027 no matter how many CE-marked SYN/ACKs it receives (i.e. 1028 incremented from 5 to 6, but no further). 1030 Reason: Incrementing r.cep ensures the client will eventually 1031 deliver any CE marking to the server reliably when it starts using 1032 the ACE field. Even though the client also feeds back any CE 1033 marking on the ACK of the SYN/ACK using the encoding in Table 3, 1034 this ACK is not delivered reliably, so it can be considered as a 1035 timely notification that is redundant but unreliable. The client 1036 does not increment r.cep more than once, because the server can 1037 only increment s.cep once (see next bullet). Also, this limits 1038 the unnecessarily consumed initial values of the ACE field to two. 1040 o If a TCP server in AccECN mode and in SYN-RCVD state receives CE 1041 feedback in the TCP flags of a pure ACK with no SACK blocks, it 1042 MUST increment s.cep (from 5 to 6). The TCP server then triggers 1043 the congestion control actions necessary to respond to the CE 1044 feedback. 1046 Reasoning: The TCP server can only increment s.cep once, because 1047 the first ACK it receives will cause it to transition out of SYN- 1048 RCVD state. The server's congestion response would be no 1049 different even if it could receive feedback of more than one CE- 1050 marked SYN/ACK. 1052 Once the TCP server transitions to ESTABLISHED state, it might 1053 later receive other pure ACK(s) with the handshake encoding in the 1054 ACE field. The conditions for this to occur are quite unusual, 1055 but not impossible, e.g. a SYN/ACK (or ACK of the SYN/ACK) that is 1056 delayed for longer than the server's retransmission timeout; or 1057 packet duplication by the network. Nonetheless, once in the 1058 ESTABLISHED state, the server will consider the ACE field to be 1059 encoded as the normal ACE counter on all packets with SYN=0 (given 1060 it will be following the above rule in this bullet). The server 1061 MAY include a test to avoid this case. 1063 3.2.2.3. Testing for Zeroing of the ACE Field 1065 Section 3.2.2 required the Data Receiver to initialize the r.cep 1066 counter to a non-zero value. Therefore, in either direction the 1067 initial value of the ACE counter ought to be non-zero. 1069 If AccECN has been successfully negotiated, the Data Sender SHOULD 1070 check the value of the ACE counter in the first packet (with or 1071 without data) that arrives with SYN=0. If the value of this ACE 1072 field is zero (0b000), the Data Sender disables sending ECN-capable 1073 packets for the remainder of the half-connection by setting the IP/ 1074 ECN field in all subsequent packets to Not-ECT. 1076 Usually, the server checks the ACK of the SYN/ACK from the client, 1077 while the client checks the first data segment from the server. 1078 However, if reordering occurs, "the first packet ... that arrives" 1079 will not necessarily be the same as the first packet in sequence 1080 order. The test has been specified loosely like this to simplify 1081 implementation, and because it would not have been any more precise 1082 to have specified the first packet in sequence order, which would not 1083 necessarily be the first ACE counter that the Data Receiver fed back 1084 anyway, given it might have been a retransmission. 1086 The possibility of re-ordering means that there is a small chance 1087 that the ACE field on the first packet to arrive is genuinely zero 1088 (without middlebox interference). This would cause a host to 1089 unnecessarily disable ECN for a half connection. Therefore, in 1090 environments where there is no evidence of the ACE field being 1091 zeroed, implementations can skip this test. 1093 Note that the Data Sender MUST NOT test whether the arriving counter 1094 in the initial ACE field has been initialized to a specific valid 1095 value - the above check solely tests whether the ACE fields have been 1096 incorrectly zeroed. This allows hosts to use different initial 1097 values as an additional signalling channel in future. 1099 3.2.2.4. Testing for Mangling of the IP/ECN Field 1101 The value of the ACE field on the SYN/ACK indicates the value of the 1102 IP/ECN field when the SYN arrived at the server. The client can 1103 compare this with how it originally set the IP/ECN field on the SYN. 1104 If this comparison implies an unsafe transition (see below) of the 1105 IP/ECN field, for the remainder of the connection the client MUST NOT 1106 send ECN-capable packets, but it MUST continue to feed back any ECN 1107 markings on arriving packets. 1109 The value of the ACE field on the last ACK of the 3WHS indicates the 1110 value of the IP/ECN field when the SYN/ACK arrived at the client. 1111 The server can compare this with how it originally set the IP/ECN 1112 field on the SYN/ACK. If this comparison implies an unsafe 1113 transition of the IP/ECN field, for the remainder of the connection 1114 the server MUST NOT send ECN-capable packets, but it MUST continue to 1115 feed back any ECN markings on arriving packets. 1117 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 1118 count of CE marks is still eventually delivered reliably). If this 1119 ACK does not arrive, the server can continue to send ECN-capable 1120 packets without having tested for mangling of the IP/ECN field on the 1121 SYN/ACK. 1123 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 1124 repeated here for convenience: 1126 o the not-ECT codepoint changes; 1128 o either ECT codepoint transitions to not-ECT; 1130 o the CE codepoint changes. 1132 RFC 3168 says that a router that changes ECT to not-ECT is invalid 1133 but safe. However, from a host's viewpoint, this transition is 1134 unsafe because it could be the result of two transitions at different 1135 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 1136 This scenario could well happen where an ECN-enabled home router 1137 congests its upstream mobile broadband bottleneck link, then the 1138 ingress to the mobile network clears the ECN field [Mandalari18]. 1140 Once a Data Sender has entered AccECN mode it SHOULD check whether 1141 all feedback received for the first three or four round indicated 1142 that every packet it sent was CE-marked. If so, for the remainder of 1143 the connection, the Data Sender SHOULD NOT send ECN-capable packets, 1144 but it MUST continue to feed back any ECN markings on arriving 1145 packets. 1147 The above fall-back behaviours are necessary in case mangling of the 1148 IP/ECN field is asymmetric, which is currently common over some 1149 mobile networks [Mandalari18]. Then one end might see no unsafe 1150 transition and continue sending ECN-capable packets, while the other 1151 end sees an unsafe transition and stops sending ECN-capable packets. 1153 3.2.2.5. Safety against Ambiguity of the ACE Field 1155 If too many CE-marked segments are acknowledged at once, or if a long 1156 run of ACKs is lost or thinned out, the 3-bit counter in the ACE 1157 field might have cycled between two ACKs arriving at the Data Sender. 1158 The following safety procedures minimize this ambiguity. 1160 3.2.2.5.1. Data Receiver Safety Procedures 1162 An AccECN Data Receiver: 1164 o SHOULD immediately send an ACK whenever a data packet marked CE 1165 arrives after the previous data packet was not CE. 1167 o MUST immediately send an ACK once 'n' CE marks have arrived since 1168 the previous ACK, where 'n' SHOULD be 2 and MUST be no greater 1169 than 6. 1171 These rules for when to send an ACK are designed to be complemented 1172 by those in Section 3.2.3.3, which concern whether the AccECN TCP 1173 Option ought to be included on ACKs. 1175 For the avoidance of doubt, the change-triggered ACK mechanism is 1176 deliberately worded to solely apply to data packets, and to ignore 1177 the arrival of a control packet with no payload, because it is 1178 important that TCP does not acknowledge pure ACKs. The change- 1179 triggered ACK approach can lead to some additional ACKs but it feeds 1180 back the timing and the order in which ECN marks are received with 1181 minimal additional complexity. If only CE marks are infrequent, or 1182 there are multiple marks in a row, the additional load will be low. 1183 Other marking patterns could increase the load significantly. 1185 Even though the first bullet is stated as a "SHOULD", it is important 1186 for a transition to immediately trigger an ACK if at all possible, so 1187 that the Data Sender can rely on change-triggered ACKs to detect 1188 queue growth as soon as possible, e.g. at the start of a flow. This 1189 requirement can only be relaxed if certain offload hardware needed 1190 for high performance cannot support change-triggered ACKs (although 1191 high performance protocols such as DCTCP already successfully use 1192 change-triggered ACKs). One possible compromise would be for the 1193 receiver to heuristically detect whether the sender is in slow-start, 1194 then to implement change-triggered ACKs while the sender is in slow- 1195 start, and offload otherwise. 1197 3.2.2.5.2. Data Sender Safety Procedures 1199 If the Data Sender has not received AccECN TCP Options to give it 1200 more dependable information, and it detects that the ACE field could 1201 have cycled, it SHOULD deem whether it cycled by taking the safest 1202 likely case under the prevailing conditions. It can detect if the 1203 counter could have cycled by using the jump in the acknowledgement 1204 number since the last ACK to calculate or estimate how many segments 1205 could have been acknowledged. An example algorithm to implement this 1206 policy is given in Appendix A.2. An implementer MAY develop an 1207 alternative algorithm as long as it satisfies these requirements. 1209 If missing acknowledgement numbers arrive later (reordering) and 1210 prove that the counter did not cycle, the Data Sender MAY attempt to 1211 neutralize the effect of any action it took based on a conservative 1212 assumption that it later found to be incorrect. 1214 The Data Sender can estimate how many packets (of any marking) an ACK 1215 acknowledges. If the ACE counter on an ACK seems to imply that the 1216 minimum number of newly CE-marked packets is greater that the number 1217 of newly acknowledged packets, the Data Sender SHOULD believe the ACE 1218 counter, unless it can be sure that it is counting all control 1219 packets correctly. 1221 3.2.3. The AccECN Option 1223 The AccECN Option is defined as shown in Figure 4. The initial 'E' 1224 of each field name stands for 'Echo'. 1226 0 1 2 3 1227 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1228 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1229 | Kind = TBD0 | Length = 11 | EE0B field | 1230 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1231 | EE0B (cont'd) | ECEB field | 1232 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1233 | EE1B field | Order 0 1234 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1236 0 1 2 3 1237 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1239 | Kind = TBD1 | Length = 11 | EE1B field | 1240 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1241 | EE1B (cont'd) | ECEB field | 1242 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1243 | EE0B field | Order 1 1244 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1246 Figure 4: The AccECN TCP Option 1248 Figure 4 shows two option field orders; order 0 and order 1. They 1249 both consists of three 24-bit fields. Order 0 provides the 24 least 1250 significant bits of the r.e0b, r.ceb and r.e1b counters, 1251 respectively. Order 1 provides the same fields, but in the opposite 1252 order. On each packet, the Data Receiver can use whichever order is 1253 more efficient. 1255 When a Data Receiver sends an AccECN Option, it MUST set the Kind 1256 field to TBD0 if using Order 0, or to TBD1 if using Order 1. These 1257 two new TCP Option Kinds are registered in Section 7 and called 1258 respectively AccECN0 and AccECN1. 1260 Note that there is no field to feed back Not-ECT bytes. Nonetheless 1261 an algorithm for the Data Sender to calculate the number of payload 1262 bytes received as Not-ECT is given in Appendix A.5. 1264 Whenever a Data Receiver sends an AccECN Option, the rules in 1265 Section 3.2.3.3 expect it to usually send a full-length option. To 1266 cope with option space limitations, it can omit unchanged fields from 1267 the tail of the option, as long as it preserves the order of the 1268 remaining fields and includes any field that has changed. The length 1269 field MUST indicate which fields are present as follows: 1271 +--------+------------------+------------------+ 1272 | Length | Type 0 | Type 1 | 1273 +--------+------------------+------------------+ 1274 | 11 | EE0B, ECEB, EE1B | EE1B, ECEB, EE0B | 1275 | 8 | EE0B, ECEB | EE1B, ECEB | 1276 | 5 | EE0B | EE1B | 1277 | 2 | (empty) | (empty) | 1278 +--------+------------------+------------------+ 1280 The empty option of Length=2 is provided to allow for a case where an 1281 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 1282 but there is very limited space for the option. 1284 All implementations of a Data Sender that read any AccECN Option MUST 1285 be able to read in AccECN Options of any of the above lengths. For 1286 forward compatibility, if the AccECN Option is of any other length, 1287 implementations MUST use those whole 3-octet fields that fit within 1288 the length and ignore the remainder of the option. 1290 The AccECN Option has to be optional to implement, because both 1291 sender and receiver have to be able to cope without the option anyway 1292 - in cases where it does not traverse a network path. It is 1293 RECOMMENDED to implement both sending and receiving of the AccECN 1294 Option. If sending of the AccECN Option is implemented, the fall- 1295 backs described in this document will need to be implemented as well 1296 (unless solely for a controlled environment where path traversal is 1297 not considered a problem). Even if a developer does not implement 1298 sending of the AccECN Option, it is RECOMMENDED that they still 1299 implement logic to receive and understand any AccECN Options sent by 1300 remote peers. 1302 If a Data Receiver intends to send the AccECN Option at any time 1303 during the rest of the connection it is strongly recommended to also 1304 test path traversal of the AccECN Option as specified in 1305 Section 3.2.3.2. 1307 3.2.3.1. Encoding and Decoding Feedback in the AccECN Option Fields 1309 Whenever the Data Receiver includes any of the counter fields (ECEB, 1310 EE0B, EE1B) in an AccECN Option, it MUST encode the 24 least 1311 significant bits of the current value of the associated counter into 1312 the field (respectively r.ceb, r.e0b, r.e1b). 1314 Whenever the Data Sender receives ACK carrying an AccECN Option, it 1315 first checks whether the ACK has already been superseded by another 1316 ACK in which case it ignores the ECN feedback. If the ACK has not 1317 been superseded, the Data Sender MUST decode the fields in the AccECN 1318 Option as follows. For each field, it takes the least significant 24 1319 bits of its associated local counter (s.ceb, s.e0b or s.e1b) and 1320 subtracts them from the counter in the associated field of the 1321 incoming AccECN Option (respectively ECEB, EE0B, EE1B), to work out 1322 the minimum positive increment it could apply to s.ceb, s.e0b or 1323 s.e1b (assuming the field in the option only wrapped at most once). 1325 Appendix A.1 gives an example algorithm for the Data Receiver to 1326 encode its byte counters into the AccECN Option, and for the Data 1327 Sender to decode the AccECN Option fields into its byte counters. 1329 Note that, as specified in Section 3.2, any data on the SYN (SYN=1, 1330 ACK=0) is not included in any of the locally held octet counters nor 1331 in the AccECN Option on the wire. 1333 3.2.3.2. Path Traversal of the AccECN Option 1335 3.2.3.2.1. Testing the AccECN Option during the Handshake 1337 The TCP client MUST NOT include the AccECN TCP Option on the SYN. (A 1338 fall-back strategy for the loss of the SYN (possibly due to middlebox 1339 interference) is specified in Section 3.1.4.) 1341 A TCP server that confirms its support for AccECN (in response to an 1342 AccECN SYN from the client as described in Section 3.1) SHOULD 1343 include an AccECN TCP Option on the SYN/ACK. 1345 A TCP client that has successfully negotiated AccECN SHOULD include 1346 an AccECN Option in the first ACK at the end of the 3WHS. However, 1347 this first ACK is not delivered reliably, so the TCP client SHOULD 1348 also include an AccECN Option on the first data segment it sends (if 1349 it ever sends one). 1351 A host MAY NOT include an AccECN Option in any of these three cases 1352 if it has cached knowledge that the packet would be likely to be 1353 blocked on the path to the other host if it included an AccECN 1354 Option. 1356 3.2.3.2.2. Testing for Loss of Packets Carrying the AccECN Option 1358 If after the normal TCP timeout the TCP server has not received an 1359 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 1360 lost, e.g. due to congestion, or a middlebox might be blocking the 1361 AccECN Option. To expedite connection setup, the TCP server SHOULD 1362 retransmit the SYN/ACK repeating the same AE, CWR and ECE TCP flags 1363 as on the original SYN/ACK but with no AccECN Option. If this 1364 retransmission times out, to expedite connection setup, the TCP 1365 server SHOULD disable AccECN and ECN for this connection by 1366 retransmitting the SYN/ACK with AE=CWR=ECE=0 and no AccECN Option. 1368 Implementers MAY use other fall-back strategies if they are found to 1369 be more effective (e.g. retrying the AccECN Option for a second time 1370 before fall-back - most appropriate during high levels of 1371 congestion). However, other fall-back strategies will need to follow 1372 all the rules in Section 3.1.5, which concern behaviour when SYNs or 1373 SYN/ACKs negotiating different types of feedback have been sent 1374 within the same connection. 1376 If the TCP client detects that the first data segment it sent with 1377 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1378 on the retransmission. Again, implementers MAY use other fall-back 1379 strategies such as attempting to retransmit a second segment with the 1380 AccECN Option before fall-back, and/or caching whether the AccECN 1381 Option is blocked for subsequent connections. 1382 [I-D.ietf-tcpm-2140bis] further discusses caching of TCP parameters 1383 and status information. 1385 If a host falls back to not sending the AccECN Option, it will 1386 continue to process any incoming AccECN Options as normal. 1388 Either host MAY include the AccECN Option in a subsequent segment to 1389 retest whether the AccECN Option can traverse the path. 1391 If the TCP server receives a second SYN with a request for AccECN 1392 support, it should resend the SYN/ACK, again confirming its support 1393 for AccECN, but this time without the AccECN Option. This approach 1394 rules out any interference by middleboxes that may drop packets with 1395 unknown options, even though it is more likely that the SYN/ACK would 1396 have been lost due to congestion. The TCP server MAY try to send 1397 another packet with the AccECN Option at a later point during the 1398 connection but should monitor if that packet got lost as well, in 1399 which case it SHOULD disable the sending of the AccECN Option for 1400 this half-connection. 1402 Similarly, an AccECN end-point MAY separately memorize which data 1403 packets carried an AccECN Option and disable the sending of AccECN 1404 Options if the loss probability of those packets is significantly 1405 higher than that of all other data packets in the same connection. 1407 3.2.3.2.3. Testing for Absence of the AccECN Option 1409 If the TCP client has successfully negotiated AccECN but does not 1410 receive an AccECN Option on the SYN/ACK (e.g. because is has been 1411 stripped by a middlebox or not sent by the server), the client 1412 switches into a mode that assumes that the AccECN Option is not 1413 available for this half connection. 1415 Similarly, if the TCP server has successfully negotiated AccECN but 1416 does not receive an AccECN Option on the first segment that 1417 acknowledges sequence space at least covering the ISN, it switches 1418 into a mode that assumes that the AccECN Option is not available for 1419 this half connection. 1421 While a host is in this mode that assumes incoming AccECN Options are 1422 not available, it MUST adopt the conservative interpretation of the 1423 ACE field discussed in Section 3.2.2.5. However, it cannot make any 1424 assumption about support of outgoing AccECN Options on the other half 1425 connection, so it SHOULD continue to send the AccECN Option itself 1426 (unless it has established that sending the AccECN Option is causing 1427 packets to be blocked as in Section 3.2.3.2.2). 1429 If a host is in the mode that assumes incoming AccECN Options are not 1430 available, but it receives an AccECN Option at any later point during 1431 the connection, this clearly indicates that the AccECN Option is not 1432 blocked on the respective path, and the AccECN endpoint MAY switch 1433 out of the mode that assumes the AccECN Option is not available for 1434 this half connection. 1436 3.2.3.2.4. Test for Zeroing of the AccECN Option 1438 For a related test for invalid initialization of the ACE field, see 1439 Section 3.2.2.3 1441 Section 3.2 required the Data Receiver to initialize the r.e0b 1442 counter to a non-zero value. Therefore, in either direction the 1443 initial value of the EE0B field in the AccECN Option (if one exists) 1444 ought to be non-zero. If AccECN has been negotiated: 1446 o the TCP server MAY check the initial value of the EE0B field in 1447 the first segment that acknowledges sequence space that at least 1448 covers the ISN plus 1. If the initial value of the EE0B field is 1449 zero, the server will switch into a mode that ignores the AccECN 1450 Option for this half connection. 1452 o the TCP client MAY check the initial value of the EE0B field on 1453 the SYN/ACK. If the initial value of the EE0B field is zero, the 1454 client will switch into a mode that ignores the AccECN Option for 1455 this half connection. 1457 While a host is in the mode that ignores the AccECN Option it MUST 1458 adopt the conservative interpretation of the ACE field discussed in 1459 Section 3.2.2.5. 1461 Note that the Data Sender MUST NOT test whether the arriving byte 1462 counters in the initial AccECN Option have been initialized to 1463 specific valid values - the above checks solely test whether these 1464 fields have been incorrectly zeroed. This allows hosts to use 1465 different initial values as an additional signalling channel in 1466 future. Also note that the initial value of either field might be 1467 greater than its expected initial value, because the counters might 1468 already have been incremented. Nonetheless, the initial values of 1469 the counters have been chosen so that they cannot wrap to zero on 1470 these initial segments. 1472 3.2.3.2.5. Consistency between AccECN Feedback Fields 1474 When the AccECN Option is available it supplements but does not 1475 replace the ACE field. An endpoint using AccECN feedback MUST always 1476 consider the information provided in the ACE field whether or not the 1477 AccECN Option is also available. 1479 If the AccECN option is present, the s.cep counter might increase 1480 while the s.ceb counter does not (e.g. due to a CE-marked control 1481 packet). The sender's response to such a situation is out of scope, 1482 and needs to be dealt with in a specification that uses ECN-capable 1483 control packets. Theoretically, this situation could also occur if a 1484 middlebox mangled the AccECN Option but not the ACE field. However, 1485 the Data Sender has to assume that the integrity of the AccECN Option 1486 is sound, based on the above test of the well-known initial values 1487 and optionally other integrity tests (Section 5.3). 1489 If either end-point detects that the s.ceb counter has increased but 1490 the s.cep has not (and by testing ACK coverage it is certain how much 1491 the ACE field has wrapped), this invalid protocol transition has to 1492 be due to some form of feedback mangling. So, the Data Sender MUST 1493 disable sending ECN-capable packets for the remainder of the half- 1494 connection by setting the IP/ECN field in all subsequent packets to 1495 Not-ECT. 1497 3.2.3.3. Usage of the AccECN TCP Option 1499 If the Data Receiver intends to use the AccECN TCP Option to provide 1500 feedback, the following rules determine when a Data Receiver in 1501 AccECN mode sends an ACK with the AccECN TCP Option, and which fields 1502 to include: 1504 Change-Triggered ACKs: If an arriving packet increments a different 1505 byte counter to that incremented by the previous packet, the Data 1506 Receiver SHOULD immediately send an ACK with an AccECN Option, 1507 without waiting for the next delayed ACK (this is in addition to 1508 the safety recommendation in Section 3.2.2.5 against ambiguity of 1509 the ACE field). 1511 Even though this bullet is stated as a "SHOULD", it is important 1512 for a transition to immediately trigger an ACK if at all possible, 1513 as already argued when specifying change-triggered ACKs for the 1514 ACE. 1516 Continual Repetition: Otherwise, if arriving packets continue to 1517 increment the same byte counter, the Data Receiver can include an 1518 AccECN Option on most or all (delayed) ACKs, but it does not have 1519 to. 1521 * It SHOULD include a counter that has continued to increment on 1522 the next scheduled ACK following a change-triggered ACK; 1524 * while the same counter continues to increment, it SHOULD 1525 include the counter every n ACKs as consistently as possible, 1526 where n can be chosen by the implementer; 1528 * It SHOULD always include an AccECN Option if the r.ceb counter 1529 is incrementing and it MAY include an AccECN Option if r.ec0b 1530 or r.ec1b is incrementing 1532 * It SHOULD, include each counter at least once for every 2^22 1533 bytes incremented to prevent overflow during continual 1534 repetition. 1536 If the smallest allowed AccECN Option would leave insufficient 1537 space for two SACK blocks on a particular ACK, the Data Receiver 1538 MUST give precedence to the SACK option (total 18 octets), because 1539 loss feedback is more critical. 1541 Necessary Option Length: It MAY exclude counter(s) that have not 1542 changed for the whole connection (but beacons still include all 1543 fields - see below). It SHOULD include counter(s) that have 1544 incremented at some time during the connection. It MUST include 1545 the counter(s) that have incremented since the previous AccECN 1546 Option and it MUST only truncate fields from the right-hand tail 1547 of the option to preserve the order of the remaining fields (see 1548 Section 3.2.3); 1550 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1551 length AccECN TCP Option on at least three ACKs per RTT, or on all 1552 ACKs if there are less than three per RTT (see Appendix A.4 for an 1553 example algorithm that satisfies this requirement). 1555 The above rules complement those in Section 3.2.2.5, which determine 1556 when to generate an ACK irrespective of whether an AccECN TCP Option 1557 is to be included. 1559 The following example series of arriving IP/ECN fields illustrates 1560 when a Data Receiver will emit an ACK with an AccECN Option if it is 1561 using a delayed ACK factor of 2 segments and change-triggered ACKs: 1562 01 -> ACK, 01, 01 -> ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 1563 -> ACK. 1565 Even though first bullet is stated as a "SHOULD", it is important for 1566 a transition to immediately trigger an ACK if at all possible, so 1567 that the Data Sender can rely on change-triggered ACKs to detect 1568 queue growth as soon as possible, e.g. at the start of a flow. This 1569 requirement can only be relaxed if certain offload hardware needed 1570 for high performance cannot support change-triggered ACKs (although 1571 high performance protocols such as DCTCP already successfully use 1572 change-triggered ACKs). One possible experimental compromise would 1573 be for the receiver to heuristically detect whether the sender is in 1574 slow-start, then to implement change-triggered ACKs while the sender 1575 is in slow-start, and offload otherwise. 1577 For the avoidance of doubt, this change-triggered ACK mechanism is 1578 deliberately worded to ignore the arrival of a control packet with no 1579 payload, which therefore does not alter any byte counters, because it 1580 is important that TCP does not acknowledge pure ACKs. The change- 1581 triggered ACK approach can lead to some additional ACKs but it feeds 1582 back the timing and the order in which ECN marks are received with 1583 minimal additional complexity. If only CE marks are infrequent, or 1584 there are multiple marks in a row, the additional load will be low. 1585 Other marking patterns could increase the load significantly, 1586 Investigating the additional load is a goal of the proposed 1587 experiment. 1589 Implementation note: sending an AccECN Option each time a different 1590 counter changes and including a full-length AccECN Option on every 1591 delayed ACK will satisfy the requirements described above and might 1592 be the easiest implementation, as long as sufficient space is 1593 available in each ACK (in total and in the option space). 1595 Appendix A.3 gives an example algorithm to estimate the number of 1596 marked bytes from the ACE field alone, if the AccECN Option is not 1597 available. 1599 If a host has determined that segments with the AccECN Option always 1600 seem to be discarded somewhere along the path, it is no longer 1601 obliged to follow the above rules. 1603 3.3. AccECN Compliance Requirements for TCP Proxies, Offload Engines 1604 and other Middleboxes 1606 3.3.1. Requirements for TCP Proxies 1608 A large class of middleboxes split TCP connections. Such a middlebox 1609 would be compliant with the AccECN protocol if the TCP implementation 1610 on each side complied with the present AccECN specification and each 1611 side negotiated AccECN independently of the other side. 1613 3.3.2. Requirements for TCP Normalizers 1615 Another large class of middleboxes intervenes to some degree at the 1616 transport layer, but attempts to be transparent (invisible) to the 1617 end-to-end connection. A subset of this class of middleboxes 1618 attempts to `normalize' the TCP wire protocol by checking that all 1619 values in header fields comply with a rather narrow and often 1620 outdated interpretation of the TCP specifications. To comply with 1621 the present AccECN specification, such a middlebox MUST NOT change 1622 the ACE field or the AccECN Option. 1624 A middlebox claiming to be transparent at the transport layer MUST 1625 forward the AccECN TCP Option unaltered, whether or not the length 1626 value matches one of those specified in Section 3.2.3, and whether or 1627 not the initial values of the byte-counter fields are correct. This 1628 is because blocking apparently invalid values does not improve 1629 security (because AccECN hosts are required to ignore invalid values 1630 anyway), while it prevents the standardized set of values being 1631 extended in future (because outdated normalizers would block updated 1632 hosts from using the extended AccECN standard). 1634 3.3.3. Requirements for TCP ACK Filtering 1636 A node that implements ACK filtering (aka. thinning or coalescing) 1637 and itself also implements ECN marking will not need to filter ACKs 1638 from connections that use AccECN feedback. Therefore, such a node 1639 SHOULD detect connections that have negotiated the use of AccECN 1640 feedback during the handshake (see Table 2) and it SHOULD preserve 1641 the timing of each ACK (if it coalesced ACKs it would not be AccECN- 1642 compliant, but the requirement is stated as a "SHOULD" in order to 1643 allow leeway for pre-existing ACK filtering functions to be brought 1644 into line). 1646 A node that implements ACK filtering and does not itself implement 1647 ECN marking does not need to treat AccECN connections any differently 1648 from other TCP connections. Nonetheless, it is RECOMMENDED that such 1649 nodes implement ECN marking and comply with the requirements of the 1650 previoius paragraph. This should be a better way than ACK filtering 1651 to improve the performance of AccECN TCP connections. 1653 The rationale for these requirements is that AccECN feedback provides 1654 sufficient information to a data receiver for it to be able to 1655 monitor ECN marking of the ACKs it has sent, so that it can thin the 1656 ACK stream itself. This will eventually mean that ACK filtering in 1657 the network gives no performance advantage. Then TCP will be able to 1658 maintain its own control over ACK coalescing. This will also allow 1659 the TCP Data Sender to use the timing of ACK arrivals to more 1660 reliably infer further information about the path congestion level. 1662 Note that the specification of AccECN in TCP does not presume to rely 1663 on the above ACK filtering behaviour in the network, because it has 1664 to be robust against pre-existing network nodes that still filter 1665 AccECN ACKs, and robust against ACK loss during overload. 1667 Section 5.2.1 of [RFC3449] gives best current practice on ACK 1668 filtering (aka. thinning or coalescing). It gives no advice on ACKs 1669 carrying ECN feedback, because at the time is said that "ECN remain 1670 areas of ongoing research". This section updates that advice for a 1671 TCP connection that supports AccECN feedback. 1673 3.3.4. Requirements for TCP Segmentation Offload 1675 Hardware to offload certain TCP processing represents another large 1676 class of middleboxes (even though it is often a function of a host's 1677 network interface and rarely in its own 'box'). 1679 The ACE field changes with every received CE marking, so today's 1680 receive offloading could lead to many interrupts in high congestion 1681 situations. Although that would be useful (because congestion 1682 information is received sooner), it could also significantly increase 1683 processor load, particularly in scenarios such as DCTCP or L4S where 1684 the marking rate is generally higher. 1686 Current offload hardware ejects a segment from the coalescing process 1687 whenever the TCP ECN flags change. Thus Classic ECN causes offload 1688 to be inefficient. In data centres it has been fortunate for this 1689 offload hardware that DCTCP-style feedback changes less often when 1690 there are long sequences of CE marks, which is more common with a 1691 step marking threshold (but less likely the more short flows are in 1692 the mix). The ACE counter approach has been designed so that 1693 coalescing can continue over arbitrary patterns of marking and only 1694 needs to stop when the counter wraps. Nonetheless, until the 1695 particular offload hardware in use implements this more efficient 1696 approach, it is likely to be more efficient for AccECN connections to 1697 implement this counter-style logic using software segmentation 1698 offload. 1700 ECN encodes a varying signal in the ACK stream, so it is inevitable 1701 that offload hardware will ultimately need to handle any form of ECN 1702 feedback exceptionally. The ACE field has been designed as a counter 1703 so that it is straightforward for offload hardware to pass on the 1704 highest counter, and to push a segment from its cache before the 1705 counter wraps. The purpose of working towards standardized TCP ECN 1706 feedback is to reduce the risk for hardware developers, who would 1707 otherwise have to guess which scheme is likely to become dominant. 1709 The above process has been designed to enable a continuing 1710 incremental deployment path - to more highly dynamic congestion 1711 control. Once DCTCP offload hardware supports AccECN, it will be 1712 able to coalesce efficiently for any sequence of marks, instead of 1713 relying for efficiency on the long marking sequences from step 1714 marking. In the next stage, DCTCP marking can evolve from a step to 1715 a ramp function. That in turn will allow host congestion control 1716 algorithms to respond faster to dynamics, while being backwards 1717 compatible with existing host algorithms. 1719 4. Updates to RFC 3168 1721 Normative statements in the following sections of RFC3168 are updated 1722 by the present AccECN specification: 1724 o The whole of "6.1.1 TCP Initialization" of [RFC3168] is updated by 1725 Section 3.1 of the present specification. 1727 o In "6.1.2. The TCP Sender" of [RFC3168], all mentions of a 1728 congestion response to an ECN-Echo (ECE) ACK packet are updated by 1729 Section 3.2 of the present specification to mean an increment to 1730 the sender's count of CE-marked packets, s.cep. And the 1731 requirements to set the CWR flag no longer apply, as specified in 1732 Section 3.1.5 of the present specification. Otherwise, the 1733 remaining requirements in "6.1.2. The TCP Sender" still stand. 1735 It will be noted that RFC 8311 already updates, or potentially 1736 updates, a number of the requirements in "6.1.2. The TCP Sender". 1737 Section 6.1.2 of RFC 3168 extended standard TCP congestion control 1738 [RFC5681] to cover ECN marking as well as packet drop. Whereas, 1739 RFC 8311 enables experimentation with alternative responses to ECN 1740 marking, if specified for instance by an experimental RFC on the 1741 IETF document stream. RFC 8311 also strengthened the statement 1742 that "ECT(0) SHOULD be used" to a "MUST" (see [RFC8311] for the 1743 details). 1745 o The whole of "6.1.3. The TCP Receiver" of [RFC3168] is updated by 1746 Section 3.2 of the present specification, with the exception of 1747 the last paragraph (about congestion response to drop and ECN in 1748 the same round trip), which still stands. Incidentally, this last 1749 paragraph is in the wrong section, because it relates to TCP 1750 sender behaviour. 1752 o The following text within "6.1.5. Retransmitted TCP packets": 1754 "the TCP data receiver SHOULD ignore the ECN field on arriving 1755 data packets that are outside of the receiver's current 1756 window." 1758 is updated by more stringent acceptability tests for any packet 1759 (not just data packets) in the present specification. 1760 Specifically, in the normative specification of AccECN (Section 3) 1761 only 'Acceptable' packets contribute to the ECN counters at the 1762 AccECN receiver and Section 1.3 defines an Acceptable packet as 1763 one that passes the acceptability tests in both [RFC0793] and 1764 [RFC5961]. 1766 o Sections 5.2, 6.1.1, 6.1.4, 6.1.5 and 6.1.6 of [RFC3168] prohibit 1767 use of ECN on TCP control packets and retransmissions. The 1768 present specification does not update that aspect of RFC 3168, but 1769 it does say what feedback an AccECN Data Receiver should provide 1770 if it receives an ECN-capable control packet or retransmission. 1771 This ensures AccECN is forward compatible with any future scheme 1772 that allows ECN on these packets, as provided for in section 4.3 1773 of [RFC8311] and as proposed in [I-D.ietf-tcpm-generalized-ecn]. 1775 5. Interaction with TCP Variants 1777 This section is informative, not normative. 1779 5.1. Compatibility with SYN Cookies 1781 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1782 protect itself from SYN flooding attacks. It places minimal commonly 1783 used connection state in the SYN/ACK, and deliberately does not hold 1784 any state while waiting for the subsequent ACK (e.g. it closes the 1785 thread). Therefore it cannot record the fact that it entered AccECN 1786 mode for both half-connections. Indeed, it cannot even remember 1787 whether it negotiated the use of classic ECN [RFC3168]. 1789 Nonetheless, such a server can determine that it negotiated AccECN as 1790 follows. If a TCP server using SYN Cookies supports AccECN and if it 1791 receives a pure ACK that acknowledges an ISN that is a valid SYN 1792 cookie, and if the ACK contains an ACE field with the value 0b010 to 1793 0b111 (decimal 2 to 7), it can assume that: 1795 o the TCP client must have requested AccECN support on the SYN 1797 o it (the server) must have confirmed that it supported AccECN 1799 Therefore the server can switch itself into AccECN mode, and continue 1800 as if it had never forgotten that it switched itself into AccECN mode 1801 earlier. 1803 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1804 with the value 0b000 or 0b001, these values indicate that the client 1805 did not request support for AccECN and therefore the server does not 1806 enter AccECN mode for this connection. Further, 0b001 on the ACK 1807 implies that the server sent an ECN-capable SYN/ACK, which was marked 1808 CE in the network, and the non-AccECN client fed this back by setting 1809 ECE on the ACK of the SYN/ACK. 1811 5.2. Compatibility with TCP Experiments and Common TCP Options 1813 AccECN is compatible (at least on paper) with the most commonly used 1814 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1815 also compatible with the recent promising experimental TCP options 1816 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1817 AccECN is friendly to all these protocols, because space for TCP 1818 options is particularly scarce on the SYN, where AccECN consumes zero 1819 additional header space. 1821 When option space is under pressure from other options, 1822 Section 3.2.3.3 provides guidance on how important it is to send an 1823 AccECN Option and whether it needs to be a full-length option. 1825 Implementers of TFO need to take careful note of the recommendation 1826 in Section 3.2.2.1. That section recommends that, if the client has 1827 successfully negotiated AccECN, when acknowledging the SYN/ACK, even 1828 if it has data to send, it sends a pure ACK immediately before the 1829 data. Then it can reflect the IP-ECN field of the SYN/ACK on this 1830 pure ACK, which allows the server to detect ECN mangling. 1832 5.3. Compatibility with Feedback Integrity Mechanisms 1834 Three alternative mechanisms are available to assure the integrity of 1835 ECN and/or loss signals. AccECN is compatible with any of these 1836 approaches: 1838 o The Data Sender can test the integrity of the receiver's ECN (or 1839 loss) feedback by occasionally setting the IP-ECN field to a value 1840 normally only set by the network (and/or deliberately leaving a 1841 sequence number gap). Then it can test whether the Data 1842 Receiver's feedback faithfully reports what it expects (similar to 1843 para 2 of Section 20.2 of [RFC3168]). Unlike the ECN Nonce 1844 [RFC3540], this approach does not waste the ECT(1) codepoint in 1845 the IP header, it does not require standardization and it does not 1846 rely on misbehaving receivers volunteering to reveal feedback 1847 information that allows them to be detected. However, setting the 1848 CE mark by the sender might conceal actual congestion feedback 1849 from the network and should therefore only be done sparingly. 1851 o Networks generate congestion signals when they are becoming 1852 congested, so networks are more likely than Data Senders to be 1853 concerned about the integrity of the receiver's feedback of these 1854 signals. A network can enforce a congestion response to its ECN 1855 markings (or packet losses) using congestion exposure (ConEx) 1856 audit [RFC7713]. Whether the receiver or a downstream network is 1857 suppressing congestion feedback or the sender is unresponsive to 1858 the feedback, or both, ConEx audit can neutralize any advantage 1859 that any of these three parties would otherwise gain. 1861 ConEx is a change to the Data Sender that is most useful when 1862 combined with AccECN. Without AccECN, the ConEx behaviour of a 1863 Data Sender would have to be more conservative than would be 1864 necessary if it had the accurate feedback of AccECN. 1866 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1867 detect any tampering with AccECN feedback between the Data 1868 Receiver and the Data Sender (whether malicious or accidental). 1869 The AccECN fields are immutable end-to-end, so they are amenable 1870 to TCP-AO protection, which covers TCP options by default. 1871 However, TCP-AO is often too brittle to use on many end-to-end 1872 paths, where middleboxes can make verification fail in their 1873 attempts to improve performance or security, e.g. by 1874 resegmentation or shifting the sequence space. 1876 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1877 of congestion feedback. With minor changes AccECN could be optimized 1878 for the possibility that the ECT(1) codepoint might be used as an ECN 1879 Nonce. However, given RFC 3540 has been reclassified as historic, 1880 the AccECN design has been generalized so that it ought to be able to 1881 support other possible uses of the ECT(1) codepoint, such as a lower 1882 severity or a more instant congestion signal than CE. 1884 6. Protocol Properties 1886 This section is informative not normative. It describes how well the 1887 protocol satisfies the agreed requirements for a more accurate ECN 1888 feedback protocol [RFC7560]. 1890 Accuracy: From each ACK, the Data Sender can infer the number of new 1891 CE marked segments since the previous ACK. This provides better 1892 accuracy on CE feedback than classic ECN. In addition if the 1893 AccECN Option is present (not blocked by the network path) the 1894 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1896 Overhead: The AccECN scheme is divided into two parts. The 1897 essential part reuses the 3 flags already assigned to ECN in the 1898 IP header. The supplementary part adds an additional TCP option 1899 consuming up to 11 bytes. However, no TCP option is consumed in 1900 the SYN. 1902 Ordering: The order in which marks arrive at the Data Receiver is 1903 preserved in AccECN feedback, because the Data Receiver is 1904 expected to send an ACK immediately whenever a different mark 1905 arrives. 1907 Timeliness: While the same ECN markings are arriving continually at 1908 the Data Receiver, it can defer ACKs as TCP does normally, but it 1909 will immediately send an ACK as soon as a different ECN marking 1910 arrives. 1912 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1913 latency-sensitive uses of ECN feedback by capturing the timing of 1914 transitions but not wasting resources while the state of the 1915 signalling system is stable. Within the constraints of the 1916 change-triggered ACK rules, the receiver can control how 1917 frequently it sends the AccECN TCP Option and therefore to some 1918 extent it can control the overhead induced by AccECN. 1920 Resilience: All information is provided based on counters. 1921 Therefore if ACKs are lost, the counters on the first ACK 1922 following the losses allows the Data Sender to immediately recover 1923 the number of the ECN markings that it missed. And if data or 1924 ACKs are reordered, stale congestion information can be identified 1925 and ignored. 1927 Resilience against Bias: Because feedback is based on repetition of 1928 counters, random losses do not remove any information, they only 1929 delay it. Therefore, even though some ACKs are change-triggered, 1930 random losses will not alter the proportions of the different ECN 1931 markings in the feedback. 1933 Resilience vs Overhead: If space is limited in some segments (e.g. 1934 because more options are needed on some segments, such as the SACK 1935 option after loss), the Data Receiver can send AccECN Options less 1936 frequently or truncate fields that have not changed, usually down 1937 to as little as 5 bytes. However, it has to send a full-sized 1938 AccECN Option at least three times per RTT, which the Data Sender 1939 can rely on as a regular beacon or checkpoint. 1941 Resilience vs Timeliness and Ordering: Ordering information and the 1942 timing of transitions cannot be communicated in three cases: i) 1943 during ACK loss; ii) if something on the path strips the AccECN 1944 Option; or iii) if the Data Receiver is unable to support Change- 1945 Triggered ACKs. Following ACK reordering, the Data Sender can 1946 reconstruct the order in which feedback was sent, but not until 1947 all the missing feedback has arrived. 1949 Complexity: An AccECN implementation solely involves simple counter 1950 increments, some modulo arithmetic to communicate the least 1951 significant bits and allow for wrap, and some heuristics for 1952 safety against fields cycling due to prolonged periods of ACK 1953 loss. Each host needs to maintain eight additional counters. The 1954 hosts have to apply some additional tests to detect tampering by 1955 middleboxes, but in general the protocol is simple to understand, 1956 simple to implement and requires few cycles per packet to execute. 1958 Integrity: AccECN is compatible with at least three approaches that 1959 can assure the integrity of ECN feedback. If the AccECN Option is 1960 stripped the resolution of the feedback is degraded, but the 1961 integrity of this degraded feedback can still be assured. 1963 Backward Compatibility: If only one endpoint supports the AccECN 1964 scheme, it will fall-back to the most advanced ECN feedback scheme 1965 supported by the other end. 1967 Backward Compatibility: If the AccECN Option is stripped by a 1968 middlebox, AccECN still provides basic congestion feedback in the 1969 ACE field. Further, AccECN can be used to detect mangling of the 1970 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1971 marked segments; and blocking of segments carrying the AccECN 1972 Option. It can detect these conditions during TCP's 3WHS so that 1973 it can fall back to operation without ECN and/or operation without 1974 the AccECN Option. 1976 Forward Compatibility: The behaviour of endpoints and middleboxes is 1977 carefully defined for all reserved or currently unused codepoints 1978 in the scheme. Then, the designers of security devices can 1979 understand which currently unused values might appear in future. 1980 So, even if they choose to treat such values as anomalous while 1981 they are not widely used, any blocking will at least be under 1982 policy control not hard-coded. Then, if previously unused values 1983 start to appear on the Internet (or in standards), such policies 1984 could be quickly reversed. 1986 7. IANA Considerations 1988 This document reassigns bit 7 of the TCP header flags to the AccECN 1989 experiment. This bit was previously called the Nonce Sum (NS) flag 1990 [RFC3540], but RFC 3540 has been reclassified as historic [RFC8311]. 1991 The flag will now be defined as: 1993 +-----+-------------------+-----------+ 1994 | Bit | Name | Reference | 1995 +-----+-------------------+-----------+ 1996 | 7 | AE (Accurate ECN) | RFC XXXX | 1997 +-----+-------------------+-----------+ 1999 [TO BE REMOVED: IANA is requested to update the existing entry in the 2000 Transmission Control Protocol (TCP) Header Flags registration 2001 (https://www.iana.org/assignments/tcp-header-flags/tcp-header- 2002 flags.xhtml#tcp-header-flags-1) for Bit 7 to "AE (Accurate ECN), 2003 previously used as NS (Nonce Sum) by [RFC3540], which is now Historic 2004 [RFC8311]" and change the reference to this RFC-to-be instead of 2005 RFC8311.] 2007 This document also defines two new TCP options for AccECN, assigned 2008 values of TBD0 and TBD1 (decimal) from the TCP option space. These 2009 values are defined as: 2011 +------+--------+--------------------------------+-----------+ 2012 | Kind | Length | Meaning | Reference | 2013 +------+--------+--------------------------------+-----------+ 2014 | TBD0 | N | Accurate ECN Order 0 (AccECN0) | RFC XXXX | 2015 | TBD1 | N | Accurate ECN Order 1 (AccECN1) | RFC XXXX | 2016 +------+--------+--------------------------------+-----------+ 2018 [TO BE REMOVED: This registration should take place at the following 2019 location: http://www.iana.org/assignments/tcp-parameters/tcp- 2020 parameters.xhtml#tcp-parameters-1 ] 2022 Early implementations using experimental option 254 per [RFC6994] 2023 with the single magic number 0xACCE (16 bits), as allocated in the 2024 IANA "TCP Experimental Option Experiment Identifiers (TCP ExIDs)" 2025 registry, SHOULD migrate to use these new option kinds (TBD0 & TBD1). 2027 [TO BE REMOVED: The description of the 0xACCE value in the TCP ExIDs 2028 registry should be changed to "AccECN (current and new 2029 implementations SHOULD use option kinds TBD0 and TBD1)" at the 2030 following location: https://www.iana.org/assignments/tcp-parameters/ 2031 tcp-parameters.xhtml#tcp-exids ] 2033 8. Security Considerations 2035 If ever the supplementary part of AccECN based on the new AccECN TCP 2036 Option is unusable (due for example to middlebox interference) the 2037 essential part of AccECN's congestion feedback offers only limited 2038 resilience to long runs of ACK loss (see Section 3.2.2.5). These 2039 problems are unlikely to be due to malicious intervention (because if 2040 an attacker could strip a TCP option or discard a long run of ACKs it 2041 could wreak other arbitrary havoc). However, it would be of concern 2042 if AccECN's resilience could be indirectly compromised during a 2043 flooding attack. AccECN is still considered safe though, because if 2044 the option is not presented, the AccECN Data Sender is then required 2045 to switch to more conservative assumptions about wrap of congestion 2046 indication counters (see Section 3.2.2.5 and Appendix A.2). 2048 Section 5.1 describes how a TCP server can negotiate AccECN and use 2049 the SYN cookie method for mitigating SYN flooding attacks. 2051 There is concern that ECN markings could be altered or suppressed, 2052 particularly because a misbehaving Data Receiver could increase its 2053 own throughput at the expense of others. AccECN is compatible with 2054 the three schemes known to assure the integrity of ECN feedback (see 2055 Section 5.3 for details). If the AccECN Option is stripped by an 2056 incorrectly implemented middlebox, the resolution of the feedback 2057 will be degraded, but the integrity of this degraded information can 2058 still be assured. 2060 There is a potential concern that a receiver could deliberately omit 2061 the AccECN Option pretending that it had been stripped by a 2062 middlebox. No known way can yet be contrived to take advantage of 2063 this downgrade attack, but it is mentioned here in case someone else 2064 can contrive one. 2066 The AccECN protocol is not believed to introduce any new privacy 2067 concerns, because it merely counts and feeds back signals at the 2068 transport layer that had already been visible at the IP layer. 2070 9. Acknowledgements 2072 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 2073 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf, 2074 Michael Tuexen, Yuchung Cheng, Kenjiro Cho, Olivier Tilmans, Ilpo 2075 Jaervinen and Neal Cardwell for their input and discussion. The idea 2076 of using the three ECN-related TCP flags as one field for more 2077 accurate TCP-ECN feedback was first introduced in the re-ECN protocol 2078 that was the ancestor of ConEx. 2080 Bob Briscoe was part-funded by the Comcast Innovation Fund, the 2081 European Community under its Seventh Framework Programme through the 2082 Reducing Internet Transport Latency (RITE) project (ICT-317700) and 2083 through the Trilogy 2 project (ICT-317756), and the Research Council 2084 of Norway through the TimeIn project. The views expressed here are 2085 solely those of the authors. 2087 Mirja Kuehlewind was partly supported by the European Commission 2088 under Horizon 2020 grant agreement no. 688421 Measurement and 2089 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 2090 State Secretariat for Education, Research, and Innovation under 2091 contract no. 15.0268. This support does not imply endorsement. 2093 10. Comments Solicited 2095 Comments and questions are encouraged and very welcome. They can be 2096 addressed to the IETF TCP maintenance and minor modifications working 2097 group mailing list , and/or to the authors. 2099 11. References 2101 11.1. Normative References 2103 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 2104 RFC 793, DOI 10.17487/RFC0793, September 1981, 2105 . 2107 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2108 Requirement Levels", BCP 14, RFC 2119, 2109 DOI 10.17487/RFC2119, March 1997, 2110 . 2112 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2113 of Explicit Congestion Notification (ECN) to IP", 2114 RFC 3168, DOI 10.17487/RFC3168, September 2001, 2115 . 2117 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2118 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 2119 . 2121 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2122 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2123 May 2017, . 2125 11.2. Informative References 2127 [I-D.ietf-tcpm-2140bis] 2128 Touch, J., Welzl, M., and S. Islam, "TCP Control Block 2129 Interdependence", draft-ietf-tcpm-2140bis-05 (work in 2130 progress), April 2020. 2132 [I-D.ietf-tcpm-generalized-ecn] 2133 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 2134 Congestion Notification (ECN) to TCP Control Packets", 2135 draft-ietf-tcpm-generalized-ecn-06 (work in progress), 2136 October 2020. 2138 [I-D.ietf-tsvwg-l4s-arch] 2139 Briscoe, B., Schepper, K., Bagnulo, M., and G. White, "Low 2140 Latency, Low Loss, Scalable Throughput (L4S) Internet 2141 Service: Architecture", draft-ietf-tsvwg-l4s-arch-07 (work 2142 in progress), October 2020. 2144 [Mandalari18] 2145 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 2146 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 2147 over Mobile", IEEE Communications Magazine , March 2018. 2149 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 2150 Selective Acknowledgment Options", RFC 2018, 2151 DOI 10.17487/RFC2018, October 1996, 2152 . 2154 [RFC3449] Balakrishnan, H., Padmanabhan, V., Fairhurst, G., and M. 2155 Sooriyabandara, "TCP Performance Implications of Network 2156 Path Asymmetry", BCP 69, RFC 3449, DOI 10.17487/RFC3449, 2157 December 2002, . 2159 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2160 Congestion Notification (ECN) Signaling with Nonces", 2161 RFC 3540, DOI 10.17487/RFC3540, June 2003, 2162 . 2164 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 2165 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 2166 . 2168 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 2169 Ramakrishnan, "Adding Explicit Congestion Notification 2170 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 2171 DOI 10.17487/RFC5562, June 2009, 2172 . 2174 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 2175 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 2176 June 2010, . 2178 [RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's 2179 Robustness to Blind In-Window Attacks", RFC 5961, 2180 DOI 10.17487/RFC5961, August 2010, 2181 . 2183 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 2184 "TCP Extensions for Multipath Operation with Multiple 2185 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 2186 . 2188 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 2189 RFC 6994, DOI 10.17487/RFC6994, August 2013, 2190 . 2192 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 2193 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 2194 . 2196 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 2197 "Problem Statement and Requirements for Increased Accuracy 2198 in Explicit Congestion Notification (ECN) Feedback", 2199 RFC 7560, DOI 10.17487/RFC7560, August 2015, 2200 . 2202 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 2203 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 2204 DOI 10.17487/RFC7713, December 2015, 2205 . 2207 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 2208 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 2209 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 2210 October 2017, . 2212 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 2213 Notification (ECN) Experimentation", RFC 8311, 2214 DOI 10.17487/RFC8311, January 2018, 2215 . 2217 [RFC8511] Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 2218 "TCP Alternative Backoff with ECN (ABE)", RFC 8511, 2219 DOI 10.17487/RFC8511, December 2018, 2220 . 2222 Appendix A. Example Algorithms 2224 This appendix is informative, not normative. It gives example 2225 algorithms that would satisfy the normative requirements of the 2226 AccECN protocol. However, implementers are free to choose other ways 2227 to implement the requirements. 2229 A.1. Example Algorithm to Encode/Decode the AccECN Option 2231 The example algorithms below show how a Data Receiver in AccECN mode 2232 could encode its CE byte counter r.ceb into the ECEB field within the 2233 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 2234 the ECEB field into its byte counter s.ceb. The other counters for 2235 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 2236 similarly encoded and decoded. 2238 It is assumed that each local byte counter is an unsigned integer 2239 greater than 24b (probably 32b), and that the following constant has 2240 been assigned: 2242 DIVOPT = 2^24 2244 Every time a CE marked data segment arrives, the Data Receiver 2245 increments its local value of r.ceb by the size of the TCP Data. 2246 Whenever it sends an ACK with the AccECN Option, the value it writes 2247 into the ECEB field is 2249 ECEB = r.ceb % DIVOPT 2251 where '%' is the remainder operator. 2253 On the arrival of an AccECN Option, the Data Sender first makes sure 2254 the ACK has not been superseded in order to avoid winding the s.ceb 2255 counter backwards. It uses the TCP acknowledgement number and any 2256 SACK options to calculate newlyAckedB, the amount of new data that 2257 the ACK acknowledges in bytes (newlyAckedB can be zero but not 2258 negative). If newlyAckedB is zero, either the ACK has been 2259 superseded or CE-marked packet(s) without data could have arrived. 2260 To break the tie for the latter case, the Data Sender could use 2261 timestamps (if present) to work out newlyAckedT, the amount of new 2262 time that the ACK acknowledges. If the Data Sender determines that 2263 the ACK has been superseded it ignores the AccECN Option. Otherwise, 2264 the Data Sender calculates the minimum non-negative difference d.ceb 2265 between the ECEB field and its local s.ceb counter, using modulo 2266 arithmetic as follows: 2268 if ((newlyAckedB > 0) || (newlyAckedT > 0)) { 2269 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 2270 s.ceb += d.ceb 2271 } 2273 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 2274 then 2276 s.ceb % DIVOPT = 1 2277 d.ceb = (1461 + 2^24 - 1) % 2^24 2278 = 1460 2279 s.ceb = 33,554,433 + 1460 2280 = 33,555,893 2282 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 2284 The example algorithms below show how a Data Receiver in AccECN mode 2285 could encode its CE packet counter r.cep into the ACE field, and how 2286 the Data Sender in AccECN mode could decode the ACE field into its 2287 s.cep counter. The Data Sender's algorithm includes code to 2288 heuristically detect a long enough unbroken string of ACK losses that 2289 could have concealed a cycle of the congestion counter in the ACE 2290 field of the next ACK to arrive. 2292 Two variants of the algorithm are given: i) a more conservative 2293 variant for a Data Sender to use if it detects that the AccECN Option 2294 is not available (see Section 3.2.2.5 and Section 3.2.3.2); and ii) a 2295 less conservative variant that is feasible when complementary 2296 information is available from the AccECN Option. 2298 A.2.1. Safety Algorithm without the AccECN Option 2300 It is assumed that each local packet counter is a sufficiently sized 2301 unsigned integer (probably 32b) and that the following constant has 2302 been assigned: 2304 DIVACE = 2^3 2306 Every time an Acceptable CE marked packet arrives (Section 3.2.2.2), 2307 the Data Receiver increments its local value of r.cep by 1. It 2308 repeats the same value of ACE in every subsequent ACK until the next 2309 CE marking arrives, where 2311 ACE = r.cep % DIVACE. 2313 If the Data Sender received an earlier value of the counter that had 2314 been delayed due to ACK reordering, it might incorrectly calculate 2315 that the ACE field had wrapped. Therefore, on the arrival of every 2316 ACK, the Data Sender ensures the ACK has not been superseded using 2317 the TCP acknowledgement number, any SACK options and timestamps (if 2318 available) to calculate newlyAckedB, as in Appendix A.1. If the ACK 2319 has not been superseded, the Data Sender calculates the minimum 2320 difference d.cep between the ACE field and its local s.cep counter, 2321 using modulo arithmetic as follows: 2323 if ((newlyAckedB > 0) || (newlyAckedT > 0)) 2324 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 2326 Section 3.2.2.5 expects the Data Sender to assume that the ACE field 2327 cycled if it is the safest likely case under prevailing conditions. 2328 The 3-bit ACE field in an arriving ACK could have cycled and become 2329 ambiguous to the Data Sender if a row of ACKs goes missing that 2330 covers a stream of data long enough to contain 8 or more CE marks. 2331 We use the word `missing' rather than `lost', because some or all the 2332 missing ACKs might arrive eventually, but out of order. Even if some 2333 of the missing ACKs were piggy-backed on data (i.e. not pure ACKs) 2334 retransmissions will not repair the lost AccECN information, because 2335 AccECN requires retransmissions to carry the latest AccECN counters, 2336 not the original ones. 2338 The phrase `under prevailing conditions' allows for implementation- 2339 dependent interpretation. A Data Sender might take account of the 2340 prevailing size of data segments and the prevailing CE marking rate 2341 just before the sequence of missing ACKs. However, we shall start 2342 with the simplest algorithm, which assumes segments are all full- 2343 sized and ultra-conservatively it assumes that ECN marking was 100% 2344 on the forward path when ACKs on the reverse path started to all be 2345 dropped. Specifically, if newlyAckedB is the amount of data that an 2346 ACK acknowledges since the previous ACK, then the Data Sender could 2347 assume that this acknowledges newlyAckedPkt full-sized segments, 2348 where newlyAckedPkt = newlyAckedB/MSS. Then it could assume that the 2349 ACE field incremented by 2351 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 2353 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 2354 size segments than any previous ACK, and that ACE increments by a 2355 minimum of 2 CE marks (d.cep=2). The above formula works out that it 2356 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 2357 2). However, if ACE increases by a minimum of 2 but acknowledges 10 2358 full-sized segments, then it would be necessary to assume that there 2359 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 2361 ACKs that acknowledge a large stretch of packets might be common in 2362 data centres to achieve a high packet rate or might be due to ACK 2363 thinning by a middlebox. In these cases, cycling of the ACE field 2364 would often appear to have been possible, so the above algorithm 2365 would be over-conservative, leading to a false high marking rate and 2366 poor performance. Therefore it would be reasonable to only use 2367 dSafer.cep rather than d.cep if the moving average of newlyAckedPkt 2368 was well below 8. 2370 Implementers could build in more heuristics to estimate prevailing 2371 average segment size and prevailing ECN marking. For instance, 2372 newlyAckedPkt in the above formula could be replaced with 2373 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 2374 segment size and p is the prevailing ECN marking probability. 2375 However, ultimately, if TCP's ECN feedback becomes inaccurate it 2376 still has loss detection to fall back on. Therefore, it would seem 2377 safe to implement a simple algorithm, rather than a perfect one. 2379 The simple algorithm for dSafer.cep above requires no monitoring of 2380 prevailing conditions and it would still be safe if, for example, 2381 segments were on average at least 5% of full-sized as long as ECN 2382 marking was 5% or less. Assuming it was used, the Data Sender would 2383 increment its packet counter as follows: 2385 s.cep += dSafer.cep 2387 If missing acknowledgement numbers arrive later (due to reordering), 2388 Section 3.2.2.5 says "the Data Sender MAY attempt to neutralize the 2389 effect of any action it took based on a conservative assumption that 2390 it later found to be incorrect". To do this, the Data Sender would 2391 have to store the values of all the relevant variables whenever it 2392 made assumptions, so that it could re-evaluate them later. Given 2393 this could become complex and it is not required, we do not attempt 2394 to provide an example of how to do this. 2396 A.2.2. Safety Algorithm with the AccECN Option 2398 When the AccECN Option is available on the ACKs before and after the 2399 possible sequence of ACK losses, if the Data Sender only needs CE- 2400 marked bytes, it will have sufficient information in the AccECN 2401 Option without needing to process the ACE field. If for some reason 2402 it needs CE-marked packets, if dSafer.cep is different from d.cep, it 2403 can determine whether d.cep is likely to be a safe enough estimate by 2404 checking whether the average marked segment size (s = d.ceb/d.cep) is 2405 less than the MSS (where d.ceb is the amount of newly CE-marked bytes 2406 - see Appendix A.1). Specifically, it could use the following 2407 algorithm: 2409 SAFETY_FACTOR = 2 2410 if (dSafer.cep > d.cep) { 2411 if (d.ceb <= MSS * d.cep) { % Same as (s <= MSS), but no DBZ 2412 sSafer = d.ceb/dSafer.cep 2413 if (sSafer < MSS/SAFETY_FACTOR) 2414 dSafer.cep = d.cep % d.cep is a safe enough estimate 2415 } % else 2416 % No need for else; dSafer.cep is already correct, 2417 % because d.cep must have been too small 2418 } 2420 The chart below shows when the above algorithm will consider d.cep 2421 can replace dSafer.cep as a safe enough estimate of the number of CE- 2422 marked packets: 2424 ^ 2425 sSafer| 2426 | 2427 MSS+ 2428 | 2429 | dSafer.cep 2430 | is 2431 MSS/SAFETY_FACTOR+--------------+ safest 2432 | | 2433 | d.cep is safe| 2434 | enough | 2435 +--------------------> 2436 MSS s 2438 The following examples give the reasoning behind the algorithm, 2439 assuming MSS=1460 [B]: 2441 o if d.cep=0, dSafer.cep=8 and d.ceb=1460, then s=infinity and 2442 sSafer=182.5. 2443 Therefore even though the average size of 8 data segments is 2444 unlikely to have been as small as MSS/8, d.cep cannot have been 2445 correct, because it would imply an average segment size greater 2446 than the MSS. 2448 o if d.cep=2, dSafer.cep=10 and d.ceb=1460, then s=730 and 2449 sSafer=146. 2450 Therefore d.cep is safe enough, because the average size of 10 2451 data segments is unlikely to have been as small as MSS/10. 2453 o if d.cep=7, dSafer.cep=15 and d.ceb=10200, then s=1457 and 2454 sSafer=680. 2456 Therefore d.cep is safe enough, because the average data segment 2457 size is more likely to have been just less than one MSS, rather 2458 than below MSS/2. 2460 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 2461 far less likely. However, because [RFC3168] currently precludes 2462 this, the above algorithm assumes that pure ACKs are not ECN-capable. 2464 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 2466 If the AccECN Option is not available, the Data Sender can only 2467 decode CE-marking from the ACE field in packets. Every time an ACK 2468 arrives, to convert this into an estimate of CE-marked bytes, it 2469 needs an average of the segment size, s_ave. Then it can add or 2470 subtract s_ave from the value of d.ceb as the value of d.cep 2471 increments or decrements. Some possible ways to calculate s_ave are 2472 outlined below. The precise details will depend on why an estimate 2473 of marked bytes is needed. 2475 The implementation could keep a record of the byte numbers of all the 2476 boundaries between packets in flight (including control packets), and 2477 recalculate s_ave on every ACK. However it would be simpler to 2478 merely maintain a counter packets_in_flight for the number of packets 2479 in flight (including control packets), which is reset once per RTT. 2480 Either way, it would estimate s_ave as: 2482 s_ave ~= flightsize / packets_in_flight, 2484 where flightsize is the variable that TCP already maintains for the 2485 number of bytes in flight. To avoid floating point arithmetic, it 2486 could right-bit-shift by lg(packets_in_flight), where lg() means log 2487 base 2. 2489 An alternative would be to maintain an exponentially weighted moving 2490 average (EWMA) of the segment size: 2492 s_ave = a * s + (1-a) * s_ave, 2494 where a is the decay constant for the EWMA. However, then it is 2495 necessary to choose a good value for this constant, which ought to 2496 depend on the number of packets in flight. Also the decay constant 2497 needs to be power of two to avoid floating point arithmetic. 2499 A.4. Example Algorithm to Beacon AccECN Options 2501 Section 3.2.3.3 requires a Data Receiver to beacon a full-length 2502 AccECN Option at least 3 times per RTT. This could be implemented by 2503 maintaining a variable to store the number of ACKs (pure and data 2504 ACKs) since a full AccECN Option was last sent and another for the 2505 approximate number of ACKs sent in the last round trip time: 2507 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 2508 send_full_AccECN_Option() 2510 For optimized integer arithmetic, BEACON_FREQ = 4 could be used, 2511 rather than 3, so that the division could be implemented as an 2512 integer right bit-shift by lg(BEACON_FREQ). 2514 In certain operating systems, it might be too complex to maintain 2515 acks_in_round. In others it might be possible by tagging each data 2516 segment in the retransmit buffer with the number of ACKs sent at the 2517 point that segment was sent. This would not work well if the Data 2518 Receiver was not sending data itself, in which case it might be 2519 necessary to beacon based on time instead, as follows: 2521 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 2522 send_full_AccECN_Option() 2524 This time-based approach does not work well when all the ACKs are 2525 sent early in each round trip, as is the case during slow-start. In 2526 this case few options will be sent (evtl. even less than 3 per RTT). 2527 However, when continuously sending data, data packets as well as ACKs 2528 will spread out equally over the RTT and sufficient ACKs with the 2529 AccECN option will be sent. 2531 A.5. Example Algorithm to Count Not-ECT Bytes 2533 A Data Sender in AccECN mode can infer the amount of TCP payload data 2534 arriving at the receiver marked Not-ECT from the difference between 2535 the amount of newly ACKed data and the sum of the bytes with the 2536 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 2537 r.e0b is initialized to 1 and the other two counters are initialized 2538 to 0, the initial sum will be 1, which matches the initial offset of 2539 the TCP sequence number on completion of the 3WHS. 2541 For this approach to be precise, it has to be assumed that spurious 2542 (unnecessary) retransmissions do not lead to double counting. This 2543 assumption is currently correct, given that RFC 3168 requires that 2544 the Data Sender marks retransmitted segments as Not-ECT. However, 2545 the converse is not true; necessary retransmissions will result in 2546 under-counting. 2548 However, such precision is unlikely to be necessary. The only known 2549 use of a count of Not-ECT marked bytes is to test whether equipment 2550 on the path is clearing the ECN field (perhaps due to an out-dated 2551 attempt to clear, or bleach, what used to be the ToS field). To 2552 detect bleaching it will be sufficient to detect whether nearly all 2553 bytes arrive marked as Not-ECT. Therefore there should be no need to 2554 keep track of the details of retransmissions. 2556 Appendix B. Rationale for Usage of TCP Header Flags 2558 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake 2560 AccECN uses a rather unorthodox approach to negotiate the highest 2561 version TCP ECN feedback scheme that both ends support, as justified 2562 below. It follows from the original TCP ECN capability negotiation 2563 [RFC3168], in which the client set the 2 least significant of the 2564 original reserved flags in the TCP header, and fell back to no ECN 2565 support if the server responded with the 2 flags cleared, which had 2566 previously been the default. 2568 ECN originally used header flags rather than a TCP option because it 2569 was considered more efficient to use a header flag for 1 bit of 2570 feedback per ACK, and this bit could be overloaded to indicate 2571 support for ECN during the handshake. During the development of ECN, 2572 1 bit crept up to 2, in order to deliver the feedback reliably and to 2573 work round some broken hosts that reflected the reserved flags during 2574 the handshake. 2576 In order to be backward compatible with RFC 3168, AccECN continues 2577 this approach, using the 3rd least significant TCP header flag that 2578 had previously been allocated for the ECN nonce (now historic). 2579 Then, whatever form of server an AccECN client encounters, the 2580 connection can fall back to the highest version of feedback protocol 2581 that both ends support, as explained in Section 3.1. 2583 If AccECN had used the more orthodox approach of a TCP option, it 2584 would still have had to set the two ECN flags in the main TCP header, 2585 in order to be able to fall back to Classic RFC 3168 ECN, or to 2586 disable ECN support, without another round of negotiation. Then 2587 AccECN would also have had to handle all the different ways that 2588 servers currently respond to settings of the ECN flags in the main 2589 TCP header, including all the conflicting cases where a server might 2590 have said it supported one approach in the flags and another approach 2591 in the new TCP option. And AccECN would have had to deal with all 2592 the additional possibilities where a middlebox might have mangled the 2593 ECN flags, or removed the TCP option. Thus, usage of the 3rd 2594 reserved TCP header flag simplified the protocol. 2596 The third flag was used in a way that could be distinguished from the 2597 ECN nonce, in case any nonce deployment was encountered. Previous 2598 usage of this flag for the ECN nonce was integrated into the original 2599 ECN negotiation. This further justified the 3rd flag's use for 2600 AccECN, because a non-ECN usage of this flag would have had to use it 2601 as a separate single bit, rather than in combination with the other 2 2602 ECN flags. 2604 Indeed, having overloaded the original uses of these three flags for 2605 its handshake, AccECN overloads all three bits again as a 3-bit 2606 counter. 2608 B.2. Four Codepoints in the SYN/ACK 2610 Of the 8 possible codepoints that the 3 TCP header flags can indicate 2611 on the SYN/ACK, 4 already indicated earlier (or broken) versions of 2612 ECN support. In the early design of AccECN, an AccECN server could 2613 use only 2 of the 4 remaining codepoints. They both indicated AccECN 2614 support, but one fed back that the SYN had arrived marked as CE. 2615 Even though ECN support on a SYN is not yet on the standards track, 2616 the idea is for either end to act as a dumb reflector, so that future 2617 capabilities can be unilaterally deployed without requiring 2-ended 2618 deployment (justified in Section 2.5). 2620 During traversal testing it was discovered that the ECN field in the 2621 SYN was mangled on a non-negligible proportion of paths. Therefore 2622 it was necessary to allow the SYN/ACK to feed all four IP/ECN 2623 codepoints that the SYN could arrive with back to the client. 2624 Without this, the client could not know whether to disable ECN for 2625 the connection due to mangling of the IP/ECN field (also explained in 2626 Section 2.5). This development consumed the remaining 2 codepoints 2627 on the SYN/ACK that had been reserved for future use by AccECN in 2628 earlier versions. 2630 B.3. Space for Future Evolution 2632 Despite availability of usable TCP header space being extremely 2633 scarce, the AccECN protocol has taken all possible steps to ensure 2634 that there is space to negotiate possible future variants of the 2635 protocol, either if the experiment proves that a variant of AccECN is 2636 required, or if a completely different ECN feedback approach is 2637 needed: 2639 Future AccECN variants: When the AccECN capability is negotiated 2640 during TCP's 3WHS, the rows in Table 2 tagged as 'Nonce' and 2641 'Broken' in the column for the capability of node B are unused by 2642 any current protocol in the RFC series. These could be used by 2643 TCP servers in future to indicate a variant of the AccECN 2644 protocol. In recent measurement studies in which the response of 2645 large numbers of servers to an AccECN SYN has been tested, e.g. 2646 [Mandalari18], a very small number of SYN/ACKs arrive with the 2647 pattern tagged as 'Nonce', and a small but more significant number 2648 arrive with the pattern tagged as 'Broken'. The 'Nonce' pattern 2649 could be a sign that a few servers have implemented the ECN Nonce 2650 [RFC3540], which has now been reclassified as historic [RFC8311], 2651 or it could be the random result of some unknown middlebox 2652 behaviour. The greater prevalence of the 'Broken' pattern 2653 suggests that some instances still exist of the broken code that 2654 reflects the reserved flags on the SYN. 2656 The requirement not to reject unexpected initial values of the ACE 2657 counter (in the main TCP header) in the last para of 2658 Section 3.2.2.3 ensures that 3 unused codepoints on the ACK of the 2659 SYN/ACK, 6 unused values on the first SYN=0 data packet from the 2660 client and 7 unused values on the first SYN=0 data packet from the 2661 server could be used to declare future variants of the AccECN 2662 protocol. The word 'declare' is used rather than 'negotiate' 2663 because, at this late stage in the 3WHS, it would be too late for 2664 a negotiation between the endpoints to be completed. A similar 2665 requirement not to reject unexpected initial values in the TCP 2666 option (Section 3.2.3.2.4) is for the same purpose. If traversal 2667 of the TCP option were reliable, this would have enabled a far 2668 wider range of future variation of the whole AccECN protocol. 2669 Nonetheless, it could be used to reliably negotiate a wide range 2670 of variation in the semantics of the AccECN Option. 2672 Future non-AccECN variants: Five codepoints out of the 8 possible in 2673 the 3 TCP header flags used by AccECN are unused on the initial 2674 SYN (in the order AE,CWR,ECE): 001, 010, 100, 101, 110. 2675 Section 3.1.3 ensures that the installed base of AccECN servers 2676 will all assume these are equivalent to AccECN negotiation with 2677 111 on the SYN. These codepoints would not allow fall-back to 2678 Classic ECN support for a server that did not understand them, but 2679 this approach ensures they are available in future, perhaps for 2680 uses other than ECN alongside the AccECN scheme. All possible 2681 combinations of SYN/ACK could be used in response except either 2682 000 or reflection of the same values sent on the SYN. 2684 Of course, other ways could be resorted to in order to extend 2685 AccECN or ECN in future, although their traversal properties are 2686 likely to be inferior. They include a new TCP option; using the 2687 remaining reserved flags in the main TCP header (preferably 2688 extending the 3-bit combinations used by AccECN to 4-bit 2689 combinations, rather than burning one bit for just one state); a 2690 non-zero urgent pointer in combination with the URG flag cleared; 2691 or some other unexpected combination of fields yet to be invented. 2693 Authors' Addresses 2695 Bob Briscoe 2696 Independent 2697 UK 2699 EMail: ietf@bobbriscoe.net 2700 URI: http://bobbriscoe.net/ 2702 Mirja Kuehlewind 2703 Ericsson 2704 Germany 2706 EMail: ietf@kuehlewind.net 2708 Richard Scheffenegger 2709 NetApp 2710 Vienna 2711 Austria 2713 EMail: Richard.Scheffenegger@netapp.com