idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-14.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC3168, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC3449, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. (Using the creation date from RFC3168, updated by this document, for RFC5378 checks: 2000-11-17) (Using the creation date from RFC3449, updated by this document, for RFC5378 checks: 1999-10-04) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 22, 2021) is 1160 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'B' is mentioned on line 2450, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-11) exists of draft-ietf-tcpm-2140bis-07 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-06 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-08 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Independent 4 Updates: 3168, 3449 (if approved) M. Kuehlewind 5 Intended status: Standards Track Ericsson 6 Expires: August 26, 2021 R. Scheffenegger 7 NetApp 8 February 22, 2021 10 More Accurate ECN Feedback in TCP 11 draft-ietf-tcpm-accurate-ecn-14 13 Abstract 15 Explicit Congestion Notification (ECN) is a mechanism where network 16 nodes can mark IP packets instead of dropping them to indicate 17 incipient congestion to the end-points. Receivers with an ECN- 18 capable transport protocol feed back this information to the sender. 19 ECN is specified for TCP in such a way that only one feedback signal 20 can be transmitted per Round-Trip Time (RTT). Recent new TCP 21 mechanisms like Congestion Exposure (ConEx), Data Center TCP (DCTCP) 22 or Low Latency Low Loss Scalable Throughput (L4S) need more accurate 23 ECN feedback information whenever more than one marking is received 24 in one RTT. This document specifies a scheme to provide more than 25 one feedback signal per RTT in the TCP header. Given TCP header 26 space is scarce, it allocates a reserved header bit, that was 27 previously used for the ECN-Nonce which has now been declared 28 historic. It also overloads the two existing ECN flags in the TCP 29 header. The resulting extra space is exploited to feed back the IP- 30 ECN field received during the 3-way handshake as well. Supplementary 31 feedback information can optionally be provided in a new TCP option, 32 which is never used on the TCP SYN. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on August 26, 2021. 50 Copyright Notice 52 Copyright (c) 2021 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 5 69 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 70 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 71 1.4. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 72 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 73 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 8 74 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 75 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 76 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 77 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 10 78 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 11 79 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 11 80 3.1.1. Negotiation during the TCP handshake . . . . . . . . 11 81 3.1.2. Backward Compatibility . . . . . . . . . . . . . . . 12 82 3.1.3. Forward Compatibility . . . . . . . . . . . . . . . . 15 83 3.1.4. Retransmission of the SYN . . . . . . . . . . . . . . 15 84 3.1.5. Implications of AccECN Mode . . . . . . . . . . . . . 16 85 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 18 86 3.2.1. Initialization of Feedback Counters . . . . . . . . . 18 87 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 19 88 3.2.3. The AccECN Option . . . . . . . . . . . . . . . . . . 26 89 3.3. AccECN Compliance Requirements for TCP Proxies, Offload 90 Engines and other Middleboxes . . . . . . . . . . . . . . 35 91 3.3.1. Requirements for TCP Proxies . . . . . . . . . . . . 35 92 3.3.2. Requirements for TCP Normalizers . . . . . . . . . . 35 93 3.3.3. Requirements for TCP ACK Filtering . . . . . . . . . 35 94 3.3.4. Requirements for TCP Segmentation Offload . . . . . . 36 95 4. Updates to RFC 3168 . . . . . . . . . . . . . . . . . . . . . 37 96 5. Interaction with TCP Variants . . . . . . . . . . . . . . . . 38 97 5.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 38 98 5.2. Compatibility with TCP Experiments and Common TCP Options 39 99 5.3. Compatibility with Feedback Integrity Mechanisms . . . . 39 100 6. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 40 101 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 43 102 8. Security Considerations . . . . . . . . . . . . . . . . . . . 44 103 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 45 104 10. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 45 105 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 45 106 11.1. Normative References . . . . . . . . . . . . . . . . . . 45 107 11.2. Informative References . . . . . . . . . . . . . . . . . 46 108 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 49 109 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 49 110 A.2. Example Algorithm for Safety Against Long Sequences of 111 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 50 112 A.2.1. Safety Algorithm without the AccECN Option . . . . . 50 113 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 52 114 A.3. Example Algorithm to Estimate Marked Bytes from Marked 115 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 54 116 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 54 117 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 55 118 Appendix B. Rationale for Usage of TCP Header Flags . . . . . . 56 119 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake . . . 56 120 B.2. Four Codepoints in the SYN/ACK . . . . . . . . . . . . . 57 121 B.3. Space for Future Evolution . . . . . . . . . . . . . . . 57 122 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 59 124 1. Introduction 126 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 127 network nodes can mark IP packets instead of dropping them to 128 indicate incipient congestion to the end-points. Receivers with an 129 ECN-capable transport protocol feed back this information to the 130 sender. In RFC 3168, ECN was specified for TCP in such a way that 131 only one feedback signal could be transmitted per Round-Trip Time 132 (RTT). Recently, proposed mechanisms like Congestion Exposure (ConEx 133 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need to 134 know when more than one marking is received in one RTT which is 135 information that cannot be provided by the feedback scheme as 136 specified in [RFC3168]. This document specifies an update to the ECN 137 feedback scheme of RFC 3168 that provides more accurate information 138 and could be used by these and potentially other future TCP 139 extensions. A fuller treatment of the motivation for this 140 specification is given in the associated requirements document 141 [RFC7560]. 143 This documents specifies a standards track scheme for ECN feedback in 144 the TCP header to provide more than one feedback signal per RTT. It 145 will be called the more accurate ECN feedback scheme, or AccECN for 146 short. This document updates RFC 3168 with respect to negotiation 147 and use of the feedback scheme for TCP. All aspects of RFC 3168 148 other than the TCP feedback scheme, in particular the definition of 149 ECN at the IP layer, remain unchanged by this specification. 150 Section 4 gives a more detailed specification of exactly which 151 aspects of RFC 3168 this document updates. 153 AccECN is intended to be a complete replacement for classic TCP/ECN 154 feedback, not a fork in the design of TCP. AccECN feedback 155 complements TCP's loss feedback and it can coexist alongside 156 'classic' [RFC3168] TCP/ECN feedback. So its applicability is 157 intended to include all public and private IP networks (and even any 158 non-IP networks over which TCP is used today), whether or not any 159 nodes on the path support ECN, of whatever flavour. This document 160 uses the term Classic ECN when it needs to distinguish the RFC 3168 161 ECN TCP feedback scheme from the AccECN TCP feedback scheme. 163 AccECN feedback overloads the two existing ECN flags in the TCP 164 header and allocates the currently reserved flag (previously called 165 NS) in the TCP header, to be used as one three-bit counter field 166 indicating the number of congestion experienced marked packets. 167 Given the new definitions of these three bits, both ends have to 168 support the new wire protocol before it can be used. Therefore 169 during the TCP handshake the two ends use these three bits in the TCP 170 header to negotiate the most advanced feedback protocol that they can 171 both support, in a way that is backward compatible with [RFC3168]. 173 AccECN is solely a change to the TCP wire protocol; it covers the 174 negotiation and signaling of more accurate ECN feedback from a TCP 175 Data Receiver to a Data Sender. It is completely independent of how 176 TCP might respond to congestion feedback, which is out of scope, but 177 ultimately the motivation for accurate ECN feedback. Like Classic 178 ECN feedback, AccECN can be used by standard Reno congestion control 179 [RFC5681] to respond to the existence of at least one congestion 180 notification within a round trip. Or, unlike Reno, AccECN can be 181 used to respond to the extent of congestion notification over a round 182 trip, as for example DCTCP does in controlled environments [RFC8257]. 183 For congestion response, this specification refers to RFC 3168, or 184 ECN experiments such as those referred to in [RFC8311], namely: a 185 TCP-based Low Latency Low Loss Scalable (L4S) congestion control 186 [I-D.ietf-tsvwg-l4s-arch]; or Alternative Backoff with ECN (ABE) 187 [RFC8511]. 189 It is recommended that the AccECN protocol is implemented alongside 190 SACK [RFC2018] and the experimental ECN++ protocol 192 [I-D.ietf-tcpm-generalized-ecn], which allows the ECN capability to 193 be used on TCP control packets. Therefore, this specification does 194 not discuss implementing AccECN alongside [RFC5562], which was an 195 earlier experimental protocol with narrower scope than ECN++. 197 1.1. Document Roadmap 199 The following introductory section outlines the goals of AccECN 200 (Section 1.2). Then terminology is defined (Section 1.3) and a recap 201 of existing prerequisite technology is given (Section 1.4). 203 Section 2 gives an informative overview of the AccECN protocol. Then 204 Section 3 gives the normative protocol specification, and Section 4 205 clarifies which aspects of RFC 3168 are updated by this 206 specification. Section 5 assesses the interaction of AccECN with 207 commonly used variants of TCP, whether standardized or not. 208 Section 6 summarizes the features and properties of AccECN. 210 Section 7 summarizes the protocol fields and numbers that IANA will 211 need to assign and Section 8 points to the aspects of the protocol 212 that will be of interest to the security community. 214 Appendix A gives pseudocode examples for the various algorithms that 215 AccECN uses and Appendix B explains why AccECN uses flags in the main 216 TCP header and quantifies the space left for future use. 218 1.2. Goals 220 [RFC7560] enumerates requirements that a candidate feedback scheme 221 will need to satisfy, under the headings: resilience, timeliness, 222 integrity, accuracy (including ordering and lack of bias), 223 complexity, overhead and compatibility (both backward and forward). 224 It recognizes that a perfect scheme that fully satisfies all the 225 requirements is unlikely and trade-offs between requirements are 226 likely. Section 6 presents the properties of AccECN against these 227 requirements and discusses the trade-offs made. 229 The requirements document recognizes that a protocol as ubiquitous as 230 TCP needs to be able to serve as-yet-unspecified requirements. 231 Therefore an AccECN receiver aims to act as a generic (dumb) 232 reflector of congestion information so that in future new sender 233 behaviours can be deployed unilaterally. 235 1.3. Terminology 237 AccECN: The more accurate ECN feedback scheme will be called AccECN 238 for short. 240 Classic ECN: the ECN protocol specified in [RFC3168]. 242 Classic ECN feedback: the feedback aspect of the ECN protocol 243 specified in [RFC3168], including generation, encoding, 244 transmission and decoding of feedback, but not the Data Sender's 245 subsequent response to that feedback. 247 ACK: A TCP acknowledgement, with or without a data payload (ACK=1). 249 Pure ACK: A TCP acknowledgement without a data payload. 251 Acceptable packet / segment: A packet or segment that passes the 252 acceptability tests in [RFC0793] and [RFC5961]. 254 TCP client: The TCP stack that originates a connection. 256 TCP server: The TCP stack that responds to a connection request. 258 Data Receiver: The endpoint of a TCP half-connection that receives 259 data and sends AccECN feedback. 261 Data Sender: The endpoint of a TCP half-connection that sends data 262 and receives AccECN feedback. 264 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 265 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 266 document are to be interpreted as described in BCP 14 [RFC2119] 267 [RFC8174] when, and only when, they appear in all capitals, as shown 268 here. 270 1.4. Recap of Existing ECN feedback in IP/TCP 272 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 273 negotiated with the receiver at the transport layer, an ECN sender 274 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 275 to indicate an ECN-capable transport (ECT). If both ECN bits are 276 zero, the packet is considered to have been sent by a Not-ECN-capable 277 Transport (Not-ECT). When a network node experiences congestion, it 278 will occasionally either drop or mark a packet, with the choice 279 depending on the packet's ECN codepoint. If the codepoint is Not- 280 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 281 the node can mark the packet by setting both ECN bits, which is 282 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 283 Table 1 summarises these codepoints. 285 +------------------+----------------+---------------------------+ 286 | IP-ECN codepoint | Codepoint name | Description | 287 +------------------+----------------+---------------------------+ 288 | 0b00 | Not-ECT | Not ECN-Capable Transport | 289 | 0b01 | ECT(1) | ECN-Capable Transport (1) | 290 | 0b10 | ECT(0) | ECN-Capable Transport (0) | 291 | 0b11 | CE | Congestion Experienced | 292 +------------------+----------------+---------------------------+ 294 Table 1: The ECN Field in the IP Header 296 In the TCP header the first two bits in byte 14 are defined as flags 297 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 298 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 299 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 300 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 301 Data Receiver starts to set the Echo Congestion Experienced (ECE) 302 flag continuously in the TCP header of ACKs, which ensures the signal 303 is received reliably even if ACKs are lost. The TCP sender confirms 304 that it has received at least one ECE signal by responding with the 305 congestion window reduced (CWR) flag, which allows the TCP receiver 306 to stop repeating the ECN-Echo flag. This always leads to a full RTT 307 of ACKs with ECE set. Thus any additional CE markings arriving 308 within this RTT cannot be fed back. 310 The last bit in byte 13 of the TCP header was defined as the Nonce 311 Sum (NS) for the ECN Nonce [RFC3540]. In the absence of widespread 312 deployment RFC 3540 has been reclassified as historic [RFC8311] and 313 the respective flag has been marked as "reserved", making this TCP 314 flag available for use by the AccECN experiment instead. 316 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 317 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 318 | | | N | C | E | U | A | P | R | S | F | 319 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 320 | | | | R | E | G | K | H | T | N | N | 321 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 323 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 325 2. AccECN Protocol Overview and Rationale 327 This section provides an informative overview of the AccECN protocol 328 that will be normatively specified in Section 3 330 Like the original TCP approach, the Data Receiver of each TCP half- 331 connection sends AccECN feedback to the Data Sender on TCP 332 acknowledgements, reusing data packets of the other half-connection 333 whenever possible. 335 The AccECN protocol has had to be designed in two parts: 337 o an essential part that re-uses ECN TCP header bits to feed back 338 the number of arriving CE marked packets. This provides more 339 accuracy than classic ECN feedback, but limited resilience against 340 ACK loss; 342 o a supplementary part using a new AccECN TCP Option that provides 343 additional feedback on the number of bytes that arrive marked with 344 each of the three ECN codepoints (not just CE marks). This 345 provides greater resilience against ACK loss than the essential 346 feedback, but it is more likely to suffer from middlebox 347 interference. 349 The two part design was necessary, given limitations on the space 350 available for TCP options and given the possibility that certain 351 incorrectly designed middleboxes prevent TCP using any new options. 353 The essential part overloads the previous definition of the three 354 flags in the TCP header that had been assigned for use by ECN. This 355 design choice deliberately replaces the classic ECN feedback 356 protocol, rather than leaving classic ECN feedback intact and adding 357 more accurate feedback separately because: 359 o this efficiently reuses scarce TCP header space, given TCP option 360 space is approaching saturation; 362 o a single upgrade path for the TCP protocol is preferable to a fork 363 in the design; 365 o otherwise classic and accurate ECN feedback could give conflicting 366 feedback on the same segment, which could open up new security 367 concerns and make implementations unnecessarily complex; 369 o middleboxes are more likely to faithfully forward the TCP ECN 370 flags than newly defined areas of the TCP header. 372 AccECN is designed to work even if the supplementary part is removed 373 or zeroed out, as long as the essential part gets through. 375 2.1. Capability Negotiation 377 AccECN is a change to the wire protocol of the main TCP header, 378 therefore it can only be used if both endpoints have been upgraded to 379 understand it. The TCP client signals support for AccECN on the 380 initial SYN of a connection and the TCP server signals whether it 381 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 382 client uses to signal AccECN support have been carefully chosen so 383 that a TCP server will interpret them as a request to support the 384 most recent variant of ECN feedback that it supports. Then the 385 client falls back to the same variant of ECN feedback. 387 An AccECN TCP client does not send the new AccECN Option on the SYN 388 as SYN option space is limited. The TCP server sends the AccECN 389 Option on the SYN/ACK and the client sends it on the first ACK to 390 test whether the network path forwards the option correctly. 392 2.2. Feedback Mechanism 394 A Data Receiver maintains four counters initialized at the start of 395 the half-connection. Three count the number of arriving payload 396 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 397 the number of packets arriving marked with a CE codepoint (including 398 control packets without payload if they are CE-marked). 400 The Data Sender maintains four equivalent counters for the half 401 connection, and the AccECN protocol is designed to ensure they will 402 match the values in the Data Receiver's counters, albeit after a 403 little delay. 405 Each ACK carries the three least significant bits (LSBs) of the 406 packet-based CE counter using the ECN bits in the TCP header, now 407 renamed the Accurate ECN (ACE) field (see Figure 3 later). The 24 408 LSBs of each byte counter are carried in the AccECN Option. 410 2.3. Delayed ACKs and Resilience Against ACK Loss 412 With both the ACE and the AccECN Option mechanisms, the Data Receiver 413 continually repeats the current LSBs of each of its respective 414 counters. There is no need to acknowledge these continually repeated 415 counters, so the congestion window reduced (CWR) mechanism is no 416 longer used. Even if some ACKs are lost, the Data Sender should be 417 able to infer how much to increment its own counters, even if the 418 protocol field has wrapped. 420 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 421 it appears to have incremented by one (say), the field might have 422 actually cycled completely then incremented by one. The Data 423 Receiver is not allowed to delay sending an ACK to such an extent 424 that the ACE field would cycle. However cycling is still a 425 possibility at the Data Sender because a whole sequence of ACKs 426 carrying intervening values of the field might all be lost or delayed 427 in transit. 429 The fields in the AccECN Option are larger, but they will increment 430 in larger steps because they count bytes not packets. Nonetheless, 431 their size has been chosen such that a whole cycle of the field would 432 never occur between ACKs unless there had been an infeasibly long 433 sequence of ACK losses. Therefore, as long as the AccECN Option is 434 available, it can be treated as a dependable feedback channel. 436 If the AccECN Option is not available, e.g. it is being stripped by a 437 middlebox, the AccECN protocol will only feed back information on CE 438 markings (using the ACE field). Although not ideal, this will be 439 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 440 will ever indicate more severe congestion than CE, even though future 441 uses for ECT(0) or ECT(1) are still unclear [RFC8311]. Because the 442 3-bit ACE field is so small, when it is the only field available the 443 Data Sender has to interpret it assuming the most likely wrap, but 444 with a degree of conservatism. 446 Certain specified events trigger the Data Receiver to include an 447 AccECN Option on an ACK. The rules are designed to ensure that the 448 order in which different markings arrive at the receiver is 449 communicated to the sender (as long as options are reaching the 450 sender and as long as there is no ACK loss). Implementations are 451 encouraged to send an AccECN Option more frequently, but this is left 452 up to the implementer. 454 2.4. Feedback Metrics 456 The CE packet counter in the ACE field and the CE byte counter in the 457 AccECN Option both provide feedback on received CE-marks. The CE 458 packet counter includes control packets that do not have payload 459 data, while the CE byte counter solely includes marked payload bytes. 460 If both are present, the byte counter in the option will provide the 461 more accurate information needed for modern congestion control and 462 policing schemes, such as L4S, DCTCP or ConEx. If the option is 463 stripped, a simple algorithm to estimate the number of marked bytes 464 from the ACE field is given in Appendix A.3. 466 Feedback in bytes is recommended in order to protect against the 467 receiver using attacks similar to 'ACK-Division' to artificially 468 inflate the congestion window, which is why [RFC5681] now recommends 469 that TCP counts acknowledged bytes not packets. 471 2.5. Generic (Dumb) Reflector 473 The ACE field provides information about CE markings on both data and 474 control packets. According to [RFC3168] the Data Sender is meant to 475 set control packets to Not-ECT. However, mechanisms in certain 476 private networks (e.g. data centres) set control packets to be ECN 477 capable because they are precisely the packets that performance 478 depends on most. 480 For this reason, AccECN is designed to be a generic reflector of 481 whatever ECN markings it sees, whether or not they are compliant with 482 a current standard. Then as standards evolve, Data Senders can 483 upgrade unilaterally without any need for receivers to upgrade too. 484 It is also useful to be able to rely on generic reflection behaviour 485 when senders need to test for unexpected interference with markings 486 (for instance Section 3.2.2.3, Section 3.2.2.4 and Section 3.2.3.2 of 487 the present document and para 2 of Section 20.2 of [RFC3168]). 489 The initial SYN is the most critical control packet, so AccECN 490 provides feedback on its ECN marking. Although RFC 3168 prohibits an 491 ECN-capable SYN, providing feedback of ECN marking on the SYN 492 supports future scenarios in which SYNs might be ECN-enabled (without 493 prejudging whether they ought to be). For instance, [RFC8311] 494 updates this aspect of RFC 3168 to allow experimentation with ECN- 495 capable TCP control packets. 497 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 498 not-ECT in compliance with RFC 3168, feedback on the state of the ECN 499 field when it arrives at the receiver could still be useful, because 500 middleboxes have been known to overwrite the ECN IP field as if it is 501 still part of the old Type of Service (ToS) field [Mandalari18]. If 502 a TCP client has set the SYN to Not-ECT, but receives feedback that 503 the ECN field on the SYN arrived with a different codepoint, it can 504 detect such middlebox interference and send Not-ECT for the rest of 505 the connection. Today, if a TCP server receives ECT or CE on a SYN, 506 it cannot know whether it is invalid (or valid) because only the TCP 507 client knows whether it originally marked the SYN as Not-ECT (or 508 ECT). Therefore, prior to AccECN, the server's only safe course of 509 action was to disable ECN for the connection. Instead, the AccECN 510 protocol allows the server to feed back the received ECN field to the 511 client, which then has all the information to decide whether the 512 connection has to fall-back from supporting ECN (or not). 514 3. AccECN Protocol Specification 516 3.1. Negotiating to use AccECN 518 3.1.1. Negotiation during the TCP handshake 520 Given the ECN Nonce [RFC3540] has been reclassified as historic 521 [RFC8311], the present specification re-allocates the TCP flag at bit 522 7 of the TCP header, which was previously called NS (Nonce Sum), as 523 the AE (Accurate ECN) flag (see IANA Considerations in Section 7) as 524 shown below. 526 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 527 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 528 | | | A | C | E | U | A | P | R | S | F | 529 | Header Length | Reserved | E | W | C | R | C | S | S | Y | I | 530 | | | | R | E | G | K | H | T | N | N | 531 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 533 Figure 2: The (post-AccECN) definition of the TCP header flags during 534 the TCP handshake 536 During the TCP handshake at the start of a connection, to request 537 more accurate ECN feedback the TCP client (host A) MUST set the TCP 538 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 540 If a TCP server (B) that is AccECN-enabled receives a SYN with the 541 above three flags set, it MUST set both its half connections into 542 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 543 the 4 values shown in the top block of Table 2 to confirm that it 544 supports AccECN. The TCP server MUST NOT set one of these 4 545 combination of flags on the SYN/ACK unless the preceding SYN 546 requested support for AccECN as above. 548 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 549 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 550 that arrived on the SYN. This applies whether or not the server 551 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 552 Section 2.5 for rationale). 554 Once a TCP client (A) has sent the above SYN to declare that it 555 supports AccECN, and once it has received the above SYN/ACK segment 556 that confirms that the TCP server supports AccECN, the TCP client 557 MUST set both its half connections into AccECN mode. 559 Once in AccECN mode, a TCP client or server has the rights and 560 obligations to participate in the ECN protocol defined in 561 Section 3.1.5. 563 The procedure for the client to follow if a SYN/ACK does not arrive 564 before its retransmission timer expires is given in Section 3.1.4. 566 3.1.2. Backward Compatibility 568 The three flags set to 1 to indicate AccECN support on the SYN have 569 been carefully chosen to enable natural fall-back to prior stages in 570 the evolution of ECN, as above. Table 2 tabulates all the 571 negotiation possibilities for ECN-related capabilities that involve 572 at least one AccECN-capable host. The entries in the first two 573 columns have been abbreviated, as follows: 575 AccECN: More Accurate ECN Feedback (the present specification) 577 Nonce: ECN Nonce feedback [RFC3540] 579 ECN: 'Classic' ECN feedback [RFC3168] 581 No ECN: Not-ECN-capable. Implicit congestion notification using 582 packet drop. 584 +--------+--------+------------+------------+-----------------------+ 585 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 586 | | | | B->A | | 587 +--------+--------+------------+------------+-----------------------+ 588 | | | AE CWR ECE | AE CWR ECE | | 589 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN(no ECT on SYN) | 590 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 591 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 592 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 593 | | | | | | 594 | AccECN | Nonce | 1 1 1 | 1 0 1 | (Reserved) | 595 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 596 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 597 | | | | | | 598 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 599 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 600 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 601 | | | | | | 602 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 603 +--------+--------+------------+------------+-----------------------+ 605 Table 2: ECN capability negotiation between Client (A) and Server (B) 607 Table 2 is divided into blocks each separated by an empty row. 609 1. The top block shows the case already described in Section 3.1 610 where both endpoints support AccECN and how the TCP server (B) 611 indicates congestion feedback. 613 2. The second block shows the cases where the TCP client (A) 614 supports AccECN but the TCP server (B) supports some earlier 615 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 616 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 617 shown it MUST set both its half connections into the feedback 618 mode shown in the rightmost column. If it has set itself into 619 classic ECN feedback mode it MUST then comply with [RFC3168]. 621 The server response called 'Nonce' in the table is now historic. 622 For an AccECN implementation, there is no need to recognize or 623 support ECN Nonce feedback [RFC3540], which has been reclassified 624 as historic [RFC8311]. AccECN is compatible with alternative ECN 625 feedback integrity approaches (see Section 5.3). 627 3. The third block shows the cases where the TCP server (B) supports 628 AccECN but the TCP client (A) supports some earlier variant of 629 TCP feedback, indicated in its SYN. 631 When an AccECN-enabled TCP server (B) receives a SYN with 632 AE,CWR,ECE = 0,1,1 it MUST do one of the following: 634 * set both its half connections into the classic ECN feedback 635 mode and return a SYN/ACK with AE, CWR, ECE = 0,0,1 as shown. 636 Then it MUST comply with [RFC3168]. 638 * set both its half-connections into No ECN mode and return a 639 SYN/ACK with AE,CWR,ECE = 0,0,0, then continue with ECN 640 disabled. This latter case is unlikely to be desirable, but 641 it is allowed as a possibility, e.g. for minimal TCP 642 implementations. 644 When an AccECN-enabled TCP server (B) receives a SYN with 645 AE,CWR,ECE = 0,0,0 it MUST set both its half connections into the 646 Not ECN feedback mode, return a SYN/ACK with AE,CWR,ECE = 0,0,0 647 as shown and continue with ECN disabled. 649 4. The fourth block displays a combination labelled `Broken'. Some 650 older TCP server implementations incorrectly set the reserved 651 flags in the SYN/ACK by reflecting those in the SYN. Such broken 652 TCP servers (B) cannot support ECN, so as soon as an AccECN- 653 capable TCP client (A) receives such a broken SYN/ACK it MUST 654 fall back to Not ECN mode for both its half connections and 655 continue with ECN disabled. 657 The following additional rules do not fit the structure of the table, 658 but they complement it: 660 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 661 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 662 Host A MUST then enter the same feedback mode as it would have 663 entered had it been a responding host and received the same SYN. 664 Then host A MUST send the same SYN/ACK as it would have sent had 665 it been a responding host. 667 In-window SYN during TIME-WAIT: Many TCP implementations create a 668 new TCP connection if they receive an in-window SYN packet during 669 TIME-WAIT state. When a TCP host enters TIME-WAIT or CLOSED 670 state, it should ignore any previous state about the negotiation 671 of AccECN for that connection and renegotiate the feedback mode 672 according to Table 2. 674 3.1.3. Forward Compatibility 676 If a TCP server that implements AccECN receives a SYN with the three 677 TCP header flags (AE, CWR and ECE) set to any combination other than 678 000, 011 or 111, it MUST negotiate the use of AccECN as if they had 679 been set to 111. This ensures that future uses of the other 680 combinations on a SYN can rely on consistent behaviour from the 681 installed base of AccECN servers. 683 For the avoidance of doubt, the behaviour described in the present 684 specification applies whether or not the three remaining reserved TCP 685 header flags are zero. 687 3.1.4. Retransmission of the SYN 689 If the sender of an AccECN SYN times out before receiving the SYN/ 690 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 691 least one more time by continuing to set all three TCP ECN flags on 692 the first retransmitted SYN (using the usual retransmission time- 693 outs). If this first retransmission also fails to be acknowledged, 694 the sender SHOULD send subsequent retransmissions of the SYN with the 695 three TCP-ECN flags cleared (AE=CWR=ECE=0). A retransmitted SYN MUST 696 use the same ISN as the original SYN. 698 Retrying once before fall-back adds delay in the case where a 699 middlebox drops an AccECN (or ECN) SYN deliberately. However, 700 current measurements imply that a drop is less likely to be due to 701 middlebox interference than other intermittent causes of loss, e.g. 702 congestion, wireless interference, etc. 704 Implementers MAY use other fall-back strategies if they are found to 705 be more effective (e.g. attempting to negotiate AccECN on the SYN 706 only once or more than twice (most appropriate during high levels of 707 congestion). However, other fall-back strategies will need to follow 708 all the rules in Section 3.1.5, which concern behaviour when SYNs or 709 SYN/ACKs negotiating different types of feedback have been sent 710 within the same connection. 712 Further it may make sense to also remove any other new or 713 experimental fields or options on the SYN in case a middlebox might 714 be blocking them, although the required behaviour will depend on the 715 specification of the other option(s) and any attempt to co-ordinate 716 fall-back between different modules of the stack. 718 Whichever fall-back strategy is used, the TCP initiator SHOULD cache 719 failed connection attempts. If it does, it SHOULD NOT give up 720 attempting to negotiate AccECN on the SYN of subsequent connection 721 attempts until it is clear that the blockage is persistently and 722 specifically due to AccECN. The cache should be arranged to expire 723 so that the initiator will infrequently attempt to check whether the 724 problem has been resolved. 726 The fall-back procedure if the TCP server receives no ACK to 727 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 728 Section 3.2.3.2. 730 3.1.5. Implications of AccECN Mode 732 Section 3.1.1 describes the only ways that a host can enter AccECN 733 mode, whether as a client or as a server. 735 As a Data Sender, a host in AccECN mode has the rights and 736 obligations concerning the use of ECN defined below, which build on 737 those in [RFC3168] as updated by [RFC8311]: 739 o Using ECT: 741 * It can set an ECT codepoint in the IP header of packets to 742 indicate to the network that the transport is capable and 743 willing to participate in ECN for this packet. 745 * It does not have to set ECT on any packet (for instance if it 746 has reason to believe such a packet would be blocked). 748 o Switching feedback negotiation (e.g. fall-back): 750 * It SHOULD NOT set ECT on any packet if it has received at least 751 one valid SYN or Acceptable SYN/ACK with AE=CWR=ECE=0. A 752 "valid SYN" has the same port numbers and the same ISN as the 753 SYN that caused the server to enter AccECN mode. 755 * It MUST NOT send an ECN-setup SYN [RFC3168] within the same 756 connection as it has sent a SYN requesting AccECN feedback. 758 * It MUST NOT send an ECN-setup SYN/ACK [RFC3168] within the same 759 connection as it has sent a SYN/ACK agreeing to use AccECN 760 feedback. 762 The above rules are necessary because, if one peer were to 763 negotiate the feedback mode in two different types of handshake, 764 it would not be possible for the other peer to know for certain 765 which handshake packet(s) the other end had eventually received or 766 in which order it received them. So, without these rules, the two 767 peers could end up using difference feedback modes without knowing 768 it. 770 o Congestion response: 772 * It is still obliged to respond appropriately to AccECN feedback 773 that indicates there were ECN marks on packets it had 774 previously sent, as defined in Section 6.1 of [RFC3168] and 775 updated by Sections 2.1 and 4.1 of [RFC8311]. 777 * The commitment to respond appropriately to incoming indications 778 of congestion remains even if it sends a SYN packet with 779 AE=CWR=ECE=0, in a later transmission within the same TCP 780 connection. 782 * Unlike an RFC 3168 data sender, it MUST NOT set CWR to indicate 783 it has received and responded to indications of congestion (for 784 the avoidance of doubt, this does not preclude it from setting 785 the bits of the ACE counter field, which includes an overloaded 786 use of the same bit). 788 As a Data Receiver: 790 o a host in AccECN mode MUST feed back the information in the IP-ECN 791 field of incoming packets using Accurate ECN feedback, as 792 specified in Section 3.2 below. 794 o if it receives an ECN-setup SYN or ECN-setup SYN/ACK [RFC3168] 795 during the same connection as it receives a SYN requesting AccECN 796 feedback or a SYN/ACK agreeing to use AccECN feedback, it MUST 797 reset the connection with a RST packet. 799 o If for any reason it is not willing to provide ECN feedback on a 800 particular TCP connection, to indicate this unwillingness it 801 SHOULD clear the AE, CWR and ECE flags in all SYN and/or SYN/ACK 802 packets that it sends. 804 o it MUST NOT use reception of packets with ECT set in the IP-ECN 805 field as an implicit signal that the peer is ECN-capable. Reason: 806 ECT at the IP layer does not explicitly confirm the peer has the 807 correct ECN feedback logic, as the packets could have been mangled 808 at the IP layer. 810 3.2. AccECN Feedback 812 Each Data Receiver of each half connection maintains four counters, 813 r.cep, r.ceb, r.e0b and r.e1b: 815 o The Data Receiver MUST increment the CE packet counter (r.cep), 816 for every Acceptable packet that it receives with the CE code 817 point in the IP ECN field, including CE marked control packets but 818 excluding CE on SYN packets (SYN=1; ACK=0). 820 o The Data Receiver MUST increment the r.ceb, r.e0b or r.e1b byte 821 counters by the number of TCP payload octets in Acceptable packets 822 marked respectively with the CE, ECT(0) and ECT(1) codepoint in 823 their IP-ECN field, including any payload octets on control 824 packets, but not including any payload octets on SYN packets 825 (SYN=1; ACK=0). 827 Each Data Sender of each half connection maintains four counters, 828 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 829 counters at the Data Receiver. 831 A Data Receiver feeds back the CE packet counter using the Accurate 832 ECN (ACE) field, as explained in Section 3.2.2. And it feeds back 833 all the byte counters using the AccECN TCP Option, as specified in 834 Section 3.2.3. 836 Whenever a host feeds back the value of any counter, it MUST report 837 the most recent value, no matter whether it is in a pure ACK, an ACK 838 with new payload data or a retransmission. Therefore the feedback 839 carried on a retransmitted packet is unlikely to be the same as the 840 feedback on the original packet. 842 3.2.1. Initialization of Feedback Counters 844 When a host first enters AccECN mode, in its role as a Data Receiver 845 it initializes its counters to r.cep = 5, r.e0b = 1 and r.ceb = 846 r.e1b.= 0, 848 Non-zero initial values are used to support a stateless handshake 849 (see Section 5.1) and to be distinct from cases where the fields are 850 incorrectly zeroed (e.g. by middleboxes - see Section 3.2.3.2.4). 852 When a host enters AccECN mode, in its role as a Data Sender it 853 initializes its counters to s.cep = 5, s.e0b = 1 and s.ceb = s.e1b.= 854 0. 856 3.2.2. The ACE Field 858 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 859 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 860 as one 3-bit field. Then the field is given a new name, ACE, as 861 shown in Figure 3. 863 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 864 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 865 | | | | U | A | P | R | S | F | 866 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 867 | | | | G | K | H | T | N | N | 868 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 870 Figure 3: Definition of the ACE field within bytes 13 and 14 of the 871 TCP Header (when AccECN has been negotiated and SYN=0). 873 The original definition of these three flags in the TCP header, 874 including the addition of support for the ECN Nonce, is shown for 875 comparison in Figure 1. This specification does not rename these 876 three TCP flags to ACE unconditionally; it merely overloads them with 877 another name and definition once an AccECN connection has been 878 established. 880 With one exception (Section 3.2.2.1), a host with both of its half- 881 connections in AccECN mode MUST interpret the AE, CWR and ECE flags 882 as the 3-bit ACE counter on a segment with the SYN flag cleared 883 (SYN=0). On such a packet, a Data Receiver MUST encode the three 884 least significant bits of its r.cep counter into the ACE field that 885 it feeds back to the Data Sender. A host MUST NOT interpret the 3 886 flags as a 3-bit ACE field on any segment with SYN=1 (whether ACK is 887 0 or 1), or if AccECN negotiation is incomplete or has not succeeded. 889 Both parts of each of these conditions are equally important. For 890 instance, even if AccECN negotiation has been successful, the ACE 891 field is not defined on any segments with SYN=1 (e.g. a 892 retransmission of an unacknowledged SYN/ACK, or when both ends send 893 SYN/ACKs after AccECN support has been successfully negotiated during 894 a simultaneous open). 896 3.2.2.1. ACE Field on the ACK of the SYN/ACK 898 A TCP client (A) in AccECN mode MUST feed back which of the 4 899 possible values of the IP-ECN field was on the SYN/ACK by writing it 900 into the ACE field of a pure ACK with no SACK blocks using the binary 901 encoding in Table 3 (which is the same as that used on the SYN/ACK in 902 Table 2). This shall be called the handshake encoding of the ACE 903 field, and it is the only exception to the rule that the ACE field 904 carries the 3 least significant bits of the r.cep counter on packets 905 with SYN=0. 907 Normally, a TCP client acknowledges a SYN/ACK with an ACK that 908 satisfies the above conditions anyway (SYN=0, no data, no SACK 909 blocks). If an AccECN TCP client intends to acknowledge the SYN/ACK 910 with a packet that does not satisfy these conditions (e.g. it has 911 data to include on the ACK), it SHOULD first send a pure ACK that 912 does satisfy these conditions (see Section 5.2), so that it can feed 913 back which of the four values of the IP-ECN field arrived on the SYN/ 914 ACK. A valid exception to this "SHOULD" would be where the 915 implementation will only be used in an environment where mangling of 916 the ECN field is unlikely. 918 +---------------------+---------------------+-----------------------+ 919 | IP-ECN codepoint on | ACE on pure ACK of | r.cep of client in | 920 | SYN/ACK | SYN/ACK | AccECN mode | 921 +---------------------+---------------------+-----------------------+ 922 | Not-ECT | 0b010 | 5 | 923 | ECT(1) | 0b011 | 5 | 924 | ECT(0) | 0b100 | 5 | 925 | CE | 0b110 | 6 | 926 +---------------------+---------------------+-----------------------+ 928 Table 3: The encoding of the ACE field in the ACK of the SYN-ACK to 929 reflect the SYN-ACK's IP-ECN field 931 When an AccECN server in SYN-RCVD state receives a pure ACK with 932 SYN=0 and no SACK blocks, instead of treating the ACE field as a 933 counter, it MUST infer the meaning of each possible value of the ACE 934 field from Table 4, which also shows the value that an AccECN server 935 MUST set s.cep to as a result. 937 Given this encoding of the ACE field on the ACK of a SYN/ACK is 938 exceptional, an AccECN server using large receive offload (LRO) might 939 prefer to disable LRO until such an ACK has transitioned it out of 940 SYN-RCVD state. 942 +---------------+-----------------------------+---------------------+ 943 | ACE on ACK of | IP-ECN codepoint on SYN/ACK | s.cep of server in | 944 | SYN/ACK | inferred by server | AccECN mode | 945 +---------------+-----------------------------+---------------------+ 946 | 0b000 | {Notes 1, 3} | Disable ECN | 947 | 0b001 | {Notes 2, 3} | 5 | 948 | 0b010 | Not-ECT | 5 | 949 | 0b011 | ECT(1) | 5 | 950 | 0b100 | ECT(0) | 5 | 951 | 0b101 | Currently Unused {Note 2} | 5 | 952 | 0b110 | CE | 6 | 953 | 0b111 | Currently Unused {Note 2} | 5 | 954 +---------------+-----------------------------+---------------------+ 956 Table 4: Meaning of the ACE field on the ACK of the SYN/ACK 958 {Note 1}: If the server is in AccECN mode, the value of zero raises 959 suspicion of zeroing of the ACE field on the path (see 960 Section 3.2.2.3). 962 {Note 2}: If the server is in AccECN mode, these values are Currently 963 Unused but the AccECN server's behaviour is still defined for forward 964 compatibility. Then the designer of a future protocol can know for 965 certain what AccECN servers will do with these codepoints. 967 {Note 3}: In the case where a server that implements AccECN is also 968 using a stateless handshake (termed a SYN cookie) it will not 969 remember whether it entered AccECN mode. The values 0b000 or 0b001 970 will remind it that it did not enter AccECN mode, because AccECN does 971 not use them (see Section 5.1 for details). If a stateless server 972 that implements AccECN receives either of these two values in the 973 ACK, its action is implementation-dependent and outside the scope of 974 this spec, It will certainly not take the action in the third column 975 because, after it receives either of these values, it is not in 976 AccECN mode. I.e., it will not disable ECN (at least not just 977 because ACE is 0b000) and it will not set s.cep. 979 3.2.2.2. Encoding and Decoding Feedback in the ACE Field 981 Whenever the Data Receiver sends an ACK with SYN=0 (with or without 982 data), unless the handshake encoding in Section 3.2.2.1 applies, the 983 Data Receiver MUST encode the least significant 3 bits of its r.cep 984 counter into the ACE field (see Appendix A.2). 986 Whenever the Data Sender receives an ACK with SYN=0 (with or without 987 data), it first checks whether it has already been superseded by 988 another ACK in which case it ignores the ECN feedback. If the ACK 989 has not been superseded, and if the special handshake encoding in 990 Section 3.2.2.1 does not apply, the Data Sender decodes the ACE field 991 as follows (see Appendix A.2 for examples). 993 o It takes the least significant 3 bits of its local s.cep counter 994 and subtracts them from the incoming ACE counter to work out the 995 minimum positive increment it could apply to s.cep (assuming the 996 ACE field only wrapped at most once). 998 o It then follows the safety procedures in Section 3.2.2.5.2 to 999 calculate or estimate how many packets the ACK could have 1000 acknowledged under the prevailing conditions to determine whether 1001 the ACE field might have wrapped more than once. 1003 The encode/decode procedures during the three-way handshake are 1004 exceptions to the general rules given so far, so they are spelled out 1005 step by step below for clarity: 1007 o If a TCP server in AccECN mode receives a CE mark in the IP-ECN 1008 field of a SYN (SYN=1, ACK=0), it MUST NOT increment r.cep (it 1009 remains at its initial value of 5). 1011 Reason: It would be redundant for the server to include CE-marked 1012 SYNs in its r.cep counter, because it already reliably delivers 1013 feedback of any CE marking on the SYN/ACK using the encoding in 1014 Table 2. This also ensures that, when the server starts using the 1015 ACE field, it has not unnecessarily consumed more than one initial 1016 value, given they can be used to negotiate variants of the AccECN 1017 protocol (see Appendix B.3). 1019 o If a TCP client in AccECN mode receives CE feedback in the TCP 1020 flags of a SYN/ACK, it MUST NOT increment s.cep (it remains at its 1021 initial value of 5), so that it stays in step with r.cep on the 1022 server. Nonetheless, the TCP client still triggers the congestion 1023 control actions necessary to respond to the CE feedback. 1025 o If a TCP client in AccECN mode receives a CE mark in the IP-ECN 1026 field of a SYN/ACK, it MUST increment r.cep, but no more than once 1027 no matter how many CE-marked SYN/ACKs it receives (i.e. 1028 incremented from 5 to 6, but no further). 1030 Reason: Incrementing r.cep ensures the client will eventually 1031 deliver any CE marking to the server reliably when it starts using 1032 the ACE field. Even though the client also feeds back any CE 1033 marking on the ACK of the SYN/ACK using the encoding in Table 3, 1034 this ACK is not delivered reliably, so it can be considered as a 1035 timely notification that is redundant but unreliable. The client 1036 does not increment r.cep more than once, because the server can 1037 only increment s.cep once (see next bullet). Also, this limits 1038 the unnecessarily consumed initial values of the ACE field to two. 1040 o If a TCP server in AccECN mode and in SYN-RCVD state receives CE 1041 feedback in the TCP flags of a pure ACK with no SACK blocks, it 1042 MUST increment s.cep (from 5 to 6). The TCP server then triggers 1043 the congestion control actions necessary to respond to the CE 1044 feedback. 1046 Reasoning: The TCP server can only increment s.cep once, because 1047 the first ACK it receives will cause it to transition out of SYN- 1048 RCVD state. The server's congestion response would be no 1049 different even if it could receive feedback of more than one CE- 1050 marked SYN/ACK. 1052 Once the TCP server transitions to ESTABLISHED state, it might 1053 later receive other pure ACK(s) with the handshake encoding in the 1054 ACE field. The conditions for this to occur are quite unusual, 1055 but not impossible, e.g. a SYN/ACK (or ACK of the SYN/ACK) that is 1056 delayed for longer than the server's retransmission timeout; or 1057 packet duplication by the network. Nonetheless, once in the 1058 ESTABLISHED state, the server will consider the ACE field to be 1059 encoded as the normal ACE counter on all packets with SYN=0 (given 1060 it will be following the above rule in this bullet). The server 1061 MAY include a test to avoid this case. 1063 3.2.2.3. Testing for Zeroing of the ACE Field 1065 Section 3.2.2 required the Data Receiver to initialize the r.cep 1066 counter to a non-zero value. Therefore, in either direction the 1067 initial value of the ACE counter ought to be non-zero. 1069 If AccECN has been successfully negotiated, the Data Sender SHOULD 1070 check the value of the ACE counter in the first packet (with or 1071 without data) that arrives with SYN=0. If the value of this ACE 1072 field is zero (0b000), the Data Sender disables sending ECN-capable 1073 packets for the remainder of the half-connection by setting the IP/ 1074 ECN field in all subsequent packets to Not-ECT. 1076 Usually, the server checks the ACK of the SYN/ACK from the client, 1077 while the client checks the first data segment from the server. 1078 However, if reordering occurs, "the first packet ... that arrives" 1079 will not necessarily be the same as the first packet in sequence 1080 order. The test has been specified loosely like this to simplify 1081 implementation, and because it would not have been any more precise 1082 to have specified the first packet in sequence order, which would not 1083 necessarily be the first ACE counter that the Data Receiver fed back 1084 anyway, given it might have been a retransmission. 1086 The possibility of re-ordering means that there is a small chance 1087 that the ACE field on the first packet to arrive is genuinely zero 1088 (without middlebox interference). This would cause a host to 1089 unnecessarily disable ECN for a half connection. Therefore, in 1090 environments where there is no evidence of the ACE field being 1091 zeroed, implementations can skip this test. 1093 Note that the Data Sender MUST NOT test whether the arriving counter 1094 in the initial ACE field has been initialized to a specific valid 1095 value - the above check solely tests whether the ACE fields have been 1096 incorrectly zeroed. This allows hosts to use different initial 1097 values as an additional signalling channel in future. 1099 3.2.2.4. Testing for Mangling of the IP/ECN Field 1101 The value of the ACE field on the SYN/ACK indicates the value of the 1102 IP/ECN field when the SYN arrived at the server. The client can 1103 compare this with how it originally set the IP/ECN field on the SYN. 1104 If this comparison implies an unsafe transition (see below) of the 1105 IP/ECN field, for the remainder of the connection the client MUST NOT 1106 send ECN-capable packets, but it MUST continue to feed back any ECN 1107 markings on arriving packets. 1109 The value of the ACE field on the last ACK of the 3WHS indicates the 1110 value of the IP/ECN field when the SYN/ACK arrived at the client. 1111 The server can compare this with how it originally set the IP/ECN 1112 field on the SYN/ACK. If this comparison implies an unsafe 1113 transition of the IP/ECN field, for the remainder of the connection 1114 the server MUST NOT send ECN-capable packets, but it MUST continue to 1115 feed back any ECN markings on arriving packets. 1117 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 1118 count of CE marks is still eventually delivered reliably). If this 1119 ACK does not arrive, the server can continue to send ECN-capable 1120 packets without having tested for mangling of the IP/ECN field on the 1121 SYN/ACK. 1123 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 1124 repeated here for convenience: 1126 o the not-ECT codepoint changes; 1128 o either ECT codepoint transitions to not-ECT; 1130 o the CE codepoint changes. 1132 RFC 3168 says that a router that changes ECT to not-ECT is invalid 1133 but safe. However, from a host's viewpoint, this transition is 1134 unsafe because it could be the result of two transitions at different 1135 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 1136 This scenario could well happen where an ECN-enabled home router 1137 congests its upstream mobile broadband bottleneck link, then the 1138 ingress to the mobile network clears the ECN field [Mandalari18]. 1140 Once a Data Sender has entered AccECN mode it SHOULD check whether 1141 all feedback received for the first three or four rounds indicated 1142 that every packet it sent was CE-marked. If so, for the remainder of 1143 the connection, the Data Sender SHOULD NOT send ECN-capable packets, 1144 but it MUST continue to feed back any ECN markings on arriving 1145 packets. 1147 The above fall-back behaviours are necessary in case mangling of the 1148 IP/ECN field is asymmetric, which is currently common over some 1149 mobile networks [Mandalari18]. Then one end might see no unsafe 1150 transition and continue sending ECN-capable packets, while the other 1151 end sees an unsafe transition and stops sending ECN-capable packets. 1153 3.2.2.5. Safety against Ambiguity of the ACE Field 1155 If too many CE-marked segments are acknowledged at once, or if a long 1156 run of ACKs is lost or thinned out, the 3-bit counter in the ACE 1157 field might have cycled between two ACKs arriving at the Data Sender. 1158 The following safety procedures minimize this ambiguity. 1160 3.2.2.5.1. Data Receiver Safety Procedures 1162 An AccECN Data Receiver: 1164 o SHOULD immediately send an ACK whenever a data packet marked CE 1165 arrives after the previous packet was not CE. 1167 o MUST immediately send an ACK once 'n' CE marks have arrived since 1168 the previous ACK, where 'n' SHOULD be 2 and MUST be in the range 2 1169 to 6 inclusive. 1171 These rules for when to send an ACK are designed to be complemented 1172 by those in Section 3.2.3.3, which concern whether the AccECN TCP 1173 Option ought to be included on ACKs. 1175 For the avoidance of doubt, the above change-triggered ACK mechanism 1176 is deliberately worded to solely apply to data packets, and to ignore 1177 the arrival of a control packet with no payload, because it is 1178 important that TCP does not acknowledge pure ACKs. The change- 1179 triggered ACK approach can lead to some additional ACKs but it feeds 1180 back the timing and the order in which ECN marks are received with 1181 minimal additional complexity. If only CE marks are infrequent, or 1182 there are multiple marks in a row, the additional load will be low. 1183 Other marking patterns could increase the load significantly. 1185 Even though the first bullet is stated as a "SHOULD", it is important 1186 for a transition to immediately trigger an ACK if at all possible, so 1187 that the Data Sender can rely on change-triggered ACKs to detect 1188 queue growth as soon as possible, e.g. at the start of a flow. This 1189 requirement can only be relaxed if certain offload hardware needed 1190 for high performance cannot support change-triggered ACKs (although 1191 high performance protocols such as DCTCP already successfully use 1192 change-triggered ACKs). One possible compromise would be for the 1193 receiver to heuristically detect whether the sender is in slow-start, 1194 then to implement change-triggered ACKs while the sender is in slow- 1195 start, and offload otherwise. 1197 3.2.2.5.2. Data Sender Safety Procedures 1199 If the Data Sender has not received AccECN TCP Options to give it 1200 more dependable information, and it detects that the ACE field could 1201 have cycled, it SHOULD deem whether it cycled by taking the safest 1202 likely case under the prevailing conditions. It can detect if the 1203 counter could have cycled by using the jump in the acknowledgement 1204 number since the last ACK to calculate or estimate how many segments 1205 could have been acknowledged. An example algorithm to implement this 1206 policy is given in Appendix A.2. An implementer MAY develop an 1207 alternative algorithm as long as it satisfies these requirements. 1209 If missing acknowledgement numbers arrive later (reordering) and 1210 prove that the counter did not cycle, the Data Sender MAY attempt to 1211 neutralize the effect of any action it took based on a conservative 1212 assumption that it later found to be incorrect. 1214 The Data Sender can estimate how many packets (of any marking) an ACK 1215 acknowledges. If the ACE counter on an ACK seems to imply that the 1216 minimum number of newly CE-marked packets is greater that the number 1217 of newly acknowledged packets, the Data Sender SHOULD believe the ACE 1218 counter, unless it can be sure that it is counting all control 1219 packets correctly. 1221 3.2.3. The AccECN Option 1223 The AccECN Option is defined as shown in Figure 4. The initial 'E' 1224 of each field name stands for 'Echo'. 1226 0 1 2 3 1227 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1228 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1229 | Kind = TBD0 | Length = 11 | EE0B field | 1230 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1231 | EE0B (cont'd) | ECEB field | 1232 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1233 | EE1B field | Order 0 1234 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1236 0 1 2 3 1237 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1239 | Kind = TBD1 | Length = 11 | EE1B field | 1240 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1241 | EE1B (cont'd) | ECEB field | 1242 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1243 | EE0B field | Order 1 1244 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1246 Figure 4: The AccECN TCP Option 1248 Figure 4 shows two option field orders; order 0 and order 1. They 1249 both consists of three 24-bit fields. Order 0 provides the 24 least 1250 significant bits of the r.e0b, r.ceb and r.e1b counters, 1251 respectively. Order 1 provides the same fields, but in the opposite 1252 order. On each packet, the Data Receiver can use whichever order is 1253 more efficient. 1255 When a Data Receiver sends an AccECN Option, it MUST set the Kind 1256 field to TBD0 if using Order 0, or to TBD1 if using Order 1. These 1257 two new TCP Option Kinds are registered in Section 7 and called 1258 respectively AccECN0 and AccECN1. 1260 Note that there is no field to feed back Not-ECT bytes. Nonetheless 1261 an algorithm for the Data Sender to calculate the number of payload 1262 bytes received as Not-ECT is given in Appendix A.5. 1264 Whenever a Data Receiver sends an AccECN Option, the rules in 1265 Section 3.2.3.3 expect it to usually send a full-length option. To 1266 cope with option space limitations, it can omit unchanged fields from 1267 the tail of the option, as long as it preserves the order of the 1268 remaining fields and includes any field that has changed. The length 1269 field MUST indicate which fields are present as follows: 1271 +--------+------------------+------------------+ 1272 | Length | Type 0 | Type 1 | 1273 +--------+------------------+------------------+ 1274 | 11 | EE0B, ECEB, EE1B | EE1B, ECEB, EE0B | 1275 | 8 | EE0B, ECEB | EE1B, ECEB | 1276 | 5 | EE0B | EE1B | 1277 | 2 | (empty) | (empty) | 1278 +--------+------------------+------------------+ 1280 The empty option of Length=2 is provided to allow for a case where an 1281 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 1282 but there is very limited space for the option. 1284 All implementations of a Data Sender that read any AccECN Option MUST 1285 be able to read in AccECN Options of any of the above lengths. For 1286 forward compatibility, if the AccECN Option is of any other length, 1287 implementations MUST use those whole 3-octet fields that fit within 1288 the length and ignore the remainder of the option, treating it as 1289 padding. 1291 The AccECN Option has to be optional to implement, because both 1292 sender and receiver have to be able to cope without the option anyway 1293 - in cases where it does not traverse a network path. It is 1294 RECOMMENDED to implement both sending and receiving of the AccECN 1295 Option. If sending of the AccECN Option is implemented, the fall- 1296 backs described in this document will need to be implemented as well 1297 (unless solely for a controlled environment where path traversal is 1298 not considered a problem). Even if a developer does not implement 1299 sending of the AccECN Option, it is RECOMMENDED that they still 1300 implement logic to receive and understand any AccECN Options sent by 1301 remote peers. 1303 If a Data Receiver intends to send the AccECN Option at any time 1304 during the rest of the connection it is strongly recommended to also 1305 test path traversal of the AccECN Option as specified in 1306 Section 3.2.3.2. 1308 3.2.3.1. Encoding and Decoding Feedback in the AccECN Option Fields 1310 Whenever the Data Receiver includes any of the counter fields (ECEB, 1311 EE0B, EE1B) in an AccECN Option, it MUST encode the 24 least 1312 significant bits of the current value of the associated counter into 1313 the field (respectively r.ceb, r.e0b, r.e1b). 1315 Whenever the Data Sender receives ACK carrying an AccECN Option, it 1316 first checks whether the ACK has already been superseded by another 1317 ACK in which case it ignores the ECN feedback. If the ACK has not 1318 been superseded, the Data Sender MUST decode the fields in the AccECN 1319 Option as follows. For each field, it takes the least significant 24 1320 bits of its associated local counter (s.ceb, s.e0b or s.e1b) and 1321 subtracts them from the counter in the associated field of the 1322 incoming AccECN Option (respectively ECEB, EE0B, EE1B), to work out 1323 the minimum positive increment it could apply to s.ceb, s.e0b or 1324 s.e1b (assuming the field in the option only wrapped at most once). 1326 Appendix A.1 gives an example algorithm for the Data Receiver to 1327 encode its byte counters into the AccECN Option, and for the Data 1328 Sender to decode the AccECN Option fields into its byte counters. 1330 Note that, as specified in Section 3.2, any data on the SYN (SYN=1, 1331 ACK=0) is not included in any of the locally held octet counters nor 1332 in the AccECN Option on the wire. 1334 3.2.3.2. Path Traversal of the AccECN Option 1336 3.2.3.2.1. Testing the AccECN Option during the Handshake 1338 The TCP client MUST NOT include the AccECN TCP Option on the SYN. (A 1339 fall-back strategy for the loss of the SYN (possibly due to middlebox 1340 interference) is specified in Section 3.1.4.) 1342 A TCP server that confirms its support for AccECN (in response to an 1343 AccECN SYN from the client as described in Section 3.1) SHOULD 1344 include an AccECN TCP Option on the SYN/ACK. 1346 A TCP client that has successfully negotiated AccECN SHOULD include 1347 an AccECN Option in the first ACK at the end of the 3WHS. However, 1348 this first ACK is not delivered reliably, so the TCP client SHOULD 1349 also include an AccECN Option on the first data segment it sends (if 1350 it ever sends one). 1352 A host MAY NOT include an AccECN Option in any of these three cases 1353 if it has cached knowledge that the packet would be likely to be 1354 blocked on the path to the other host if it included an AccECN 1355 Option. 1357 3.2.3.2.2. Testing for Loss of Packets Carrying the AccECN Option 1359 If after the normal TCP timeout the TCP server has not received an 1360 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 1361 lost, e.g. due to congestion, or a middlebox might be blocking the 1362 AccECN Option. To expedite connection setup, the TCP server SHOULD 1363 retransmit the SYN/ACK repeating the same AE, CWR and ECE TCP flags 1364 as on the original SYN/ACK but with no AccECN Option. If this 1365 retransmission times out, to expedite connection setup, the TCP 1366 server SHOULD disable AccECN and ECN for this connection by 1367 retransmitting the SYN/ACK with AE=CWR=ECE=0 and no AccECN Option. 1369 Implementers MAY use other fall-back strategies if they are found to 1370 be more effective (e.g. retrying the AccECN Option for a second time 1371 before fall-back - most appropriate during high levels of 1372 congestion). However, other fall-back strategies will need to follow 1373 all the rules in Section 3.1.5, which concern behaviour when SYNs or 1374 SYN/ACKs negotiating different types of feedback have been sent 1375 within the same connection. 1377 If the TCP client detects that the first data segment it sent with 1378 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1379 on the retransmission. Again, implementers MAY use other fall-back 1380 strategies such as attempting to retransmit a second segment with the 1381 AccECN Option before fall-back, and/or caching whether the AccECN 1382 Option is blocked for subsequent connections. 1383 [I-D.ietf-tcpm-2140bis] further discusses caching of TCP parameters 1384 and status information. 1386 If a host falls back to not sending the AccECN Option, it will 1387 continue to process any incoming AccECN Options as normal. 1389 Either host MAY include the AccECN Option in a subsequent segment to 1390 retest whether the AccECN Option can traverse the path. 1392 If the TCP server receives a second SYN with a request for AccECN 1393 support, it should resend the SYN/ACK, again confirming its support 1394 for AccECN, but this time without the AccECN Option. This approach 1395 rules out any interference by middleboxes that may drop packets with 1396 unknown options, even though it is more likely that the SYN/ACK would 1397 have been lost due to congestion. The TCP server MAY try to send 1398 another packet with the AccECN Option at a later point during the 1399 connection but should monitor if that packet got lost as well, in 1400 which case it SHOULD disable the sending of the AccECN Option for 1401 this half-connection. 1403 Similarly, an AccECN end-point MAY separately memorize which data 1404 packets carried an AccECN Option and disable the sending of AccECN 1405 Options if the loss probability of those packets is significantly 1406 higher than that of all other data packets in the same connection. 1408 3.2.3.2.3. Testing for Absence of the AccECN Option 1410 If the TCP client has successfully negotiated AccECN but does not 1411 receive an AccECN Option on the SYN/ACK (e.g. because is has been 1412 stripped by a middlebox or not sent by the server), the client 1413 switches into a mode that assumes that the AccECN Option is not 1414 available for this half connection. 1416 Similarly, if the TCP server has successfully negotiated AccECN but 1417 does not receive an AccECN Option on the first segment that 1418 acknowledges sequence space at least covering the ISN, it switches 1419 into a mode that assumes that the AccECN Option is not available for 1420 this half connection. 1422 While a host is in this mode that assumes incoming AccECN Options are 1423 not available, it MUST adopt the conservative interpretation of the 1424 ACE field discussed in Section 3.2.2.5. However, it cannot make any 1425 assumption about support of outgoing AccECN Options on the other half 1426 connection, so it SHOULD continue to send the AccECN Option itself 1427 (unless it has established that sending the AccECN Option is causing 1428 packets to be blocked as in Section 3.2.3.2.2). 1430 If a host is in the mode that assumes incoming AccECN Options are not 1431 available, but it receives an AccECN Option at any later point during 1432 the connection, this clearly indicates that the AccECN Option is not 1433 blocked on the respective path, and the AccECN endpoint MAY switch 1434 out of the mode that assumes the AccECN Option is not available for 1435 this half connection. 1437 3.2.3.2.4. Test for Zeroing of the AccECN Option 1439 For a related test for invalid initialization of the ACE field, see 1440 Section 3.2.2.3 1442 Section 3.2 required the Data Receiver to initialize the r.e0b 1443 counter to a non-zero value. Therefore, in either direction the 1444 initial value of the EE0B field in the AccECN Option (if one exists) 1445 ought to be non-zero. If AccECN has been negotiated: 1447 o the TCP server MAY check the initial value of the EE0B field in 1448 the first segment that acknowledges sequence space that at least 1449 covers the ISN plus 1. If the initial value of the EE0B field is 1450 zero, the server will switch into a mode that ignores the AccECN 1451 Option for this half connection. 1453 o the TCP client MAY check the initial value of the EE0B field on 1454 the SYN/ACK. If the initial value of the EE0B field is zero, the 1455 client will switch into a mode that ignores the AccECN Option for 1456 this half connection. 1458 While a host is in the mode that ignores the AccECN Option it MUST 1459 adopt the conservative interpretation of the ACE field discussed in 1460 Section 3.2.2.5. 1462 Note that the Data Sender MUST NOT test whether the arriving byte 1463 counters in the initial AccECN Option have been initialized to 1464 specific valid values - the above checks solely test whether these 1465 fields have been incorrectly zeroed. This allows hosts to use 1466 different initial values as an additional signalling channel in 1467 future. Also note that the initial value of either field might be 1468 greater than its expected initial value, because the counters might 1469 already have been incremented. Nonetheless, the initial values of 1470 the counters have been chosen so that they cannot wrap to zero on 1471 these initial segments. 1473 3.2.3.2.5. Consistency between AccECN Feedback Fields 1475 When the AccECN Option is available it supplements but does not 1476 replace the ACE field. An endpoint using AccECN feedback MUST always 1477 consider the information provided in the ACE field whether or not the 1478 AccECN Option is also available. 1480 If the AccECN option is present, the s.cep counter might increase 1481 while the s.ceb counter does not (e.g. due to a CE-marked control 1482 packet). The sender's response to such a situation is out of scope, 1483 and needs to be dealt with in a specification that uses ECN-capable 1484 control packets. Theoretically, this situation could also occur if a 1485 middlebox mangled the AccECN Option but not the ACE field. However, 1486 the Data Sender has to assume that the integrity of the AccECN Option 1487 is sound, based on the above test of the well-known initial values 1488 and optionally other integrity tests (Section 5.3). 1490 If either end-point detects that the s.ceb counter has increased but 1491 the s.cep has not (and by testing ACK coverage it is certain how much 1492 the ACE field has wrapped), this invalid protocol transition has to 1493 be due to some form of feedback mangling. So, the Data Sender MUST 1494 disable sending ECN-capable packets for the remainder of the half- 1495 connection by setting the IP/ECN field in all subsequent packets to 1496 Not-ECT. 1498 3.2.3.3. Usage of the AccECN TCP Option 1500 If the Data Receiver intends to use the AccECN TCP Option to provide 1501 feedback, the following rules determine when a Data Receiver in 1502 AccECN mode sends an ACK with the AccECN TCP Option, and which fields 1503 to include: 1505 Change-Triggered ACKs: If an arriving packet increments a different 1506 byte counter to that incremented by the previous packet, the Data 1507 Receiver SHOULD immediately send an ACK with an AccECN Option, 1508 without waiting for the next delayed ACK (this is in addition to 1509 the safety recommendation in Section 3.2.2.5 against ambiguity of 1510 the ACE field). 1512 Even though this bullet is stated as a "SHOULD", it is important 1513 for a transition to immediately trigger an ACK if at all possible, 1514 as already argued when specifying change-triggered ACKs for the 1515 ACE. 1517 Continual Repetition: Otherwise, if arriving packets continue to 1518 increment the same byte counter, the Data Receiver can include an 1519 AccECN Option on most or all (delayed) ACKs, but it does not have 1520 to. 1522 * It SHOULD include a counter that has continued to increment on 1523 the next scheduled ACK following a change-triggered ACK; 1525 * while the same counter continues to increment, it SHOULD 1526 include the counter every n ACKs as consistently as possible, 1527 where n can be chosen by the implementer; 1529 * It SHOULD always include an AccECN Option if the r.ceb counter 1530 is incrementing and it MAY include an AccECN Option if r.ec0b 1531 or r.ec1b is incrementing 1533 * It SHOULD, include each counter at least once for every 2^22 1534 bytes incremented to prevent overflow during continual 1535 repetition. 1537 If the smallest allowed AccECN Option would leave insufficient 1538 space for two SACK blocks on a particular ACK, the Data Receiver 1539 MUST give precedence to the SACK option (total 18 octets), because 1540 loss feedback is more critical. 1542 Necessary Option Length: It MAY exclude counter(s) that have not 1543 changed for the whole connection (but beacons still include all 1544 fields - see below). It SHOULD include counter(s) that have 1545 incremented at some time during the connection. It MUST include 1546 the counter(s) that have incremented since the previous AccECN 1547 Option and it MUST only truncate fields from the right-hand tail 1548 of the option to preserve the order of the remaining fields (see 1549 Section 3.2.3); 1551 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1552 length AccECN TCP Option on at least three ACKs per RTT, or on all 1553 ACKs if there are less than three per RTT (see Appendix A.4 for an 1554 example algorithm that satisfies this requirement). 1556 The above rules complement those in Section 3.2.2.5, which determine 1557 when to generate an ACK irrespective of whether an AccECN TCP Option 1558 is to be included. 1560 The following example series of arriving IP/ECN fields illustrates 1561 when a Data Receiver will emit an ACK with an AccECN Option if it is 1562 using a delayed ACK factor of 2 segments and change-triggered ACKs: 1563 01 -> ACK, 01, 01 -> ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 1564 -> ACK. 1566 Even though first bullet is stated as a "SHOULD", it is important for 1567 a transition to immediately trigger an ACK if at all possible, so 1568 that the Data Sender can rely on change-triggered ACKs to detect 1569 queue growth as soon as possible, e.g. at the start of a flow. This 1570 requirement can only be relaxed if certain offload hardware needed 1571 for high performance cannot support change-triggered ACKs (although 1572 high performance protocols such as DCTCP already successfully use 1573 change-triggered ACKs). One possible experimental compromise would 1574 be for the receiver to heuristically detect whether the sender is in 1575 slow-start, then to implement change-triggered ACKs while the sender 1576 is in slow-start, and offload otherwise. 1578 For the avoidance of doubt, this change-triggered ACK mechanism is 1579 deliberately worded to ignore the arrival of a control packet with no 1580 payload, which therefore does not alter any byte counters, because it 1581 is important that TCP does not acknowledge pure ACKs. The change- 1582 triggered ACK approach can lead to some additional ACKs but it feeds 1583 back the timing and the order in which ECN marks are received with 1584 minimal additional complexity. If only CE marks are infrequent, or 1585 there are multiple marks in a row, the additional load will be low. 1586 Other marking patterns could increase the load significantly, 1587 Investigating the additional load is a goal of the proposed 1588 experiment. 1590 Implementation note: sending an AccECN Option each time a different 1591 counter changes and including a full-length AccECN Option on every 1592 delayed ACK will satisfy the requirements described above and might 1593 be the easiest implementation, as long as sufficient space is 1594 available in each ACK (in total and in the option space). 1596 Appendix A.3 gives an example algorithm to estimate the number of 1597 marked bytes from the ACE field alone, if the AccECN Option is not 1598 available. 1600 If a host has determined that segments with the AccECN Option always 1601 seem to be discarded somewhere along the path, it is no longer 1602 obliged to follow the above rules. 1604 3.3. AccECN Compliance Requirements for TCP Proxies, Offload Engines 1605 and other Middleboxes 1607 3.3.1. Requirements for TCP Proxies 1609 A large class of middleboxes split TCP connections. Such a middlebox 1610 would be compliant with the AccECN protocol if the TCP implementation 1611 on each side complied with the present AccECN specification and each 1612 side negotiated AccECN independently of the other side. 1614 3.3.2. Requirements for TCP Normalizers 1616 Another large class of middleboxes intervenes to some degree at the 1617 transport layer, but attempts to be transparent (invisible) to the 1618 end-to-end connection. A subset of this class of middleboxes 1619 attempts to `normalize' the TCP wire protocol by checking that all 1620 values in header fields comply with a rather narrow and often 1621 outdated interpretation of the TCP specifications. To comply with 1622 the present AccECN specification, such a middlebox MUST NOT change 1623 the ACE field or the AccECN Option. 1625 A middlebox claiming to be transparent at the transport layer MUST 1626 forward the AccECN TCP Option unaltered, whether or not the length 1627 value matches one of those specified in Section 3.2.3, and whether or 1628 not the initial values of the byte-counter fields are correct. This 1629 is because blocking apparently invalid values does not improve 1630 security (because AccECN hosts are required to ignore invalid values 1631 anyway), while it prevents the standardized set of values being 1632 extended in future (because outdated normalizers would block updated 1633 hosts from using the extended AccECN standard). 1635 3.3.3. Requirements for TCP ACK Filtering 1637 A node that implements ACK filtering (aka. thinning or coalescing) 1638 and itself also implements ECN marking will not need to filter ACKs 1639 from connections that use AccECN feedback. Therefore, such a node 1640 SHOULD detect connections that are using AccECN feedback and it 1641 SHOULD refrain from filtering the ACKs of such connections (if it 1642 coalesced ACKs it would not be AccECN-compliant, but the requirement 1643 is stated as a "SHOULD" in order to allow leeway for pre-existing ACK 1644 filtering functions to be brought into line). 1646 A node that implements ACK filtering and does not itself implement 1647 ECN marking does not need to treat AccECN connections any differently 1648 from other TCP connections. Nonetheless, it is RECOMMENDED that such 1649 nodes implement ECN marking and comply with the requirements of the 1650 previous paragraph. This should be a better way than ACK filtering 1651 to improve the performance of AccECN TCP connections. 1653 The rationale for these requirements is that AccECN feedback provides 1654 sufficient information to a Data Receiver for it to be able to 1655 monitor ECN marking of the ACKs it has sent, so that it can thin the 1656 ACK stream itself. This could eventually mean that ACK filtering in 1657 the network gives no performance advantage. Then TCP will be able to 1658 maintain its own control over ACK coalescing. This will also allow 1659 the TCP Data Sender to use the timing of ACK arrivals to more 1660 reliably infer further information about the path congestion level. 1662 Note that the specification of AccECN in TCP does not presume to rely 1663 on any of the above ACK filtering behaviour in the network, because 1664 it has to be robust against pre-existing network nodes that still 1665 filter AccECN ACKs, and robust against ACK loss during overload. 1667 Section 5.2.1 of [RFC3449] gives best current practice on ACK 1668 filtering (aka. thinning or coalescing). It gives no advice on ACKs 1669 carrying ECN feedback (other than that filtering ought to preserve 1670 the correct operation of ECN feedback), because at the time is said 1671 that "ECN remain areas of ongoing research". This section updates 1672 that advice for a TCP connection that supports AccECN feedback. 1674 3.3.4. Requirements for TCP Segmentation Offload 1676 Hardware to offload certain TCP processing represents another large 1677 class of middleboxes (even though it is often a function of a host's 1678 network interface and rarely in its own 'box'). 1680 The ACE field changes with every received CE marking, so today's 1681 receive offloading could lead to many interrupts in high congestion 1682 situations. Although that would be useful (because congestion 1683 information is received sooner), it could also significantly increase 1684 processor load, particularly in scenarios such as DCTCP or L4S where 1685 the marking rate is generally higher. 1687 Current offload hardware ejects a segment from the coalescing process 1688 whenever the TCP ECN flags change. Thus Classic ECN causes offload 1689 to be inefficient. In data centres it has been fortunate for this 1690 offload hardware that DCTCP-style feedback changes less often when 1691 there are long sequences of CE marks, which is more common with a 1692 step marking threshold (but less likely the more short flows are in 1693 the mix). The ACE counter approach has been designed so that 1694 coalescing can continue over arbitrary patterns of marking and only 1695 needs to stop when the counter wraps. Nonetheless, until the 1696 particular offload hardware in use implements this more efficient 1697 approach, it is likely to be more efficient for AccECN connections to 1698 implement this counter-style logic using software segmentation 1699 offload. 1701 ECN encodes a varying signal in the ACK stream, so it is inevitable 1702 that offload hardware will ultimately need to handle any form of ECN 1703 feedback exceptionally. The ACE field has been designed as a counter 1704 so that it is straightforward for offload hardware to pass on the 1705 highest counter, and to push a segment from its cache before the 1706 counter wraps. The purpose of working towards standardized TCP ECN 1707 feedback is to reduce the risk for hardware developers, who would 1708 otherwise have to guess which scheme is likely to become dominant. 1710 The above process has been designed to enable a continuing 1711 incremental deployment path - to more highly dynamic congestion 1712 control. Once DCTCP offload hardware supports AccECN, it will be 1713 able to coalesce efficiently for any sequence of marks, instead of 1714 relying for efficiency on the long marking sequences from step 1715 marking. In the next stage, DCTCP marking can evolve from a step to 1716 a ramp function. That in turn will allow host congestion control 1717 algorithms to respond faster to dynamics, while being backwards 1718 compatible with existing host algorithms. 1720 4. Updates to RFC 3168 1722 Normative statements in the following sections of RFC3168 are updated 1723 by the present AccECN specification: 1725 o The whole of "6.1.1 TCP Initialization" of [RFC3168] is updated by 1726 Section 3.1 of the present specification. 1728 o In "6.1.2. The TCP Sender" of [RFC3168], all mentions of a 1729 congestion response to an ECN-Echo (ECE) ACK packet are updated by 1730 Section 3.2 of the present specification to mean an increment to 1731 the sender's count of CE-marked packets, s.cep. And the 1732 requirements to set the CWR flag no longer apply, as specified in 1733 Section 3.1.5 of the present specification. Otherwise, the 1734 remaining requirements in "6.1.2. The TCP Sender" still stand. 1736 It will be noted that RFC 8311 already updates, or potentially 1737 updates, a number of the requirements in "6.1.2. The TCP Sender". 1738 Section 6.1.2 of RFC 3168 extended standard TCP congestion control 1739 [RFC5681] to cover ECN marking as well as packet drop. Whereas, 1740 RFC 8311 enables experimentation with alternative responses to ECN 1741 marking, if specified for instance by an experimental RFC on the 1742 IETF document stream. RFC 8311 also strengthened the statement 1743 that "ECT(0) SHOULD be used" to a "MUST" (see [RFC8311] for the 1744 details). 1746 o The whole of "6.1.3. The TCP Receiver" of [RFC3168] is updated by 1747 Section 3.2 of the present specification, with the exception of 1748 the last paragraph (about congestion response to drop and ECN in 1749 the same round trip), which still stands. Incidentally, this last 1750 paragraph is in the wrong section, because it relates to TCP 1751 sender behaviour. 1753 o The following text within "6.1.5. Retransmitted TCP packets": 1755 "the TCP data receiver SHOULD ignore the ECN field on arriving 1756 data packets that are outside of the receiver's current 1757 window." 1759 is updated by more stringent acceptability tests for any packet 1760 (not just data packets) in the present specification. 1761 Specifically, in the normative specification of AccECN (Section 3) 1762 only 'Acceptable' packets contribute to the ECN counters at the 1763 AccECN receiver and Section 1.3 defines an Acceptable packet as 1764 one that passes the acceptability tests in both [RFC0793] and 1765 [RFC5961]. 1767 o Sections 5.2, 6.1.1, 6.1.4, 6.1.5 and 6.1.6 of [RFC3168] prohibit 1768 use of ECN on TCP control packets and retransmissions. The 1769 present specification does not update that aspect of RFC 3168, but 1770 it does say what feedback an AccECN Data Receiver should provide 1771 if it receives an ECN-capable control packet or retransmission. 1772 This ensures AccECN is forward compatible with any future scheme 1773 that allows ECN on these packets, as provided for in section 4.3 1774 of [RFC8311] and as proposed in [I-D.ietf-tcpm-generalized-ecn]. 1776 5. Interaction with TCP Variants 1778 This section is informative, not normative. 1780 5.1. Compatibility with SYN Cookies 1782 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1783 protect itself from SYN flooding attacks. It places minimal commonly 1784 used connection state in the SYN/ACK, and deliberately does not hold 1785 any state while waiting for the subsequent ACK (e.g. it closes the 1786 thread). Therefore it cannot record the fact that it entered AccECN 1787 mode for both half-connections. Indeed, it cannot even remember 1788 whether it negotiated the use of classic ECN [RFC3168]. 1790 Nonetheless, such a server can determine that it negotiated AccECN as 1791 follows. If a TCP server using SYN Cookies supports AccECN and if it 1792 receives a pure ACK that acknowledges an ISN that is a valid SYN 1793 cookie, and if the ACK contains an ACE field with the value 0b010 to 1794 0b111 (decimal 2 to 7), it can assume that: 1796 o the TCP client must have requested AccECN support on the SYN 1797 o it (the server) must have confirmed that it supported AccECN 1799 Therefore the server can switch itself into AccECN mode, and continue 1800 as if it had never forgotten that it switched itself into AccECN mode 1801 earlier. 1803 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1804 with the value 0b000 or 0b001, these values indicate that the client 1805 did not request support for AccECN and therefore the server does not 1806 enter AccECN mode for this connection. Further, 0b001 on the ACK 1807 implies that the server sent an ECN-capable SYN/ACK, which was marked 1808 CE in the network, and the non-AccECN client fed this back by setting 1809 ECE on the ACK of the SYN/ACK. 1811 5.2. Compatibility with TCP Experiments and Common TCP Options 1813 AccECN is compatible (at least on paper) with the most commonly used 1814 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1815 also compatible with the recent promising experimental TCP options 1816 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1817 AccECN is friendly to all these protocols, because space for TCP 1818 options is particularly scarce on the SYN, where AccECN consumes zero 1819 additional header space. 1821 When option space is under pressure from other options, 1822 Section 3.2.3.3 provides guidance on how important it is to send an 1823 AccECN Option and whether it needs to be a full-length option. 1825 Implementers of TFO need to take careful note of the recommendation 1826 in Section 3.2.2.1. That section recommends that, if the client has 1827 successfully negotiated AccECN, when acknowledging the SYN/ACK, even 1828 if it has data to send, it sends a pure ACK immediately before the 1829 data. Then it can reflect the IP-ECN field of the SYN/ACK on this 1830 pure ACK, which allows the server to detect ECN mangling. 1832 5.3. Compatibility with Feedback Integrity Mechanisms 1834 Three alternative mechanisms are available to assure the integrity of 1835 ECN and/or loss signals. AccECN is compatible with any of these 1836 approaches: 1838 o The Data Sender can test the integrity of the receiver's ECN (or 1839 loss) feedback by occasionally setting the IP-ECN field to a value 1840 normally only set by the network (and/or deliberately leaving a 1841 sequence number gap). Then it can test whether the Data 1842 Receiver's feedback faithfully reports what it expects (similar to 1843 para 2 of Section 20.2 of [RFC3168]). Unlike the ECN Nonce 1844 [RFC3540], this approach does not waste the ECT(1) codepoint in 1845 the IP header, it does not require standardization and it does not 1846 rely on misbehaving receivers volunteering to reveal feedback 1847 information that allows them to be detected. However, setting the 1848 CE mark by the sender might conceal actual congestion feedback 1849 from the network and should therefore only be done sparingly. 1851 o Networks generate congestion signals when they are becoming 1852 congested, so networks are more likely than Data Senders to be 1853 concerned about the integrity of the receiver's feedback of these 1854 signals. A network can enforce a congestion response to its ECN 1855 markings (or packet losses) using congestion exposure (ConEx) 1856 audit [RFC7713]. Whether the receiver or a downstream network is 1857 suppressing congestion feedback or the sender is unresponsive to 1858 the feedback, or both, ConEx audit can neutralize any advantage 1859 that any of these three parties would otherwise gain. 1861 ConEx is a change to the Data Sender that is most useful when 1862 combined with AccECN. Without AccECN, the ConEx behaviour of a 1863 Data Sender would have to be more conservative than would be 1864 necessary if it had the accurate feedback of AccECN. 1866 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1867 detect any tampering with AccECN feedback between the Data 1868 Receiver and the Data Sender (whether malicious or accidental). 1869 The AccECN fields are immutable end-to-end, so they are amenable 1870 to TCP-AO protection, which covers TCP options by default. 1871 However, TCP-AO is often too brittle to use on many end-to-end 1872 paths, where middleboxes can make verification fail in their 1873 attempts to improve performance or security, e.g. by 1874 resegmentation or shifting the sequence space. 1876 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1877 of congestion feedback. With minor changes AccECN could be optimized 1878 for the possibility that the ECT(1) codepoint might be used as an ECN 1879 Nonce. However, given RFC 3540 has been reclassified as historic, 1880 the AccECN design has been generalized so that it ought to be able to 1881 support other possible uses of the ECT(1) codepoint, such as a lower 1882 severity or a more instant congestion signal than CE. 1884 6. Protocol Properties 1886 This section is informative not normative. It describes how well the 1887 protocol satisfies the agreed requirements for a more accurate ECN 1888 feedback protocol [RFC7560]. 1890 Accuracy: From each ACK, the Data Sender can infer the number of new 1891 CE marked segments since the previous ACK. This provides better 1892 accuracy on CE feedback than classic ECN. In addition if the 1893 AccECN Option is present (not blocked by the network path) the 1894 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1896 Overhead: The AccECN scheme is divided into two parts. The 1897 essential part reuses the 3 flags already assigned to ECN in the 1898 IP header. The supplementary part adds an additional TCP option 1899 consuming up to 11 bytes. However, no TCP option is consumed in 1900 the SYN. 1902 Ordering: The order in which marks arrive at the Data Receiver is 1903 preserved in AccECN feedback, because the Data Receiver is 1904 expected to send an ACK immediately whenever a different mark 1905 arrives. 1907 Timeliness: While the same ECN markings are arriving continually at 1908 the Data Receiver, it can defer ACKs as TCP does normally, but it 1909 will immediately send an ACK as soon as a different ECN marking 1910 arrives. 1912 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1913 latency-sensitive uses of ECN feedback by capturing the timing of 1914 transitions but not wasting resources while the state of the 1915 signalling system is stable. Within the constraints of the 1916 change-triggered ACK rules, the receiver can control how 1917 frequently it sends the AccECN TCP Option and therefore to some 1918 extent it can control the overhead induced by AccECN. 1920 Resilience: All information is provided based on counters. 1921 Therefore if ACKs are lost, the counters on the first ACK 1922 following the losses allows the Data Sender to immediately recover 1923 the number of the ECN markings that it missed. And if data or 1924 ACKs are reordered, stale congestion information can be identified 1925 and ignored. 1927 Resilience against Bias: Because feedback is based on repetition of 1928 counters, random losses do not remove any information, they only 1929 delay it. Therefore, even though some ACKs are change-triggered, 1930 random losses will not alter the proportions of the different ECN 1931 markings in the feedback. 1933 Resilience vs Overhead: If space is limited in some segments (e.g. 1934 because more options are needed on some segments, such as the SACK 1935 option after loss), the Data Receiver can send AccECN Options less 1936 frequently or truncate fields that have not changed, usually down 1937 to as little as 5 bytes. However, it has to send a full-sized 1938 AccECN Option at least three times per RTT, which the Data Sender 1939 can rely on as a regular beacon or checkpoint. 1941 Resilience vs Timeliness and Ordering: Ordering information and the 1942 timing of transitions cannot be communicated in three cases: i) 1943 during ACK loss; ii) if something on the path strips the AccECN 1944 Option; or iii) if the Data Receiver is unable to support Change- 1945 Triggered ACKs. Following ACK reordering, the Data Sender can 1946 reconstruct the order in which feedback was sent, but not until 1947 all the missing feedback has arrived. 1949 Complexity: An AccECN implementation solely involves simple counter 1950 increments, some modulo arithmetic to communicate the least 1951 significant bits and allow for wrap, and some heuristics for 1952 safety against fields cycling due to prolonged periods of ACK 1953 loss. Each host needs to maintain eight additional counters. The 1954 hosts have to apply some additional tests to detect tampering by 1955 middleboxes, but in general the protocol is simple to understand, 1956 simple to implement and requires few cycles per packet to execute. 1958 Integrity: AccECN is compatible with at least three approaches that 1959 can assure the integrity of ECN feedback. If the AccECN Option is 1960 stripped the resolution of the feedback is degraded, but the 1961 integrity of this degraded feedback can still be assured. 1963 Backward Compatibility: If only one endpoint supports the AccECN 1964 scheme, it will fall-back to the most advanced ECN feedback scheme 1965 supported by the other end. 1967 Backward Compatibility: If the AccECN Option is stripped by a 1968 middlebox, AccECN still provides basic congestion feedback in the 1969 ACE field. Further, AccECN can be used to detect mangling of the 1970 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1971 marked segments; and blocking of segments carrying the AccECN 1972 Option. It can detect these conditions during TCP's 3WHS so that 1973 it can fall back to operation without ECN and/or operation without 1974 the AccECN Option. 1976 Forward Compatibility: The behaviour of endpoints and middleboxes is 1977 carefully defined for all reserved or currently unused codepoints 1978 in the scheme. Then, the designers of security devices can 1979 understand which currently unused values might appear in future. 1980 So, even if they choose to treat such values as anomalous while 1981 they are not widely used, any blocking will at least be under 1982 policy control not hard-coded. Then, if previously unused values 1983 start to appear on the Internet (or in standards), such policies 1984 could be quickly reversed. 1986 7. IANA Considerations 1988 This document reassigns bit 7 of the TCP header flags to the AccECN 1989 experiment. This bit was previously called the Nonce Sum (NS) flag 1990 [RFC3540], but RFC 3540 has been reclassified as historic [RFC8311]. 1991 The flag will now be defined as: 1993 +-----+-------------------+-----------+ 1994 | Bit | Name | Reference | 1995 +-----+-------------------+-----------+ 1996 | 7 | AE (Accurate ECN) | RFC XXXX | 1997 +-----+-------------------+-----------+ 1999 [TO BE REMOVED: IANA is requested to update the existing entry in the 2000 Transmission Control Protocol (TCP) Header Flags registration 2001 (https://www.iana.org/assignments/tcp-header-flags/tcp-header- 2002 flags.xhtml#tcp-header-flags-1) for Bit 7 to "AE (Accurate ECN), 2003 previously used as NS (Nonce Sum) by [RFC3540], which is now Historic 2004 [RFC8311]" and change the reference to this RFC-to-be instead of 2005 RFC8311.] 2007 This document also defines two new TCP options for AccECN, assigned 2008 values of TBD0 and TBD1 (decimal) from the TCP option space. These 2009 values are defined as: 2011 +------+--------+--------------------------------+-----------+ 2012 | Kind | Length | Meaning | Reference | 2013 +------+--------+--------------------------------+-----------+ 2014 | TBD0 | N | Accurate ECN Order 0 (AccECN0) | RFC XXXX | 2015 | TBD1 | N | Accurate ECN Order 1 (AccECN1) | RFC XXXX | 2016 +------+--------+--------------------------------+-----------+ 2018 [TO BE REMOVED: This registration should take place at the following 2019 location: http://www.iana.org/assignments/tcp-parameters/tcp- 2020 parameters.xhtml#tcp-parameters-1 ] 2022 Early implementations using experimental option 254 per [RFC6994] 2023 with the single magic number 0xACCE (16 bits), as allocated in the 2024 IANA "TCP Experimental Option Experiment Identifiers (TCP ExIDs)" 2025 registry, SHOULD migrate to use these new option kinds (TBD0 & TBD1). 2027 [TO BE REMOVED: The description of the 0xACCE value in the TCP ExIDs 2028 registry should be changed to "AccECN (current and new 2029 implementations SHOULD use option kinds TBD0 and TBD1)" at the 2030 following location: https://www.iana.org/assignments/tcp-parameters/ 2031 tcp-parameters.xhtml#tcp-exids ] 2033 8. Security Considerations 2035 If ever the supplementary part of AccECN based on the new AccECN TCP 2036 Option is unusable (due for example to middlebox interference) the 2037 essential part of AccECN's congestion feedback offers only limited 2038 resilience to long runs of ACK loss (see Section 3.2.2.5). These 2039 problems are unlikely to be due to malicious intervention (because if 2040 an attacker could strip a TCP option or discard a long run of ACKs it 2041 could wreak other arbitrary havoc). However, it would be of concern 2042 if AccECN's resilience could be indirectly compromised during a 2043 flooding attack. AccECN is still considered safe though, because if 2044 the option is not presented, the AccECN Data Sender is then required 2045 to switch to more conservative assumptions about wrap of congestion 2046 indication counters (see Section 3.2.2.5 and Appendix A.2). 2048 Section 5.1 describes how a TCP server can negotiate AccECN and use 2049 the SYN cookie method for mitigating SYN flooding attacks. 2051 There is concern that ECN markings could be altered or suppressed, 2052 particularly because a misbehaving Data Receiver could increase its 2053 own throughput at the expense of others. AccECN is compatible with 2054 the three schemes known to assure the integrity of ECN feedback (see 2055 Section 5.3 for details). If the AccECN Option is stripped by an 2056 incorrectly implemented middlebox, the resolution of the feedback 2057 will be degraded, but the integrity of this degraded information can 2058 still be assured. 2060 In Section 3.2.3 a Data Sender is allowed to ignore an unrecognized 2061 TCP AccECN Option length and read as many whole 3-octet fields from 2062 it as possible up to a maximum of 3, treating the remainder as 2063 padding. This opens up a potential covert channel of up to 29B (40 - 2064 (2+3*3))B. {ToDo: If necessary this 'forward compatibility' 2065 requirement can be capped or removed in a future revision, in order 2066 to narrow or close the covert channel. Comments are solicited on 2067 whether such a covert channel would be acceptable in TCP (given other 2068 TCP options already open up covert channels). And, if not, whether 2069 the channel should be narrowed or completely closed.} 2071 There is a potential concern that a receiver could deliberately omit 2072 the AccECN Option pretending that it had been stripped by a 2073 middlebox. No known way can yet be contrived to take advantage of 2074 this downgrade attack, but it is mentioned here in case someone else 2075 can contrive one. 2077 The AccECN protocol is not believed to introduce any new privacy 2078 concerns, because it merely counts and feeds back signals at the 2079 transport layer that had already been visible at the IP layer. 2081 9. Acknowledgements 2083 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 2084 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf, 2085 Michael Tuexen, Yuchung Cheng, Kenjiro Cho, Olivier Tilmans, Ilpo 2086 Jaervinen and Neal Cardwell for their input and discussion. The idea 2087 of using the three ECN-related TCP flags as one field for more 2088 accurate TCP-ECN feedback was first introduced in the re-ECN protocol 2089 that was the ancestor of ConEx. 2091 Bob Briscoe was part-funded by the Comcast Innovation Fund, the 2092 European Community under its Seventh Framework Programme through the 2093 Reducing Internet Transport Latency (RITE) project (ICT-317700) and 2094 through the Trilogy 2 project (ICT-317756), and the Research Council 2095 of Norway through the TimeIn project. The views expressed here are 2096 solely those of the authors. 2098 Mirja Kuehlewind was partly supported by the European Commission 2099 under Horizon 2020 grant agreement no. 688421 Measurement and 2100 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 2101 State Secretariat for Education, Research, and Innovation under 2102 contract no. 15.0268. This support does not imply endorsement. 2104 10. Comments Solicited 2106 Comments and questions are encouraged and very welcome. They can be 2107 addressed to the IETF TCP maintenance and minor modifications working 2108 group mailing list , and/or to the authors. 2110 11. References 2112 11.1. Normative References 2114 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 2115 RFC 793, DOI 10.17487/RFC0793, September 1981, 2116 . 2118 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2119 Requirement Levels", BCP 14, RFC 2119, 2120 DOI 10.17487/RFC2119, March 1997, 2121 . 2123 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2124 of Explicit Congestion Notification (ECN) to IP", 2125 RFC 3168, DOI 10.17487/RFC3168, September 2001, 2126 . 2128 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2129 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 2130 . 2132 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2133 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2134 May 2017, . 2136 11.2. Informative References 2138 [I-D.ietf-tcpm-2140bis] 2139 Touch, J., Welzl, M., and S. Islam, "TCP Control Block 2140 Interdependence", draft-ietf-tcpm-2140bis-07 (work in 2141 progress), December 2020. 2143 [I-D.ietf-tcpm-generalized-ecn] 2144 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 2145 Congestion Notification (ECN) to TCP Control Packets", 2146 draft-ietf-tcpm-generalized-ecn-06 (work in progress), 2147 October 2020. 2149 [I-D.ietf-tsvwg-l4s-arch] 2150 Briscoe, B., Schepper, K., Bagnulo, M., and G. White, "Low 2151 Latency, Low Loss, Scalable Throughput (L4S) Internet 2152 Service: Architecture", draft-ietf-tsvwg-l4s-arch-08 (work 2153 in progress), November 2020. 2155 [Mandalari18] 2156 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 2157 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 2158 over Mobile", IEEE Communications Magazine , March 2018. 2160 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 2161 Selective Acknowledgment Options", RFC 2018, 2162 DOI 10.17487/RFC2018, October 1996, 2163 . 2165 [RFC3449] Balakrishnan, H., Padmanabhan, V., Fairhurst, G., and M. 2166 Sooriyabandara, "TCP Performance Implications of Network 2167 Path Asymmetry", BCP 69, RFC 3449, DOI 10.17487/RFC3449, 2168 December 2002, . 2170 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2171 Congestion Notification (ECN) Signaling with Nonces", 2172 RFC 3540, DOI 10.17487/RFC3540, June 2003, 2173 . 2175 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 2176 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 2177 . 2179 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 2180 Ramakrishnan, "Adding Explicit Congestion Notification 2181 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 2182 DOI 10.17487/RFC5562, June 2009, 2183 . 2185 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 2186 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 2187 June 2010, . 2189 [RFC5961] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's 2190 Robustness to Blind In-Window Attacks", RFC 5961, 2191 DOI 10.17487/RFC5961, August 2010, 2192 . 2194 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 2195 "TCP Extensions for Multipath Operation with Multiple 2196 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 2197 . 2199 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 2200 RFC 6994, DOI 10.17487/RFC6994, August 2013, 2201 . 2203 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 2204 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 2205 . 2207 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 2208 "Problem Statement and Requirements for Increased Accuracy 2209 in Explicit Congestion Notification (ECN) Feedback", 2210 RFC 7560, DOI 10.17487/RFC7560, August 2015, 2211 . 2213 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 2214 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 2215 DOI 10.17487/RFC7713, December 2015, 2216 . 2218 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 2219 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 2220 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 2221 October 2017, . 2223 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 2224 Notification (ECN) Experimentation", RFC 8311, 2225 DOI 10.17487/RFC8311, January 2018, 2226 . 2228 [RFC8511] Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 2229 "TCP Alternative Backoff with ECN (ABE)", RFC 8511, 2230 DOI 10.17487/RFC8511, December 2018, 2231 . 2233 Appendix A. Example Algorithms 2235 This appendix is informative, not normative. It gives example 2236 algorithms that would satisfy the normative requirements of the 2237 AccECN protocol. However, implementers are free to choose other ways 2238 to implement the requirements. 2240 A.1. Example Algorithm to Encode/Decode the AccECN Option 2242 The example algorithms below show how a Data Receiver in AccECN mode 2243 could encode its CE byte counter r.ceb into the ECEB field within the 2244 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 2245 the ECEB field into its byte counter s.ceb. The other counters for 2246 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 2247 similarly encoded and decoded. 2249 It is assumed that each local byte counter is an unsigned integer 2250 greater than 24b (probably 32b), and that the following constant has 2251 been assigned: 2253 DIVOPT = 2^24 2255 Every time a CE marked data segment arrives, the Data Receiver 2256 increments its local value of r.ceb by the size of the TCP Data. 2257 Whenever it sends an ACK with the AccECN Option, the value it writes 2258 into the ECEB field is 2260 ECEB = r.ceb % DIVOPT 2262 where '%' is the remainder operator. 2264 On the arrival of an AccECN Option, the Data Sender first makes sure 2265 the ACK has not been superseded in order to avoid winding the s.ceb 2266 counter backwards. It uses the TCP acknowledgement number and any 2267 SACK options to calculate newlyAckedB, the amount of new data that 2268 the ACK acknowledges in bytes (newlyAckedB can be zero but not 2269 negative). If newlyAckedB is zero, either the ACK has been 2270 superseded or CE-marked packet(s) without data could have arrived. 2271 To break the tie for the latter case, the Data Sender could use 2272 timestamps (if present) to work out newlyAckedT, the amount of new 2273 time that the ACK acknowledges. If the Data Sender determines that 2274 the ACK has been superseded it ignores the AccECN Option. Otherwise, 2275 the Data Sender calculates the minimum non-negative difference d.ceb 2276 between the ECEB field and its local s.ceb counter, using modulo 2277 arithmetic as follows: 2279 if ((newlyAckedB > 0) || (newlyAckedT > 0)) { 2280 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 2281 s.ceb += d.ceb 2282 } 2284 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 2285 then 2287 s.ceb % DIVOPT = 1 2288 d.ceb = (1461 + 2^24 - 1) % 2^24 2289 = 1460 2290 s.ceb = 33,554,433 + 1460 2291 = 33,555,893 2293 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 2295 The example algorithms below show how a Data Receiver in AccECN mode 2296 could encode its CE packet counter r.cep into the ACE field, and how 2297 the Data Sender in AccECN mode could decode the ACE field into its 2298 s.cep counter. The Data Sender's algorithm includes code to 2299 heuristically detect a long enough unbroken string of ACK losses that 2300 could have concealed a cycle of the congestion counter in the ACE 2301 field of the next ACK to arrive. 2303 Two variants of the algorithm are given: i) a more conservative 2304 variant for a Data Sender to use if it detects that the AccECN Option 2305 is not available (see Section 3.2.2.5 and Section 3.2.3.2); and ii) a 2306 less conservative variant that is feasible when complementary 2307 information is available from the AccECN Option. 2309 A.2.1. Safety Algorithm without the AccECN Option 2311 It is assumed that each local packet counter is a sufficiently sized 2312 unsigned integer (probably 32b) and that the following constant has 2313 been assigned: 2315 DIVACE = 2^3 2317 Every time an Acceptable CE marked packet arrives (Section 3.2.2.2), 2318 the Data Receiver increments its local value of r.cep by 1. It 2319 repeats the same value of ACE in every subsequent ACK until the next 2320 CE marking arrives, where 2322 ACE = r.cep % DIVACE. 2324 If the Data Sender received an earlier value of the counter that had 2325 been delayed due to ACK reordering, it might incorrectly calculate 2326 that the ACE field had wrapped. Therefore, on the arrival of every 2327 ACK, the Data Sender ensures the ACK has not been superseded using 2328 the TCP acknowledgement number, any SACK options and timestamps (if 2329 available) to calculate newlyAckedB, as in Appendix A.1. If the ACK 2330 has not been superseded, the Data Sender calculates the minimum 2331 difference d.cep between the ACE field and its local s.cep counter, 2332 using modulo arithmetic as follows: 2334 if ((newlyAckedB > 0) || (newlyAckedT > 0)) 2335 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 2337 Section 3.2.2.5 expects the Data Sender to assume that the ACE field 2338 cycled if it is the safest likely case under prevailing conditions. 2339 The 3-bit ACE field in an arriving ACK could have cycled and become 2340 ambiguous to the Data Sender if a row of ACKs goes missing that 2341 covers a stream of data long enough to contain 8 or more CE marks. 2342 We use the word `missing' rather than `lost', because some or all the 2343 missing ACKs might arrive eventually, but out of order. Even if some 2344 of the missing ACKs were piggy-backed on data (i.e. not pure ACKs) 2345 retransmissions will not repair the lost AccECN information, because 2346 AccECN requires retransmissions to carry the latest AccECN counters, 2347 not the original ones. 2349 The phrase `under prevailing conditions' allows for implementation- 2350 dependent interpretation. A Data Sender might take account of the 2351 prevailing size of data segments and the prevailing CE marking rate 2352 just before the sequence of missing ACKs. However, we shall start 2353 with the simplest algorithm, which assumes segments are all full- 2354 sized and ultra-conservatively it assumes that ECN marking was 100% 2355 on the forward path when ACKs on the reverse path started to all be 2356 dropped. Specifically, if newlyAckedB is the amount of data that an 2357 ACK acknowledges since the previous ACK, then the Data Sender could 2358 assume that this acknowledges newlyAckedPkt full-sized segments, 2359 where newlyAckedPkt = newlyAckedB/MSS. Then it could assume that the 2360 ACE field incremented by 2362 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 2364 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 2365 size segments than any previous ACK, and that ACE increments by a 2366 minimum of 2 CE marks (d.cep=2). The above formula works out that it 2367 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 2368 2). However, if ACE increases by a minimum of 2 but acknowledges 10 2369 full-sized segments, then it would be necessary to assume that there 2370 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 2372 ACKs that acknowledge a large stretch of packets might be common in 2373 data centres to achieve a high packet rate or might be due to ACK 2374 thinning by a middlebox. In these cases, cycling of the ACE field 2375 would often appear to have been possible, so the above algorithm 2376 would be over-conservative, leading to a false high marking rate and 2377 poor performance. Therefore it would be reasonable to only use 2378 dSafer.cep rather than d.cep if the moving average of newlyAckedPkt 2379 was well below 8. 2381 Implementers could build in more heuristics to estimate prevailing 2382 average segment size and prevailing ECN marking. For instance, 2383 newlyAckedPkt in the above formula could be replaced with 2384 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 2385 segment size and p is the prevailing ECN marking probability. 2386 However, ultimately, if TCP's ECN feedback becomes inaccurate it 2387 still has loss detection to fall back on. Therefore, it would seem 2388 safe to implement a simple algorithm, rather than a perfect one. 2390 The simple algorithm for dSafer.cep above requires no monitoring of 2391 prevailing conditions and it would still be safe if, for example, 2392 segments were on average at least 5% of full-sized as long as ECN 2393 marking was 5% or less. Assuming it was used, the Data Sender would 2394 increment its packet counter as follows: 2396 s.cep += dSafer.cep 2398 If missing acknowledgement numbers arrive later (due to reordering), 2399 Section 3.2.2.5 says "the Data Sender MAY attempt to neutralize the 2400 effect of any action it took based on a conservative assumption that 2401 it later found to be incorrect". To do this, the Data Sender would 2402 have to store the values of all the relevant variables whenever it 2403 made assumptions, so that it could re-evaluate them later. Given 2404 this could become complex and it is not required, we do not attempt 2405 to provide an example of how to do this. 2407 A.2.2. Safety Algorithm with the AccECN Option 2409 When the AccECN Option is available on the ACKs before and after the 2410 possible sequence of ACK losses, if the Data Sender only needs CE- 2411 marked bytes, it will have sufficient information in the AccECN 2412 Option without needing to process the ACE field. If for some reason 2413 it needs CE-marked packets, if dSafer.cep is different from d.cep, it 2414 can determine whether d.cep is likely to be a safe enough estimate by 2415 checking whether the average marked segment size (s = d.ceb/d.cep) is 2416 less than the MSS (where d.ceb is the amount of newly CE-marked bytes 2417 - see Appendix A.1). Specifically, it could use the following 2418 algorithm: 2420 SAFETY_FACTOR = 2 2421 if (dSafer.cep > d.cep) { 2422 if (d.ceb <= MSS * d.cep) { % Same as (s <= MSS), but no DBZ 2423 sSafer = d.ceb/dSafer.cep 2424 if (sSafer < MSS/SAFETY_FACTOR) 2425 dSafer.cep = d.cep % d.cep is a safe enough estimate 2426 } % else 2427 % No need for else; dSafer.cep is already correct, 2428 % because d.cep must have been too small 2429 } 2431 The chart below shows when the above algorithm will consider d.cep 2432 can replace dSafer.cep as a safe enough estimate of the number of CE- 2433 marked packets: 2435 ^ 2436 sSafer| 2437 | 2438 MSS+ 2439 | 2440 | dSafer.cep 2441 | is 2442 MSS/SAFETY_FACTOR+--------------+ safest 2443 | | 2444 | d.cep is safe| 2445 | enough | 2446 +--------------------> 2447 MSS s 2449 The following examples give the reasoning behind the algorithm, 2450 assuming MSS=1460 [B]: 2452 o if d.cep=0, dSafer.cep=8 and d.ceb=1460, then s=infinity and 2453 sSafer=182.5. 2454 Therefore even though the average size of 8 data segments is 2455 unlikely to have been as small as MSS/8, d.cep cannot have been 2456 correct, because it would imply an average segment size greater 2457 than the MSS. 2459 o if d.cep=2, dSafer.cep=10 and d.ceb=1460, then s=730 and 2460 sSafer=146. 2461 Therefore d.cep is safe enough, because the average size of 10 2462 data segments is unlikely to have been as small as MSS/10. 2464 o if d.cep=7, dSafer.cep=15 and d.ceb=10200, then s=1457 and 2465 sSafer=680. 2467 Therefore d.cep is safe enough, because the average data segment 2468 size is more likely to have been just less than one MSS, rather 2469 than below MSS/2. 2471 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 2472 far less likely. However, because [RFC3168] currently precludes 2473 this, the above algorithm assumes that pure ACKs are not ECN-capable. 2475 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 2477 If the AccECN Option is not available, the Data Sender can only 2478 decode CE-marking from the ACE field in packets. Every time an ACK 2479 arrives, to convert this into an estimate of CE-marked bytes, it 2480 needs an average of the segment size, s_ave. Then it can add or 2481 subtract s_ave from the value of d.ceb as the value of d.cep 2482 increments or decrements. Some possible ways to calculate s_ave are 2483 outlined below. The precise details will depend on why an estimate 2484 of marked bytes is needed. 2486 The implementation could keep a record of the byte numbers of all the 2487 boundaries between packets in flight (including control packets), and 2488 recalculate s_ave on every ACK. However it would be simpler to 2489 merely maintain a counter packets_in_flight for the number of packets 2490 in flight (including control packets), which is reset once per RTT. 2491 Either way, it would estimate s_ave as: 2493 s_ave ~= flightsize / packets_in_flight, 2495 where flightsize is the variable that TCP already maintains for the 2496 number of bytes in flight. To avoid floating point arithmetic, it 2497 could right-bit-shift by lg(packets_in_flight), where lg() means log 2498 base 2. 2500 An alternative would be to maintain an exponentially weighted moving 2501 average (EWMA) of the segment size: 2503 s_ave = a * s + (1-a) * s_ave, 2505 where a is the decay constant for the EWMA. However, then it is 2506 necessary to choose a good value for this constant, which ought to 2507 depend on the number of packets in flight. Also the decay constant 2508 needs to be power of two to avoid floating point arithmetic. 2510 A.4. Example Algorithm to Beacon AccECN Options 2512 Section 3.2.3.3 requires a Data Receiver to beacon a full-length 2513 AccECN Option at least 3 times per RTT. This could be implemented by 2514 maintaining a variable to store the number of ACKs (pure and data 2515 ACKs) since a full AccECN Option was last sent and another for the 2516 approximate number of ACKs sent in the last round trip time: 2518 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 2519 send_full_AccECN_Option() 2521 For optimized integer arithmetic, BEACON_FREQ = 4 could be used, 2522 rather than 3, so that the division could be implemented as an 2523 integer right bit-shift by lg(BEACON_FREQ). 2525 In certain operating systems, it might be too complex to maintain 2526 acks_in_round. In others it might be possible by tagging each data 2527 segment in the retransmit buffer with the number of ACKs sent at the 2528 point that segment was sent. This would not work well if the Data 2529 Receiver was not sending data itself, in which case it might be 2530 necessary to beacon based on time instead, as follows: 2532 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 2533 send_full_AccECN_Option() 2535 This time-based approach does not work well when all the ACKs are 2536 sent early in each round trip, as is the case during slow-start. In 2537 this case few options will be sent (evtl. even less than 3 per RTT). 2538 However, when continuously sending data, data packets as well as ACKs 2539 will spread out equally over the RTT and sufficient ACKs with the 2540 AccECN option will be sent. 2542 A.5. Example Algorithm to Count Not-ECT Bytes 2544 A Data Sender in AccECN mode can infer the amount of TCP payload data 2545 arriving at the receiver marked Not-ECT from the difference between 2546 the amount of newly ACKed data and the sum of the bytes with the 2547 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 2548 r.e0b is initialized to 1 and the other two counters are initialized 2549 to 0, the initial sum will be 1, which matches the initial offset of 2550 the TCP sequence number on completion of the 3WHS. 2552 For this approach to be precise, it has to be assumed that spurious 2553 (unnecessary) retransmissions do not lead to double counting. This 2554 assumption is currently correct, given that RFC 3168 requires that 2555 the Data Sender marks retransmitted segments as Not-ECT. However, 2556 the converse is not true; necessary retransmissions will result in 2557 under-counting. 2559 However, such precision is unlikely to be necessary. The only known 2560 use of a count of Not-ECT marked bytes is to test whether equipment 2561 on the path is clearing the ECN field (perhaps due to an out-dated 2562 attempt to clear, or bleach, what used to be the ToS field). To 2563 detect bleaching it will be sufficient to detect whether nearly all 2564 bytes arrive marked as Not-ECT. Therefore there should be no need to 2565 keep track of the details of retransmissions. 2567 Appendix B. Rationale for Usage of TCP Header Flags 2569 B.1. Three TCP Header Flags in the SYN-SYN/ACK Handshake 2571 AccECN uses a rather unorthodox approach to negotiate the highest 2572 version TCP ECN feedback scheme that both ends support, as justified 2573 below. It follows from the original TCP ECN capability negotiation 2574 [RFC3168], in which the client set the 2 least significant of the 2575 original reserved flags in the TCP header, and fell back to no ECN 2576 support if the server responded with the 2 flags cleared, which had 2577 previously been the default. 2579 ECN originally used header flags rather than a TCP option because it 2580 was considered more efficient to use a header flag for 1 bit of 2581 feedback per ACK, and this bit could be overloaded to indicate 2582 support for ECN during the handshake. During the development of ECN, 2583 1 bit crept up to 2, in order to deliver the feedback reliably and to 2584 work round some broken hosts that reflected the reserved flags during 2585 the handshake. 2587 In order to be backward compatible with RFC 3168, AccECN continues 2588 this approach, using the 3rd least significant TCP header flag that 2589 had previously been allocated for the ECN nonce (now historic). 2590 Then, whatever form of server an AccECN client encounters, the 2591 connection can fall back to the highest version of feedback protocol 2592 that both ends support, as explained in Section 3.1. 2594 If AccECN had used the more orthodox approach of a TCP option, it 2595 would still have had to set the two ECN flags in the main TCP header, 2596 in order to be able to fall back to Classic RFC 3168 ECN, or to 2597 disable ECN support, without another round of negotiation. Then 2598 AccECN would also have had to handle all the different ways that 2599 servers currently respond to settings of the ECN flags in the main 2600 TCP header, including all the conflicting cases where a server might 2601 have said it supported one approach in the flags and another approach 2602 in the new TCP option. And AccECN would have had to deal with all 2603 the additional possibilities where a middlebox might have mangled the 2604 ECN flags, or removed the TCP option. Thus, usage of the 3rd 2605 reserved TCP header flag simplified the protocol. 2607 The third flag was used in a way that could be distinguished from the 2608 ECN nonce, in case any nonce deployment was encountered. Previous 2609 usage of this flag for the ECN nonce was integrated into the original 2610 ECN negotiation. This further justified the 3rd flag's use for 2611 AccECN, because a non-ECN usage of this flag would have had to use it 2612 as a separate single bit, rather than in combination with the other 2 2613 ECN flags. 2615 Indeed, having overloaded the original uses of these three flags for 2616 its handshake, AccECN overloads all three bits again as a 3-bit 2617 counter. 2619 B.2. Four Codepoints in the SYN/ACK 2621 Of the 8 possible codepoints that the 3 TCP header flags can indicate 2622 on the SYN/ACK, 4 already indicated earlier (or broken) versions of 2623 ECN support. In the early design of AccECN, an AccECN server could 2624 use only 2 of the 4 remaining codepoints. They both indicated AccECN 2625 support, but one fed back that the SYN had arrived marked as CE. 2626 Even though ECN support on a SYN is not yet on the standards track, 2627 the idea is for either end to act as a dumb reflector, so that future 2628 capabilities can be unilaterally deployed without requiring 2-ended 2629 deployment (justified in Section 2.5). 2631 During traversal testing it was discovered that the ECN field in the 2632 SYN was mangled on a non-negligible proportion of paths. Therefore 2633 it was necessary to allow the SYN/ACK to feed all four IP/ECN 2634 codepoints that the SYN could arrive with back to the client. 2635 Without this, the client could not know whether to disable ECN for 2636 the connection due to mangling of the IP/ECN field (also explained in 2637 Section 2.5). This development consumed the remaining 2 codepoints 2638 on the SYN/ACK that had been reserved for future use by AccECN in 2639 earlier versions. 2641 B.3. Space for Future Evolution 2643 Despite availability of usable TCP header space being extremely 2644 scarce, the AccECN protocol has taken all possible steps to ensure 2645 that there is space to negotiate possible future variants of the 2646 protocol, either if the experiment proves that a variant of AccECN is 2647 required, or if a completely different ECN feedback approach is 2648 needed: 2650 Future AccECN variants: When the AccECN capability is negotiated 2651 during TCP's 3WHS, the rows in Table 2 tagged as 'Nonce' and 2652 'Broken' in the column for the capability of node B are unused by 2653 any current protocol in the RFC series. These could be used by 2654 TCP servers in future to indicate a variant of the AccECN 2655 protocol. In recent measurement studies in which the response of 2656 large numbers of servers to an AccECN SYN has been tested, e.g. 2657 [Mandalari18], a very small number of SYN/ACKs arrive with the 2658 pattern tagged as 'Nonce', and a small but more significant number 2659 arrive with the pattern tagged as 'Broken'. The 'Nonce' pattern 2660 could be a sign that a few servers have implemented the ECN Nonce 2661 [RFC3540], which has now been reclassified as historic [RFC8311], 2662 or it could be the random result of some unknown middlebox 2663 behaviour. The greater prevalence of the 'Broken' pattern 2664 suggests that some instances still exist of the broken code that 2665 reflects the reserved flags on the SYN. 2667 The requirement not to reject unexpected initial values of the ACE 2668 counter (in the main TCP header) in the last para of 2669 Section 3.2.2.3 ensures that 3 unused codepoints on the ACK of the 2670 SYN/ACK, 6 unused values on the first SYN=0 data packet from the 2671 client and 7 unused values on the first SYN=0 data packet from the 2672 server could be used to declare future variants of the AccECN 2673 protocol. The word 'declare' is used rather than 'negotiate' 2674 because, at this late stage in the 3WHS, it would be too late for 2675 a negotiation between the endpoints to be completed. A similar 2676 requirement not to reject unexpected initial values in the TCP 2677 option (Section 3.2.3.2.4) is for the same purpose. If traversal 2678 of the TCP option were reliable, this would have enabled a far 2679 wider range of future variation of the whole AccECN protocol. 2680 Nonetheless, it could be used to reliably negotiate a wide range 2681 of variation in the semantics of the AccECN Option. 2683 Future non-AccECN variants: Five codepoints out of the 8 possible in 2684 the 3 TCP header flags used by AccECN are unused on the initial 2685 SYN (in the order AE,CWR,ECE): 001, 010, 100, 101, 110. 2686 Section 3.1.3 ensures that the installed base of AccECN servers 2687 will all assume these are equivalent to AccECN negotiation with 2688 111 on the SYN. These codepoints would not allow fall-back to 2689 Classic ECN support for a server that did not understand them, but 2690 this approach ensures they are available in future, perhaps for 2691 uses other than ECN alongside the AccECN scheme. All possible 2692 combinations of SYN/ACK could be used in response except either 2693 000 or reflection of the same values sent on the SYN. 2695 Of course, other ways could be resorted to in order to extend 2696 AccECN or ECN in future, although their traversal properties are 2697 likely to be inferior. They include a new TCP option; using the 2698 remaining reserved flags in the main TCP header (preferably 2699 extending the 3-bit combinations used by AccECN to 4-bit 2700 combinations, rather than burning one bit for just one state); a 2701 non-zero urgent pointer in combination with the URG flag cleared; 2702 or some other unexpected combination of fields yet to be invented. 2704 Authors' Addresses 2706 Bob Briscoe 2707 Independent 2708 UK 2710 EMail: ietf@bobbriscoe.net 2711 URI: http://bobbriscoe.net/ 2713 Mirja Kuehlewind 2714 Ericsson 2715 Germany 2717 EMail: ietf@kuehlewind.net 2719 Richard Scheffenegger 2720 NetApp 2721 Vienna 2722 Austria 2724 EMail: Richard.Scheffenegger@netapp.com