idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. -- The document date (November 11, 2017) is 2351 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 1830, but not defined == Outdated reference: A later version (-12) exists of draft-ietf-tcpm-alternativebackoff-ecn-03 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-02 == Outdated reference: A later version (-08) exists of draft-ietf-tsvwg-ecn-experimentation-07 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-01 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft CableLabs 4 Intended status: Experimental M. Kuehlewind 5 Expires: May 15, 2018 ETH Zurich 6 R. Scheffenegger 7 November 11, 2017 9 More Accurate ECN Feedback in TCP 10 draft-ietf-tcpm-accurate-ecn-05 12 Abstract 14 Explicit Congestion Notification (ECN) is a mechanism where network 15 nodes can mark IP packets instead of dropping them to indicate 16 incipient congestion to the end-points. Receivers with an ECN- 17 capable transport protocol feed back this information to the sender. 18 ECN is specified for TCP in such a way that only one feedback signal 19 can be transmitted per Round-Trip Time (RTT). Recently, new TCP 20 mechanisms like Congestion Exposure (ConEx) or Data Center TCP 21 (DCTCP) need more accurate ECN feedback information whenever more 22 than one marking is received in one RTT. This document specifies an 23 experimental scheme to provide more than one feedback signal per RTT 24 in the TCP header. Given TCP header space is scarce, it overloads 25 the three existing ECN-related flags in the TCP header and provides 26 additional information in a new TCP option. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on May 15, 2018. 45 Copyright Notice 47 Copyright (c) 2017 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (https://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 64 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 65 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 66 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 67 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 68 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 69 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 9 70 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 71 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 72 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 73 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 11 74 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 12 75 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 12 76 3.1.1. Negotiation during the TCP handshake . . . . . . . . 12 77 3.1.2. Retransmission of the SYN . . . . . . . . . . . . . . 14 78 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 15 79 3.2.1. Initialization of Feedback Counters at the Data 80 Sender . . . . . . . . . . . . . . . . . . . . . . . 15 81 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 16 82 3.2.3. Testing for Zeroing of the ACE Field . . . . . . . . 17 83 3.2.4. Testing for Mangling of the IP/ECN Field . . . . . . 18 84 3.2.5. Safety against Ambiguity of the ACE Field . . . . . . 19 85 3.2.6. The AccECN Option . . . . . . . . . . . . . . . . . . 20 86 3.2.7. Path Traversal of the AccECN Option . . . . . . . . . 21 87 3.2.8. Usage of the AccECN TCP Option . . . . . . . . . . . 24 88 3.3. AccECN Compliance by TCP Proxies, Offload Engines and 89 other Middleboxes . . . . . . . . . . . . . . . . . . . . 26 90 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 26 91 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 27 92 4.2. Compatibility with Other TCP Options and Experiments . . 27 93 4.3. Compatibility with Feedback Integrity Mechanisms . . . . 28 94 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 29 95 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 96 7. Security Considerations . . . . . . . . . . . . . . . . . . . 31 97 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 32 98 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 33 99 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 100 10.1. Normative References . . . . . . . . . . . . . . . . . . 33 101 10.2. Informative References . . . . . . . . . . . . . . . . . 33 102 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 36 103 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 36 104 A.2. Example Algorithm for Safety Against Long Sequences of 105 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 37 106 A.2.1. Safety Algorithm without the AccECN Option . . . . . 37 107 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 39 108 A.3. Example Algorithm to Estimate Marked Bytes from Marked 109 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 40 110 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 41 111 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 42 112 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 42 114 1. Introduction 116 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 117 network nodes can mark IP packets instead of dropping them to 118 indicate incipient congestion to the end-points. Receivers with an 119 ECN-capable transport protocol feed back this information to the 120 sender. ECN is specified for TCP in such a way that only one 121 feedback signal can be transmitted per Round-Trip Time (RTT). 122 Recently, proposed mechanisms like Congestion Exposure (ConEx 123 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need 124 more accurate ECN feedback information whenever more than one marking 125 is received in one RTT. A fuller treatment of the motivation for 126 this specification is given in the associated requirements document 127 [RFC7560]. 129 This documents specifies an experimental scheme for ECN feedback in 130 the TCP header to provide more than one feedback signal per RTT. It 131 will be called the more accurate ECN feedback scheme, or AccECN for 132 short. If AccECN progresses from experimental to the standards 133 track, it is intended to be a complete replacement for classic TCP/ 134 ECN feedback, not a fork in the design of TCP. AccECN feedback 135 complements TCP's loss feedback and it supplements classic TCP/ECN 136 feedback, so its applicability is intended to include all public and 137 private IP networks (and even any non-IP networks over which TCP is 138 used today), whether or not any nodes on the path support ECN of 139 whatever flavour. 141 Until the AccECN experiment succeeds, [RFC3168] will remain as the 142 standards track specification for adding ECN to TCP. To avoid 143 confusion, in this document we use the term 'classic ECN' for the 144 pre-existing ECN specification [RFC3168]. 146 AccECN feedback overloads flags and fields in the main TCP header 147 with new definitions, so both ends have to support the new wire 148 protocol before it can be used. Therefore during the TCP handshake 149 the two ends use the three ECN-related flags in the TCP header to 150 negotiate the most advanced feedback protocol that they can both 151 support. 153 AccECN is solely an (experimental) change to the TCP wire protocol; 154 it only specifies the negotiation and signaling of more accurate ECN 155 feedback from a TCP Data Receiver to a Data Sender. It is completely 156 independent of how TCP might respond to congestion feedback, which is 157 out of scope. For that we refer to [RFC3168] or any RFC that 158 specifies a different response to TCP ECN feedback, for example: 159 [RFC8257]; or the ECN experiments referred to in 160 [I-D.ietf-tsvwg-ecn-experimentation], namely: a TCP-based Low Latency 161 Low Loss Scalable (L4S) congestion control [I-D.ietf-tsvwg-l4s-arch]; 162 ECN-capable TCP control packets [I-D.ietf-tcpm-generalized-ecn], or 163 Alternative Backoff with ECN (ABE) 164 [I-D.ietf-tcpm-alternativebackoff-ecn]. 166 It is likely (but not required) that the AccECN protocol will be 167 implemented along with the following experimental additions to the 168 TCP-ECN protocol: ECN-capable TCP control packets and retransmissions 169 [I-D.ietf-tcpm-generalized-ecn], which includes the ECN-capable SYN/ 170 ACK experiment [RFC5562]; and testing receiver non-compliance 171 [I-D.moncaster-tcpm-rcv-cheat]. 173 1.1. Document Roadmap 175 The following introductory sections outline the goals of AccECN 176 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 177 that it is clear what success would look like. Then terminology is 178 defined (Section 1.4) and a recap of existing prerequisite technology 179 is given (Section 1.5). 181 Section 2 gives an informative overview of the AccECN protocol. Then 182 Section 3 gives the normative protocol specification. Section 4 183 assesses the interaction of AccECN with commonly used variants of 184 TCP, whether standardised or not. Section 5 summarises the features 185 and properties of AccECN. 187 Section 6 summarises the protocol fields and numbers that IANA will 188 need to assign and Section 7 points to the aspects of the protocol 189 that will be of interest to the security community. 191 Appendix A gives pseudocode examples for the various algorithms that 192 AccECN uses. 194 1.2. Goals 196 [RFC7560] enumerates requirements that a candidate feedback scheme 197 will need to satisfy, under the headings: resilience, timeliness, 198 integrity, accuracy (including ordering and lack of bias), 199 complexity, overhead and compatibility (both backward and forward). 200 It recognises that a perfect scheme that fully satisfies all the 201 requirements is unlikely and trade-offs between requirements are 202 likely. Section 5 presents the properties of AccECN against these 203 requirements and discusses the trade-offs made. 205 The requirements document recognises that a protocol as ubiquitous as 206 TCP needs to be able to serve as-yet-unspecified requirements. 207 Therefore an AccECN receiver aims to act as a generic (dumb) 208 reflector of congestion information so that in future new sender 209 behaviours can be deployed unilaterally. 211 1.3. Experiment Goals 213 TCP is critical to the robust functioning of the Internet, therefore 214 any proposed modifications to TCP need to be thoroughly tested. The 215 present specification describes an experimental protocol that adds 216 more accurate ECN feedback to the TCP protocol. The intention is to 217 specify the protocol sufficiently so that more than one 218 implementation can be built in order to test its function, robustness 219 and interoperability (with itself and with previous version of ECN 220 and TCP). 222 The experimental protocol will be considered successful if it is 223 deployed and if it satisfies the requirements of [RFC7560] in the 224 consensus opinion of the IETF tcpm working group. In short, this 225 requires that it improves the accuracy and timeliness of TCP's ECN 226 feedback, as claimed in Section 5, while striking a balance between 227 the conflicting requirements of resilience, integrity and 228 minimisation of overhead. It also requires that it is not unduly 229 complex, and that it is compatible with prevalent equipment 230 behaviours in the current Internet (e.g. hardware offloading and 231 middleboxes), whether or not they comply with standards. 233 Testing will mostly focus on fall-back strategies in case of 234 middlebox interference. Current recommended strategies are specified 235 in Sections 3.1.2, 3.2.3, 3.2.4 and 3.2.7. The effectiveness of 236 these strategies depends on the actual deployment situation of 237 middleboxes. Therefore experimental verification to confirm large- 238 scale path traversal in the Internet is needed before finalizing this 239 specification on the Standards Track. 241 1.4. Terminology 243 AccECN: The more accurate ECN feedback scheme will be called AccECN 244 for short. 246 Classic ECN: the ECN protocol specified in [RFC3168]. 248 Classic ECN feedback: the feedback aspect of the ECN protocol 249 specified in [RFC3168], including generation, encoding, 250 transmission and decoding of feedback, but not the Data Sender's 251 subsequent response to that feedback. 253 ACK: A TCP acknowledgement, with or without a data payload. 255 Pure ACK: A TCP acknowledgement without a data payload. 257 TCP client: The TCP stack that originates a connection. 259 TCP server: The TCP stack that responds to a connection request. 261 Data Receiver: The endpoint of a TCP half-connection that receives 262 data and sends AccECN feedback. 264 Data Sender: The endpoint of a TCP half-connection that sends data 265 and receives AccECN feedback. 267 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 268 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 269 document are to be interpreted as described in RFC 2119 [RFC2119]. 271 1.5. Recap of Existing ECN feedback in IP/TCP 273 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 274 negotiated with the receiver at the transport layer, an ECN sender 275 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 276 to indicate an ECN-capable transport (ECT). If both ECN bits are 277 zero, the packet is considered to have been sent by a Not-ECN-capable 278 Transport (Not-ECT). When a network node experiences congestion, it 279 will occasionally either drop or mark a packet, with the choice 280 depending on the packet's ECN codepoint. If the codepoint is Not- 281 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 282 the node can mark the packet by setting both ECN bits, which is 283 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 284 Table 1 summarises these codepoints. 286 +-----------------------+---------------+---------------------------+ 287 | IP-ECN codepoint | Codepoint | Description | 288 | (binary) | name | | 289 +-----------------------+---------------+---------------------------+ 290 | 00 | Not-ECT | Not ECN-Capable Transport | 291 | 01 | ECT(1) | ECN-Capable Transport (1) | 292 | 10 | ECT(0) | ECN-Capable Transport (0) | 293 | 11 | CE | Congestion Experienced | 294 +-----------------------+---------------+---------------------------+ 296 Table 1: The ECN Field in the IP Header 298 In the TCP header the first two bits in byte 14 are defined as flags 299 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 300 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 301 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 302 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 303 Data Receiver starts to set the Echo Congestion Experienced (ECE) 304 flag continuously in the TCP header of ACKs, which ensures the signal 305 is received reliably even if ACKs are lost. The TCP sender confirms 306 that it has received at least one ECE signal by responding with the 307 congestion window reduced (CWR) flag, which allows the TCP receiver 308 to stop repeating the ECN-Echo flag. This always leads to a full RTT 309 of ACKs with ECE set. Thus any additional CE markings arriving 310 within this RTT cannot be fed back. 312 The last bit in byte 13 of the TCP header was defined as the Nonce 313 Sum (NS) for the ECN Nonce [RFC3540]. RFC 3540 was never deployed so 314 it is being reclassified as historic, making this TCP flag available 315 for use by the AccECN experiment instead. 317 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 318 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 319 | | | N | C | E | U | A | P | R | S | F | 320 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 321 | | | | R | E | G | K | H | T | N | N | 322 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 324 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 326 2. AccECN Protocol Overview and Rationale 328 This section provides an informative overview of the AccECN protocol 329 that will be normatively specified in Section 3 330 Like the original TCP approach, the Data Receiver of each TCP half- 331 connection sends AccECN feedback to the Data Sender on TCP 332 acknowledgements, reusing data packets of the other half-connection 333 whenever possible. 335 The AccECN protocol has had to be designed in two parts: 337 o an essential part that re-uses ECN TCP header bits to feed back 338 the number of arriving CE marked packets. This provides more 339 accuracy than classic ECN feedback, but limited resilience against 340 ACK loss; 342 o a supplementary part using a new AccECN TCP Option that provides 343 additional feedback on the number of bytes that arrive marked with 344 each of the three ECN codepoints (not just CE marks). This 345 provides greater resilience against ACK loss than the essential 346 feedback, but it is more likely to suffer from middlebox 347 interference. 349 The two part design was necessary, given limitations on the space 350 available for TCP options and given the possibility that certain 351 incorrectly designed middleboxes prevent TCP using any new options. 353 The essential part overloads the previous definition of the three 354 flags in the TCP header that had been assigned for use by ECN. This 355 design choice deliberately replaces the classic ECN feedback 356 protocol, rather than leaving classic ECN feedback intact and adding 357 more accurate feedback separately because: 359 o this efficiently reuses scarce TCP header space, given TCP option 360 space is approaching saturation; 362 o a single upgrade path for the TCP protocol is preferable to a fork 363 in the design; 365 o otherwise classic and accurate ECN feedback could give conflicting 366 feedback on the same segment, which could open up new security 367 concerns and make implementations unnecessarily complex; 369 o middleboxes are more likely to faithfully forward the TCP ECN 370 flags than newly defined areas of the TCP header. 372 AccECN is designed to work even if the supplementary part is removed 373 or zeroed out, as long as the essential part gets through. 375 2.1. Capability Negotiation 377 AccECN is a change to the wire protocol of the main TCP header, 378 therefore it can only be used if both endpoints have been upgraded to 379 understand it. The TCP client signals support for AccECN on the 380 initial SYN of a connection and the TCP server signals whether it 381 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 382 client uses to signal AccECN support have been carefully chosen so 383 that a TCP server will interpret them as a request to support the 384 most recent variant of ECN feedback that it supports. Then the 385 client falls back to the same variant of ECN feedback. 387 An AccECN TCP client does not send the new AccECN Option on the SYN 388 as SYN option space is limited and successful negotiation using the 389 flags in the main header is taken as sufficient evidence that both 390 ends also support the AccECN Option. The TCP server sends the AccECN 391 Option on the SYN/ACK and the client sends it on the first ACK to 392 test whether the network path forwards the option correctly. 394 2.2. Feedback Mechanism 396 A Data Receiver maintains four counters initialised at the start of 397 the half-connection. Three count the number of arriving payload 398 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 399 the number of packets arriving marked with a CE codepoint (including 400 control packets without payload if they are CE-marked). 402 The Data Sender maintains four equivalent counters for the half 403 connection, and the AccECN protocol is designed to ensure they will 404 match the values in the Data Receiver's counters, albeit after a 405 little delay. 407 Each ACK carries the three least significant bits (LSBs) of the 408 packet-based CE counter using the ECN bits in the TCP header, now 409 renamed the Accurate ECN (ACE) field (see Figure 2 later). The LSBs 410 of each of the three byte counters are carried in the AccECN Option. 412 2.3. Delayed ACKs and Resilience Against ACK Loss 414 With both the ACE and the AccECN Option mechanisms, the Data Receiver 415 continually repeats the current LSBs of each of its respective 416 counters. There is no need to acknowledge these continually repeated 417 counters, so the congestion window reduced (CWR) mechanism is no 418 longer used. Even if some ACKs are lost, the Data Sender should be 419 able to infer how much to increment its own counters, even if the 420 protocol field has wrapped. 422 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 423 it appears to have incremented by one (say), the field might have 424 actually cycled completely then incremented by one. The Data 425 Receiver is required not to delay sending an ACK to such an extent 426 that the ACE field would cycle. However cyling is still a 427 possibility at the Data Sender because a whole sequence of ACKs 428 carrying intervening values of the field might all be lost or delayed 429 in transit. 431 The fields in the AccECN Option are larger, but they will increment 432 in larger steps because they count bytes not packets. Nonetheless, 433 their size has been chosen such that a whole cycle of the field would 434 never occur between ACKs unless there had been an infeasibly long 435 sequence of ACK losses. Therefore, as long as the AccECN Option is 436 available, it can be treated as a dependable feedback channel. 438 If the AccECN Option is not available, e.g. it is being stripped by a 439 middlebox, the AccECN protocol will only feed back information on CE 440 markings (using the ACE field). Although not ideal, this will be 441 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 442 will ever indicate more severe congestion than CE, even though future 443 uses for ECT(0) or ECT(1) are still unclear 444 [I-D.ietf-tsvwg-ecn-experimentation]. Because the 3-bit ACE field is 445 so small, when it is the only field available the Data Sender has to 446 interpret it conservatively assuming the worst possible wrap. 448 Certain specified events trigger the Data Receiver to include an 449 AccECN Option on an ACK. The rules are designed to ensure that the 450 order in which different markings arrive at the receiver is 451 communicated to the sender (as long as there is no ACK loss). 452 Implementations are encouraged to send an AccECN Option more 453 frequently, but this is left up to the implementer. 455 2.4. Feedback Metrics 457 The CE packet counter in the ACE field and the CE byte counter in the 458 AccECN Option both provide feedback on received CE-marks. The CE 459 packet counter includes control packets that do not have payload 460 data, while the CE byte counter solely includes marked payload bytes. 461 If both are present, the byte counter in the option will provide the 462 more accurate information needed for modern congestion control and 463 policing schemes, such as DCTCP or ConEx. If the option is stripped, 464 a simple algorithm to estimate the number of marked bytes from the 465 ACE field is given in Appendix A.3. 467 Feedback in bytes is recommended in order to protect against the 468 receiver using attacks similar to 'ACK-Division' to artificially 469 inflate the congestion window, which is why [RFC5681] now recommends 470 that TCP counts acknowledged bytes not packets. 472 2.5. Generic (Dumb) Reflector 474 The ACE field provides information about CE markings on both data and 475 control packets. According to [RFC3168] the Data Sender is meant to 476 set control packets to Not-ECT. However, mechanisms in certain 477 private networks (e.g. data centres) set control packets to be ECN 478 capable because they are precisely the packets that performance 479 depends on most. 481 For this reason, AccECN is designed to be a generic reflector of 482 whatever ECN markings it sees, whether or not they are compliant with 483 a current standard. Then as standards evolve, Data Senders can 484 upgrade unilaterally without any need for receivers to upgrade too. 485 It is also useful to be able to rely on generic reflection behaviour 486 when senders need to test for unexpected interference with markings 487 (for instance [I-D.kuehlewind-tcpm-ecn-fallback] and 488 [I-D.moncaster-tcpm-rcv-cheat]). 490 The initial SYN is the most critical control packet, so AccECN 491 provides feedback on whether it is CE marked. Although RFC 3168 492 prohibits an ECN-capable SYN, providing feedback of CE marking on the 493 SYN supports future scenarios in which SYNs might be ECN-enabled 494 (without prejudging whether they ought to be). For instance, 495 [I-D.ietf-tsvwg-ecn-experimentation] updates this aspect of RFC 3168 496 to allow experimentation with ECN-capable TCP control packets. 498 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 499 not-ECT in compliance with RFC 3168, feedback on the state of the ECN 500 field when it arrives at the receiver could still be useful, because 501 middleboxes have been known to overwrite the ECN IP field as if it is 502 still part of the old Type of Service (ToS) field [Mandalari18]. If 503 a TCP client has set the SYN to Not-ECT, but receives CE feedback, it 504 can detect such middlebox interference and send Not-ECT for the rest 505 of the connection (see [I-D.kuehlewind-tcpm-ecn-fallback]). Today, 506 if a TCP server receives ECT or CE on a SYN, it cannot know whether 507 it is invalid (or valid) because only the TCP client knows whether it 508 originally marked the SYN as Not-ECT (or ECT). Therefore, prior to 509 AccECN, the server's only safe course of action was to disable ECN 510 for the connection. Instead, the AccECN protocol allows the server 511 to feed back the received ECN field to the client, which then has all 512 the information to decide whether the connection has to fall-back 513 from supporting ECN (or not). 515 3. AccECN Protocol Specification 517 3.1. Negotiating to use AccECN 519 3.1.1. Negotiation during the TCP handshake 521 Given the ECN Nonce [RFC3540] is being reclassified as historic, the 522 present specification renames the TCP flag at bit 7 of the TCP header 523 flags from NS (Nonce Sum) to AE (Accurate ECN) (see IANA 524 Considerations in Section 6). 526 During the TCP handshake at the start of a connection, to request 527 more accurate ECN feedback the TCP client (host A) MUST set the TCP 528 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 530 If a TCP server (B) that is AccECN-enabled receives a SYN with the 531 above three flags set, it MUST set both its half connections into 532 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 533 the 4 values shown in the top block of Table 2 to confirm that it 534 supports AccECN. The TCP server MUST NOT set one of these 4 535 combination of flags on the SYN/ACK unless the preceding SYN 536 requested support for AccECN as above. 538 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 539 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 540 that arrived on the SYN. This applies whether or not the server 541 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 542 Section 2.5 for rationale). 544 Once a TCP client (A) has sent the above SYN to declare that it 545 supports AccECN, and once it has received the above SYN/ACK segment 546 that confirms that the TCP server supports AccECN, the TCP client 547 MUST set both its half connections into AccECN mode. 549 The procedure for the client to follow if a SYN/ACK does not arrive 550 before its retransmission timer expires is given in Section 3.1.2. 552 The three flags set to 1 to indicate AccECN support on the SYN have 553 been carefully chosen to enable natural fall-back to prior stages in 554 the evolution of ECN. Table 2 tabulates all the negotiation 555 possibilities for ECN-related capabilities that involve at least one 556 AccECN-capable host. The entries in the first two columns have been 557 abbreviated, as follows: 559 AccECN: More Accurate ECN Feedback (the present specification) 561 Nonce: ECN Nonce feedback [RFC3540] 562 ECN: 'Classic' ECN feedback [RFC3168] 564 No ECN: Not-ECN-capable. Implicit congestion notification using 565 packet drop. 567 +--------+--------+------------+-------------+----------------------+ 568 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 569 | | | | B->A | | 570 +--------+--------+------------+-------------+----------------------+ 571 | | | AE CWR ECE | AE CWR ECE | | 572 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN (Not-ECT on | 573 | | | | | SYN) | 574 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 575 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 576 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 577 | | | | | | 578 | AccECN | Nonce | 1 1 1 | 1 0 1 | classic ECN | 579 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 580 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 581 | | | | | | 582 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 583 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 584 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 585 | | | | | | 586 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 587 +--------+--------+------------+-------------+----------------------+ 589 Table 2: ECN capability negotiation between Client (A) and Server (B) 591 Table 2 is divided into blocks each separated by an empty row. 593 1. The top block shows the case already described where both 594 endpoints support AccECN and how the TCP server (B) indicates 595 congestion feedback. 597 2. The second block shows the cases where the TCP client (A) 598 supports AccECN but the TCP server (B) supports some earlier 599 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 600 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 601 shown it MUST set both its half connections into the feedback 602 mode shown in the rightmost column. 604 3. The third block shows the cases where the TCP server (B) supports 605 AccECN but the TCP client (A) supports some earlier variant of 606 TCP feedback, indicated in its SYN. Therefore, as soon as an 607 AccECN-enabled TCP server (B) receives the SYN shown, it MUST set 608 both its half connections into the feedback mode shown in the 609 rightmost column. 611 4. The fourth block displays a combination labelled `Broken' . Some 612 older TCP server implementations incorrectly set the reserved 613 flags in the SYN/ACK by reflecting those in the SYN. Such broken 614 TCP servers (B) cannot support ECN, so as soon as an AccECN- 615 capable TCP client (A) receives such a broken SYN/ACK it MUST 616 fall-back to Not ECN mode for both its half connections. 618 The following exceptional cases need some explanation: 620 ECN Nonce: An AccECN implementation, whether client or server, 621 sender or receiver, does not need to implement the ECN Nonce 622 feedback mode [RFC3540], which is being reclassified as historic 623 [I-D.ietf-tsvwg-ecn-experimentation]. AccECN is compatible with 624 an alternative ECN feedback integrity approach that does not use 625 up the ECT(1) codepoint and can be implemented solely at the 626 sender (see Section 4.3). 628 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 629 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 630 Host A MUST then enter the same feedback mode as it would have 631 entered had it been a responding host and received the same SYN. 632 Then host A MUST send the same SYN/ACK as it would have sent had 633 it been a responding host. 635 3.1.2. Retransmission of the SYN 637 If the sender of an AccECN SYN times out before receiving the SYN/ 638 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 639 least one more time by continuing to set all three TCP ECN flags on 640 the first retransmitted SYN (using the usual retransmission time- 641 outs). If this first retransmission also fails to be acknowledged, 642 the sender SHOULD send subsequent retransmissions of the SYN without 643 any TCP-ECN flags set. This adds delay, in the case where a 644 middlebox drops an AccECN (or ECN) SYN deliberately. However, 645 current measurements imply that a drop is less likely to be due to 646 middlebox interference than other intermittent causes of loss, e.g. 647 congestion, wireless interference, etc. 649 Implementers MAY use other fall-back strategies if they are found to 650 be more effective (e.g. attempting to negotiate AccECN on the SYN 651 only once or more than twice (most appropriate during high levels of 652 congestion); or falling back to classic ECN feedback rather than non- 653 ECN). Further it may make sense to also remove any other 654 experimental fields or options on the SYN in case a middlebox might 655 be blocking them, although the required behaviour will depend on the 656 specification of the other option(s) and any attempt to co-ordinate 657 fall-back between different modules of the stack. In any case, the 658 TCP initiator SHOULD cache failed connection attempts. If it does, 659 it SHOULD NOT give up attempting to negotiate AccECN on the SYN of 660 subsequent connection attempts until it is clear that the blockage is 661 persistently and specifically due to AccECN. The cache should be 662 arranged to expire so that the initiator will infrequently attempt to 663 check whether the problem has been resolved. 665 The fall-back procedure if the TCP server receives no ACK to 666 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 667 Section 3.2.7. 669 3.2. AccECN Feedback 671 Each Data Receiver of each half connection maintains four counters, 672 r.cep, r.ceb, r.e0b and r.e1b. The CE packet counter (r.cep), counts 673 the number of packets the host receives with the CE code point in the 674 IP ECN field, including CE marks on control packets without data. 675 r.ceb, r.e0b and r.e1b count the number of TCP payload bytes in 676 packets marked respectively with the CE, ECT(0) and ECT(1) codepoint 677 in their IP-ECN field. When a host first enters AccECN mode, it 678 initializes its counters to r.cep = 5, r.e0b = 1 and r.ceb = r.e1b.= 679 0 (see Appendix A.5). Non-zero initial values are used to support a 680 stateless handshake (see Section 4.1) and to be distinct from cases 681 where the fields are incorrectly zeroed (e.g. by middleboxes - see 682 Section 3.2.7.4). 684 A host feeds back the CE packet counter using the Accurate ECN (ACE) 685 field, as explained in the next section. And it feeds back all the 686 byte counters using the AccECN TCP Option, as specified in 687 Section 3.2.6. Whenever a host feeds back the value of any counter, 688 it MUST report the most recent value, no matter whether it is in a 689 pure ACK, an ACK with new payload data or a retransmission. 690 Therefore the feedback carried on a retransmitted packet is unlikely 691 to be the same as the feedback on the original packet. 693 3.2.1. Initialization of Feedback Counters at the Data Sender 695 Each Data Sender of each half connection maintains four counters, 696 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 697 counters at the Data Receiver. When a host enters AccECN mode, it 698 initializes them to s.cep = 5, s.e0b = 1 and s.ceb = s.e1b.= 0. 700 If a TCP client (A) in AccECN mode receives a SYN/ACK with CE 701 feedback, i.e. AE=1, CWR=1, ECE=0, it increments s.cep to 6. 702 Otherwise, for any of the 3 other combinations of the 3 ECN TCP flags 703 (the top 3 rows in Table 2), s.cep remains initialized to 5. 705 3.2.2. The ACE Field 707 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 708 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 709 as one 3-bit field. Then the field is given a new name, ACE, as 710 shown in Figure 2. 712 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 713 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 714 | | | | U | A | P | R | S | F | 715 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 716 | | | | G | K | H | T | N | N | 717 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 719 Figure 2: Definition of the ACE field within bytes 13 and 14 of the 720 TCP Header (when AccECN has been negotiated and SYN=0). 722 The original definition of these three flags in the TCP header, 723 including the addition of support for the ECN Nonce, is shown for 724 comparison in Figure 1. This specification does not rename these 725 three TCP flags to ACE unconditionally; it merely overloads them with 726 another name and definition once an AccECN connection has been 727 established. 729 A host MUST interpret the AE, CWR and ECE flags as the 3-bit ACE 730 counter on a segment with the SYN flag cleared (SYN=0) that it sends 731 or receives if both of its half-connections are set into AccECN mode 732 having successfully negotiated AccECN (see Section 3.1). A host MUST 733 NOT interpret the 3 flags as a 3-bit ACE field on any segment with 734 SYN=1 (whether ACK is 0 or 1), or if AccECN negotiation is incomplete 735 or has not succeeded. 737 Both parts of each of these conditions are equally important. For 738 instance, even if AccECN negotiation has been successful, the ACE 739 field is not defined on any segments with SYN=1 (e.g. a 740 retransmission of an unacknowledged SYN/ACK, or when both ends send 741 SYN/ACKs after AccECN support has been successfully negotiated during 742 a simultaneous open). 744 With only one exception, on any packet with the SYN flag cleared 745 (SYN=0), the Data Receiver MUST encode the three least significant 746 bits of its r.cep counter into the ACE field it feeds back to the 747 Data Sender. 749 There is only one exception to this rule: On the final ACK of the 750 3WHS, a TCP client (A) in AccECN mode MUST use the ACE field to feed 751 back which of the 4 possible values of the IP-ECN field were on the 752 SYN/ACK (the binary encoding is the same as that used on the SYN/ 753 ACK). Table 3 shows the meaning of each possible value of the ACE 754 field on the ACK of the SYN/ACK and the value that an AccECN server 755 MUST set s.cep to as a result. The encoding in Table 3 is solely 756 applicable on a packet in the client-server direction with an 757 acknowledgement number 1 greater than the Initial Sequence Number 758 (ISN) that was used by the server. 760 +--------------+---------------------------+------------------------+ 761 | ACE on ACK | IP-ECN codepoint on | Initial s.cep of | 762 | of SYN/ACK | SYN/ACK inferred by | server in AccECN mode | 763 | | server | | 764 +--------------+---------------------------+------------------------+ 765 | 0b000 | {Notes 1, 2} | Disable ECN | 766 | 0b001 | {Notes 2, 3} | 5 | 767 | 0b010 | Not-ECT | 5 | 768 | 0b011 | ECT(1) | 5 | 769 | 0b100 | ECT(0) | 5 | 770 | 0b101 | Currently Unused {Note 3} | 5 | 771 | 0b110 | CE | 6 | 772 | 0b111 | Currently Unused {Note 3} | 5 | 773 +--------------+---------------------------+------------------------+ 775 Table 3: Meaning of the ACE field on the ACK of the SYN/ACK 777 {Note 1}: If the server is in AccECN mode, the value of zero raises 778 suspicion of zeroing of the ACE field on the path (see 779 Section 3.2.3). 781 {Note 2}: If a server is in AccECN mode, there ought to be no valid 782 case where the ACE field on the last ACK of the 3WHS has a value of 783 0b000 or 0b001. 785 However, in the case where a server that implements AccECN is also 786 using a stateless handshake (termed a SYN cookie) it will not 787 remember whether it entered AccECN mode. Then these two values 788 remind it that it did not enter AccECN mode (see Section 4.1 for 789 details). 791 {Note 3}: If the server is in AccECN mode, these values are Currently 792 Unused but the AccECN server's behaviour is still defined for forward 793 compatibility. 795 3.2.3. Testing for Zeroing of the ACE Field 797 Section 3.2.2 required the Data Receiver to initialize the r.cep 798 counter to a non-zero value. Therefore, in either direction the 799 initial value of the ACE field ought to be non-zero. 801 If AccECN has been successfully negotiated, the Data Sender SHOULD 802 check the initial value of the ACE field in the first arriving 803 segment with SYN=0. If the initial value of the ACE field is zero 804 (0b000), the Data Sender MUST disable sending ECN-capable packets for 805 the remainder of the half-connection by setting the IP/ECN field in 806 all subsequent packets to Not-ECT. 808 For example, the server checks the ACK of the SYN/ACK or the first 809 data segment from the client, while the client checks the first data 810 segment from the server. More precisely, the "first segment with 811 SYN=0" is defined as: the segment with SYN=0 that i) acknowledges 812 sequence space at least covering the initial sequence number (ISN) 813 plus 1; and ii) arrives before any other segments with SYN=0 so it is 814 unlikely to be a retransmission. If no such segment arrives (e.g. 815 because it is lost and the ISN is first acknowledged by a subsequent 816 segment), no test for invalid initialization can be conducted, and 817 the half-connection will continue in AccECN mode. 819 Note that the Data Sender MUST NOT test whether the arriving counter 820 in the initial ACE field has been initialized to a specific valid 821 value - the above check solely tests whether the ACE fields have been 822 incorrectly zeroed. This allows hosts to use different initial 823 values as an additional signalling channel in future. 825 3.2.4. Testing for Mangling of the IP/ECN Field 827 The value of the ACE field on the SYN/ACK indicates the value of the 828 IP/ECN field when the SYN arrived at the server. The client can 829 compare this with how it originally set the IP/ECN field on the SYN. 830 If this comparison implies an unsafe transition of the IP/ECN field, 831 for the remainder of the connection the client MUST NOT send ECN- 832 capable packets, but it MUST continue to feed back any ECN markings 833 on arriving packets. 835 The value of the ACE field on the last ACK of the 3WHS indicates the 836 value of the IP/ECN field when the SYN/ACK arrived at the client. 837 The server can compare this with how it originally set the IP/ECN 838 field on the SYN/ACK. If this comparison implies an unsafe 839 transition of the IP/ECN field, for the remainder of the connection 840 the server MUST NOT send ECN-capable packets, but it MUST continue to 841 feedback any ECN markings on arriving packets. 843 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 844 count of CE marks is still eventually delivered reliably). If this 845 ACK does not arrive, the server has to continue to send ECN-capable 846 packets without having tested for mangling of the IP/ECN field on the 847 SYN/ACK. Experiments with AccECN deployment will assess whether this 848 limitation has any effect in practice. 850 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 851 repeated here for convenience: 853 o the not-ECT codepoint changes; 855 o either ECT codepoint transitions to not-ECT; 857 o the CE codepoint changes. 859 RFC 3168 says that a router that changes ECT to not-ECT is invalid 860 but safe. However, from a host's viewpoint, this transition is 861 unsafe because it could be the result of two transitions at different 862 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 863 This scenario could well happen where an ECN-enabled home router 864 congests its upstream mobile broadband bottleneck link, then the 865 ingress to the mobile network clears the ECN field [Mandalari18]. 867 The above fall-back behaviours are necessary in case mangling of the 868 IP/ECN field is asymmetric, which is currently common over some 869 mobile networks [Mandalari18]. Then one end might see no unsafe 870 transition and continue sending ECN-capable packets, while the other 871 end sees an unsafe transition and stops sending ECN-capable packets. 873 3.2.5. Safety against Ambiguity of the ACE Field 875 If too many CE-marked segments are acknowledged at once, or if a long 876 run of ACKs is lost, the 3-bit counter in the ACE field might have 877 cycled between two ACKs arriving at the Data Sender. 879 Therefore an AccECN Data Receiver SHOULD immediately send an ACK once 880 'n' CE marks have arrived since the previous ACK, where 'n' SHOULD be 881 2 and MUST be no greater than 6. 883 If the Data Sender has not received AccECN TCP Options to give it 884 more dependable information, and it detects that the ACE field could 885 have cycled under the prevailing conditions, it SHOULD conservatively 886 assume that the counter did cycle. It can detect if the counter 887 could have cycled by using the jump in the acknowledgement number 888 since the last ACK to calculate or estimate how many segments could 889 have been acknowledged. An example algorithm to implement this 890 policy is given in Appendix A.2. An implementer MAY develop an 891 alternative algorithm as long as it satisfies these requirements. 893 If missing acknowledgement numbers arrive later (reordering) and 894 prove that the counter did not cycle, the Data Sender MAY attempt to 895 neutralise the effect of any action it took based on a conservative 896 assumption that it later found to be incorrect. 898 3.2.6. The AccECN Option 900 The AccECN Option is defined as shown below in Figure 3. It consists 901 of three 24-bit fields that provide the 24 least significant bits of 902 the r.e0b, r.ceb and r.e1b counters, respectively. The initial 'E' 903 of each field name stands for 'Echo'. 905 0 1 2 3 906 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 907 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 908 | Kind = TBD1 | Length = 11 | EE0B field | 909 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 910 | EE0B (cont'd) | ECEB field | 911 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 912 | EE1B field | 913 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 915 Figure 3: The AccECN Option 917 The Data Receiver MUST set the Kind field to TBD1, which is 918 registered in Section 6 as a new TCP option Kind called AccECN. An 919 experimental TCP option with Kind=254 MAY be used for initial 920 experiments, with magic number 0xACCE. 922 Appendix A.1 gives an example algorithm for the Data Receiver to 923 encode its byte counters into the AccECN Option, and for the Data 924 Sender to decode the AccECN Option fields into its byte counters. 926 Note that there is no field to feedback Not-ECT bytes. Nonetheless 927 an algorithm for the Data Sender to calculate the number of payload 928 bytes received as Not-ECT is given in Appendix A.5. 930 Whenever a Data Receiver sends an AccECN Option, the rules in 931 Section 3.2.8 expect it to always send a full-length option. To cope 932 with option space limitations, it can omit unchanged fields from the 933 tail of the option, as long as it preserves the order of the 934 remaining fields and includes any field that has changed. The length 935 field MUST indicate which fields are present as follows: 937 Length=11: EE0B, ECEB, EE1B 939 Length=8: EE0B, ECEB 941 Length=5: EE0B 943 Length=2: (empty) 944 The empty option of Length=2 is provided to allow for a case where an 945 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 946 but there is very limited space for the option. For initial 947 experiments, the Length field MUST be 2 greater to accommodate the 948 16-bit magic number. 950 All implementations of a Data Sender MUST be able to read in AccECN 951 Options of any of the above lengths. If the AccECN Option is of any 952 other length, implementations MUST use those whole 3 octet fields 953 that fit within the length and ignore the remainder of the option. 955 3.2.7. Path Traversal of the AccECN Option 957 3.2.7.1. Testing the AccECN Option during the Handshake 959 The TCP client MUST NOT include the AccECN TCP Option on the SYN. 960 Nonetheless, if the AccECN negotiation using the ECN flags in the 961 main TCP header (Section 3.1) is successful, it implicitly declares 962 that the endpoints also support the AccECN TCP Option. A fall-back 963 strategy for the loss of the SYN (possibly due to middlebox 964 interference) is specified in Section 3.1.2. 966 A TCP server that confirms its support for AccECN (in response to an 967 AccECN SYN from the client as described in Section 3.1) SHOULD also 968 include an AccECN TCP Option in the SYN/ACK. 970 A TCP client that has successfully negotiated AccECN SHOULD include 971 an AccECN Option in the first ACK at the end of the 3WHS. However, 972 this first ACK is not delivered reliably, so the TCP client SHOULD 973 also include an AccECN Option on the first data segment it sends (if 974 it ever sends one). 976 A host MAY NOT include an AccECN Option in any of these three cases 977 if it has cached knowledge that the packet would be likely to be 978 blocked on the path to the other host if it included an AccECN 979 Option. 981 3.2.7.2. Testing for Loss of Packets Carrying the AccECN Option 983 If after the normal TCP timeout the TCP server has not received an 984 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 985 lost, e.g. due to congestion, or a middlebox might be blocking the 986 AccECN Option. To expedite connection setup, the TCP server SHOULD 987 retransmit the SYN/ACK with the same TCP flags (AE, CWR and ECE) but 988 with no AccECN Option. If this retransmission times out, to expedite 989 connection setup, the TCP server SHOULD disable AccECN and ECN for 990 this connection by retransmitting the SYN/ACK with AE=CWR=ECE=0 and 991 no AccECN Option. Implementers MAY use other fall-back strategies if 992 they are found to be more effective (e.g. falling back to classic 993 ECN feedback on the first retransmission; retrying the AccECN Option 994 for a second time before fall-back (most appropriate during high 995 levels of congestion); or falling back to classic ECN feedback rather 996 than non-ECN on the third retransmission). 998 If the TCP client detects that the first data segment it sent with 999 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1000 on the retransmission. Again, implementers MAY use other fall-back 1001 strategies such as attempting to retransmit a second segment with the 1002 AccECN Option before fall-back, and/or caching whether the AccECN 1003 Option is blocked for subsequent connections. 1005 Either host MAY include the AccECN Option in a subsequent segment to 1006 retest whether the AccECN Option can traverse the path. 1008 If the TCP server receives a second SYN with a request for AccECN 1009 support, it should resend the SYN/ACK, again confirming its support 1010 for AccECN, but this time without the AccECN Option. This approach 1011 rules out any interference by middleboxes that may drop packets with 1012 unknown options, even though it is more likely that the SYN/ACK would 1013 have been lost due to congestion. The TCP server MAY try to send 1014 another packet with the AccECN Option at a later point during the 1015 connection but should monitor if that packet got lost as well, in 1016 which case it SHOULD disable the sending of the AccECN Option for 1017 this half-connection. 1019 Similarly, an AccECN end-point MAY separately memorize which data 1020 packets carried an AccECN Option and disable the sending of AccECN 1021 Options if the loss probability of those packets is significantly 1022 higher than that of all other data packets in the same connection. 1024 3.2.7.3. Testing for Stripping of the AccECN Option 1026 If the TCP client has successfully negotiated AccECN but does not 1027 receive an AccECN Option on the SYN/ACK, it switches into a mode that 1028 assumes that the AccECN Option is not available for this half 1029 connection. 1031 Similarly, if the TCP server has successfully negotiated AccECN but 1032 does not receive an AccECN Option on the first segment that 1033 acknowledges sequence space at least covering the ISN, it switches 1034 into a mode that assumes that the AccECN Option is not available for 1035 this half connection. 1037 While a host is in this mode that assumes incoming AccECN Options are 1038 not available, it MUST adopt the conservative interpretation of the 1039 ACE field discussed in Section 3.2.5. However, it cannot make any 1040 assumption about support of outgoing AccECN Options on the other half 1041 connection, so it SHOULD continue to send the AccECN Option itself 1042 (unless it has established that sending the AccECN Option is causing 1043 packets to be blocked as in Section 3.2.7.2). 1045 If a host is in the mode that assumes incoming AccECN Options are not 1046 available, but it receives an AccECN Option at any later point during 1047 the connection, this clearly indicates that the AccECN Option is not 1048 blocked on the respective path, and the AccECN endpoint MAY switch 1049 out of the mode that assumes the AccECN Option is not available for 1050 this half connection. 1052 3.2.7.4. Test for Zeroing of the AccECN Option 1054 For a related test for invalid initialization of the ACE field, see 1055 Section 3.2.3 1057 Section 3.2 required the Data Receiver to initialize the r.e0b 1058 counter to a non-zero value. Therefore, in either direction the 1059 initial value of the EE0B field in the AccECN Option (if one exists) 1060 ought to be non-zero. If AccECN has been negotiated: 1062 o the TCP server MAY check the initial value of the EE0B field in 1063 the first segment that acknowledges sequence space that at least 1064 covers the ISN plus 1. If the initial value of the EE0B field is 1065 zero, the server will switch into a mode that ignores the AccECN 1066 Option for this half connection. 1068 o the TCP client MAY check the initial value of the EE0B field on 1069 the SYN/ACK. If the initial value of the EE0B field is zero, the 1070 client will switch into a mode that ignores the AccECN Option for 1071 this half connection. 1073 While a host is in the mode that ignores the AccECN Option it MUST 1074 adopt the conservative interpretation of the ACE field discussed in 1075 Section 3.2.5. 1077 Note that the Data Sender MUST NOT test whether the arriving byte 1078 counters in the initial AccECN Option have been initialized to 1079 specific valid values - the above checks solely test whether these 1080 fields have been incorrectly zeroed. This allows hosts to use 1081 different initial values as an additional signalling channel in 1082 future. Also note that the initial value of either field might be 1083 greater than its expected initial value, because the counters might 1084 already have been incremented. Nonetheless, the initial values of 1085 the counters have been chosen so that they cannot wrap to zero on 1086 these initial segments. 1088 3.2.7.5. Consistency between AccECN Feedback Fields 1090 When the AccECN Option is available it supplements but does not 1091 replace the ACE field. An endpoint using AccECN feedback MUST always 1092 consider the information provided in the ACE field whether or not the 1093 AccECN Option is also available. 1095 If the AccECN option is present, the s.cep counter might increase 1096 while the s.ceb counter does not (e.g. due to a CE-marked control 1097 packet). The sender's response to such a situation is out of scope, 1098 and needs to be dealt with in a specification that uses ECN-capable 1099 control packets. Theoretically, this situation could also occur if a 1100 middlebox mangled the AccECN Option but not the ACE field. However, 1101 the Data Sender has to assume that the integrity of the AccECN Option 1102 is sound, based on the above test of the well-known initial values 1103 and optionally other integrity tests (Section 4.3). 1105 If either end-point detects that the s.ceb counter has increased but 1106 the s.cep has not (and by testing ACK coverage it is certain how much 1107 the ACE field has wrapped), this invalid protocol transition has to 1108 be due to some form of feedback mangling. So, the Data Sender MUST 1109 disable sending ECN-capable packets for the remainder of the half- 1110 connection by setting the IP/ECN field in all subsequent packets to 1111 Not-ECT. 1113 3.2.8. Usage of the AccECN TCP Option 1115 The following rules determine when a Data Receiver in AccECN mode 1116 sends the AccECN TCP Option, and which fields to include: 1118 Change-Triggered ACKs: If an arriving packet increments a different 1119 byte counter to that incremented by the previous packet, the Data 1120 Receiver MUST immediately send an ACK with an AccECN Option, 1121 without waiting for the next delayed ACK (this is in addition to 1122 the safety recommendation in Section 3.2.5 against ambiguity of 1123 the ACE field). 1125 This is stated as a "MUST" so that the data sender can rely on 1126 change-triggered ACKs to detect transitions right from the very 1127 start of a flow, without first having to detect whether the 1128 receiver complies. A concern has been raised that certain offload 1129 hardware needed for high performance might not be able to support 1130 change-triggered ACKs, although high performance protocols such as 1131 DCTCP successfully use change-triggered ACKs. One possible 1132 experimental compromise would be for the receiver to heuristically 1133 detect whether the sender is in slow-start, then to implement 1134 change-triggered ACKs in software while the sender is in slow- 1135 start, and offload to hardware otherwise. If the operator 1136 disables change-triggered ACKs, whether partially like this or 1137 otherwise, the operator will also be responsible for ensuring a 1138 co-ordinated sender algorithm is deployed; 1140 Continual Repetition: Otherwise, if arriving packets continue to 1141 increment the same byte counter, the Data Receiver can include an 1142 AccECN Option on most or all (delayed) ACKs, but it does not have 1143 to. If option space is limited on a particular ACK, the Data 1144 Receiver MUST give precedence to SACK information about loss. It 1145 SHOULD include an AccECN Option if the r.ceb counter has 1146 incremented and it MAY include an AccECN Option if r.ec0b or 1147 r.ec1b has incremented; 1149 Full-Length Options Preferred: It SHOULD always use full-length 1150 AccECN Options. It MAY use shorter AccECN Options if space is 1151 limited, but it MUST include the counter(s) that have incremented 1152 since the previous AccECN Option and it MUST only truncate fields 1153 from the right-hand tail of the option to preserve the order of 1154 the remaining fields (see Section 3.2.6); 1156 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1157 length AccECN TCP Option on at least three ACKs per RTT, or on all 1158 ACKs if there are less than three per RTT (see Appendix A.4 for an 1159 example algorithm that satisfies this requirement). 1161 The following example series of arriving IP/ECN fields illustrates 1162 when a Data Receiver will emit an ACK if it is using a delayed ACK 1163 factor of 2 segments and change-triggered ACKs: 01 -> ACK, 01, 01 -> 1164 ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 -> ACK. 1166 For the avoidance of doubt, the change-triggered ACK mechanism is 1167 deliberately worded to ignore the arrival of a control packet with no 1168 payload, which therefore does not alter any byte counters, because it 1169 is important that TCP does not acknowledge pure ACKs. The change- 1170 triggered ACK approach will lead to some additional ACKs but it feeds 1171 back the timing and the order in which ECN marks are received with 1172 minimal additional complexity. 1174 Implementation note: sending an AccECN Option each time a different 1175 counter changes and including a full-length AccECN Option on every 1176 delayed ACK will satisfy the requirements described above and might 1177 be the easiest implementation, as long as sufficient space is 1178 available in each ACK (in total and in the option space). 1180 Appendix A.3 gives an example algorithm to estimate the number of 1181 marked bytes from the ACE field alone, if the AccECN Option is not 1182 available. 1184 If a host has determined that segments with the AccECN Option always 1185 seem to be discarded somewhere along the path, it is no longer 1186 obliged to follow the above rules. 1188 3.3. AccECN Compliance by TCP Proxies, Offload Engines and other 1189 Middleboxes 1191 A large class of middleboxes split TCP connections. Such a middlebox 1192 would be compliant with the AccECN protocol if the TCP implementation 1193 on each side complied with the present AccECN specification and each 1194 side negotiated AccECN independently of the other side. 1196 Another large class of middleboxes intervenes to some degree at the 1197 transport layer, but attempts to be transparent (invisible) to the 1198 end-to-end connection. A subset of this class of middleboxes 1199 attempts to `normalise' the TCP wire protocol by checking that all 1200 values in header fields comply with a rather narrow interpretation of 1201 the TCP specifications. To comply with the present AccECN 1202 specification, such a middlebox MUST NOT change the ACE field or the 1203 AccECN Option and it MUST attempt to preserve the timing of each ACK 1204 (for example, if it coalesced ACKs it would not be AccECN-compliant). 1205 A middlebox claiming to be transparent at the transport layer MUST 1206 forward the AccECN TCP Option unaltered, whether or not the length 1207 value matches one of those specified in Section 3.2.6, and whether or 1208 not the initial values of the byte-counter fields are correct. This 1209 is because blocking apparently invalid values does not improve 1210 security (because AccECN hosts are required to ignore invalid values 1211 anyway), while it prevents the standardised set of values being 1212 extended in future (because outdated normalisers would block updated 1213 hosts from using the extended AccECN standard). 1215 Hardware to offload certain TCP processing represents another large 1216 class of middleboxes, even though it is often a function of a host's 1217 network interface and rarely in its own 'box'. Leeway has been 1218 allowed in the present AccECN specification in the expectation that 1219 offload hardware could comply and still serve its function. 1220 Nonetheless, such hardware MUST attempt to preserve the timing of 1221 each ACK (for example, if it coalesced ACKs it would not be AccECN- 1222 compliant). 1224 4. Interaction with Other TCP Variants 1226 This section is informative, not normative. 1228 4.1. Compatibility with SYN Cookies 1230 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1231 protect itself from SYN flooding attacks. It places minimal commonly 1232 used connection state in the SYN/ACK, and deliberately does not hold 1233 any state while waiting for the subsequent ACK (e.g. it closes the 1234 thread). Therefore it cannot record the fact that it entered AccECN 1235 mode for both half-connections. Indeed, it cannot even remember 1236 whether it negotiated the use of classic ECN [RFC3168]. 1238 Nonetheless, such a server can determine that it negotiated AccECN as 1239 follows. If a TCP server using SYN Cookies supports AccECN and if it 1240 receives a pure ACK that acknowledges an ISN that is a valid SYN 1241 cookie, and if the ACK contains an ACE field with the value 0b010 to 1242 0b111 (decimal 2 to 7), it can assume that: 1244 o the TCP client must have requested AccECN support on the SYN 1246 o it (the server) must have confirmed that it supported AccECN 1248 Therefore the server can switch itself into AccECN mode, and continue 1249 as if it had never forgotten that it switched itself into AccECN mode 1250 earlier. 1252 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1253 with the value 0b000 or 0b001, these values indicate that the client 1254 did not request support for AccECN and therefore the server does not 1255 enter AccECN mode for this connection. Further, 0b001 on the ACK 1256 implies that the server sent an ECN-capable SYN/ACK, which was marked 1257 CE in the network, and the non-AccECN client fed this back by setting 1258 ECE on the ACK of the SYN/ACK. 1260 4.2. Compatibility with Other TCP Options and Experiments 1262 AccECN is compatible (at least on paper) with the most commonly used 1263 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1264 also compatible with the recent promising experimental TCP options 1265 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1266 AccECN is friendly to all these protocols, because space for TCP 1267 options is particularly scarce on the SYN, where AccECN consumes zero 1268 additional header space. 1270 When option space is under pressure from other options, Section 3.2.8 1271 provides guidance on how important it is to send an AccECN Option and 1272 whether it needs to be a full-length option. 1274 4.3. Compatibility with Feedback Integrity Mechanisms 1276 Three alternative mechanisms are available to assure the integrity of 1277 ECN and/or loss signals. AccECN is compatible with any of these 1278 approaches: 1280 o The Data Sender can test the integrity of the receiver's ECN (or 1281 loss) feedback by occasionally setting the IP-ECN field to a value 1282 normally only set by the network (and/or deliberately leaving a 1283 sequence number gap). Then it can test whether the Data 1284 Receiver's feedback faithfully reports what it expects 1285 [I-D.moncaster-tcpm-rcv-cheat]. Unlike the ECN Nonce [RFC3540], 1286 this approach does not waste the ECT(1) codepoint in the IP 1287 header, it does not require standardisation and it does not rely 1288 on misbehaving receivers volunteering to reveal feedback 1289 information that allows them to be detected. However, setting the 1290 CE mark by the sender might conceal actual congestion feedback 1291 from the network and should therefore only be done sparsely. 1293 o Networks generate congestion signals when they are becoming 1294 congested, so networks are more likely than Data Senders to be 1295 concerned about the integrity of the receiver's feedback of these 1296 signals. A network can enforce a congestion response to its ECN 1297 markings (or packet losses) using congestion exposure (ConEx) 1298 audit [RFC7713]. Whether the receiver or a downstream network is 1299 suppressing congestion feedback or the sender is unresponsive to 1300 the feedback, or both, ConEx audit can neutralise any advantage 1301 that any of these three parties would otherwise gain. 1303 ConEx is a change to the Data Sender that is most useful when 1304 combined with AccECN. Without AccECN, the ConEx behaviour of a 1305 Data Sender would have to be more conservative than would be 1306 necessary if it had the accurate feedback of AccECN. 1308 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1309 detect any tampering with AccECN feedback between the Data 1310 Receiver and the Data Sender (whether malicious or accidental). 1311 The AccECN fields are immutable end-to-end, so they are amenable 1312 to TCP-AO protection, which covers TCP options by default. 1313 However, TCP-AO is often too brittle to use on many end-to-end 1314 paths, where middleboxes can make verification fail in their 1315 attempts to improve performance or security, e.g. by 1316 resegmentation or shifting the sequence space. 1318 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1319 of congestion feedback. With minor changes AccECN could be optimised 1320 for the possibility that the ECT(1) codepoint might be used as an ECN 1321 Nonce . However, given RFC 3540 is being reclassified as historic, 1322 the AccECN design has been generalised so that it ought to be able to 1323 support other possible uses of the ECT(1) codepoint, such as a lower 1324 severity or a more instant congestion signal than CE. 1326 5. Protocol Properties 1328 This section is informative not normative. It describes how well the 1329 protocol satisfies the agreed requirements for a more accurate ECN 1330 feedback protocol [RFC7560]. 1332 Accuracy: From each ACK, the Data Sender can infer the number of new 1333 CE marked segments since the previous ACK. This provides better 1334 accuracy on CE feedback than classic ECN. In addition if the 1335 AccECN Option is present (not blocked by the network path) the 1336 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1338 Overhead: The AccECN scheme is divided into two parts. The 1339 essential part reuses the 3 flags already assigned to ECN in the 1340 IP header. The supplementary part adds an additional TCP option 1341 consuming up to 11 bytes. However, no TCP option is consumed in 1342 the SYN. 1344 Ordering: The order in which marks arrive at the Data Receiver is 1345 preserved in AccECN feedback, because the Data Receiver is 1346 expected to send an ACK immediately whenever a different mark 1347 arrives. 1349 Timeliness: While the same ECN markings are arriving continually at 1350 the Data Receiver, it can defer ACKs as TCP does normally, but it 1351 will immediately send an ACK as soon as a different ECN marking 1352 arrives. 1354 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1355 latency-sensitive uses of ECN feedback by capturing the timing of 1356 transitions but not wasting resources while the state of the 1357 signalling system is stable. The receiver can control how 1358 frequently it sends the AccECN TCP Option and therefore it can 1359 control the overhead induced by AccECN. 1361 Resilience: All information is provided based on counters. 1362 Therefore if ACKs are lost, the counters on the first ACK 1363 following the losses allows the Data Sender to immediately recover 1364 the number of the ECN markings that it missed. 1366 Resilience against Bias: Because feedback is based on repetition of 1367 counters, random losses do not remove any information, they only 1368 delay it. Therefore, even though some ACKs are change-triggered, 1369 random losses will not alter the proportions of the different ECN 1370 markings in the feedback. 1372 Resilience vs Overhead: If space is limited in some segments (e.g. 1373 because more option are need on some segments, such as the SACK 1374 option after loss), the Data Receiver can send AccECN Options less 1375 frequently or truncate fields that have not changed, usually down 1376 to as little as 5 bytes. However, it has to send a full-sized 1377 AccECN Option at least three times per RTT, which the Data Sender 1378 can rely on as a regular beacon or checkpoint. 1380 Resilience vs Timeliness and Ordering: Ordering information and the 1381 timing of transitions cannot be communicated in three cases: i) 1382 during ACK loss; ii) if something on the path strips the AccECN 1383 Option; or iii) if the Data Receiver is unable to support Change- 1384 Triggered ACKs. 1386 Complexity: An AccECN implementation solely involves simple counter 1387 increments, some modulo arithmetic to communicate the least 1388 significant bits and allow for wrap, and some heuristics for 1389 safety against fields cycling due to prolonged periods of ACK 1390 loss. Each host needs to maintain eight additional counters. The 1391 hosts have to apply some additional tests to detect tampering by 1392 middleboxes, but in general the protocol is simple to understand, 1393 simple to implement and requires few cycles per packet to execute. 1395 Integrity: AccECN is compatible with at least three approaches that 1396 can assure the integrity of ECN feedback. If the AccECN Option is 1397 stripped the resolution of the feedback is degraded, but the 1398 integrity of this degraded feedback can still be assured. 1400 Backward Compatibility: If only one endpoint supports the AccECN 1401 scheme, it will fall-back to the most advanced ECN feedback scheme 1402 supported by the other end. 1404 Backward Compatibility: If the AccECN Option is stripped by a 1405 middlebox, AccECN still provides basic congestion feedback in the 1406 ACE field. Further, AccECN can be used to detect mangling of the 1407 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1408 marked segments; and blocking of segments carrying the AccECN 1409 Option. It can detect these conditions during TCP's 3WHS so that 1410 it can fall back to operation without ECN and/or operation without 1411 the AccECN Option. 1413 Forward Compatibility: The behaviour of endpoints and middleboxes is 1414 carefully defined for all reserved or currently unused codepoints 1415 in the scheme, to ensure that any blocking of anomalous values is 1416 always at least under reversible policy control. 1418 6. IANA Considerations 1420 This document reassigns bit 7 of the TCP header flags to the AccECN 1421 experiment. This bit was previously called the Nonce Sum (NS) flag 1422 [RFC3540], but RFC 3540 is being reclassified as historic 1423 [I-D.ietf-tsvwg-ecn-experimentation]. The flag will now be defined 1424 as: 1426 +-----+-------------------+-----------+ 1427 | Bit | Name | Reference | 1428 +-----+-------------------+-----------+ 1429 | 7 | AE (Accurate ECN) | RFC XXXX | 1430 +-----+-------------------+-----------+ 1432 [TO BE REMOVED: This registration should take place at the following 1433 location: https://www.iana.org/assignments/tcp-header-flags/tcp- 1434 header-flags.xhtml#tcp-header-flags-1 ] 1436 This document also defines a new TCP option for AccECN, assigned a 1437 value of TBD1 (decimal) from the TCP option space. This value is 1438 defined as: 1440 +------+--------+-----------------------+-----------+ 1441 | Kind | Length | Meaning | Reference | 1442 +------+--------+-----------------------+-----------+ 1443 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1444 +------+--------+-----------------------+-----------+ 1446 [TO BE REMOVED: This registration should take place at the following 1447 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1448 parameters.xhtml#tcp-parameters-1 ] 1450 Early implementation before the IANA allocation MUST follow [RFC6994] 1451 and use experimental option 254 and magic number 0xACCE (16 bits), 1452 then migrate to the new option after the allocation. 1454 7. Security Considerations 1456 If ever the supplementary part of AccECN based on the new AccECN TCP 1457 Option is unusable (due for example to middlebox interference) the 1458 essential part of AccECN's congestion feedback offers only limited 1459 resilience to long runs of ACK loss (see Section 3.2.5). These 1460 problems are unlikely to be due to malicious intervention (because if 1461 an attacker could strip a TCP option or discard a long run of ACKs it 1462 could wreak other arbitrary havoc). However, it would be of concern 1463 if AccECN's resilience could be indirectly compromised during a 1464 flooding attack. AccECN is still considered safe though, because if 1465 the option is not presented, the AccECN Data Sender is then required 1466 to switch to more conservative assumptions about wrap of congestion 1467 indication counters (see Section 3.2.5 and Appendix A.2). 1469 Section 4.1 describes how a TCP server can negotiate AccECN and use 1470 the SYN cookie method for mitigating SYN flooding attacks. 1472 There is concern that ECN markings could be altered or suppressed, 1473 particularly because a misbehaving Data Receiver could increase its 1474 own throughput at the expense of others. AccECN is compatible with 1475 the three schemes known to assure the integrity of ECN feedback (see 1476 Section 4.3 for details). If the AccECN Option is stripped by an 1477 incorrectly implemented middlebox, the resolution of the feedback 1478 will be degraded, but the integrity of this degraded information can 1479 still be assured. 1481 There is a potential concern that a receiver could deliberately omit 1482 the AccECN Option pretending that it had been stripped by a 1483 middlebox. No known way can yet be contrived to take advantage of 1484 this downgrade attack, but it is mentioned here in case someone else 1485 can contrive one. 1487 The AccECN protocol is not believed to introduce any new privacy 1488 concerns, because it merely counts and feeds back signals at the 1489 transport layer that had already been visible at the IP layer. 1491 8. Acknowledgements 1493 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 1494 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf 1495 and Michael Tuexen for their input and discussion. The idea of using 1496 the three ECN-related TCP flags as one field for more accurate TCP- 1497 ECN feedback was first introduced in the re-ECN protocol that was the 1498 ancestor of ConEx. 1500 Bob Briscoe was part-funded by the European Community under its 1501 Seventh Framework Programme through the Reducing Internet Transport 1502 Latency (RITE) project (ICT-317700) and through the Trilogy 2 project 1503 (ICT-317756). He was also part-funded by the Research Council of 1504 Norway through the TimeIn project. The views expressed here are 1505 solely those of the authors. 1507 Mirja Kuehlewind was partly supported by the European Commission 1508 under Horizon 2020 grant agreement no. 688421 Measurement and 1509 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 1510 State Secretariat for Education, Research, and Innovation under 1511 contract no. 15.0268. This support does not imply endorsement. 1513 9. Comments Solicited 1515 Comments and questions are encouraged and very welcome. They can be 1516 addressed to the IETF TCP maintenance and minor modifications working 1517 group mailing list , and/or to the authors. 1519 10. References 1521 10.1. Normative References 1523 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1524 Requirement Levels", BCP 14, RFC 2119, 1525 DOI 10.17487/RFC2119, March 1997, 1526 . 1528 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1529 of Explicit Congestion Notification (ECN) to IP", 1530 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1531 . 1533 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1534 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1535 . 1537 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 1538 RFC 6994, DOI 10.17487/RFC6994, August 2013, 1539 . 1541 10.2. Informative References 1543 [I-D.ietf-tcpm-alternativebackoff-ecn] 1544 Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 1545 "TCP Alternative Backoff with ECN (ABE)", draft-ietf-tcpm- 1546 alternativebackoff-ecn-03 (work in progress), October 1547 2017. 1549 [I-D.ietf-tcpm-generalized-ecn] 1550 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 1551 Congestion Notification (ECN) to TCP Control Packets", 1552 draft-ietf-tcpm-generalized-ecn-02 (work in progress), 1553 October 2017. 1555 [I-D.ietf-tsvwg-ecn-experimentation] 1556 Black, D., "Relaxing Restrictions on Explicit Congestion 1557 Notification (ECN) Experimentation", draft-ietf-tsvwg-ecn- 1558 experimentation-07 (work in progress), October 2017. 1560 [I-D.ietf-tsvwg-l4s-arch] 1561 Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, 1562 Low Loss, Scalable Throughput (L4S) Internet Service: 1563 Architecture", draft-ietf-tsvwg-l4s-arch-01 (work in 1564 progress), October 2017. 1566 [I-D.kuehlewind-tcpm-ecn-fallback] 1567 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 1568 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 1569 fallback-01 (work in progress), September 2013. 1571 [I-D.moncaster-tcpm-rcv-cheat] 1572 Moncaster, T., Briscoe, B., and A. Jacquet, "A TCP Test to 1573 Allow Senders to Identify Receiver Non-Compliance", draft- 1574 moncaster-tcpm-rcv-cheat-03 (work in progress), July 2014. 1576 [Mandalari18] 1577 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 1578 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 1579 over Mobile", IEEE Communications Magazine , March 2018. 1581 (to appear) 1583 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1584 Congestion Notification (ECN) Signaling with Nonces", 1585 RFC 3540, DOI 10.17487/RFC3540, June 2003, 1586 . 1588 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1589 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 1590 . 1592 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1593 Ramakrishnan, "Adding Explicit Congestion Notification 1594 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 1595 DOI 10.17487/RFC5562, June 2009, 1596 . 1598 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1599 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 1600 June 2010, . 1602 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 1603 "TCP Extensions for Multipath Operation with Multiple 1604 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 1605 . 1607 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1608 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1609 . 1611 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 1612 "Problem Statement and Requirements for Increased Accuracy 1613 in Explicit Congestion Notification (ECN) Feedback", 1614 RFC 7560, DOI 10.17487/RFC7560, August 2015, 1615 . 1617 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1618 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 1619 DOI 10.17487/RFC7713, December 2015, 1620 . 1622 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1623 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 1624 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 1625 October 2017, . 1627 Appendix A. Example Algorithms 1629 This appendix is informative, not normative. It gives example 1630 algorithms that would satisfy the normative requirements of the 1631 AccECN protocol. However, implementers are free to choose other ways 1632 to implement the requirements. 1634 A.1. Example Algorithm to Encode/Decode the AccECN Option 1636 The example algorithms below show how a Data Receiver in AccECN mode 1637 could encode its CE byte counter r.ceb into the ECEB field within the 1638 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 1639 the ECEB field into its byte counter s.ceb. The other counters for 1640 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 1641 similarly encoded and decoded. 1643 It is assumed that each local byte counter is an unsigned integer 1644 greater than 24b (probably 32b), and that the following constant has 1645 been assigned: 1647 DIVOPT = 2^24 1649 Every time a CE marked data segment arrives, the Data Receiver 1650 increments its local value of r.ceb by the size of the TCP Data. 1651 Whenever it sends an ACK with the AccECN Option, the value it writes 1652 into the ECEB field is 1654 ECEB = r.ceb % DIVOPT 1656 where '%' is the modulo operator. 1658 On the arrival of an AccECN Option, the Data Sender uses the TCP 1659 acknowledgement number and any SACK options to calculate newlyAckedB, 1660 the amount of new data that the ACK acknowledges in bytes. If 1661 newlyAckedB is negative it means that a more up to date ACK has 1662 already been processed, so this ACK has been superseded and the Data 1663 Sender has to ignore the AccECN Option. Then the Data Sender 1664 calculates the minimum difference d.ceb between the ECEB field and 1665 its local s.ceb counter, using modulo arithmetic as follows: 1667 if (newlyAckedB >= 0) { 1668 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 1669 s.ceb += d.ceb 1670 } 1672 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 1673 then 1674 s.ceb % DIVOPT = 1 1675 d.ceb = (1461 + 2^24 - 1) % 2^24 1676 = 1460 1677 s.ceb = 33,554,433 + 1460 1678 = 33,555,893 1680 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 1682 The example algorithms below show how a Data Receiver in AccECN mode 1683 could encode its CE packet counter r.cep into the ACE field, and how 1684 the Data Sender in AccECN mode could decode the ACE field into its 1685 s.cep counter. The Data Sender's algorithm includes code to 1686 heuristically detect a long enough unbroken string of ACK losses that 1687 could have concealed a cycle of the congestion counter in the ACE 1688 field of the next ACK to arrive. 1690 Two variants of the algorithm are given: i) a more conservative 1691 variant for a Data Sender to use if it detects that the AccECN Option 1692 is not available (see Section 3.2.5 and Section 3.2.7); and ii) a 1693 less conservative variant that is feasible when complementary 1694 information is available from the AccECN Option. 1696 A.2.1. Safety Algorithm without the AccECN Option 1698 It is assumed that each local packet counter is a sufficiently sized 1699 unsigned integer (probably 32b) and that the following constant has 1700 been assigned: 1702 DIVACE = 2^3 1704 Every time a CE marked packet arrives, the Data Receiver increments 1705 its local value of r.cep by 1. It repeats the same value of ACE in 1706 every subsequent ACK until the next CE marking arrives, where 1708 ACE = r.cep % DIVACE. 1710 If the Data Sender received an earlier value of the counter that had 1711 been delayed due to ACK reordering, it might incorrectly calculate 1712 that the ACE field had wrapped. Therefore, on the arrival of every 1713 ACK, the Data Sender uses the TCP acknowledgement number and any SACK 1714 options to calculate newlyAckedB, the amount of new data that the ACK 1715 acknowledges. If newlyAckedB is negative it means that a more up to 1716 date ACK has already been processed, so this ACK has been superseded 1717 and the Data Sender has to ignore the AccECN Option. If newlyAckedB 1718 is zero, to break the tie the Data Sender could use timestamps (if 1719 present) to work out newlyAckedT, the amount of new time that the ACK 1720 acknowledges. Then the Data Sender calculates the minimum difference 1721 d.cep between the ACE field and its local s.cep counter, using modulo 1722 arithmetic as follows: 1724 if ((newlyAckedB > 0) || (newlyAckedB == 0 && newlyAckedT > 0)) 1725 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 1727 Section 3.2.5 requires the Data Sender to assume that the ACE field 1728 did cycle if it could have cycled under prevailing conditions. The 1729 3-bit ACE field in an arriving ACK could have cycled and become 1730 ambiguous to the Data Sender if a row of ACKs goes missing that 1731 covers a stream of data long enough to contain 8 or more CE marks. 1732 We use the word `missing' rather than `lost', because some or all the 1733 missing ACKs might arrive eventually, but out of order. Even if some 1734 of the lost ACKs are piggy-backed on data (i.e. not pure ACKs) 1735 retransmissions will not repair the lost AccECN information, because 1736 AccECN requires retransmissions to carry the latest AccECN counters, 1737 not the original ones. 1739 The phrase `under prevailing conditions' allows the Data Sender to 1740 take account of the prevailing size of data segments and the 1741 prevailing CE marking rate just before the sequence of ACK losses. 1742 However, we shall start with the simplest algorithm, which assumes 1743 segments are all full-sized and ultra-conservatively it assumes that 1744 ECN marking was 100% on the forward path when ACKs on the reverse 1745 path started to all be dropped. Specifically, if newlyAckedB is the 1746 amount of data that an ACK acknowledges since the previous ACK, then 1747 the Data Sender could assume that this acknowledges newlyAckedPkt 1748 full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it 1749 could assume that the ACE field incremented by 1751 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 1753 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 1754 size segments than any previous ACK, and that ACE increments by a 1755 minimum of 2 CE marks (d.cep=2). The above formula works out that it 1756 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 1757 2). However, if ACE increases by a minimum of 2 but acknowledges 10 1758 full-sized segments, then it would be necessary to assume that there 1759 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 1761 Implementers could build in more heuristics to estimate prevailing 1762 average segment size and prevailing ECN marking. For instance, 1763 newlyAckedPkt in the above formula could be replaced with 1764 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 1765 segment size and p is the prevailing ECN marking probability. 1766 However, ultimately, if TCP's ECN feedback becomes inaccurate it 1767 still has loss detection to fall back on. Therefore, it would seem 1768 safe to implement a simple algorithm, rather than a perfect one. 1770 The simple algorithm for dSafer.cep above requires no monitoring of 1771 prevailing conditions and it would still be safe if, for example, 1772 segments were on average at least 5% of full-sized as long as ECN 1773 marking was 5% or less. Assuming it was used, the Data Sender would 1774 increment its packet counter as follows: 1776 s.cep += dSafer.cep 1778 If missing acknowledgement numbers arrive later (due to reordering), 1779 Section 3.2.5 says "the Data Sender MAY attempt to neutralise the 1780 effect of any action it took based on a conservative assumption that 1781 it later found to be incorrect". To do this, the Data Sender would 1782 have to store the values of all the relevant variables whenever it 1783 made assumptions, so that it could re-evaluate them later. Given 1784 this could become complex and it is not required, we do not attempt 1785 to provide an example of how to do this. 1787 A.2.2. Safety Algorithm with the AccECN Option 1789 When the AccECN Option is available on the ACKs before and after the 1790 possible sequence of ACK losses, if the Data Sender only needs CE- 1791 marked bytes, it will have sufficient information in the AccECN 1792 Option without needing to process the ACE field. However, if for 1793 some reason it needs CE-marked packets, if dSafer.cep is different 1794 from d.cep, it can calculate the average marked segment size that 1795 each implies to determine whether d.cep is likely to be a safe enough 1796 estimate. Specifically, it could use the following algorithm, where 1797 d.ceb is the amount of newly CE-marked bytes (see Appendix A.1): 1799 SAFETY_FACTOR = 2 1800 if (dSafer.cep > d.cep) { 1801 s = d.ceb/d.cep 1802 if (s <= MSS) { 1803 sSafer = d.ceb/dSafer.cep 1804 if (sSafer < MSS/SAFETY_FACTOR) 1805 dSafer.cep = d.cep % d.cep is a safe enough estimate 1806 } % else 1807 % No need for else; dSafer.cep is already correct, 1808 % because d.cep must have been too small 1809 } 1811 The chart below shows when the above algorithm will consider d.cep 1812 can replace dSafer.cep as a safe enough estimate of the number of CE- 1813 marked packets: 1815 ^ 1816 sSafer| 1817 | 1818 MSS+ 1819 | 1820 | dSafer.cep 1821 | is 1822 MSS/2+--------------+ safest 1823 | | 1824 | d.cep is safe| 1825 | enough | 1826 +--------------------> 1827 MSS s 1829 The following examples give the reasoning behind the algorithm, 1830 assuming MSS=1,460 [B]: 1832 o if d.cep=0, dSafer.cep=8 and d.ceb=1,460, then s=infinity and 1833 sSafer=182.5. 1834 Therefore even though the average size of 8 data segments is 1835 unlikely to have been as small as MSS/8, d.cep cannot have been 1836 correct, because it would imply an average segment size greater 1837 than the MSS. 1839 o if d.cep=2, dSafer.cep=10 and d.ceb=1,460, then s=730 and 1840 sSafer=146. 1841 Therefore d.cep is safe enough, because the average size of 10 1842 data segments is unlikely to have been as small as MSS/10. 1844 o if d.cep=7, dSafer.cep=15 and d.ceb=10,200, then s=1,457 and 1845 sSafer=680. 1846 Therefore d.cep is safe enough, because the average data segment 1847 size is more likely to have been just less than one MSS, rather 1848 than below MSS/2. 1850 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 1851 far less likely. However, because [RFC3168] currently precludes 1852 this, the above algorithm assumes that pure ACKs are not ECN-capable. 1854 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 1856 If the AccECN Option is not available, the Data Sender can only 1857 decode CE-marking from the ACE field in packets. Every time an ACK 1858 arrives, to convert this into an estimate of CE-marked bytes, it 1859 needs an average of the segment size, s_ave. Then it can add or 1860 subtract s_ave from the value of d.ceb as the value of d.cep 1861 increments or decrements. 1863 To calculate s_ave, it could keep a record of the byte numbers of all 1864 the boundaries between packets in flight (including control packets), 1865 and recalculate s_ave on every ACK. However it would be simpler to 1866 merely maintain a counter packets_in_flight for the number of packets 1867 in flight (including control packets), which it could update once per 1868 RTT. Either way, it would estimate s_ave as: 1870 s_ave ~= flightsize / packets_in_flight, 1872 where flightsize is the variable that TCP already maintains for the 1873 number of bytes in flight. To avoid floating point arithmetic, it 1874 could right-bit-shift by lg(packets_in_flight), where lg() means log 1875 base 2. 1877 An alternative would be to maintain an exponentially weighted moving 1878 average (EWMA) of the segment size: 1880 s_ave = a * s + (1-a) * s_ave, 1882 where a is the decay constant for the EWMA. However, then it is 1883 necessary to choose a good value for this constant, which ought to 1884 depend on the number of packets in flight. Also the decay constant 1885 needs to be power of two to avoid floating point arithmetic. 1887 A.4. Example Algorithm to Beacon AccECN Options 1889 Section 3.2.8 requires a Data Receiver to beacon a full-length AccECN 1890 Option at least 3 times per RTT. This could be implemented by 1891 maintaining a variable to store the number of ACKs (pure and data 1892 ACKs) since a full AccECN Option was last sent and another for the 1893 approximate number of ACKs sent in the last round trip time: 1895 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 1896 send_full_AccECN_Option() 1898 For optimised integer arithmetic, BEACON_FREQ = 4 could be used, 1899 rather than 3, so that the division could be implemented as an 1900 integer right bit-shift by lg(BEACON_FREQ). 1902 In certain operating systems, it might be too complex to maintain 1903 acks_in_round. In others it might be possible by tagging each data 1904 segment in the retransmit buffer with the number of ACKs sent at the 1905 point that segment was sent. This would not work well if the Data 1906 Receiver was not sending data itself, in which case it might be 1907 necessary to beacon based on time instead, as follows: 1909 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 1910 send_full_AccECN_Option() 1912 This time-based approach does not work well when all the ACKs are 1913 sent early in each round trip, as is the case during slow-start. In 1914 this case few options will be sent (evtl. even less than 3 per RTT). 1915 However, when continuously sending data, data packets as well as ACKs 1916 will spread out equally over the RTT and sufficient ACKs with the 1917 AccECN option will be sent. 1919 A.5. Example Algorithm to Count Not-ECT Bytes 1921 A Data Sender in AccECN mode can infer the amount of TCP payload data 1922 arriving at the receiver marked Not-ECT from the difference between 1923 the amount of newly ACKed data and the sum of the bytes with the 1924 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 1925 r.e0b is initialized to 1 and the other two counters are initialized 1926 to 0, the initial sum will be 1, which matches the initial offset of 1927 the TCP sequence number on completion of the 3WHS. 1929 For this approach to be precise, it has to be assumed that spurious 1930 (unnecessary) retransmissions do not lead to double counting. This 1931 assumption is currently correct, given that RFC 3168 requires that 1932 the Data Sender marks retransmitted segments as Not-ECT. However, 1933 the converse is not true; necessary transmissions will result in 1934 under-counting. 1936 However, such precision is unlikely to be necessary. The only known 1937 use of a count of Not-ECT marked bytes is to test whether equipment 1938 on the path is clearing the ECN field (perhaps due to an out-dated 1939 attempt to clear, or bleach, what used to be the ToS field). To 1940 detect bleaching it will be sufficient to detect whether nearly all 1941 bytes arrive marked as Not-ECT. Therefore there should be no need to 1942 keep track of the details of retransmissions. 1944 Authors' Addresses 1946 Bob Briscoe 1947 CableLabs 1948 UK 1950 EMail: ietf@bobbriscoe.net 1951 URI: http://bobbriscoe.net/ 1953 Mirja Kuehlewind 1954 ETH Zurich 1955 Zurich 1956 Switzerland 1958 EMail: mirja.kuehlewind@tik.ee.ethz.ch 1959 Richard Scheffenegger 1960 Vienna 1961 Austria 1963 EMail: rscheff@gmx.at