idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. -- The document date (October 30, 2017) is 2363 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 1814, but not defined == Outdated reference: A later version (-12) exists of draft-ietf-tcpm-alternativebackoff-ecn-02 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-01 == Outdated reference: A later version (-08) exists of draft-ietf-tsvwg-ecn-experimentation-07 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-00 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft CableLabs 4 Intended status: Experimental M. Kuehlewind 5 Expires: May 3, 2018 ETH Zurich 6 R. Scheffenegger 7 October 30, 2017 9 More Accurate ECN Feedback in TCP 10 draft-ietf-tcpm-accurate-ecn-04 12 Abstract 14 Explicit Congestion Notification (ECN) is a mechanism where network 15 nodes can mark IP packets instead of dropping them to indicate 16 incipient congestion to the end-points. Receivers with an ECN- 17 capable transport protocol feed back this information to the sender. 18 ECN is specified for TCP in such a way that only one feedback signal 19 can be transmitted per Round-Trip Time (RTT). Recently, new TCP 20 mechanisms like Congestion Exposure (ConEx) or Data Center TCP 21 (DCTCP) need more accurate ECN feedback information whenever more 22 than one marking is received in one RTT. This document specifies an 23 experimental scheme to provide more than one feedback signal per RTT 24 in the TCP header. Given TCP header space is scarce, it overloads 25 the three existing ECN-related flags in the TCP header and provides 26 additional information in a new TCP option. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on May 3, 2018. 45 Copyright Notice 47 Copyright (c) 2017 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (https://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 64 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 65 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 66 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 67 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 68 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 69 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 8 70 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 71 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 72 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 73 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 11 74 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 11 75 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 12 76 3.1.1. Negotiation during the TCP handshake . . . . . . . . 12 77 3.1.2. Retransmission of the SYN . . . . . . . . . . . . . . 14 78 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 15 79 3.2.1. Initialization of Feedback Counters at the Data 80 Sender . . . . . . . . . . . . . . . . . . . . . . . 15 81 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 16 82 3.2.3. Testing for Zeroing of the ACE Field . . . . . . . . 17 83 3.2.4. Testing for Mangling of the IP/ECN Field . . . . . . 18 84 3.2.5. Safety against Ambiguity of the ACE Field . . . . . . 19 85 3.2.6. The AccECN Option . . . . . . . . . . . . . . . . . . 19 86 3.2.7. Path Traversal of the AccECN Option . . . . . . . . . 21 87 3.2.8. Usage of the AccECN TCP Option . . . . . . . . . . . 24 88 3.3. AccECN Compliance by TCP Proxies, Offload Engines and 89 other Middleboxes . . . . . . . . . . . . . . . . . . . . 26 90 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 26 91 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 26 92 4.2. Compatibility with Other TCP Options and Experiments . . 27 93 4.3. Compatibility with Feedback Integrity Mechanisms . . . . 27 94 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 28 95 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 96 7. Security Considerations . . . . . . . . . . . . . . . . . . . 31 97 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 32 98 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 32 99 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 32 100 10.1. Normative References . . . . . . . . . . . . . . . . . . 33 101 10.2. Informative References . . . . . . . . . . . . . . . . . 33 102 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 36 103 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 36 104 A.2. Example Algorithm for Safety Against Long Sequences of 105 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 37 106 A.2.1. Safety Algorithm without the AccECN Option . . . . . 37 107 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 39 108 A.3. Example Algorithm to Estimate Marked Bytes from Marked 109 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 40 110 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 41 111 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 42 112 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 42 114 1. Introduction 116 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 117 network nodes can mark IP packets instead of dropping them to 118 indicate incipient congestion to the end-points. Receivers with an 119 ECN-capable transport protocol feed back this information to the 120 sender. ECN is specified for TCP in such a way that only one 121 feedback signal can be transmitted per Round-Trip Time (RTT). 122 Recently, proposed mechanisms like Congestion Exposure (ConEx 123 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need 124 more accurate ECN feedback information whenever more than one marking 125 is received in one RTT. A fuller treatment of the motivation for 126 this specification is given in the associated requirements document 127 [RFC7560]. 129 This documents specifies an experimental scheme for ECN feedback in 130 the TCP header to provide more than one feedback signal per RTT. It 131 will be called the more accurate ECN feedback scheme, or AccECN for 132 short. If AccECN progresses from experimental to the standards 133 track, it is intended to be a complete replacement for classic ECN 134 feedback, not a fork in the design of TCP. Thus, the applicability 135 of AccECN is intended to include all public and private IP networks 136 (and even any non-IP networks over which TCP is used today). Until 137 the AccECN experiment succeeds, [RFC3168] will remain as the 138 standards track specification for adding ECN to TCP. To avoid 139 confusion, in this document we use the term 'classic ECN' for the 140 pre-existing ECN specification [RFC3168]. 142 AccECN feedback overloads flags and fields in the main TCP header 143 with new definitions, so both ends have to support the new wire 144 protocol before it can be used. Therefore during the TCP handshake 145 the two ends use the three ECN-related flags in the TCP header to 146 negotiate the most advanced feedback protocol that they can both 147 support. 149 AccECN is solely an (experimental) change to the TCP wire protocol; 150 it only specifies the negotiation and signaling of more accurate ECN 151 feedback from a TCP Data Receiver to a Data Sender. It is completely 152 independent of how TCP might respond to congestion feedback, which is 153 out of scope. For that we refer to [RFC3168] or any RFC that 154 specifies a different response to TCP ECN feedback, for example: 155 [RFC8257]; or the ECN experiments referred to in 156 [I-D.ietf-tsvwg-ecn-experimentation], namely: a TCP-based Low Latency 157 Low Loss Scalable (L4S) congestion control [I-D.ietf-tsvwg-l4s-arch]; 158 ECN-capable TCP control packets [I-D.ietf-tcpm-generalized-ecn], or 159 Alternative Backoff with ECN (ABE) 160 [I-D.ietf-tcpm-alternativebackoff-ecn]. 162 It is likely (but not required) that the AccECN protocol will be 163 implemented along with the following experimental additions to the 164 TCP-ECN protocol: ECN-capable TCP control packets and retransmissions 165 [I-D.ietf-tcpm-generalized-ecn], which includes the ECN-capable SYN/ 166 ACK experiment [RFC5562]; and testing receiver non-compliance 167 [I-D.moncaster-tcpm-rcv-cheat]. 169 1.1. Document Roadmap 171 The following introductory sections outline the goals of AccECN 172 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 173 that it is clear what success would look like. Then terminology is 174 defined (Section 1.4) and a recap of existing prerequisite technology 175 is given (Section 1.5). 177 Section 2 gives an informative overview of the AccECN protocol. Then 178 Section 3 gives the normative protocol specification. Section 4 179 assesses the interaction of AccECN with commonly used variants of 180 TCP, whether standardised or not. Section 5 summarises the features 181 and properties of AccECN. 183 Section 6 summarises the protocol fields and numbers that IANA will 184 need to assign and Section 7 points to the aspects of the protocol 185 that will be of interest to the security community. 187 Appendix A gives pseudocode examples for the various algorithms that 188 AccECN uses. 190 1.2. Goals 192 [RFC7560] enumerates requirements that a candidate feedback scheme 193 will need to satisfy, under the headings: resilience, timeliness, 194 integrity, accuracy (including ordering and lack of bias), 195 complexity, overhead and compatibility (both backward and forward). 196 It recognises that a perfect scheme that fully satisfies all the 197 requirements is unlikely and trade-offs between requirements are 198 likely. Section 5 presents the properties of AccECN against these 199 requirements and discusses the trade-offs made. 201 The requirements document recognises that a protocol as ubiquitous as 202 TCP needs to be able to serve as-yet-unspecified requirements. 203 Therefore an AccECN receiver aims to act as a generic (dumb) 204 reflector of congestion information so that in future new sender 205 behaviours can be deployed unilaterally. 207 1.3. Experiment Goals 209 TCP is critical to the robust functioning of the Internet, therefore 210 any proposed modifications to TCP need to be thoroughly tested. The 211 present specification describes an experimental protocol that adds 212 more accurate ECN feedback to the TCP protocol. The intention is to 213 specify the protocol sufficiently so that more than one 214 implementation can be built in order to test its function, robustness 215 and interoperability (with itself and with previous version of ECN 216 and TCP). 218 The experimental protocol will be considered successful if it is 219 deployed and if it satisfies the requirements of [RFC7560] in the 220 consensus opinion of the IETF tcpm working group. In short, this 221 requires that it improves the accuracy and timeliness of TCP's ECN 222 feedback, as claimed in Section 5, while striking a balance between 223 the conflicting requirements of resilience, integrity and 224 minimisation of overhead. It also requires that it is not unduly 225 complex, and that it is compatible with prevalent equipment 226 behaviours in the current Internet (e.g. hardware offloading and 227 middleboxes), whether or not they comply with standards. 229 Testing will mostly focus on fall-back strategies in case of 230 middlebox interference. Current recommended strategies are specified 231 in Sections 3.1.2, 3.2.3, 3.2.4 and 3.2.7. The effectiveness of 232 these strategies depends on the actual deployment situation of 233 middleboxes. Therefore experimental verification to confirm large- 234 scale path traversal in the Internet is needed before finalizing this 235 specification on the Standards Track. 237 1.4. Terminology 239 AccECN: The more accurate ECN feedback scheme will be called AccECN 240 for short. 242 Classic ECN: the ECN protocol specified in [RFC3168]. 244 Classic ECN feedback: the feedback aspect of the ECN protocol 245 specified in [RFC3168], including generation, encoding, 246 transmission and decoding of feedback, but not the Data Sender's 247 subsequent response to that feedback. 249 ACK: A TCP acknowledgement, with or without a data payload. 251 Pure ACK: A TCP acknowledgement without a data payload. 253 TCP client: The TCP stack that originates a connection. 255 TCP server: The TCP stack that responds to a connection request. 257 Data Receiver: The endpoint of a TCP half-connection that receives 258 data and sends AccECN feedback. 260 Data Sender: The endpoint of a TCP half-connection that sends data 261 and receives AccECN feedback. 263 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 264 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 265 document are to be interpreted as described in RFC 2119 [RFC2119]. 267 1.5. Recap of Existing ECN feedback in IP/TCP 269 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 270 negotiated with the receiver at the transport layer, an ECN sender 271 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 272 to indicate an ECN-capable transport (ECT). If both ECN bits are 273 zero, the packet is considered to have been sent by a Not-ECN-capable 274 Transport (Not-ECT). When a network node experiences congestion, it 275 will occasionally either drop or mark a packet, with the choice 276 depending on the packet's ECN codepoint. If the codepoint is Not- 277 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 278 the node can mark the packet by setting both ECN bits, which is 279 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 280 Table 1 summarises these codepoints. 282 +-----------------------+---------------+---------------------------+ 283 | IP-ECN codepoint | Codepoint | Description | 284 | (binary) | name | | 285 +-----------------------+---------------+---------------------------+ 286 | 00 | Not-ECT | Not ECN-Capable Transport | 287 | 01 | ECT(1) | ECN-Capable Transport (1) | 288 | 10 | ECT(0) | ECN-Capable Transport (0) | 289 | 11 | CE | Congestion Experienced | 290 +-----------------------+---------------+---------------------------+ 292 Table 1: The ECN Field in the IP Header 294 In the TCP header the first two bits in byte 14 are defined as flags 295 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 296 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 297 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 298 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 299 Data Receiver starts to set the Echo Congestion Experienced (ECE) 300 flag continuously in the TCP header of ACKs, which ensures the signal 301 is received reliably even if ACKs are lost. The TCP sender confirms 302 that it has received at least one ECE signal by responding with the 303 congestion window reduced (CWR) flag, which allows the TCP receiver 304 to stop repeating the ECN-Echo flag. This always leads to a full RTT 305 of ACKs with ECE set. Thus any additional CE markings arriving 306 within this RTT cannot be fed back. 308 The last bit in byte 13 of the TCP header was defined as the Nonce 309 Sum (NS) for the ECN Nonce [RFC3540]. RFC 3540 was never deployed so 310 it is being reclassified as historic, making this TCP flag available 311 for use by the AccECN experiment instead. 313 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 314 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 315 | | | N | C | E | U | A | P | R | S | F | 316 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 317 | | | | R | E | G | K | H | T | N | N | 318 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 320 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 322 2. AccECN Protocol Overview and Rationale 324 This section provides an informative overview of the AccECN protocol 325 that will be normatively specified in Section 3 327 Like the original TCP approach, the Data Receiver of each TCP half- 328 connection sends AccECN feedback to the Data Sender on TCP 329 acknowledgements, reusing data packets of the other half-connection 330 whenever possible. 332 The AccECN protocol has had to be designed in two parts: 334 o an essential part that re-uses ECN TCP header bits to feed back 335 the number of arriving CE marked packets. This provides more 336 accuracy than classic ECN feedback, but limited resilience against 337 ACK loss; 339 o a supplementary part using a new AccECN TCP Option that provides 340 additional feedback on the number of bytes that arrive marked with 341 each of the three ECN codepoints (not just CE marks). This 342 provides greater resilience against ACK loss than the essential 343 feedback, but it is more likely to suffer from middlebox 344 interference. 346 The two part design was necessary, given limitations on the space 347 available for TCP options and given the possibility that certain 348 incorrectly designed middleboxes prevent TCP using any new options. 350 The essential part overloads the previous definition of the three 351 flags in the TCP header that had been assigned for use by ECN. This 352 design choice deliberately replaces the classic ECN feedback 353 protocol, rather than leaving classic ECN feedback intact and adding 354 more accurate feedback separately because: 356 o this efficiently reuses scarce TCP header space, given TCP option 357 space is approaching saturation; 359 o a single upgrade path for the TCP protocol is preferable to a fork 360 in the design; 362 o otherwise classic and accurate ECN feedback could give conflicting 363 feedback on the same segment, which could open up new security 364 concerns and make implementations unnecessarily complex; 366 o middleboxes are more likely to faithfully forward the TCP ECN 367 flags than newly defined areas of the TCP header. 369 AccECN is designed to work even if the supplementary part is removed 370 or zeroed out, as long as the essential part gets through. 372 2.1. Capability Negotiation 374 AccECN is a change to the wire protocol of the main TCP header, 375 therefore it can only be used if both endpoints have been upgraded to 376 understand it. The TCP client signals support for AccECN on the 377 initial SYN of a connection and the TCP server signals whether it 378 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 379 client uses to signal AccECN support have been carefully chosen so 380 that a TCP server will interpret them as a request to support the 381 most recent variant of ECN feedback that it supports. Then the 382 client falls back to the same variant of ECN feedback. 384 An AccECN TCP client does not send the new AccECN Option on the SYN 385 as SYN option space is limited and successful negotiation using the 386 flags in the main header is taken as sufficient evidence that both 387 ends also support the AccECN Option. The TCP server sends the AccECN 388 Option on the SYN/ACK and the client sends it on the first ACK to 389 test whether the network path forwards the option correctly. 391 2.2. Feedback Mechanism 393 A Data Receiver maintains four counters initialised at the start of 394 the half-connection. Three count the number of arriving payload 395 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 396 the number of packets arriving marked with a CE codepoint (including 397 control packets without payload if they are CE-marked). 399 The Data Sender maintains four equivalent counters for the half 400 connection, and the AccECN protocol is designed to ensure they will 401 match the values in the Data Receiver's counters, albeit after a 402 little delay. 404 Each ACK carries the three least significant bits (LSBs) of the 405 packet-based CE counter using the ECN bits in the TCP header, now 406 renamed the Accurate ECN (ACE) field (see Figure 2 later). The LSBs 407 of each of the three byte counters are carried in the AccECN Option. 409 2.3. Delayed ACKs and Resilience Against ACK Loss 411 With both the ACE and the AccECN Option mechanisms, the Data Receiver 412 continually repeats the current LSBs of each of its respective 413 counters. There is no need to acknowledge these continually repeated 414 counters, so the congestion window reduced (CWR) mechanism is no 415 longer used. Even if some ACKs are lost, the Data Sender should be 416 able to infer how much to increment its own counters, even if the 417 protocol field has wrapped. 419 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 420 it appears to have incremented by one (say), the field might have 421 actually cycled completely then incremented by one. The Data 422 Receiver is required not to delay sending an ACK to such an extent 423 that the ACE field would cycle. However cyling is still a 424 possibility at the Data Sender because a whole sequence of ACKs 425 carrying intervening values of the field might all be lost or delayed 426 in transit. 428 The fields in the AccECN Option are larger, but they will increment 429 in larger steps because they count bytes not packets. Nonetheless, 430 their size has been chosen such that a whole cycle of the field would 431 never occur between ACKs unless there had been an infeasibly long 432 sequence of ACK losses. Therefore, as long as the AccECN Option is 433 available, it can be treated as a dependable feedback channel. 435 If the AccECN Option is not available, e.g. it is being stripped by a 436 middlebox, the AccECN protocol will only feed back information on CE 437 markings (using the ACE field). Although not ideal, this will be 438 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 439 will ever indicate more severe congestion than CE, even though future 440 uses for ECT(0) or ECT(1) are still unclear 441 [I-D.ietf-tsvwg-ecn-experimentation]. Because the 3-bit ACE field is 442 so small, when it is the only field available the Data Sender has to 443 interpret it conservatively assuming the worst possible wrap. 445 Certain specified events trigger the Data Receiver to include an 446 AccECN Option on an ACK. The rules are designed to ensure that the 447 order in which different markings arrive at the receiver is 448 communicated to the sender (as long as there is no ACK loss). 449 Implementations are encouraged to send an AccECN Option more 450 frequently, but this is left up to the implementer. 452 2.4. Feedback Metrics 454 The CE packet counter in the ACE field and the CE byte counter in the 455 AccECN Option both provide feedback on received CE-marks. The CE 456 packet counter includes control packets that do not have payload 457 data, while the CE byte counter solely includes marked payload bytes. 458 If both are present, the byte counter in the option will provide the 459 more accurate information needed for modern congestion control and 460 policing schemes, such as DCTCP or ConEx. If the option is stripped, 461 a simple algorithm to estimate the number of marked bytes from the 462 ACE field is given in Appendix A.3. 464 Feedback in bytes is recommended in order to protect against the 465 receiver using attacks similar to 'ACK-Division' to artificially 466 inflate the congestion window, which is why [RFC5681] now recommends 467 that TCP counts acknowledged bytes not packets. 469 2.5. Generic (Dumb) Reflector 471 The ACE field provides information about CE markings on both data and 472 control packets. According to [RFC3168] the Data Sender is meant to 473 set control packets to Not-ECT. However, mechanisms in certain 474 private networks (e.g. data centres) set control packets to be ECN 475 capable because they are precisely the packets that performance 476 depends on most. 478 For this reason, AccECN is designed to be a generic reflector of 479 whatever ECN markings it sees, whether or not they are compliant with 480 a current standard. Then as standards evolve, Data Senders can 481 upgrade unilaterally without any need for receivers to upgrade too. 482 It is also useful to be able to rely on generic reflection behaviour 483 when senders need to test for unexpected interference with markings 484 (for instance [I-D.kuehlewind-tcpm-ecn-fallback] and 485 [I-D.moncaster-tcpm-rcv-cheat]). 487 The initial SYN is the most critical control packet, so AccECN 488 provides feedback on whether it is CE marked. Although RFC 3168 489 prohibits an ECN-capable SYN, providing feedback of CE marking on the 490 SYN supports future scenarios in which SYNs might be ECN-enabled 491 (without prejudging whether they ought to be). For instance, 492 [I-D.ietf-tsvwg-ecn-experimentation] updates this aspect of RFC 3168 493 to allow experimentation with ECN-capable TCP control packets. 495 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 496 not-ECT in compliance with RFC 3168, feedback on the state of the ECN 497 field when it arrives at the receiver could still be useful, because 498 middleboxes have been known to overwrite the ECN IP field as if it is 499 still part of the old Type of Service (ToS) field [Mandalari18]. If 500 a TCP client has set the SYN to Not-ECT, but receives CE feedback, it 501 can detect such middlebox interference and send Not-ECT for the rest 502 of the connection (see [I-D.kuehlewind-tcpm-ecn-fallback]). Today, 503 if a TCP server receives ECT or CE on a SYN, it cannot know whether 504 it is invalid (or valid) because only the TCP client knows whether it 505 originally marked the SYN as Not-ECT (or ECT). Therefore, prior to 506 AccECN, the server's only safe course of action was to disable ECN 507 for the connection. Instead, the AccECN protocol allows the server 508 to feed back the received ECN field to the client, which then has all 509 the information to decide whether the connection has to fall-back 510 from supporting ECN (or not). 512 3. AccECN Protocol Specification 513 3.1. Negotiating to use AccECN 515 3.1.1. Negotiation during the TCP handshake 517 Given the ECN Nonce [RFC3540] is being reclassified as historic, the 518 present specification renames the TCP flag at bit 7 of the TCP header 519 flags from NS (Nonce Sum) to AE (Accurate ECN) (see IANA 520 Considerations in Section 6). 522 During the TCP handshake at the start of a connection, to request 523 more accurate ECN feedback the TCP client (host A) MUST set the TCP 524 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 526 If a TCP server (B) that is AccECN-enabled receives a SYN with the 527 above three flags set, it MUST set both its half connections into 528 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 529 the 4 values shown in the top block of Table 2 to confirm that it 530 supports AccECN. The TCP server MUST NOT set one of these 4 531 combination of flags on the SYN/ACK unless the preceding SYN 532 requested support for AccECN as above. 534 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 535 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 536 that arrived on the SYN. This applies whether or not the server 537 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 538 Section 2.5 for rationale). 540 Once a TCP client (A) has sent the above SYN to declare that it 541 supports AccECN, and once it has received the above SYN/ACK segment 542 that confirms that the TCP server supports AccECN, the TCP client 543 MUST set both its half connections into AccECN mode. 545 The procedure for the client to follow if a SYN/ACK does not arrive 546 before its retransmission timer expires is given in Section 3.1.2. 548 The three flags set to 1 to indicate AccECN support on the SYN have 549 been carefully chosen to enable natural fall-back to prior stages in 550 the evolution of ECN. Table 2 tabulates all the negotiation 551 possibilities for ECN-related capabilities that involve at least one 552 AccECN-capable host. The entries in the first two columns have been 553 abbreviated, as follows: 555 AccECN: More Accurate ECN Feedback (the present specification) 557 Nonce: ECN Nonce feedback [RFC3540] 559 ECN: 'Classic' ECN feedback [RFC3168] 560 No ECN: Not-ECN-capable. Implicit congestion notification using 561 packet drop. 563 +--------+--------+------------+-------------+----------------------+ 564 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 565 | | | | B->A | | 566 +--------+--------+------------+-------------+----------------------+ 567 | | | AE CWR ECE | AE CWR ECE | | 568 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN (Not-ECT on | 569 | | | | | SYN) | 570 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 571 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 572 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 573 | | | | | | 574 | AccECN | Nonce | 1 1 1 | 1 0 1 | classic ECN | 575 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 576 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 577 | | | | | | 578 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 579 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 580 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 581 | | | | | | 582 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 583 +--------+--------+------------+-------------+----------------------+ 585 Table 2: ECN capability negotiation between Client (A) and Server (B) 587 Table 2 is divided into blocks each separated by an empty row. 589 1. The top block shows the case already described where both 590 endpoints support AccECN and how the TCP server (B) indicates 591 congestion feedback. 593 2. The second block shows the cases where the TCP client (A) 594 supports AccECN but the TCP server (B) supports some earlier 595 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 596 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 597 shown it MUST set both its half connections into the feedback 598 mode shown in the rightmost column. 600 3. The third block shows the cases where the TCP server (B) supports 601 AccECN but the TCP client (A) supports some earlier variant of 602 TCP feedback, indicated in its SYN. Therefore, as soon as an 603 AccECN-enabled TCP server (B) receives the SYN shown, it MUST set 604 both its half connections into the feedback mode shown in the 605 rightmost column. 607 4. The fourth block displays a combination labelled `Broken' . Some 608 older TCP server implementations incorrectly set the reserved 609 flags in the SYN/ACK by reflecting those in the SYN. Such broken 610 TCP servers (B) cannot support ECN, so as soon as an AccECN- 611 capable TCP client (A) receives such a broken SYN/ACK it MUST 612 fall-back to Not ECN mode for both its half connections. 614 The following exceptional cases need some explanation: 616 ECN Nonce: An AccECN implementation, whether client or server, 617 sender or receiver, does not need to implement the ECN Nonce 618 feedback mode [RFC3540], which is being reclassified as historic 619 [I-D.ietf-tsvwg-ecn-experimentation]. AccECN is compatible with 620 an alternative ECN feedback integrity approach that does not use 621 up the ECT(1) codepoint and can be implemented solely at the 622 sender (see Section 4.3). 624 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 625 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 626 Host A MUST then enter the same feedback mode as it would have 627 entered had it been a responding host and received the same SYN. 628 Then host A MUST send the same SYN/ACK as it would have sent had 629 it been a responding host. 631 3.1.2. Retransmission of the SYN 633 If the sender of an AccECN SYN times out before receiving the SYN/ 634 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 635 least one more time by continuing to set all three TCP ECN flags on 636 the first retransmitted SYN (using the usual retransmission time- 637 outs). If this first retransmission also fails to be acknowledged, 638 the sender SHOULD send subsequent retransmissions of the SYN without 639 any TCP-ECN flags set. This adds delay, in the case where a 640 middlebox drops an AccECN (or ECN) SYN deliberately. However, 641 current measurements imply that a drop is less likely to be due to 642 middlebox interference than other intermittent causes of loss, e.g. 643 congestion, wireless interference, etc. 645 Implementers MAY use other fall-back strategies if they are found to 646 be more effective (e.g. attempting to negotiate AccECN on the SYN 647 only once or more than twice (most appropriate during high levels of 648 congestion); or falling back to classic ECN feedback rather than non- 649 ECN). Further it may make sense to also remove any other 650 experimental fields or options on the SYN in case a middlebox might 651 be blocking them, although the required behaviour will depend on the 652 specification of the other option(s) and any attempt to co-ordinate 653 fall-back between different modules of the stack. In any case, the 654 TCP initiator SHOULD cache failed connection attempts. If it does, 655 it SHOULD NOT give up attempting to negotiate AccECN on the SYN of 656 subsequent connection attempts until it is clear that the blockage is 657 persistently and specifically due to AccECN. The cache should be 658 arranged to expire so that the initiator will infrequently attempt to 659 check whether the problem has been resolved. 661 The fall-back procedure if the TCP server receives no ACK to 662 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 663 Section 3.2.7. 665 3.2. AccECN Feedback 667 Each Data Receiver of each half connection maintains four counters, 668 r.cep, r.ceb, r.e0b and r.e1b. The CE packet counter (r.cep), counts 669 the number of packets the host receives with the CE code point in the 670 IP ECN field, including CE marks on control packets without data. 671 r.ceb, r.e0b and r.e1b count the number of TCP payload bytes in 672 packets marked respectively with the CE, ECT(0) and ECT(1) codepoint 673 in their IP-ECN field. When a host first enters AccECN mode, it 674 initializes its counters to r.cep = 5, r.e0b = 1 and r.ceb = r.e1b.= 675 0 (see Appendix A.5). Non-zero initial values are used to support a 676 stateless handshake (see Section 4.1) and to be distinct from cases 677 where the fields are incorrectly zeroed (e.g. by middleboxes - see 678 Section 3.2.7.4). 680 A host feeds back the CE packet counter using the Accurate ECN (ACE) 681 field, as explained in the next section. And it feeds back all the 682 byte counters using the AccECN TCP Option, as specified in 683 Section 3.2.6. Whenever a host feeds back the value of any counter, 684 it MUST report the most recent value, no matter whether it is in a 685 pure ACK, an ACK with new payload data or a retransmission. 686 Therefore the feedback carried on a retransmitted packet is unlikely 687 to be the same as the feedback on the original packet. 689 3.2.1. Initialization of Feedback Counters at the Data Sender 691 Each Data Sender of each half connection maintains four counters, 692 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 693 counters at the Data Receiver. When a host enters AccECN mode, it 694 initializes them to s.cep = 5, s.e0b = 1 and s.ceb = s.e1b.= 0. 696 If a TCP client (A) in AccECN mode receives a SYN/ACK with CE 697 feedback, i.e. AE=1, CWR=1, ECE=0, it increments s.cep to 6. 698 Otherwise, for any of the 3 other combinations of the 3 ECN TCP flags 699 (the top 3 rows in Table 2), s.cep remains initialized to 5. 701 3.2.2. The ACE Field 703 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 704 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 705 as one 3-bit field. Then the field is given a new name, ACE, as 706 shown in Figure 2. 708 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 709 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 710 | | | | U | A | P | R | S | F | 711 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 712 | | | | G | K | H | T | N | N | 713 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 715 Figure 2: Definition of the ACE field within bytes 13 and 14 of the 716 TCP Header (when AccECN has been negotiated and SYN=0). 718 The original definition of these three flags in the TCP header, 719 including the addition of support for the ECN Nonce, is shown for 720 comparison in Figure 1. This specification does not rename these 721 three TCP flags to ACE unconditionally; it merely overloads them with 722 another name and definition once an AccECN connection has been 723 established. 725 A host MUST interpret the AE, CWR and ECE flags as the 3-bit ACE 726 counter on a segment with the SYN flag cleared (SYN=0) that it sends 727 or receives if both of its half-connections are set into AccECN mode 728 having successfully negotiated AccECN (see Section 3.1). A host MUST 729 NOT interpret the 3 flags as a 3-bit ACE field on any segment with 730 SYN=1 (whether ACK is 0 or 1), or if AccECN negotiation is incomplete 731 or has not succeeded. 733 Both parts of each of these conditions are equally important. For 734 instance, even if AccECN negotiation has been successful, the ACE 735 field is not defined on any segments with SYN=1 (e.g. a 736 retransmission of an unacknowledged SYN/ACK, or when both ends send 737 SYN/ACKs after AccECN support has been successfully negotiated during 738 a simultaneous open). 740 With only one exception, on any packet with the SYN flag cleared 741 (SYN=0), the Data Receiver MUST encode the three least significant 742 bits of its r.cep counter into the ACE field it feeds back to the 743 Data Sender. 745 There is only one exception to this rule: On the final ACK of the 746 3WHS, a TCP client (A) in AccECN mode MUST use the ACE field to feed 747 back which of the 4 possible values of the IP-ECN field were on the 748 SYN/ACK (the binary encoding is the same as that used on the SYN/ 749 ACK). Table 3 shows the meaning of each possible value of the ACE 750 field on the ACK of the SYN/ACK and the value that an AccECN server 751 MUST set s.cep to as a result. 753 +--------------+---------------------------+------------------------+ 754 | ACE on ACK | IP-ECN codepoint on | Initial s.cep of | 755 | of SYN/ACK | SYN/ACK inferred by | server in AccECN mode | 756 | | server | | 757 +--------------+---------------------------+------------------------+ 758 | 0b000 | {Notes 1, 2} | Disable ECN | 759 | 0b001 | {Notes 2, 3} | 5 | 760 | 0b010 | Not-ECT | 5 | 761 | 0b011 | ECT(1) | 5 | 762 | 0b100 | ECT(0) | 5 | 763 | 0b101 | Currently Unused {Note 3} | 5 | 764 | 0b110 | CE | 6 | 765 | 0b111 | Currently Unused {Note 3} | 5 | 766 +--------------+---------------------------+------------------------+ 768 Table 3: Meaning of the ACE field on the ACK of the SYN/ACK 770 {Note 1}: If the server is in AccECN mode, the value of zero raises 771 suspicion of zeroing of the ACE field on the path (see 772 Section 3.2.3). 774 {Note 2}: If a server is in AccECN mode, there ought to be no valid 775 case where the ACE field on the last ACK of the 3WHS has a value of 776 0b000 or 0b001. 778 However, in the case where a server that implements AccECN is also 779 using a stateless handshake (termed a SYN cookie) it will not 780 remember whether it entered AccECN mode. Then these two values 781 remind it that it did not enter AccECN mode (see Section 4.1 for 782 details). 784 {Note 3}: If the server is in AccECN mode, these values are Currently 785 Unused but the AccECN server's behaviour is still defined for forward 786 compatibility. 788 3.2.3. Testing for Zeroing of the ACE Field 790 Section 3.2.2 required the Data Receiver to initialize the r.cep 791 counter to a non-zero value. Therefore, in either direction the 792 initial value of the ACE field ought to be non-zero. 794 If AccECN has been successfully negotiated, the Data Sender SHOULD 795 check the initial value of the ACE field in the first arriving 796 segment with SYN=0. If the initial value of the ACE field is zero 797 (0b000), the Data Sender MUST disable sending ECN-capable packets for 798 the remainder of the half-connection by setting the IP/ECN field in 799 all subsequent packets to Not-ECT. 801 For example, the server checks the ACK of the SYN/ACK or the first 802 data segment from the client, while the client checks the first data 803 segment from the server. More precisely, the "first segment with 804 SYN=0" is defined as: the segment with SYN=0 that i) acknowledges 805 sequence space at least covering the initial sequence number (ISN) 806 plus 1; and ii) arrives before any other segments with SYN=0 so it is 807 unlikely to be a retransmission. If no such segment arrives (e.g. 808 because it is lost and the ISN is first acknowledged by a subsequent 809 segment), no test for invalid initialization can be conducted, and 810 the half-connection will continue in AccECN mode. 812 Note that the Data Sender MUST NOT test whether the arriving counter 813 in the initial ACE field has been initialized to a specific valid 814 value - the above check solely tests whether the ACE fields have been 815 incorrectly zeroed. This allows hosts to use different initial 816 values as an additional signalling channel in future. 818 3.2.4. Testing for Mangling of the IP/ECN Field 820 The value of the ACE field on the SYN/ACK indicates the value of the 821 IP/ECN field when the SYN arrived at the server. The client can 822 compare this with how it originally set the IP/ECN field on the SYN. 823 If this comparison implies an unsafe transition of the IP/ECN field, 824 for the remainder of the connection the client MUST NOT send ECN- 825 capable packets, but it MUST continue to feed back any ECN markings 826 on arriving packets. 828 The value of the ACE field on the last ACK of the 3WHS indicates the 829 value of the IP/ECN field when the SYN/ACK arrived at the client. 830 The server can compare this with how it originally set the IP/ECN 831 field on the SYN/ACK. If this comparison implies an unsafe 832 transition of the IP/ECN field, for the remainder of the connection 833 the server MUST NOT send ECN-capable packets, but it MUST continue to 834 feedback any ECN markings on arriving packets. 836 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 837 repeated here for convenience: 839 o the not-ECT codepoint changes; 841 o either ECT codepoint transitions to not-ECT; 843 o the CE codepoint changes. 845 RFC 3168 says that a router that changes ECT to not-ECT is invalid 846 but safe. However, from a host's viewpoint, this transition is 847 unsafe because it could be the result of two transitions at different 848 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 849 This scenario could well happen where an ECN-enabled home router 850 congests its upstream mobile broadband bottleneck link, then the 851 ingress to the mobile network clears the ECN field [Mandalari18]. 853 The above fall-back behaviours are necessary in case mangling of the 854 IP/ECN field is asymmetric, which is currently common over some 855 mobile networks [Mandalari18]. Then one end might see no unsafe 856 transition and continue sending ECN-capable packets, while the other 857 end sees an unsafe transition and stops sending ECN-capable packets. 859 3.2.5. Safety against Ambiguity of the ACE Field 861 If too many CE-marked segments are acknowledged at once, or if a long 862 run of ACKs is lost, the 3-bit counter in the ACE field might have 863 cycled between two ACKs arriving at the Data Sender. 865 Therefore an AccECN Data Receiver SHOULD immediately send an ACK once 866 'n' CE marks have arrived since the previous ACK, where 'n' SHOULD be 867 2 and MUST be no greater than 6. 869 If the Data Sender has not received AccECN TCP Options to give it 870 more dependable information, and it detects that the ACE field could 871 have cycled under the prevailing conditions, it SHOULD conservatively 872 assume that the counter did cycle. It can detect if the counter 873 could have cycled by using the jump in the acknowledgement number 874 since the last ACK to calculate or estimate how many segments could 875 have been acknowledged. An example algorithm to implement this 876 policy is given in Appendix A.2. An implementer MAY develop an 877 alternative algorithm as long as it satisfies these requirements. 879 If missing acknowledgement numbers arrive later (reordering) and 880 prove that the counter did not cycle, the Data Sender MAY attempt to 881 neutralise the effect of any action it took based on a conservative 882 assumption that it later found to be incorrect. 884 3.2.6. The AccECN Option 886 The AccECN Option is defined as shown below in Figure 3. It consists 887 of three 24-bit fields that provide the 24 least significant bits of 888 the r.e0b, r.ceb and r.e1b counters, respectively. The initial 'E' 889 of each field name stands for 'Echo'. 891 0 1 2 3 892 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 893 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 894 | Kind = TBD1 | Length = 11 | EE0B field | 895 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 896 | EE0B (cont'd) | ECEB field | 897 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 898 | EE1B field | 899 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 901 Figure 3: The AccECN Option 903 The Data Receiver MUST set the Kind field to TBD1, which is 904 registered in Section 6 as a new TCP option Kind called AccECN. An 905 experimental TCP option with Kind=254 MAY be used for initial 906 experiments, with magic number 0xACCE. 908 Appendix A.1 gives an example algorithm for the Data Receiver to 909 encode its byte counters into the AccECN Option, and for the Data 910 Sender to decode the AccECN Option fields into its byte counters. 912 Note that there is no field to feedback Not-ECT bytes. Nonetheless 913 an algorithm for the Data Sender to calculate the number of payload 914 bytes received as Not-ECT is given in Appendix A.5. 916 Whenever a Data Receiver sends an AccECN Option, the rules in 917 Section 3.2.8 expect it to always send a full-length option. To cope 918 with option space limitations, it can omit unchanged fields from the 919 tail of the option, as long as it preserves the order of the 920 remaining fields and includes any field that has changed. The length 921 field MUST indicate which fields are present as follows: 923 Length=11: EE0B, ECEB, EE1B 925 Length=8: EE0B, ECEB 927 Length=5: EE0B 929 Length=2: (empty) 931 The empty option of Length=2 is provided to allow for a case where an 932 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 933 but there is very limited space for the option. For initial 934 experiments, the Length field MUST be 2 greater to accommodate the 935 16-bit magic number. 937 All implementations of a Data Sender MUST be able to read in AccECN 938 Options of any of the above lengths. If the AccECN Option is of any 939 other length, implementations MUST use those whole 3 octet fields 940 that fit within the length and ignore the remainder of the option. 942 3.2.7. Path Traversal of the AccECN Option 944 3.2.7.1. Testing the AccECN Option during the Handshake 946 The TCP client MUST NOT include the AccECN TCP Option on the SYN. 947 Nonetheless, if the AccECN negotiation using the ECN flags in the 948 main TCP header (Section 3.1) is successful, it implicitly declares 949 that the endpoints also support the AccECN TCP Option. A fall-back 950 strategy for the loss of the SYN (possibly due to middlebox 951 interference) is specified in Section 3.1.2. 953 A TCP server that confirms its support for AccECN (in response to an 954 AccECN SYN from the client as described in Section 3.1) SHOULD also 955 include an AccECN TCP Option in the SYN/ACK. 957 A TCP client that has successfully negotiated AccECN SHOULD include 958 an AccECN Option in the first ACK at the end of the 3WHS. However, 959 this first ACK is not delivered reliably, so the TCP client SHOULD 960 also include an AccECN Option on the first data segment it sends (if 961 it ever sends one). 963 A host MAY NOT include an AccECN Option in any of these three cases 964 if it has cached knowledge that the packet would be likely to be 965 blocked on the path to the other host if it included an AccECN 966 Option. 968 3.2.7.2. Testing for Loss of Packets Carrying the AccECN Option 970 If after the normal TCP timeout the TCP server has not received an 971 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 972 lost, e.g. due to congestion, or a middlebox might be blocking the 973 AccECN Option. To expedite connection setup, the TCP server SHOULD 974 retransmit the SYN/ACK with the same TCP flags (AE, CWR and ECE) but 975 with no AccECN Option. If this retransmission times out, to expedite 976 connection setup, the TCP server SHOULD disable AccECN and ECN for 977 this connection by retransmitting the SYN/ACK with AE=CWR=ECE=0 and 978 no AccECN Option. Implementers MAY use other fall-back strategies if 979 they are found to be more effective (e.g. falling back to classic 980 ECN feedback on the first retransmission; retrying the AccECN Option 981 for a second time before fall-back (most appropriate during high 982 levels of congestion); or falling back to classic ECN feedback rather 983 than non-ECN on the third retransmission). 985 If the TCP client detects that the first data segment it sent with 986 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 987 on the retransmission. Again, implementers MAY use other fall-back 988 strategies such as attempting to retransmit a second segment with the 989 AccECN Option before fall-back, and/or caching whether the AccECN 990 Option is blocked for subsequent connections. 992 Either host MAY include the AccECN Option in a subsequent segment to 993 retest whether the AccECN Option can traverse the path. 995 If the TCP server receives a second SYN with a request for AccECN 996 support, it should resend the SYN/ACK, again confirming its support 997 for AccECN, but this time without the AccECN Option. This approach 998 rules out any interference by middleboxes that may drop packets with 999 unknown options, even though it is more likely that the SYN/ACK would 1000 have been lost due to congestion. The TCP server MAY try to send 1001 another packet with the AccECN Option at a later point during the 1002 connection but should monitor if that packet got lost as well, in 1003 which case it SHOULD disable the sending of the AccECN Option for 1004 this half-connection. 1006 Similarly, an AccECN end-point MAY separately memorize which data 1007 packets carried an AccECN Option and disable the sending of AccECN 1008 Options if the loss probability of those packets is significantly 1009 higher than that of all other data packets in the same connection. 1011 3.2.7.3. Testing for Stripping of the AccECN Option 1013 If the TCP client has successfully negotiated AccECN but does not 1014 receive an AccECN Option on the SYN/ACK, it switches into a mode that 1015 assumes that the AccECN Option is not available for this half 1016 connection. 1018 Similarly, if the TCP server has successfully negotiated AccECN but 1019 does not receive an AccECN Option on the first segment that 1020 acknowledges sequence space at least covering the ISN, it switches 1021 into a mode that assumes that the AccECN Option is not available for 1022 this half connection. 1024 While a host is in this mode that assumes incoming AccECN Options are 1025 not available, it MUST adopt the conservative interpretation of the 1026 ACE field discussed in Section 3.2.5. However, it cannot make any 1027 assumption about support of outgoing AccECN Options on the other half 1028 connection, so it SHOULD continue to send the AccECN Option itself 1029 (unless it has established that sending the AccECN Option is causing 1030 packets to be blocked as in Section 3.2.7.2). 1032 If a host is in the mode that assumes incoming AccECN Options are not 1033 available, but it receives an AccECN Option at any later point during 1034 the connection, this clearly indicates that the AccECN Option is not 1035 blocked on the respective path, and the AccECN endpoint MAY switch 1036 out of the mode that assumes the AccECN Option is not available for 1037 this half connection. 1039 3.2.7.4. Test for Zeroing of the AccECN Option 1041 For a related test for invalid initialization of the ACE field, see 1042 Section 3.2.3 1044 Section 3.2 required the Data Receiver to initialize the r.e0b 1045 counter to a non-zero value. Therefore, in either direction the 1046 initial value of the EE0B field in the AccECN Option (if one exists) 1047 ought to be non-zero. If AccECN has been negotiated: 1049 o the TCP server MAY check the initial value of the EE0B field in 1050 the first segment that acknowledges sequence space that at least 1051 covers the ISN plus 1. If the initial value of the EE0B field is 1052 zero, the server will switch into a mode that ignores the AccECN 1053 Option for this half connection. 1055 o the TCP client MAY check the initial value of the EE0B field on 1056 the SYN/ACK. If the initial value of the EE0B field is zero, the 1057 client will switch into a mode that ignores the AccECN Option for 1058 this half connection. 1060 While a host is in the mode that ignores the AccECN Option it MUST 1061 adopt the conservative interpretation of the ACE field discussed in 1062 Section 3.2.5. 1064 Note that the Data Sender MUST NOT test whether the arriving byte 1065 counters in the initial AccECN Option have been initialized to 1066 specific valid values - the above checks solely test whether these 1067 fields have been incorrectly zeroed. This allows hosts to use 1068 different initial values as an additional signalling channel in 1069 future. Also note that the initial value of either field might be 1070 greater than its expected initial value, because the counters might 1071 already have been incremented. Nonetheless, the initial values of 1072 the counters have been chosen so that they cannot wrap to zero on 1073 these initial segments. 1075 3.2.7.5. Consistency between AccECN Feedback Fields 1077 When the AccECN Option is available it supplements but does not 1078 replace the ACE field. An endpoint using AccECN feedback MUST always 1079 consider the information provided in the ACE field whether or not the 1080 AccECN Option is also available. 1082 If the AccECN option is present, the s.cep counter might increase 1083 while the s.ceb counter does not (e.g. due to a CE-marked control 1084 packet). The sender's response to such a situation is out of scope, 1085 and needs to be dealt with in a specification that uses ECN-capable 1086 control packets. Theoretically, this situation could also occur if a 1087 middlebox mangled the AccECN Option but not the ACE field. However, 1088 the Data Sender has to assume that the integrity of the AccECN Option 1089 is sound, based on the above test of the well-known initial values 1090 and optionally other integrity tests (Section 4.3). 1092 If either end-point detects that the s.ceb counter has increased but 1093 the s.cep has not (and by testing ACK coverage it is certain how much 1094 the ACE field has wrapped), this invalid protocol transition has to 1095 be due to some form of feedback mangling. So, the Data Sender MUST 1096 disable sending ECN-capable packets for the remainder of the half- 1097 connection by setting the IP/ECN field in all subsequent packets to 1098 Not-ECT. 1100 3.2.8. Usage of the AccECN TCP Option 1102 The following rules determine when a Data Receiver in AccECN mode 1103 sends the AccECN TCP Option, and which fields to include: 1105 Change-Triggered ACKs: If an arriving packet increments a different 1106 byte counter to that incremented by the previous packet, the Data 1107 Receiver MUST immediately send an ACK with an AccECN Option, 1108 without waiting for the next delayed ACK (this is in addition to 1109 the safety recommendation in Section 3.2.5 against ambiguity of 1110 the ACE field). 1112 This is stated as a "MUST" so that the data sender can rely on 1113 change-triggered ACKs to detect transitions right from the very 1114 start of a flow, without first having to detect whether the 1115 receiver complies. A concern has been raised that certain offload 1116 hardware needed for high performance might not be able to support 1117 change-triggered ACKs, although high performance protocols such as 1118 DCTCP successfully use change-triggered ACKs. One possible 1119 compromise would be for the receiver to heuristically detect 1120 whether the sender is in slow-start, then to implement change- 1121 triggered ACKs in software while the sender is in slow-start, and 1122 offload to hardware otherwise. If the operator disables change- 1123 triggered ACKs, whether partially like this or otherwise, the 1124 operator will also be responsible for ensuring a co-ordinated 1125 sender algorithm is deployed; 1127 Continual Repetition: Otherwise, if arriving packets continue to 1128 increment the same byte counter, the Data Receiver can include an 1129 AccECN Option on most or all (delayed) ACKs, but it does not have 1130 to. If option space is limited on a particular ACK, the Data 1131 Receiver MUST give precedence to SACK information about loss. It 1132 SHOULD include an AccECN Option if the r.ceb counter has 1133 incremented and it MAY include an AccECN Option if r.ec0b or 1134 r.ec1b has incremented; 1136 Full-Length Options Preferred: It SHOULD always use full-length 1137 AccECN Options. It MAY use shorter AccECN Options if space is 1138 limited, but it MUST include the counter(s) that have incremented 1139 since the previous AccECN Option and it MUST only truncate fields 1140 from the right-hand tail of the option to preserve the order of 1141 the remaining fields (see Section 3.2.6); 1143 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1144 length AccECN TCP Option on at least three ACKs per RTT, or on all 1145 ACKs if there are less than three per RTT (see Appendix A.4 for an 1146 example algorithm that satisfies this requirement). 1148 The following example series of arriving IP/ECN fields illustrates 1149 when a Data Receiver will emit an ACK if it is using a delayed ACK 1150 factor of 2 segments and change-triggered ACKs: 01 -> ACK, 01, 01 -> 1151 ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 -> ACK. 1153 For the avoidance of doubt, the change-triggered ACK mechanism is 1154 deliberately worded to ignore the arrival of a control packet with no 1155 payload, which therefore does not alter any byte counters, because it 1156 is important that TCP does not acknowledge pure ACKs. The change- 1157 triggered ACK approach will lead to some additional ACKs but it feeds 1158 back the timing and the order in which ECN marks are received with 1159 minimal additional complexity. 1161 Implementation note: sending an AccECN Option each time a different 1162 counter changes and including a full-length AccECN Option on every 1163 delayed ACK will satisfy the requirements described above and might 1164 be the easiest implementation, as long as sufficient space is 1165 available in each ACK (in total and in the option space). 1167 Appendix A.3 gives an example algorithm to estimate the number of 1168 marked bytes from the ACE field alone, if the AccECN Option is not 1169 available. 1171 If a host has determined that segments with the AccECN Option always 1172 seem to be discarded somewhere along the path, it is no longer 1173 obliged to follow the above rules. 1175 3.3. AccECN Compliance by TCP Proxies, Offload Engines and other 1176 Middleboxes 1178 A large class of middleboxes split TCP connections. Such a middlebox 1179 would be compliant with the AccECN protocol if the TCP implementation 1180 on each side complied with the present AccECN specification and each 1181 side negotiated AccECN independently of the other side. 1183 Another large class of middleboxes intervenes to some degree at the 1184 transport layer, but attempts to be transparent (invisible) to the 1185 end-to-end connection. A subset of this class of middleboxes 1186 attempts to `normalise' the TCP wire protocol by checking that all 1187 values in header fields comply with a rather narrow interpretation of 1188 the TCP specifications. To comply with the present AccECN 1189 specification, such a middlebox MUST NOT change the ACE field or the 1190 AccECN Option and it MUST attempt to preserve the timing of each ACK 1191 (for example, if it coalesced ACKs it would not be AccECN-compliant). 1192 A middlebox claiming to be transparent at the transport layer MUST 1193 forward the AccECN TCP Option unaltered, whether or not the length 1194 value matches one of those specified in Section 3.2.6, and whether or 1195 not the initial values of the byte-counter fields are correct. This 1196 is because blocking apparently invalid values does not improve 1197 security (because AccECN hosts are required to ignore invalid values 1198 anyway), while it prevents the standardised set of values being 1199 extended in future (because outdated normalisers would block updated 1200 hosts from using the extended AccECN standard). 1202 Hardware to offload certain TCP processing represents another large 1203 class of middleboxes, even though it is often a function of a host's 1204 network interface and rarely in its own 'box'. Leeway has been 1205 allowed in the present AccECN specification in the expectation that 1206 offload hardware could comply and still serve its function. 1207 Nonetheless, such hardware MUST attempt to preserve the timing of 1208 each ACK (for example, if it coalesced ACKs it would not be AccECN- 1209 compliant). 1211 4. Interaction with Other TCP Variants 1213 This section is informative, not normative. 1215 4.1. Compatibility with SYN Cookies 1217 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1218 protect itself from SYN flooding attacks. It places minimal commonly 1219 used connection state in the SYN/ACK, and deliberately does not hold 1220 any state while waiting for the subsequent ACK (e.g. it closes the 1221 thread). Therefore it cannot record the fact that it entered AccECN 1222 mode for both half-connections. Indeed, it cannot even remember 1223 whether it negotiated the use of classic ECN [RFC3168]. 1225 Nonetheless, such a server can determine that it negotiated AccECN as 1226 follows. If a TCP server using SYN Cookies supports AccECN and if it 1227 receives a pure ACK that acknowledges an ISN that is a valid SYN 1228 cookie, and if the ACK contains an ACE field with the value 0b010 to 1229 0b111 (decimal 2 to 7), it can assume that: 1231 o the TCP client must have requested AccECN support on the SYN 1233 o it (the server) must have confirmed that it supported AccECN 1235 Therefore the server can switch itself into AccECN mode, and continue 1236 as if it had never forgotten that it switched itself into AccECN mode 1237 earlier. 1239 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1240 with the value 0b000 or 0b001, these values indicate that the client 1241 did not request support for AccECN and therefore the server does not 1242 enter AccECN mode for this connection. Further, 0b001 on the ACK 1243 implies that the server sent an ECN-capable SYN/ACK, which was marked 1244 CE in the network, and the non-AccECN client fed this back by setting 1245 ECE on the ACK of the SYN/ACK. 1247 4.2. Compatibility with Other TCP Options and Experiments 1249 AccECN is compatible (at least on paper) with the most commonly used 1250 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1251 also compatible with the recent promising experimental TCP options 1252 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1253 AccECN is friendly to all these protocols, because space for TCP 1254 options is particularly scarce on the SYN, where AccECN consumes zero 1255 additional header space. 1257 When option space is under pressure from other options, Section 3.2.8 1258 provides guidance on how important it is to send an AccECN Option and 1259 whether it needs to be a full-length option. 1261 4.3. Compatibility with Feedback Integrity Mechanisms 1263 Three alternative mechanisms are available to assure the integrity of 1264 ECN and/or loss signals. AccECN is compatible with any of these 1265 approaches: 1267 o The Data Sender can test the integrity of the receiver's ECN (or 1268 loss) feedback by occasionally setting the IP-ECN field to a value 1269 normally only set by the network (and/or deliberately leaving a 1270 sequence number gap). Then it can test whether the Data 1271 Receiver's feedback faithfully reports what it expects 1272 [I-D.moncaster-tcpm-rcv-cheat]. Unlike the ECN Nonce [RFC3540], 1273 this approach does not waste the ECT(1) codepoint in the IP 1274 header, it does not require standardisation and it does not rely 1275 on misbehaving receivers volunteering to reveal feedback 1276 information that allows them to be detected. However, setting the 1277 CE mark by the sender might conceal actual congestion feedback 1278 from the network and should therefore only be done sparsely. 1280 o Networks generate congestion signals when they are becoming 1281 congested, so networks are more likely than Data Senders to be 1282 concerned about the integrity of the receiver's feedback of these 1283 signals. A network can enforce a congestion response to its ECN 1284 markings (or packet losses) using congestion exposure (ConEx) 1285 audit [RFC7713]. Whether the receiver or a downstream network is 1286 suppressing congestion feedback or the sender is unresponsive to 1287 the feedback, or both, ConEx audit can neutralise any advantage 1288 that any of these three parties would otherwise gain. 1290 ConEx is a change to the Data Sender that is most useful when 1291 combined with AccECN. Without AccECN, the ConEx behaviour of a 1292 Data Sender would have to be more conservative than would be 1293 necessary if it had the accurate feedback of AccECN. 1295 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1296 detect any tampering with AccECN feedback between the Data 1297 Receiver and the Data Sender (whether malicious or accidental). 1298 The AccECN fields are immutable end-to-end, so they are amenable 1299 to TCP-AO protection, which covers TCP options by default. 1300 However, TCP-AO is often too brittle to use on many end-to-end 1301 paths, where middleboxes can make verification fail in their 1302 attempts to improve performance or security, e.g. by 1303 resegmentation or shifting the sequence space. 1305 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1306 of congestion feedback. With minor changes AccECN could be optimised 1307 for the possibility that the ECT(1) codepoint might be used as an ECN 1308 Nonce . However, given RFC 3540 is being reclassified as historic, 1309 the AccECN design has been generalised so that it ought to be able to 1310 support other possible uses of the ECT(1) codepoint, such as a lower 1311 severity or a more instant congestion signal than CE. 1313 5. Protocol Properties 1315 This section is informative not normative. It describes how well the 1316 protocol satisfies the agreed requirements for a more accurate ECN 1317 feedback protocol [RFC7560]. 1319 Accuracy: From each ACK, the Data Sender can infer the number of new 1320 CE marked segments since the previous ACK. This provides better 1321 accuracy on CE feedback than classic ECN. In addition if the 1322 AccECN Option is present (not blocked by the network path) the 1323 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1325 Overhead: The AccECN scheme is divided into two parts. The 1326 essential part reuses the 3 flags already assigned to ECN in the 1327 IP header. The supplementary part adds an additional TCP option 1328 consuming up to 11 bytes. However, no TCP option is consumed in 1329 the SYN. 1331 Ordering: The order in which marks arrive at the Data Receiver is 1332 preserved in AccECN feedback, because the Data Receiver is 1333 expected to send an ACK immediately whenever a different mark 1334 arrives. 1336 Timeliness: While the same ECN markings are arriving continually at 1337 the Data Receiver, it can defer ACKs as TCP does normally, but it 1338 will immediately send an ACK as soon as a different ECN marking 1339 arrives. 1341 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1342 latency-sensitive uses of ECN feedback by capturing the timing of 1343 transitions but not wasting resources while the state of the 1344 signalling system is stable. The receiver can control how 1345 frequently it sends the AccECN TCP Option and therefore it can 1346 control the overhead induced by AccECN. 1348 Resilience: All information is provided based on counters. 1349 Therefore if ACKs are lost, the counters on the first ACK 1350 following the losses allows the Data Sender to immediately recover 1351 the number of the ECN markings that it missed. 1353 Resilience against Bias: Because feedback is based on repetition of 1354 counters, random losses do not remove any information, they only 1355 delay it. Therefore, even though some ACKs are change-triggered, 1356 random losses will not alter the proportions of the different ECN 1357 markings in the feedback. 1359 Resilience vs Overhead: If space is limited in some segments (e.g. 1360 because more option are need on some segments, such as the SACK 1361 option after loss), the Data Receiver can send AccECN Options less 1362 frequently or truncate fields that have not changed, usually down 1363 to as little as 5 bytes. However, it has to send a full-sized 1364 AccECN Option at least three times per RTT, which the Data Sender 1365 can rely on as a regular beacon or checkpoint. 1367 Resilience vs Timeliness and Ordering: Ordering information and the 1368 timing of transitions cannot be communicated in three cases: i) 1369 during ACK loss; ii) if something on the path strips the AccECN 1370 Option; or iii) if the Data Receiver is unable to support Change- 1371 Triggered ACKs. 1373 Complexity: An AccECN implementation solely involves simple counter 1374 increments, some modulo arithmetic to communicate the least 1375 significant bits and allow for wrap, and some heuristics for 1376 safety against fields cycling due to prolonged periods of ACK 1377 loss. Each host needs to maintain eight additional counters. The 1378 hosts have to apply some additional tests to detect tampering by 1379 middleboxes, but in general the protocol is simple to understand, 1380 simple to implement and requires few cycles per packet to execute. 1382 Integrity: AccECN is compatible with at least three approaches that 1383 can assure the integrity of ECN feedback. If the AccECN Option is 1384 stripped the resolution of the feedback is degraded, but the 1385 integrity of this degraded feedback can still be assured. 1387 Backward Compatibility: If only one endpoint supports the AccECN 1388 scheme, it will fall-back to the most advanced ECN feedback scheme 1389 supported by the other end. 1391 Backward Compatibility: If the AccECN Option is stripped by a 1392 middlebox, AccECN still provides basic congestion feedback in the 1393 ACE field. Further, AccECN can be used to detect mangling of the 1394 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1395 marked segments; and blocking of segments carrying the AccECN 1396 Option. It can detect these conditions during TCP's 3WHS so that 1397 it can fall back to operation without ECN and/or operation without 1398 the AccECN Option. 1400 Forward Compatibility: The behaviour of endpoints and middleboxes is 1401 carefully defined for all reserved or currently unused codepoints 1402 in the scheme, to ensure that any blocking of anomalous values is 1403 always at least under reversible policy control. 1405 6. IANA Considerations 1407 This document reassigns bit 7 of the TCP header flags to the AccECN 1408 experiment. This bit was previously called the Nonce Sum (NS) flag 1409 [RFC3540], but RFC 3540 is being reclassified as historic 1410 [I-D.ietf-tsvwg-ecn-experimentation]. The flag will now be defined 1411 as: 1413 +-----+-------------------+-----------+ 1414 | Bit | Name | Reference | 1415 +-----+-------------------+-----------+ 1416 | 7 | AE (Accurate ECN) | RFC XXXX | 1417 +-----+-------------------+-----------+ 1419 [TO BE REMOVED: This registration should take place at the following 1420 location: https://www.iana.org/assignments/tcp-header-flags/tcp- 1421 header-flags.xhtml#tcp-header-flags-1 ] 1423 This document also defines a new TCP option for AccECN, assigned a 1424 value of TBD1 (decimal) from the TCP option space. This value is 1425 defined as: 1427 +------+--------+-----------------------+-----------+ 1428 | Kind | Length | Meaning | Reference | 1429 +------+--------+-----------------------+-----------+ 1430 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1431 +------+--------+-----------------------+-----------+ 1433 [TO BE REMOVED: This registration should take place at the following 1434 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1435 parameters.xhtml#tcp-parameters-1 ] 1437 Early implementation before the IANA allocation MUST follow [RFC6994] 1438 and use experimental option 254 and magic number 0xACCE (16 bits), 1439 then migrate to the new option after the allocation. 1441 7. Security Considerations 1443 If ever the supplementary part of AccECN based on the new AccECN TCP 1444 Option is unusable (due for example to middlebox interference) the 1445 essential part of AccECN's congestion feedback offers only limited 1446 resilience to long runs of ACK loss (see Section 3.2.5). These 1447 problems are unlikely to be due to malicious intervention (because if 1448 an attacker could strip a TCP option or discard a long run of ACKs it 1449 could wreak other arbitrary havoc). However, it would be of concern 1450 if AccECN's resilience could be indirectly compromised during a 1451 flooding attack. AccECN is still considered safe though, because if 1452 the option is not presented, the AccECN Data Sender is then required 1453 to switch to more conservative assumptions about wrap of congestion 1454 indication counters (see Section 3.2.5 and Appendix A.2). 1456 Section 4.1 describes how a TCP server can negotiate AccECN and use 1457 the SYN cookie method for mitigating SYN flooding attacks. 1459 There is concern that ECN markings could be altered or suppressed, 1460 particularly because a misbehaving Data Receiver could increase its 1461 own throughput at the expense of others. AccECN is compatible with 1462 the three schemes known to assure the integrity of ECN feedback (see 1463 Section 4.3 for details). If the AccECN Option is stripped by an 1464 incorrectly implemented middlebox, the resolution of the feedback 1465 will be degraded, but the integrity of this degraded information can 1466 still be assured. 1468 There is a potential concern that a receiver could deliberately omit 1469 the AccECN Option pretending that it had been stripped by a 1470 middlebox. No known way can yet be contrived to take advantage of 1471 this downgrade attack, but it is mentioned here in case someone else 1472 can contrive one. 1474 The AccECN protocol is not believed to introduce any new privacy 1475 concerns, because it merely counts and feeds back signals at the 1476 transport layer that had already been visible at the IP layer. 1478 8. Acknowledgements 1480 We want to thank Koen De Schepper, Praveen Balasubramanian and 1481 Michael Welzl for their input and discussion. The idea of using the 1482 three ECN-related TCP flags as one field for more accurate TCP-ECN 1483 feedback was first introduced in the re-ECN protocol that was the 1484 ancestor of ConEx. 1486 Bob Briscoe was part-funded by the European Community under its 1487 Seventh Framework Programme through the Reducing Internet Transport 1488 Latency (RITE) project (ICT-317700) and through the Trilogy 2 project 1489 (ICT-317756). The views expressed here are solely those of the 1490 authors. 1492 This work is partly supported by the European Commission under 1493 Horizon 2020 grant agreement no. 688421 Measurement and Architecture 1494 for a Middleboxed Internet (MAMI), and by the Swiss State Secretariat 1495 for Education, Research, and Innovation under contract no. 15.0268. 1496 This support does not imply endorsement. 1498 9. Comments Solicited 1500 Comments and questions are encouraged and very welcome. They can be 1501 addressed to the IETF TCP maintenance and minor modifications working 1502 group mailing list , and/or to the authors. 1504 10. References 1505 10.1. Normative References 1507 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1508 Requirement Levels", BCP 14, RFC 2119, 1509 DOI 10.17487/RFC2119, March 1997, 1510 . 1512 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1513 of Explicit Congestion Notification (ECN) to IP", 1514 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1515 . 1517 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1518 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1519 . 1521 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 1522 RFC 6994, DOI 10.17487/RFC6994, August 2013, 1523 . 1525 10.2. Informative References 1527 [I-D.ietf-tcpm-alternativebackoff-ecn] 1528 Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 1529 "TCP Alternative Backoff with ECN (ABE)", draft-ietf-tcpm- 1530 alternativebackoff-ecn-02 (work in progress), October 1531 2017. 1533 [I-D.ietf-tcpm-generalized-ecn] 1534 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 1535 Congestion Notification (ECN) to TCP Control Packets", 1536 draft-ietf-tcpm-generalized-ecn-01 (work in progress), 1537 September 2017. 1539 [I-D.ietf-tsvwg-ecn-experimentation] 1540 Black, D., "Relaxing Restrictions on Explicit Congestion 1541 Notification (ECN) Experimentation", draft-ietf-tsvwg-ecn- 1542 experimentation-07 (work in progress), October 2017. 1544 [I-D.ietf-tsvwg-l4s-arch] 1545 Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, 1546 Low Loss, Scalable Throughput (L4S) Internet Service: 1547 Architecture", draft-ietf-tsvwg-l4s-arch-00 (work in 1548 progress), May 2017. 1550 [I-D.kuehlewind-tcpm-ecn-fallback] 1551 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 1552 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 1553 fallback-01 (work in progress), September 2013. 1555 [I-D.moncaster-tcpm-rcv-cheat] 1556 Moncaster, T., Briscoe, B., and A. Jacquet, "A TCP Test to 1557 Allow Senders to Identify Receiver Non-Compliance", draft- 1558 moncaster-tcpm-rcv-cheat-03 (work in progress), July 2014. 1560 [Mandalari18] 1561 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 1562 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 1563 over Mobile", IEEE Communications Magazine , March 2018. 1565 (to appear) 1567 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1568 Congestion Notification (ECN) Signaling with Nonces", 1569 RFC 3540, DOI 10.17487/RFC3540, June 2003, 1570 . 1572 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1573 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 1574 . 1576 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1577 Ramakrishnan, "Adding Explicit Congestion Notification 1578 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 1579 DOI 10.17487/RFC5562, June 2009, 1580 . 1582 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1583 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 1584 June 2010, . 1586 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 1587 "TCP Extensions for Multipath Operation with Multiple 1588 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 1589 . 1591 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1592 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1593 . 1595 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 1596 "Problem Statement and Requirements for Increased Accuracy 1597 in Explicit Congestion Notification (ECN) Feedback", 1598 RFC 7560, DOI 10.17487/RFC7560, August 2015, 1599 . 1601 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1602 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 1603 DOI 10.17487/RFC7713, December 2015, 1604 . 1606 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1607 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 1608 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 1609 October 2017, . 1611 Appendix A. Example Algorithms 1613 This appendix is informative, not normative. It gives example 1614 algorithms that would satisfy the normative requirements of the 1615 AccECN protocol. However, implementers are free to choose other ways 1616 to implement the requirements. 1618 A.1. Example Algorithm to Encode/Decode the AccECN Option 1620 The example algorithms below show how a Data Receiver in AccECN mode 1621 could encode its CE byte counter r.ceb into the ECEB field within the 1622 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 1623 the ECEB field into its byte counter s.ceb. The other counters for 1624 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 1625 similarly encoded and decoded. 1627 It is assumed that each local byte counter is an unsigned integer 1628 greater than 24b (probably 32b), and that the following constant has 1629 been assigned: 1631 DIVOPT = 2^24 1633 Every time a CE marked data segment arrives, the Data Receiver 1634 increments its local value of r.ceb by the size of the TCP Data. 1635 Whenever it sends an ACK with the AccECN Option, the value it writes 1636 into the ECEB field is 1638 ECEB = r.ceb % DIVOPT 1640 where '%' is the modulo operator. 1642 On the arrival of an AccECN Option, the Data Sender uses the TCP 1643 acknowledgement number and any SACK options to calculate newlyAckedB, 1644 the amount of new data that the ACK acknowledges in bytes. If 1645 newlyAckedB is negative it means that a more up to date ACK has 1646 already been processed, so this ACK has been superseded and the Data 1647 Sender has to ignore the AccECN Option. Then the Data Sender 1648 calculates the minimum difference d.ceb between the ECEB field and 1649 its local s.ceb counter, using modulo arithmetic as follows: 1651 if (newlyAckedB >= 0) { 1652 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 1653 s.ceb += d.ceb 1654 } 1656 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 1657 then 1658 s.ceb % DIVOPT = 1 1659 d.ceb = (1461 + 2^24 - 1) % 2^24 1660 = 1460 1661 s.ceb = 33,554,433 + 1460 1662 = 33,555,893 1664 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 1666 The example algorithms below show how a Data Receiver in AccECN mode 1667 could encode its CE packet counter r.cep into the ACE field, and how 1668 the Data Sender in AccECN mode could decode the ACE field into its 1669 s.cep counter. The Data Sender's algorithm includes code to 1670 heuristically detect a long enough unbroken string of ACK losses that 1671 could have concealed a cycle of the congestion counter in the ACE 1672 field of the next ACK to arrive. 1674 Two variants of the algorithm are given: i) a more conservative 1675 variant for a Data Sender to use if it detects that the AccECN Option 1676 is not available (see Section 3.2.5 and Section 3.2.7); and ii) a 1677 less conservative variant that is feasible when complementary 1678 information is available from the AccECN Option. 1680 A.2.1. Safety Algorithm without the AccECN Option 1682 It is assumed that each local packet counter is a sufficiently sized 1683 unsigned integer (probably 32b) and that the following constant has 1684 been assigned: 1686 DIVACE = 2^3 1688 Every time a CE marked packet arrives, the Data Receiver increments 1689 its local value of r.cep by 1. It repeats the same value of ACE in 1690 every subsequent ACK until the next CE marking arrives, where 1692 ACE = r.cep % DIVACE. 1694 If the Data Sender received an earlier value of the counter that had 1695 been delayed due to ACK reordering, it might incorrectly calculate 1696 that the ACE field had wrapped. Therefore, on the arrival of every 1697 ACK, the Data Sender uses the TCP acknowledgement number and any SACK 1698 options to calculate newlyAckedB, the amount of new data that the ACK 1699 acknowledges. If newlyAckedB is negative it means that a more up to 1700 date ACK has already been processed, so this ACK has been superseded 1701 and the Data Sender has to ignore the AccECN Option. If newlyAckedB 1702 is zero, to break the tie the Data Sender could use timestamps (if 1703 present) to work out newlyAckedT, the amount of new time that the ACK 1704 acknowledges. Then the Data Sender calculates the minimum difference 1705 d.cep between the ACE field and its local s.cep counter, using modulo 1706 arithmetic as follows: 1708 if ((newlyAckedB > 0) || (newlyAckedB == 0 && newlyAckedT > 0)) 1709 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 1711 Section 3.2.5 requires the Data Sender to assume that the ACE field 1712 did cycle if it could have cycled under prevailing conditions. The 1713 3-bit ACE field in an arriving ACK could have cycled and become 1714 ambiguous to the Data Sender if a row of ACKs goes missing that 1715 covers a stream of data long enough to contain 8 or more CE marks. 1716 We use the word `missing' rather than `lost', because some or all the 1717 missing ACKs might arrive eventually, but out of order. Even if some 1718 of the lost ACKs are piggy-backed on data (i.e. not pure ACKs) 1719 retransmissions will not repair the lost AccECN information, because 1720 AccECN requires retransmissions to carry the latest AccECN counters, 1721 not the original ones. 1723 The phrase `under prevailing conditions' allows the Data Sender to 1724 take account of the prevailing size of data segments and the 1725 prevailing CE marking rate just before the sequence of ACK losses. 1726 However, we shall start with the simplest algorithm, which assumes 1727 segments are all full-sized and ultra-conservatively it assumes that 1728 ECN marking was 100% on the forward path when ACKs on the reverse 1729 path started to all be dropped. Specifically, if newlyAckedB is the 1730 amount of data that an ACK acknowledges since the previous ACK, then 1731 the Data Sender could assume that this acknowledges newlyAckedPkt 1732 full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it 1733 could assume that the ACE field incremented by 1735 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 1737 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 1738 size segments than any previous ACK, and that ACE increments by a 1739 minimum of 2 CE marks (d.cep=2). The above formula works out that it 1740 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 1741 2). However, if ACE increases by a minimum of 2 but acknowledges 10 1742 full-sized segments, then it would be necessary to assume that there 1743 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 1745 Implementers could build in more heuristics to estimate prevailing 1746 average segment size and prevailing ECN marking. For instance, 1747 newlyAckedPkt in the above formula could be replaced with 1748 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 1749 segment size and p is the prevailing ECN marking probability. 1750 However, ultimately, if TCP's ECN feedback becomes inaccurate it 1751 still has loss detection to fall back on. Therefore, it would seem 1752 safe to implement a simple algorithm, rather than a perfect one. 1754 The simple algorithm for dSafer.cep above requires no monitoring of 1755 prevailing conditions and it would still be safe if, for example, 1756 segments were on average at least 5% of full-sized as long as ECN 1757 marking was 5% or less. Assuming it was used, the Data Sender would 1758 increment its packet counter as follows: 1760 s.cep += dSafer.cep 1762 If missing acknowledgement numbers arrive later (due to reordering), 1763 Section 3.2.5 says "the Data Sender MAY attempt to neutralise the 1764 effect of any action it took based on a conservative assumption that 1765 it later found to be incorrect". To do this, the Data Sender would 1766 have to store the values of all the relevant variables whenever it 1767 made assumptions, so that it could re-evaluate them later. Given 1768 this could become complex and it is not required, we do not attempt 1769 to provide an example of how to do this. 1771 A.2.2. Safety Algorithm with the AccECN Option 1773 When the AccECN Option is available on the ACKs before and after the 1774 possible sequence of ACK losses, if the Data Sender only needs CE- 1775 marked bytes, it will have sufficient information in the AccECN 1776 Option without needing to process the ACE field. However, if for 1777 some reason it needs CE-marked packets, if dSafer.cep is different 1778 from d.cep, it can calculate the average marked segment size that 1779 each implies to determine whether d.cep is likely to be a safe enough 1780 estimate. Specifically, it could use the following algorithm, where 1781 d.ceb is the amount of newly CE-marked bytes (see Appendix A.1): 1783 SAFETY_FACTOR = 2 1784 if (dSafer.cep > d.cep) { 1785 s = d.ceb/d.cep 1786 if (s <= MSS) { 1787 sSafer = d.ceb/dSafer.cep 1788 if (sSafer < MSS/SAFETY_FACTOR) 1789 dSafer.cep = d.cep % d.cep is a safe enough estimate 1790 } % else 1791 % No need for else; dSafer.cep is already correct, 1792 % because d.cep must have been too small 1793 } 1795 The chart below shows when the above algorithm will consider d.cep 1796 can replace dSafer.cep as a safe enough estimate of the number of CE- 1797 marked packets: 1799 ^ 1800 sSafer| 1801 | 1802 MSS+ 1803 | 1804 | dSafer.cep 1805 | is 1806 MSS/2+--------------+ safest 1807 | | 1808 | d.cep is safe| 1809 | enough | 1810 +--------------------> 1811 MSS s 1813 The following examples give the reasoning behind the algorithm, 1814 assuming MSS=1,460 [B]: 1816 o if d.cep=0, dSafer.cep=8 and d.ceb=1,460, then s=infinity and 1817 sSafer=182.5. 1818 Therefore even though the average size of 8 data segments is 1819 unlikely to have been as small as MSS/8, d.cep cannot have been 1820 correct, because it would imply an average segment size greater 1821 than the MSS. 1823 o if d.cep=2, dSafer.cep=10 and d.ceb=1,460, then s=730 and 1824 sSafer=146. 1825 Therefore d.cep is safe enough, because the average size of 10 1826 data segments is unlikely to have been as small as MSS/10. 1828 o if d.cep=7, dSafer.cep=15 and d.ceb=10,200, then s=1,457 and 1829 sSafer=680. 1830 Therefore d.cep is safe enough, because the average data segment 1831 size is more likely to have been just less than one MSS, rather 1832 than below MSS/2. 1834 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 1835 far less likely. However, because [RFC3168] currently precludes 1836 this, the above algorithm assumes that pure ACKs are not ECN-capable. 1838 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 1840 If the AccECN Option is not available, the Data Sender can only 1841 decode CE-marking from the ACE field in packets. Every time an ACK 1842 arrives, to convert this into an estimate of CE-marked bytes, it 1843 needs an average of the segment size, s_ave. Then it can add or 1844 subtract s_ave from the value of d.ceb as the value of d.cep 1845 increments or decrements. 1847 To calculate s_ave, it could keep a record of the byte numbers of all 1848 the boundaries between packets in flight (including control packets), 1849 and recalculate s_ave on every ACK. However it would be simpler to 1850 merely maintain a counter packets_in_flight for the number of packets 1851 in flight (including control packets), which it could update once per 1852 RTT. Either way, it would estimate s_ave as: 1854 s_ave ~= flightsize / packets_in_flight, 1856 where flightsize is the variable that TCP already maintains for the 1857 number of bytes in flight. To avoid floating point arithmetic, it 1858 could right-bit-shift by lg(packets_in_flight), where lg() means log 1859 base 2. 1861 An alternative would be to maintain an exponentially weighted moving 1862 average (EWMA) of the segment size: 1864 s_ave = a * s + (1-a) * s_ave, 1866 where a is the decay constant for the EWMA. However, then it is 1867 necessary to choose a good value for this constant, which ought to 1868 depend on the number of packets in flight. Also the decay constant 1869 needs to be power of two to avoid floating point arithmetic. 1871 A.4. Example Algorithm to Beacon AccECN Options 1873 Section 3.2.8 requires a Data Receiver to beacon a full-length AccECN 1874 Option at least 3 times per RTT. This could be implemented by 1875 maintaining a variable to store the number of ACKs (pure and data 1876 ACKs) since a full AccECN Option was last sent and another for the 1877 approximate number of ACKs sent in the last round trip time: 1879 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 1880 send_full_AccECN_Option() 1882 For optimised integer arithmetic, BEACON_FREQ = 4 could be used, 1883 rather than 3, so that the division could be implemented as an 1884 integer right bit-shift by lg(BEACON_FREQ). 1886 In certain operating systems, it might be too complex to maintain 1887 acks_in_round. In others it might be possible by tagging each data 1888 segment in the retransmit buffer with the number of ACKs sent at the 1889 point that segment was sent. This would not work well if the Data 1890 Receiver was not sending data itself, in which case it might be 1891 necessary to beacon based on time instead, as follows: 1893 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 1894 send_full_AccECN_Option() 1896 This time-based approach does not work well when all the ACKs are 1897 sent early in each round trip, as is the case during slow-start. In 1898 this case few options will be sent (evtl. even less than 3 per RTT). 1899 However, when continuously sending data, data packets as well as ACKs 1900 will spread out equally over the RTT and sufficient ACKs with the 1901 AccECN option will be sent. 1903 A.5. Example Algorithm to Count Not-ECT Bytes 1905 A Data Sender in AccECN mode can infer the amount of TCP payload data 1906 arriving at the receiver marked Not-ECT from the difference between 1907 the amount of newly ACKed data and the sum of the bytes with the 1908 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 1909 r.e0b is initialized to 1 and the other two counters are initialized 1910 to 0, the initial sum will be 1, which matches the initial offset of 1911 the TCP sequence number on completion of the 3WHS. 1913 For this approach to be precise, it has to be assumed that spurious 1914 (unnecessary) retransmissions do not lead to double counting. This 1915 assumption is currently correct, given that RFC 3168 requires that 1916 the Data Sender marks retransmitted segments as Not-ECT. However, 1917 the converse is not true; necessary transmissions will result in 1918 under-counting. 1920 However, such precision is unlikely to be necessary. The only known 1921 use of a count of Not-ECT marked bytes is to test whether equipment 1922 on the path is clearing the ECN field (perhaps due to an out-dated 1923 attempt to clear, or bleach, what used to be the ToS field). To 1924 detect bleaching it will be sufficient to detect whether nearly all 1925 bytes arrive marked as Not-ECT. Therefore there should be no need to 1926 keep track of the details of retransmissions. 1928 Authors' Addresses 1930 Bob Briscoe 1931 CableLabs 1932 UK 1934 EMail: ietf@bobbriscoe.net 1935 URI: http://bobbriscoe.net/ 1937 Mirja Kuehlewind 1938 ETH Zurich 1939 Zurich 1940 Switzerland 1942 EMail: mirja.kuehlewind@tik.ee.ethz.ch 1943 Richard Scheffenegger 1944 Vienna 1945 Austria 1947 EMail: rscheff@gmx.at