idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. -- The document date (March 5, 2018) is 2244 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 1856, but not defined == Outdated reference: A later version (-12) exists of draft-ietf-tcpm-alternativebackoff-ecn-06 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-02 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-01 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft CableLabs 4 Intended status: Experimental M. Kuehlewind 5 Expires: September 6, 2018 ETH Zurich 6 R. Scheffenegger 7 March 5, 2018 9 More Accurate ECN Feedback in TCP 10 draft-ietf-tcpm-accurate-ecn-06 12 Abstract 14 Explicit Congestion Notification (ECN) is a mechanism where network 15 nodes can mark IP packets instead of dropping them to indicate 16 incipient congestion to the end-points. Receivers with an ECN- 17 capable transport protocol feed back this information to the sender. 18 ECN is specified for TCP in such a way that only one feedback signal 19 can be transmitted per Round-Trip Time (RTT). Recently,ew TCP 20 mechanisms like Congestion Exposure (ConEx) or Data Center TCP 21 (DCTCP) need more accurate ECN feedback information whenever more 22 than one marking is received in one RTT. This document specifies an 23 experimental scheme to provide more than one feedback signal per RTT 24 in the TCP header. Given TCP header space is scarce, it overloads 25 the three existing ECN-related flags in the TCP header and provides 26 additional information in a new TCP option. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on September 6, 2018. 45 Copyright Notice 47 Copyright (c) 2018 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (https://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 64 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 65 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 66 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 67 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 7 68 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 8 69 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 9 70 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 71 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 10 72 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 73 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 11 74 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 12 75 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 12 76 3.1.1. Negotiation during the TCP handshake . . . . . . . . 12 77 3.1.2. Retransmission of the SYN . . . . . . . . . . . . . . 14 78 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 15 79 3.2.1. Initialization of Feedback Counters at the Data 80 Sender . . . . . . . . . . . . . . . . . . . . . . . 15 81 3.2.2. The ACE Field . . . . . . . . . . . . . . . . . . . . 16 82 3.2.3. Testing for Zeroing of the ACE Field . . . . . . . . 18 83 3.2.4. Testing for Mangling of the IP/ECN Field . . . . . . 18 84 3.2.5. Safety against Ambiguity of the ACE Field . . . . . . 19 85 3.2.6. The AccECN Option . . . . . . . . . . . . . . . . . . 20 86 3.2.7. Path Traversal of the AccECN Option . . . . . . . . . 21 87 3.2.8. Usage of the AccECN TCP Option . . . . . . . . . . . 24 88 3.3. Requirements for TCP Proxies, Offload Engines and other 89 Middleboxes on AccECN Compliance . . . . . . . . . . . . 26 90 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 27 91 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 27 92 4.2. Compatibility with Other TCP Options and Experiments . . 28 93 4.3. Compatibility with Feedback Integrity Mechanisms . . . . 28 94 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 29 95 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 96 7. Security Considerations . . . . . . . . . . . . . . . . . . . 32 97 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 32 98 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 33 99 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 100 10.1. Normative References . . . . . . . . . . . . . . . . . . 33 101 10.2. Informative References . . . . . . . . . . . . . . . . . 33 102 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 36 103 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 36 104 A.2. Example Algorithm for Safety Against Long Sequences of 105 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 37 106 A.2.1. Safety Algorithm without the AccECN Option . . . . . 37 107 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 39 108 A.3. Example Algorithm to Estimate Marked Bytes from Marked 109 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 40 110 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 41 111 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 42 112 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 42 114 1. Introduction 116 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 117 network nodes can mark IP packets instead of dropping them to 118 indicate incipient congestion to the end-points. Receivers with an 119 ECN-capable transport protocol feed back this information to the 120 sender. ECN is specified for TCP in such a way that only one 121 feedback signal can be transmitted per Round-Trip Time (RTT). 122 Recently, proposed mechanisms like Congestion Exposure (ConEx 123 [RFC7713]), DCTCP [RFC8257] or L4S [I-D.ietf-tsvwg-l4s-arch] need 124 more accurate ECN feedback information than provided by the feedback 125 scheme as specified in [RFC3168] whenever more than one marking is 126 received in one RTT. This document specifies an alternative feedback 127 scheme that provides more accurate information and could be used by 128 these new TCP extensions. A fuller treatment of the motivation for 129 this specification is given in the associated requirements document 130 [RFC7560]. 132 This documents specifies an experimental scheme for ECN feedback in 133 the TCP header to provide more than one feedback signal per RTT. It 134 will be called the more accurate ECN feedback scheme, or AccECN for 135 short. If AccECN progresses from experimental to the standards 136 track, it is intended to be a complete replacement for classic TCP/ 137 ECN feedback, not a fork in the design of TCP. AccECN feedback 138 complements TCP's loss feedback and it supplements classic TCP/ECN 139 feedback, so its applicability is intended to include all public and 140 private IP networks (and even any non-IP networks over which TCP is 141 used today), whether or not any nodes on the path support ECN of 142 whatever flavour. 144 Until the AccECN experiment succeeds, [RFC3168] will remain as the 145 only standards track specification for adding ECN to TCP. To avoid 146 confusion, in this document we use the term 'classic ECN' for the 147 pre-existing ECN specification [RFC3168]. 149 AccECN feedback overloads the two existing ECN flags as well as the 150 currently reserved and previously called NS flag in the main TCP 151 header with new definitions, so both ends have to support the new 152 wire protocol before it can be used. Therefore during the TCP 153 handshake the two ends use the three ECN-related flags in the TCP 154 header to negotiate the most advanced feedback protocol that they can 155 both support. 157 AccECN is solely an (experimental) change to the TCP wire protocol; 158 it only specifies the negotiation and signaling of more accurate ECN 159 feedback from a TCP Data Receiver to a Data Sender. It is completely 160 independent of how TCP might respond to congestion feedback, which is 161 out of scope. For that we refer to [RFC3168] or any RFC that 162 specifies a different response to TCP ECN feedback, for example: 163 [RFC8257]; or the ECN experiments referred to in [RFC8311], namely: a 164 TCP-based Low Latency Low Loss Scalable (L4S) congestion control 165 [I-D.ietf-tsvwg-l4s-arch]; ECN-capable TCP control packets 166 [I-D.ietf-tcpm-generalized-ecn], or Alternative Backoff with ECN 167 (ABE) [I-D.ietf-tcpm-alternativebackoff-ecn]. 169 It is likely (but not required) that the AccECN protocol will be 170 implemented along with the following experimental additions to the 171 TCP-ECN protocol: ECN-capable TCP control packets and retransmissions 172 [I-D.ietf-tcpm-generalized-ecn], which includes the ECN-capable SYN/ 173 ACK experiment [RFC5562]; and testing receiver non-compliance 174 [I-D.moncaster-tcpm-rcv-cheat]. 176 1.1. Document Roadmap 178 The following introductory sections outline the goals of AccECN 179 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 180 that it is clear what success would look like. Then terminology is 181 defined (Section 1.4) and a recap of existing prerequisite technology 182 is given (Section 1.5). 184 Section 2 gives an informative overview of the AccECN protocol. Then 185 Section 3 gives the normative protocol specification. Section 4 186 assesses the interaction of AccECN with commonly used variants of 187 TCP, whether standardised or not. Section 5 summarises the features 188 and properties of AccECN. 190 Section 6 summarises the protocol fields and numbers that IANA will 191 need to assign and Section 7 points to the aspects of the protocol 192 that will be of interest to the security community. 194 Appendix A gives pseudocode examples for the various algorithms that 195 AccECN uses. 197 1.2. Goals 199 [RFC7560] enumerates requirements that a candidate feedback scheme 200 will need to satisfy, under the headings: resilience, timeliness, 201 integrity, accuracy (including ordering and lack of bias), 202 complexity, overhead and compatibility (both backward and forward). 203 It recognises that a perfect scheme that fully satisfies all the 204 requirements is unlikely and trade-offs between requirements are 205 likely. Section 5 presents the properties of AccECN against these 206 requirements and discusses the trade-offs made. 208 The requirements document recognises that a protocol as ubiquitous as 209 TCP needs to be able to serve as-yet-unspecified requirements. 210 Therefore an AccECN receiver aims to act as a generic (dumb) 211 reflector of congestion information so that in future new sender 212 behaviours can be deployed unilaterally. 214 1.3. Experiment Goals 216 TCP is critical to the robust functioning of the Internet, therefore 217 any proposed modifications to TCP need to be thoroughly tested. The 218 present specification describes an experimental protocol that adds 219 more accurate ECN feedback to the TCP protocol. The intention is to 220 specify the protocol sufficiently so that more than one 221 implementation can be built in order to test its function, robustness 222 and interoperability (with itself and with previous version of ECN 223 and TCP). 225 The experimental protocol will be considered successful if it is 226 deployed and if it satisfies the requirements of [RFC7560] in the 227 consensus opinion of the IETF tcpm working group. In short, this 228 requires that it improves the accuracy and timeliness of TCP's ECN 229 feedback, as claimed in Section 5, while striking a balance between 230 the conflicting requirements of resilience, integrity and 231 minimisation of overhead. It also requires that it is not unduly 232 complex, and that it is compatible with prevalent equipment 233 behaviours in the current Internet (e.g. hardware offloading and 234 middleboxes), whether or not they comply with standards. 236 Testing will mostly focus on fall-back strategies in case of 237 middlebox interference. Current recommended strategies are specified 238 in Sections 3.1.2, 3.2.3, 3.2.4 and 3.2.7. The effectiveness of 239 these strategies depends on the actual deployment situation of 240 middleboxes. Therefore experimental verification to confirm large- 241 scale path traversal in the Internet is needed before finalizing this 242 specification on the Standards Track. 244 Another experimentation focus is the implementation feasibiliy of 245 change-triggered ACKs as described in section 3.2.8. While on 246 average this should not lead to a higher ACK rate, it changes the ACK 247 patter which especially can have an impact on hardware offload. 248 Further experimentation is needed to advise if this should a hard 249 requirement or just prefer behavior. 251 1.4. Terminology 253 AccECN: The more accurate ECN feedback scheme will be called AccECN 254 for short. 256 Classic ECN: the ECN protocol specified in [RFC3168]. 258 Classic ECN feedback: the feedback aspect of the ECN protocol 259 specified in [RFC3168], including generation, encoding, 260 transmission and decoding of feedback, but not the Data Sender's 261 subsequent response to that feedback. 263 ACK: A TCP acknowledgement, with or without a data payload. 265 Pure ACK: A TCP acknowledgement without a data payload. 267 TCP client: The TCP stack that originates a connection. 269 TCP server: The TCP stack that responds to a connection request. 271 Data Receiver: The endpoint of a TCP half-connection that receives 272 data and sends AccECN feedback. 274 Data Sender: The endpoint of a TCP half-connection that sends data 275 and receives AccECN feedback. 277 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 278 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 279 document are to be interpreted as described in BCP 14 [RFC2119] 280 [RFC8174] when, and only when, they appear in all capitals, as shown 281 here. 283 1.5. Recap of Existing ECN feedback in IP/TCP 285 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 286 negotiated with the receiver at the transport layer, an ECN sender 287 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 288 to indicate an ECN-capable transport (ECT). If both ECN bits are 289 zero, the packet is considered to have been sent by a Not-ECN-capable 290 Transport (Not-ECT). When a network node experiences congestion, it 291 will occasionally either drop or mark a packet, with the choice 292 depending on the packet's ECN codepoint. If the codepoint is Not- 293 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 294 the node can mark the packet by setting both ECN bits, which is 295 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 296 Table 1 summarises these codepoints. 298 +-----------------------+---------------+---------------------------+ 299 | IP-ECN codepoint | Codepoint | Description | 300 | (binary) | name | | 301 +-----------------------+---------------+---------------------------+ 302 | 00 | Not-ECT | Not ECN-Capable Transport | 303 | 01 | ECT(1) | ECN-Capable Transport (1) | 304 | 10 | ECT(0) | ECN-Capable Transport (0) | 305 | 11 | CE | Congestion Experienced | 306 +-----------------------+---------------+---------------------------+ 308 Table 1: The ECN Field in the IP Header 310 In the TCP header the first two bits in byte 14 are defined as flags 311 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 312 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 313 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 314 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 315 Data Receiver starts to set the Echo Congestion Experienced (ECE) 316 flag continuously in the TCP header of ACKs, which ensures the signal 317 is received reliably even if ACKs are lost. The TCP sender confirms 318 that it has received at least one ECE signal by responding with the 319 congestion window reduced (CWR) flag, which allows the TCP receiver 320 to stop repeating the ECN-Echo flag. This always leads to a full RTT 321 of ACKs with ECE set. Thus any additional CE markings arriving 322 within this RTT cannot be fed back. 324 The last bit in byte 13 of the TCP header was defined as the Nonce 325 Sum (NS) for the ECN Nonce [RFC3540]. In the absence of widespread 326 deployment RFC 3540 has been reclassified as historic [RFC8311] and 327 the respective flag has been marked as "reserved", making this TCP 328 flag available for use by the AccECN experiment instead. 330 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 331 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 332 | | | N | C | E | U | A | P | R | S | F | 333 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 334 | | | | R | E | G | K | H | T | N | N | 335 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 337 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 339 2. AccECN Protocol Overview and Rationale 341 This section provides an informative overview of the AccECN protocol 342 that will be normatively specified in Section 3 344 Like the original TCP approach, the Data Receiver of each TCP half- 345 connection sends AccECN feedback to the Data Sender on TCP 346 acknowledgements, reusing data packets of the other half-connection 347 whenever possible. 349 The AccECN protocol has had to be designed in two parts: 351 o an essential part that re-uses ECN TCP header bits to feed back 352 the number of arriving CE marked packets. This provides more 353 accuracy than classic ECN feedback, but limited resilience against 354 ACK loss; 356 o a supplementary part using a new AccECN TCP Option that provides 357 additional feedback on the number of bytes that arrive marked with 358 each of the three ECN codepoints (not just CE marks). This 359 provides greater resilience against ACK loss than the essential 360 feedback, but it is more likely to suffer from middlebox 361 interference. 363 The two part design was necessary, given limitations on the space 364 available for TCP options and given the possibility that certain 365 incorrectly designed middleboxes prevent TCP using any new options. 367 The essential part overloads the previous definition of the three 368 flags in the TCP header that had been assigned for use by ECN. This 369 design choice deliberately replaces the classic ECN feedback 370 protocol, rather than leaving classic ECN feedback intact and adding 371 more accurate feedback separately because: 373 o this efficiently reuses scarce TCP header space, given TCP option 374 space is approaching saturation; 376 o a single upgrade path for the TCP protocol is preferable to a fork 377 in the design; 379 o otherwise classic and accurate ECN feedback could give conflicting 380 feedback on the same segment, which could open up new security 381 concerns and make implementations unnecessarily complex; 383 o middleboxes are more likely to faithfully forward the TCP ECN 384 flags than newly defined areas of the TCP header. 386 AccECN is designed to work even if the supplementary part is removed 387 or zeroed out, as long as the essential part gets through. 389 2.1. Capability Negotiation 391 AccECN is a change to the wire protocol of the main TCP header, 392 therefore it can only be used if both endpoints have been upgraded to 393 understand it. The TCP client signals support for AccECN on the 394 initial SYN of a connection and the TCP server signals whether it 395 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 396 client uses to signal AccECN support have been carefully chosen so 397 that a TCP server will interpret them as a request to support the 398 most recent variant of ECN feedback that it supports. Then the 399 client falls back to the same variant of ECN feedback. 401 An AccECN TCP client does not send the new AccECN Option on the SYN 402 as SYN option space is limited and successful negotiation using the 403 flags in the main header is taken as sufficient evidence that both 404 ends also support the AccECN Option. The TCP server sends the AccECN 405 Option on the SYN/ACK and the client sends it on the first ACK to 406 test whether the network path forwards the option correctly. 408 2.2. Feedback Mechanism 410 A Data Receiver maintains four counters initialised at the start of 411 the half-connection. Three count the number of arriving payload 412 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 413 the number of packets arriving marked with a CE codepoint (including 414 control packets without payload if they are CE-marked). 416 The Data Sender maintains four equivalent counters for the half 417 connection, and the AccECN protocol is designed to ensure they will 418 match the values in the Data Receiver's counters, albeit after a 419 little delay. 421 Each ACK carries the three least significant bits (LSBs) of the 422 packet-based CE counter using the ECN bits in the TCP header, now 423 renamed the Accurate ECN (ACE) field (see Figure 2 later). The LSBs 424 of each of the three byte counters are carried in the AccECN Option. 426 2.3. Delayed ACKs and Resilience Against ACK Loss 428 With both the ACE and the AccECN Option mechanisms, the Data Receiver 429 continually repeats the current LSBs of each of its respective 430 counters. There is no need to acknowledge these continually repeated 431 counters, so the congestion window reduced (CWR) mechanism is no 432 longer used. Even if some ACKs are lost, the Data Sender should be 433 able to infer how much to increment its own counters, even if the 434 protocol field has wrapped. 436 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 437 it appears to have incremented by one (say), the field might have 438 actually cycled completely then incremented by one. The Data 439 Receiver is required not to delay sending an ACK to such an extent 440 that the ACE field would cycle. However cyling is still a 441 possibility at the Data Sender because a whole sequence of ACKs 442 carrying intervening values of the field might all be lost or delayed 443 in transit. 445 The fields in the AccECN Option are larger, but they will increment 446 in larger steps because they count bytes not packets. Nonetheless, 447 their size has been chosen such that a whole cycle of the field would 448 never occur between ACKs unless there had been an infeasibly long 449 sequence of ACK losses. Therefore, as long as the AccECN Option is 450 available, it can be treated as a dependable feedback channel. 452 If the AccECN Option is not available, e.g. it is being stripped by a 453 middlebox, the AccECN protocol will only feed back information on CE 454 markings (using the ACE field). Although not ideal, this will be 455 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 456 will ever indicate more severe congestion than CE, even though future 457 uses for ECT(0) or ECT(1) are still unclear [RFC8311]. Because the 458 3-bit ACE field is so small, when it is the only field available the 459 Data Sender has to interpret it conservatively assuming the worst 460 possible wrap. 462 Certain specified events trigger the Data Receiver to include an 463 AccECN Option on an ACK. The rules are designed to ensure that the 464 order in which different markings arrive at the receiver is 465 communicated to the sender (as long as there is no ACK loss). 466 Implementations are encouraged to send an AccECN Option more 467 frequently, but this is left up to the implementer. 469 2.4. Feedback Metrics 471 The CE packet counter in the ACE field and the CE byte counter in the 472 AccECN Option both provide feedback on received CE-marks. The CE 473 packet counter includes control packets that do not have payload 474 data, while the CE byte counter solely includes marked payload bytes. 475 If both are present, the byte counter in the option will provide the 476 more accurate information needed for modern congestion control and 477 policing schemes, such as DCTCP or ConEx. If the option is stripped, 478 a simple algorithm to estimate the number of marked bytes from the 479 ACE field is given in Appendix A.3. 481 Feedback in bytes is recommended in order to protect against the 482 receiver using attacks similar to 'ACK-Division' to artificially 483 inflate the congestion window, which is why [RFC5681] now recommends 484 that TCP counts acknowledged bytes not packets. 486 2.5. Generic (Dumb) Reflector 488 The ACE field provides information about CE markings on both data and 489 control packets. According to [RFC3168] the Data Sender is meant to 490 set control packets to Not-ECT. However, mechanisms in certain 491 private networks (e.g. data centres) set control packets to be ECN 492 capable because they are precisely the packets that performance 493 depends on most. 495 For this reason, AccECN is designed to be a generic reflector of 496 whatever ECN markings it sees, whether or not they are compliant with 497 a current standard. Then as standards evolve, Data Senders can 498 upgrade unilaterally without any need for receivers to upgrade too. 499 It is also useful to be able to rely on generic reflection behaviour 500 when senders need to test for unexpected interference with markings 501 (for instance [I-D.kuehlewind-tcpm-ecn-fallback] and 502 [I-D.moncaster-tcpm-rcv-cheat]). 504 The initial SYN is the most critical control packet, so AccECN 505 provides feedback on whether it is CE marked. Although RFC 3168 506 prohibits an ECN-capable SYN, providing feedback of CE marking on the 507 SYN supports future scenarios in which SYNs might be ECN-enabled 508 (without prejudging whether they ought to be). For instance, 509 [RFC8311] updates this aspect of RFC 3168 to allow experimentation 510 with ECN-capable TCP control packets. 512 Even if the TCP client (or server) has set the SYN (or SYN/ACK) to 513 not-ECT in compliance with RFC 3168, feedback on the state of the ECN 514 field when it arrives at the receiver could still be useful, because 515 middleboxes have been known to overwrite the ECN IP field as if it is 516 still part of the old Type of Service (ToS) field [Mandalari18]. If 517 a TCP client has set the SYN to Not-ECT, but receives CE feedback, it 518 can detect such middlebox interference and send Not-ECT for the rest 519 of the connection (see [I-D.kuehlewind-tcpm-ecn-fallback]). Today, 520 if a TCP server receives ECT or CE on a SYN, it cannot know whether 521 it is invalid (or valid) because only the TCP client knows whether it 522 originally marked the SYN as Not-ECT (or ECT). Therefore, prior to 523 AccECN, the server's only safe course of action was to disable ECN 524 for the connection. Instead, the AccECN protocol allows the server 525 to feed back the received ECN field to the client, which then has all 526 the information to decide whether the connection has to fall-back 527 from supporting ECN (or not). 529 3. AccECN Protocol Specification 531 3.1. Negotiating to use AccECN 533 3.1.1. Negotiation during the TCP handshake 535 Given the ECN Nonce [RFC3540] has been reclassified as historic 536 [RFC8311], the present specification renames the TCP flag at bit 7 of 537 the TCP header flags from NS (Nonce Sum) to AE (Accurate ECN) (see 538 IANA Considerations in Section 6). 540 During the TCP handshake at the start of a connection, to request 541 more accurate ECN feedback the TCP client (host A) MUST set the TCP 542 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 544 If a TCP server (B) that is AccECN-enabled receives a SYN with the 545 above three flags set, it MUST set both its half connections into 546 AccECN mode. Then it MUST set the TCP flags on the SYN/ACK to one of 547 the 4 values shown in the top block of Table 2 to confirm that it 548 supports AccECN. The TCP server MUST NOT set one of these 4 549 combination of flags on the SYN/ACK unless the preceding SYN 550 requested support for AccECN as above. 552 A TCP server in AccECN mode MUST set the AE, CWR and ECE TCP flags on 553 the SYN/ACK to the value in Table 2 that feeds back the IP-ECN field 554 that arrived on the SYN. This applies whether or not the server 555 itself supports setting the IP-ECN field on a SYN or SYN/ACK (see 556 Section 2.5 for rationale). 558 Once a TCP client (A) has sent the above SYN to declare that it 559 supports AccECN, and once it has received the above SYN/ACK segment 560 that confirms that the TCP server supports AccECN, the TCP client 561 MUST set both its half connections into AccECN mode. 563 The procedure for the client to follow if a SYN/ACK does not arrive 564 before its retransmission timer expires is given in Section 3.1.2. 566 The three flags set to 1 to indicate AccECN support on the SYN have 567 been carefully chosen to enable natural fall-back to prior stages in 568 the evolution of ECN. Table 2 tabulates all the negotiation 569 possibilities for ECN-related capabilities that involve at least one 570 AccECN-capable host. The entries in the first two columns have been 571 abbreviated, as follows: 573 AccECN: More Accurate ECN Feedback (the present specification) 575 Nonce: ECN Nonce feedback [RFC3540] 577 ECN: 'Classic' ECN feedback [RFC3168] 579 No ECN: Not-ECN-capable. Implicit congestion notification using 580 packet drop. 582 +--------+--------+------------+-------------+----------------------+ 583 | A | B | SYN A->B | SYN/ACK | Feedback Mode | 584 | | | | B->A | | 585 +--------+--------+------------+-------------+----------------------+ 586 | | | AE CWR ECE | AE CWR ECE | | 587 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN (Not-ECT on | 588 | | | | | SYN) | 589 | AccECN | AccECN | 1 1 1 | 0 1 1 | AccECN (ECT1 on SYN) | 590 | AccECN | AccECN | 1 1 1 | 1 0 0 | AccECN (ECT0 on SYN) | 591 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 592 | | | | | | 593 | AccECN | Nonce | 1 1 1 | 1 0 1 | classic ECN | 594 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 595 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 596 | | | | | | 597 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 598 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 599 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 600 | | | | | | 601 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 602 +--------+--------+------------+-------------+----------------------+ 604 Table 2: ECN capability negotiation between Client (A) and Server (B) 606 Table 2 is divided into blocks each separated by an empty row. 608 1. The top block shows the case already described where both 609 endpoints support AccECN and how the TCP server (B) indicates 610 congestion feedback. 612 2. The second block shows the cases where the TCP client (A) 613 supports AccECN but the TCP server (B) supports some earlier 614 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 615 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 616 shown it MUST set both its half connections into the feedback 617 mode shown in the rightmost column. 619 3. The third block shows the cases where the TCP server (B) supports 620 AccECN but the TCP client (A) supports some earlier variant of 621 TCP feedback, indicated in its SYN. Therefore, as soon as an 622 AccECN-enabled TCP server (B) receives the SYN shown, it MUST set 623 both its half connections into the feedback mode shown in the 624 rightmost column. 626 4. The fourth block displays a combination labelled `Broken' . Some 627 older TCP server implementations incorrectly set the reserved 628 flags in the SYN/ACK by reflecting those in the SYN. Such broken 629 TCP servers (B) cannot support ECN, so as soon as an AccECN- 630 capable TCP client (A) receives such a broken SYN/ACK it MUST 631 fall-back to Not ECN mode for both its half connections. 633 The following exceptional cases need some explanation: 635 ECN Nonce: With AccECN implementation, there is no need for the ECN 636 Nonce feedback mode [RFC3540], which has also been reclassified as 637 historic [RFC8311], as AccECN is compatible with an alternative 638 ECN feedback integrity approach that does not use up the ECT(1) 639 codepoint and can be implemented solely at the sender (see 640 Section 4.3). 642 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 643 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 644 Host A MUST then enter the same feedback mode as it would have 645 entered had it been a responding host and received the same SYN. 646 Then host A MUST send the same SYN/ACK as it would have sent had 647 it been a responding host. 649 3.1.2. Retransmission of the SYN 651 If the sender of an AccECN SYN times out before receiving the SYN/ 652 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 653 least one more time by continuing to set all three TCP ECN flags on 654 the first retransmitted SYN (using the usual retransmission time- 655 outs). If this first retransmission also fails to be acknowledged, 656 the sender SHOULD send subsequent retransmissions of the SYN without 657 any TCP-ECN flags set. This adds delay, in the case where a 658 middlebox drops an AccECN (or ECN) SYN deliberately. However, 659 current measurements imply that a drop is less likely to be due to 660 middlebox interference than other intermittent causes of loss, e.g. 661 congestion, wireless interference, etc. 663 Implementers MAY use other fall-back strategies if they are found to 664 be more effective (e.g. attempting to negotiate AccECN on the SYN 665 only once or more than twice (most appropriate during high levels of 666 congestion); or falling back to classic ECN feedback rather than non- 667 ECN). Further it may make sense to also remove any other 668 experimental fields or options on the SYN in case a middlebox might 669 be blocking them, although the required behaviour will depend on the 670 specification of the other option(s) and any attempt to co-ordinate 671 fall-back between different modules of the stack. In any case, the 672 TCP initiator SHOULD cache failed connection attempts. If it does, 673 it SHOULD NOT give up attempting to negotiate AccECN on the SYN of 674 subsequent connection attempts until it is clear that the blockage is 675 persistently and specifically due to AccECN. The cache should be 676 arranged to expire so that the initiator will infrequently attempt to 677 check whether the problem has been resolved. 679 The fall-back procedure if the TCP server receives no ACK to 680 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 681 Section 3.2.7. 683 3.2. AccECN Feedback 685 Each Data Receiver of each half connection maintains four counters, 686 r.cep, r.ceb, r.e0b and r.e1b. The CE packet counter (r.cep), counts 687 the number of packets the host receives with the CE code point in the 688 IP ECN field, including CE marks on control packets without data. 689 r.ceb, r.e0b and r.e1b count the number of TCP payload bytes in 690 packets marked respectively with the CE, ECT(0) and ECT(1) codepoint 691 in their IP-ECN field. When a host first enters AccECN mode, it 692 initializes its counters to r.cep = 5, r.e0b = 1 and r.ceb = r.e1b.= 693 0 (see Appendix A.5). Non-zero initial values are used to support a 694 stateless handshake (see Section 4.1) and to be distinct from cases 695 where the fields are incorrectly zeroed (e.g. by middleboxes - see 696 Section 3.2.7.4). 698 A host feeds back the CE packet counter using the Accurate ECN (ACE) 699 field, as explained in the next section. And it feeds back all the 700 byte counters using the AccECN TCP Option, as specified in 701 Section 3.2.6. Whenever a host feeds back the value of any counter, 702 it MUST report the most recent value, no matter whether it is in a 703 pure ACK, an ACK with new payload data or a retransmission. 704 Therefore the feedback carried on a retransmitted packet is unlikely 705 to be the same as the feedback on the original packet. 707 3.2.1. Initialization of Feedback Counters at the Data Sender 709 Each Data Sender of each half connection maintains four counters, 710 s.cep, s.ceb, s.e0b and s.e1b intended to track the equivalent 711 counters at the Data Receiver. When a host enters AccECN mode, it 712 initializes them to s.cep = 5, s.e0b = 1 and s.ceb = s.e1b.= 0. 714 If a TCP client (A) in AccECN mode receives a SYN/ACK with CE 715 feedback, i.e. AE=1, CWR=1, ECE=0, it increments s.cep to 6. 716 Otherwise, for any of the 3 other combinations of the 3 ECN TCP flags 717 (the top 3 rows in Table 2), s.cep remains initialized to 5. 719 3.2.2. The ACE Field 721 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 722 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 723 as one 3-bit field. Then the field is given a new name, ACE, as 724 shown in Figure 2. 726 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 727 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 728 | | | | U | A | P | R | S | F | 729 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 730 | | | | G | K | H | T | N | N | 731 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 733 Figure 2: Definition of the ACE field within bytes 13 and 14 of the 734 TCP Header (when AccECN has been negotiated and SYN=0). 736 The original definition of these three flags in the TCP header, 737 including the addition of support for the ECN Nonce, is shown for 738 comparison in Figure 1. This specification does not rename these 739 three TCP flags to ACE unconditionally; it merely overloads them with 740 another name and definition once an AccECN connection has been 741 established. 743 A host MUST interpret the AE, CWR and ECE flags as the 3-bit ACE 744 counter on a segment with the SYN flag cleared (SYN=0) that it sends 745 or receives if both of its half-connections are set into AccECN mode 746 having successfully negotiated AccECN (see Section 3.1). A host MUST 747 NOT interpret the 3 flags as a 3-bit ACE field on any segment with 748 SYN=1 (whether ACK is 0 or 1), or if AccECN negotiation is incomplete 749 or has not succeeded. 751 Both parts of each of these conditions are equally important. For 752 instance, even if AccECN negotiation has been successful, the ACE 753 field is not defined on any segments with SYN=1 (e.g. a 754 retransmission of an unacknowledged SYN/ACK, or when both ends send 755 SYN/ACKs after AccECN support has been successfully negotiated during 756 a simultaneous open). 758 With only one exception, on any packet with the SYN flag cleared 759 (SYN=0), the Data Receiver MUST encode the three least significant 760 bits of its r.cep counter into the ACE field it feeds back to the 761 Data Sender. 763 There is only one exception to this rule: On the final ACK of the 764 3WHS, a TCP client (A) in AccECN mode MUST use the ACE field to feed 765 back which of the 4 possible values of the IP-ECN field were on the 766 SYN/ACK (the binary encoding is the same as that used on the SYN/ 767 ACK). Table 3 shows the meaning of each possible value of the ACE 768 field on the ACK of the SYN/ACK and the value that an AccECN server 769 MUST set s.cep to as a result. The encoding in Table 3 is solely 770 applicable on a packet in the client-server direction with an 771 acknowledgement number 1 greater than the Initial Sequence Number 772 (ISN) that was used by the server. 774 +--------------+---------------------------+------------------------+ 775 | ACE on ACK | IP-ECN codepoint on | Initial s.cep of | 776 | of SYN/ACK | SYN/ACK inferred by | server in AccECN mode | 777 | | server | | 778 +--------------+---------------------------+------------------------+ 779 | 0b000 | {Notes 1, 2} | Disable ECN | 780 | 0b001 | {Notes 2, 3} | 5 | 781 | 0b010 | Not-ECT | 5 | 782 | 0b011 | ECT(1) | 5 | 783 | 0b100 | ECT(0) | 5 | 784 | 0b101 | Currently Unused {Note 3} | 5 | 785 | 0b110 | CE | 6 | 786 | 0b111 | Currently Unused {Note 3} | 5 | 787 +--------------+---------------------------+------------------------+ 789 Table 3: Meaning of the ACE field on the ACK of the SYN/ACK 791 {Note 1}: If the server is in AccECN mode, the value of zero raises 792 suspicion of zeroing of the ACE field on the path (see 793 Section 3.2.3). 795 {Note 2}: If a server is in AccECN mode, there ought to be no valid 796 case where the ACE field on the last ACK of the 3WHS has a value of 797 0b000 or 0b001. 799 However, in the case where a server that implements AccECN is also 800 using a stateless handshake (termed a SYN cookie) it will not 801 remember whether it entered AccECN mode. Then these two values 802 remind it that it did not enter AccECN mode (see Section 4.1 for 803 details). 805 {Note 3}: If the server is in AccECN mode, these values are Currently 806 Unused but the AccECN server's behaviour is still defined for forward 807 compatibility. 809 3.2.3. Testing for Zeroing of the ACE Field 811 Section 3.2.2 required the Data Receiver to initialize the r.cep 812 counter to a non-zero value. Therefore, in either direction the 813 initial value of the ACE field ought to be non-zero. 815 If AccECN has been successfully negotiated, the Data Sender SHOULD 816 check the initial value of the ACE field in the first arriving 817 segment with SYN=0. If the initial value of the ACE field is zero 818 (0b000), the Data Sender MUST disable sending ECN-capable packets for 819 the remainder of the half-connection by setting the IP/ECN field in 820 all subsequent packets to Not-ECT. 822 For example, the server checks the ACK of the SYN/ACK or the first 823 data segment from the client, while the client checks the first data 824 segment from the server. More precisely, the "first segment with 825 SYN=0" is defined as: the segment with SYN=0 that i) acknowledges 826 sequence space at least covering the initial sequence number (ISN) 827 plus 1; and ii) arrives before any other segments with SYN=0 so it is 828 unlikely to be a retransmission. If no such segment arrives (e.g. 829 because it is lost and the ISN is first acknowledged by a subsequent 830 segment), no test for invalid initialization can be conducted, and 831 the half-connection will continue in AccECN mode. 833 Note that the Data Sender MUST NOT test whether the arriving counter 834 in the initial ACE field has been initialized to a specific valid 835 value - the above check solely tests whether the ACE fields have been 836 incorrectly zeroed. This allows hosts to use different initial 837 values as an additional signalling channel in future. 839 3.2.4. Testing for Mangling of the IP/ECN Field 841 The value of the ACE field on the SYN/ACK indicates the value of the 842 IP/ECN field when the SYN arrived at the server. The client can 843 compare this with how it originally set the IP/ECN field on the SYN. 844 If this comparison implies an unsafe transition of the IP/ECN field, 845 for the remainder of the connection the client MUST NOT send ECN- 846 capable packets, but it MUST continue to feed back any ECN markings 847 on arriving packets. 849 The value of the ACE field on the last ACK of the 3WHS indicates the 850 value of the IP/ECN field when the SYN/ACK arrived at the client. 851 The server can compare this with how it originally set the IP/ECN 852 field on the SYN/ACK. If this comparison implies an unsafe 853 transition of the IP/ECN field, for the remainder of the connection 854 the server MUST NOT send ECN-capable packets, but it MUST continue to 855 feedback any ECN markings on arriving packets. 857 The ACK of the SYN/ACK is not reliably delivered (nonetheless, the 858 count of CE marks is still eventually delivered reliably). If this 859 ACK does not arrive, the server has to continue to send ECN-capable 860 packets without having tested for mangling of the IP/ECN field on the 861 SYN/ACK. Experiments with AccECN deployment will assess whether this 862 limitation has any effect in practice. 864 Invalid transitions of the IP/ECN field are defined in [RFC3168] and 865 repeated here for convenience: 867 o the not-ECT codepoint changes; 869 o either ECT codepoint transitions to not-ECT; 871 o the CE codepoint changes. 873 RFC 3168 says that a router that changes ECT to not-ECT is invalid 874 but safe. However, from a host's viewpoint, this transition is 875 unsafe because it could be the result of two transitions at different 876 routers on the path: ECT to CE (safe) then CE to not-ECT (unsafe). 877 This scenario could well happen where an ECN-enabled home router 878 congests its upstream mobile broadband bottleneck link, then the 879 ingress to the mobile network clears the ECN field [Mandalari18]. 881 The above fall-back behaviours are necessary in case mangling of the 882 IP/ECN field is asymmetric, which is currently common over some 883 mobile networks [Mandalari18]. Then one end might see no unsafe 884 transition and continue sending ECN-capable packets, while the other 885 end sees an unsafe transition and stops sending ECN-capable packets. 887 3.2.5. Safety against Ambiguity of the ACE Field 889 If too many CE-marked segments are acknowledged at once, or if a long 890 run of ACKs is lost, the 3-bit counter in the ACE field might have 891 cycled between two ACKs arriving at the Data Sender. 893 Therefore an AccECN Data Receiver SHOULD immediately send an ACK once 894 'n' CE marks have arrived since the previous ACK, where 'n' SHOULD be 895 2 and MUST be no greater than 6. 897 If the Data Sender has not received AccECN TCP Options to give it 898 more dependable information, and it detects that the ACE field could 899 have cycled under the prevailing conditions, it SHOULD conservatively 900 assume that the counter did cycle. It can detect if the counter 901 could have cycled by using the jump in the acknowledgement number 902 since the last ACK to calculate or estimate how many segments could 903 have been acknowledged. An example algorithm to implement this 904 policy is given in Appendix A.2. An implementer MAY develop an 905 alternative algorithm as long as it satisfies these requirements. 907 If missing acknowledgement numbers arrive later (reordering) and 908 prove that the counter did not cycle, the Data Sender MAY attempt to 909 neutralise the effect of any action it took based on a conservative 910 assumption that it later found to be incorrect. 912 3.2.6. The AccECN Option 914 The AccECN Option is defined as shown below in Figure 3. It consists 915 of three 24-bit fields that provide the 24 least significant bits of 916 the r.e0b, r.ceb and r.e1b counters, respectively. The initial 'E' 917 of each field name stands for 'Echo'. 919 0 1 2 3 920 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 921 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 922 | Kind = TBD1 | Length = 11 | EE0B field | 923 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 924 | EE0B (cont'd) | ECEB field | 925 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 926 | EE1B field | 927 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 929 Figure 3: The AccECN Option 931 The Data Receiver MUST set the Kind field to TBD1, which is 932 registered in Section 6 as a new TCP option Kind called AccECN. An 933 experimental TCP option with Kind=254 MAY be used for initial 934 experiments, with magic number 0xACCE. 936 Appendix A.1 gives an example algorithm for the Data Receiver to 937 encode its byte counters into the AccECN Option, and for the Data 938 Sender to decode the AccECN Option fields into its byte counters. 940 Note that there is no field to feedback Not-ECT bytes. Nonetheless 941 an algorithm for the Data Sender to calculate the number of payload 942 bytes received as Not-ECT is given in Appendix A.5. 944 Whenever a Data Receiver sends an AccECN Option, the rules in 945 Section 3.2.8 expect it to always send a full-length option. To cope 946 with option space limitations, it can omit unchanged fields from the 947 tail of the option, as long as it preserves the order of the 948 remaining fields and includes any field that has changed. The length 949 field MUST indicate which fields are present as follows: 951 Length=11: EE0B, ECEB, EE1B 952 Length=8: EE0B, ECEB 954 Length=5: EE0B 956 Length=2: (empty) 958 The empty option of Length=2 is provided to allow for a case where an 959 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 960 but there is very limited space for the option. For initial 961 experiments, the Length field MUST be 2 greater to accommodate the 962 16-bit magic number. 964 All implementations of a Data Sender MUST be able to read in AccECN 965 Options of any of the above lengths. If the AccECN Option is of any 966 other length, implementations MUST use those whole 3 octet fields 967 that fit within the length and ignore the remainder of the option. 969 The use of the AccECN option is optional for the Data Receiver. If 970 the Data Receiver intents to use the AccECN option at any time during 971 the rest of the connection it strongly recommended to also test its 972 path traversal by including it in the SYN/ACK as specified in the 973 next section. By default the use of the AccECN option is 974 RECOMMENDED. 976 3.2.7. Path Traversal of the AccECN Option 978 3.2.7.1. Testing the AccECN Option during the Handshake 980 The TCP client MUST NOT include the AccECN TCP Option on the SYN. 981 Nonetheless, if the AccECN negotiation using the ECN flags in the 982 main TCP header (Section 3.1) is successful, it implicitly declares 983 that the endpoints also support the AccECN TCP Option. A fall-back 984 strategy for the loss of the SYN (possibly due to middlebox 985 interference) is specified in Section 3.1.2. 987 A TCP server that confirms its support for AccECN (in response to an 988 AccECN SYN from the client as described in Section 3.1) SHOULD 989 include an AccECN TCP Option in the SYN/ACK. 991 A TCP client that has successfully negotiated AccECN SHOULD include 992 an AccECN Option in the first ACK at the end of the 3WHS. However, 993 this first ACK is not delivered reliably, so the TCP client SHOULD 994 also include an AccECN Option on the first data segment it sends (if 995 it ever sends one). 997 A host MAY NOT include an AccECN Option in any of these three cases 998 if it has cached knowledge that the packet would be likely to be 999 blocked on the path to the other host if it included an AccECN 1000 Option. 1002 3.2.7.2. Testing for Loss of Packets Carrying the AccECN Option 1004 If after the normal TCP timeout the TCP server has not received an 1005 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 1006 lost, e.g. due to congestion, or a middlebox might be blocking the 1007 AccECN Option. To expedite connection setup, the TCP server SHOULD 1008 retransmit the SYN/ACK with the same TCP flags (AE, CWR and ECE) but 1009 with no AccECN Option. If this retransmission times out, to expedite 1010 connection setup, the TCP server SHOULD disable AccECN and ECN for 1011 this connection by retransmitting the SYN/ACK with AE=CWR=ECE=0 and 1012 no AccECN Option. Implementers MAY use other fall-back strategies if 1013 they are found to be more effective (e.g. falling back to classic 1014 ECN feedback on the first retransmission; retrying the AccECN Option 1015 for a second time before fall-back (most appropriate during high 1016 levels of congestion); or falling back to classic ECN feedback rather 1017 than non-ECN on the third retransmission). 1019 If the TCP client detects that the first data segment it sent with 1020 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 1021 on the retransmission. Again, implementers MAY use other fall-back 1022 strategies such as attempting to retransmit a second segment with the 1023 AccECN Option before fall-back, and/or caching whether the AccECN 1024 Option is blocked for subsequent connections. 1026 Either host MAY include the AccECN Option in a subsequent segment to 1027 retest whether the AccECN Option can traverse the path. 1029 If the TCP server receives a second SYN with a request for AccECN 1030 support, it should resend the SYN/ACK, again confirming its support 1031 for AccECN, but this time without the AccECN Option. This approach 1032 rules out any interference by middleboxes that may drop packets with 1033 unknown options, even though it is more likely that the SYN/ACK would 1034 have been lost due to congestion. The TCP server MAY try to send 1035 another packet with the AccECN Option at a later point during the 1036 connection but should monitor if that packet got lost as well, in 1037 which case it SHOULD disable the sending of the AccECN Option for 1038 this half-connection. 1040 Similarly, an AccECN end-point MAY separately memorize which data 1041 packets carried an AccECN Option and disable the sending of AccECN 1042 Options if the loss probability of those packets is significantly 1043 higher than that of all other data packets in the same connection. 1045 3.2.7.3. Testing for Stripping of the AccECN Option 1047 If the TCP client has successfully negotiated AccECN but does not 1048 receive an AccECN Option on the SYN/ACK, it switches into a mode that 1049 assumes that the AccECN Option is not available for this half 1050 connection. 1052 Similarly, if the TCP server has successfully negotiated AccECN but 1053 does not receive an AccECN Option on the first segment that 1054 acknowledges sequence space at least covering the ISN, it switches 1055 into a mode that assumes that the AccECN Option is not available for 1056 this half connection. 1058 While a host is in this mode that assumes incoming AccECN Options are 1059 not available, it MUST adopt the conservative interpretation of the 1060 ACE field discussed in Section 3.2.5. However, it cannot make any 1061 assumption about support of outgoing AccECN Options on the other half 1062 connection, so it SHOULD continue to send the AccECN Option itself 1063 (unless it has established that sending the AccECN Option is causing 1064 packets to be blocked as in Section 3.2.7.2). 1066 If a host is in the mode that assumes incoming AccECN Options are not 1067 available, but it receives an AccECN Option at any later point during 1068 the connection, this clearly indicates that the AccECN Option is not 1069 blocked on the respective path, and the AccECN endpoint MAY switch 1070 out of the mode that assumes the AccECN Option is not available for 1071 this half connection. 1073 3.2.7.4. Test for Zeroing of the AccECN Option 1075 For a related test for invalid initialization of the ACE field, see 1076 Section 3.2.3 1078 Section 3.2 required the Data Receiver to initialize the r.e0b 1079 counter to a non-zero value. Therefore, in either direction the 1080 initial value of the EE0B field in the AccECN Option (if one exists) 1081 ought to be non-zero. If AccECN has been negotiated: 1083 o the TCP server MAY check the initial value of the EE0B field in 1084 the first segment that acknowledges sequence space that at least 1085 covers the ISN plus 1. If the initial value of the EE0B field is 1086 zero, the server will switch into a mode that ignores the AccECN 1087 Option for this half connection. 1089 o the TCP client MAY check the initial value of the EE0B field on 1090 the SYN/ACK. If the initial value of the EE0B field is zero, the 1091 client will switch into a mode that ignores the AccECN Option for 1092 this half connection. 1094 While a host is in the mode that ignores the AccECN Option it MUST 1095 adopt the conservative interpretation of the ACE field discussed in 1096 Section 3.2.5. 1098 Note that the Data Sender MUST NOT test whether the arriving byte 1099 counters in the initial AccECN Option have been initialized to 1100 specific valid values - the above checks solely test whether these 1101 fields have been incorrectly zeroed. This allows hosts to use 1102 different initial values as an additional signalling channel in 1103 future. Also note that the initial value of either field might be 1104 greater than its expected initial value, because the counters might 1105 already have been incremented. Nonetheless, the initial values of 1106 the counters have been chosen so that they cannot wrap to zero on 1107 these initial segments. 1109 3.2.7.5. Consistency between AccECN Feedback Fields 1111 When the AccECN Option is available it supplements but does not 1112 replace the ACE field. An endpoint using AccECN feedback MUST always 1113 consider the information provided in the ACE field whether or not the 1114 AccECN Option is also available. 1116 If the AccECN option is present, the s.cep counter might increase 1117 while the s.ceb counter does not (e.g. due to a CE-marked control 1118 packet). The sender's response to such a situation is out of scope, 1119 and needs to be dealt with in a specification that uses ECN-capable 1120 control packets. Theoretically, this situation could also occur if a 1121 middlebox mangled the AccECN Option but not the ACE field. However, 1122 the Data Sender has to assume that the integrity of the AccECN Option 1123 is sound, based on the above test of the well-known initial values 1124 and optionally other integrity tests (Section 4.3). 1126 If either end-point detects that the s.ceb counter has increased but 1127 the s.cep has not (and by testing ACK coverage it is certain how much 1128 the ACE field has wrapped), this invalid protocol transition has to 1129 be due to some form of feedback mangling. So, the Data Sender MUST 1130 disable sending ECN-capable packets for the remainder of the half- 1131 connection by setting the IP/ECN field in all subsequent packets to 1132 Not-ECT. 1134 3.2.8. Usage of the AccECN TCP Option 1136 The following rules determine when a Data Receiver in AccECN mode 1137 sends the AccECN TCP Option, and which fields to include: 1139 Change-Triggered ACKs: If an arriving packet increments a different 1140 byte counter to that incremented by the previous packet, the Data 1141 Receiver MUST immediately send an ACK with an AccECN Option, 1142 without waiting for the next delayed ACK (this is in addition to 1143 the safety recommendation in Section 3.2.5 against ambiguity of 1144 the ACE field). 1146 This is stated as a "MUST" so that the data sender can rely on 1147 change-triggered ACKs to detect transitions right from the very 1148 start of a flow, without first having to detect whether the 1149 receiver complies. A concern has been raised that certain offload 1150 hardware needed for high performance might not be able to support 1151 change-triggered ACKs, although high performance protocols such as 1152 DCTCP successfully use change-triggered ACKs. One possible 1153 experimental compromise would be for the receiver to heuristically 1154 detect whether the sender is in slow-start, then to implement 1155 change-triggered ACKs in software while the sender is in slow- 1156 start, and offload to hardware otherwise. If the operator 1157 disables change-triggered ACKs, whether partially like this or 1158 otherwise, the operator will also be responsible for ensuring a 1159 co-ordinated sender algorithm is deployed; 1161 Continual Repetition: Otherwise, if arriving packets continue to 1162 increment the same byte counter, the Data Receiver can include an 1163 AccECN Option on most or all (delayed) ACKs, but it does not have 1164 to. If option space is limited on a particular ACK, the Data 1165 Receiver MUST give precedence to SACK information about loss. It 1166 SHOULD include an AccECN Option if the r.ceb counter has 1167 incremented and it MAY include an AccECN Option if r.ec0b or 1168 r.ec1b has incremented; 1170 Full-Length Options Preferred: It SHOULD always use full-length 1171 AccECN Options. It MAY use shorter AccECN Options if space is 1172 limited, but it MUST include the counter(s) that have incremented 1173 since the previous AccECN Option and it MUST only truncate fields 1174 from the right-hand tail of the option to preserve the order of 1175 the remaining fields (see Section 3.2.6); 1177 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1178 length AccECN TCP Option on at least three ACKs per RTT, or on all 1179 ACKs if there are less than three per RTT (see Appendix A.4 for an 1180 example algorithm that satisfies this requirement). 1182 The following example series of arriving IP/ECN fields illustrates 1183 when a Data Receiver will emit an ACK if it is using a delayed ACK 1184 factor of 2 segments and change-triggered ACKs: 01 -> ACK, 01, 01 -> 1185 ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 -> ACK. 1187 For the avoidance of doubt, the change-triggered ACK mechanism is 1188 deliberately worded to ignore the arrival of a control packet with no 1189 payload, which therefore does not alter any byte counters, because it 1190 is important that TCP does not acknowledge pure ACKs. The change- 1191 triggered ACK approach will lead to some additional ACKs but it feeds 1192 back the timing and the order in which ECN marks are received with 1193 minimal additional complexity. 1195 Implementation note: sending an AccECN Option each time a different 1196 counter changes and including a full-length AccECN Option on every 1197 delayed ACK will satisfy the requirements described above and might 1198 be the easiest implementation, as long as sufficient space is 1199 available in each ACK (in total and in the option space). 1201 Appendix A.3 gives an example algorithm to estimate the number of 1202 marked bytes from the ACE field alone, if the AccECN Option is not 1203 available. 1205 If a host has determined that segments with the AccECN Option always 1206 seem to be discarded somewhere along the path, it is no longer 1207 obliged to follow the above rules. 1209 3.3. Requirements for TCP Proxies, Offload Engines and other 1210 Middleboxes on AccECN Compliance 1212 A large class of middleboxes split TCP connections. Such a middlebox 1213 would be compliant with the AccECN protocol if the TCP implementation 1214 on each side complied with the present AccECN specification and each 1215 side negotiated AccECN independently of the other side. 1217 Another large class of middleboxes intervenes to some degree at the 1218 transport layer, but attempts to be transparent (invisible) to the 1219 end-to-end connection. A subset of this class of middleboxes 1220 attempts to `normalise' the TCP wire protocol by checking that all 1221 values in header fields comply with a rather narrow interpretation of 1222 the TCP specifications. To comply with the present AccECN 1223 specification, such a middlebox MUST NOT change the ACE field or the 1224 AccECN Option and it SHOULD preserve the timing of each ACK (for 1225 example, if it coalesced ACKs it would not be AccECN-compliant) as 1226 these can be used by the Data Sender to infer further information 1227 about the path congestion level. A middlebox claiming to be 1228 transparent at the transport layer MUST forward the AccECN TCP Option 1229 unaltered, whether or not the length value matches one of those 1230 specified in Section 3.2.6, and whether or not the initial values of 1231 the byte-counter fields are correct. This is because blocking 1232 apparently invalid values does not improve security (because AccECN 1233 hosts are required to ignore invalid values anyway), while it 1234 prevents the standardised set of values being extended in future 1235 (because outdated normalisers would block updated hosts from using 1236 the extended AccECN standard). 1238 Hardware to offload certain TCP processing represents another large 1239 class of middleboxes, even though it is often a function of a host's 1240 network interface and rarely in its own 'box'. Leeway has been 1241 allowed in the present AccECN specification in the expectation that 1242 offload hardware could comply and still serve its function. 1243 Nonetheless, such hardware SHOULD also preserve the timing of each 1244 ACK (for example, if it coalesced ACKs it would not be AccECN- 1245 compliant). 1247 4. Interaction with Other TCP Variants 1249 This section is informative, not normative. 1251 4.1. Compatibility with SYN Cookies 1253 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1254 protect itself from SYN flooding attacks. It places minimal commonly 1255 used connection state in the SYN/ACK, and deliberately does not hold 1256 any state while waiting for the subsequent ACK (e.g. it closes the 1257 thread). Therefore it cannot record the fact that it entered AccECN 1258 mode for both half-connections. Indeed, it cannot even remember 1259 whether it negotiated the use of classic ECN [RFC3168]. 1261 Nonetheless, such a server can determine that it negotiated AccECN as 1262 follows. If a TCP server using SYN Cookies supports AccECN and if it 1263 receives a pure ACK that acknowledges an ISN that is a valid SYN 1264 cookie, and if the ACK contains an ACE field with the value 0b010 to 1265 0b111 (decimal 2 to 7), it can assume that: 1267 o the TCP client must have requested AccECN support on the SYN 1269 o it (the server) must have confirmed that it supported AccECN 1271 Therefore the server can switch itself into AccECN mode, and continue 1272 as if it had never forgotten that it switched itself into AccECN mode 1273 earlier. 1275 If the pure ACK that acknowledges a SYN cookie contains an ACE field 1276 with the value 0b000 or 0b001, these values indicate that the client 1277 did not request support for AccECN and therefore the server does not 1278 enter AccECN mode for this connection. Further, 0b001 on the ACK 1279 implies that the server sent an ECN-capable SYN/ACK, which was marked 1280 CE in the network, and the non-AccECN client fed this back by setting 1281 ECE on the ACK of the SYN/ACK. 1283 4.2. Compatibility with Other TCP Options and Experiments 1285 AccECN is compatible (at least on paper) with the most commonly used 1286 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1287 also compatible with the recent promising experimental TCP options 1288 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1289 AccECN is friendly to all these protocols, because space for TCP 1290 options is particularly scarce on the SYN, where AccECN consumes zero 1291 additional header space. 1293 When option space is under pressure from other options, Section 3.2.8 1294 provides guidance on how important it is to send an AccECN Option and 1295 whether it needs to be a full-length option. 1297 4.3. Compatibility with Feedback Integrity Mechanisms 1299 Three alternative mechanisms are available to assure the integrity of 1300 ECN and/or loss signals. AccECN is compatible with any of these 1301 approaches: 1303 o The Data Sender can test the integrity of the receiver's ECN (or 1304 loss) feedback by occasionally setting the IP-ECN field to a value 1305 normally only set by the network (and/or deliberately leaving a 1306 sequence number gap). Then it can test whether the Data 1307 Receiver's feedback faithfully reports what it expects 1308 [I-D.moncaster-tcpm-rcv-cheat]. Unlike the ECN Nonce [RFC3540], 1309 this approach does not waste the ECT(1) codepoint in the IP 1310 header, it does not require standardisation and it does not rely 1311 on misbehaving receivers volunteering to reveal feedback 1312 information that allows them to be detected. However, setting the 1313 CE mark by the sender might conceal actual congestion feedback 1314 from the network and should therefore only be done sparsely. 1316 o Networks generate congestion signals when they are becoming 1317 congested, so networks are more likely than Data Senders to be 1318 concerned about the integrity of the receiver's feedback of these 1319 signals. A network can enforce a congestion response to its ECN 1320 markings (or packet losses) using congestion exposure (ConEx) 1321 audit [RFC7713]. Whether the receiver or a downstream network is 1322 suppressing congestion feedback or the sender is unresponsive to 1323 the feedback, or both, ConEx audit can neutralise any advantage 1324 that any of these three parties would otherwise gain. 1326 ConEx is a change to the Data Sender that is most useful when 1327 combined with AccECN. Without AccECN, the ConEx behaviour of a 1328 Data Sender would have to be more conservative than would be 1329 necessary if it had the accurate feedback of AccECN. 1331 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1332 detect any tampering with AccECN feedback between the Data 1333 Receiver and the Data Sender (whether malicious or accidental). 1334 The AccECN fields are immutable end-to-end, so they are amenable 1335 to TCP-AO protection, which covers TCP options by default. 1336 However, TCP-AO is often too brittle to use on many end-to-end 1337 paths, where middleboxes can make verification fail in their 1338 attempts to improve performance or security, e.g. by 1339 resegmentation or shifting the sequence space. 1341 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1342 of congestion feedback. With minor changes AccECN could be optimised 1343 for the possibility that the ECT(1) codepoint might be used as an ECN 1344 Nonce. However, given RFC 3540 has been reclassified as historic, 1345 the AccECN design has been generalised so that it ought to be able to 1346 support other possible uses of the ECT(1) codepoint, such as a lower 1347 severity or a more instant congestion signal than CE. 1349 5. Protocol Properties 1351 This section is informative not normative. It describes how well the 1352 protocol satisfies the agreed requirements for a more accurate ECN 1353 feedback protocol [RFC7560]. 1355 Accuracy: From each ACK, the Data Sender can infer the number of new 1356 CE marked segments since the previous ACK. This provides better 1357 accuracy on CE feedback than classic ECN. In addition if the 1358 AccECN Option is present (not blocked by the network path) the 1359 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1361 Overhead: The AccECN scheme is divided into two parts. The 1362 essential part reuses the 3 flags already assigned to ECN in the 1363 IP header. The supplementary part adds an additional TCP option 1364 consuming up to 11 bytes. However, no TCP option is consumed in 1365 the SYN. 1367 Ordering: The order in which marks arrive at the Data Receiver is 1368 preserved in AccECN feedback, because the Data Receiver is 1369 expected to send an ACK immediately whenever a different mark 1370 arrives. 1372 Timeliness: While the same ECN markings are arriving continually at 1373 the Data Receiver, it can defer ACKs as TCP does normally, but it 1374 will immediately send an ACK as soon as a different ECN marking 1375 arrives. 1377 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1378 latency-sensitive uses of ECN feedback by capturing the timing of 1379 transitions but not wasting resources while the state of the 1380 signalling system is stable. The receiver can control how 1381 frequently it sends the AccECN TCP Option and therefore it can 1382 control the overhead induced by AccECN. 1384 Resilience: All information is provided based on counters. 1385 Therefore if ACKs are lost, the counters on the first ACK 1386 following the losses allows the Data Sender to immediately recover 1387 the number of the ECN markings that it missed. 1389 Resilience against Bias: Because feedback is based on repetition of 1390 counters, random losses do not remove any information, they only 1391 delay it. Therefore, even though some ACKs are change-triggered, 1392 random losses will not alter the proportions of the different ECN 1393 markings in the feedback. 1395 Resilience vs Overhead: If space is limited in some segments (e.g. 1396 because more option are need on some segments, such as the SACK 1397 option after loss), the Data Receiver can send AccECN Options less 1398 frequently or truncate fields that have not changed, usually down 1399 to as little as 5 bytes. However, it has to send a full-sized 1400 AccECN Option at least three times per RTT, which the Data Sender 1401 can rely on as a regular beacon or checkpoint. 1403 Resilience vs Timeliness and Ordering: Ordering information and the 1404 timing of transitions cannot be communicated in three cases: i) 1405 during ACK loss; ii) if something on the path strips the AccECN 1406 Option; or iii) if the Data Receiver is unable to support Change- 1407 Triggered ACKs. 1409 Complexity: An AccECN implementation solely involves simple counter 1410 increments, some modulo arithmetic to communicate the least 1411 significant bits and allow for wrap, and some heuristics for 1412 safety against fields cycling due to prolonged periods of ACK 1413 loss. Each host needs to maintain eight additional counters. The 1414 hosts have to apply some additional tests to detect tampering by 1415 middleboxes, but in general the protocol is simple to understand, 1416 simple to implement and requires few cycles per packet to execute. 1418 Integrity: AccECN is compatible with at least three approaches that 1419 can assure the integrity of ECN feedback. If the AccECN Option is 1420 stripped the resolution of the feedback is degraded, but the 1421 integrity of this degraded feedback can still be assured. 1423 Backward Compatibility: If only one endpoint supports the AccECN 1424 scheme, it will fall-back to the most advanced ECN feedback scheme 1425 supported by the other end. 1427 Backward Compatibility: If the AccECN Option is stripped by a 1428 middlebox, AccECN still provides basic congestion feedback in the 1429 ACE field. Further, AccECN can be used to detect mangling of the 1430 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1431 marked segments; and blocking of segments carrying the AccECN 1432 Option. It can detect these conditions during TCP's 3WHS so that 1433 it can fall back to operation without ECN and/or operation without 1434 the AccECN Option. 1436 Forward Compatibility: The behaviour of endpoints and middleboxes is 1437 carefully defined for all reserved or currently unused codepoints 1438 in the scheme, to ensure that any blocking of anomalous values is 1439 always at least under reversible policy control. 1441 6. IANA Considerations 1443 This document reassigns bit 7 of the TCP header flags to the AccECN 1444 experiment. This bit was previously called the Nonce Sum (NS) flag 1445 [RFC3540], but RFC 3540 is being reclassified as historic [RFC8311]. 1446 The flag will now be defined as: 1448 +-----+-------------------+-----------+ 1449 | Bit | Name | Reference | 1450 +-----+-------------------+-----------+ 1451 | 7 | AE (Accurate ECN) | RFC XXXX | 1452 +-----+-------------------+-----------+ 1454 [TO BE REMOVED: This registration should take place at the following 1455 location: https://www.iana.org/assignments/tcp-header-flags/tcp- 1456 header-flags.xhtml#tcp-header-flags-1 ] 1458 This document also defines a new TCP option for AccECN, assigned a 1459 value of TBD1 (decimal) from the TCP option space. This value is 1460 defined as: 1462 +------+--------+-----------------------+-----------+ 1463 | Kind | Length | Meaning | Reference | 1464 +------+--------+-----------------------+-----------+ 1465 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1466 +------+--------+-----------------------+-----------+ 1468 [TO BE REMOVED: This registration should take place at the following 1469 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1470 parameters.xhtml#tcp-parameters-1 ] 1472 Early implementation before the IANA allocation MUST follow [RFC6994] 1473 and use experimental option 254 and magic number 0xACCE (16 bits), 1474 then migrate to the new option after the allocation. 1476 7. Security Considerations 1478 If ever the supplementary part of AccECN based on the new AccECN TCP 1479 Option is unusable (due for example to middlebox interference) the 1480 essential part of AccECN's congestion feedback offers only limited 1481 resilience to long runs of ACK loss (see Section 3.2.5). These 1482 problems are unlikely to be due to malicious intervention (because if 1483 an attacker could strip a TCP option or discard a long run of ACKs it 1484 could wreak other arbitrary havoc). However, it would be of concern 1485 if AccECN's resilience could be indirectly compromised during a 1486 flooding attack. AccECN is still considered safe though, because if 1487 the option is not presented, the AccECN Data Sender is then required 1488 to switch to more conservative assumptions about wrap of congestion 1489 indication counters (see Section 3.2.5 and Appendix A.2). 1491 Section 4.1 describes how a TCP server can negotiate AccECN and use 1492 the SYN cookie method for mitigating SYN flooding attacks. 1494 There is concern that ECN markings could be altered or suppressed, 1495 particularly because a misbehaving Data Receiver could increase its 1496 own throughput at the expense of others. AccECN is compatible with 1497 the three schemes known to assure the integrity of ECN feedback (see 1498 Section 4.3 for details). If the AccECN Option is stripped by an 1499 incorrectly implemented middlebox, the resolution of the feedback 1500 will be degraded, but the integrity of this degraded information can 1501 still be assured. 1503 There is a potential concern that a receiver could deliberately omit 1504 the AccECN Option pretending that it had been stripped by a 1505 middlebox. No known way can yet be contrived to take advantage of 1506 this downgrade attack, but it is mentioned here in case someone else 1507 can contrive one. 1509 The AccECN protocol is not believed to introduce any new privacy 1510 concerns, because it merely counts and feeds back signals at the 1511 transport layer that had already been visible at the IP layer. 1513 8. Acknowledgements 1515 We want to thank Koen De Schepper, Praveen Balasubramanian, Michael 1516 Welzl, Gorry Fairhurst, David Black, Spencer Dawkins, Michael Scharf 1517 and Michael Tuexen for their input and discussion. The idea of using 1518 the three ECN-related TCP flags as one field for more accurate TCP- 1519 ECN feedback was first introduced in the re-ECN protocol that was the 1520 ancestor of ConEx. 1522 Bob Briscoe was part-funded by the European Community under its 1523 Seventh Framework Programme through the Reducing Internet Transport 1524 Latency (RITE) project (ICT-317700) and through the Trilogy 2 project 1525 (ICT-317756). He was also part-funded by the Research Council of 1526 Norway through the TimeIn project. The views expressed here are 1527 solely those of the authors. 1529 Mirja Kuehlewind was partly supported by the European Commission 1530 under Horizon 2020 grant agreement no. 688421 Measurement and 1531 Architecture for a Middleboxed Internet (MAMI), and by the Swiss 1532 State Secretariat for Education, Research, and Innovation under 1533 contract no. 15.0268. This support does not imply endorsement. 1535 9. Comments Solicited 1537 Comments and questions are encouraged and very welcome. They can be 1538 addressed to the IETF TCP maintenance and minor modifications working 1539 group mailing list , and/or to the authors. 1541 10. References 1543 10.1. Normative References 1545 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1546 Requirement Levels", BCP 14, RFC 2119, 1547 DOI 10.17487/RFC2119, March 1997, 1548 . 1550 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1551 of Explicit Congestion Notification (ECN) to IP", 1552 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1553 . 1555 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1556 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1557 . 1559 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 1560 RFC 6994, DOI 10.17487/RFC6994, August 2013, 1561 . 1563 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1564 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1565 May 2017, . 1567 10.2. Informative References 1569 [I-D.ietf-tcpm-alternativebackoff-ecn] 1570 Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 1571 "TCP Alternative Backoff with ECN (ABE)", draft-ietf-tcpm- 1572 alternativebackoff-ecn-06 (work in progress), February 1573 2018. 1575 [I-D.ietf-tcpm-generalized-ecn] 1576 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 1577 Congestion Notification (ECN) to TCP Control Packets", 1578 draft-ietf-tcpm-generalized-ecn-02 (work in progress), 1579 October 2017. 1581 [I-D.ietf-tsvwg-l4s-arch] 1582 Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, 1583 Low Loss, Scalable Throughput (L4S) Internet Service: 1584 Architecture", draft-ietf-tsvwg-l4s-arch-01 (work in 1585 progress), October 2017. 1587 [I-D.kuehlewind-tcpm-ecn-fallback] 1588 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 1589 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 1590 fallback-01 (work in progress), September 2013. 1592 [I-D.moncaster-tcpm-rcv-cheat] 1593 Moncaster, T., Briscoe, B., and A. Jacquet, "A TCP Test to 1594 Allow Senders to Identify Receiver Non-Compliance", draft- 1595 moncaster-tcpm-rcv-cheat-03 (work in progress), July 2014. 1597 [Mandalari18] 1598 Mandalari, A., Lutu, A., Briscoe, B., Bagnulo, M., and Oe. 1599 Alay, "Measuring ECN++: Good News for ++, Bad News for ECN 1600 over Mobile", IEEE Communications Magazine , March 2018. 1602 (to appear) 1604 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1605 Congestion Notification (ECN) Signaling with Nonces", 1606 RFC 3540, DOI 10.17487/RFC3540, June 2003, 1607 . 1609 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1610 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 1611 . 1613 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1614 Ramakrishnan, "Adding Explicit Congestion Notification 1615 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 1616 DOI 10.17487/RFC5562, June 2009, 1617 . 1619 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1620 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 1621 June 2010, . 1623 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 1624 "TCP Extensions for Multipath Operation with Multiple 1625 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 1626 . 1628 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1629 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1630 . 1632 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 1633 "Problem Statement and Requirements for Increased Accuracy 1634 in Explicit Congestion Notification (ECN) Feedback", 1635 RFC 7560, DOI 10.17487/RFC7560, August 2015, 1636 . 1638 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1639 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 1640 DOI 10.17487/RFC7713, December 2015, 1641 . 1643 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1644 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 1645 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 1646 October 2017, . 1648 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 1649 Notification (ECN) Experimentation", RFC 8311, 1650 DOI 10.17487/RFC8311, January 2018, 1651 . 1653 Appendix A. Example Algorithms 1655 This appendix is informative, not normative. It gives example 1656 algorithms that would satisfy the normative requirements of the 1657 AccECN protocol. However, implementers are free to choose other ways 1658 to implement the requirements. 1660 A.1. Example Algorithm to Encode/Decode the AccECN Option 1662 The example algorithms below show how a Data Receiver in AccECN mode 1663 could encode its CE byte counter r.ceb into the ECEB field within the 1664 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 1665 the ECEB field into its byte counter s.ceb. The other counters for 1666 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 1667 similarly encoded and decoded. 1669 It is assumed that each local byte counter is an unsigned integer 1670 greater than 24b (probably 32b), and that the following constant has 1671 been assigned: 1673 DIVOPT = 2^24 1675 Every time a CE marked data segment arrives, the Data Receiver 1676 increments its local value of r.ceb by the size of the TCP Data. 1677 Whenever it sends an ACK with the AccECN Option, the value it writes 1678 into the ECEB field is 1680 ECEB = r.ceb % DIVOPT 1682 where '%' is the modulo operator. 1684 On the arrival of an AccECN Option, the Data Sender uses the TCP 1685 acknowledgement number and any SACK options to calculate newlyAckedB, 1686 the amount of new data that the ACK acknowledges in bytes. If 1687 newlyAckedB is negative it means that a more up to date ACK has 1688 already been processed, so this ACK has been superseded and the Data 1689 Sender has to ignore the AccECN Option. Then the Data Sender 1690 calculates the minimum difference d.ceb between the ECEB field and 1691 its local s.ceb counter, using modulo arithmetic as follows: 1693 if (newlyAckedB >= 0) { 1694 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 1695 s.ceb += d.ceb 1696 } 1698 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 1699 then 1700 s.ceb % DIVOPT = 1 1701 d.ceb = (1461 + 2^24 - 1) % 2^24 1702 = 1460 1703 s.ceb = 33,554,433 + 1460 1704 = 33,555,893 1706 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 1708 The example algorithms below show how a Data Receiver in AccECN mode 1709 could encode its CE packet counter r.cep into the ACE field, and how 1710 the Data Sender in AccECN mode could decode the ACE field into its 1711 s.cep counter. The Data Sender's algorithm includes code to 1712 heuristically detect a long enough unbroken string of ACK losses that 1713 could have concealed a cycle of the congestion counter in the ACE 1714 field of the next ACK to arrive. 1716 Two variants of the algorithm are given: i) a more conservative 1717 variant for a Data Sender to use if it detects that the AccECN Option 1718 is not available (see Section 3.2.5 and Section 3.2.7); and ii) a 1719 less conservative variant that is feasible when complementary 1720 information is available from the AccECN Option. 1722 A.2.1. Safety Algorithm without the AccECN Option 1724 It is assumed that each local packet counter is a sufficiently sized 1725 unsigned integer (probably 32b) and that the following constant has 1726 been assigned: 1728 DIVACE = 2^3 1730 Every time a CE marked packet arrives, the Data Receiver increments 1731 its local value of r.cep by 1. It repeats the same value of ACE in 1732 every subsequent ACK until the next CE marking arrives, where 1734 ACE = r.cep % DIVACE. 1736 If the Data Sender received an earlier value of the counter that had 1737 been delayed due to ACK reordering, it might incorrectly calculate 1738 that the ACE field had wrapped. Therefore, on the arrival of every 1739 ACK, the Data Sender uses the TCP acknowledgement number and any SACK 1740 options to calculate newlyAckedB, the amount of new data that the ACK 1741 acknowledges. If newlyAckedB is negative it means that a more up to 1742 date ACK has already been processed, so this ACK has been superseded 1743 and the Data Sender has to ignore the AccECN Option. If newlyAckedB 1744 is zero, to break the tie the Data Sender could use timestamps (if 1745 present) to work out newlyAckedT, the amount of new time that the ACK 1746 acknowledges. Then the Data Sender calculates the minimum difference 1747 d.cep between the ACE field and its local s.cep counter, using modulo 1748 arithmetic as follows: 1750 if ((newlyAckedB > 0) || (newlyAckedB == 0 && newlyAckedT > 0)) 1751 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 1753 Section 3.2.5 requires the Data Sender to assume that the ACE field 1754 did cycle if it could have cycled under prevailing conditions. The 1755 3-bit ACE field in an arriving ACK could have cycled and become 1756 ambiguous to the Data Sender if a row of ACKs goes missing that 1757 covers a stream of data long enough to contain 8 or more CE marks. 1758 We use the word `missing' rather than `lost', because some or all the 1759 missing ACKs might arrive eventually, but out of order. Even if some 1760 of the lost ACKs are piggy-backed on data (i.e. not pure ACKs) 1761 retransmissions will not repair the lost AccECN information, because 1762 AccECN requires retransmissions to carry the latest AccECN counters, 1763 not the original ones. 1765 The phrase `under prevailing conditions' allows the Data Sender to 1766 take account of the prevailing size of data segments and the 1767 prevailing CE marking rate just before the sequence of ACK losses. 1768 However, we shall start with the simplest algorithm, which assumes 1769 segments are all full-sized and ultra-conservatively it assumes that 1770 ECN marking was 100% on the forward path when ACKs on the reverse 1771 path started to all be dropped. Specifically, if newlyAckedB is the 1772 amount of data that an ACK acknowledges since the previous ACK, then 1773 the Data Sender could assume that this acknowledges newlyAckedPkt 1774 full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it 1775 could assume that the ACE field incremented by 1777 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 1779 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 1780 size segments than any previous ACK, and that ACE increments by a 1781 minimum of 2 CE marks (d.cep=2). The above formula works out that it 1782 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 1783 2). However, if ACE increases by a minimum of 2 but acknowledges 10 1784 full-sized segments, then it would be necessary to assume that there 1785 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 1787 Implementers could build in more heuristics to estimate prevailing 1788 average segment size and prevailing ECN marking. For instance, 1789 newlyAckedPkt in the above formula could be replaced with 1790 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 1791 segment size and p is the prevailing ECN marking probability. 1792 However, ultimately, if TCP's ECN feedback becomes inaccurate it 1793 still has loss detection to fall back on. Therefore, it would seem 1794 safe to implement a simple algorithm, rather than a perfect one. 1796 The simple algorithm for dSafer.cep above requires no monitoring of 1797 prevailing conditions and it would still be safe if, for example, 1798 segments were on average at least 5% of full-sized as long as ECN 1799 marking was 5% or less. Assuming it was used, the Data Sender would 1800 increment its packet counter as follows: 1802 s.cep += dSafer.cep 1804 If missing acknowledgement numbers arrive later (due to reordering), 1805 Section 3.2.5 says "the Data Sender MAY attempt to neutralise the 1806 effect of any action it took based on a conservative assumption that 1807 it later found to be incorrect". To do this, the Data Sender would 1808 have to store the values of all the relevant variables whenever it 1809 made assumptions, so that it could re-evaluate them later. Given 1810 this could become complex and it is not required, we do not attempt 1811 to provide an example of how to do this. 1813 A.2.2. Safety Algorithm with the AccECN Option 1815 When the AccECN Option is available on the ACKs before and after the 1816 possible sequence of ACK losses, if the Data Sender only needs CE- 1817 marked bytes, it will have sufficient information in the AccECN 1818 Option without needing to process the ACE field. However, if for 1819 some reason it needs CE-marked packets, if dSafer.cep is different 1820 from d.cep, it can calculate the average marked segment size that 1821 each implies to determine whether d.cep is likely to be a safe enough 1822 estimate. Specifically, it could use the following algorithm, where 1823 d.ceb is the amount of newly CE-marked bytes (see Appendix A.1): 1825 SAFETY_FACTOR = 2 1826 if (dSafer.cep > d.cep) { 1827 s = d.ceb/d.cep 1828 if (s <= MSS) { 1829 sSafer = d.ceb/dSafer.cep 1830 if (sSafer < MSS/SAFETY_FACTOR) 1831 dSafer.cep = d.cep % d.cep is a safe enough estimate 1832 } % else 1833 % No need for else; dSafer.cep is already correct, 1834 % because d.cep must have been too small 1835 } 1837 The chart below shows when the above algorithm will consider d.cep 1838 can replace dSafer.cep as a safe enough estimate of the number of CE- 1839 marked packets: 1841 ^ 1842 sSafer| 1843 | 1844 MSS+ 1845 | 1846 | dSafer.cep 1847 | is 1848 MSS/2+--------------+ safest 1849 | | 1850 | d.cep is safe| 1851 | enough | 1852 +--------------------> 1853 MSS s 1855 The following examples give the reasoning behind the algorithm, 1856 assuming MSS=1,460 [B]: 1858 o if d.cep=0, dSafer.cep=8 and d.ceb=1,460, then s=infinity and 1859 sSafer=182.5. 1860 Therefore even though the average size of 8 data segments is 1861 unlikely to have been as small as MSS/8, d.cep cannot have been 1862 correct, because it would imply an average segment size greater 1863 than the MSS. 1865 o if d.cep=2, dSafer.cep=10 and d.ceb=1,460, then s=730 and 1866 sSafer=146. 1867 Therefore d.cep is safe enough, because the average size of 10 1868 data segments is unlikely to have been as small as MSS/10. 1870 o if d.cep=7, dSafer.cep=15 and d.ceb=10,200, then s=1,457 and 1871 sSafer=680. 1872 Therefore d.cep is safe enough, because the average data segment 1873 size is more likely to have been just less than one MSS, rather 1874 than below MSS/2. 1876 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 1877 far less likely. However, because [RFC3168] currently precludes 1878 this, the above algorithm assumes that pure ACKs are not ECN-capable. 1880 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 1882 If the AccECN Option is not available, the Data Sender can only 1883 decode CE-marking from the ACE field in packets. Every time an ACK 1884 arrives, to convert this into an estimate of CE-marked bytes, it 1885 needs an average of the segment size, s_ave. Then it can add or 1886 subtract s_ave from the value of d.ceb as the value of d.cep 1887 increments or decrements. 1889 To calculate s_ave, it could keep a record of the byte numbers of all 1890 the boundaries between packets in flight (including control packets), 1891 and recalculate s_ave on every ACK. However it would be simpler to 1892 merely maintain a counter packets_in_flight for the number of packets 1893 in flight (including control packets), which it could update once per 1894 RTT. Either way, it would estimate s_ave as: 1896 s_ave ~= flightsize / packets_in_flight, 1898 where flightsize is the variable that TCP already maintains for the 1899 number of bytes in flight. To avoid floating point arithmetic, it 1900 could right-bit-shift by lg(packets_in_flight), where lg() means log 1901 base 2. 1903 An alternative would be to maintain an exponentially weighted moving 1904 average (EWMA) of the segment size: 1906 s_ave = a * s + (1-a) * s_ave, 1908 where a is the decay constant for the EWMA. However, then it is 1909 necessary to choose a good value for this constant, which ought to 1910 depend on the number of packets in flight. Also the decay constant 1911 needs to be power of two to avoid floating point arithmetic. 1913 A.4. Example Algorithm to Beacon AccECN Options 1915 Section 3.2.8 requires a Data Receiver to beacon a full-length AccECN 1916 Option at least 3 times per RTT. This could be implemented by 1917 maintaining a variable to store the number of ACKs (pure and data 1918 ACKs) since a full AccECN Option was last sent and another for the 1919 approximate number of ACKs sent in the last round trip time: 1921 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 1922 send_full_AccECN_Option() 1924 For optimised integer arithmetic, BEACON_FREQ = 4 could be used, 1925 rather than 3, so that the division could be implemented as an 1926 integer right bit-shift by lg(BEACON_FREQ). 1928 In certain operating systems, it might be too complex to maintain 1929 acks_in_round. In others it might be possible by tagging each data 1930 segment in the retransmit buffer with the number of ACKs sent at the 1931 point that segment was sent. This would not work well if the Data 1932 Receiver was not sending data itself, in which case it might be 1933 necessary to beacon based on time instead, as follows: 1935 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 1936 send_full_AccECN_Option() 1938 This time-based approach does not work well when all the ACKs are 1939 sent early in each round trip, as is the case during slow-start. In 1940 this case few options will be sent (evtl. even less than 3 per RTT). 1941 However, when continuously sending data, data packets as well as ACKs 1942 will spread out equally over the RTT and sufficient ACKs with the 1943 AccECN option will be sent. 1945 A.5. Example Algorithm to Count Not-ECT Bytes 1947 A Data Sender in AccECN mode can infer the amount of TCP payload data 1948 arriving at the receiver marked Not-ECT from the difference between 1949 the amount of newly ACKed data and the sum of the bytes with the 1950 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 1951 r.e0b is initialized to 1 and the other two counters are initialized 1952 to 0, the initial sum will be 1, which matches the initial offset of 1953 the TCP sequence number on completion of the 3WHS. 1955 For this approach to be precise, it has to be assumed that spurious 1956 (unnecessary) retransmissions do not lead to double counting. This 1957 assumption is currently correct, given that RFC 3168 requires that 1958 the Data Sender marks retransmitted segments as Not-ECT. However, 1959 the converse is not true; necessary transmissions will result in 1960 under-counting. 1962 However, such precision is unlikely to be necessary. The only known 1963 use of a count of Not-ECT marked bytes is to test whether equipment 1964 on the path is clearing the ECN field (perhaps due to an out-dated 1965 attempt to clear, or bleach, what used to be the ToS field). To 1966 detect bleaching it will be sufficient to detect whether nearly all 1967 bytes arrive marked as Not-ECT. Therefore there should be no need to 1968 keep track of the details of retransmissions. 1970 Authors' Addresses 1972 Bob Briscoe 1973 CableLabs 1974 UK 1976 EMail: ietf@bobbriscoe.net 1977 URI: http://bobbriscoe.net/ 1979 Mirja Kuehlewind 1980 ETH Zurich 1981 Zurich 1982 Switzerland 1984 EMail: mirja.kuehlewind@tik.ee.ethz.ch 1985 Richard Scheffenegger 1986 Vienna 1987 Austria 1989 EMail: rscheff@gmx.at