idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. -- The document date (May 30, 2017) is 2515 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 1689, but not defined == Outdated reference: A later version (-12) exists of draft-ietf-tcpm-alternativebackoff-ecn-01 == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-dctcp-06 == Outdated reference: A later version (-08) exists of draft-ietf-tsvwg-ecn-experimentation-02 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-00 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Simula Research Laboratory 4 Intended status: Experimental M. Kuehlewind 5 Expires: December 1, 2017 ETH Zurich 6 R. Scheffenegger 7 May 30, 2017 9 More Accurate ECN Feedback in TCP 10 draft-ietf-tcpm-accurate-ecn-03 12 Abstract 14 Explicit Congestion Notification (ECN) is a mechanism where network 15 nodes can mark IP packets instead of dropping them to indicate 16 incipient congestion to the end-points. Receivers with an ECN- 17 capable transport protocol feed back this information to the sender. 18 ECN is specified for TCP in such a way that only one feedback signal 19 can be transmitted per Round-Trip Time (RTT). Recently, new TCP 20 mechanisms like Congestion Exposure (ConEx) or Data Center TCP 21 (DCTCP) need more accurate ECN feedback information whenever more 22 than one marking is received in one RTT. This document specifies an 23 experimental scheme to provide more than one feedback signal per RTT 24 in the TCP header. Given TCP header space is scarce, it overloads 25 the three existing ECN-related flags in the TCP header and provides 26 additional information in a new TCP option. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on December 1, 2017. 45 Copyright Notice 47 Copyright (c) 2017 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 64 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 65 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 66 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 67 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 68 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 69 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 8 70 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 9 71 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 72 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 73 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 10 74 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 11 75 3.1. Negotiating to use AccECN . . . . . . . . . . . . . . . . 11 76 3.1.1. Negotiation during the TCP handshake . . . . . . . . 11 77 3.1.2. Retransmission of the SYN . . . . . . . . . . . . . . 14 78 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 15 79 3.2.1. The ACE Field . . . . . . . . . . . . . . . . . . . . 15 80 3.2.2. Testing for Zeroing of the ACE Field . . . . . . . . 16 81 3.2.3. Safety against Ambiguity of the ACE Field . . . . . . 17 82 3.2.4. The AccECN Option . . . . . . . . . . . . . . . . . . 17 83 3.2.5. Path Traversal of the AccECN Option . . . . . . . . . 19 84 3.2.6. Usage of the AccECN TCP Option . . . . . . . . . . . 22 85 3.3. AccECN Compliance by TCP Proxies, Offload Engines and 86 other Middleboxes . . . . . . . . . . . . . . . . . . . . 23 87 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 24 88 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 24 89 4.2. Compatibility with Other TCP Options and Experiments . . 25 90 4.3. Compatibility with Feedback Integrity Mechanisms . . . . 25 91 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 26 92 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 93 7. Security Considerations . . . . . . . . . . . . . . . . . . . 29 94 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 29 95 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 30 96 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 30 97 10.1. Normative References . . . . . . . . . . . . . . . . . . 30 98 10.2. Informative References . . . . . . . . . . . . . . . . . 30 99 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 33 100 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 33 101 A.2. Example Algorithm for Safety Against Long Sequences of 102 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 34 103 A.2.1. Safety Algorithm without the AccECN Option . . . . . 34 104 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 36 105 A.3. Example Algorithm to Estimate Marked Bytes from Marked 106 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 37 107 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 38 108 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 39 109 Appendix B. Alternative Design Choices (To Be Removed Before 110 Publication) . . . . . . . . . . . . . . . . . . . . 39 111 Appendix C. Open Protocol Design Issues (To Be Removed Before 112 Publication) . . . . . . . . . . . . . . . . . . . . 40 113 Appendix D. Changes in This Version (To Be Removed Before 114 Publication) . . . . . . . . . . . . . . . . . . . . 40 115 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 40 117 1. Introduction 119 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 120 network nodes can mark IP packets instead of dropping them to 121 indicate incipient congestion to the end-points. Receivers with an 122 ECN-capable transport protocol feed back this information to the 123 sender. ECN is specified for TCP in such a way that only one 124 feedback signal can be transmitted per Round-Trip Time (RTT). 125 Recently, proposed mechanisms like Congestion Exposure (ConEx 126 [RFC7713]), DCTCP [I-D.ietf-tcpm-dctcp] or L4S 127 [I-D.ietf-tsvwg-l4s-arch] need more accurate ECN feedback information 128 whenever more than one marking is received in one RTT. A fuller 129 treatment of the motivation for this specification is given in the 130 associated requirements document [RFC7560]. 132 This documents specifies an experimental scheme for ECN feedback in 133 the TCP header to provide more than one feedback signal per RTT. It 134 will be called the more accurate ECN feedback scheme, or AccECN for 135 short. If AccECN progresses from experimental to the standards 136 track, it is intended to be a complete replacement for classic ECN 137 feedback, not a fork in the design of TCP. Thus, the applicability 138 of AccECN is intended to include all public and private IP networks 139 (and even any non-IP networks over which TCP is used today). Until 140 the AccECN experiment succeeds, [RFC3168] will remain as the 141 standards track specification for adding ECN to TCP. To avoid 142 confusion, in this document we use the term 'classic ECN' for the 143 pre-existing ECN specification [RFC3168]. 145 AccECN feedback overloads flags and fields in the main TCP header 146 with new definitions, so both ends have to support the new wire 147 protocol before it can be used. Therefore during the TCP handshake 148 the two ends use the three ECN-related flags in the TCP header to 149 negotiate the most advanced feedback protocol that they can both 150 support. 152 AccECN is solely an (experimental) change to the TCP wire protocol; 153 it only specifies the negotiation and signaling of more accurate ECN 154 feedback from a TCP Data Receiver to a Data Sender. It is completely 155 independent of how TCP might respond to congestion feedback, which is 156 out of scope. For that we refer to [RFC3168] or any RFC that 157 specifies a different response to TCP ECN feedback, for example: 158 [I-D.ietf-tcpm-dctcp]; or the ECN experiments referred to in 159 [I-D.ietf-tsvwg-ecn-experimentation], namely: a TCP-based Low Latency 160 Low Loss Scalable (L4S) congestion control [I-D.ietf-tsvwg-l4s-arch]; 161 ECN-capable TCP control packets [I-D.bagnulo-tcpm-generalized-ecn], 162 or Alternative Backoff with ECN (ABE) 163 [I-D.ietf-tcpm-alternativebackoff-ecn]. 165 It is likely (but not required) that the AccECN protocol will be 166 implemented along with the following experimental additions to the 167 TCP-ECN protocol: ECN-capable TCP control packets and retransmissions 168 [I-D.bagnulo-tcpm-generalized-ecn], which includes the ECN-capable 169 SYN-ACK experiment [RFC5562]; and testing receiver non-compliance 170 [I-D.moncaster-tcpm-rcv-cheat]. 172 1.1. Document Roadmap 174 The following introductory sections outline the goals of AccECN 175 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 176 that it is clear what success would look like. Then terminology is 177 defined (Section 1.4) and a recap of existing prerequisite technology 178 is given (Section 1.5). 180 Section 2 gives an informative overview of the AccECN protocol. Then 181 Section 3 gives the normative protocol specification. Section 4 182 assesses the interaction of AccECN with commonly used variants of 183 TCP, whether standardised or not. Section 5 summarises the features 184 and properties of AccECN. 186 Section 6 summarises the protocol fields and numbers that IANA will 187 need to assign and Section 7 points to the aspects of the protocol 188 that will be of interest to the security community. 190 Appendix A gives pseudocode examples for the various algorithms that 191 AccECN uses. 193 1.2. Goals 195 [RFC7560] enumerates requirements that a candidate feedback scheme 196 will need to satisfy, under the headings: resilience, timeliness, 197 integrity, accuracy (including ordering and lack of bias), 198 complexity, overhead and compatibility (both backward and forward). 199 It recognises that a perfect scheme that fully satisfies all the 200 requirements is unlikely and trade-offs between requirements are 201 likely. Section 5 presents the properties of AccECN against these 202 requirements and discusses the trade-offs made. 204 The requirements document recognises that a protocol as ubiquitous as 205 TCP needs to be able to serve as-yet-unspecified requirements. 206 Therefore an AccECN receiver aims to act as a generic (dumb) 207 reflector of congestion information so that in future new sender 208 behaviours can be deployed unilaterally. 210 1.3. Experiment Goals 212 TCP is critical to the robust functioning of the Internet, therefore 213 any proposed modifications to TCP need to be thoroughly tested. The 214 present specification describes an experimental protocol that adds 215 more accurate ECN feedback to the TCP protocol. The intention is to 216 specify the protocol sufficiently so that more than one 217 implementation can be built in order to test its function, robustness 218 and interoperability (with itself and with previous version of ECN 219 and TCP). 221 The experimental protocol will be considered successful if it 222 satisfies the requirements of [RFC7560] in the consensus opinion of 223 the IETF tcpm working group. In short, this requires that it 224 improves the accuracy and timeliness of TCP's ECN feedback, as 225 claimed in Section 5, while striking a balance between the 226 conflicting requirements of resilience, integrity and minimisation of 227 overhead. It also requires that it is not unduly complex, and that 228 it is compatible with prevalent equipment behaviours in the current 229 Internet, whether or not they comply with standards. 231 Testing will mostly focus on fall-back strategies in case of 232 middlebox interference. Current recommended strategies are specified 233 in Sections 3.1.2, 3.2.2 and 3.2.5. The effectiveness of these 234 strategies depends on the actual deployment situation of middleboxes. 235 Therefore experimental verification to confirm large-scale path 236 traversal in the Internet is needed to finalize this specification on 237 Standards Track. 239 1.4. Terminology 241 AccECN: The more accurate ECN feedback scheme will be called AccECN 242 for short. 244 Classic ECN: the ECN protocol specified in [RFC3168]. 246 Classic ECN feedback: the feedback aspect of the ECN protocol 247 specified in [RFC3168], including generation, encoding, 248 transmission and decoding of feedback, but not the Data Sender's 249 subsequent response to that feedback. 251 ACK: A TCP acknowledgement, with or without a data payload. 253 Pure ACK: A TCP acknowledgement without a data payload. 255 TCP client: The TCP stack that originates a connection. 257 TCP server: The TCP stack that responds to a connection request. 259 Data Receiver: The endpoint of a TCP half-connection that receives 260 data and sends AccECN feedback. 262 Data Sender: The endpoint of a TCP half-connection that sends data 263 and receives AccECN feedback. 265 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 266 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 267 document are to be interpreted as described in RFC 2119 [RFC2119]. 269 1.5. Recap of Existing ECN feedback in IP/TCP 271 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 272 negotiated with the receiver at the transport layer, an ECN sender 273 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 274 to indicate an ECN-capable transport (ECT). If both ECN bits are 275 zero, the packet is considered to have been sent by a Not-ECN-capable 276 Transport (Not-ECT). When a network node experiences congestion, it 277 will occasionally either drop or mark a packet, with the choice 278 depending on the packet's ECN codepoint. If the codepoint is Not- 279 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 280 the node can mark the packet by setting both ECN bits, which is 281 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 282 Table 1 summarises these codepoints. 284 +-----------------------+---------------+---------------------------+ 285 | IP-ECN codepoint | Codepoint | Description | 286 | (binary) | name | | 287 +-----------------------+---------------+---------------------------+ 288 | 00 | Not-ECT | Not ECN-Capable Transport | 289 | 01 | ECT(1) | ECN-Capable Transport (1) | 290 | 10 | ECT(0) | ECN-Capable Transport (0) | 291 | 11 | CE | Congestion Experienced | 292 +-----------------------+---------------+---------------------------+ 294 Table 1: The ECN Field in the IP Header 296 In the TCP header the first two bits in byte 14 are defined as flags 297 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 298 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 299 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 300 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 301 Data Receiver starts to set the Echo Congestion Experienced (ECE) 302 flag continuously in the TCP header of ACKs, which ensures the signal 303 is received reliably even if ACKs are lost. The TCP sender confirms 304 that it has received at least one ECE signal by responding with the 305 congestion window reduced (CWR) flag, which allows the TCP receiver 306 to stop repeating the ECN-Echo flag. This always leads to a full RTT 307 of ACKs with ECE set. Thus any additional CE markings arriving 308 within this RTT cannot be fed back. 310 The last bit in byte 13 of the TCP header was defined as the Nonce 311 Sum (NS) for the ECN Nonce [RFC3540]. RFC 3540 was never deployed so 312 it is being reclassified as historic, making this TCP flag available 313 for use by the AccECN experiment instead. 315 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 316 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 317 | | | N | C | E | U | A | P | R | S | F | 318 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 319 | | | | R | E | G | K | H | T | N | N | 320 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 322 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 324 2. AccECN Protocol Overview and Rationale 326 This section provides an informative overview of the AccECN protocol 327 that will be normatively specified in Section 3 329 Like the original TCP approach, the Data Receiver of each TCP half- 330 connection sends AccECN feedback to the Data Sender on TCP 331 acknowledgements, reusing data packets of the other half-connection 332 whenever possible. 334 The AccECN protocol has had to be designed in two parts: 336 o an essential part that re-uses ECN TCP header bits to feed back 337 the number of arriving CE marked packets. This provides more 338 accuracy than classic ECN feedback, but limited resilience against 339 ACK loss; 341 o a supplementary part using a new AccECN TCP Option that provides 342 additional feedback on the number of bytes that arrive marked with 343 each of the three ECN codepoints (not just CE marks). This 344 provides greater resilience against ACK loss than the essential 345 feedback, but it is more likely to suffer from middlebox 346 interference. 348 The two part design was necessary, given limitations on the space 349 available for TCP options and given the possibility that certain 350 incorrectly designed middleboxes prevent TCP using any new options. 352 The essential part overloads the previous definition of the three 353 flags in the TCP header that had been assigned for use by ECN. This 354 design choice deliberately replaces the classic ECN feedback 355 protocol, rather than leaving classic ECN feedback intact and adding 356 more accurate feedback separately because: 358 o this efficiently reuses scarce TCP header space, given TCP option 359 space is approaching saturation; 361 o a single upgrade path for the TCP protocol is preferable to a fork 362 in the design; 364 o otherwise classic and accurate ECN feedback could give conflicting 365 feedback on the same segment, which could open up new security 366 concerns and make implementations unnecessarily complex; 368 o middleboxes are more likely to faithfully forward the TCP ECN 369 flags than newly defined areas of the TCP header. 371 AccECN is designed to work even if the supplementary part is removed 372 or zeroed out, as long as the essential part gets through. 374 2.1. Capability Negotiation 376 AccECN is a change to the wire protocol of the main TCP header, 377 therefore it can only be used if both endpoints have been upgraded to 378 understand it. The TCP client signals support for AccECN on the 379 initial SYN of a connection and the TCP server signals whether it 380 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 381 client uses to signal AccECN support have been carefully chosen so 382 that a TCP server will interpret them as a request to support the 383 most recent variant of ECN feedback that it supports. Then the 384 client falls back to the same variant of ECN feedback. 386 An AccECN TCP client does not send the new AccECN Option on the SYN 387 as SYN option space is limited and successful negotiation using the 388 flags in the main header is taken as sufficient evidence that both 389 ends also support the AccECN Option. The TCP server sends the AccECN 390 Option on the SYN/ACK and the client sends it on the first ACK to 391 test whether the network path forwards the option correctly. 393 2.2. Feedback Mechanism 395 A Data Receiver maintains four counters initialised at the start of 396 the half-connection. Three count the number of arriving payload 397 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 398 the number of packets arriving marked with a CE codepoint (including 399 control packets without payload if they are CE-marked). 401 The Data Sender maintains four equivalent counters for the half 402 connection, and the AccECN protocol is designed to ensure they will 403 match the values in the Data Receiver's counters, albeit after a 404 little delay. 406 Each ACK carries the three least significant bits (LSBs) of the 407 packet-based CE counter using the ECN bits in the TCP header, now 408 renamed the Accurate ECN (ACE) field (see Figure 2 later). The LSBs 409 of each of the three byte counters are carried in the AccECN Option. 411 2.3. Delayed ACKs and Resilience Against ACK Loss 413 With both the ACE and the AccECN Option mechanisms, the Data Receiver 414 continually repeats the current LSBs of each of its respective 415 counters. Then, even if some ACKs are lost, the Data Sender should 416 be able to infer how much to increment its own counters, even if the 417 protocol field has wrapped. 419 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 420 it appears to have incremented by one (say), the field might have 421 actually cycled completely then incremented by one. The Data 422 Receiver is required not to delay sending an ACK to such an extent 423 that the ACE field would cycle. However cyling is still a 424 possibility at the Data Sender because a whole sequence of ACKs 425 carrying intervening values of the field might all be lost or delayed 426 in transit. 428 The fields in the AccECN Option are larger, but they will increment 429 in larger steps because they count bytes not packets. Nonetheless, 430 their size has been chosen such that a whole cycle of the field would 431 never occur between ACKs unless there had been an infeasibly long 432 sequence of ACK losses. Therefore, as long as the AccECN Option is 433 available, it can be treated as a dependable feedback channel. 435 If the AccECN Option is not available, e.g. it is being stripped by a 436 middlebox, the AccECN protocol will only feed back information on CE 437 markings (using the ACE field). Although not ideal, this will be 438 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 439 will ever indicate more severe congestion than CE, even though future 440 uses for ECT(0) or ECT(1) are still unclear 441 [I-D.ietf-tsvwg-ecn-experimentation]. Because the 3-bit ACE field is 442 so small, when it is the only field available the Data Sender has to 443 interpret it conservatively assuming the worst possible wrap. 445 Certain specified events trigger the Data Receiver to include an 446 AccECN Option on an ACK. The rules are designed to ensure that the 447 order in which different markings arrive at the receiver is 448 communicated to the sender (as long as there is no ACK loss). 449 Implementations are encouraged to send an AccECN Option more 450 frequently, but this is left up to the implementer. 452 2.4. Feedback Metrics 454 The CE packet counter in the ACE field and the CE byte counter in the 455 AccECN Option both provide feedback on received CE-marks. The CE 456 packet counter includes control packets that do not have payload 457 data, while the CE byte counter solely includes marked payload bytes. 458 If both are present, the byte counter in the option will provide the 459 more accurate information needed for modern congestion control and 460 policing schemes, such as DCTCP or ConEx. If the option is stripped, 461 a simple algorithm to estimate the number of marked bytes from the 462 ACE field is given in Appendix A.3. 464 Feedback in bytes is recommended in order to protect against the 465 receiver using attacks similar to 'ACK-Division' to artificially 466 inflate the congestion window, which is why [RFC5681] now recommends 467 that TCP counts acknowledged bytes not packets. 469 2.5. Generic (Dumb) Reflector 471 The ACE field provides information about CE markings on both data and 472 control packets. According to [RFC3168] the Data Sender is meant to 473 set control packets to Not-ECT. However, mechanisms in certain 474 private networks (e.g. data centres) set control packets to be ECN 475 capable because they are precisely the packets that performance 476 depends on most. 478 For this reason, AccECN is designed to be a generic reflector of 479 whatever ECN markings it sees, whether or not they are compliant with 480 a current standard. Then as standards evolve, Data Senders can 481 upgrade unilaterally without any need for receivers to upgrade too. 482 It is also useful to be able to rely on generic reflection behaviour 483 when senders need to test for unexpected interference with markings 484 (for instance [I-D.kuehlewind-tcpm-ecn-fallback] and 485 [I-D.moncaster-tcpm-rcv-cheat]). 487 The initial SYN is the most critical control packet, so AccECN 488 provides feedback on whether it is CE marked. Although RFC 3168 489 prohibits an ECN-capable SYN, providing feedback of CE marking on the 490 SYN supports future scenarios in which SYNs might be ECN-enabled 491 (without prejudging whether they ought to be). For instance, 492 [I-D.ietf-tsvwg-ecn-experimentation] updates this aspect of RFC 3168 493 to allow experimentation with ECN-capable TCP control packets. 495 Even if the TCP client has set the SYN to not-ECT in compliance with 496 RFC 3168, feedback on whether it has been CE-marked could still be 497 useful, because middleboxes have been known to overwrite the ECN IP 498 field as if it is still part of the old Type of Service (ToS) field. 499 If a TCP client has set the SYN to Not-ECT, but receives CE feedback, 500 it can detect such middlebox interference and send Not-ECT for the 501 rest of the connection (see [I-D.kuehlewind-tcpm-ecn-fallback]). 502 Today, if a TCP server receives CE on a SYN, it cannot know whether 503 it is invalid (or valid) because only the TCP client knows whether it 504 originally marked the SYN as Not-ECT (or ECT). Therefore, prior to 505 AccECN, the server's only safe course of action was to disable ECN 506 for the connection. Instead, the AccECN protocol allows the server 507 to feed back the CE marking to the client, which then has all the 508 information to decide whether the connection has to fall-back from 509 supporting ECN (or not). 511 3. AccECN Protocol Specification 513 3.1. Negotiating to use AccECN 515 3.1.1. Negotiation during the TCP handshake 517 Given the ECN Nonce [RFC3540] is being reclassified as historic, the 518 present specification renames the TCP flag at bit 7 of the TCP header 519 flags from NS (Nonce Sum) to AE (Accurate ECN) (see IANA 520 Considerations in Section 6). 522 During the TCP handshake at the start of a connection, to request 523 more accurate ECN feedback the TCP client (host A) MUST set the TCP 524 flags AE=1, CWR=1 and ECE=1 in the initial SYN segment. 526 If a TCP server (B) that is AccECN-enabled receives a SYN with the 527 above three flags set, it MUST set both its half connections into 528 AccECN mode. Then it MUST set the TCP flags CWR=1 and ECE=0 on its 529 response in the SYN/ACK segment to confirm that it supports AccECN. 530 The TCP server MUST NOT set this combination of flags unless the 531 preceding SYN requested support for AccECN as above. 533 A TCP server in AccECN mode MUST additionally set the TCP flag AE=1 534 on the SYN/ACK if the IP/ECN field of the SYN was CE-marked (see 535 Section 2.5 for rationale). If the IP/ECN field of the received SYN 536 was Not-ECT, ECT(0) or ECT(1), it MUST clear the TCP AE flag (AE=0) 537 on the SYN/ACK. 539 Once a TCP client (A) has sent the above SYN to declare that it 540 supports AccECN, and once it has received the above SYN/ACK segment 541 that confirms that the TCP server supports AccECN, the TCP client 542 MUST set both its half connections into AccECN mode. 544 The procedure for the client to follow if a SYN/ACK does not arrive 545 before its retransmission timer expires is given in Section 3.1.2. 547 The three flags set to 1 to indicate AccECN support on the SYN have 548 been carefully chosen to enable natural fall-back to prior stages in 549 the evolution of ECN. Table 2 tabulates all the negotiation 550 possibilities for ECN-related capabilities that involve at least one 551 AccECN-capable host. The entries in the first two columns have been 552 abbreviated, as follows: 554 AccECN: More Accurate ECN Feedback (the present specification) 556 Nonce: ECN Nonce feedback [RFC3540] 558 ECN: 'Classic' ECN feedback [RFC3168] 560 No ECN: Not-ECN-capable. Implicit congestion notification using 561 packet drop. 563 +--------+---------+------------+--------------+--------------------+ 564 | A | B | SYN A->B | SYN/ACK B->A | Feedback Mode | 565 +--------+---------+------------+--------------+--------------------+ 566 | | | AE CWR ECE | AE CWR ECE | | 567 | AccECN | AccECN | 1 1 1 | 0 1 0 | AccECN | 568 | AccECN | AccECN | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 569 | | | | | | 570 | AccECN | Nonce | 1 1 1 | 1 0 1 | classic ECN | 571 | AccECN | ECN | 1 1 1 | 0 0 1 | classic ECN | 572 | AccECN | No ECN | 1 1 1 | 0 0 0 | Not ECN | 573 | | | | | | 574 | Nonce | AccECN | 0 1 1 | 0 0 1 | classic ECN | 575 | ECN | AccECN | 0 1 1 | 0 0 1 | classic ECN | 576 | No ECN | AccECN | 0 0 0 | 0 0 0 | Not ECN | 577 | | | | | | 578 | AccECN | Broken | 1 1 1 | 1 1 1 | Not ECN | 579 | AccECN | AccECN+ | 1 1 1 | 0 1 1 | AccECN (CU) | 580 | AccECN | AccECN+ | 1 1 1 | 1 0 0 | AccECN (CU) | 581 +--------+---------+------------+--------------+--------------------+ 583 Table 2: ECN capability negotiation between Client (A) and Server (B) 585 Table 2 is divided into blocks each separated by an empty row. 587 1. The top block shows the case already described where both 588 endpoints support AccECN and how the TCP server (B) indicates 589 congestion feedback. 591 2. The second block shows the cases where the TCP client (A) 592 supports AccECN but the TCP server (B) supports some earlier 593 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 594 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 595 shown it MUST set both its half connections into the feedback 596 mode shown in the rightmost column. 598 3. The third block shows the cases where the TCP server (B) supports 599 AccECN but the TCP client (A) supports some earlier variant of 600 TCP feedback, indicated in its SYN. Therefore, as soon as an 601 AccECN-enabled TCP server (B) receives the SYN shown, it MUST set 602 both its half connections into the feedback mode shown in the 603 rightmost column. 605 4. The fourth block displays combinations that are not valid or 606 currently unused. The first case (labelled `Broken' is where all 607 bits set in the SYN are reflected by the receiver in the SYN/ACK, 608 which happens quite often if the TCP connection is proxied. In 609 this case, both ends MUST fall-back to Not ECN for both half 610 connections. The other two cases (labelled 'AccECN (CU)') are 611 currently unassigned and available for an RFC to extend TCP in 612 future, tagged as 'AccECN+' (see Appendix B for possible uses). 613 For forward compatibility, as soon as an AccECN-capable TCP 614 client (A) receives either of these SYN/ACKs it MUST set both its 615 half connections into AccECN mode, as if the SYN/ACK had been 616 AE=0, CWR=1, ECE=0. 618 The following exceptional cases need some explanation: 620 ECN Nonce: An AccECN implementation, whether client or server, 621 sender or receiver, does not need to implement the ECN Nonce 622 feedback mode [RFC3540], which is being reclassified as historic 623 [I-D.ietf-tsvwg-ecn-experimentation]. AccECN is compatible with 624 an alternative ECN feedback integrity approach that does not use 625 up the ECT(1) codepoint and can be implemented solely at the 626 sender (see Section 4.3). 628 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 629 with AE=1, CWR=1 and ECE=1, might receive another SYN from host B. 630 Host A MUST then enter the same feedback mode as it would have 631 entered had it been a responding host and received the same SYN. 632 Then host A MUST send the same SYN/ACK as it would have sent had 633 it been a responding host (see the third block above). 635 3.1.2. Retransmission of the SYN 637 If the sender of an AccECN SYN times out before receiving the SYN/ 638 ACK, the sender SHOULD attempt to negotiate the use of AccECN at 639 least one more time by continuing to set all three TCP ECN flags on 640 the first retransmitted SYN (using the usual retransmission time- 641 outs). If this first retransmission also fails to be acknowledged, 642 the sender SHOULD send subsequent retransmissions of the SYN without 643 any ECN flags set. This adds delay, in the case where a middlebox 644 drops an AccECN (or ECN) SYN deliberately. However, current 645 measurements imply that a drop is less likely to be due to middlebox 646 interference than other intermittent causes of loss, e.g. congestion, 647 wireless interference, etc. 649 Implementers MAY use other fall-back strategies if they are found to 650 be more effective (e.g. attempting to retransmit an AccECN SYN only 651 once or more than twice (most appropriate during high levels of 652 congestion); or falling back to classic ECN feedback rather than non- 653 ECN). Further it may make sense to also remove any other 654 experimental fields or options on the SYN in case a middlebox might 655 be blocking them, although the required behaviour will depend on the 656 specification of the other option(s) and any attempt to co-ordinate 657 fall-back between different modules of the stack. In any case, the 658 TCP initiator SHOULD cache failed connection attempts. If it does, 659 it SHOULD NOT give up attempting to negotiate AccECN on the SYN of 660 subsequent connection attempts until it is clear that the blockage is 661 persistently and specifically due to AccECN. The cache should be 662 arranged to expire so that the initiator will infrequently attempt to 663 check whether the problem has been resolved. 665 The fall-back procedure if the TCP server receives no ACK to 666 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 667 Section 3.2.5. 669 3.2. AccECN Feedback 671 Each Data Receiver maintains four counters, r.cep, r.ceb, r.e0b and 672 r.e1b. The CE packet counter (r.cep), counts the number of packets 673 the host receives with the CE code point in the IP ECN field, 674 including CE marks on control packets without data. r.ceb, r.e0b and 675 r.e1b count the number of TCP payload bytes in packets marked 676 respectively with the CE, ECT(0) and ECT(1) codepoint in their IP-ECN 677 field. When a host first enters AccECN mode, it initialises its 678 counters to r.cep = 6, r.e0b = 1 and r.ceb = r.e1b.= 0 (see 679 Appendix A.5). Non-zero initial values are used to support a 680 stateless handshake (see Section 4.1) and to be distinct from cases 681 where the fields are incorrectly zeroed (e.g. by middleboxes - see 682 Section 3.2.5.4). 684 A host feeds back the CE packet counter using the Accurate ECN (ACE) 685 field, as explained in the next section. And it feeds back all the 686 byte counters using the AccECN TCP Option, as specified in 687 Section 3.2.4. Whenever a host feeds back the value of any counter, 688 it MUST report the most recent value, no matter whether it is in a 689 pure ACK, an ACK with new payload data or a retransmission. 691 3.2.1. The ACE Field 693 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 694 overload the three TCP flags (AE, CWR and ECE) in the main TCP header 695 as one 3-bit field. Then the field is given a new name, ACE, as 696 shown in Figure 2. 698 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 699 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 700 | | | | U | A | P | R | S | F | 701 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 702 | | | | G | K | H | T | N | N | 703 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 705 Figure 2: Definition of the ACE field within bytes 13 and 14 of the 706 TCP Header (when AccECN has been negotiated and SYN=0). 708 The original definition of these three flags in the TCP header, 709 including the addition of support for the ECN Nonce, is shown for 710 comparison in Figure 1. This specification does not rename these 711 three TCP flags to ACE for always; it merely overloads them with 712 another name and definition once an AccECN connection has been 713 established. 715 A host MUST interpret the AE, CWR and ECE flags as the 3-bit ACE 716 counter on a segment with the SYN flag cleared (SYN=0) that it sends 717 or receives if both of its half-connections are set into AccECN mode 718 having successfully negotiated AccECN (see Section 3.1). A host MUST 719 NOT interpret the 3 flags as a 3-bit ACE field on any segment with 720 SYN=1 (whether ACK is 0 or 1), or if AccECN negotiation is incomplete 721 or has not succeeded. 723 Both parts of each of these conditions are equally important. For 724 instance, even if AccECN negotiation has been successful, the ACE 725 field is not defined on any segments with SYN=1 (e.g. a 726 retransmission of an unacknowledged SYN/ACK, or when both ends send 727 SYN/ACKs after AccECN support has been successfully negotiated during 728 a simultaneous open). 730 The ACE field encodes the three least significant bits of the r.cep 731 counter, therefore its initial value will be 0b110 (decimal 6). If 732 the SYN/ACK was CE marked, the client MUST increase its r.cep counter 733 before it sends its first ACK, therefore the initial value of the ACE 734 field will be 0b111 (decimal 7). To support a stateless handshake 735 (see Section 4.1), these values have been chosen deliberately so that 736 they are distinct from [RFC5562] behaviour, where the TCP client 737 would set ECE on the first ACK as feedback for a CE mark on the SYN/ 738 ACK. 740 3.2.2. Testing for Zeroing of the ACE Field 742 Section 3.2.1 required the Data Receiver to initialize the r.cep 743 counter to a non-zero value. Therefore, in either direction the 744 initial value of the ACE field ought to be non-zero. 746 If AccECN has been successfully negotiated, the Data Sender SHOULD 747 check the initial value of the ACE field in the first arriving 748 segment with SYN=0. If the initial value of the ACE field is zero 749 (0b000), the Data Sender MUST disable sending ECN-capable packets for 750 the remainder of the half-connection by setting the IP/ECN field in 751 all subsequent packets to Not-ECT. 753 For example, the server checks the ACK of the SYN/ACK or the first 754 data segment from the client, while the client checks the first data 755 segment from the server. More precisely, the "first segment with 756 SYN=0" is defined as: the segment with SYN=0 that i) acknowledges 757 sequence space at least covering the initial sequence number (ISN) 758 plus 1; and ii) arrives before any other segments with SYN=0 so it is 759 unlikely to be a retransmission. If no such segment arrives (e.g. 760 because it is lost and the ISN is first acknowledged by a subsequent 761 segment), no test for invalid initialization can be conducted, and 762 the half-connection will continue in AccECN mode. 764 Note that the Data Sender MUST NOT test whether the arriving counter 765 in the initial ACE field has been initialized to a specific valid 766 value - the above check solely tests whether the ACE fields have been 767 incorrectly zeroed. This allows hosts to use different initial 768 values as an additional signalling channel in future. 770 3.2.3. Safety against Ambiguity of the ACE Field 772 If too many CE-marked segments are acknowledged at once, or if a long 773 run of ACKs is lost, the 3-bit counter in the ACE field might have 774 cycled between two ACKs arriving at the Data Sender. 776 Therefore an AccECN Data Receiver SHOULD immediately send an ACK once 777 'n' CE marks have arrived since the previous ACK, where 'n' SHOULD be 778 2 and MUST be no greater than 6. 780 If the Data Sender has not received AccECN TCP Options to give it 781 more dependable information, and it detects that the ACE field could 782 have cycled under the prevailing conditions, it SHOULD conservatively 783 assume that the counter did cycle. It can detect if the counter 784 could have cycled by using the jump in the acknowledgement number 785 since the last ACK to calculate or estimate how many segments could 786 have been acknowledged. An example algorithm to implement this 787 policy is given in Appendix A.2. An implementer MAY develop an 788 alternative algorithm as long as it satisfies these requirements. 790 If missing acknowledgement numbers arrive later (reordering) and 791 prove that the counter did not cycle, the Data Sender MAY attempt to 792 neutralise the effect of any action it took based on a conservative 793 assumption that it later found to be incorrect. 795 3.2.4. The AccECN Option 797 The AccECN Option is defined as shown below in Figure 3. It consists 798 of three 24-bit fields that provide the 24 least significant bits of 799 the r.e0b, r.ceb and r.e1b counters, respectively. The initial 'E' 800 of each field name stands for 'Echo'. 802 0 1 2 3 803 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 804 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 805 | Kind = TBD1 | Length = 11 | EE0B field | 806 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 807 | EE0B (cont'd) | ECEB field | 808 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 809 | EE1B field | 810 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 812 Figure 3: The AccECN Option 814 The Data Receiver MUST set the Kind field to TBD1, which is 815 registered in Section 6 as a new TCP option Kind called AccECN. An 816 experimental TCP option with Kind=254 MAY be used for initial 817 experiments, with magic number 0xACCE. 819 Appendix A.1 gives an example algorithm for the Data Receiver to 820 encode its byte counters into the AccECN Option, and for the Data 821 Sender to decode the AccECN Option fields into its byte counters. 823 Note that there is no field to feedback Not-ECT bytes. Nonetheless 824 an algorithm for the Data Sender to calculate the number of payload 825 bytes received as Not-ECT is given in Appendix A.5. 827 Whenever a Data Receiver sends an AccECN Option, the rules in 828 Section 3.2.6 expect it to always send a full-length option. To cope 829 with option space limitations, it can omit unchanged fields from the 830 tail of the option, as long as it preserves the order of the 831 remaining fields and includes any field that has changed. The length 832 field MUST indicate which fields are present as follows: 834 Length=11: EE0B, ECEB, EE1B 836 Length=8: EE0B, ECEB 838 Length=5: EE0B 840 Length=2: (empty) 842 The empty option of Length=2 is provided to allow for a case where an 843 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 844 but there is very limited space for the option. For initial 845 experiments, the Length field MUST be 2 greater to accommodate the 846 16-bit magic number. 848 All implementations of a Data Sender MUST be able to read in AccECN 849 Options of any of the above lengths. They MUST ignore an AccECN 850 Option of any other length. 852 3.2.5. Path Traversal of the AccECN Option 854 3.2.5.1. Testing the AccECN Option during the Handshake 856 The TCP client MUST NOT include the AccECN TCP Option on the SYN. 857 Nonetheless, if the AccECN negotiation using the ECN flags in the 858 main TCP header (Section 3.1) is successful, it implicitly declares 859 that the endpoints also support the AccECN TCP Option. A fall-back 860 strategy for the loss of the SYN (possibly due to middlebox 861 interference) is specified in Section 3.1.2. 863 A TCP server that confirms its support for AccECN (in response to an 864 AccECN SYN from the client as described in Section 3.1) SHOULD also 865 include an AccECN TCP Option in the SYN/ACK. 867 A TCP client that has successfully negotiated AccECN SHOULD include 868 an AccECN Option in the first ACK at the end of the 3WHS. However, 869 this first ACK is not delivered reliably, so the TCP client SHOULD 870 also include an AccECN Option on the first data segment it sends (if 871 it ever sends one). 873 A host MAY NOT include an AccECN Option in any of these three cases 874 if it has cached knowledge that the packet would be likely to be 875 blocked on the path to the other host if it included an AccECN 876 Option. 878 3.2.5.2. Testing for Loss of Packets Carrying the AccECN Option 880 If after the normal TCP timeout the TCP server has not received an 881 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 882 lost, e.g. due to congestion, or a middlebox might be blocking the 883 AccECN Option. To expedite connection setup, the TCP server SHOULD 884 retransmit the SYN/ACK with the same TCP flags (AE, CWR and ECE) but 885 with no AccECN Option. If this retransmission times out, to expedite 886 connection setup, the TCP server SHOULD disable AccECN and ECN for 887 this connection by retransmitting the SYN/ACK with AE=CWR=ECE=0 and 888 no AccECN Option. Implementers MAY use other fall-back strategies if 889 they are found to be more effective (e.g. falling back to classic 890 ECN feedback on the first retransmission; retrying the AccECN Option 891 for a second time before fall-back (most appropriate during high 892 levels of congestion); or falling back to classic ECN feedback rather 893 than non-ECN on the third retransmission). 895 If the TCP client detects that the first data segment it sent with 896 the AccECN Option was lost, it SHOULD fall back to no AccECN Option 897 on the retransmission. Again, implementers MAY use other fall-back 898 strategies such as attempting to retransmit a second segment with the 899 AccECN Option before fall-back, and/or caching whether the AccECN 900 Option is blocked for subsequent connections. 902 Either host MAY include the AccECN Option in a subsequent segment to 903 retest whether the AccECN Option can traverse the path. 905 If the TCP server receives a second SYN with a request for AccECN 906 support, it should resend the SYN/ACK, again confirming its support 907 for AccECN, but this time without the AccECN Option. This approach 908 rules out any interference by middleboxes that may drop packets with 909 unknown options, even though it is more likely that the SYN/ACK would 910 have been lost due to congestion. The TCP server MAY try to send 911 another packet with the AccECN Option at a later point during the 912 connection but should monitor if that packet got lost as well, in 913 which case it SHOULD disable the sending of the AccECN Option for 914 this half-connection. 916 Similarly, an AccECN end-point MAY separately memorize which data 917 packets carried an AccECN Option and disable the sending of AccECN 918 Options if the loss probability of those packets is significantly 919 higher than that of all other data packets in the same connection. 921 3.2.5.3. Testing for Stripping of the AccECN Option 923 If the TCP client has successfully negotiated AccECN but does not 924 receive an AccECN Option on the SYN/ACK, it switches into a mode that 925 assumes that the AccECN Option is not available for this half 926 connection. 928 Similarly, if the TCP server has successfully negotiated AccECN but 929 does not receive an AccECN Option on the first segment that 930 acknowledges sequence space at least covering the ISN, it switches 931 into a mode that assumes that the AccECN Option is not available for 932 this half connection. 934 While a host is in this mode that assumes incoming AccECN Options are 935 not available, it MUST adopt the conservative interpretation of the 936 ACE field discussed in Section 3.2.3. However, it cannot make any 937 assumption about support of outgoing AccECN Options on the other half 938 connection, so it SHOULD continue to send the AccECN Option itself 939 (unless it has established that sending the AccECN Option is causing 940 packets to be blocked as in Section 3.2.5.2). 942 If a host is in the mode that assumes incoming AccECN Options are not 943 available, but it receives an AccECN Option at any later point during 944 the connection, this clearly indicates that the AccECN Option is not 945 blocked on the respective path, and the AccECN endpoint MAY switch 946 out of the mode that assumes the AccECN Option is not available for 947 this half connection. 949 3.2.5.4. Test for Zeroing of the AccECN Option 951 For a related test for invalid initialization of the ACE field, see 952 Section 3.2.2 954 Section 3.2 required the Data Receiver to initialize the r.e0b 955 counter to a non-zero value. Therefore, in either direction the 956 initial value of the EE0B field in the AccECN Option (if one exists) 957 ought to be non-zero. If AccECN has been negotiated: 959 o the TCP server MAY check the initial value of the EE0B field in 960 the first segment that acknowledges sequence space that at least 961 covers the ISN plus 1. If the initial value of the EE0B field is 962 zero, the server will switch into a mode that ignores the AccECN 963 Option for this half connection. 965 o the TCP client MAY check the initial value of the EE0B field on 966 the SYN/ACK. If the initial value of the EE0B field is zero, the 967 client will switch into a mode that ignores the AccECN Option for 968 this half connection. 970 While a host is in the mode that ignores the AccECN Option it MUST 971 adopt the conservative interpretation of the ACE field discussed in 972 Section 3.2.3. 974 Note that the Data Sender MUST NOT test whether the arriving byte 975 counters in the initial AccECN Option have been initialized to 976 specific valid values - the above checks solely test whether these 977 fields have been incorrectly zeroed. This allows hosts to use 978 different initial values as an additional signalling channel in 979 future. Also note that the initial value of either field might be 980 greater than its expected initial value, because the counters might 981 already have been incremented. Nonetheless, the initial values of 982 the counters have been chosen so that they cannot wrap to zero on 983 these initial segments. 985 3.2.5.5. Consistency between AccECN Feedback Fields 987 When the AccECN Option is available it supplements but does not 988 replace the ACE field. An endpoint using AccECN feedback MUST always 989 consider the information provided in the ACE field whether or not the 990 AccECN Option is also available. 992 If the AccECN option is present, the s.cep counter might increase 993 while the s.ceb counter does not (e.g. due to a CE-marked control 994 packet). The sender's response to such a situation is out of scope, 995 and needs to be dealt with in a specification that uses ECN-capable 996 control packets. Theoretically, this situation could also occur if a 997 middlebox mangled the AccECN Option but not the ACE field. However, 998 the Data Sender has to assume that the integrity of the AccECN Option 999 is sound, based on the above test of the well-known initial values 1000 and optionally other integrity tests (Section 4.3). 1002 If either end-point detects that the s.ceb counter has increased but 1003 the s.cep has not (and by testing ACK coverage it is certain how much 1004 the ACE field has wrapped), this invalid protocol transition has to 1005 be due to some form of feedback mangling. So, the Data Sender MUST 1006 disable sending ECN-capable packets for the remainder of the half- 1007 connection by setting the IP/ECN field in all subsequent packets to 1008 Not-ECT. 1010 3.2.6. Usage of the AccECN TCP Option 1012 The following rules determine when a Data Receiver in AccECN mode 1013 sends the AccECN TCP Option, and which fields to include: 1015 Change-Triggered ACKs: If an arriving packet increments a different 1016 byte counter to that incremented by the previous packet, the Data 1017 Receiver SHOULD immediately send an ACK with an AccECN Option, 1018 without waiting for the next delayed ACK (this is in addition to 1019 the safety recommendation in Section 3.2.3 against ambiguity of 1020 the ACE field). Certain offload hardware might not be able to 1021 support change-triggered ACKs, but otherwise it is important to 1022 keep exceptions to this rule to a minimum so that Data Senders can 1023 generally rely on this behaviour; 1025 Continual Repetition: Otherwise, if arriving packets continue to 1026 increment the same byte counter, the Data Receiver can include an 1027 AccECN Option on most or all (delayed) ACKs, but it does not have 1028 to. If option space is limited on a particular ACK, the Data 1029 Receiver MUST give precedence to SACK information about loss. It 1030 SHOULD include an AccECN Option if the r.ceb counter has 1031 incremented and it MAY include an AccECN Option if r.ec0b or 1032 r.ec1b has incremented; 1034 Full-Length Options Preferred: It SHOULD always use full-length 1035 AccECN Options. It MAY use shorter AccECN Options if space is 1036 limited, but it MUST include the counter(s) that have incremented 1037 since the previous AccECN Option and it MUST only truncate fields 1038 from the right-hand tail of the option to preserve the order of 1039 the remaining fields (see Section 3.2.4); 1041 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 1042 length AccECN TCP Option on at least three ACKs per RTT, or on all 1043 ACKs if there are less than three per RTT (see Appendix A.4 for an 1044 example algorithm that satisfies this requirement). 1046 The following example series of arriving IP/ECN fields illustrates 1047 when a Data Receiver will emit an ACK if it is using a delayed ACK 1048 factor of 2 segments and change-triggered ACKs: 01 -> ACK, 01, 01 -> 1049 ACK, 10 -> ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 -> ACK. 1051 For the avoidance of doubt, the change-triggered ACK mechanism is 1052 deliberately worded to ignore the arrival of a control packet with no 1053 payload, which therefore does not alter any byte counters, because it 1054 is important that TCP does not acknowledge pure ACKs. The change- 1055 triggered ACK approach will lead to some additional ACKs but it feeds 1056 back the timing and the order in which ECN marks are received with 1057 minimal additional complexity. 1059 Implementation note: sending an AccECN Option each time a different 1060 counter changes and including a full-length AccECN Option on every 1061 delayed ACK will satisfy the requirements described above and might 1062 be the easiest implementation, as long as sufficient space is 1063 available in each ACK (in total and in the option space). 1065 Appendix A.3 gives an example algorithm to estimate the number of 1066 marked bytes from the ACE field alone, if the AccECN Option is not 1067 available. 1069 If a host has determined that segments with the AccECN Option always 1070 seem to be discarded somewhere along the path, it is no longer 1071 obliged to follow the above rules. 1073 3.3. AccECN Compliance by TCP Proxies, Offload Engines and other 1074 Middleboxes 1076 A large class of middleboxes split TCP connections. Such a middlebox 1077 would be compliant with the AccECN protocol if the TCP implementation 1078 on each side complied with the present AccECN specification and each 1079 side negotiated AccECN independently of the other side. 1081 Another large class of middleboxes intervenes to some degree at the 1082 transport layer, but attempts to be transparent (invisible) to the 1083 end-to-end connection. A subset of this class of middleboxes 1084 attempts to `normalise' the TCP wire protocol by checking that all 1085 values in header fields comply with a rather narrow interpretation of 1086 the TCP specifications. To comply with the present AccECN 1087 specification, such a middlebox MUST NOT change the ACE field or the 1088 AccECN Option and it MUST attempt to preserve the timing of each ACK 1089 (for example, if it coalesced ACKs it would not be AccECN-compliant). 1090 A middlebox claiming to be transparent at the transport layer MUST 1091 forward the AccECN TCP Option unaltered, whether or not the length 1092 value matches one of those specified in Section 3.2.4, and whether or 1093 not the initial values of the byte-counter fields are correct. This 1094 is because blocking apparently invalid values does not improve 1095 security (because AccECN hosts are required to ignore invalid values 1096 anyway), while it prevents the standardised set of values being 1097 extended in future (because outdated normalisers would block updated 1098 hosts from using the extended AccECN standard). 1100 Hardware to offload certain TCP processing represents another large 1101 class of middleboxes, even though it is often a function of a host's 1102 network interface and rarely in its own 'box'. Leeway has been 1103 allowed in the present AccECN specification in the expectation that 1104 offload hardware could comply and still serve its function. 1105 Nonetheless, such hardware MUST attempt to preserve the timing of 1106 each ACK (for example, if it coalesced ACKs it would not be AccECN- 1107 compliant). 1109 4. Interaction with Other TCP Variants 1111 This section is informative, not normative. 1113 4.1. Compatibility with SYN Cookies 1115 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 1116 protect itself from SYN flooding attacks. It places minimal commonly 1117 used connection state in the SYN/ACK, and deliberately does not hold 1118 any state while waiting for the subsequent ACK (e.g. it closes the 1119 thread). Therefore it cannot record the fact that it entered AccECN 1120 mode for both half-connections. Indeed, it cannot even remember 1121 whether it negotiated the use of classic ECN [RFC3168]. 1123 Nonetheless, such a server can determine that it negotiated AccECN as 1124 follows. If a TCP server using SYN Cookies supports AccECN and if 1125 the first segment it receives that at least covers the ISN contains 1126 an ACE field with the value 0b110 or 0b111, it can assume that: 1128 o the TCP client must have requested AccECN support on the SYN 1130 o it (the server) must have confirmed that it supported AccECN 1131 Therefore the server can switch itself into AccECN mode, and continue 1132 as if it had never forgotten that it switched itself into AccECN mode 1133 earlier. For other values of ACE field, heuristics to infer what 1134 other type of ECN the client supports are out of scope. 1136 4.2. Compatibility with Other TCP Options and Experiments 1138 AccECN is compatible (at least on paper) with the most commonly used 1139 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1140 also compatible with the recent promising experimental TCP options 1141 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 1142 AccECN is friendly to all these protocols, because space for TCP 1143 options is particularly scarce on the SYN, where AccECN consumes zero 1144 additional header space. 1146 When option space is under pressure from other options, Section 3.2.6 1147 provides guidance on how important it is to send an AccECN Option and 1148 whether it needs to be a full-length option. 1150 4.3. Compatibility with Feedback Integrity Mechanisms 1152 Three alternative mechanisms are available to assure the integrity of 1153 ECN and/or loss signals. AccECN is compatible with any of these 1154 approaches: 1156 o The Data Sender can test the integrity of the receiver's ECN (or 1157 loss) feedback by occasionally setting the IP-ECN field to a value 1158 normally only set by the network (and/or deliberately leaving a 1159 sequence number gap). Then it can test whether the Data 1160 Receiver's feedback faithfully reports what it expects 1161 [I-D.moncaster-tcpm-rcv-cheat]. Unlike the ECN Nonce [RFC3540], 1162 this approach does not waste the ECT(1) codepoint in the IP 1163 header, it does not require standardisation and it does not rely 1164 on misbehaving receivers volunteering to reveal feedback 1165 information that allows them to be detected. However, setting the 1166 CE mark by the sender might conceal actual congestion feedback 1167 from the network and should therefore only be done sparsely. 1169 o Networks generate congestion signals when they are becoming 1170 congested, so networks are more likely than Data Senders to be 1171 concerned about the integrity of the receiver's feedback of these 1172 signals. A network can enforce a congestion response to its ECN 1173 markings (or packet losses) using congestion exposure (ConEx) 1174 audit [RFC7713]. Whether the receiver or a downstream network is 1175 suppressing congestion feedback or the sender is unresponsive to 1176 the feedback, or both, ConEx audit can neutralise any advantage 1177 that any of these three parties would otherwise gain. 1179 ConEx is a change to the Data Sender that is most useful when 1180 combined with AccECN. Without AccECN, the ConEx behaviour of a 1181 Data Sender would have to be more conservative than would be 1182 necessary if it had the accurate feedback of AccECN. 1184 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1185 detect any tampering with AccECN feedback between the Data 1186 Receiver and the Data Sender (whether malicious or accidental). 1187 The AccECN fields are immutable end-to-end, so they are amenable 1188 to TCP-AO protection, which covers TCP options by default. 1189 However, TCP-AO is often too brittle to use on many end-to-end 1190 paths, where middleboxes can make verification fail in their 1191 attempts to improve performance or security, e.g. by 1192 resegmentation or shifting the sequence space. 1194 Originally the ECN Nonce [RFC3540] was proposed to ensure integrity 1195 of congestion feedback. With minor changes AccECN could be optimised 1196 for the possibility that the ECT(1) codepoint might be used as an ECN 1197 Nonce . However, given RFC 3540 is being reclassified as historic, 1198 the AccECN design has been generalised so that it ought to be able to 1199 support other possible uses of the ECT(1) codepoint, such as a lower 1200 severity or a more instant congestion signal than CE. 1202 5. Protocol Properties 1204 This section is informative not normative. It describes how well the 1205 protocol satisfies the agreed requirements for a more accurate ECN 1206 feedback protocol [RFC7560]. 1208 Accuracy: From each ACK, the Data Sender can infer the number of new 1209 CE marked segments since the previous ACK. This provides better 1210 accuracy on CE feedback than classic ECN. In addition if the 1211 AccECN Option is present (not blocked by the network path) the 1212 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1214 Overhead: The AccECN scheme is divided into two parts. The 1215 essential part reuses the 3 flags already assigned to ECN in the 1216 IP header. The supplementary part adds an additional TCP option 1217 consuming up to 11 bytes. However, no TCP option is consumed in 1218 the SYN. 1220 Ordering: The order in which marks arrive at the Data Receiver is 1221 preserved in AccECN feedback, because the Data Receiver is 1222 expected to send an ACK immediately whenever a different mark 1223 arrives. 1225 Timeliness: While the same ECN markings are arriving continually at 1226 the Data Receiver, it can defer ACKs as TCP does normally, but it 1227 will immediately send an ACK as soon as a different ECN marking 1228 arrives. 1230 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1231 latency-sensitive uses of ECN feedback by capturing the timing of 1232 transitions but not wasting resources while the state of the 1233 signalling system is stable. The receiver can control how 1234 frequently it sends the AccECN TCP Option and therefore it can 1235 control the overhead induced by AccECN. 1237 Resilience: All information is provided based on counters. 1238 Therefore if ACKs are lost, the counters on the first ACK 1239 following the losses allows the Data Sender to immediately recover 1240 the number of the ECN markings that it missed. 1242 Resilience against Bias: Because feedback is based on repetition of 1243 counters, random losses do not remove any information, they only 1244 delay it. Therefore, even though some ACKs are change-triggered, 1245 random losses will not alter the proportions of the different ECN 1246 markings in the feedback. 1248 Resilience vs Overhead: If space is limited in some segments (e.g. 1249 because more option are need on some segments, such as the SACK 1250 option after loss), the Data Receiver can send AccECN Options less 1251 frequently or truncate fields that have not changed, usually down 1252 to as little as 5 bytes. However, it has to send a full-sized 1253 AccECN Option at least three times per RTT, which the Data Sender 1254 can rely on as a regular beacon or checkpoint. 1256 Resilience vs Timeliness and Ordering: Ordering information and the 1257 timing of transitions cannot be communicated in three cases: i) 1258 during ACK loss; ii) if something on the path strips the AccECN 1259 Option; or iii) if the Data Receiver is unable to support Change- 1260 Triggered ACKs. 1262 Complexity: An AccECN implementation solely involves simple counter 1263 increments, some modulo arithmetic to communicate the least 1264 significant bits and allow for wrap, and some heuristics for 1265 safety against fields cycling due to prolonged periods of ACK 1266 loss. Each host needs to maintain eight additional counters. The 1267 hosts have to apply some additional tests to detect tampering by 1268 middleboxes, but in general the protocol is simple to understand, 1269 simple to implement and requires few cycles per packet to execute. 1271 Integrity: AccECN is compatible with at least three approaches that 1272 can assure the integrity of ECN feedback. If the AccECN Option is 1273 stripped the resolution of the feedback is degraded, but the 1274 integrity of this degraded feedback can still be assured. 1276 Backward Compatibility: If only one endpoint supports the AccECN 1277 scheme, it will fall-back to the most advanced ECN feedback scheme 1278 supported by the other end. 1280 Backward Compatibility: If the AccECN Option is stripped by a 1281 middlebox, AccECN still provides basic congestion feedback in the 1282 ACE field. Further, AccECN can be used to detect mangling of the 1283 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1284 marked segments; and blocking of segments carrying the AccECN 1285 Option. It can detect these conditions during TCP's 3WHS so that 1286 it can fall back to operation without ECN and/or operation without 1287 the AccECN Option. 1289 Forward Compatibility: The behaviour of endpoints and middleboxes is 1290 carefully defined for all reserved or currently unused codepoints 1291 in the scheme, to ensure that any blocking of anomalous values is 1292 always at least under reversible policy control. 1294 6. IANA Considerations 1296 This document reassigns bit 7 of the TCP header flags to the AccECN 1297 experiment. This bit was previously called the Nonce Sum (NS) flag 1298 [RFC3540], but RFC 3540 is being reclassified as historic. The flag 1299 will now be defined as: 1301 +-----+-------------------+-----------+ 1302 | Bit | Name | Reference | 1303 +-----+-------------------+-----------+ 1304 | 7 | AE (Accurate ECN) | RFC XXXX | 1305 +-----+-------------------+-----------+ 1307 [TO BE REMOVED: This registration should take place at the following 1308 location: https://www.iana.org/assignments/tcp-header-flags/tcp- 1309 header-flags.xhtml#tcp-header-flags-1 ] 1311 This document also defines a new TCP option for AccECN, assigned a 1312 value of TBD1 (decimal) from the TCP option space. This value is 1313 defined as: 1315 +------+--------+-----------------------+-----------+ 1316 | Kind | Length | Meaning | Reference | 1317 +------+--------+-----------------------+-----------+ 1318 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1319 +------+--------+-----------------------+-----------+ 1321 [TO BE REMOVED: This registration should take place at the following 1322 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1323 parameters.xhtml#tcp-parameters-1 ] 1324 Early implementation before the IANA allocation MUST follow [RFC6994] 1325 and use experimental option 254 and magic number 0xACCE (16 bits), 1326 then migrate to the new option after the allocation. 1328 7. Security Considerations 1330 If ever the supplementary part of AccECN based on the new AccECN TCP 1331 Option is unusable (due for example to middlebox interference) the 1332 essential part of AccECN's congestion feedback offers only limited 1333 resilience to long runs of ACK loss (see Section 3.2.3). These 1334 problems are unlikely to be due to malicious intervention (because if 1335 an attacker could strip a TCP option or discard a long run of ACKs it 1336 could wreak other arbitrary havoc). However, it would be of concern 1337 if AccECN's resilience could be indirectly compromised during a 1338 flooding attack. AccECN is still considered safe though, because if 1339 the option is not presented, the AccECN Data Sender is then required 1340 to switch to more conservative assumptions about wrap of congestion 1341 indication counters (see Section 3.2.3 and Appendix A.2). 1343 Section 4.1 describes how a TCP server can negotiate AccECN and use 1344 the SYN cookie method for mitigating SYN flooding attacks. 1346 There is concern that ECN markings could be altered or suppressed, 1347 particularly because a misbehaving Data Receiver could increase its 1348 own throughput at the expense of others. AccECN is compatible with 1349 the three schemes known to assure the integrity of ECN feedback (see 1350 Section 4.3 for details). If the AccECN Option is stripped by an 1351 incorrectly implemented middlebox, the resolution of the feedback 1352 will be degraded, but the integrity of this degraded information can 1353 still be assured. 1355 The AccECN protocol is not believed to introduce any new privacy 1356 concerns, because it merely counts and feeds back signals at the 1357 transport layer that had already been visible at the IP layer. 1359 8. Acknowledgements 1361 We want to thank Koen De Schepper, Praveen Balasubramanian and 1362 Michael Welzl for their input and discussion. The idea of using the 1363 three ECN-related TCP flags as one field for more accurate TCP-ECN 1364 feedback was first introduced in the re-ECN protocol that was the 1365 ancestor of ConEx. 1367 Bob Briscoe was part-funded by the European Community under its 1368 Seventh Framework Programme through the Reducing Internet Transport 1369 Latency (RITE) project (ICT-317700) and through the Trilogy 2 project 1370 (ICT-317756). The views expressed here are solely those of the 1371 authors. 1373 This work is partly supported by the European Commission under 1374 Horizon 2020 grant agreement no. 688421 Measurement and Architecture 1375 for a Middleboxed Internet (MAMI), and by the Swiss State Secretariat 1376 for Education, Research, and Innovation under contract no. 15.0268. 1377 This support does not imply endorsement. 1379 9. Comments Solicited 1381 Comments and questions are encouraged and very welcome. They can be 1382 addressed to the IETF TCP maintenance and minor modifications working 1383 group mailing list , and/or to the authors. 1385 10. References 1387 10.1. Normative References 1389 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1390 Requirement Levels", BCP 14, RFC 2119, 1391 DOI 10.17487/RFC2119, March 1997, 1392 . 1394 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1395 of Explicit Congestion Notification (ECN) to IP", 1396 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1397 . 1399 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1400 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1401 . 1403 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 1404 RFC 6994, DOI 10.17487/RFC6994, August 2013, 1405 . 1407 10.2. Informative References 1409 [I-D.bagnulo-tcpm-generalized-ecn] 1410 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 1411 Congestion Notification (ECN) to TCP Control Packets", 1412 draft-bagnulo-tcpm-generalized-ecn-04 (work in progress), 1413 May 2017. 1415 [I-D.ietf-tcpm-alternativebackoff-ecn] 1416 Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, 1417 "TCP Alternative Backoff with ECN (ABE)", draft-ietf-tcpm- 1418 alternativebackoff-ecn-01 (work in progress), May 2017. 1420 [I-D.ietf-tcpm-dctcp] 1421 Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1422 and G. Judd, "Datacenter TCP (DCTCP): TCP Congestion 1423 Control for Datacenters", draft-ietf-tcpm-dctcp-06 (work 1424 in progress), May 2017. 1426 [I-D.ietf-tsvwg-ecn-experimentation] 1427 Black, D., "Explicit Congestion Notification (ECN) 1428 Experimentation", draft-ietf-tsvwg-ecn-experimentation-02 1429 (work in progress), April 2017. 1431 [I-D.ietf-tsvwg-l4s-arch] 1432 Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, 1433 Low Loss, Scalable Throughput (L4S) Internet Service: 1434 Architecture", draft-ietf-tsvwg-l4s-arch-00 (work in 1435 progress), May 2017. 1437 [I-D.kuehlewind-tcpm-ecn-fallback] 1438 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 1439 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 1440 fallback-01 (work in progress), September 2013. 1442 [I-D.moncaster-tcpm-rcv-cheat] 1443 Moncaster, T., Briscoe, B., and A. Jacquet, "A TCP Test to 1444 Allow Senders to Identify Receiver Non-Compliance", draft- 1445 moncaster-tcpm-rcv-cheat-03 (work in progress), July 2014. 1447 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1448 Congestion Notification (ECN) Signaling with Nonces", 1449 RFC 3540, DOI 10.17487/RFC3540, June 2003, 1450 . 1452 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1453 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 1454 . 1456 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1457 Ramakrishnan, "Adding Explicit Congestion Notification 1458 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 1459 DOI 10.17487/RFC5562, June 2009, 1460 . 1462 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1463 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 1464 June 2010, . 1466 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 1467 "TCP Extensions for Multipath Operation with Multiple 1468 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 1469 . 1471 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1472 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1473 . 1475 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 1476 "Problem Statement and Requirements for Increased Accuracy 1477 in Explicit Congestion Notification (ECN) Feedback", 1478 RFC 7560, DOI 10.17487/RFC7560, August 2015, 1479 . 1481 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1482 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 1483 DOI 10.17487/RFC7713, December 2015, 1484 . 1486 Appendix A. Example Algorithms 1488 This appendix is informative, not normative. It gives example 1489 algorithms that would satisfy the normative requirements of the 1490 AccECN protocol. However, implementers are free to choose other ways 1491 to implement the requirements. 1493 A.1. Example Algorithm to Encode/Decode the AccECN Option 1495 The example algorithms below show how a Data Receiver in AccECN mode 1496 could encode its CE byte counter r.ceb into the ECEB field within the 1497 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 1498 the ECEB field into its byte counter s.ceb. The other counters for 1499 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 1500 similarly encoded and decoded. 1502 It is assumed that each local byte counter is an unsigned integer 1503 greater than 24b (probably 32b), and that the following constant has 1504 been assigned: 1506 DIVOPT = 2^24 1508 Every time a CE marked data segment arrives, the Data Receiver 1509 increments its local value of r.ceb by the size of the TCP Data. 1510 Whenever it sends an ACK with the AccECN Option, the value it writes 1511 into the ECEB field is 1513 ECEB = r.ceb % DIVOPT 1515 where '%' is the modulo operator. 1517 On the arrival of an AccECN Option, the Data Sender uses the TCP 1518 acknowledgement number and any SACK options to calculate newlyAckedB, 1519 the amount of new data that the ACK acknowledges in bytes. If 1520 newlyAckedB is negative it means that a more up to date ACK has 1521 already been processed, so this ACK has been superseded and the Data 1522 Sender has to ignore the AccECN Option. Then the Data Sender 1523 calculates the minimum difference d.ceb between the ECEB field and 1524 its local s.ceb counter, using modulo arithmetic as follows: 1526 if (newlyAckedB >= 0) { 1527 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 1528 s.ceb += d.ceb 1529 } 1531 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 1532 then 1533 s.ceb % DIVOPT = 1 1534 d.ceb = (1461 + 2^24 - 1) % 2^24 1535 = 1460 1536 s.ceb = 33,554,433 + 1460 1537 = 33,555,893 1539 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 1541 The example algorithms below show how a Data Receiver in AccECN mode 1542 could encode its CE packet counter r.cep into the ACE field, and how 1543 the Data Sender in AccECN mode could decode the ACE field into its 1544 s.cep counter. The Data Sender's algorithm includes code to 1545 heuristically detect a long enough unbroken string of ACK losses that 1546 could have concealed a cycle of the congestion counter in the ACE 1547 field of the next ACK to arrive. 1549 Two variants of the algorithm are given: i) a more conservative 1550 variant for a Data Sender to use if it detects that the AccECN Option 1551 is not available (see Section 3.2.3 and Section 3.2.5); and ii) a 1552 less conservative variant that is feasible when complementary 1553 information is available from the AccECN Option. 1555 A.2.1. Safety Algorithm without the AccECN Option 1557 It is assumed that each local packet counter is a sufficiently sized 1558 unsigned integer (probably 32b) and that the following constant has 1559 been assigned: 1561 DIVACE = 2^3 1563 Every time a CE marked packet arrives, the Data Receiver increments 1564 its local value of r.cep by 1. It repeats the same value of ACE in 1565 every subsequent ACK until the next CE marking arrives, where 1567 ACE = r.cep % DIVACE. 1569 If the Data Sender received an earlier value of the counter that had 1570 been delayed due to ACK reordering, it might incorrectly calculate 1571 that the ACE field had wrapped. Therefore, on the arrival of every 1572 ACK, the Data Sender uses the TCP acknowledgement number and any SACK 1573 options to calculate newlyAckedB, the amount of new data that the ACK 1574 acknowledges. If newlyAckedB is negative it means that a more up to 1575 date ACK has already been processed, so this ACK has been superseded 1576 and the Data Sender has to ignore the AccECN Option. If newlyAckedB 1577 is zero, to break the tie the Data Sender could use timestamps (if 1578 present) to work out newlyAckedT, the amount of new time that the ACK 1579 acknowledges. Then the Data Sender calculates the minimum difference 1580 d.cep between the ACE field and its local s.cep counter, using modulo 1581 arithmetic as follows: 1583 if ((newlyAckedB > 0) || (newlyAckedB == 0 && newlyAckedT > 0)) 1584 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 1586 Section 3.2.3 requires the Data Sender to assume that the ACE field 1587 did cycle if it could have cycled under prevailing conditions. The 1588 3-bit ACE field in an arriving ACK could have cycled and become 1589 ambiguous to the Data Sender if a row of ACKs goes missing that 1590 covers a stream of data long enough to contain 8 or more CE marks. 1591 We use the word `missing' rather than `lost', because some or all the 1592 missing ACKs might arrive eventually, but out of order. Even if some 1593 of the lost ACKs are piggy-backed on data (i.e. not pure ACKs) 1594 retransmissions will not repair the lost AccECN information, because 1595 AccECN requires retransmissions to carry the latest AccECN counters, 1596 not the original ones. 1598 The phrase `under prevailing conditions' allows the Data Sender to 1599 take account of the prevailing size of data segments and the 1600 prevailing CE marking rate just before the sequence of ACK losses. 1601 However, we shall start with the simplest algorithm, which assumes 1602 segments are all full-sized and ultra-conservatively it assumes that 1603 ECN marking was 100% on the forward path when ACKs on the reverse 1604 path started to all be dropped. Specifically, if newlyAckedB is the 1605 amount of data that an ACK acknowledges since the previous ACK, then 1606 the Data Sender could assume that this acknowledges newlyAckedPkt 1607 full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it 1608 could assume that the ACE field incremented by 1610 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 1612 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 1613 size segments than any previous ACK, and that ACE increments by a 1614 minimum of 2 CE marks (d.cep=2). The above formula works out that it 1615 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 1616 2). However, if ACE increases by a minimum of 2 but acknowledges 10 1617 full-sized segments, then it would be necessary to assume that there 1618 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 1620 Implementers could build in more heuristics to estimate prevailing 1621 average segment size and prevailing ECN marking. For instance, 1622 newlyAckedPkt in the above formula could be replaced with 1623 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 1624 segment size and p is the prevailing ECN marking probability. 1625 However, ultimately, if TCP's ECN feedback becomes inaccurate it 1626 still has loss detection to fall back on. Therefore, it would seem 1627 safe to implement a simple algorithm, rather than a perfect one. 1629 The simple algorithm for dSafer.cep above requires no monitoring of 1630 prevailing conditions and it would still be safe if, for example, 1631 segments were on average at least 5% of full-sized as long as ECN 1632 marking was 5% or less. Assuming it was used, the Data Sender would 1633 increment its packet counter as follows: 1635 s.cep += dSafer.cep 1637 If missing acknowledgement numbers arrive later (due to reordering), 1638 Section 3.2.3 says "the Data Sender MAY attempt to neutralise the 1639 effect of any action it took based on a conservative assumption that 1640 it later found to be incorrect". To do this, the Data Sender would 1641 have to store the values of all the relevant variables whenever it 1642 made assumptions, so that it could re-evaluate them later. Given 1643 this could become complex and it is not required, we do not attempt 1644 to provide an example of how to do this. 1646 A.2.2. Safety Algorithm with the AccECN Option 1648 When the AccECN Option is available on the ACKs before and after the 1649 possible sequence of ACK losses, if the Data Sender only needs CE- 1650 marked bytes, it will have sufficient information in the AccECN 1651 Option without needing to process the ACE field. However, if for 1652 some reason it needs CE-marked packets, if dSafer.cep is different 1653 from d.cep, it can calculate the average marked segment size that 1654 each implies to determine whether d.cep is likely to be a safe enough 1655 estimate. Specifically, it could use the following algorithm, where 1656 d.ceb is the amount of newly CE-marked bytes (see Appendix A.1): 1658 SAFETY_FACTOR = 2 1659 if (dSafer.cep > d.cep) { 1660 s = d.ceb/d.cep 1661 if (s <= MSS) { 1662 sSafer = d.ceb/dSafer.cep 1663 if (sSafer < MSS/SAFETY_FACTOR) 1664 dSafer.cep = d.cep % d.cep is a safe enough estimate 1665 } % else 1666 % No need for else; dSafer.cep is already correct, 1667 % because d.cep must have been too small 1668 } 1670 The chart below shows when the above algorithm will consider d.cep 1671 can replace dSafer.cep as a safe enough estimate of the number of CE- 1672 marked packets: 1674 ^ 1675 sSafer| 1676 | 1677 MSS+ 1678 | 1679 | dSafer.cep 1680 | is 1681 MSS/2+--------------+ safest 1682 | | 1683 | d.cep is safe| 1684 | enough | 1685 +--------------------> 1686 MSS s 1688 The following examples give the reasoning behind the algorithm, 1689 assuming MSS=1,460 [B]: 1691 o if d.cep=0, dSafer.cep=8 and d.ceb=1,460, then s=infinity and 1692 sSafer=182.5. 1693 Therefore even though the average size of 8 data segments is 1694 unlikely to have been as small as MSS/8, d.cep cannot have been 1695 correct, because it would imply an average segment size greater 1696 than the MSS. 1698 o if d.cep=2, dSafer.cep=10 and d.ceb=1,460, then s=730 and 1699 sSafer=146. 1700 Therefore d.cep is safe enough, because the average size of 10 1701 data segments is unlikely to have been as small as MSS/10. 1703 o if d.cep=7, dSafer.cep=15 and d.ceb=10,200, then s=1,457 and 1704 sSafer=680. 1705 Therefore d.cep is safe enough, because the average data segment 1706 size is more likely to have been just less than one MSS, rather 1707 than below MSS/2. 1709 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 1710 far less likely. However, because [RFC3168] currently precludes 1711 this, the above algorithm assumes that pure ACKs are not ECN-capable. 1713 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 1715 If the AccECN Option is not available, the Data Sender can only 1716 decode CE-marking from the ACE field in packets. Every time an ACK 1717 arrives, to convert this into an estimate of CE-marked bytes, it 1718 needs an average of the segment size, s_ave. Then it can add or 1719 subtract s_ave from the value of d.ceb as the value of d.cep 1720 increments or decrements. 1722 To calculate s_ave, it could keep a record of the byte numbers of all 1723 the boundaries between packets in flight (including control packets), 1724 and recalculate s_ave on every ACK. However it would be simpler to 1725 merely maintain a counter packets_in_flight for the number of packets 1726 in flight (including control packets), which it could update once per 1727 RTT. Either way, it would estimate s_ave as: 1729 s_ave ~= flightsize / packets_in_flight, 1731 where flightsize is the variable that TCP already maintains for the 1732 number of bytes in flight. To avoid floating point arithmetic, it 1733 could right-bit-shift by lg(packets_in_flight), where lg() means log 1734 base 2. 1736 An alternative would be to maintain an exponentially weighted moving 1737 average (EWMA) of the segment size: 1739 s_ave = a * s + (1-a) * s_ave, 1741 where a is the decay constant for the EWMA. However, then it is 1742 necessary to choose a good value for this constant, which ought to 1743 depend on the number of packets in flight. Also the decay constant 1744 needs to be power of two to avoid floating point arithmetic. 1746 A.4. Example Algorithm to Beacon AccECN Options 1748 Section 3.2.6 requires a Data Receiver to beacon a full-length AccECN 1749 Option at least 3 times per RTT. This could be implemented by 1750 maintaining a variable to store the number of ACKs (pure and data 1751 ACKs) since a full AccECN Option was last sent and another for the 1752 approximate number of ACKs sent in the last round trip time: 1754 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 1755 send_full_AccECN_Option() 1757 For optimised integer arithmetic, BEACON_FREQ = 4 could be used, 1758 rather than 3, so that the division could be implemented as an 1759 integer right bit-shift by lg(BEACON_FREQ). 1761 In certain operating systems, it might be too complex to maintain 1762 acks_in_round. In others it might be possible by tagging each data 1763 segment in the retransmit buffer with the number of ACKs sent at the 1764 point that segment was sent. This would not work well if the Data 1765 Receiver was not sending data itself, in which case it might be 1766 necessary to beacon based on time instead, as follows: 1768 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 1769 send_full_AccECN_Option() 1771 This time-based approach does not work well when all the ACKs are 1772 sent early in each round trip, as is the case during slow-start. In 1773 this case few options will be sent (evtl. even less than 3 per RTT). 1774 However, when continuously sending data, data packets as well as ACKs 1775 will spread out equally over the RTT and sufficient ACKs with the 1776 AccECN option will be sent. 1778 A.5. Example Algorithm to Count Not-ECT Bytes 1780 A Data Sender in AccECN mode can infer the amount of TCP payload data 1781 arriving at the receiver marked Not-ECT from the difference between 1782 the amount of newly ACKed data and the sum of the bytes with the 1783 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 1784 r.e0b is initialized to 1 and the other two counters are initialized 1785 to 0, the initial sum will be 1, which matches the initial offset of 1786 the TCP sequence number on completion of the 3WHS. 1788 For this approach to be precise, it has to be assumed that spurious 1789 (unnecessary) retransmissions do not lead to double counting. This 1790 assumption is currently correct, given that RFC 3168 requires that 1791 the Data Sender marks retransmitted segments as Not-ECT. However, 1792 the converse is not true; necessary transmissions will result in 1793 under-counting. 1795 However, such precision is unlikely to be necessary. The only known 1796 use of a count of Not-ECT marked bytes is to test whether equipment 1797 on the path is clearing the ECN field (perhaps due to an out-dated 1798 attempt to clear, or bleach, what used to be the ToS field). To 1799 detect bleaching it will be sufficient to detect whether nearly all 1800 bytes arrive marked as Not-ECT. Therefore there should be no need to 1801 keep track of the details of retransmissions. 1803 Appendix B. Alternative Design Choices (To Be Removed Before 1804 Publication) 1806 This appendix is informative, not normative. It records alternative 1807 designs that the authors chose not to include in the normative 1808 specification, but which the IETF might wish to consider for 1809 inclusion: 1811 Feedback all four ECN codepoints on the SYN/ACK: The last two 1812 negotiation combinations in Table 2 could be used to indicate 1813 AccECN support while also feeding back that the arriving SYN was 1814 ECT(0) or ECT(1). This could be used to probe the client to 1815 server path for incorrect forwarding of the ECN field 1816 [I-D.kuehlewind-tcpm-ecn-fallback]. 1818 Feedback all four ECN codepoints on the First ACK: To probe the 1819 server to client path for incorrect ECN forwarding, it could be 1820 useful to have four feedback states on the first ACK from the TCP 1821 client. This could be achieved by assigning four combinations of 1822 the ECN flags in the main TCP header, and only initializing the 1823 ACE field on subsequent segments. 1825 Appendix C. Open Protocol Design Issues (To Be Removed Before 1826 Publication) 1828 1. Currently it is specified that the receiver `SHOULD' use Change- 1829 Triggered ACKs. It is controversial whether this ought to be a 1830 `MUST' instead. A `SHOULD' would leave the Data Sender uncertain 1831 whether it can rely on the timing and ordering information in 1832 ACKs. If the sender guesses wrongly, it will probably introduce 1833 at least 1 RTT of delay before it can use this timing 1834 information. Ironically it will most likely be wanting this 1835 information to reduce ramp-up delay. A `MUST' could make it hard 1836 to implement AccECN in offload hardware. However, it is not 1837 known whether AccECN would be hard to implement in such hardware 1838 even with a `SHOULD' here. For instance, was it hard to offload 1839 DCTCP to hardware because of change-triggered ACKs, or was this 1840 just one of many reasons? The choice between MUST and SHOULD 1841 here is critical. Before that choice is made, a clear use-case 1842 for certainty of timing and ordering information is needed, plus 1843 well-informed discussion about hardware offload constraints. 1845 2. There is possibly a concern that a receiver could deliberately 1846 omit the AccECN Option pretending that it had been stripped by a 1847 middlebox. No known way can yet be contrived to take advantage 1848 of this downgrade attack, but it is mentioned here in case 1849 someone else can contrive one. 1851 Appendix D. Changes in This Version (To Be Removed Before Publication) 1853 The difference between any pair of versions can be displayed at 1854 http://datatracker.ietf.org/doc/draft-kuehlewind-tcpm-accurate-ecn/ 1855 history/ 1857 Authors' Addresses 1859 Bob Briscoe 1860 Simula Research Laboratory 1862 EMail: ietf@bobbriscoe.net 1863 URI: http://bobbriscoe.net/ 1864 Mirja Kuehlewind 1865 ETH Zurich 1866 Zurich 1867 Switzerland 1869 EMail: mirja.kuehlewind@tik.ee.ethz.ch 1871 Richard Scheffenegger 1872 Vienna 1873 Austria 1875 EMail: rscheff@gmx.at