idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: If the TCP client indicated AccECN support, a TCP server tha confirms its support for AccECN (as described in Section 3.1) SHOULD also include an AccECN TCP Option in the SYN/ACK. A TCP client that has successfully negotiated AccECN SHOULD include an AccECN Option in the first ACK at the end of the 3WHS. However, this first ACK is not delivered reliably, so the TCP client SHOULD also include an AccECN Option on the first data segment it sends (if it ever sends one). A host MAY NOT include an AccECN Option in any of these three cases if it has cached knowledge that the packet would be likely to be blocked on the path to the other host if it included an AccECN Option. -- The document date (October 31, 2016) is 2727 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 1501, but not defined -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Simula Research Laboratory 4 Intended status: Experimental M. Kuehlewind 5 Expires: May 4, 2017 ETH Zurich 6 R. Scheffenegger 7 October 31, 2016 9 More Accurate ECN Feedback in TCP 10 draft-ietf-tcpm-accurate-ecn-02 12 Abstract 14 Explicit Congestion Notification (ECN) is a mechanism where network 15 nodes can mark IP packets instead of dropping them to indicate 16 incipient congestion to the end-points. Receivers with an ECN- 17 capable transport protocol feed back this information to the sender. 18 ECN is specified for TCP in such a way that only one feedback signal 19 can be transmitted per Round-Trip Time (RTT). Recently, new TCP 20 mechanisms like Congestion Exposure (ConEx) or Data Center TCP 21 (DCTCP) need more accurate ECN feedback information whenever more 22 than one marking is received in one RTT. This document specifies an 23 experimental scheme to provide more than one feedback signal per RTT 24 in the TCP header. Given TCP header space is scarce, it overloads 25 the three existing ECN-related flags in the TCP header and provides 26 additional information in a new TCP option. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on May 4, 2017. 45 Copyright Notice 47 Copyright (c) 2016 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 64 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 4 65 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 66 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 67 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 68 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 69 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 8 70 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 8 71 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 72 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 73 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 10 74 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 11 75 3.1. Negotiation during the TCP handshake . . . . . . . . . . 11 76 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 14 77 3.2.1. The ACE Field . . . . . . . . . . . . . . . . . . . . 14 78 3.2.2. Safety against Ambiguity of the ACE Field . . . . . . 16 79 3.2.3. The AccECN Option . . . . . . . . . . . . . . . . . . 16 80 3.2.4. Path Traversal of the AccECN Option . . . . . . . . . 17 81 3.2.5. Usage of the AccECN TCP Option . . . . . . . . . . . 19 82 3.3. AccECN Compliance by TCP Proxies, Offload Engines and 83 other Middleboxes . . . . . . . . . . . . . . . . . . . . 20 84 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 21 85 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 21 86 4.2. Compatibility with Other TCP Options and Experiments . . 21 87 4.3. Compatibility with Feedback Integrity Mechanisms . . . . 21 88 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 23 89 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 90 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25 91 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 92 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 26 93 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 26 94 10.1. Normative References . . . . . . . . . . . . . . . . . . 27 95 10.2. Informative References . . . . . . . . . . . . . . . . . 27 96 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 29 97 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 29 98 A.2. Example Algorithm for Safety Against Long Sequences of 99 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 30 100 A.2.1. Safety Algorithm without the AccECN Option . . . . . 30 101 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 32 102 A.3. Example Algorithm to Estimate Marked Bytes from Marked 103 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 33 104 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 34 105 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 35 106 Appendix B. Alternative Design Choices (To Be Removed Before 107 Publication) . . . . . . . . . . . . . . . . . . . . 35 108 Appendix C. Open Protocol Design Issues (To Be Removed Before 109 Publication) . . . . . . . . . . . . . . . . . . . . 36 110 Appendix D. Changes in This Version (To Be Removed Before 111 Publication) . . . . . . . . . . . . . . . . . . . . 37 112 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 37 114 1. Introduction 116 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 117 network nodes can mark IP packets instead of dropping them to 118 indicate incipient congestion to the end-points. Receivers with an 119 ECN-capable transport protocol feed back this information to the 120 sender. ECN is specified for TCP in such a way that only one 121 feedback signal can be transmitted per Round-Trip Time (RTT). 122 Recently, proposed mechanisms like Congestion Exposure (ConEx 123 [I-D.ietf-conex-abstract-mech]) or DCTCP [I-D.bensley-tcpm-dctcp] 124 need more accurate ECN feedback information whenever more than one 125 marking is received in one RTT. A fuller treatment of the motivation 126 for this specification is given in the associated requirements 127 document [RFC7560]. 129 This documents specifies an experimental scheme for ECN feedback in 130 the TCP header to provide more than one feedback signal per RTT. It 131 will be called the more accurate ECN feedback scheme, or AccECN for 132 short. If AccECN progresses from experimental to the standards 133 track, it is intended to be a complete replacement for classic ECN 134 feedback, not a fork in the design of TCP. Thus, the applicability 135 of AccECN is intended to include all public and private IP networks 136 (and even any non-IP networks over which TCP is used today). Until 137 the AccECN experiment succeeds, [RFC3168] will remain as the 138 standards track specification for adding ECN to TCP. To avoid 139 confusion, in this document we use the term 'classic ECN' for the 140 pre-existing ECN specification [RFC3168]. 142 AccECN is solely an (experimental) change to the TCP wire protocol. 143 It is completely independent of how TCP might respond to congestion 144 feedback. This specification overloads flags and fields in the main 145 TCP header with new definitions, so both ends have to support the new 146 wire protocol before it can be used. Therefore during the TCP 147 handshake the two ends use the three ECN-related flags in the TCP 148 header to negotiate the most advanced feedback protocol that they can 149 both support. 151 It is likely (but not required) that the AccECN protocol will be 152 implemented along with the following experimental additions to the 153 TCP-ECN protocol: ECN-capable SYN/ACK [RFC5562], ECN path-probing and 154 fall-back [I-D.kuehlewind-tcpm-ecn-fallback] and testing receiver 155 non-compliance [I-D.moncaster-tcpm-rcv-cheat]. 157 1.1. Document Roadmap 159 The following introductory sections outline the goals of AccECN 160 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 161 that it is clear what success would look like. Then terminology is 162 defined (Section 1.4) and a recap of existing prerequisite technology 163 is given (Section 1.5). 165 Section 2 gives an informative overview of the AccECN protocol. Then 166 Section 3 gives the normative protocol specification. Section 4 167 assesses the interaction of AccECN with commonly used variants of 168 TCP, whether standardised or not. Section 5 summarises the features 169 and properties of AccECN. 171 Section 6 summarises the protocol fields and numbers that IANA will 172 need to assign and Section 7 points to the aspects of the protocol 173 that will be of interest to the security community. 175 Appendix A gives pseudocode examples for the various algorithms that 176 AccECN uses. 178 1.2. Goals 180 [RFC7560] enumerates requirements that a candidate feedback scheme 181 will need to satisfy, under the headings: resilience, timeliness, 182 integrity, accuracy (including ordering and lack of bias), 183 complexity, overhead and compatibility (both backward and forward). 184 It recognises that a perfect scheme that fully satisfies all the 185 requirements is unlikely and trade-offs between requirements are 186 likely. Section 5 presents the properties of AccECN against these 187 requirements and discusses the trade-offs made. 189 The requirements document recognises that a protocol as ubiquitous as 190 TCP needs to be able to serve as-yet-unspecified requirements. 191 Therefore an AccECN receiver aims to act as a generic (dumb) 192 reflector of congestion information so that in future new sender 193 behaviours can be deployed unilaterally. 195 1.3. Experiment Goals 197 TCP is critical to the robust functioning of the Internet, therefore 198 any proposed modifications to TCP need to be thoroughly tested. The 199 present specification describes an experimental protocol that adds 200 more accurate ECN feedback to the TCP protocol. The intention is to 201 specify the protocol sufficiently so that more than one 202 implementation can be built in order to test its function, robustness 203 and interoperability (with itself and with previous version of ECN 204 and TCP). 206 The experimental protocol will be considered successful if it 207 satisfies the requirements of [RFC7560] in the consensus opinion of 208 the IETF tcpm working group. In short, this requires that it 209 improves the accuracy and timeliness of TCP's ECN feedback, as 210 claimed in Section 5, while striking a balance between the 211 conflicting requirements of resilience, integrity and minimisation of 212 overhead. It also requires that it is not unduly complex, and that 213 it is compatible with prevalent equipment behaviours in the current 214 Internet, whether or not they comply with standards. 216 1.4. Terminology 218 AccECN: The more accurate ECN feedback scheme will be called AccECN 219 for short. 221 Classic ECN: the ECN protocol specified in [RFC3168]. 223 Classic ECN feedback: the feedback aspect of the ECN protocol 224 specified in [RFC3168], including generation, encoding, 225 transmission and decoding of feedback, but not the Data Sender's 226 subsequent response to that feedback. 228 ACK: A TCP acknowledgement, with or without a data payload. 230 Pure ACK: A TCP acknowledgement without a data payload. 232 TCP client: The TCP stack that originates a connection. 234 TCP server: The TCP stack that responds to a connection request. 236 Data Receiver: The endpoint of a TCP half-connection that receives 237 data and sends AccECN feedback. 239 Data Sender: The endpoint of a TCP half-connection that sends data 240 and receives AccECN feedback. 242 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 243 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 244 document are to be interpreted as described in RFC 2119 [RFC2119]. 246 1.5. Recap of Existing ECN feedback in IP/TCP 248 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 249 negotiated with the receiver at the transport layer, an ECN sender 250 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 251 to indicate an ECN-capable transport (ECT). If both ECN bits are 252 zero, the packet is considered to have been sent by a Not-ECN-capable 253 Transport (Not-ECT). When a network node experiences congestion, it 254 will occasionally either drop or mark a packet, with the choice 255 depending on the packet's ECN codepoint. If the codepoint is Not- 256 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 257 the node can mark the packet by setting both ECN bits, which is 258 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 259 Table 1 summarises these codepoints. 261 +-----------------------+---------------+---------------------------+ 262 | IP-ECN codepoint | Codepoint | Description | 263 | (binary) | name | | 264 +-----------------------+---------------+---------------------------+ 265 | 00 | Not-ECT | Not ECN-Capable Transport | 266 | 01 | ECT(1) | ECN-Capable Transport (1) | 267 | 10 | ECT(0) | ECN-Capable Transport (0) | 268 | 11 | CE | Congestion Experienced | 269 +-----------------------+---------------+---------------------------+ 271 Table 1: The ECN Field in the IP Header 273 In the TCP header the first two bits in byte 14 are defined as flags 274 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 275 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 276 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 277 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 278 Data Receiver starts to set the Echo Congestion Experienced (ECE) 279 flag continuously in the TCP header of ACKs, which ensures the signal 280 is received reliably even if ACKs are lost. The TCP sender confirms 281 that it has received at least one ECE signal by responding with the 282 congestion window reduced (CWR) flag, which allows the TCP receiver 283 to stop repeating the ECN-Echo flag. This always leads to a full RTT 284 of ACKs with ECE set. Thus any additional CE markings arriving 285 within this RTT cannot be fed back. 287 The ECN Nonce [RFC3540] is an optional experimental addition to ECN 288 that the TCP sender can use to protect against accidental or 289 malicious concealment of marked or dropped packets. The sender can 290 send an ECN nonce, which is a continuous pseudo-random pattern of 291 ECT(0) and ECT(1) codepoints in the ECN field. The receiver is 292 required to feed back a 1-bit nonce sum that counts the occurrence of 293 ECT(1) packets using the last bit of byte 13 in the TCP header, which 294 is defined as the Nonce Sum (NS) flag. 296 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 297 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 298 | | | N | C | E | U | A | P | R | S | F | 299 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 300 | | | | R | E | G | K | H | T | N | N | 301 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 303 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 305 2. AccECN Protocol Overview and Rationale 307 This section provides an informative overview of the AccECN protocol 308 that will be normatively specified in Section 3 310 Like the original TCP approach, the Data Receiver of each TCP half- 311 connection sends AccECN feedback to the Data Sender on TCP 312 acknowledgements, reusing data packets of the other half-connection 313 whenever possible. 315 The AccECN protocol has had to be designed in two parts: 317 o an essential part that re-uses ECN TCP header bits to feed back 318 the number of arriving CE marked packets. This provides more 319 accuracy than classic ECN feedback, but limited resilience against 320 ACK loss; 322 o a supplementary part using a new AccECN TCP Option that provides 323 additional feedback on the number of bytes that arrive marked with 324 each of the three ECN codepoints (not just CE marks). This 325 provides greater resilience against ACK loss than the essential 326 feedback, but it is more likely to suffer from middlebox 327 interference. 329 The two part design was necessary, given limitations on the space 330 available for TCP options and given the possibility that certain 331 incorrectly designed middleboxes prevent TCP using any new options. 333 The essential part overloads the previous definition of the three 334 flags in the TCP header that had been assigned for use by ECN. This 335 design choice deliberately replaces the classic ECN feedback 336 protocol, rather than leaving classic ECN feedback intact and adding 337 more accurate feedback separately because: 339 o this efficiently reuses scarce TCP header space, given TCP option 340 space is approaching saturation; 342 o a single upgrade path for the TCP protocol is preferable to a fork 343 in the design; 345 o otherwise classic and accurate ECN feedback could give conflicting 346 feedback on the same segment, which could open up new security 347 concerns and make implementations unnecessarily complex; 349 o middleboxes are more likely to faithfully forward the TCP ECN 350 flags than newly defined areas of the TCP header. 352 AccECN is designed to work even if the supplementary part is removed 353 or zeroed out, as long as the essential part gets through. 355 2.1. Capability Negotiation 357 AccECN is a change to the wire protocol of the main TCP header, 358 therefore it can only be used if both endpoints have been upgraded to 359 understand it. The TCP client signals support for AccECN on the 360 initial SYN of a connection and the TCP server signals whether it 361 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 362 client uses to signal AccECN support have been carefully chosen so 363 that a TCP server will interpret them as a request to support the 364 most recent variant of ECN feedback that it supports. Then the 365 client falls back to the same variant of ECN feedback. 367 An AccECN TCP client does not send the new AccECN Option on the SYN 368 as SYN option space is limited and successful negotiation using the 369 flags in the main header is taken as sufficient evidence that both 370 ends also support the AccECN Option. The TCP server sends the AccECN 371 Option on the SYN/ACK and the client sends it on the first ACK to 372 test whether the network path forwards the option correctly. 374 2.2. Feedback Mechanism 376 A Data Receiver maintains four counters initialised at the start of 377 the half-connection. Three count the number of arriving payload 378 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 379 the number of packets arriving marked with a CE codepoint (including 380 control packets without payload if they are CE-marked). 382 The Data Sender maintains four equivalent counters for the half 383 connection, and the AccECN protocol is designed to ensure they will 384 match the values in the Data Receiver's counters, albeit after a 385 little delay. 387 Each ACK carries the three least significant bits (LSBs) of the 388 packet-based CE counter using the ECN bits in the TCP header, now 389 renamed the Accurate ECN (ACE) field. The LSBs of each of the three 390 byte counters are carried in the AccECN Option. 392 2.3. Delayed ACKs and Resilience Against ACK Loss 394 With both the ACE and the AccECN Option mechanisms, the Data Receiver 395 continually repeats the current LSBs of each of its respective 396 counters. Then, even if some ACKs are lost, the Data Sender should 397 be able to infer how much to increment its own counters, even if the 398 protocol field has wrapped. 400 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 401 it appears to have incremented by one (say), the field might have 402 actually cycled completely then incremented by one. The Data 403 Receiver is required not to delay sending an ACK to such an extent 404 that the ACE field would cycle. However cyling is still a 405 possibility at the Data Sender because a whole sequence of ACKs 406 carrying intervening values of the field might all be lost or delayed 407 in transit. 409 The fields in the AccECN Option are larger, but they will increment 410 in larger steps because they count bytes not packets. Nonetheless, 411 their size has been chosen such that a whole cycle of the field would 412 never occur between ACKs unless there had been an infeasibly long 413 sequence of ACK losses. Therefore, as long as the AccECN Option is 414 available, it can be treated as a dependable feedback channel. 416 If the AccECN Option is not available, e.g. it is being stripped by a 417 middlebox, the AccECN protocol will only feed back information on CE 418 markings (using the ACE field). Although not ideal, this will be 419 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 420 will ever indicate more severe congestion than CE, even though future 421 uses for ECT(0) or ECT(1) are still unclear. Because the 3-bit ACE 422 field is so small, when it is the only field available the Data 423 Sender has to interpret it conservatively assuming the worst possible 424 wrap. 426 Certain specified events trigger the Data Receiver to include an 427 AccECN Option on an ACK. The rules are designed to ensure that the 428 order in which different markings arrive at the receiver is 429 communicated to the sender (as long as there is no ACK loss). 431 Implementations are encouraged to send an AccECN Option more 432 frequently, but this is left up to the implementer. 434 2.4. Feedback Metrics 436 The CE packet counter in the ACE field and the CE byte counter in the 437 AccECN Option both provide feedback on received CE-marks. The CE 438 packet counter includes control packets that do not have payload 439 data, while the CE byte counter solely includes marked payload bytes. 440 If both are present, the byte counter in the option will provide the 441 more accurate information needed for modern congestion control and 442 policing schemes, such as DCTCP or ConEx. If the option is stripped, 443 a simple algorithm to estimate the number of marked bytes from the 444 ACE field is given in Appendix A.3. 446 Feedback in bytes is recommended in order to protect against the 447 receiver using attacks similar to 'ACK-Division' to artificially 448 inflate the congestion window, which is why [RFC5681] now recommends 449 that TCP counts acknowledged bytes not packets. 451 2.5. Generic (Dumb) Reflector 453 The ACE field provides information about CE markings on both data and 454 control packets. According to [RFC3168] the Data Sender is meant to 455 set control packets to Not-ECT. However, mechanisms in certain 456 private networks (e.g. data centres) set control packets to be ECN 457 capable because they are precisely the packets that performance 458 depends on most. 460 For this reason, AccECN is designed to be a generic reflector of 461 whatever ECN markings it sees, whether or not they are compliant with 462 a current standard. Then as standards evolve, Data Senders can 463 upgrade unilaterally without any need for receivers to upgrade too. 464 It is also useful to be able to rely on generic reflection behaviour 465 when senders need to test for unexpected interference with markings 466 (for instance [I-D.kuehlewind-tcpm-ecn-fallback] and 467 [I-D.moncaster-tcpm-rcv-cheat]). 469 The initial SYN is the most critical control packet, so AccECN 470 provides feedback on whether it is CE marked, even though it is not 471 allowed to be ECN-capable according to RFC 3168. However, 472 middleboxes have been known to overwrite the ECN IP field as if it is 473 still part of the old Type of Service (ToS) field. If a TCP client 474 has set the SYN to Not-ECT, but receives CE feedback, it can detect 475 such middlebox interference and send Not-ECT for the rest of the 476 connection (see [I-D.kuehlewind-tcpm-ecn-fallback] for the detailed 477 fall-back behaviour). 479 Today, if a TCP server receives CE on a SYN, it cannot know whether 480 it is invalid (or valid) because only the TCP client knows whether it 481 originally marked the SYN as Not-ECT (or ECT). Therefore, the 482 server's only safe course of action is to disable ECN for the 483 connection. Instead, the AccECN protocol allows the server to feed 484 back the CE marking to the client, which then has all the information 485 to decide whether the connection has to fall-back from supporting ECN 486 (or not). 488 Providing feedback of CE marking on the SYN also supports future 489 scenarios in which SYNs might be ECN-enabled (without prejudging 490 whether they ought to be). For instance, in certain environments 491 such as data centres, it might be appropriate to allow ECN-capable 492 SYNs. Then, if feedback showed the SYN had been CE marked, the TCP 493 client could reduce its initial window (IW). It could also reduce IW 494 conservatively if feedback showed the receiver did not support ECN 495 (because if there had been a CE marking, the receiver would not have 496 understood it). Note that this text merely motivates dumb reflection 497 of CE on a SYN, it does not judge whether a SYN ought to be ECN- 498 capable. 500 3. AccECN Protocol Specification 502 3.1. Negotiation during the TCP handshake 504 During the TCP handshake at the start of a connection, to request 505 more accurate ECN feedback the TCP client (host A) MUST set the TCP 506 flags NS=1, CWR=1 and ECE=1 in the initial SYN segment. 508 If a TCP server (B) that is AccECN enabled receives a SYN with the 509 above three flags set, it MUST set both its half connections into 510 AccECN mode. Then it MUST set the flags CWR=1 and ECE=0 on its 511 response in the SYN/ACK segment to confirm that it supports AccECN. 512 The TCP server MUST NOT set this combination of flags unless the 513 preceding SYN requested support for AccECN as above. 515 A TCP server in AccECN mode MUST additionally set the flag NS=1 on 516 the SYN/ACK if the SYN was CE-marked (see Section 2.5). If the 517 received SYN was Not-ECT, ECT(0) or ECT(1), it MUST clear NS (NS=0) 518 on the SYN/ACK. 520 Once a TCP client (A) has sent the above SYN to declare that it 521 supports AccECN, and once it has received the above SYN/ACK segment 522 that confirms that the TCP server supports AccECN, the TCP client 523 MUST set both its half connections into AccECN mode. 525 If after the normal TCP timeout the TCP client has not received a 526 SYN/ACK to acknowledge its SYN, the SYN might just have been lost, 527 e.g. due to congestion, or a middlebox might be blocking segments 528 with the AccECN flags. To expedite connection setup, the host SHOULD 529 fall back to NS=CWR=ECE=0 on the retransmission of the SYN. It would 530 make sense to also remove any other experimental fields or options on 531 the SYN in case a middlebox might be blocking them, although the 532 required behaviour will depend on the specification of the other 533 option(s) and any attempt to co-ordinate fall-back between different 534 modules of the stack. Implementers MAY use other fall-back 535 strategies if they are found to be more effective (e.g. attempting to 536 retransmit a second AccECN segment before fall-back, falling back to 537 classic ECN feedback rather than non-ECN, and/or caching the result 538 of a previous attempt to access the same host while negotiating 539 AccECN). 541 The fall-back procedure if the TCP server receives no ACK to 542 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 543 Section 3.2.4. 545 The three flags set to 1 to indicate AccECN support on the SYN have 546 been carefully chosen to enable natural fall-back to prior stages in 547 the evolution of ECN. Table 2 tabulates all the negotiation 548 possibilities for ECN-related capabilities that involve at least one 549 AccECN-capable host. To compress the width of the table, the 550 headings of the first four columns have been severely abbreviated, as 551 follows: 553 Ac: More *Ac*curate ECN Feedback 555 N: ECN-*N*once [RFC3540] 557 E: *E*CN [RFC3168] 559 I: Not-ECN (*I*mplicit congestion notification using packet drop). 561 +----+---+---+---+------------+--------------+----------------------+ 562 | Ac | N | E | I | SYN A->B | SYN/ACK B->A | Feedback Mode | 563 +----+---+---+---+------------+--------------+----------------------+ 564 | | | | | NS CWR ECE | NS CWR ECE | | 565 | AB | | | | 1 1 1 | 0 1 0 | AccECN | 566 | AB | | | | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 567 | | | | | | | | 568 | A | B | | | 1 1 1 | 1 0 1 | classic ECN | 569 | A | | B | | 1 1 1 | 0 0 1 | classic ECN | 570 | A | | | B | 1 1 1 | 0 0 0 | Not ECN | 571 | | | | | | | | 572 | B | A | | | 0 1 1 | 0 0 1 | classic ECN | 573 | B | | A | | 0 1 1 | 0 0 1 | classic ECN | 574 | B | | | A | 0 0 0 | 0 0 0 | Not ECN | 575 | | | | | | | | 576 | A | | | B | 1 1 1 | 1 1 1 | Not ECN (broken) | 577 | A | | | | 1 1 1 | 0 1 1 | Not ECN (see Appx B) | 578 | A | | | | 1 1 1 | 1 0 0 | Not ECN (see Appx B) | 579 +----+---+---+---+------------+--------------+----------------------+ 581 Table 2: ECN capability negotiation between Originator (A) and 582 Responder (B) 584 Table 2 is divided into blocks each separated by an empty row. 586 1. The top block shows the case already described where both 587 endpoints support AccECN and how the TCP server (B) indicates 588 congestion feedback. 590 2. The second block shows the cases where the TCP client (A) 591 supports AccECN but the TCP server (B) supports some earlier 592 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 593 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 594 shown it MUST set both its half connections into the feedback 595 mode shown in the rightmost column. 597 3. The third block shows the cases where the TCP server (B) supports 598 AccECN but the TCP client (A) supports some earlier variant of 599 TCP feedback, indicated in its SYN. Therefore, as soon as an 600 AccECN-enabled TCP server (B) receives the SYN shown, it MUST set 601 both its half connections into the feedback mode shown in the 602 rightmost column. 604 4. The fourth block displays combinations that are not valid or 605 currently unused and therefore both ends MUST fall-back to Not 606 ECN for both half connections. Especially the first case (marked 607 `broken') where all bits set in the SYN are reflected by the 608 receiver in the SYN/ACK, which happens quite often if the TCP 609 connection is proxied.{ToDo: Consider using the last two cases 610 for AccECN f/b of ECT(0) and ECT(1) on the SYN (Appendix B)} 612 The following exceptional cases need some explanation: 614 ECN Nonce: An AccECN implementation, whether client or server, 615 sender or receiver, does not need to implement the ECN Nonce 616 behaviour [RFC3540]. AccECN is compatible with an alternative ECN 617 feedback integrity approach that does not use up the ECT(1) 618 codepoint and can be implemented solely at the sender (see 619 Section 4.3). 621 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 622 with NS=1, CWR=1 and ECE=1, might receive another SYN from host B. 623 Host A MUST then enter the same feedback mode as it would have 624 entered had it been a responding host and received the same SYN. 625 Then host A MUST send the same SYN/ACK as it would have sent had 626 it been a responding host (see the third block above). 628 3.2. AccECN Feedback 630 Each Data Receiver maintains four counters, r.cep, r.ceb, r.e0b and 631 r.e1b. The CE packet counter (r.cep), counts the number of packets 632 the host receives with the CE code point in the IP ECN field, 633 including CE marks on control packets without data. r.ceb, r.e0b and 634 r.e1b count the number of TCP payload bytes in packets marked 635 respectively with the CE, ECT(0) and ECT(1) codepoint in their IP-ECN 636 field. When a host first enters AccECN mode, it initialises its 637 counters to r.cep = 6, r.e0b = 1 and r.ceb = r.e1b.= 0 (see 638 Appendix A.5). Non-zero initial values are used to be distinct from 639 cases where the fields are incorrectly zeroed (e.g. by middleboxes). 641 A host feeds back the CE packet counter using the Accurate ECN (ACE) 642 field, as explained in the next section. And it feeds back all the 643 byte counters using the AccECN TCP Option, as specified in 644 Section 3.2.3. Whenever a host feeds back the value of any counter, 645 it MUST report the most recent value, no matter whether it is in a 646 pure ACK, an ACK with new payload data or a retransmission. 648 3.2.1. The ACE Field 650 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 651 overload the three TCP flags ECE, CWR and NS in the main TCP header 652 as one 3-bit field. Then the field is given a new name, ACE, as 653 shown in Figure 2. 655 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 656 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 657 | | | | U | A | P | R | S | F | 658 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 659 | | | | G | K | H | T | N | N | 660 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 662 Figure 2: Definition of the ACE field within bytes 13 and 14 of the 663 TCP Header (when AccECN has been negotiated and SYN=0). 665 The original definition of these three flags in the TCP header, 666 including the addition of support for the ECN Nonce, is shown for 667 comparison in Figure 1. This specification does not rename these 668 three TCP flags, it merely overloads them with another name and 669 definition once an AccECN connection has been established. 671 A host MUST interpret the ECE, CWR and NS flags as the 3-bit ACE 672 counter on a segment with SYN=0 that it sends or receives if both of 673 its half-connections are set into AccECN mode having successfully 674 negotiated AccECN (see Section 3.1). A host MUST NOT interpret the 3 675 flags as a 3-bit ACE field on any segment with SYN=1 (whether ACK is 676 0 or 1), or if AccECN negotiation is incomplete or has not succeeded. 678 Both parts of each of these conditions are equally important. For 679 instance, even if AccECN negotiation has been successful, the ACE 680 field is not defined on any segments with SYN=1 (e.g. a 681 retransmission of an unacknowledged SYN/ACK, or when both ends send 682 SYN/ACKs after AccECN support has been successfully negotiated during 683 a simultaneous open). 685 The ACE field encodes the three least significant bits of the r.cep 686 counter, therefore its initial value will be 0b110 (decimal 6). This 687 non-zero initialization allows a TCP server to use a stateless 688 handshake (see Section 4.1) but still detect from the TCP client's 689 first ACK that the client considers it has successfully negotiated 690 AccECN. If the SYN/ACK was CE marked, the client MUST increase its 691 r.cep counter before it sends its first ACK, therefore the initial 692 value of the ACE field will be 0b111 (decimal 7). These values have 693 deliberately been chosen such that they are distinct from [RFC5562] 694 behaviour, where the TCP client would set ECE on the first ACK as 695 feedback for a CE mark on the SYN/ACK. 697 If the value of the ACE field on the first segment with SYN=0 in 698 either direction is anything other than 0b110 or 0b111, the Data 699 Receiver MUST disable ECN for the remainder of the half-connection by 700 marking all subsequent packets as Not-ECT. 702 3.2.2. Safety against Ambiguity of the ACE Field 704 If too many CE-marked segments are acknowledged at once, or if a long 705 run of ACKs is lost, the 3-bit counter in the ACE field might have 706 cycled between two ACKs arriving at the Data Sender. 708 Therefore an AccECN Data Receiver SHOULD immediately send an ACK once 709 'n' CE marks have arrived since the previous ACK, where 'n' SHOULD be 710 2 and MUST be no greater than 6. 712 If the Data Sender has not received AccECN TCP Options to give it 713 more dependable information, and it detects that the ACE field could 714 have cycled under the prevailing conditions, it SHOULD conservatively 715 assume that the counter did cycle. It can detect if the counter 716 could have cycled by using the jump in the acknowledgement number 717 since the last ACK to calculate or estimate how many segments could 718 have been acknowledged. An example algorithm to implement this 719 policy is given in Appendix A.2. An implementer MAY develop an 720 alternative algorithm as long as it satisfies these requirements. 722 If missing acknowledgement numbers arrive later (reordering) and 723 prove that the counter did not cycle, the Data Sender MAY attempt to 724 neutralise the effect of any action it took based on a conservative 725 assumption that it later found to be incorrect. 727 3.2.3. The AccECN Option 729 The AccECN Option is defined as shown below in Figure 3. It consists 730 of three 24-bit fields that provide the 24 least significant bits of 731 the r.e0b, r.ceb and r.e1b counters, respectively. The initial 'E' 732 of each field name stands for 'Echo'. 734 0 1 2 3 735 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 736 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 737 | Kind = TBD1 | Length = 11 | EE0B field | 738 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 739 | EE0B (cont'd) | ECEB field | 740 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 741 | EE1B field | 742 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 744 Figure 3: The AccECN Option 746 The Data Receiver MUST set the Kind field to TBD1, which is 747 registered in Section 6 as a new TCP option Kind called AccECN. An 748 experimental TCP option with Kind=254 MAY be used for initial 749 experiments, with magic number 0xACCE. 751 Appendix A.1 gives an example algorithm for the Data Receiver to 752 encode its byte counters into the AccECN Option, and for the Data 753 Sender to decode the AccECN Option fields into its byte counters. 755 Note that there is no field to feedback Not-ECT bytes. Nonetheless 756 an algorithm for the Data Sender to calculate the number of payload 757 bytes received as Not-ECT is given in Appendix A.5. 759 Whenever a Data Receiver sends an AccECN Option, the rules in 760 Section 3.2.5 expect it to always send a full-length option. To cope 761 with option space limitations, it can omit unchanged fields from the 762 tail of the option, as long as it preserves the order of the 763 remaining fields and includes any field that has changed. The length 764 field MUST indicate which fields are present as follows: 766 Length=11: EE0B, ECEB, EE1B 768 Length=8: EE0B, ECEB 770 Length=5: EE0B 772 Length=2: (empty) 774 The empty option of Length=2 is provided to allow for a case where an 775 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 776 but there is very limited space for the option. For initial 777 experiments, the Length field MUST be 2 greater to accommodate the 778 16-bit magic number. 780 All implementations of a Data Sender MUST be able to read in AccECN 781 Options of any of the above lengths. They MUST ignore an AccECN 782 Option of any other length. 784 3.2.4. Path Traversal of the AccECN Option 786 An AccECN host MUST NOT include the AccECN TCP Option on the SYN. 787 Nonetheless, if the AccECN negotiation using the ECN flags in the 788 main TCP header (Section 3.1) is successful, it implicitly declares 789 that the endpoints also support the AccECN TCP Option. 791 If the TCP client indicated AccECN support, a TCP server tha confirms 792 its support for AccECN (as described in Section 3.1) SHOULD also 793 include an AccECN TCP Option in the SYN/ACK. A TCP client that has 794 successfully negotiated AccECN SHOULD include an AccECN Option in the 795 first ACK at the end of the 3WHS. However, this first ACK is not 796 delivered reliably, so the TCP client SHOULD also include an AccECN 797 Option on the first data segment it sends (if it ever sends one). A 798 host MAY NOT include an AccECN Option in any of these three cases if 799 it has cached knowledge that the packet would be likely to be blocked 800 on the path to the other host if it included an AccECN Option. 802 If the TCP client has successfully negotiated AccECN but does not 803 receive an AccECN Option on the SYN/ACK, it switches into a mode that 804 assumes that the AccECN Option is not available for this half 805 connection. Similarly, if the TCP server has successfully negotiated 806 AccECN but does not receive an AccECN Option on the first ACK or on 807 the first data segment, it switches into a mode that assumes that the 808 AccECN Option is not available for this half connection. 810 While a host is in the mode that assumes the AccECN Option is not 811 available, it MUST adopt the conservative interpretation of the ACE 812 field discussed in Section 3.2.2. However, it cannot make any 813 assumption about support of the AccECN Option on the other half 814 connection, so it MUST continue to send the AccECN Option itself. 816 If after the normal TCP timeout the TCP server has not received an 817 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 818 lost, e.g. due to congestion, or a middlebox might be blocking the 819 AccECN Option. To expedite connection setup, the host SHOULD fall 820 back to NS=CWR=ECE=0 and no AccECN Option on the retransmission of 821 the SYN/ACK. Implementers MAY use other fall-back strategies if they 822 are found to be more effective (e.g. retransmitting a SYN/ACK with 823 AccECN TCP flags but not the AccECN Option; attempting to retransmit 824 a second AccECN segment before fall-back (most appropriate during 825 high levels of congestion); or falling back to classic ECN feedback 826 rather than non-ECN). 828 Similarly, if the TCP client detects that the first data segment it 829 sent with the AccECN Option was lost, it SHOULD fall back to no 830 AccECN Option on the retransmission. Again, implementers MAY use 831 other fall-back strategies such as attempting to retransmit a second 832 segment with the AccECN Option before fall-back, and/or caching the 833 result of previous attempts. 835 Either host MAY include the AccECN Option in a subsequent segment to 836 retest whether the AccECN Option can traverse the path. 838 Currently the Data Sender is not required to test whether the 839 arriving byte counters in the AccECN Option have been correctly 840 initialised. This allows different initial values to be used as an 841 additional signalling channel in future. If any inappropriate 842 zeroing of these fields is discovered during testing, this approach 843 will need to be reviewed. 845 3.2.5. Usage of the AccECN TCP Option 847 The following rules determine when a Data Receiver in AccECN mode 848 sends the AccECN TCP Option, and which fields to include: 850 Change-Triggered ACKs: If an arriving packet increments a different 851 byte counter to that incremented by the previous packet, the Data 852 Receiver SHOULD immediately send an ACK with an AccECN Option, 853 without waiting for the next delayed ACK. Certain offload 854 hardware might not be able to support change-triggered ACKs, but 855 otherwise it is important to keep exceptions to this rule to a 856 minimum so that Data Senders can generally rely on this behaviour; 858 Continual Repetition: Otherwise, if arriving packets continue to 859 increment the same byte counter, the Data Receiver can include an 860 AccECN Option on most or all (delayed) ACKs, but it does not have 861 to. If option space is limited on a particular ACK, the Data 862 Receiver MUST give precedence to SACK information about loss. It 863 SHOULD include an AccECN Option if the r.ceb counter has 864 incremented and it MAY include an AccECN Option if r.ec0b or 865 r.ec1b has incremented; 867 Full-Length Options Preferred: It SHOULD always use full-length 868 AccECN Options. It MAY use shorter AccECN Options if space is 869 limited, but it MUST include the counter(s) that have incremented 870 since the previous AccECN Option and it MUST only truncate fields 871 from the right-hand tail of the option to preserve the order of 872 the remaining fields (see Section 3.2.3); 874 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 875 length AccECN TCP Option on at least three ACKs per RTT, or on all 876 ACKs if there are less than three per RTT (see Appendix A.4 for an 877 example algorithm that satisfies this requirement). 879 The following example series of arriving marks illustrates when a 880 Data Receiver will emit an ACK if it is using a delayed ACK factor of 881 2 segments and change-triggered ACKs: 01 -> ACK, 01, 01 -> ACK, 10 -> 882 ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 -> ACK. 884 For the avoidance of doubt, the change-triggered ACK mechanism 885 ignores the arrival of a control packet with no payload, because it 886 does not alter any byte counters. The change-triggered ACK approach 887 will lead to some additional ACKs but it feeds back the timing and 888 the order in which ECN marks are received with minimal additional 889 complexity. 891 Implementation note: sending an AccECN Option each time a different 892 counter changes and including a full-length AccECN Option on every 893 delayed ACK will satisfy the requirements described above and might 894 be the easiest implementation, as long as sufficient space is 895 available in each ACK (in total and in the option space). 897 Appendix A.3 gives an example algorithm to estimate the number of 898 marked bytes from the ACE field alone, if the AccECN Option is not 899 available. 901 If a host has determined that segments with the AccECN Option always 902 seem to be discarded somewhere along the path, it is no longer 903 obliged to follow the above rules. 905 3.3. AccECN Compliance by TCP Proxies, Offload Engines and other 906 Middleboxes 908 A large class of middleboxes split TCP connections. Such a middlebox 909 would be compliant with the AccECN protocol if the TCP implementation 910 on each side complied with the present AccECN specification and each 911 side negotiated AccECN independently of the other side. 913 Another large class of middleboxes intervene to some degree at the 914 transport layer, but attempts to be transparent (invisible) to the 915 end-to-end connection. A subset of this class of middleboxes 916 attempts to `normalise' the TCP wire protocol by checking that all 917 values in header fields comply with a rather narrow interpretation of 918 the TCP specifications. To comply with the present AccECN 919 specification, such a middlebox MUST NOT change the ACE field or the 920 AccECN Option and it MUST attempt to preserve the timing of each ACK 921 (for example, if it coalesced ACKs it would not be AccECN-compliant). 922 A middlebox claiming to be transparent at the transport layer MUST 923 forward the AccECN TCP Option unaltered, whether or not the length 924 value matches one of those specified in Section 3.2.3, and whether or 925 not the initial values of the byte-counter fields are correct. This 926 is because blocking apparently invalid values does not improve 927 security (because AccECN hosts are required to ignore invalid values 928 anyway), while it prevents the standardised set of values being 929 extended in future (because outdated normalisers would block updated 930 hosts from using the extended AccECN standard). 932 Hardware to offload certain TCP processing represents another large 933 class of middleboxes, even though it is often a function of a host's 934 network interface and rarely in its own 'box'. Leeway has been 935 allowed in the present AccECN specification in the expectation that 936 offload hardware could comply and still serve its function. 937 Nonetheless, such hardware MUST attempt to preserve the timing of 938 each ACK (for example, if it coalesced ACKs it would not be AccECN- 939 compliant). 941 4. Interaction with Other TCP Variants 943 This section is informative, not normative. 945 4.1. Compatibility with SYN Cookies 947 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 948 protect itself from SYN flooding attacks. It places minimal commonly 949 used connection state in the SYN/ACK, and deliberately does not hold 950 any state while waiting for the subsequent ACK (e.g. it closes the 951 thread). Therefore it cannot record the fact that it entered AccECN 952 mode for both half-connections. Indeed, it cannot even remember 953 whether it negotiated the use of classic ECN [RFC3168]. 955 Nonetheless, such a server can determine that it negotiated AccECN as 956 follows. If a TCP server using SYN Cookies supports AccECN and if 957 the first ACK it receives contains an ACE field with the value 0b110 958 or 0b111, it can assume that: 960 o the TCP client must have requested AccECN support on the SYN 962 o it (the server) must have confirmed that it supported AccECN 964 Therefore the server can switch itself into AccECN mode, and continue 965 as if it had never forgotten that it switched itself into AccECN mode 966 earlier. 968 4.2. Compatibility with Other TCP Options and Experiments 970 AccECN is compatible (at least on paper) with the most commonly used 971 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 972 also compatible with the recent promising experimental TCP options 973 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 974 AccECN is friendly to all these protocols, because space for TCP 975 options is particularly scarce on the SYN, where AccECN consumes zero 976 additional header space. 978 When option space is under pressure from other options, Section 3.2.5 979 provides guidance on how important it is to send an AccECN Option and 980 whether it needs to be a full-length option. 982 4.3. Compatibility with Feedback Integrity Mechanisms 984 The ECN Nonce [RFC3540] is an experimental IETF specification 985 intended to allow a sender to test whether ECN CE markings (or 986 losses) introduced in one network are being suppressed by the 987 receiver or anywhere else in the feedback loop, such as another 988 network or a middlebox. The ECN nonce has not been deployed as far 989 as can be ascertained. The nonce would now be nearly impossible to 990 deploy retrospectively, because to catch a misbehaving receiver it 991 relies on the receiver volunteering feedback information to 992 incriminate itself. A receiver that has been modified to misbehave 993 can simply claim that it does not support nonce feedback, which will 994 seem unremarkable given so many other hosts do not support it either. 996 With minor changes AccECN could be optimised for the possibility that 997 the ECT(1) codepoint might be used as a nonce. However, given the 998 nonce is now probably undeployable, the AccECN design has been 999 generalised so that it ought to be able to support other possible 1000 uses of the ECT(1) codepoint, such as a lower severity or a more 1001 instant congestion signal than CE. 1003 Three alternative mechanisms are available to assure the integrity of 1004 ECN and/or loss signals. AccECN is compatible with any of these 1005 approaches: 1007 o The Data Sender can test the integrity of the receiver's ECN (or 1008 loss) feedback by occasionally setting the IP-ECN field to a value 1009 normally only set by the network (and/or deliberately leaving a 1010 sequence number gap). Then it can test whether the Data 1011 Receiver's feedback faithfully reports what it expects 1012 [I-D.moncaster-tcpm-rcv-cheat]. Unlike the ECN Nonce, this 1013 approach does not waste the ECT(1) codepoint in the IP header, it 1014 does not require standardisation and it does not rely on 1015 misbehaving receivers volunteering to reveal feedback information 1016 that allows them to be detected. However, setting the CE mark by 1017 the sender might conceal actual congestion feedback from the 1018 network and should therefore only be done sparsely. 1020 o Networks generate congestion signals when they are becoming 1021 congested, so they are more likely than Data Senders to be 1022 concerned about the integrity of the receiver's feedback of these 1023 signals. A network can enforce a congestion response to its ECN 1024 markings (or packet losses) using congestion exposure (ConEx) 1025 audit [I-D.ietf-conex-abstract-mech]. Whether the receiver or a 1026 downstream network is suppressing congestion feedback or the 1027 sender is unresponsive to the feedback, or both, ConEx audit can 1028 neutralise any advantage that any of these three parties would 1029 otherwise gain. 1031 ConEx is a change to the Data Sender that is most useful when 1032 combined with AccECN. Without AccECN, the ConEx behaviour of a 1033 Data Sender would have to be more conservative than would be 1034 necessary if it had the accurate feedback of AccECN. 1036 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1037 detect any tampering with AccECN feedback between the Data 1038 Receiver and the Data Sender (whether malicious or accidental). 1039 The AccECN fields are immutable end-to-end, so they are amenable 1040 to TCP-AO protection, which covers TCP options by default. 1041 However, TCP-AO is often too brittle to use on many end-to-end 1042 paths, where middleboxes can make verification fail in their 1043 attempts to improve performance or security, e.g. by 1044 resegmentation or shifting the sequence space. 1046 5. Protocol Properties 1048 This section is informative not normative. It describes how well the 1049 protocol satisfies the agreed requirements for a more accurate ECN 1050 feedback protocol [RFC7560]. 1052 Accuracy: From each ACK, the Data Sender can infer the number of new 1053 CE marked segments since the previous ACK. This provides better 1054 accuracy on CE feedback than classic ECN. In addition if the 1055 AccECN Option is present (not blocked by the network path) the 1056 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1058 Overhead: The AccECN scheme is divided into two parts. The 1059 essential part reuses the 3 flags already assigned to ECN in the 1060 IP header. The supplementary part adds an additional TCP option 1061 consuming up to 11 bytes. However, no TCP option is consumed in 1062 the SYN. 1064 Ordering: The order in which marks arrive at the Data Receiver is 1065 preserved in AccECN feedback, because the Data Receiver is 1066 expected to send an ACK immediately whenever a different mark 1067 arrives. 1069 Timeliness: While the same ECN markings are arriving continually at 1070 the Data Receiver, it can defer ACKs as TCP does normally, but it 1071 will immediately send an ACK as soon as a different ECN marking 1072 arrives. 1074 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1075 latency-sensitive uses of ECN feedback by capturing the timing of 1076 transitions but not wasting resources while the state of the 1077 signalling system is stable. The receiver can control how 1078 frequently it sends the AccECN TCP Option and therefore it can 1079 control the overhead induced by AccECN. 1081 Resilience: All information is provided based on counters. 1082 Therefore if ACKs are lost, the counters on the first ACK 1083 following the losses allows the Data Sender to immediately recover 1084 the number of the ECN markings that it missed. 1086 Resilience against Bias: Because feedback is based on repetition of 1087 counters, random losses do not remove any information, they only 1088 delay it. Therefore, even though some ACKs are change-triggered, 1089 random losses will not alter the proportions of the different ECN 1090 markings in the feedback. 1092 Resilience vs Overhead: If space is limited in some segments (e.g. 1093 because more option are need on some segments, such as the SACK 1094 option after loss), the Data Receiver can send AccECN Options less 1095 frequently or truncate fields that have not changed, usually down 1096 to as little as 5 bytes. However, it has to send a full-sized 1097 AccECN Option at least three times per RTT, which the Data Sender 1098 can rely on as a regular beacon or checkpoint. 1100 Resilience vs Timeliness and Ordering: Ordering information and the 1101 timing of transitions cannot be communicated in three cases: i) 1102 during ACK loss; ii) if something on the path strips the AccECN 1103 Option; or iii) if the Data Receiver is unable to support Change- 1104 Triggered ACKs. 1106 Complexity: An AccECN implementation solely involves simple counter 1107 increments, some modulo arithmetic to communicate the least 1108 significant bits and allow for wrap, and some heuristics for 1109 safety against fields cycling due to prolonged periods of ACK 1110 loss. Each host needs to maintain eight additional counters. The 1111 hosts have to apply some additional tests to detect tampering by 1112 middleboxes, but in general the protocol is simple to understand, 1113 simple to implement and requires few cycles per packet to execute. 1115 Integrity: AccECN is compatible with at least three approaches that 1116 can assure the integrity of ECN feedback. If the AccECN Option is 1117 stripped the resolution of the feedback is degraded, but the 1118 integrity of this degraded feedback can still be assured. 1120 Backward Compatibility: If only one endpoint supports the AccECN 1121 scheme, it will fall-back to the most advanced ECN feedback scheme 1122 supported by the other end. 1124 Backward Compatibility: If the AccECN Option is stripped by a 1125 middlebox, AccECN still provides basic congestion feedback in the 1126 ACE field. Further, AccECN can be used to detect mangling of the 1127 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1128 marked segments; and blocking of segments carrying the AccECN 1129 Option. It can detect these conditions during TCP's 3WHS so that 1130 it can fall back to operation without ECN and/or operation without 1131 the AccECN Option. 1133 Forward Compatibility: The behaviour of endpoints and middleboxes is 1134 carefully defined for all reserved or currently unused codepoints 1135 in the scheme, to ensure that any blocking of anomalous values is 1136 always at least under reversible policy control. 1138 6. IANA Considerations 1140 This document defines a new TCP option for AccECN, assigned a value 1141 of TBD1 (decimal) from the TCP option space. This value is defined 1142 as: 1144 +------+--------+-----------------------+-----------+ 1145 | Kind | Length | Meaning | Reference | 1146 +------+--------+-----------------------+-----------+ 1147 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1148 +------+--------+-----------------------+-----------+ 1150 [TO BE REMOVED: This registration should take place at the following 1151 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1152 parameters.xhtml#tcp-parameters-1] 1154 Early implementation before the IANA allocation MUST follow [RFC6994] 1155 and use experimental option 254 and magic number 0xACCE (16 bits) 1156 {ToDo register this with IANA}, then migrate to the new option after 1157 the allocation. 1159 7. Security Considerations 1161 If ever the supplementary part of AccECN based on the new AccECN TCP 1162 Option is unusable (due for example to middlebox interference) the 1163 essential part of AccECN's congestion feedback offers only limited 1164 resilience to long runs of ACK loss (see Section 3.2.2). These 1165 problems are unlikely to be due to malicious intervention (because if 1166 an attacker could strip a TCP option or discard a long run of ACKs it 1167 could wreak other arbitrary havoc). However, it would be of concern 1168 if AccECN's resilience could be indirectly compromised during a 1169 flooding attack. AccECN is still considered safe though, because if 1170 the option is not presented, the AccECN Data Sender is then required 1171 to switch to more conservative assumptions about wrap of congestion 1172 indication counters (see Section 3.2.2 and Appendix A.2). 1174 Section 4.1 describes how a TCP server can negotiate AccECN and use 1175 the SYN cookie method for mitigating SYN flooding attacks. 1177 There is concern that ECN markings could be altered or suppressed, 1178 particularly because a misbehaving Data Receiver could increase its 1179 own throughput at the expense of others. Given the experimental ECN 1180 nonce is now probably undeployable, AccECN has been generalised for 1181 other possible uses of the ECT(1) codepoint to avoid obsolescence of 1182 the codepoint even if the nonce mechanism is obsoleted. AccECN is 1183 compatible with the three other schemes known to assure the integrity 1184 of ECN feedback (see Section 4.3 for details). If the AccECN Option 1185 is stripped by an incorrectly implemented middlebox, the resolution 1186 of the feedback will be degraded, but the integrity of this degraded 1187 information can still be assured. 1189 The AccECN protocol is not believed to introduce any new privacy 1190 concerns, because it merely counts and feeds back signals at the 1191 transport layer that had already been visible at the IP layer. 1193 8. Acknowledgements 1195 We want to thank Koen De Schepper, Praveen Balasubramanian and 1196 Michael Welzl for their input and discussion. The idea of using the 1197 three ECN-related TCP flags as one field for more accurate TCP-ECN 1198 feedback was first introduced in the re-ECN protocol that was the 1199 ancestor of ConEx. 1201 Bob Briscoe was part-funded by the European Community under its 1202 Seventh Framework Programme through the Reducing Internet Transport 1203 Latency (RITE) project (ICT-317700) and through the Trilogy 2 project 1204 (ICT-317756). The views expressed here are solely those of the 1205 authors. 1207 This work is partly supported by the European Commission under 1208 Horizon 2020 grant agreement no. 688421 Measurement and Architecture 1209 for a Middleboxed Internet (MAMI), and by the Swiss State Secretariat 1210 for Education, Research, and Innovation under contract no. 15.0268. 1211 This support does not imply endorsement. 1213 9. Comments Solicited 1215 Comments and questions are encouraged and very welcome. They can be 1216 addressed to the IETF TCP maintenance and minor modifications working 1217 group mailing list , and/or to the authors. 1219 10. References 1220 10.1. Normative References 1222 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1223 Requirement Levels", BCP 14, RFC 2119, 1224 DOI 10.17487/RFC2119, March 1997, 1225 . 1227 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1228 of Explicit Congestion Notification (ECN) to IP", 1229 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1230 . 1232 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1233 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1234 . 1236 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 1237 RFC 6994, DOI 10.17487/RFC6994, August 2013, 1238 . 1240 10.2. Informative References 1242 [I-D.bensley-tcpm-dctcp] 1243 Bensley, S., Eggert, L., Thaler, D., Balasubramanian, P., 1244 and G. Judd, "Microsoft's Datacenter TCP (DCTCP): TCP 1245 Congestion Control for Datacenters", draft-bensley-tcpm- 1246 dctcp-05 (work in progress), July 2015. 1248 [I-D.ietf-conex-abstract-mech] 1249 Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1250 Concepts, Abstract Mechanism and Requirements", draft- 1251 ietf-conex-abstract-mech-13 (work in progress), October 1252 2014. 1254 [I-D.kuehlewind-tcpm-ecn-fallback] 1255 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 1256 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 1257 fallback-01 (work in progress), September 2013. 1259 [I-D.moncaster-tcpm-rcv-cheat] 1260 Moncaster, T., Briscoe, B., and A. Jacquet, "A TCP Test to 1261 Allow Senders to Identify Receiver Non-Compliance", draft- 1262 moncaster-tcpm-rcv-cheat-03 (work in progress), July 2014. 1264 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1265 Congestion Notification (ECN) Signaling with Nonces", 1266 RFC 3540, DOI 10.17487/RFC3540, June 2003, 1267 . 1269 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1270 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 1271 . 1273 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1274 Ramakrishnan, "Adding Explicit Congestion Notification 1275 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 1276 DOI 10.17487/RFC5562, June 2009, 1277 . 1279 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1280 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 1281 June 2010, . 1283 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 1284 "TCP Extensions for Multipath Operation with Multiple 1285 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 1286 . 1288 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1289 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1290 . 1292 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 1293 "Problem Statement and Requirements for Increased Accuracy 1294 in Explicit Congestion Notification (ECN) Feedback", 1295 RFC 7560, DOI 10.17487/RFC7560, August 2015, 1296 . 1298 Appendix A. Example Algorithms 1300 This appendix is informative, not normative. It gives example 1301 algorithms that would satisfy the normative requirements of the 1302 AccECN protocol. However, implementers are free to choose other ways 1303 to implement the requirements. 1305 A.1. Example Algorithm to Encode/Decode the AccECN Option 1307 The example algorithms below show how a Data Receiver in AccECN mode 1308 could encode its CE byte counter r.ceb into the ECEB field within the 1309 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 1310 the ECEB field into its byte counter s.ceb. The other counters for 1311 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 1312 similarly encoded and decoded. 1314 It is assumed that each local byte counter is an unsigned integer 1315 greater than 24b (probably 32b), and that the following constant has 1316 been assigned: 1318 DIVOPT = 2^24 1320 Every time a CE marked data segment arrives, the Data Receiver 1321 increments its local value of r.ceb by the size of the TCP Data. 1322 Whenever it sends an ACK with the AccECN Option, the value it writes 1323 into the ECEB field is 1325 ECEB = r.ceb % DIVOPT 1327 where '%' is the modulo operator. 1329 On the arrival of an AccECN Option, the Data Sender uses the TCP 1330 acknowledgement number and any SACK options to calculate newlyAckedB, 1331 the amount of new data that the ACK acknowledges in bytes. If 1332 newlyAckedB is negative it means that a more up to date ACK has 1333 already been processed, so this ACK has been superseded and the Data 1334 Sender has to ignore the AccECN Option. Then the Data Sender 1335 calculates the minimum difference d.ceb between the ECEB field and 1336 its local s.ceb counter, using modulo arithmetic as follows: 1338 if (newlyAckedB >= 0) { 1339 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 1340 s.ceb += d.ceb 1341 } 1343 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 1344 then 1345 s.ceb % DIVOPT = 1 1346 d.ceb = (1461 + 2^24 - 1) % 2^24 1347 = 1460 1348 s.ceb = 33,554,433 + 1460 1349 = 33,555,893 1351 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 1353 The example algorithms below show how a Data Receiver in AccECN mode 1354 could encode its CE packet counter r.cep into the ACE field, and how 1355 the Data Sender in AccECN mode could decode the ACE field into its 1356 s.cep counter. The Data Sender's algorithm includes code to 1357 heuristically detect a long enough unbroken string of ACK losses that 1358 could have concealed a cycle of the congestion counter in the ACE 1359 field of the next ACK to arrive. 1361 Two variants of the algorithm are given: i) a more conservative 1362 variant for a Data Sender to use if it detects that the AccECN Option 1363 is not available (see Section 3.2.2 and Section 3.2.4); and ii) a 1364 less conservative variant that is feasible when complementary 1365 information is available from the AccECN Option. 1367 A.2.1. Safety Algorithm without the AccECN Option 1369 It is assumed that each local packet counter is a sufficiently sized 1370 unsigned integer (probably 32b) and that the following constant has 1371 been assigned: 1373 DIVACE = 2^3 1375 Every time a CE marked packet arrives, the Data Receiver increments 1376 its local value of r.cep by 1. It repeats the same value of ACE in 1377 every subsequent ACK until the next CE marking arrives, where 1379 ACE = r.cep % DIVACE. 1381 If the Data Sender received an earlier value of the counter that had 1382 been delayed due to ACK reordering, it might incorrectly calculate 1383 that the ACE field had wrapped. Therefore, on the arrival of every 1384 ACK, the Data Sender uses the TCP acknowledgement number and any SACK 1385 options to calculate newlyAckedB, the amount of new data that the ACK 1386 acknowledges. If newlyAckedB is negative it means that a more up to 1387 date ACK has already been processed, so this ACK has been superseded 1388 and the Data Sender has to ignore the AccECN Option. If newlyAckedB 1389 is zero, to break the tie the Data Sender could use timestamps (if 1390 present) to work out newlyAckedT, the amount of new time that the ACK 1391 acknowledges. Then the Data Sender calculates the minimum difference 1392 d.cep between the ACE field and its local s.cep counter, using modulo 1393 arithmetic as follows: 1395 if ((newlyAckedB > 0) || (newlyAckedB == 0 && newlyAckedT > 0)) 1396 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 1398 Section 3.2.2 requires the Data Sender to assume that the ACE field 1399 did cycle if it could have cycled under prevailing conditions. The 1400 3-bit ACE field in an arriving ACK could have cycled and become 1401 ambiguous to the Data Sender if a row of ACKs goes missing that 1402 covers a stream of data long enough to contain 8 or more CE marks. 1403 We use the word `missing' rather than `lost', because some or all the 1404 missing ACKs might arrive eventually, but out of order. Even if some 1405 of the lost ACKs are piggy-backed on data (i.e. not pure ACKs) 1406 retransmissions will not repair the lost AccECN information, because 1407 AccECN requires retransmissions to carry the latest AccECN counters, 1408 not the original ones. 1410 The phrase `under prevailing conditions' allows the Data Sender to 1411 take account of the prevailing size of data segments and the 1412 prevailing CE marking rate just before the sequence of ACK losses. 1413 However, we shall start with the simplest algorithm, which assumes 1414 segments are all full-sized and ultra-conservatively it assumes that 1415 ECN marking was 100% on the forward path when ACKs on the reverse 1416 path started to all be dropped. Specifically, if newlyAckedB is the 1417 amount of data that an ACK acknowledges since the previous ACK, then 1418 the Data Sender could assume that this acknowledges newlyAckedPkt 1419 full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it 1420 could assume that the ACE field incremented by 1422 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 1424 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 1425 size segments than any previous ACK, and that ACE increments by a 1426 minimum of 2 CE marks (d.cep=2). The above formula works out that it 1427 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 1428 2). However, if ACE increases by a minimum of 2 but acknowledges 10 1429 full-sized segments, then it would be necessary to assume that there 1430 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 1432 Implementers could build in more heuristics to estimate prevailing 1433 average segment size and prevailing ECN marking. For instance, 1434 newlyAckedPkt in the above formula could be replaced with 1435 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 1436 segment size and p is the prevailing ECN marking probability. 1437 However, ultimately, if TCP's ECN feedback becomes inaccurate it 1438 still has loss detection to fall back on. Therefore, it would seem 1439 safe to implement a simple algorithm, rather than a perfect one. 1441 The simple algorithm for dSafer.cep above requires no monitoring of 1442 prevailing conditions and it would still be safe if, for example, 1443 segments were on average at least 5% of full-sized as long as ECN 1444 marking was 5% or less. Assuming it was used, the Data Sender would 1445 increment its packet counter as follows: 1447 s.cep += dSafer.cep 1449 If missing acknowledgement numbers arrive later (due to reordering), 1450 Section 3.2.2 says "the Data Sender MAY attempt to neutralise the 1451 effect of any action it took based on a conservative assumption that 1452 it later found to be incorrect". To do this, the Data Sender would 1453 have to store the values of all the relevant variables whenever it 1454 made assumptions, so that it could re-evaluate them later. Given 1455 this could become complex and it is not required, we do not attempt 1456 to provide an example of how to do this. 1458 A.2.2. Safety Algorithm with the AccECN Option 1460 When the AccECN Option is available on the ACKs before and after the 1461 possible sequence of ACK losses, if the Data Sender only needs CE- 1462 marked bytes, it will have sufficient information in the AccECN 1463 Option without needing to process the ACE field. However, if for 1464 some reason it needs CE-marked packets, if dSafer.cep is different 1465 from d.cep, it can calculate the average marked segment size that 1466 each implies to determine whether d.cep is likely to be a safe enough 1467 estimate. Specifically, it could use the following algorithm, where 1468 d.ceb is the amount of newly CE-marked bytes (see Appendix A.1): 1470 SAFETY_FACTOR = 2 1471 if (dSafer.cep > d.cep) { 1472 s = d.ceb/d.cep 1473 if (s <= MSS) { 1474 sSafer = d.ceb/dSafer.cep 1475 if (sSafer < MSS/SAFETY_FACTOR) 1476 dSafer.cep = d.cep % d.cep is a safe enough estimate 1477 } % else 1478 % No need for else; dSafer.cep is already correct, 1479 % because d.cep must have been too small 1480 } 1482 The chart below shows when the above algorithm will consider d.cep 1483 can replace dSafer.cep as a safe enough estimate of the number of CE- 1484 marked packets: 1486 ^ 1487 sSafer| 1488 | 1489 MSS+ 1490 | 1491 | dSafer.cep 1492 | is 1493 MSS/2+--------------+ safest 1494 | | 1495 | d.cep is safe| 1496 | enough | 1497 +--------------------> 1498 MSS s 1500 The following examples give the reasoning behind the algorithm, 1501 assuming MSS=1,460 [B]: 1503 o if d.cep=0, dSafer.cep=8 and d.ceb=1,460, then s=infinity and 1504 sSafer=182.5. 1505 Therefore even though the average size of 8 data segments is 1506 unlikely to have been as small as MSS/8, d.cep cannot have been 1507 correct, because it would imply an average segment size greater 1508 than the MSS. 1510 o if d.cep=2, dSafer.cep=10 and d.ceb=1,460, then s=730 and 1511 sSafer=146. 1512 Therefore d.cep is safe enough, because the average size of 10 1513 data segments is unlikely to have been as small as MSS/10. 1515 o if d.cep=7, dSafer.cep=15 and d.ceb=10,200, then s=1,457 and 1516 sSafer=680. 1517 Therefore d.cep is safe enough, because the average data segment 1518 size is more likely to have been just less than one MSS, rather 1519 than below MSS/2. 1521 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 1522 far less likely. However, because [RFC3168] currently precludes 1523 this, the above algorithm assumes that pure ACKs are not ECN-capable. 1525 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 1527 If the AccECN Option is not available, the Data Sender can only 1528 decode CE-marking from the ACE field in packets. Every time an ACK 1529 arrives, to convert this into an estimate of CE-marked bytes, it 1530 needs an average of the segment size, s_ave. Then it can add or 1531 subtract s_ave from the value of d.ceb as the value of d.cep 1532 increments or decrements. 1534 To calculate s_ave, it could keep a record of the byte numbers of all 1535 the boundaries between packets in flight (including control packets), 1536 and recalculate s_ave on every ACK. However it would be simpler to 1537 merely maintain a counter packets_in_flight for the number of packets 1538 in flight (including control packets), which it could update once per 1539 RTT. Either way, it would estimate s_ave as: 1541 s_ave ~= flightsize / packets_in_flight, 1543 where flightsize is the variable that TCP already maintains for the 1544 number of bytes in flight. To avoid floating point arithmetic, it 1545 could right-bit-shift by lg(packets_in_flight), where lg() means log 1546 base 2. 1548 An alternative would be to maintain an exponentially weighted moving 1549 average (EWMA) of the segment size: 1551 s_ave = a * s + (1-a) * s_ave, 1553 where a is the decay constant for the EWMA. However, then it is 1554 necessary to choose a good value for this constant, which ought to 1555 depend on the number of packets in flight. Also the decay constant 1556 needs to be power of two to avoid floating point arithmetic. 1558 A.4. Example Algorithm to Beacon AccECN Options 1560 Section 3.2.5 requires a Data Receiver to beacon a full-length AccECN 1561 Option at least 3 times per RTT. This could be implemented by 1562 maintaining a variable to store the number of ACKs (pure and data 1563 ACKs) since a full AccECN Option was last sent and another for the 1564 approximate number of ACKs sent in the last round trip time: 1566 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 1567 send_full_AccECN_Option() 1569 For optimised integer arithmetic, BEACON_FREQ = 4 could be used, 1570 rather than 3, so that the division could be implemented as an 1571 integer right bit-shift by lg(BEACON_FREQ). 1573 In certain operating systems, it might be too complex to maintain 1574 acks_in_round. In others it might be possible by tagging each data 1575 segment in the retransmit buffer with the number of ACKs sent at the 1576 point that segment was sent. This would not work well if the Data 1577 Receiver was not sending data itself, in which case it might be 1578 necessary to beacon based on time instead, as follows: 1580 if ( time_now > time_last_option_sent + (RTT / BEACON_FREQ) ) 1581 send_full_AccECN_Option() 1583 This time-based approach does not work well when all the ACKs are 1584 sent early in each round trip, as is the case during slow-start. In 1585 this case few options will be sent (evtl. even less than 3 per RTT). 1586 However, when continuously sending data, data packets as well as ACKs 1587 will spread out equally over the RTT and sufficient ACKs with the 1588 AccECN option will be sent. 1590 A.5. Example Algorithm to Count Not-ECT Bytes 1592 A Data Sender in AccECN mode can infer the amount of TCP payload data 1593 arriving at the receiver marked Not-ECT from the difference between 1594 the amount of newly ACKed data and the sum of the bytes with the 1595 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 1596 r.e0b is initialised to 1 and the other two counters are initialised 1597 to 0, the initial sum will be 1, which matches the initial offset of 1598 the TCP sequence number on completion of the 3WHS. 1600 For this approach to be precise, it has to be assumed that spurious 1601 (unnecessary) retransmissions do not lead to double counting. This 1602 assumption is currently correct, given that RFC 3168 requires that 1603 the Data Sender marks retransmitted segments as Not-ECT. However, 1604 the converse is not true; necessary transmissions will result in 1605 under-counting. 1607 However, such precision is unlikely to be necessary. The only known 1608 use of a count of Not-ECT marked bytes is to test whether equipment 1609 on the path is clearing the ECN field (perhaps due to an out-dated 1610 attempt to clear, or bleach, what used to be the ToS field). To 1611 detect bleaching it will be sufficient to detect whether nearly all 1612 bytes arrive marked as Not-ECT. Therefore there should be no need to 1613 keep track of the details of retransmissions. 1615 Appendix B. Alternative Design Choices (To Be Removed Before 1616 Publication) 1618 This appendix is informative, not normative. It records alternative 1619 designs that the authors chose not to include in the normative 1620 specification, but which the IETF might wish to consider for 1621 inclusion: 1623 Feedback all four ECN codepoints on the SYN/ACK: The last two 1624 negotiation combinations in Table 2 could also be used to indicate 1625 AccECN support and to feedback that the arriving SYN was ECT(0) or 1626 ECT(1). This could be used to probe the client to server path for 1627 incorrect forwarding of the ECN field 1628 [I-D.kuehlewind-tcpm-ecn-fallback]. Note, however, that it would 1629 be unremarkable if ECN on the SYN was zeroed by security devices, 1630 given RFC 3168 prohibited ECT on SYN because it enables DoS 1631 attacks. 1633 Feedback all four ECN codepoints on the First ACK: To probe the 1634 server to client path for incorrect ECN forwarding, it could be 1635 useful to have four feedback states on the first ACK from the TCP 1636 client. This could be achieved by assigning four combinations of 1637 the ECN flags in the main TCP header, and only initialising the 1638 ACE field on subsequent segments. 1640 Empty AccECN Option: It might be useful to allow an empty (Length=2) 1641 AccECN Option on the SYN/ACK and first ACK. Then if a host had to 1642 omit the option because there was insufficient space for a larger 1643 option, it would not give the impression to the other end that a 1644 middlebox had stripped the option. 1646 Appendix C. Open Protocol Design Issues (To Be Removed Before 1647 Publication) 1649 1. Currently it is specified that the receiver `SHOULD' use Change- 1650 Triggered ACKs. It is controversial whether this ought to be a 1651 `MUST' instead. A `SHOULD' would leave the Data Sender uncertain 1652 whether it can rely on the timing and ordering information in 1653 ACKs. If the sender guesses wrongly, it will probably introduce 1654 at least 1 RTT of delay before it can use this timing 1655 information. Ironically it will most likely be wanting this 1656 information to reduce ramp-up delay. A `MUST' could make it hard 1657 to implement AccECN in offload hardware. However, it is not 1658 known whether AccECN would be hard to implement in such hardware 1659 even with a `SHOULD' here. For instance, was it hard to offload 1660 DCTCP to hardware because of change-triggered ACKs, or was this 1661 just one of many reasons? The choice between MUST and SHOULD 1662 here is critical. Before that choice is made, a clear use-case 1663 for certainty of timing and ordering information is needed, plus 1664 well-informed discussion about hardware offload constraints. 1666 2. There is possibly a concern that a receiver could deliberately 1667 omit the AccECN Option pretending that it had been stripped by a 1668 middlebox. No known way can yet be contrived to take advantage 1669 of this downgrade attack, but it is mentioned here in case 1670 someone else can contrive one. 1672 3. The s.cep counter might increase even if the s.ceb counter does 1673 not (e.g. due to a CE-marked control packet). The sender's 1674 response to such a situation is considered out of scope, because 1675 this ought to be dealt with in whatever future specification 1676 allows ECN-capable control packets. However, it is possible that 1677 the situation might arise even if the sender has not sent ECN- 1678 capable control packets, in which case, this draft might need to 1679 give some advice on how the sender should respond. 1681 Appendix D. Changes in This Version (To Be Removed Before Publication) 1683 The difference between any pair of versions can be displayed at 1684 1687 From kuehlewind-05 to ietf-00: Filename change to reflect WG 1688 adoption. 1690 Authors' Addresses 1692 Bob Briscoe 1693 Simula Research Laboratory 1695 EMail: ietf@bobbriscoe.net 1696 URI: http://bobbriscoe.net/ 1698 Mirja Kuehlewind 1699 ETH Zurich 1700 Zurich 1701 Switzerland 1703 EMail: mirja.kuehlewind@tik.ee.ethz.ch 1705 Richard Scheffenegger 1706 Vienna 1707 Austria 1709 EMail: rscheff@gmx.at