idnits 2.17.1 draft-ietf-tcpm-accurate-ecn-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 30, 2016) is 2851 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 1502, but not defined -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance & Minor Extensions (tcpm) B. Briscoe 3 Internet-Draft Simula Research Laboratory 4 Intended status: Experimental M. Kuehlewind 5 Expires: January 1, 2017 ETH Zurich 6 R. Scheffenegger 7 NetApp, Inc. 8 June 30, 2016 10 More Accurate ECN Feedback in TCP 11 draft-ietf-tcpm-accurate-ecn-01 13 Abstract 15 Explicit Congestion Notification (ECN) is a mechanism where network 16 nodes can mark IP packets instead of dropping them to indicate 17 incipient congestion to the end-points. Receivers with an ECN- 18 capable transport protocol feed back this information to the sender. 19 ECN is specified for TCP in such a way that only one feedback signal 20 can be transmitted per Round-Trip Time (RTT). Recently, new TCP 21 mechanisms like Congestion Exposure (ConEx) or Data Center TCP 22 (DCTCP) need more accurate ECN feedback information whenever more 23 than one marking is received in one RTT. This document specifies an 24 experimental scheme to provide more than one feedback signal per RTT 25 in the TCP header. Given TCP header space is scarce, it overloads 26 the three existing ECN-related flags in the TCP header and provides 27 additional information in a new TCP option. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at http://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on January 1, 2017. 46 Copyright Notice 48 Copyright (c) 2016 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 65 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 4 66 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 67 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 68 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 69 2. AccECN Protocol Overview and Rationale . . . . . . . . . . . 7 70 2.1. Capability Negotiation . . . . . . . . . . . . . . . . . 8 71 2.2. Feedback Mechanism . . . . . . . . . . . . . . . . . . . 8 72 2.3. Delayed ACKs and Resilience Against ACK Loss . . . . . . 9 73 2.4. Feedback Metrics . . . . . . . . . . . . . . . . . . . . 10 74 2.5. Generic (Dumb) Reflector . . . . . . . . . . . . . . . . 10 75 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 11 76 3.1. Negotiation during the TCP handshake . . . . . . . . . . 11 77 3.2. AccECN Feedback . . . . . . . . . . . . . . . . . . . . . 14 78 3.2.1. The ACE Field . . . . . . . . . . . . . . . . . . . . 14 79 3.2.2. Safety against Ambiguity of the ACE Field . . . . . . 16 80 3.2.3. The AccECN Option . . . . . . . . . . . . . . . . . . 16 81 3.2.4. Path Traversal of the AccECN Option . . . . . . . . . 17 82 3.2.5. Usage of the AccECN TCP Option . . . . . . . . . . . 19 83 3.3. AccECN Compliance by TCP Proxies, Offload Engines and 84 other Middleboxes . . . . . . . . . . . . . . . . . . . . 20 85 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 21 86 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 21 87 4.2. Compatibility with Other TCP Options and Experiments . . 21 88 4.3. Compatibility with Feedback Integrity Mechanisms . . . . 21 89 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 23 90 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 91 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25 92 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 93 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 26 94 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 26 95 10.1. Normative References . . . . . . . . . . . . . . . . . . 27 96 10.2. Informative References . . . . . . . . . . . . . . . . . 27 97 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 29 98 A.1. Example Algorithm to Encode/Decode the AccECN Option . . 29 99 A.2. Example Algorithm for Safety Against Long Sequences of 100 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 30 101 A.2.1. Safety Algorithm without the AccECN Option . . . . . 30 102 A.2.2. Safety Algorithm with the AccECN Option . . . . . . . 32 103 A.3. Example Algorithm to Estimate Marked Bytes from Marked 104 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 33 105 A.4. Example Algorithm to Beacon AccECN Options . . . . . . . 34 106 A.5. Example Algorithm to Count Not-ECT Bytes . . . . . . . . 35 107 Appendix B. Alternative Design Choices (To Be Removed Before 108 Publication) . . . . . . . . . . . . . . . . . . . . 35 109 Appendix C. Open Protocol Design Issues (To Be Removed Before 110 Publication) . . . . . . . . . . . . . . . . . . . . 36 111 Appendix D. Changes in This Version (To Be Removed Before 112 Publication) . . . . . . . . . . . . . . . . . . . . 37 113 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 37 115 1. Introduction 117 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 118 network nodes can mark IP packets instead of dropping them to 119 indicate incipient congestion to the end-points. Receivers with an 120 ECN-capable transport protocol feed back this information to the 121 sender. ECN is specified for TCP in such a way that only one 122 feedback signal can be transmitted per Round-Trip Time (RTT). 123 Recently, proposed mechanisms like Congestion Exposure (ConEx 124 [I-D.ietf-conex-abstract-mech]) or DCTCP [I-D.bensley-tcpm-dctcp] 125 need more accurate ECN feedback information whenever more than one 126 marking is received in one RTT. A fuller treatment of the motivation 127 for this specification is given in the associated requirements 128 document [RFC7560]. 130 This documents specifies an experimental scheme for ECN feedback in 131 the TCP header to provide more than one feedback signal per RTT. It 132 will be called the more accurate ECN feedback scheme, or AccECN for 133 short. If AccECN progresses from experimental to the standards 134 track, it is intended to be a complete replacement for classic ECN 135 feedback, not a fork in the design of TCP. Thus, the applicability 136 of AccECN is intended to include all public and private IP networks 137 (and even any non-IP networks over which TCP is used today). Until 138 the AccECN experiment succeeds, [RFC3168] will remain as the 139 standards track specification for adding ECN to TCP. To avoid 140 confusion, in this document we use the term 'classic ECN' for the 141 pre-existing ECN specification [RFC3168]. 143 AccECN is solely an (experimental) change to the TCP wire protocol. 144 It is completely independent of how TCP might respond to congestion 145 feedback. This specification overloads flags and fields in the main 146 TCP header with new definitions, so both ends have to support the new 147 wire protocol before it can be used. Therefore during the TCP 148 handshake the two ends use the three ECN-related flags in the TCP 149 header to negotiate the most advanced feedback protocol that they can 150 both support. 152 It is likely (but not required) that the AccECN protocol will be 153 implemented along with the following experimental additions to the 154 TCP-ECN protocol: ECN-capable SYN/ACK [RFC5562], ECN path-probing and 155 fall-back [I-D.kuehlewind-tcpm-ecn-fallback] and testing receiver 156 non-compliance [I-D.moncaster-tcpm-rcv-cheat]. 158 1.1. Document Roadmap 160 The following introductory sections outline the goals of AccECN 161 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 162 that it is clear what success would look like. Then terminology is 163 defined (Section 1.4) and a recap of existing prerequisite technology 164 is given (Section 1.5). 166 Section 2 gives an informative overview of the AccECN protocol. Then 167 Section 3 gives the normative protocol specification. Section 4 168 assesses the interaction of AccECN with commonly used variants of 169 TCP, whether standardised or not. Section 5 summarises the features 170 and properties of AccECN. 172 Section 6 summarises the protocol fields and numbers that IANA will 173 need to assign and Section 7 points to the aspects of the protocol 174 that will be of interest to the security community. 176 Appendix A gives pseudocode examples for the various algorithms that 177 AccECN uses. 179 1.2. Goals 181 [RFC7560] enumerates requirements that a candidate feedback scheme 182 will need to satisfy, under the headings: resilience, timeliness, 183 integrity, accuracy (including ordering and lack of bias), 184 complexity, overhead and compatibility (both backward and forward). 185 It recognises that a perfect scheme that fully satisfies all the 186 requirements is unlikely and trade-offs between requirements are 187 likely. Section 5 presents the properties of AccECN against these 188 requirements and discusses the trade-offs made. 190 The requirements document recognises that a protocol as ubiquitous as 191 TCP needs to be able to serve as-yet-unspecified requirements. 192 Therefore an AccECN receiver aims to act as a generic (dumb) 193 reflector of congestion information so that in future new sender 194 behaviours can be deployed unilaterally. 196 1.3. Experiment Goals 198 TCP is critical to the robust functioning of the Internet, therefore 199 any proposed modifications to TCP need to be thoroughly tested. The 200 present specification describes an experimental protocol that adds 201 more accurate ECN feedback to the TCP protocol. The intention is to 202 specify the protocol sufficiently so that more than one 203 implementation can be built in order to test its function, robustness 204 and interoperability (with itself and with previous version of ECN 205 and TCP). 207 The experimental protocol will be considered successful if it 208 satisfies the requirements of [RFC7560] in the consensus opinion of 209 the IETF tcpm working group. In short, this requires that it 210 improves the accuracy and timeliness of TCP's ECN feedback, as 211 claimed in Section 5, while striking a balance between the 212 conflicting requirements of resilience, integrity and minimisation of 213 overhead. It also requires that it is not unduly complex, and that 214 it is compatible with prevalent equipment behaviours in the current 215 Internet, whether or not they comply with standards. 217 1.4. Terminology 219 AccECN: The more accurate ECN feedback scheme will be called AccECN 220 for short. 222 Classic ECN: the ECN protocol specified in [RFC3168]. 224 Classic ECN feedback: the feedback aspect of the ECN protocol 225 specified in [RFC3168], including generation, encoding, 226 transmission and decoding of feedback, but not the Data Sender's 227 subsequent response to that feedback. 229 ACK: A TCP acknowledgement, with or without a data payload. 231 Pure ACK: A TCP acknowledgement without a data payload. 233 TCP client: The TCP stack that originates a connection. 235 TCP server: The TCP stack that responds to a connection request. 237 Data Receiver: The endpoint of a TCP half-connection that receives 238 data and sends AccECN feedback. 240 Data Sender: The endpoint of a TCP half-connection that sends data 241 and receives AccECN feedback. 243 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 244 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 245 document are to be interpreted as described in RFC 2119 [RFC2119]. 247 1.5. Recap of Existing ECN feedback in IP/TCP 249 ECN [RFC3168] uses two bits in the IP header. Once ECN has been 250 negotiated with the receiver at the transport layer, an ECN sender 251 can set two possible codepoints (ECT(0) or ECT(1)) in the IP header 252 to indicate an ECN-capable transport (ECT). If both ECN bits are 253 zero, the packet is considered to have been sent by a Not-ECN-capable 254 Transport (Not-ECT). When a network node experiences congestion, it 255 will occasionally either drop or mark a packet, with the choice 256 depending on the packet's ECN codepoint. If the codepoint is Not- 257 ECT, only drop is appropriate. If the codepoint is ECT(0) or ECT(1), 258 the node can mark the packet by setting both ECN bits, which is 259 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 260 Table 1 summarises these codepoints. 262 +-----------------------+---------------+---------------------------+ 263 | IP-ECN codepoint | Codepoint | Description | 264 | (binary) | name | | 265 +-----------------------+---------------+---------------------------+ 266 | 00 | Not-ECT | Not ECN-Capable Transport | 267 | 01 | ECT(1) | ECN-Capable Transport (1) | 268 | 10 | ECT(0) | ECN-Capable Transport (0) | 269 | 11 | CE | Congestion Experienced | 270 +-----------------------+---------------+---------------------------+ 272 Table 1: The ECN Field in the IP Header 274 In the TCP header the first two bits in byte 14 are defined as flags 275 for the use of ECN (CWR and ECE in Figure 1 [RFC3168]). A TCP client 276 indicates it supports ECN by setting ECE=CWR=1 in the SYN, and an 277 ECN-enabled server confirms ECN support by setting ECE=1 and CWR=0 in 278 the SYN/ACK. On reception of a CE-marked packet at the IP layer, the 279 Data Receiver starts to set the Echo Congestion Experienced (ECE) 280 flag continuously in the TCP header of ACKs, which ensures the signal 281 is received reliably even if ACKs are lost. The TCP sender confirms 282 that it has received at least one ECE signal by responding with the 283 congestion window reduced (CWR) flag, which allows the TCP receiver 284 to stop repeating the ECN-Echo flag. This always leads to a full RTT 285 of ACKs with ECE set. Thus any additional CE markings arriving 286 within this RTT cannot be fed back. 288 The ECN Nonce [RFC3540] is an optional experimental addition to ECN 289 that the TCP sender can use to protect against accidental or 290 malicious concealment of marked or dropped packets. The sender can 291 send an ECN nonce, which is a continuous pseudo-random pattern of 292 ECT(0) and ECT(1) codepoints in the ECN field. The receiver is 293 required to feed back a 1-bit nonce sum that counts the occurrence of 294 ECT(1) packets using the last bit of byte 13 in the TCP header, which 295 is defined as the Nonce Sum (NS) flag. 297 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 298 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 299 | | | N | C | E | U | A | P | R | S | F | 300 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 301 | | | | R | E | G | K | H | T | N | N | 302 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 304 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 306 2. AccECN Protocol Overview and Rationale 308 This section provides an informative overview of the AccECN protocol 309 that will be normatively specified in Section 3 311 Like the original TCP approach, the Data Receiver of each TCP half- 312 connection sends AccECN feedback to the Data Sender on TCP 313 acknowledgements, reusing data packets of the other half-connection 314 whenever possible. 316 The AccECN protocol has had to be designed in two parts: 318 o an essential part that re-uses ECN TCP header bits to feed back 319 the number of arriving CE marked packets. This provides more 320 accuracy than classic ECN feedback, but limited resilience against 321 ACK loss; 323 o a supplementary part using a new AccECN TCP Option that provides 324 additional feedback on the number of bytes that arrive marked with 325 each of the three ECN codepoints (not just CE marks). This 326 provides greater resilience against ACK loss than the essential 327 feedback, but it is more likely to suffer from middlebox 328 interference. 330 The two part design was necessary, given limitations on the space 331 available for TCP options and given the possibility that certain 332 incorrectly designed middleboxes prevent TCP using any new options. 334 The essential part overloads the previous definition of the three 335 flags in the TCP header that had been assigned for use by ECN. This 336 design choice deliberately replaces the classic ECN feedback 337 protocol, rather than leaving classic ECN feedback intact and adding 338 more accurate feedback separately because: 340 o this efficiently reuses scarce TCP header space, given TCP option 341 space is approaching saturation; 343 o a single upgrade path for the TCP protocol is preferable to a fork 344 in the design; 346 o otherwise classic and accurate ECN feedback could give conflicting 347 feedback on the same segment, which could open up new security 348 concerns and make implementations unnecessarily complex; 350 o middleboxes are more likely to faithfully forward the TCP ECN 351 flags than newly defined areas of the TCP header. 353 AccECN is designed to work even if the supplementary part is removed 354 or zeroed out, as long as the essential part gets through. 356 2.1. Capability Negotiation 358 AccECN is a change to the wire protocol of the main TCP header, 359 therefore it can only be used if both endpoints have been upgraded to 360 understand it. The TCP client signals support for AccECN on the 361 initial SYN of a connection and the TCP server signals whether it 362 supports AccECN on the SYN/ACK. The TCP flags on the SYN that the 363 client uses to signal AccECN support have been carefully chosen so 364 that a TCP server will interpret them as a request to support the 365 most recent variant of ECN feedback that it supports. Then the 366 client falls back to the same variant of ECN feedback. 368 An AccECN TCP client does not send the new AccECN Option on the SYN 369 as SYN option space is limited and successful negotiation using the 370 flags in the main header is taken as sufficient evidence that both 371 ends also support the AccECN Option. The TCP server sends the AccECN 372 Option on the SYN/ACK and the client sends it on the first ACK to 373 test whether the network path forwards the option correctly. 375 2.2. Feedback Mechanism 377 A Data Receiver maintains four counters initialised at the start of 378 the half-connection. Three count the number of arriving payload 379 bytes marked CE, ECT(1) and ECT(0) respectively. The fourth counts 380 the number of packets arriving marked with a CE codepoint (including 381 control packets without payload if they are CE-marked). 383 The Data Sender maintains four equivalent counters for the half 384 connection, and the AccECN protocol is designed to ensure they will 385 match the values in the Data Receiver's counters, albeit after a 386 little delay. 388 Each ACK carries the three least significant bits (LSBs) of the 389 packet-based CE counter using the ECN bits in the TCP header, now 390 renamed the Accurate ECN (ACE) field. The LSBs of each of the three 391 byte counters are carried in the AccECN Option. 393 2.3. Delayed ACKs and Resilience Against ACK Loss 395 With both the ACE and the AccECN Option mechanisms, the Data Receiver 396 continually repeats the current LSBs of each of its respective 397 counters. Then, even if some ACKs are lost, the Data Sender should 398 be able to infer how much to increment its own counters, even if the 399 protocol field has wrapped. 401 The 3-bit ACE field can wrap fairly frequently. Therefore, even if 402 it appears to have incremented by one (say), the field might have 403 actually cycled completely then incremented by one. The Data 404 Receiver is required not to delay sending an ACK to such an extent 405 that the ACE field would cycle. However cyling is still a 406 possibility at the Data Sender because a whole sequence of ACKs 407 carrying intervening values of the field might all be lost or delayed 408 in transit. 410 The fields in the AccECN Option are larger, but they will increment 411 in larger steps because they count bytes not packets. Nonetheless, 412 their size has been chosen such that a whole cycle of the field would 413 never occur between ACKs unless there had been an infeasibly long 414 sequence of ACK losses. Therefore, as long as the AccECN Option is 415 available, it can be treated as a dependable feedback channel. 417 If the AccECN Option is not available, e.g. it is being stripped by a 418 middlebox, the AccECN protocol will only feed back information on CE 419 markings (using the ACE field). Although not ideal, this will be 420 sufficient, because it is envisaged that neither ECT(0) nor ECT(1) 421 will ever indicate more severe congestion than CE, even though future 422 uses for ECT(0) or ECT(1) are still unclear. Because the 3-bit ACE 423 field is so small, when it is the only field available the Data 424 Sender has to interpret it conservatively assuming the worst possible 425 wrap. 427 Certain specified events trigger the Data Receiver to include an 428 AccECN Option on an ACK. The rules are designed to ensure that the 429 order in which different markings arrive at the receiver is 430 communicated to the sender (as long as there is no ACK loss). 432 Implementations are encouraged to send an AccECN Option more 433 frequently, but this is left up to the implementer. 435 2.4. Feedback Metrics 437 The CE packet counter in the ACE field and the CE byte counter in the 438 AccECN Option both provide feedback on received CE-marks. The CE 439 packet counter includes control packets that do not have payload 440 data, while the CE byte counter solely includes marked payload bytes. 441 If both are present, the byte counter in the option will provide the 442 more accurate information needed for modern congestion control and 443 policing schemes, such as DCTCP or ConEx. If the option is stripped, 444 a simple algorithm to estimate the number of marked bytes from the 445 ACE field is given in Appendix A.3. 447 Feedback in bytes is recommended in order to protect against the 448 receiver using attacks similar to 'ACK-Division' to artificially 449 inflate the congestion window, which is why [RFC5681] now recommends 450 that TCP counts acknowledged bytes not packets. 452 2.5. Generic (Dumb) Reflector 454 The ACE field provides information about CE markings on both data and 455 control packets. According to [RFC3168] the Data Sender is meant to 456 set control packets to Not-ECT. However, mechanisms in certain 457 private networks (e.g. data centres) set control packets to be ECN 458 capable because they are precisely the packets that performance 459 depends on most. 461 For this reason, AccECN is designed to be a generic reflector of 462 whatever ECN markings it sees, whether or not they are compliant with 463 a current standard. Then as standards evolve, Data Senders can 464 upgrade unilaterally without any need for receivers to upgrade too. 465 It is also useful to be able to rely on generic reflection behaviour 466 when senders need to test for unexpected interference with markings 467 (for instance [I-D.kuehlewind-tcpm-ecn-fallback] and 468 [I-D.moncaster-tcpm-rcv-cheat]). 470 The initial SYN is the most critical control packet, so AccECN 471 provides feedback on whether it is CE marked, even though it is not 472 allowed to be ECN-capable according to RFC 3168. However, 473 middleboxes have been known to overwrite the ECN IP field as if it is 474 still part of the old Type of Service (ToS) field. If a TCP client 475 has set the SYN to Not-ECT, but receives CE feedback, it can detect 476 such middlebox interference and send Not-ECT for the rest of the 477 connection (see [I-D.kuehlewind-tcpm-ecn-fallback] for the detailed 478 fall-back behaviour). 480 Today, if a TCP server receives CE on a SYN, it cannot know whether 481 it is invalid (or valid) because only the TCP client knows whether it 482 originally marked the SYN as Not-ECT (or ECT). Therefore, the 483 server's only safe course of action is to disable ECN for the 484 connection. Instead, the AccECN protocol allows the server to feed 485 back the CE marking to the client, which then has all the information 486 to decide whether the connection has to fall-back from supporting ECN 487 (or not). 489 Providing feedback of CE marking on the SYN also supports future 490 scenarios in which SYNs might be ECN-enabled (without prejudging 491 whether they ought to be). For instance, in certain environments 492 such as data centres, it might be appropriate to allow ECN-capable 493 SYNs. Then, if feedback showed the SYN had been CE marked, the TCP 494 client could reduce its initial window (IW). It could also reduce IW 495 conservatively if feedback showed the receiver did not support ECN 496 (because if there had been a CE marking, the receiver would not have 497 understood it). Note that this text merely motivates dumb reflection 498 of CE on a SYN, it does not judge whether a SYN ought to be ECN- 499 capable. 501 3. AccECN Protocol Specification 503 3.1. Negotiation during the TCP handshake 505 During the TCP handshake at the start of a connection, to request 506 more accurate ECN feedback the TCP client (host A) MUST set the TCP 507 flags NS=1, CWR=1 and ECE=1 in the initial SYN segment. 509 If a TCP server (B) that is AccECN enabled receives a SYN with the 510 above three flags set, it MUST set both its half connections into 511 AccECN mode. Then it MUST set the flags CWR=1 and ECE=0 on its 512 response in the SYN/ACK segment to confirm that it supports AccECN. 513 The TCP server MUST NOT set this combination of flags unless the 514 preceding SYN requested support for AccECN as above. 516 A TCP server in AccECN mode MUST additionally set the flag NS=1 on 517 the SYN/ACK if the SYN was CE-marked (see Section 2.5). If the 518 received SYN was Not-ECT, ECT(0) or ECT(1), it MUST clear NS (NS=0) 519 on the SYN/ACK. 521 Once a TCP client (A) has sent the above SYN to declare that it 522 supports AccECN, and once it has received the above SYN/ACK segment 523 that confirms that the TCP server supports AccECN, the TCP client 524 MUST set both its half connections into AccECN mode. 526 If after the normal TCP timeout the TCP client has not received a 527 SYN/ACK to acknowledge its SYN, the SYN might just have been lost, 528 e.g. due to congestion, or a middlebox might be blocking segments 529 with the AccECN flags. To expedite connection setup, the host SHOULD 530 fall back to NS=CWR=ECE=0 on the retransmission of the SYN. It would 531 make sense to also remove any other experimental fields or options on 532 the SYN in case a middlebox might be blocking them, although the 533 required behaviour will depend on the specification of the other 534 option(s) and any attempt to co-ordinate fall-back between different 535 modules of the stack. Implementers MAY use other fall-back 536 strategies if they are found to be more effective (e.g. attempting to 537 retransmit a second AccECN segment before fall-back, falling back to 538 classic ECN feedback rather than non-ECN, and/or caching the result 539 of a previous attempt to access the same host while negotiating 540 AccECN). 542 The fall-back procedure if the TCP server receives no ACK to 543 acknowledge a SYN/ACK that tried to negotiate AccECN is specified in 544 Section 3.2.4. 546 The three flags set to 1 to indicate AccECN support on the SYN have 547 been carefully chosen to enable natural fall-back to prior stages in 548 the evolution of ECN. Table 2 tabulates all the negotiation 549 possibilities for ECN-related capabilities that involve at least one 550 AccECN-capable host. To compress the width of the table, the 551 headings of the first four columns have been severely abbreviated, as 552 follows: 554 Ac: More *Ac*curate ECN Feedback 556 N: ECN-*N*once [RFC3540] 558 E: *E*CN [RFC3168] 560 I: Not-ECN (*I*mplicit congestion notification using packet drop). 562 +----+---+---+---+------------+--------------+----------------------+ 563 | Ac | N | E | I | SYN A->B | SYN/ACK B->A | Feedback Mode | 564 +----+---+---+---+------------+--------------+----------------------+ 565 | | | | | NS CWR ECE | NS CWR ECE | | 566 | AB | | | | 1 1 1 | 0 1 0 | AccECN | 567 | AB | | | | 1 1 1 | 1 1 0 | AccECN (CE on SYN) | 568 | | | | | | | | 569 | A | B | | | 1 1 1 | 1 0 1 | classic ECN | 570 | A | | B | | 1 1 1 | 0 0 1 | classic ECN | 571 | A | | | B | 1 1 1 | 0 0 0 | Not ECN | 572 | | | | | | | | 573 | B | A | | | 0 1 1 | 0 0 1 | classic ECN | 574 | B | | A | | 0 1 1 | 0 0 1 | classic ECN | 575 | B | | | A | 0 0 0 | 0 0 0 | Not ECN | 576 | | | | | | | | 577 | A | | | B | 1 1 1 | 1 1 1 | Not ECN (broken) | 578 | A | | | | 1 1 1 | 0 1 1 | Not ECN (see Appx B) | 579 | A | | | | 1 1 1 | 1 0 0 | Not ECN (see Appx B) | 580 +----+---+---+---+------------+--------------+----------------------+ 582 Table 2: ECN capability negotiation between Originator (A) and 583 Responder (B) 585 Table 2 is divided into blocks each separated by an empty row. 587 1. The top block shows the case already described where both 588 endpoints support AccECN and how the TCP server (B) indicates 589 congestion feedback. 591 2. The second block shows the cases where the TCP client (A) 592 supports AccECN but the TCP server (B) supports some earlier 593 variant of TCP feedback, indicated in its SYN/ACK. Therefore, as 594 soon as an AccECN-capable TCP client (A) receives the SYN/ACK 595 shown it MUST set both its half connections into the feedback 596 mode shown in the rightmost column. 598 3. The third block shows the cases where the TCP server (B) supports 599 AccECN but the TCP client (A) supports some earlier variant of 600 TCP feedback, indicated in its SYN. Therefore, as soon as an 601 AccECN-enabled TCP server (B) receives the SYN shown, it MUST set 602 both its half connections into the feedback mode shown in the 603 rightmost column. 605 4. The fourth block displays combinations that are not valid or 606 currently unused and therefore both ends MUST fall-back to Not 607 ECN for both half connections. Especially the first case (marked 608 `broken') where all bits set in the SYN are reflected by the 609 receiver in the SYN/ACK, which happens quite often if the TCP 610 connection is proxied.{ToDo: Consider using the last two cases 611 for AccECN f/b of ECT(0) and ECT(1) on the SYN (Appendix B)} 613 The following exceptional cases need some explanation: 615 ECN Nonce: An AccECN implementation, whether client or server, 616 sender or receiver, does not need to implement the ECN Nonce 617 behaviour [RFC3540]. AccECN is compatible with an alternative ECN 618 feedback integrity approach that does not use up the ECT(1) 619 codepoint and can be implemented solely at the sender (see 620 Section 4.3). 622 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 623 with NS=1, CWR=1 and ECE=1, might receive another SYN from host B. 624 Host A MUST then enter the same feedback mode as it would have 625 entered had it been a responding host and received the same SYN. 626 Then host A MUST send the same SYN/ACK as it would have sent had 627 it been a responding host (see the third block above). 629 3.2. AccECN Feedback 631 Each Data Receiver maintains four counters, r.cep, r.ceb, r.e0b and 632 r.e1b. The CE packet counter (r.cep), counts the number of packets 633 the host receives with the CE code point in the IP ECN field, 634 including CE marks on control packets without data. r.ceb, r.e0b and 635 r.e1b count the number of TCP payload bytes in packets marked 636 respectively with the CE, ECT(0) and ECT(1) codepoint in their IP-ECN 637 field. When a host first enters AccECN mode, it initialises its 638 counters to r.cep = 6, r.e0b = 1 and r.ceb = r.e1b.= 0 (see 639 Appendix A.5). Non-zero initial values are used to be distinct from 640 cases where the fields are incorrectly zeroed (e.g. by middleboxes). 642 A host feeds back the CE packet counter using the Accurate ECN (ACE) 643 field, as explained in the next section. And it feeds back all the 644 byte counters using the AccECN TCP Option, as specified in 645 Section 3.2.3. Whenever a host feeds back the value of any counter, 646 it MUST report the most recent value, no matter whether it is in a 647 pure ACK, an ACK with new payload data or a retransmission. 649 3.2.1. The ACE Field 651 After AccECN has been negotiated on the SYN and SYN/ACK, both hosts 652 overload the three TCP flags ECE, CWR and NS in the main TCP header 653 as one 3-bit field. Then the field is given a new name, ACE, as 654 shown in Figure 2. 656 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 657 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 658 | | | | U | A | P | R | S | F | 659 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 660 | | | | G | K | H | T | N | N | 661 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 663 Figure 2: Definition of the ACE field within bytes 13 and 14 of the 664 TCP Header (when AccECN has been negotiated and SYN=0). 666 The original definition of these three flags in the TCP header, 667 including the addition of support for the ECN Nonce, is shown for 668 comparison in Figure 1. This specification does not rename these 669 three TCP flags, it merely overloads them with another name and 670 definition once an AccECN connection has been established. 672 A host MUST interpret the ECE, CWR and NS flags as the 3-bit ACE 673 counter on a segment with SYN=0 that it sends or receives if both of 674 its half-connections are set into AccECN mode having successfully 675 negotiated AccECN (see Section 3.1). A host MUST NOT interpret the 3 676 flags as a 3-bit ACE field on any segment with SYN=1 (whether ACK is 677 0 or 1), or if AccECN negotiation is incomplete or has not succeeded. 679 Both parts of each of these conditions are equally important. For 680 instance, even if AccECN negotiation has been successful, the ACE 681 field is not defined on any segments with SYN=1 (e.g. a 682 retransmission of an unacknowledged SYN/ACK, or when both ends send 683 SYN/ACKs after AccECN support has been successfully negotiated during 684 a simultaneous open). 686 The ACE field encodes the three least significant bits of the r.cep 687 counter, therefore its initial value will be 0b110 (decimal 6). This 688 non-zero initialization allows a TCP server to use a stateless 689 handshake (see Section 4.1) but still detect from the TCP client's 690 first ACK that the client considers it has successfully negotiated 691 AccECN. If the SYN/ACK was CE marked, the client MUST increase its 692 r.cep counter before it sends its first ACK, therefore the initial 693 value of the ACE field will be 0b111 (decimal 7). These values have 694 deliberately been chosen such that they are distinct from [RFC5562] 695 behaviour, where the TCP client would set ECE on the first ACK as 696 feedback for a CE mark on the SYN/ACK. 698 If the value of the ACE field on the first segment with SYN=0 in 699 either direction is anything other than 0b110 or 0b111, the Data 700 Receiver MUST disable ECN for the remainder of the half-connection by 701 marking all subsequent packets as Not-ECT. 703 3.2.2. Safety against Ambiguity of the ACE Field 705 If too many CE-marked segments are acknowledged at once, or if a long 706 run of ACKs is lost, the 3-bit counter in the ACE field might have 707 cycled between two ACKs arriving at the Data Sender. 709 Therefore an AccECN Data Receiver SHOULD immediately send an ACK once 710 'n' CE marks have arrived since the previous ACK, where 'n' SHOULD be 711 2 and MUST be no greater than 6. 713 If the Data Sender has not received AccECN TCP Options to give it 714 more dependable information, and it detects that the ACE field could 715 have cycled under the prevailing conditions, it SHOULD conservatively 716 assume that the counter did cycle. It can detect if the counter 717 could have cycled by using the jump in the acknowledgement number 718 since the last ACK to calculate or estimate how many segments could 719 have been acknowledged. An example algorithm to implement this 720 policy is given in Appendix A.2. An implementer MAY develop an 721 alternative algorithm as long as it satisfies these requirements. 723 If missing acknowledgement numbers arrive later (reordering) and 724 prove that the counter did not cycle, the Data Sender MAY attempt to 725 neutralise the effect of any action it took based on a conservative 726 assumption that it later found to be incorrect. 728 3.2.3. The AccECN Option 730 The AccECN Option is defined as shown below in Figure 3. It consists 731 of three 24-bit fields that provide the 24 least significant bits of 732 the r.e0b, r.ceb and r.e1b counters, respectively. The initial 'E' 733 of each field name stands for 'Echo'. 735 0 1 2 3 736 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 737 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 738 | Kind = TBD1 | Length = 11 | EE0B field | 739 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 740 | EE0B (cont'd) | ECEB field | 741 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 742 | EE1B field | 743 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 745 Figure 3: The AccECN Option 747 The Data Receiver MUST set the Kind field to TBD1, which is 748 registered in Section 6 as a new TCP option Kind called AccECN. An 749 experimental TCP option with Kind=254 MAY be used for initial 750 experiments, with magic number 0xACCE. 752 Appendix A.1 gives an example algorithm for the Data Receiver to 753 encode its byte counters into the AccECN Option, and for the Data 754 Sender to decode the AccECN Option fields into its byte counters. 756 Note that there is no field to feedback Not-ECT bytes. Nonetheless 757 an algorithm for the Data Sender to calculate the number of payload 758 bytes received as Not-ECT is given in Appendix A.5. 760 Whenever a Data Receiver sends an AccECN Option, the rules in 761 Section 3.2.5 expect it to always send a full-length option. To cope 762 with option space limitations, it can omit unchanged fields from the 763 tail of the option, as long as it preserves the order of the 764 remaining fields and includes any field that has changed. The length 765 field MUST indicate which fields are present as follows: 767 Length=11: EE0B, ECEB, EE1B 769 Length=8: EE0B, ECEB 771 Length=5: EE0B 773 Length=2: (empty) 775 The empty option of Length=2 is provided to allow for a case where an 776 AccECN Option has to be sent (e.g. on the SYN/ACK to test the path), 777 but there is very limited space for the option. For initial 778 experiments, the Length field MUST be 2 greater to accommodate the 779 16-bit magic number. 781 All implementations of a Data Sender MUST be able to read in AccECN 782 Options of any of the above lengths. They MUST ignore an AccECN 783 Option of any other length. 785 3.2.4. Path Traversal of the AccECN Option 787 An AccECN host MUST NOT include the AccECN TCP Option on the SYN. 788 Nonetheless, if the AccECN negotiation using the ECN flags in the 789 main TCP header (Section 3.1) is successful, it implicitly declares 790 that the endpoints also support the AccECN TCP Option. 792 If the TCP client indicated AccECN support, a TCP server tha confirms 793 its support for AccECN (as described in Section 3.1) SHOULD also 794 include an AccECN TCP Option in the SYN/ACK. A TCP client that has 795 successfully negotiated AccECN SHOULD include an AccECN Option in the 796 first ACK at the end of the 3WHS. However, this first ACK is not 797 delivered reliably, so the TCP client SHOULD also include an AccECN 798 Option on the first data segment it sends (if it ever sends one). A 799 host need not include an AccECN Option in any of these three cases if 800 it has cached knowledge that the packet would be likely to be blocked 801 on the path to the other host if it included an AccECN Option. 803 If the TCP client has successfully negotiated AccECN but does not 804 receive an AccECN Option on the SYN/ACK, it switches into a mode that 805 assumes that the AccECN Option is not available for this half 806 connection. Similarly, if the TCP server has successfully negotiated 807 AccECN but does not receive an AccECN Option on the first ACK or on 808 the first data segment, it switches into a mode that assumes that the 809 AccECN Option is not available for this half connection. 811 While a host is in the mode that assumes the AccECN Option is not 812 available, it MUST adopt the conservative interpretation of the ACE 813 field discussed in Section 3.2.2. However, it cannot make any 814 assumption about support of the AccECN Option on the other half 815 connection, so it MUST continue to send the AccECN Option itself. 817 If after the normal TCP timeout the TCP server has not received an 818 ACK to acknowledge its SYN/ACK, the SYN/ACK might just have been 819 lost, e.g. due to congestion, or a middlebox might be blocking the 820 AccECN Option. To expedite connection setup, the host SHOULD fall 821 back to NS=CWR=ECE=0 and no AccECN Option on the retransmission of 822 the SYN/ACK. Implementers MAY use other fall-back strategies if they 823 are found to be more effective (e.g. retransmitting a SYN/ACK with 824 AccECN TCP flags but not the AccECN Option; attempting to retransmit 825 a second AccECN segment before fall-back (most appropriate during 826 high levels of congestion); or falling back to classic ECN feedback 827 rather than non-ECN). 829 Similarly, if the TCP client detects that the first data segment it 830 sent was lost, it SHOULD fall back to no AccECN Option on the 831 retransmission. Again, implementers MAY use other fall-back 832 strategies such as attempting to retransmit a second segment with the 833 AccECN Option before fall-back, and/or caching the result of previous 834 attempts. 836 Either host MAY include the AccECN Option in a subsequent segment to 837 retest whether the AccECN Option can traverse the path. 839 Currently the Data Sender is not required to test whether the 840 arriving byte counters in the AccECN Option have been correctly 841 initialised. This allows different initial values to be used as an 842 additional signalling channel in future. If any inappropriate 843 zeroing of these fields is discovered during testing, this approach 844 will need to be reviewed. 846 3.2.5. Usage of the AccECN TCP Option 848 The following rules determine when a Data Receiver in AccECN mode 849 sends the AccECN TCP Option, and which fields to include: 851 Change-Triggered ACKs: If an arriving packet increments a different 852 byte counter to that incremented by the previous packet, the Data 853 Receiver SHOULD immediately send an ACK with an AccECN Option, 854 without waiting for the next delayed ACK. Certain offload 855 hardware might not be able to support change-triggered ACKs, but 856 otherwise it is important to keep exceptions to this rule to a 857 minimum so that Data Senders can generally rely on this behaviour; 859 Continual Repetition: Otherwise, if arriving packets continue to 860 increment the same byte counter, the Data Receiver can include an 861 AccECN Option on most or all (delayed) ACKs, but it does not have 862 to. If option space is limited on a particular ACK, the Data 863 Receiver MUST give precedence to SACK information about loss. It 864 SHOULD include an AccECN Option if the r.ceb counter has 865 incremented and it MAY include an AccECN Option if r.ec0b or 866 r.ec1b has incremented; 868 Full-Length Options Preferred: It SHOULD always use full-length 869 AccECN Options. It MAY use shorter AccECN Options if space is 870 limited, but it MUST include the counter(s) that have incremented 871 since the previous AccECN Option and it MUST only truncate fields 872 from the right-hand tail of the option to preserve the order of 873 the remaining fields (see Section 3.2.3); 875 Beaconing Full-Length Options: Nonetheless, it MUST include a full- 876 length AccECN TCP Option on at least three ACKs per RTT, or on all 877 ACKs if there are less than three per RTT (see Appendix A.4 for an 878 example algorithm that satisfies this requirement). 880 The following example series of arriving marks illustrates when a 881 Data Receiver will emit an ACK if it is using a delayed ACK factor of 882 2 segments and change-triggered ACKs: 01 -> ACK, 01, 01 -> ACK, 10 -> 883 ACK, 10, 01 -> ACK, 01, 11 -> ACK, 01 -> ACK. 885 For the avoidance of doubt, the change-triggered ACK mechanism 886 ignores the arrival of a control packet with no payload, because it 887 does not alter any byte counters. The change-triggered ACK approach 888 will lead to some additional ACKs but it feeds back the timing and 889 the order in which ECN marks are received with minimal additional 890 complexity. 892 Implementation note: sending an AccECN Option each time a different 893 counter changes and including a full-length AccECN Option on every 894 delayed ACK will satisfy the requirements described above and might 895 be the easiest implementation, as long as sufficient space is 896 available in each ACK (in total and in the option space). 898 Appendix A.3 gives an example algorithm to estimate the number of 899 marked bytes from the ACE field alone, if the AccECN Option is not 900 available. 902 If a host has determined that segments with the AccECN Option always 903 seem to be discarded somewhere along the path, it is no longer 904 obliged to follow the above rules. 906 3.3. AccECN Compliance by TCP Proxies, Offload Engines and other 907 Middleboxes 909 A large class of middleboxes split TCP connections. Such a middlebox 910 would be compliant with the AccECN protocol if the TCP implementation 911 on each side complied with the present AccECN specification and each 912 side negotiated AccECN independently of the other side. 914 Another large class of middleboxes intervene to some degree at the 915 transport layer, but attempts to be transparent (invisible) to the 916 end-to-end connection. A subset of this class of middleboxes 917 attempts to `normalise' the TCP wire protocol by checking that all 918 values in header fields comply with a rather narrow interpretation of 919 the TCP specifications. To comply with the present AccECN 920 specification, such a middlebox MUST NOT change the ACE field or the 921 AccECN Option and it MUST attempt to preserve the timing of each ACK 922 (for example, if it coalesced ACKs it would not be AccECN-compliant). 923 A middlebox claiming to be transparent at the transport layer MUST 924 forward the AccECN TCP Option unaltered, whether or not the length 925 value matches one of those specified in Section 3.2.3, and whether or 926 not the initial values of the byte-counter fields are correct. This 927 is because blocking apparently invalid values does not improve 928 security (because AccECN hosts are required to ignore invalid values 929 anyway), while it prevents the standardised set of values being 930 extended in future (because outdated normalisers would block updated 931 hosts from using the extended AccECN standard). 933 Hardware to offload certain TCP processing represents another large 934 class of middleboxes, even though it is often a function of a host's 935 network interface and rarely in its own 'box'. Leeway has been 936 allowed in the present AccECN specification in the expectation that 937 offload hardware could comply and still serve its function. 938 Nonetheless, such hardware MUST attempt to preserve the timing of 939 each ACK (for example, if it coalesced ACKs it would not be AccECN- 940 compliant). 942 4. Interaction with Other TCP Variants 944 This section is informative, not normative. 946 4.1. Compatibility with SYN Cookies 948 A TCP server can use SYN Cookies (see Appendix A of [RFC4987]) to 949 protect itself from SYN flooding attacks. It places minimal commonly 950 used connection state in the SYN/ACK, and deliberately does not hold 951 any state while waiting for the subsequent ACK (e.g. it closes the 952 thread). Therefore it cannot record the fact that it entered AccECN 953 mode for both half-connections. Indeed, it cannot even remember 954 whether it negotiated the use of classic ECN [RFC3168]. 956 Nonetheless, such a server can determine that it negotiated AccECN as 957 follows. If a TCP server using SYN Cookies supports AccECN and if 958 the first ACK it receives contains an ACE field with the value 0b110 959 or 0b111, it can assume that: 961 o the TCP client must have requested AccECN support on the SYN 963 o it (the server) must have confirmed that it supported AccECN 965 Therefore the server can switch itself into AccECN mode, and continue 966 as if it had never forgotten that it switched itself into AccECN mode 967 earlier. 969 4.2. Compatibility with Other TCP Options and Experiments 971 AccECN is compatible (at least on paper) with the most commonly used 972 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 973 also compatible with the recent promising experimental TCP options 974 TCP Fast Open (TFO [RFC7413]) and Multipath TCP (MPTCP [RFC6824]). 975 AccECN is friendly to all these protocols, because space for TCP 976 options is particularly scarce on the SYN, where AccECN consumes zero 977 additional header space. 979 When option space is under pressure from other options, Section 3.2.5 980 provides guidance on how important it is to send an AccECN Option and 981 whether it needs to be a full-length option. 983 4.3. Compatibility with Feedback Integrity Mechanisms 985 The ECN Nonce [RFC3540] is an experimental IETF specification 986 intended to allow a sender to test whether ECN CE markings (or 987 losses) introduced in one network are being suppressed by the 988 receiver or anywhere else in the feedback loop, such as another 989 network or a middlebox. The ECN nonce has not been deployed as far 990 as can be ascertained. The nonce would now be nearly impossible to 991 deploy retrospectively, because to catch a misbehaving receiver it 992 relies on the receiver volunteering feedback information to 993 incriminate itself. A receiver that has been modified to misbehave 994 can simply claim that it does not support nonce feedback, which will 995 seem unremarkable given so many other hosts do not support it either. 997 With minor changes AccECN could be optimised for the possibility that 998 the ECT(1) codepoint might be used as a nonce. However, given the 999 nonce is now probably undeployable, the AccECN design has been 1000 generalised so that it ought to be able to support other possible 1001 uses of the ECT(1) codepoint, such as a lower severity or a more 1002 instant congestion signal than CE. 1004 Three alternative mechanisms are available to assure the integrity of 1005 ECN and/or loss signals. AccECN is compatible with any of these 1006 approaches: 1008 o The Data Sender can test the integrity of the receiver's ECN (or 1009 loss) feedback by occasionally setting the IP-ECN field to a value 1010 normally only set by the network (and/or deliberately leaving a 1011 sequence number gap). Then it can test whether the Data 1012 Receiver's feedback faithfully reports what it expects 1013 [I-D.moncaster-tcpm-rcv-cheat]. Unlike the ECN Nonce, this 1014 approach does not waste the ECT(1) codepoint in the IP header, it 1015 does not require standardisation and it does not rely on 1016 misbehaving receivers volunteering to reveal feedback information 1017 that allows them to be detected. However, setting the CE mark by 1018 the sender might conceal actual congestion feedback from the 1019 network and should therefore only be done sparsely. 1021 o Networks generate congestion signals when they are becoming 1022 congested, so they are more likely than Data Senders to be 1023 concerned about the integrity of the receiver's feedback of these 1024 signals. A network can enforce a congestion response to its ECN 1025 markings (or packet losses) using congestion exposure (ConEx) 1026 audit [I-D.ietf-conex-abstract-mech]. Whether the receiver or a 1027 downstream network is suppressing congestion feedback or the 1028 sender is unresponsive to the feedback, or both, ConEx audit can 1029 neutralise any advantage that any of these three parties would 1030 otherwise gain. 1032 ConEx is a change to the Data Sender that is most useful when 1033 combined with AccECN. Without AccECN, the ConEx behaviour of a 1034 Data Sender would have to be more conservative than would be 1035 necessary if it had the accurate feedback of AccECN. 1037 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1038 detect any tampering with AccECN feedback between the Data 1039 Receiver and the Data Sender (whether malicious or accidental). 1040 The AccECN fields are immutable end-to-end, so they are amenable 1041 to TCP-AO protection, which covers TCP options by default. 1042 However, TCP-AO is often too brittle to use on many end-to-end 1043 paths, where middleboxes can make verification fail in their 1044 attempts to improve performance or security, e.g. by 1045 resegmentation or shifting the sequence space. 1047 5. Protocol Properties 1049 This section is informative not normative. It describes how well the 1050 protocol satisfies the agreed requirements for a more accurate ECN 1051 feedback protocol [RFC7560]. 1053 Accuracy: From each ACK, the Data Sender can infer the number of new 1054 CE marked segments since the previous ACK. This provides better 1055 accuracy on CE feedback than classic ECN. In addition if the 1056 AccECN Option is present (not blocked by the network path) the 1057 number of bytes marked with CE, ECT(1) and ECT(0) are provided. 1059 Overhead: The AccECN scheme is divided into two parts. The 1060 essential part reuses the 3 flags already assigned to ECN in the 1061 IP header. The supplementary part adds an additional TCP option 1062 consuming up to 11 bytes. However, no TCP option is consumed in 1063 the SYN. 1065 Ordering: The order in which marks arrive at the Data Receiver is 1066 preserved in AccECN feedback, because the Data Receiver is 1067 expected to send an ACK immediately whenever a different mark 1068 arrives. 1070 Timeliness: While the same ECN markings are arriving continually at 1071 the Data Receiver, it can defer ACKs as TCP does normally, but it 1072 will immediately send an ACK as soon as a different ECN marking 1073 arrives. 1075 Timeliness vs Overhead: Change-Triggered ACKs are intended to enable 1076 latency-sensitive uses of ECN feedback by capturing the timing of 1077 transitions but not wasting resources while the state of the 1078 signalling system is stable. The receiver can control how 1079 frequently it sends the AccECN TCP Option and therefore it can 1080 control the overhead induced by AccECN. 1082 Resilience: All information is provided based on counters. 1083 Therefore if ACKs are lost, the counters on the first ACK 1084 following the losses allows the Data Sender to immediately recover 1085 the number of the ECN markings that it missed. 1087 Resilience against Bias: Because feedback is based on repetition of 1088 counters, random losses do not remove any information, they only 1089 delay it. Therefore, even though some ACKs are change-triggered, 1090 random losses will not alter the proportions of the different ECN 1091 markings in the feedback. 1093 Resilience vs Overhead: If space is limited in some segments (e.g. 1094 because more option are need on some segments, such as the SACK 1095 option after loss), the Data Receiver can send AccECN Options less 1096 frequently or truncate fields that have not changed, usually down 1097 to as little as 5 bytes. However, it has to send a full-sized 1098 AccECN Option at least three times per RTT, which the Data Sender 1099 can rely on as a regular beacon or checkpoint. 1101 Resilience vs Timeliness and Ordering: Ordering information and the 1102 timing of transitions cannot be communicated in three cases: i) 1103 during ACK loss; ii) if something on the path strips the AccECN 1104 Option; or iii) if the Data Receiver is unable to support Change- 1105 Triggered ACKs. 1107 Complexity: An AccECN implementation solely involves simple counter 1108 increments, some modulo arithmetic to communicate the least 1109 significant bits and allow for wrap, and some heuristics for 1110 safety against fields cycling due to prolonged periods of ACK 1111 loss. Each host needs to maintain eight additional counters. The 1112 hosts have to apply some additional tests to detect tampering by 1113 middleboxes, but in general the protocol is simple to understand, 1114 simple to implement and requires few cycles per packet to execute. 1116 Integrity: AccECN is compatible with at least three approaches that 1117 can assure the integrity of ECN feedback. If the AccECN Option is 1118 stripped the resolution of the feedback is degraded, but the 1119 integrity of this degraded feedback can still be assured. 1121 Backward Compatibility: If only one endpoint supports the AccECN 1122 scheme, it will fall-back to the most advanced ECN feedback scheme 1123 supported by the other end. 1125 Backward Compatibility: If the AccECN Option is stripped by a 1126 middlebox, AccECN still provides basic congestion feedback in the 1127 ACE field. Further, AccECN can be used to detect mangling of the 1128 IP ECN field; mangling of the TCP ECN flags; blocking of ECT- 1129 marked segments; and blocking of segments carrying the AccECN 1130 Option. It can detect these conditions during TCP's 3WHS so that 1131 it can fall back to operation without ECN and/or operation without 1132 the AccECN Option. 1134 Forward Compatibility: The behaviour of endpoints and middleboxes is 1135 carefully defined for all reserved or currently unused codepoints 1136 in the scheme, to ensure that any blocking of anomalous values is 1137 always at least under reversible policy control. 1139 6. IANA Considerations 1141 This document defines a new TCP option for AccECN, assigned a value 1142 of TBD1 (decimal) from the TCP option space. This value is defined 1143 as: 1145 +------+--------+-----------------------+-----------+ 1146 | Kind | Length | Meaning | Reference | 1147 +------+--------+-----------------------+-----------+ 1148 | TBD1 | N | Accurate ECN (AccECN) | RFC XXXX | 1149 +------+--------+-----------------------+-----------+ 1151 [TO BE REMOVED: This registration should take place at the following 1152 location: http://www.iana.org/assignments/tcp-parameters/tcp- 1153 parameters.xhtml#tcp-parameters-1] 1155 Early implementation before the IANA allocation MUST follow [RFC6994] 1156 and use experimental option 254 and magic number 0xACCE (16 bits) 1157 {ToDo register this with IANA}, then migrate to the new option after 1158 the allocation. 1160 7. Security Considerations 1162 If ever the supplementary part of AccECN based on the new AccECN TCP 1163 Option is unusable (due for example to middlebox interference) the 1164 essential part of AccECN's congestion feedback offers only limited 1165 resilience to long runs of ACK loss (see Section 3.2.2). These 1166 problems are unlikely to be due to malicious intervention (because if 1167 an attacker could strip a TCP option or discard a long run of ACKs it 1168 could wreak other arbitrary havoc). However, it would be of concern 1169 if AccECN's resilience could be indirectly compromised during a 1170 flooding attack. AccECN is still considered safe though, because if 1171 the option is not presented, the AccECN Data Sender is then required 1172 to switch to more conservative assumptions about wrap of congestion 1173 indication counters (see Section 3.2.2 and Appendix A.2). 1175 Section 4.1 describes how a TCP server can negotiate AccECN and use 1176 the SYN cookie method for mitigating SYN flooding attacks. 1178 There is concern that ECN markings could be altered or suppressed, 1179 particularly because a misbehaving Data Receiver could increase its 1180 own throughput at the expense of others. Given the experimental ECN 1181 nonce is now probably undeployable, AccECN has been generalised for 1182 other possible uses of the ECT(1) codepoint to avoid obsolescence of 1183 the codepoint even if the nonce mechanism is obsoleted. AccECN is 1184 compatible with the three other schemes known to assure the integrity 1185 of ECN feedback (see Section 4.3 for details). If the AccECN Option 1186 is stripped by an incorrectly implemented middlebox, the resolution 1187 of the feedback will be degraded, but the integrity of this degraded 1188 information can still be assured. 1190 The AccECN protocol is not believed to introduce any new privacy 1191 concerns, because it merely counts and feeds back signals at the 1192 transport layer that had already been visible at the IP layer. 1194 8. Acknowledgements 1196 We want to thank Koen De Schepper, Praveen Balasubramanian and 1197 Michael Welzl for their input and discussion. The idea of using the 1198 three ECN-related TCP flags as one field for more accurate TCP-ECN 1199 feedback was first introduced in the re-ECN protocol that was the 1200 ancestor of ConEx. 1202 Bob Briscoe was part-funded by the European Community under its 1203 Seventh Framework Programme through the Reducing Internet Transport 1204 Latency (RITE) project (ICT-317700) and through the Trilogy 2 project 1205 (ICT-317756). The views expressed here are solely those of the 1206 authors. 1208 This work is partly supported by the European Commission under 1209 Horizon 2020 grant agreement no. 688421 Measurement and Architecture 1210 for a Middleboxed Internet (MAMI), and by the Swiss State Secretariat 1211 for Education, Research, and Innovation under contract no. 15.0268. 1212 This support does not imply endorsement. 1214 9. Comments Solicited 1216 Comments and questions are encouraged and very welcome. They can be 1217 addressed to the IETF TCP maintenance and minor modifications working 1218 group mailing list , and/or to the authors. 1220 10. References 1221 10.1. Normative References 1223 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1224 Requirement Levels", BCP 14, RFC 2119, 1225 DOI 10.17487/RFC2119, March 1997, 1226 . 1228 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1229 of Explicit Congestion Notification (ECN) to IP", 1230 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1231 . 1233 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1234 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1235 . 1237 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 1238 RFC 6994, DOI 10.17487/RFC6994, August 2013, 1239 . 1241 10.2. Informative References 1243 [I-D.bensley-tcpm-dctcp] 1244 Bensley, S., Eggert, L., Thaler, D., Balasubramanian, P., 1245 and G. Judd, "Microsoft's Datacenter TCP (DCTCP): TCP 1246 Congestion Control for Datacenters", draft-bensley-tcpm- 1247 dctcp-05 (work in progress), July 2015. 1249 [I-D.ietf-conex-abstract-mech] 1250 Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1251 Concepts, Abstract Mechanism and Requirements", draft- 1252 ietf-conex-abstract-mech-13 (work in progress), October 1253 2014. 1255 [I-D.kuehlewind-tcpm-ecn-fallback] 1256 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 1257 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 1258 fallback-01 (work in progress), September 2013. 1260 [I-D.moncaster-tcpm-rcv-cheat] 1261 Moncaster, T., Briscoe, B., and A. Jacquet, "A TCP Test to 1262 Allow Senders to Identify Receiver Non-Compliance", draft- 1263 moncaster-tcpm-rcv-cheat-03 (work in progress), July 2014. 1265 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1266 Congestion Notification (ECN) Signaling with Nonces", 1267 RFC 3540, DOI 10.17487/RFC3540, June 2003, 1268 . 1270 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1271 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 1272 . 1274 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1275 Ramakrishnan, "Adding Explicit Congestion Notification 1276 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 1277 DOI 10.17487/RFC5562, June 2009, 1278 . 1280 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1281 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 1282 June 2010, . 1284 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 1285 "TCP Extensions for Multipath Operation with Multiple 1286 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 1287 . 1289 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1290 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1291 . 1293 [RFC7560] Kuehlewind, M., Ed., Scheffenegger, R., and B. Briscoe, 1294 "Problem Statement and Requirements for Increased Accuracy 1295 in Explicit Congestion Notification (ECN) Feedback", 1296 RFC 7560, DOI 10.17487/RFC7560, August 2015, 1297 . 1299 Appendix A. Example Algorithms 1301 This appendix is informative, not normative. It gives example 1302 algorithms that would satisfy the normative requirements of the 1303 AccECN protocol. However, implementers are free to choose other ways 1304 to implement the requirements. 1306 A.1. Example Algorithm to Encode/Decode the AccECN Option 1308 The example algorithms below show how a Data Receiver in AccECN mode 1309 could encode its CE byte counter r.ceb into the ECEB field within the 1310 AccECN TCP Option, and how a Data Sender in AccECN mode could decode 1311 the ECEB field into its byte counter s.ceb. The other counters for 1312 bytes marked ECT(0) and ECT(1) in the AccECN Option would be 1313 similarly encoded and decoded. 1315 It is assumed that each local byte counter is an unsigned integer 1316 greater than 24b (probably 32b), and that the following constant has 1317 been assigned: 1319 DIVOPT = 2^24 1321 Every time a CE marked data segment arrives, the Data Receiver 1322 increments its local value of r.ceb by the size of the TCP Data. 1323 Whenever it sends an ACK with the AccECN Option, the value it writes 1324 into the ECEB field is 1326 ECEB = r.ceb % DIVOPT 1328 where '%' is the modulo operator. 1330 On the arrival of an AccECN Option, the Data Sender uses the TCP 1331 acknowledgement number and any SACK options to calculate newlyAckedB, 1332 the amount of new data that the ACK acknowledges in bytes. If 1333 newlyAckedB is negative it means that a more up to date ACK has 1334 already been processed, so this ACK has been superseded and the Data 1335 Sender has to ignore the AccECN Option. Then the Data Sender 1336 calculates the minimum difference d.ceb between the ECEB field and 1337 its local s.ceb counter, using modulo arithmetic as follows: 1339 if (newlyAckedB >= 0) { 1340 d.ceb = (ECEB + DIVOPT - (s.ceb % DIVOPT)) % DIVOPT 1341 s.ceb += d.ceb 1342 } 1344 For example, if s.ceb is 33,554,433 and ECEB is 1461 (both decimal), 1345 then 1346 s.ceb % DIVOPT = 1 1347 d.ceb = (1461 + 2^24 - 1) % 2^24 1348 = 1460 1349 s.ceb = 33,554,433 + 1460 1350 = 33,555,893 1352 A.2. Example Algorithm for Safety Against Long Sequences of ACK Loss 1354 The example algorithms below show how a Data Receiver in AccECN mode 1355 could encode its CE packet counter r.cep into the ACE field, and how 1356 the Data Sender in AccECN mode could decode the ACE field into its 1357 s.cep counter. The Data Sender's algorithm includes code to 1358 heuristically detect a long enough unbroken string of ACK losses that 1359 could have concealed a cycle of the congestion counter in the ACE 1360 field of the next ACK to arrive. 1362 Two variants of the algorithm are given: i) a more conservative 1363 variant for a Data Sender to use if it detects that the AccECN Option 1364 is not available (see Section 3.2.2 and Section 3.2.4); and ii) a 1365 less conservative variant that is feasible when complementary 1366 information is available from the AccECN Option. 1368 A.2.1. Safety Algorithm without the AccECN Option 1370 It is assumed that each local packet counter is a sufficiently sized 1371 unsigned integer (probably 32b) and that the following constant has 1372 been assigned: 1374 DIVACE = 2^3 1376 Every time a CE marked packet arrives, the Data Receiver increments 1377 its local value of r.cep by 1. It repeats the same value of ACE in 1378 every subsequent ACK until the next CE marking arrives, where 1380 ACE = r.cep % DIVACE. 1382 If the Data Sender received an earlier value of the counter that had 1383 been delayed due to ACK reordering, it might incorrectly calculate 1384 that the ACE field had wrapped. Therefore, on the arrival of every 1385 ACK, the Data Sender uses the TCP acknowledgement number and any SACK 1386 options to calculate newlyAckedB, the amount of new data that the ACK 1387 acknowledges. If newlyAckedB is negative it means that a more up to 1388 date ACK has already been processed, so this ACK has been superseded 1389 and the Data Sender has to ignore the AccECN Option. If newlyAckedB 1390 is zero, to break the tie the Data Sender could use timestamps (if 1391 present) to work out newlyAckedT, the amount of new time that the ACK 1392 acknowledges. Then the Data Sender calculates the minimum difference 1393 d.cep between the ACE field and its local s.cep counter, using modulo 1394 arithmetic as follows: 1396 if ((newlyAckedB > 0) || (newlyAckedB == 0 && newlyAckedT > 0)) 1397 d.cep = (ACE + DIVACE - (s.cep % DIVACE)) % DIVACE 1399 Section 3.2.2 requires the Data Sender to assume that the ACE field 1400 did cycle if it could have cycled under prevailing conditions. The 1401 3-bit ACE field in an arriving ACK could have cycled and become 1402 ambiguous to the Data Sender if a row of ACKs goes missing that 1403 covers a stream of data long enough to contain 8 or more CE marks. 1404 We use the word `missing' rather than `lost', because some or all the 1405 missing ACKs might arrive eventually, but out of order. Even if some 1406 of the lost ACKs are piggy-backed on data (i.e. not pure ACKs) 1407 retransmissions will not repair the lost AccECN information, because 1408 AccECN requires retransmissions to carry the latest AccECN counters, 1409 not the original ones. 1411 The phrase `under prevailing conditions' allows the Data Sender to 1412 take account of the prevailing size of data segments and the 1413 prevailing CE marking rate just before the sequence of ACK losses. 1414 However, we shall start with the simplest algorithm, which assumes 1415 segments are all full-sized and ultra-conservatively it assumes that 1416 ECN marking was 100% on the forward path when ACKs on the reverse 1417 path started to all be dropped. Specifically, if newlyAckedB is the 1418 amount of data that an ACK acknowledges since the previous ACK, then 1419 the Data Sender could assume that this acknowledges newlyAckedPkt 1420 full-sized segments, where newlyAckedPkt = newlyAckedB/MSS. Then it 1421 could assume that the ACE field incremented by 1423 dSafer.cep = newlyAckedPkt - ((newlyAckedPkt - d.cep) % DIVACE), 1425 For example, imagine an ACK acknowledges newlyAckedPkt=9 more full- 1426 size segments than any previous ACK, and that ACE increments by a 1427 minimum of 2 CE marks (d.cep=2). The above formula works out that it 1428 would still be safe to assume 2 CE marks (because 9 - ((9-2) % 8) = 1429 2). However, if ACE increases by a minimum of 2 but acknowledges 10 1430 full-sized segments, then it would be necessary to assume that there 1431 could have been 10 CE marks (because 10 - ((10-2) % 8) = 10). 1433 Implementers could build in more heuristics to estimate prevailing 1434 average segment size and prevailing ECN marking. For instance, 1435 newlyAckedPkt in the above formula could be replaced with 1436 newlyAckedPktHeur = newlyAckedPkt*p*MSS/s, where s is the prevailing 1437 segment size and p is the prevailing ECN marking probability. 1438 However, ultimately, if TCP's ECN feedback becomes inaccurate it 1439 still has loss detection to fall back on. Therefore, it would seem 1440 safe to implement a simple algorithm, rather than a perfect one. 1442 The simple algorithm for dSafer.cep above requires no monitoring of 1443 prevailing conditions and it would still be safe if, for example, 1444 segments were on average at least 5% of full-sized as long as ECN 1445 marking was 5% or less. Assuming it was used, the Data Sender would 1446 increment its packet counter as follows: 1448 s.cep += dSafer.cep 1450 If missing acknowledgement numbers arrive later (due to reordering), 1451 Section 3.2.2 says "the Data Sender MAY attempt to neutralise the 1452 effect of any action it took based on a conservative assumption that 1453 it later found to be incorrect". To do this, the Data Sender would 1454 have to store the values of all the relevant variables whenever it 1455 made assumptions, so that it could re-evaluate them later. Given 1456 this could become complex and it is not required, we do not attempt 1457 to provide an example of how to do this. 1459 A.2.2. Safety Algorithm with the AccECN Option 1461 When the AccECN Option is available on the ACKs before and after the 1462 possible sequence of ACK losses, if the Data Sender only needs CE- 1463 marked bytes, it will have sufficient information in the AccECN 1464 Option without needing to process the ACE field. However, if for 1465 some reason it needs CE-marked packets, if dSafer.cep is different 1466 from d.cep, it can calculate the average marked segment size that 1467 each implies to determine whether d.cep is likely to be a safe enough 1468 estimate. Specifically, it could use the following algorithm, where 1469 d.ceb is the amount of newly CE-marked bytes (see Appendix A.1): 1471 SAFETY_FACTOR = 2 1472 if (dSafer.cep > d.cep) { 1473 s = d.ceb/d.cep 1474 if (s <= MSS) { 1475 sSafer = d.ceb/dSafer.cep 1476 if (sSafer < MSS/SAFETY_FACTOR) 1477 dSafer.cep = d.cep % d.cep is a safe enough estimate 1478 } % else 1479 % No need for else; dSafer.cep is already correct, 1480 % because d.cep must have been too small 1481 } 1483 The chart below shows when the above algorithm will consider d.cep 1484 can replace dSafer.cep as a safe enough estimate of the number of CE- 1485 marked packets: 1487 ^ 1488 sSafer| 1489 | 1490 MSS+ 1491 | 1492 | dSafer.cep 1493 | is 1494 MSS/2+--------------+ safest 1495 | | 1496 | d.cep is safe| 1497 | enough | 1498 +--------------------> 1499 MSS s 1501 The following examples give the reasoning behind the algorithm, 1502 assuming MSS=1,460 [B]: 1504 o if d.cep=0, dSafer.cep=8 and d.ceb=1,460, then s=infinity and 1505 sSafer=182.5. 1506 Therefore even though the average size of 8 data segments is 1507 unlikely to have been as small as MSS/8, d.cep cannot have been 1508 correct, because it would imply an average segment size greater 1509 than the MSS. 1511 o if d.cep=2, dSafer.cep=10 and d.ceb=1,460, then s=730 and 1512 sSafer=146. 1513 Therefore d.cep is safe enough, because the average size of 10 1514 data segments is unlikely to have been as small as MSS/10. 1516 o if d.cep=7, dSafer.cep=15 and d.ceb=10,200, then s=1,457 and 1517 sSafer=680. 1518 Therefore d.cep is safe enough, because the average data segment 1519 size is more likely to have been just less than one MSS, rather 1520 than below MSS/2. 1522 If pure ACKs were allowed to be ECN-capable, missing ACKs would be 1523 far less likely. However, because [RFC3168] currently precludes 1524 this, the above algorithm assumes that pure ACKs are not ECN-capable. 1526 A.3. Example Algorithm to Estimate Marked Bytes from Marked Packets 1528 If the AccECN Option is not available, the Data Sender can only 1529 decode CE-marking from the ACE field in packets. Every time an ACK 1530 arrives, to convert this into an estimate of CE-marked bytes, it 1531 needs an average of the segment size, s_ave. Then it can add or 1532 subtract s_ave from the value of d.ceb as the value of d.cep 1533 increments or decrements. 1535 To calculate s_ave, it could keep a record of the byte numbers of all 1536 the boundaries between packets in flight (including control packets), 1537 and recalculate s_ave on every ACK. However it would be simpler to 1538 merely maintain a counter packets_in_flight for the number of packets 1539 in flight (including control packets), which it could update once per 1540 RTT. Either way, it would estimate s_ave as: 1542 s_ave ~= flightsize / packets_in_flight, 1544 where flightsize is the variable that TCP already maintains for the 1545 number of bytes in flight. To avoid floating point arithmetic, it 1546 could right-bit-shift by lg(packets_in_flight), where lg() means log 1547 base 2. 1549 An alternative would be to maintain an exponentially weighted moving 1550 average (EWMA) of the segment size: 1552 s_ave = a * s + (1-a) * s_ave, 1554 where a is the decay constant for the EWMA. However, then it is 1555 necessary to choose a good value for this constant, which ought to 1556 depend on the number of packets in flight. Also the decay constant 1557 needs to be power of two to avoid floating point arithmetic. 1559 A.4. Example Algorithm to Beacon AccECN Options 1561 Section 3.2.5 requires a Data Receiver to beacon a full-length AccECN 1562 Option at least 3 times per RTT. This could be implemented by 1563 maintaining a variable to store the number of ACKs (pure and data 1564 ACKs) since a full AccECN Option was last sent and another for the 1565 approximate number of ACKs sent in the last round trip time: 1567 if (acks_since_full_last_sent > acks_in_round / BEACON_FREQ) 1568 send_full_AccECN_Option() 1570 For optimised integer arithmetic, BEACON_FREQ = 4 could be used, 1571 rather than 3, so that the division could be implemented as an 1572 integer right bit-shift by lg(BEACON_FREQ). 1574 In certain operating systems, it might be too complex to maintain 1575 acks_in_round. In others it might be possible by tagging each data 1576 segment in the retransmit buffer with the number of ACKs sent at the 1577 point that segment was sent. This would not work well if the Data 1578 Receiver was not sending data itself, in which case it might be 1579 necessary to beacon based on time instead, as follows: 1581 if (time_now > time_last_option_sent + RTT / BEACON_FREQ) 1582 send_full_AccECN_Option() 1584 However, this time-based approach does not work well when all the 1585 ACKs are sent early in each round trip, as is the case during slow- 1586 start. 1588 {ToDo: A simple and robust beaconing algorithm for all circumstances 1589 is still work-in-progress.} 1591 A.5. Example Algorithm to Count Not-ECT Bytes 1593 A Data Sender in AccECN mode can infer the amount of TCP payload data 1594 arriving at the receiver marked Not-ECT from the difference between 1595 the amount of newly ACKed data and the sum of the bytes with the 1596 other three markings, d.ceb, d.e0b and d.e1b. Note that, because 1597 r.e0b is initialised to 1 and the other two counters are initialised 1598 to 0, the initial sum will be 1, which matches the initial offset of 1599 the TCP sequence number on completion of the 3WHS. 1601 For this approach to be precise, it has to be assumed that spurious 1602 (unnecessary) retransmissions do not lead to double counting. This 1603 assumption is currently correct, given that RFC 3168 requires that 1604 the Data Sender marks retransmitted segments as Not-ECT. However, 1605 the converse is not true; necessary transmissions will result in 1606 under-counting. 1608 However, such precision is unlikely to be necessary. The only known 1609 use of a count of Not-ECT marked bytes is to test whether equipment 1610 on the path is clearing the ECN field (perhaps due to an out-dated 1611 attempt to clear, or bleach, what used to be the ToS field). To 1612 detect bleaching it will be sufficient to detect whether nearly all 1613 bytes arrive marked as Not-ECT. Therefore there should be no need to 1614 keep track of the details of retransmissions. 1616 Appendix B. Alternative Design Choices (To Be Removed Before 1617 Publication) 1619 This appendix is informative, not normative. It records alternative 1620 designs that the authors chose not to include in the normative 1621 specification, but which the IETF might wish to consider for 1622 inclusion: 1624 Feedback all four ECN codepoints on the SYN/ACK: The last two 1625 negotiation combinations in Table 2 could also be used to indicate 1626 AccECN support and to feedback that the arriving SYN was ECT(0) or 1627 ECT(1). This could be used to probe the client to server path for 1628 incorrect forwarding of the ECN field 1629 [I-D.kuehlewind-tcpm-ecn-fallback]. Note, however, that it would 1630 be unremarkable if ECN on the SYN was zeroed by security devices, 1631 given RFC 3168 prohibited ECT on SYN because it enables DoS 1632 attacks. 1634 Feedback all four ECN codepoints on the First ACK: To probe the 1635 server to client path for incorrect ECN forwarding, it could be 1636 useful to have four feedback states on the first ACK from the TCP 1637 client. This could be achieved by assigning four combinations of 1638 the ECN flags in the main TCP header, and only initialising the 1639 ACE field on subsequent segments. 1641 Empty AccECN Option: It might be useful to allow an empty (Length=2) 1642 AccECN Option on the SYN/ACK and first ACK. Then if a host had to 1643 omit the option because there was insufficient space for a larger 1644 option, it would not give the impression to the other end that a 1645 middlebox had stripped the option. 1647 Appendix C. Open Protocol Design Issues (To Be Removed Before 1648 Publication) 1650 1. Currently it is specified that the receiver `SHOULD' use Change- 1651 Triggered ACKs. It is controversial whether this ought to be a 1652 `MUST' instead. A `SHOULD' would leave the Data Sender uncertain 1653 whether it can rely on the timing and ordering information in 1654 ACKs. If the sender guesses wrongly, it will probably introduce 1655 at least 1RTT of delay before it can use this timing information. 1656 Ironically it will most likely be wanting this information to 1657 reduce ramp-up delay. A `MUST' could make it hard to implement 1658 AccECN in offload hardware. However, it is not known whether 1659 AccECN would be hard to implement in such hardware even with a 1660 `SHOULD' here. For instance, was it hard to offload DCTCP to 1661 hardware because of change-triggered ACKs, or was this just one 1662 of many reasons? The choice between MUST and SHOULD here is 1663 critical. Before that choice is made, a clear use-case for 1664 certainty of timing and ordering information is needed, plus 1665 well-informed discussion about hardware offload constraints. 1667 2. There is possibly a concern that a receiver could deliberately 1668 omit the AccECN Option pretending that it had been stripped by a 1669 middlebox. No known way can yet be contrived to take advantage 1670 of this downgrade attack, but it is mentioned here in case 1671 someone else can contrive one. 1673 3. The s.cep counter might increase even if the s.ceb counter does 1674 not (e.g. due to a CE-marked control packet). The sender's 1675 response to such a situation is considered out of scope, because 1676 this ought to be dealt with in whatever future specification 1677 allows ECN-capable control packets. However, it is possible that 1678 the situation might arise even if the sender has not sent ECN- 1679 capable control packets, in which case, this draft might need to 1680 give some advice on how the sender should respond. 1682 Appendix D. Changes in This Version (To Be Removed Before Publication) 1684 The difference between any pair of versions can be displayed at 1685 1688 From kuehlewind-05 to ietf-00: Filename change to reflect WG 1689 adoption. 1691 Authors' Addresses 1693 Bob Briscoe 1694 Simula Research Laboratory 1696 EMail: ietf@bobbriscoe.net 1697 URI: http://bobbriscoe.net/ 1699 Mirja Kuehlewind 1700 ETH Zurich 1701 Gloriastrasse 35 1702 Zurich 8092 1703 Switzerland 1705 EMail: mirja.kuehlewind@tik.ee.ethz.ch 1707 Richard Scheffenegger 1708 NetApp, Inc. 1709 Am Euro Platz 2 1710 Vienna 1120 1711 Austria 1713 Phone: +43 1 3676811 3146 1714 EMail: rs@netapp.com