idnits 2.17.1 draft-kuehlewind-tcpm-accurate-ecn-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 02, 2014) is 3579 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 2147, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-05) exists of draft-bensley-tcpm-dctcp-01 == Outdated reference: A later version (-13) exists of draft-ietf-conex-abstract-mech-11 == Outdated reference: A later version (-08) exists of draft-ietf-tcpm-accecn-reqs-05 == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-fastopen-09 == Outdated reference: A later version (-03) exists of draft-moncaster-tcpm-rcv-cheat-02 -- Obsolete informational reference (is this intentional?): RFC 5226 (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT 4 Intended status: Experimental R. Scheffenegger 5 Expires: January 3, 2015 NetApp, Inc. 6 M. Kuehlewind 7 University of Stuttgart 8 July 02, 2014 10 More Accurate ECN Feedback in TCP 11 draft-kuehlewind-tcpm-accurate-ecn-03 13 Abstract 15 Explicit Congestion Notification (ECN) is a mechanism where network 16 nodes can mark IP packets instead of dropping them to indicate 17 incipient congestion to the end-points. Receivers with an ECN- 18 capable transport protocol feed back this information to the sender. 19 ECN is specified for TCP in such a way that only one feedback signal 20 can be transmitted per Round-Trip Time (RTT). Recently, new TCP 21 mechanisms like Congestion Exposure (ConEx) or Data Center TCP 22 (DCTCP) need more accurate ECN feedback information whenever more 23 than one marking is received in one RTT. This document specifies an 24 experimental scheme to provide more than one feedback signal per RTT 25 in the TCP header. Given TCP header space is scarce, it overloads 26 the three existing ECN-related flags in the TCP header. Also, to 27 improve robustness it uses 15 more bits if available. For initial 28 experiments it places these in a TCP option. However, if the Urgent 29 flag is cleared, zero header overhead could be achieved by reusing 30 the Urgent Pointer opportunistically. Therefore this document 31 reserves space in the Urgent Pointer to be used if the protocol 32 progresses to the standards track. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at http://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on January 3, 2015. 50 Copyright Notice 52 Copyright (c) 2014 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 69 1.2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 5 70 1.3. Experiment Goals . . . . . . . . . . . . . . . . . . . . 5 71 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 72 1.5. Recap of Existing ECN feedback in IP/TCP . . . . . . . . 6 73 2. AccECN Protocol Overview . . . . . . . . . . . . . . . . . . 8 74 2.1. Essential and Supplementary Parts . . . . . . . . . . . . 8 75 2.2. Capability Negotiation . . . . . . . . . . . . . . . . . 9 76 2.3. Two Complementary Feedback Methods . . . . . . . . . . . 10 77 2.4. Resilience Against ACK Loss . . . . . . . . . . . . . . . 11 78 2.5. Order of Arrival of IP-ECN Markings . . . . . . . . . . . 11 79 3. AccECN Protocol Specification . . . . . . . . . . . . . . . . 12 80 3.1. Negotiation during the TCP handshake . . . . . . . . . . 12 81 3.2. Essential AccECN Feedback . . . . . . . . . . . . . . . . 15 82 3.2.1. The ACE Field . . . . . . . . . . . . . . . . . . . . 15 83 3.2.2. Safety against Ambiguity of the ACE Field . . . . . . 17 84 3.2.3. ACE Counter Selection . . . . . . . . . . . . . . . . 17 85 3.3. The Supplementary AccECN Field (SupAccECN) . . . . . . . 18 86 3.3.1. Placement of the SupAccECN Field . . . . . . . . . . 19 87 3.3.2. Structure of the SupAccECN Field . . . . . . . . . . 22 88 3.3.3. Higher Resilience Congestion Counters (Top-ACE) . . . 22 89 3.3.4. Accurate ECN Sequence within Delayed ACKs . . . . . . 24 90 3.3.5. AccECN Feedback Integrity . . . . . . . . . . . . . . 28 91 3.4. Accurate ECN Receiver Operation . . . . . . . . . . . . . 29 92 3.5. Accurate ECN Sender Operation . . . . . . . . . . . . . . 30 93 3.6. Detection of Legacy Middlebox Interference . . . . . . . 30 94 3.7. Correct Middlebox Operation . . . . . . . . . . . . . . . 30 95 4. Interaction with Other TCP Variants . . . . . . . . . . . . . 31 96 4.1. Compatibility with SYN Cookies . . . . . . . . . . . . . 31 97 4.2. Compatibility with Other Options and Experiments . . . . 32 98 5. Protocol Properties . . . . . . . . . . . . . . . . . . . . . 32 99 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 100 6.1. SupAccECN TCP Option Allocation . . . . . . . . . . . . . 34 101 6.2. Non-Urgent Field Registry . . . . . . . . . . . . . . . . 35 102 7. Security Considerations . . . . . . . . . . . . . . . . . . . 36 103 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 36 104 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 37 105 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 37 106 10.1. Normative References . . . . . . . . . . . . . . . . . . 37 107 10.2. Informative References . . . . . . . . . . . . . . . . . 37 108 Appendix A. Example Algorithms . . . . . . . . . . . . . . . . . 39 109 A.1. Example Algorithm for Safety Against Long Sequences of 110 ACK Loss . . . . . . . . . . . . . . . . . . . . . . . . 39 111 A.2. Example Counter Selection Algorithms . . . . . . . . . . 40 112 A.2.1. Counter Selection Algorithm Alt#1 . . . . . . . . . . 41 113 A.2.2. Counter Selection Algorithm Alt#2 . . . . . . . . . . 43 114 A.3. Example Encodings and Decodings of Top-ACE and ACE . . . 44 115 A.3.1. Encoding Top-ACE and ACE by the Data Receiver . . . . 45 116 A.3.2. Decoding Top-ACE and ACE by the Data Sender . . . . . 46 117 A.4. Example ECN Sequence (ESQ) Encoding Algorithms . . . . . 47 118 Appendix B. Alternative Design Choices (To Be Removed Before 119 Publication) . . . . . . . . . . . . . . . . . . . . 49 120 B.1. Supplementary AccECN Field on the SYN/ACK . . . . . . . . 49 121 B.1.1. Placement of the Supplementary AccECN Field in a 122 SYN/ACK . . . . . . . . . . . . . . . . . . . . . . . 49 123 B.1.2. Structure of the Supplementary AccECN Field in a 124 SYN/ACK . . . . . . . . . . . . . . . . . . . . . . . 50 125 B.2. Remove Not-ECT from ECN Sequence (ESQ) Encoding . . . . . 51 126 B.3. ECN Fall-Back . . . . . . . . . . . . . . . . . . . . . . 52 127 B.4. Remote Delayed ACK Control Proposal . . . . . . . . . . . 52 128 Appendix C. Open Protocol Design Issues (To Be Removed Before 129 Publication) . . . . . . . . . . . . . . . . . . . . 53 130 Appendix D. Changes in This Version (To Be Removed Before 131 Publication) . . . . . . . . . . . . . . . . . . . . 54 133 1. Introduction 135 Explicit Congestion Notification (ECN) [RFC3168] is a mechanism where 136 network nodes can mark IP packets instead of dropping them to 137 indicate incipient congestion to the end-points. Receivers with an 138 ECN-capable transport protocol feed back this information to the 139 sender. ECN is specified for TCP in such a way that only one 140 feedback signal can be transmitted per Round-Trip Time (RTT). 141 Recently, proposed mechanisms like Congestion Exposure (ConEx 142 [I-D.ietf-conex-abstract-mech]) or DCTCP [I-D.bensley-tcpm-dctcp] 143 need more accurate ECN feedback information whenever more than one 144 marking is received in one RTT. A fuller treatment of the motivation 145 for this specification is given in [I-D.ietf-tcpm-accecn-reqs]. 147 This documents specifies an experimental scheme for ECN feedback in 148 the TCP header to provide more than one feedback signal per RTT. It 149 will be called the more accurate ECN feedback scheme, or AccECN for 150 short. If AccECN progresses from experimental to the standards 151 track, it is intended to be a complete replacement for classic ECN 152 feedback, not a fork in the design of TCP. Thus, the applicability 153 of AccECN is intended to include all public and private IP networks 154 (and even any non-IP networks over which TCP is used today). Until 155 the AccECN experiment succeeds, [RFC3168] will remain as the 156 standards track specification for adding ECN to TCP. To avoid 157 confusion we call the ECN specification of [RFC3168] 'classic ECN' in 158 this document. 160 AccECN is solely an (experimental) change to the TCP wire protocol. 161 It is completely independent of how TCP might respond to congestion 162 feedback. This specification overloads flags and fields in the main 163 TCP header with new definitions, so both ends have to support the new 164 wire protocol before it can be used. Therefore during the TCP 165 handshake the two ends use the three ECN-related flags in the TCP 166 header to negotiate the most advanced feedback protocol that they can 167 both support. 169 1.1. Document Roadmap 171 The following introductory sections outline the goals of AccECN 172 (Section 1.2) and the goal of experiments with ECN (Section 1.3) so 173 that it is clear what success would look like. Then terminology is 174 defined (Section 1.4) and a recap of existing prerequisite technology 175 is given (Section 1.5). 177 Section 2 gives an informative overview of the AccECN protocol. Then 178 Section 3 gives the normative protocol specification. Section 4 179 assesses the interaction of AccECN with commonly used variants of 180 TCP, whether standardised or not. Section 5 summarises the features 181 and properties of AccECN. 183 Section 6 summarises the protocol fields and numbers that IANA will 184 need to assign and Section 7 points to the aspects of the protocol 185 that will be of interest to the security community, as well as 186 discussing additional security-related issues. 188 The following aspects are relegated to appendices: 190 o Appendix A: Pseudocode examples for the various algorithms that 191 AccECN uses; 193 o Then three appendices for use during document development that 194 will be deleted before publication {ToDo: Delete this list before 195 publication}: 197 * Appendix B: Protocol design alternatives that could be 198 considered for inclusion in the main specification; 200 * Appendix C: a 'To Do' list of open protocol design issues; 202 * Appendix D: Document change log. 204 1.2. Goals 206 [I-D.ietf-tcpm-accecn-reqs] enumerates requirements that a candidate 207 feedback scheme will need to satisfy, under the headings: resilience, 208 timeliness, integrity, accuracy (including ordering and lack of 209 bias), complexity, overhead and compatibility (both backward and 210 forward). It recognises that a perfect scheme that fully satisfies 211 all the requirements is unlikely and trade-offs between requirements 212 are likely. Section 5 presents the properties of AccECN against 213 these requirements and discusses the trade-offs made. 215 The requirements document recognises that a protocol as ubiquitous as 216 TCP needs to be able to serve as-yet-unspecified requirements. 217 Therefore an AccECN receiver aims to act as a generic reflector of 218 congestion information so that in future new sender behaviours can be 219 deployed unilaterally. 221 1.3. Experiment Goals 223 TCP is critical to the robust functioning of the Internet, therefore 224 any proposed modifications to TCP need to be thoroughly tested. The 225 present specification describes an experimental protocol that adds 226 more accurate ECN feedback to the TCP protocol. The intention is to 227 specify the protocol sufficiently so that more than one 228 implementation can be built in order to test its function, robustness 229 and interoperability (with itself and with previous version of ECN 230 and TCP). 232 Success criteria: The experimental protocol will be considered 233 successful if it satisfies the requirements of 234 [I-D.ietf-tcpm-accecn-reqs] in the consensus opinion of the IETF 235 tcpm working group. In short, this requires that it improves the 236 accuracy and timeliness of TCP's ECN feedback, as claimed in 237 Section 5, while striking a balance between the conflicting 238 requirements of resilience, integrity and minimisation of 239 overhead. It also requires that it is not unduly complex, and 240 that it is compatible with prevalent equipment behaviours in the 241 current Internet, whether or not they comply with standards. 243 Duration: To be credible, the experiment will need to last at least 244 12 months from publication of the present specification. At that 245 time, a report on the experiment will be written up. If 246 successful, it would then be appropriate to work on a standards 247 track specification that adds more accurate ECN feedback to TCP. 249 1.4. Terminology 251 AccECN: The more accurate ECN feedback scheme will be called AccECN 252 for short. 254 Classic ECN: the ECN scheme as specified in [RFC3168]. 256 ACK: A TCP acknowledgement, with or without a data payload. 258 Pure ACK: A TCP acknowledgement without a data payload. 260 SupAccECN: The Supplementary Accurate ECN field that provides 261 additional resilience as well as information about the ordering of 262 ECN markings covered by a delayed ACK. 264 Data receiver: The endpoint of a TCP half-connection that receives 265 data and sends AccECN feedback. 267 Data sender: The endpoint of a TCP half-connection that sends data 268 and receives AccECN feedback. 270 Outgoing AccECN Protocol Handler (or, Outgoing Protocol Handler): 271 The protocol handler at the Data Receiver that marshals the AccECN 272 fields when sending an ACK. 274 Incoming AccECN Protocol Handler (or, Incoming Protocol Handler): 275 The protocol handler at the Data Sender that reads the AccECN 276 fields when receiving an ACK. 278 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 279 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 280 document are to be interpreted as described in RFC 2119 [RFC2119]. 282 1.5. Recap of Existing ECN feedback in IP/TCP 284 ECN [RFC3168] requires two bits in the IP header. Prior to the 285 specification of ECN, these two bits were always zero, which is 286 called Not-ECT. An ECN sender can set two possible codepoints 287 (ECT(0) or ECT(1)) to indicate an ECN-capable transport (ECT). It is 288 prohibited from doing so unless it has checked that the receiver will 289 understand ECN and be able to feed it back. A network node can set 290 both bits simultaneously when it experiences congestion, which is 291 termed 'Congestion Experienced' (CE), or loosely a 'congestion mark'. 292 Table 1 summarises these codepoints. 294 +---------------+-----------+-----------+---------------------------+ 295 | IP-ECN | Codepoint | Abbrev- | Description | 296 | codepoint | name | iation | | 297 | (binary) | | | | 298 +---------------+-----------+-----------+---------------------------+ 299 | 00 | Not-ECT | N | Not ECN-Capable Transport | 300 | 01 | ECT(1) | 1 | ECN-Capable Transport (1) | 301 | 10 | ECT(0) | 0 | ECN-Capable Transport (0) | 302 | 11 | CE | C | Congestion Experienced | 303 +---------------+-----------+-----------+---------------------------+ 305 Table 1: The ECN Field in the IP Header 307 In the TCP header the first two bits in byte 14 are defined as flags 308 for the use of ECN (CWR and ECE in Figure 1). On reception of a CE- 309 marked packet at the IP layer, the Data Receiver starts to set the 310 Echo Congestion Experienced (ECE) flag continuously in the TCP header 311 of ACKs, which ensures the signal is received reliably even if ACKs 312 are lost. The TCP sender confirms that it has received at least one 313 ECE signal by responding with the congestion window reduced (CWR) 314 flag, which allows the TCP receiver to stop repeating the ECN-Echo 315 flag. This always leads to a full RTT of ACKs with ECE set. Thus 316 any additional CE markings arriving within this RTT cannot be fed 317 back. 319 The ECN Nonce [RFC3540] is an optional experimental addition to ECN 320 that the TCP sender can use to protect against accidental or 321 malicious concealment of marked or dropped packets. The sender can 322 send an ECN nonce, which is a continuous pseudo-random pattern of 323 ECT(0) and ECT(1) codepoints in the ECN field. The receiver is 324 required to feed back a 1-bit nonce sum that counts the occurrence of 325 ECT(1) packets using the last bit of byte 13 in the TCP header, which 326 is defined as the Nonce Sum (NS) flag. 328 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 329 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 330 | | | N | C | E | U | A | P | R | S | F | 331 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 332 | | | | R | E | G | K | H | T | N | N | 333 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 335 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 337 2. AccECN Protocol Overview 339 This section provides an informative overview of the AccECN protocol 340 that will be normatively specified in Section 3. 342 2.1. Essential and Supplementary Parts 344 Given limitations on the space available for TCP options and given 345 the possibility that certain incorrectly designed middleboxes prevent 346 TCP using any new options, the AccECN protocol has had to be designed 347 in two parts: 349 o an essential part that provides more accurate ECN feedback than 350 classic ECN with limited resilience against ACK loss; 352 o a supplementary part that serves three functions: 354 * it greatly improves the resilience of AccECN feedback 355 information against loss of ACKs; 357 * it provides information about the order in which ECN markings 358 in the IP header arrived at the Data Receiver; 360 * it improves the timeliness of AccECN feedback when a delayed 361 ACK covers multiple congestion signals. 363 The essential part overloads the previous definition of the three 364 flags in the TCP header that had been assigned for use by ECN. This 365 design choice deliberately replaces the classic ECN feedback 366 protocol, rather than leaving classic ECN intact and adding more 367 accurate feedback separately: 369 o because this efficiently reuses scarce TCP header space, given TCP 370 option space is approaching saturation; 372 o because a single upgrade path for the TCP protocol is preferable 373 to a fork in the design; 375 o because otherwise classic and accurate ECN feedback could give 376 conflicting feedback on the same segment, which could open up new 377 security concerns and make implementations unnecessarily complex; 379 o because middleboxes are more likely to faithfully forward the TCP 380 ECN flags than newly defined areas of the TCP header. 382 AccECN is designed to work even if the supplementary part is removed 383 or zeroed out, as long as the essential part gets through. The 384 supplementary part is carried in a field called Supplementary 385 Accurate ECN (SupAccECN). 387 It is eventually intended that the SupAccECN field would be placed 388 within the main TCP header, by overloading the Urgent Pointer in any 389 segment with URG = 0. However, it would be presumptuous to reassign 390 bits in the main TCP header on an experimental basis. Therefore, 391 this specification reserves sufficient bits within the Urgent Pointer 392 (when URG = 0) for use by AccECN if it reaches the standards track. 393 For the present AccECN experiments, this specification defines an 394 experimental TCP option to carry SupAccECN instead. 396 When URG = 0, the Urgent Pointer field cannot be used as an Urgent 397 Pointer. Therefore, this specification gives it a new name when URG 398 = 0, defining it as the Non-Urgent field. This specification also 399 establishes an IANA registry for future standards actions to assign 400 values in this newly defined Non-Urgent field. 402 In order to ease a future transition from experiment to standards 403 track, the Incoming Protocol Handler of all AccECN implementations is 404 required to be able to read the SupAccECN field whether it arrives in 405 a TCP Option or within the Non-Urgent field. However, for the 406 present experimental specification, an AccECN implementation is 407 forbidden from writing into the Non-Urgent field. 409 Reserving the Non-Urgent field for future use by AccECN is justified, 410 because the Non-Urgent field cannot always be guaranteed to be 411 available. AccECN is unusual in that it is designed to work 412 reasonably well even if the supplementary part is sometimes missing. 413 Therefore, on the rare segments when the Urgent Pointer is needed for 414 its original purpose, URG=1 can still be set and AccECN will still 415 work. However, a future standards action can overload part of the 416 Non-Urgent field for use by AccECN, whenever URG=0. 418 2.2. Capability Negotiation 420 AccECN is a change to the wire protocol of the main TCP header, 421 therefore it can only be used if both endpoints have been upgraded to 422 understand it. The client signals support for AccECN on the initial 423 SYN of a connection and the server signals whether it supports AccECN 424 on the SYN/ACK. The TCP flags on the SYN that the client uses to 425 signal AccECN support have been carefully chosen so that a server 426 will interpret them as a request to support the most advanced variant 427 of ECN that it supports. Then the client falls back to the same ECN 428 variant. 430 The above negotiation uses the three ECN-related flags in the TCP 431 header and determines if both ends support the essential part of 432 AccECN. On segments after the SYN/ACK, the SupAccECN field is used 433 to determine whether the supplementary part of AccECN is usable over 434 each half-connection. No supplementary part is needed on the initial 435 SYN. A proposal to include a supplementary AccECN field on the SYN/ 436 ACK is included in Appendix B.1. 438 2.3. Two Complementary Feedback Methods 440 Each AccECN half-connection uses two complementary methods to feed 441 back ECN markings: 443 Cumulative Counters: A Data Receiver maintains three counters for 444 the number of CE, ECT(1) and Not-ECT codepoints received since the 445 start of the half-connection. In each ACK it places one of these 446 counters, reduced in size by a suitable modulo operation. The 447 Data Sender reads each counter in order to update its own three 448 respective counters, which it uses to track the three counters at 449 the Data Receiver. Of course, each endpoint takes the role of 450 both Data Receiver and Data Sender, so each will maintain three 451 counters as a receiver and three as a sender. AccECN does not 452 provide an explicit count of ECT(0) marks, but this can be 453 inferred from the other feedback; 455 Sequence List: A list of the codepoints in the IP-ECN field of all 456 the segments covered by a delayed ACK, in the order that they 457 arrived at the Data Receiver. This list also provides timely 458 feedback of any congestion information other than the one covered 459 by the single counter selected. 461 TCP's traditional feedback is byte-based, whereas AccECN feedback is 462 packet-based, which was a pragmatic choice to reduce feedback 463 overhead, given each packet carries only one ECN mark. AccECN aims 464 to act as a sufficiently generic feedback reflector that can be 465 applied for different uses by different TCP sender behaviours, both 466 existing and in the future. 468 If a particular sender behaviour needed to associate AccECN's 469 feedback of each ECN marking with the size of the original packet 470 that picked up the marking, there is enough information in AccECN 471 feedback to do so, although perhaps imperfectly. Similarly, if a 472 sender behaviour needed to associate the feedback of each ECN marking 473 with the timing of each packet it originally sent, that too ought to 474 be possible. Of course, the order of arrival at the receiver is not 475 necessarily the order in which packets were sent, and the order in 476 which ACKs return might be different again. So, to apply AccECN to 477 these more challenging tasks, the Data Sender would probably have to 478 record the sizes and/or timings of packets in flight and combine 479 AccECN feedback with the cumulative acknowledgement numbers on each 480 ACK as well as selective ACK (SACK) information [RFC2018]. 482 Whether such calculations are required or not is outside the scope of 483 the present AccECN specification. The role of AccECN is merely to 484 ensure it would be possible for a Data Sender to reconstruct which 485 segment carried which marking, not to mandate whether it should. As 486 long as AccECN reflects sufficient feedback information without 487 excessive overhead, it fulfils its role. One reason for the 488 experimental status of the present specification is to establish 489 whether the trade-off between accuracy and overhead has been pitched 490 at the right level. 492 2.4. Resilience Against ACK Loss 494 Because the counter method repeats one of the accumulating counters 495 on each ACK, if ACKs are lost, a counter in a subsequent ACK will 496 still recover the lost information in a fairly timely fashion. 498 There is very little space in the 3 bits available for the essential 499 part of an AccECN acknowledgement, so each of the three counters can 500 wrap fairly frequently. Therefore, even if the counter appears to 501 have incremented by one (say), the counter might have actually 502 wrapped completely then incremented by one. This is a possibility 503 because the whole sequence of ACKs carrying the intervening values of 504 the counter might all have been lost or delayed. To be able to tell 505 if a counter has wrapped, AccECN feeds back more significant bits of 506 the counter within the supplementary part, making it resilient to ACK 507 loss. 509 The supplementary part includes the sequence of ECN codepoints 510 covered by a delayed ACK (see below). As well as providing ordering 511 information, this provides more timely feedback when more than one 512 counter has changed within the time covered by one delayed ACK. It 513 also provides resilience against the loss of a counter in a future 514 ACK. 516 2.5. Order of Arrival of IP-ECN Markings 518 [RFC5681] recommends using delayed ACKs, so one acknowledgement will 519 often carry feedback about the ECN markings on more than one segment. 520 Therefore, ideally, AccECN is required to provide ordering 521 information [I-D.ietf-tcpm-accecn-reqs]. However, a counter in each 522 ACK only says how many more IP-ECN markings arrived since the last 523 ACK, not the order in which they arrived. 525 This might seem an unnecessary level of precision given [RFC5681] 526 currently advises against delaying acknowledgement for more than two 527 full-sized segments. However, a delayed ACK could cover multiple 528 segments that are smaller than full-size. Also, in practice one 529 delayed ACK can cover many tens of packets that have all been 530 coalesced into one large segment by large receive offload (LRO) 531 hardware before being passed to the Data Receiver. Therefore, the 532 design of AccECN allows for future expansion of the number of 533 segments that can be covered by one delayed ACK. 535 Once the connection is in progress, in each ACK the Data Receiver 536 encodes the sequence of IP-ECN markings covered by that ACK, which 537 includes the number of segments covered by the delayed ACK. The 538 sequence does not need to include the last segment to arrive, because 539 there is already sufficient information in the essential part of the 540 feedback to infer that marking (by subtracting the markings in the 541 list from the increment of the cumulative counter). 543 AccECN uses a fixed size (10b) field for the sequence encoding. This 544 can communicate a sequence of up to 14 codepoints, not including the 545 last segment. The encoding is optimised for a selection of simple 546 but common patterns. If the pattern of arriving codepoints becomes 547 too complex to encode in 10b, the Data Receiver has to emit an ACK 548 and start a new sequence for the next ACK. The scheme can always 549 encode all the theoretically possible combinations of arriving 550 codepoints in a delayed ACK covering 3 segments or less. 552 3. AccECN Protocol Specification 554 3.1. Negotiation during the TCP handshake 556 During the TCP handshake at the start of a connection, to request 557 more accurate ECN feedback the originator of the connection (host A) 558 MUST set the TCP flags NS=1, CWR=1 and ECE=1 in the initial SYN 559 segment. 561 If a responding host (B) that implements AccECN receives a SYN with 562 the above three flags set, it MUST set both its half connections into 563 AccECN mode. Then it MUST set the flags NS=0, CWR=1 and ECE=0 on its 564 response in the SYN/ACK segment to confirm that it supports AccECN. 565 The responding host MUST NOT set this combination of flags unless the 566 preceding SYN requested support for AccECN as above. 568 Once an originating host (A) has sent the above SYN to declare that 569 it supports AccECN, and once it has received the above SYN/ACK 570 segment that confirms that the responding host supports AccECN, the 571 originating host MUST set both its half connections into AccECN mode. 573 The three flags set to 1 to indicate AccECN support on the SYN have 574 been carefully chosen to enable natural fall-back to prior stages in 575 the evolution of ECN. Table 2 tabulates all the negotiation 576 possibilities for ECN-related capabilities that involve at least one 577 AccECN-capable host. To compress the width of the table, the 578 headings of the first four columns have been severely abbreviated, as 579 follows: 581 Ac: More *Ac*curate ECN Feedback 583 N: ECN-*N*once [RFC3540] 585 E: *E*CN [RFC3168] 587 I: Not-ECN (*I*mplicit congestion notification using packet drop). 589 +----+---+---+---+------------+--------------+------------------+ 590 | Ac | N | E | I | SYN A->B | SYN/ACK B->A | Mode | 591 +----+---+---+---+------------+--------------+------------------+ 592 | | | | | NS CWR ECE | NS CWR ECE | | 593 | AB | | | | 1 1 1 | 0 1 0 | AccECN | 594 | | | | | | | | 595 | A | B | | | 1 1 1 | 1 0 1 | classic ECN | 596 | A | | B | | 1 1 1 | 0 0 1 | classic ECN | 597 | A | | | B | 1 1 1 | 0 0 0 | Not ECN | 598 | A | | | B | 1 1 1 | 1 1 1 | Not ECN (broken) | 599 | | | | | | | | 600 | B | A | | | 0 1 1 | 0 0 1 | classic ECN | 601 | B | | A | | 0 1 1 | 0 0 1 | classic ECN | 602 | B | | | A | 0 0 0 | 0 0 0 | Not ECN | 603 | | | | | | | | 604 | A | | | | 1 1 1 | 0 1 1 | AccECN (Rsvd) | 605 | A | | | | 1 1 1 | 1 0 0 | AccECN (Rsvd) | 606 | A | | | | 1 1 1 | 1 1 0 | AccECN (Rsvd) | 607 +----+---+---+---+------------+--------------+------------------+ 609 Table 2: ECN capability negotiation between Originator (A) and 610 Responder (B) 612 Table 2 is divided into blocks each separated by an empty row. 614 1. The top block shows the case already described where both 615 endpoints support AccECN. 617 2. The second block shows the cases where the originating host (A) 618 supports AccECN but the responding host (B) supports some earlier 619 variant of TCP, indicated in its SYN/ACK. Therefore, as soon as 620 an originating AccECN-capable host (A) receives the SYN/ACK shown 621 it MUST set both its half connections into the mode shown in the 622 rightmost column. 624 3. The third block shows the cases where the responding host (B) 625 supports AccECN but the originating host (A) supports some 626 earlier variant of TCP, indicated in its SYN. Therefore, as soon 627 as responding AccECN-capable host (B) receives the SYN shown it 628 MUST set both its half connections into the mode shown in the 629 rightmost column. 631 4. Forward Compatibility: The fourth block enumerates the remaining 632 combinations of AccECN-related flags that are Reserved for future 633 use by AccECN ('Rsvd'). 635 * If an originating AccECN host (A) sends NS=1, CWR=1 and ECE=1 636 in the initial SYN segment and if it receives any of these 637 Reserved values in a SYN/ACK response, it MUST set both its 638 half connections into AccECN mode. 640 {ToDo: Can we think of anything now that an AccECN server 641 could use any of these Reserved combinations of flags for, to 642 signal something extra for the whole connection? If not, 643 rather than Reserved, we need to decide whether to make these 644 combinations Rsvd and therefore not switch to AccECN mode.} 646 * To comply with the present AccECN protocol, middleboxes MUST 647 forward these Rsvd combinations of flags unaltered (see also 648 Section 3.7). 650 The table is self-explanatory in most respects, but the following 651 exceptional cases need some explanation. 653 Not ECN (broken): [RFC3168] points out that broken TCP server 654 implementations exist that reflect the 'reserved' flags [RFC0793] 655 back to the originator. If the SYN/ACK reflects the same flag 656 settings as the preceding SYN, an AccECN client implementation 657 MUST revert to Not-ECT. 659 ECN Nonce: An AccECN implementation, whether client or server, 660 sender or receiver, does not need to implement the ECN Nonce 661 behaviour [RFC3540]. AccECN is compatible with a sender-only ECN 662 feedback integrity approach that does not use up the ECT(1) 663 codepoint (see Section 3.3.5). 665 Simultaneous Open: An originating AccECN Host (A), having sent a SYN 666 with NS=1, CWR=1 and ECE=1, might receive another SYN from host B. 667 Host A MUST then enter the same mode as it would have entered had 668 it been a responding host and received the same SYN. Then host A 669 MUST send the same SYN/ACK as it would have sent had it been a 670 responding host (see the third block above). 672 3.2. Essential AccECN Feedback 674 This section specifies the essential part of AccECN feedback, 675 including its placement and the encoding of the counters. 677 3.2.1. The ACE Field 679 Once AccECN has been negotiated for a connection, it overloads the 680 three TCP flags ECE, CWR and NS in the main TCP header as one 3-bit 681 field to encode 8 distinct codepoints. Then the field is given a new 682 name, ACE, as shown in Figure 2. The original definition of these 683 three flags in the TCP header, including the addition of support for 684 the ECN Nonce, is shown for comparison in Figure 1. This 685 specification does not rename these three TCP flags, it merely 686 overloads them with another name and definition once an AccECN 687 connection has been established. 689 A host MUST interpret the ECE, CWR and NS flags as the 3-bit ACE 690 counter on a segment with SYN=0 that it sends or receives after it 691 has set both its half-connections into AccECN mode having 692 successfully negotiated AccECN (see Section 3.1). A host MUST NOT 693 interpret the 3 flags as a 3-bit ACE field on any segment with SYN=1 694 (whether ACK is 0 or 1), or if AccECN negotiation is incomplete or 695 has not succeeded. 697 Both parts of each of these conditions are equally important. For 698 instance, even if AccECN negotiation has been successful, the ACE 699 field is not defined on any segments with SYN=1 (e.g. a 700 retransmission of an unacknowledged SYN/ACK, or when both ends send 701 SYN/ACKs after AccECN support has been successfully negotiated during 702 a simultaneous open). 704 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 705 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 706 | | | | U | A | P | R | S | F | 707 | Header Length | Reserved | ACE | R | C | S | S | Y | I | 708 | | | | G | K | H | T | N | N | 709 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 711 Figure 2: Definition of the ACE field within bytes 13 and 14 of the 712 TCP Header (when AccECN has been negotiated and SYN=0). 714 The Data Receiver maintains three counters, r.ci, r.e1 and r.ni, to 715 count the number of packets it receives with respectively the CE, 716 ECT(1) and Not-ECT codepoint in the IP-ECN field. When a Data 717 Receiver first enters AccECN mode, it MUST initialise its counters to 718 zero. The Outgoing Protocol Handler at the Data Receiver uses the 719 ACE field to encode one of these counters at a time into each ACK. 721 How it determines which counter to signal on any particular ACK is 722 specified later (Section 3.2.3). 724 The 8 possible codepoints of the ACE field are shown in Table 3. A 725 Data Receiver uses four of them to encode a 'Congestion Indication' 726 (CI) counter for CE markings and three to encode E1 for ECT(1) 727 markings. It uses the eighth codepoint to feed back the arrival of 728 Not-ECT in the IP-ECN field using a codepoint termed NI (Not-ECT 729 Indication). We will now use an example to explain how ACE is 730 encoded by the Outgoing Protocol Handler and decoded by the Incoming 731 Protocol Handler. 733 +-----------+----------------+------------------+-------------------+ 734 | ACE (base | CI (base 4) | E1 (base 3) for | NI (base 1) for | 735 | 2) | for CE | ECT(1) | Not-ECT | 736 +-----------+----------------+------------------+-------------------+ 737 | 000 | 0 | - | - | 738 | 001 | 1 | - | - | 739 | 010 | 2 | - | - | 740 | 011 | 3 | - | - | 741 | 100 | - | 0 | - | 742 | 101 | - | 1 | - | 743 | 110 | - | 2 | - | 744 | 111 | - | - | 0 | 745 +-----------+----------------+------------------+-------------------+ 747 Table 3: Codepoint assignments in the ACE field for feedback of 748 congestion counters 750 Encode: Imagine that the E1 counter is the next to be signalled and 751 r.e1 = 17. Then, because the E1 counter is base 3, the Data Receiver 752 calculates 754 E1 = 17 % 3 755 = 2 757 So it looks up E1=2 in Table 3 to get the codepoint to set in ACE, 758 which is 0b110. 760 Decode: The Data Sender maintains three counters, s.ci, s.e1 and s.ni 761 and it uses the incoming codepoints in ACE to ensure these track the 762 equivalent counters at the receiver. Imagine the s.e1 counter at the 763 Data Sender has currently reached 16 when the 0b110 codepoint arrives 764 via the ACE field. The Data Sender looks up 0b110 in Table 3 to get 765 E1 = 2. It finds the difference between s.e1 and E1 using modulo 3 766 arithmetic, then adds the difference to s.e1, as follows: 768 delta_s.e1 = (E1 + 3 - s.e1 % 3) % 3 769 = (2 + 3 - 16 % 3) % 3 770 = 1 771 => s.e1 = s.e1 + delta_s.e1 772 = 16 + 1 773 = 17 775 3.2.2. Safety against Ambiguity of the ACE Field 777 Clearly, the CI, E1 and NI counters will frequently wrap given the 778 size of the space available to encode them is so small. If a number 779 of ACKs in a row are lost, the Data Sender might not be able to tell 780 whether one of these counters has wrapped or not. 782 The supplementary part of AccECN provides more space to signal higher 783 bits of these counters, which gives resilience against ACK loss 784 (Section 3.3.3). However, the supplementary part of the AccECN 785 protocol might be unavailable (perhaps due to middlebox 786 interference). 788 Therefore, if the Data Sender detects that these fields could have 789 wrapped, it SHOULD behave conservatively. That is, if the AccECN 790 sender detects that the supplementary part of the AccECN protocol is 791 unavailable, and it detects a jump in the acknowledgement number that 792 implies that so many ACKs are missing that a counter could have 793 wrapped under the prevailing conditions, it SHOULD decode the counter 794 assuming that the counter did wrap. If missing acknowledgement 795 numbers arrive later (reordering) and prove that the counter did not 796 wrap, the Data Sender MAY attempt to neutralise the effect of any 797 action it took based on a conservative assumption that it later found 798 to be incorrect. 800 An example algorithm to implement this policy is given in 801 Appendix A.1. An implementer MAY develop an alternative algorithm as 802 long as it satisfies these requirements. 804 3.2.3. ACE Counter Selection 806 If the Data Receiver implements ACK-withholding as recommended in 807 [RFC5681], more than one counter could have incremented before 808 sending each ACK. It follows the steps below to determine which 809 counter to encode in the ACE field: 811 1. If the last IP-ECN field that arrived was CE, ECT(1) or Not-ECT, 812 the Data Receiver MUST encode the associated counter in the ACE 813 field, i.e. respectively CI, E1 or Not-ECT; 815 2. If the last IP-ECN field that arrived was ECT(0), the Data 816 Receiver can signal either the CI or the E1 counter: 818 * The choice of which to signal SHOULD be based on the principle 819 that the more one counter has changed recently the more it 820 SHOULD be signalled; 822 * If there is a tie between CI and E1, CI MUST take precedence. 824 Appendix A.2 suggests two possible algorithms that could be used to 825 determine which counter to encode in ACE. An implementer MAY develop 826 an alternative algorithm as long as it meets the requirements in the 827 three steps above. 829 If an AccECN Data Sender has to retransmit a packet due to a 830 suspected loss, in its role as a Data Receiver it will piggy-back 831 AccECN feedback on the retransmitted packet. On a retransmitted 832 packet, a Data Receiver MUST select which counter to send using the 833 rules in the above three steps and encode the latest prevailing value 834 of the selected counter, which will not necessarily be the same 835 counter that the packet carried originally, nor the original value of 836 that counter. 838 There is no standards track end-to-end definition of the ECT(1) 839 codepoint of the IP-ECN field. Nonetheless, to comply with this 840 specification, an AccECN Data Receiver MUST implement and reflect the 841 ECT(1) counter as specified here. Then, a standards track definition 842 of the ECT(1) codepoint can be defined in future and be deployed 843 unilaterally in Data Senders, without having to wait for associated 844 receivers to be deployed. The above rules ensure that a Data 845 Receiver will only feed back the ECT(1) counter if some packets 846 marked with ECT(1) are arriving. 848 At the Data Sender, the Incoming AccECN Protocol Handler MUST be able 849 to receive feedback of E1 codepoints, but the Data Sender MAY discard 850 them (it might not have any logic to understand what to do with 851 them). However, if an Incoming AccECN Protocol Handler is running 852 back-to-back with an Outgoing AccECN Protocol handler (e.g. to 853 implement a split TCP connection), it MUST forward the values of all 854 AccECN counters including E1, and not discard any. 856 {ToDo: Refer if necessary to Section 3.4). 858 3.3. The Supplementary AccECN Field (SupAccECN) 860 This section defines the size, placement and internal structure of 861 the Supplementary AccECN field (SupAccECN), as well as the semantics 862 of the sub-fields within it. The internal structure of the SupAccECN 863 field is agnostic to where it is placed in the TCP header, so that it 864 can be moved during planned evolution of the protocol. The protocol 865 overview in Section 2 explains that the field is placed in a TCP 866 option for initial experiments, but if it progresses to the standards 867 track, it is planned to place it in the main TCP header, using some 868 of the bits in the Urgent Pointer (when URG=0). 870 3.3.1. Placement of the SupAccECN Field 872 The Outgoing AccECN Protocol Handler at a Data Receiver MUST place 873 the SupAccECN field in a SupAccECN TCP option (Section 3.3.1.1). 875 Forward compatibility: If the SupAccECN TCP option (Section 3.3.1.1) 876 is absent, the Incoming AccECN Protocol Handler at a Data Sender MUST 877 attempt to read the SupAccECN field from within the Non-Urgent field 878 (Section 3.3.1.2). 880 3.3.1.1. The SupAccECN TCP Option 882 The Data Receiver MUST set the Kind field to 0x (TBA), which is 883 registered in Section 6.1 as a new TCP option Kind called SupAccECN. 884 An experimental TCP option with Kind=254 MAY be used for initial 885 experiments, with magic number 0xACCE. 887 The Data Receiver MUST set the Length field to 4 [octets] on any 888 segment with SYN=0. For initial experiments, the Length field MUST 889 be 2 greater to accommodate the 16-bit magic number. In either case, 890 the Data Receiver MUST pad the most significant bit with zeros up to 891 a whole number of octets, as illustrated in Figure 3. This padding 892 bit is currently unused (CU). 894 Forward compatibility: To comply with the present AccECN 895 specification: 897 o the Incoming AccECN Protocol Handler at the Data Sender MUST 898 ignore the padding bit, whether it is set to zero or not; 900 o if the Length field of the TCP option is greater than that 901 expected from the paragraph above, a Data Sender MUST take the 902 SupAccECN field to be aligned with the right hand end (least 903 significant bit) of the TCP Option as calculated using the Length 904 field; 906 o if the Length value is less than that expected from the paragraph 907 above, the Incoming AccECN Protocol Handler at the Data Sender 908 MUST discard the segment; 910 o a middlebox MUST forward the padding bit unaltered, whether it is 911 set to zero or not; 913 o if the Length value is different to that expected from the 914 paragraph above (whether larger or smaller), a middlebox MUST 915 still forward the TCP option unaltered. 917 0 1 2 3 918 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 919 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ a) 920 | Kind = 0xKK | Length = 4 |0| SupAccECN | 921 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 923 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 924 | Kind = 254 | Length = 6 | magic number = 0xACCE | b) 925 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 926 |0| SupAccECN | 927 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 929 a) Using the permanently assigned TCP option Kind 0x (TBA); b) 930 Using a Shared TCP Option Kind for Initial Experiments 932 Figure 3: Placement of the SupAccECN field within the SupAccECN TCP 933 Option on a Segment with SYN=0 935 3.3.1.2. The Non-Urgent Field 937 If the Urgent (URG) flag in the TCP header [RFC0793] is zero, this 938 specification experimentally renames the Urgent Pointer (bytes 19 and 939 20 counting from 1 of the TCP header) as the Non-Urgent field. If 940 URG = 1, this 16 bit field keeps its original name and definition 941 from [RFC0793] as the Urgent Pointer. Bytes 13 to 20 of the TCP 942 header when URG=0 are illustrated in Figure 4, which shows the new 943 experimental definition of the Non-Urgent Field. 945 Note that the new experimental definition of the Non-Urgent field is 946 intended for wider use than just AccECN, which is why it solely 947 depends on the URG flag and it is independent of whether AccECN has 948 been negotiated or not. 950 Section 6.2 establishes a new registry to assign values within this 951 Non-Urgent field. Section 6.2 also reserves space for a future 952 standards track AccECN specification within this field. 954 0 1 2 3 955 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 956 ... 957 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 958 | Data |Res- |N|C|E|U|A|P|R|S|F| | 959 | Offset|erved|S|W|C|R|C|S|S|Y|I| Window | 960 | | | |R|E|G|K|H|T|N|N| | 961 | | | | | |=| | | | | | | 962 | | | | | |0| | | | | | | 963 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 964 | Checksum | Non-Urgent | 965 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 966 ... 968 Figure 4: Experimental Renaming of the TCP Urgent Pointer (bytes 19 & 969 20) as the Non-Urgent field when URG=0 971 As required in Section 3.3.1, the Outgoing Protocol Handler of the 972 present AccECN specification never writes into the Non-Urgent field. 973 Nonetheless, the Incoming AccECN Protocol Handler can read the 974 SupAccECN field from within the Non-Urgent field. 976 When reading the Non-Urgent field, AccECN implementations MUST take 977 the SupAccECN field to be right-justified (i.e. the least significant 978 bit of SupAccECN is aligned with the least significant bit of the 979 Non-Urgent Field) as shown in Figure 5. The remaining most 980 significant bit is currently unused (CU). 982 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 983 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 984 | X | SupAccECN | 985 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 987 Figure 5: Placement of the SupAccECN field within the Non-Urgent 988 field of a segment with SYN=0 990 Forward compatibility: To comply with the present AccECN 991 specification: 993 o the Incoming Protocol Handler of an AccECN Data Sender MUST ignore 994 the remaining most significant bit in the Non-Urgent field (shown 995 as X in Figure 5 meaning "Don't care"); 997 o middleboxes MUST forward the most significant bit unaltered, 998 whether it is set to zero or not. 1000 3.3.2. Structure of the SupAccECN Field 1002 This section defines the structure of the Supplementary AccECN field 1003 (SupAccECN) for SYN/ACKs and for subsequent segments within each 1004 half-connection. There is no SupAccECN field in the initial SYN 1005 segment. 1007 The size of the SupAccECN field on a segment with SYN = 0 is always 1008 15 bits. Figure 6 shows the internal structure of the SupAccECN 1009 field on any segment with SYN = 0 including the ACK that ends the 1010 3-way handshake. 1012 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1013 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1014 |DAC| ESQ | Top-ACE | 1015 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1017 Figure 6: The Supplementary AccECN Field on a Segment with SYN = 0 1019 The sub-fields of SupAccECN on a segment with SYN = 0 have the 1020 following meanings: 1022 Top-ACE: Higher significant bits of the counter in ACE within the 1023 same segment (defined in Section 3.3.3). 1025 ESQ: The ECN Sequence field (defined in Section 3.3.4). 1027 DAC: Reserved for Delayed ACK Control (see Appendix B.4). 1029 Forward Compatibility: In the meantime, the Outgoing AccECN 1030 Protocol Handler MUST set DAC to zero (0); the Incoming AccECN 1031 Protocol Handler MUST ignore this flag; and middleboxes MUST 1032 forward this flag unaltered whether or not it is zero. 1034 3.3.3. Higher Resilience Congestion Counters (Top-ACE) 1036 Four codepoints are set aside for the CI counter in the ACE field to 1037 provide reasonable resilience under expected marking and loss 1038 regimes. However, resilience against more extreme levels of CE 1039 marking, return ACK loss or ACK thinning really requires more space 1040 than the 3 bits taken from existing TCP flags for the ACE counter. 1041 At the same time, is it not necessary to deliver higher order bits 1042 with every returned segment, or even reliably at all. 1044 Therefore on segments with SYN=0, the least significant four bits of 1045 the Supplementary AccECN field are defined as the 'Top ACE' field, as 1046 illustrated in Figure 6. Whenever an AccECN implementation encodes a 1047 counter in ACE, it MUST also encode the higher precision bits of the 1048 same counter in the Top-ACE field of the same segment, using the 1049 following rules: 1051 o Top-ACE MUST be initialised to 0 at the start of each half- 1052 connection. 1054 o Whenever the CI counter (base 4) in ACE wraps, the associated Top- 1055 ACE MUST increment by 1. 1057 o Similarly, whenever the E1 counter (base 3) in ACE wraps, Top-ACE 1058 MUST increment by 1. 1060 o The NI counter in ACE is base 1, so it can hardly be called a 1061 counter. The presence of the NI counter in ACE MUST be 1062 interpreted as an indication that the associated Top-ACE field in 1063 the same segment has incremented, because Top-ACE on its own 1064 represents the NI counter. 1066 Formulae for encoding and decoding the counters CI, E1 or NI into the 1067 Top-ACE and ACE fields are given in Appendix A.3, which also includes 1068 numerical examples. 1070 The 4 bits in the Top-ACE field multiply the number of distinct 1071 codepoints for each counter by 2^4 = 16. Using Top-ACE therefore 1072 increases the numbers of distinct codepoints for each counter as 1073 follows: 1075 +---------------------+-----------------+---------------------------+ 1076 | Counter | codepoints in | codepoints in Top-ACE | 1077 | | ACE | with ACE | 1078 +---------------------+-----------------+---------------------------+ 1079 | CI (counts CE) | 4 | 16 * 4 = 64 | 1080 | E1 (counts ECT(1)) | 3 | 16 * 3 = 48 | 1081 | NI (counts Not-ECT) | 1 | 16 * 1 = 16 | 1082 +---------------------+-----------------+---------------------------+ 1084 Top-ACE hugely improves the resilience of AccECN against ambiguity of 1085 counters due to ACK loss, compared with that of ACE alone (quantified 1086 in Appendix A.1). With Top-ACE, the AccECN protocol can lose a whole 1087 string of ACKs covering up to 64 - 1 = 63 congestion indications 1088 without becoming ambiguous. Similarly AccECN is robust to losing a 1089 whole string of ACKs covering 47 ECT(1) markings or 15 Not-ECT 1090 markings. If, for example, about 1 in 100 data packets were marked 1091 with a CE codepoint on the forward path, all the ACKs covering about 1092 100 * 63 = 6,300 segments would have to be missing from the reverse 1093 path before AccECN would become ambiguous. If just one of these ACKs 1094 got through, it would resolve any ambiguity. 1096 3.3.4. Accurate ECN Sequence within Delayed ACKs 1098 Given each delayed ACK can cover multiple segments, a Data Receiver 1099 needs to describe the order in which the ECN codepoints arrived. 1100 AccECN uses a 10-bit ECN Sequence (ESQ) field to encode this 1101 ordering. This section explains the encoding. An example encoding 1102 algorithm in pseudocode is given in Appendix A.4. Implementations 1103 MAY develop their own encoding algorithm as long as it complies with 1104 the requirements in this section. 1106 Once the TCP 3-way handshake has completed, an AccECN Data Receiver 1107 can defer an ACK until one of these three tests does not pass: 1109 1. The number of deferred bytes exceeds a configured limit 1110 (currently two full-sized segments [RFC5681]); 1112 2. The longest time for which an ACK has been delayed exceeds a 1113 configured limit (currently 500ms [RFC5681]); 1115 3. The sequence of ECN codepoints has become too complex to encode 1116 in the fixed 10b available. 1118 AccECN can encode the order of a sequence of up to 15 ECN codepoints 1119 in one ACK. The ACE field in the ACK always encodes the ECN 1120 codepoint of the latest packet to arrive. Using the ESQ field of the 1121 same ACK, the Outgoing AccECN Protocol Handler can encode the order 1122 of arrival of up to 14 ECN codepoints that arrived before this, 1123 making a maximum coverage of 15 packets. 1125 The encoding of the ESQ field is optimised for a selection of simple 1126 sequences that are expected to be common. Even if the first two 1127 tests pass, if a more complex sequence occurs, the third test above 1128 will fail so the Data Receiver will be forced to send an ACK earlier 1129 than it would have otherwise. The most complex sequence that AccECN 1130 can encode is a run of 'spaces' (SP) ending in one 'mark' (MK1), then 1131 another run of 'spaces', followed by a 'mark' that might be different 1132 from the first (MK2). 1134 The internal structure of the 10-bit Accurate ECN Sequence (ESQ) 1135 field is show in Figure 7. 1137 0 1 2 3 4 5 6 7 8 9 1138 +---+---+---+---+---+---+---+---+---+---+ 1139 | RL1 | RL2 | SP | MK1 | 1140 +---+---+---+---+---+---+---+---+---+---+ 1142 Figure 7: Internal Structure of the Accurate ECN Sequence (ESQ) Field 1143 The sub-fields of ESQ have the following meanings: 1145 RL1: Run-Length #1: a 3-bit field giving the length of a first run 1146 consisting of spaces (SP) ending in one mark (MK1), which is 1147 included in the length of the run; 1149 RL2: Run-Length #2: another 3-bit field giving the length of a 1150 second run of spaces (SP). There is no mark included in this run; 1152 SP: Space: The 2-bit ECN codepoint defined as a space, for the 1153 present ACK only; 1155 MK1: Mark #1: The 2-bit ECN codepoint defined as the first mark, for 1156 the present ACK only. 1158 The Incoming Protocol Handler can always determine the second mark 1159 (MK2) from the counter that the Data Receiver uses in the ACE field, 1160 which has to be the counter associated with the last ECN codepoint to 1161 have arrived (according to the rules in Section 3.2.3). Even though 1162 there is no counter associated with ECT(0), the Incoming Protocol 1163 Handler can tell if the last codepoint to arrive was ECT(0), because 1164 the counter used in ACE will not have changed relative to the 1165 previous packet. 1167 Figure 8 gives example sequences of ECN codepoints and illustrates 1168 how the Data Receiver encodes them. The sequences use the single- 1169 character abbreviations in Table 1 for each ECN codepoint. The last 1170 codepoint to arrive is shown on the right. 1172 ,----- RL1 = 6 ------> ,--- RL2=4 --> 1173 a) 0 0 0 0 0 C 0 0 0 0 1 1174 SP SP SP SP SP MK1 SP SP SP SP MK2 1176 ,--- RL1=4 --> (RL2 = 0) 1177 b) C C C 0 0 1178 SP SP SP MK1 MK2 1180 ,--------- RL1 = 7 ------> ,--------- RL2 = 7 ------> 1181 c) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1182 SP SP SP SP SP SP MK1 SP SP SP SP SP SP SP MK2 1184 RL1=1 ,> ,--- RL2=4 --> 1185 d) C 0 0 0 0 C 1186 MK1 SP SP SP SP MK2 1188 RL2=1 ,> (RL1 = 0) 1189 e) N N 1190 SP MK2 1192 Figure 8: Examples Encodings of Sequences of ECN Codepoints in the 1193 ESQ Field 1195 The examples should be self-explanatory, but the following points 1196 might help: 1198 o The term 'mark' does not have to mean an 'ECN mark'. In (a) the 1199 'spaces' are defined as ECT(0) and the first 'mark' is defined as 1200 CE. However, in (b) it is more efficient to define CE as the 1201 'space' and ECT(0) as the first 'mark'; 1203 o A mark is defined to mean just one codepoint, so two marks in a 1204 row have to be encoded as two different marks, even if they are 1205 the same codepoint (b). The first and second marks can be defined 1206 as different (a) or the same (b or c); 1208 o For a long run of the same codepoint, the first mark can be 1209 defined to be the same as a space, and if necessary the second 1210 mark can be the same as well (c); 1212 o The first run (if non-zero length) always ends in one mark. So, 1213 if its run-length is 1, it contains a mark but no spaces (d); 1215 o Either run-length might be zero (b & e), but MK2 will always be 1216 present. If the first run-length is zero, the definition of MK1 1217 is redundant (e). If both run-lengths are zero, the definition of 1218 SP would be redundant as well. 1220 The following normative statements govern an implementation of an 1221 AccECN Data Receiver when it defers an ACK: 1223 o The Outgoing Protocol Handler MUST NOT encode the last packet to 1224 be acknowledged into the ESQ field; 1226 o If the Outgoing Protocol Handler cannot encode the last ECN 1227 codepoint to arrive in the ESQ field, it MUST send an ACK 1228 immediately; 1230 o The Outgoing Protocol Handler MUST NOT include a codepoint in the 1231 sequence of codepoints in an ACK that is from any packet already 1232 reported in another ACK; 1234 o If RL1=0, the Outgoing Protocol Handler MUST set MK1 = ECT(0) = 1235 0b10, even though the value of MK1 seems redundant. 1237 o If RL2=0 and RL1=<1, the Outgoing Protocol Handler MUST set SP = 1238 ECT(0) = 0b10, even though the value of SP seems redundant. 1240 The last two rules ensure that the value of ESQ as a whole is never 1241 all-zeros, which allows the Incoming Protocol Handler to detect 1242 interference by middleboxes (see Section 3.6). 1244 The following normative statements govern an implementation of an 1245 AccECN Data Sender: 1247 o The Incoming AccECN Protocol Handler MUST increment the congestion 1248 codepoint counters (other than the one associated with the ACE 1249 field) by counting the codepoints as it decodes the ESQ field; 1251 o If the Incoming AccECN Protocol Handler finds that the value of a 1252 congestion counter calculated using ESQ would be more than that 1253 calculated using Top-ACE/ACE, it SHOULD use the higher of the two 1254 calculations. 1256 o If the Incoming AccECN Protocol Handler finds that the value of a 1257 congestion counter calculated using ESQ would be less than that 1258 calculated using Top-ACE/ACE, it SHOULD use the higher of the two 1259 calculations. An example of an exception to this rule would be 1260 where the Incoming Protocol Handler had previously conservatively 1261 assumed counter wrap, but then missing ACKs arriving later filled 1262 the gap in the sequence feedback. 1264 o While the Incoming AccECN Protocol Handler is calculating the 1265 value of a congestion counter using Top-ACE/ACE, if it finds that 1266 the value calculated using ESQ in a previous segment is already 1267 higher, it SHOULD use the lower value calculated using ACE/Top- 1268 ACE. It SHOULD also consider the SupAccECN field in subsequent 1269 segments as suspect {ToDo: suggest what concrete action this 1270 implies}. 1272 Forward Compatibility: 1274 o if RL1=0: 1276 * the Incoming Protocol Handler MUST ignore the value in MK1; 1278 * middleboxes MUST forward the value in MK1 unaltered (whether or 1279 not it is 0b10 as it ought to be). 1281 o if RL2=0 and RL1=<1: 1283 * the Incoming Protocol Handler MUST ignore the value in SP; 1285 * middleboxes MUST forward the value in SP unaltered (whether or 1286 not it is 0b10 as it ought to be). 1288 3.3.5. AccECN Feedback Integrity 1290 The ECN Nonce [RFC3540] is an experimental IETF specification 1291 intended to allow a sender to test whether ECN CE markings (or 1292 losses) are being suppressed by the receiver (or anywhere else in the 1293 feedback loop, such as another network or a middlebox). The ECN 1294 nonce has not been deployed as far as can be ascertained. The nonce 1295 would now be nearly impossible to deploy retrospectively, because to 1296 catch a misbehaving receiver it relies on the receiver volunteering 1297 feedback information to incriminate itself. A receiver that has been 1298 modified to misbehave can simply claim that it does not support nonce 1299 feedback, which will seem unremarkable given so many other hosts do 1300 not support it either. 1302 With minor changes AccECN could be optimised for the possibility that 1303 the ECT(1) codepoint might be used as a nonce. However, given the 1304 nonce is now probably undeployable, the AccECN design has been 1305 generalised so that it ought to be able to support other possible 1306 uses of the ECT(1) codepoint, such as a lower severity or a more 1307 instant congestion signal than CE. 1309 Three alternative mechanisms are available to assure the integrity of 1310 ECN and/or loss signals. AccECN is compatible with any of these 1311 approaches: 1313 o The Data Sender can test the integrity of the receiver's ECN (or 1314 loss) feedback by occasionally setting the IP-ECN field to a value 1315 normally only set by the network (and/or deliberately leaving a 1316 sequence number gap). Then it can test whether the Data 1317 Receiver's feedback faithfully reports what it expects 1318 [I-D.moncaster-tcpm-rcv-cheat]. Unlike the ECN Nonce, this 1319 approach does not waste the ECT(1) codepoint in the IP header, it 1320 does not require standardisation and it does not rely on 1321 misbehaving receivers volunteering to reveal feedback information 1322 that allows them to be detected. 1324 o Networks generate congestion signals when they are becoming 1325 congested, so they are more likely than Data Senders to be 1326 concerned about the integrity of the receiver's feedback of these 1327 signals. A network can enforce a congestion response to its ECN 1328 markings (or packet losses) using congestion exposure (ConEx) 1329 audit [I-D.ietf-conex-abstract-mech]. Whether the receiver or a 1330 downstream network is suppressing congestion feedback or the 1331 sender is unresponsive to the feedback, or both, ConEx audit can 1332 neutralise any advantage that any of these three parties would 1333 otherwise gain. 1335 ConEx is a change to the Data Sender that is most useful when 1336 combined with AccECN. Without AccECN, the ConEx behaviour of a 1337 Data Sender would have to be more conservative than would be 1338 necessary if it had the accurate feedback of AccECN. 1340 o The TCP authentication option (TCP-AO [RFC5925]) can be used to 1341 detect any tampering with AccECN feedback between the Data 1342 Receiver and the Data Sender. Although this section of the 1343 feedback loop is the least likely to come under malicious attack, 1344 it is increasingly likely to be tampered with accidentally by 1345 middleboxes intervening at layer 4. The AccECN fields are 1346 immutable end-to-end, so whether placed in the Non-Urgent field or 1347 a TCP option, they are amenable to default TCP-AO protection (but 1348 not if TCP-AO protection of TCP options is turned off, which is 1349 non-default but might be necessary for other reasons). 1351 3.4. Accurate ECN Receiver Operation 1353 A TCP receiver MUST only feedback ECN information arriving in a 1354 segment that it deems is part of the flow, by using regular TCP 1355 techniques based on sequence numbers. 1357 {ToDo: It might be useful to describe receiver end of the feedback 1358 process, including special cases, e.g. pure ACKs, retransmissions, 1359 window probes, partial ACKs, etc. Does AccECN feed back each ECN 1360 codepoint when a data packet is duplicated?} 1362 3.5. Accurate ECN Sender Operation 1364 A TCP sender MUST only accept ECN feedback on ACKs that it deems is 1365 part of the flow, by using regular TCP techniques based on sequence 1366 numbers. 1368 {ToDo: It might be useful to describe the sender end of the feedback 1369 process, including special cases, e.g. pure ACKs, retransmissions, 1370 window probes, partial ACKs, etc.} 1372 3.6. Detection of Legacy Middlebox Interference 1374 The definition of the SupAccECN field has been contrived so that the 1375 value all-zeros is undefined. Therefore, an Outgoing AccECN Protocol 1376 Handler MUST NOT ever set the value of SupAccECN to all-zeros. 1378 Therefore, the Incoming AccECN Protocol Handler MUST check that the 1379 value of ESQ is non-zero (on a segment with SYN=0). If the Incoming 1380 Protocol Handler detects all-zeros in either of these fields on any 1381 segment, it MUST ignore the whole SupAccECN field on that segment, 1382 and it SHOULD ignore the SupAccECN field on all subsequent segments 1383 in the same half-connection or at least treat each with greater 1384 suspicion. 1386 If a Data Sender ignores the incoming SupAccECN field, it MUST revert 1387 to the conservative behaviour needed when only the essential part of 1388 the AccECN protocol is available, as described in Section 3.2.2. 1389 Nonetheless, the Outgoing AccECN Protocol Handler of the same Data 1390 Sender MUST continue to set the SupAccECN field as normal 1391 (Section 3.3), because any interference might be only in one 1392 direction. The AccECN protocol does not include any requirement for 1393 a Data Sender that detects interference to notify the other end, 1394 because the complexity required to assure message integrity in the 1395 face of interference is not warranted. 1397 3.7. Correct Middlebox Operation 1399 A large class of middleboxes split TCP connections, acting as the 1400 receiver for one connection and the sender for another, passing data 1401 between the two, usually via a buffer. Network interface hardware to 1402 offload certain TCP processing represents another large class of 1403 middleboxes, even though it is rarely in its own 'box'. 1405 To comply with this specification, each side of such a middlebox MUST 1406 comply with the AccECN requirements applicable to a responding host 1407 or an originating host during capability negotiation (Section 3.1) 1408 and the required AccECN behaviours as a Data Receiver or as a Data 1409 Sender throughout this specification. 1411 Another class of middleboxes attempts to 'normalise' the TCP wire 1412 protocol by checking that all values in header fields comply with a 1413 rather narrow interpretation of the TCP specifications. To comply 1414 with this specification, such middleboxes MUST be updated to 1415 recognise and forward values in fields that comply with the newly 1416 defined semantics of AccECN. This includes the explicitly stated 1417 requirements to forward Reserved (Rsvd) and Currently Unused (CU) 1418 values unaltered. An 'ideal' TCP normaliser would not have to change 1419 to accommodate AccECN, because AccECN does not directly contravene 1420 any existing TCP specifications, even though it uses existing TCP 1421 fields in unorthodox ways. 1423 4. Interaction with Other TCP Variants 1425 4.1. Compatibility with SYN Cookies 1427 A server can use SYN Cookies (see Appendix A of [RFC4987]) to protect 1428 itself from SYN flooding attacks. It places minimal commonly used 1429 connection state in the SYN/ACK, and deliberately does not hold any 1430 additional state while waiting for the subsequent ACK. Therefore it 1431 cannot record the fact that it entered AccECN mode for both half- 1432 connections. Indeed, it cannot even remember whether it negotiated 1433 the use of classic ECN [RFC3168]. 1435 If the server (host B) receives the final ACK of the 3-way handshake 1436 with a SupAccECN TCP option, it can infer that the originating host 1437 (A) supports AccECN. If host B supports AccECN itself, it can 1438 further infer that it would have entered AccECN mode before sending 1439 the SYN/ACK. 1441 If, on the other hand, the originating host (A) sends the final ACK 1442 of the 3-way handshake with the SupAccECN field in the Non-Urgent 1443 field, responding host B can still infer that host A originally 1444 negotiated AccECN, by checking the fourteen least significant bits of 1445 the Non-Urgent field and the ACE field, as follows: 1447 o Host B knows that host A would not defer the final ACK of the 1448 3-way handshake, because TCP never delays this. 1450 o Therefore, if host B sends the SYN/ACK with its IP-ECN field set 1451 to ECT(0) [RFC5562], then checks the fourteen least significant 1452 bits of the Non-Urgent field of the final ACK of the 3-way 1453 handshake, it can make the following inferences: 1455 1. lsb(Non-Urgent) == 000010100000 && ACE == 000 implies host A 1456 is AccECN and the SYN/ACK arrived unchanged as ECT(0); 1458 2. lsb(Non-Urgent) == 000010100000 && ACE == 001 implies host A 1459 is AccECN and the SYN-ACK was CE-marked; 1461 3. lsb(Non-Urgent) == 000010100001 && ACE == 111 implies host A 1462 is AccECN and the IP-ECN field of the SYN/ACK was zeroed; 1464 4. lsb(Non-Urgent) == 000000000000 or any value other than those 1465 above implies host A is Not AccECN or a middlebox is 1466 interfering with the Non-Urgent field. 1468 o If, on the other hand, host B sends the SYN/ACK with its IP-ECN 1469 field set to Not-ECT, then checks the fourteen least significant 1470 bits of the Non-Urgent field of the final ACK of the 3-way 1471 handshake, it can make the following inferences: 1473 1. lsb(Non-Urgent) == 000010100001 && ACE == 111 implies host A 1474 is AccECN; 1476 2. lsb(Non-Urgent) == 000000000000 or any value other than that 1477 above implies host A is Not AccECN or a middlebox is 1478 interfering with the Non-Urgent field. 1480 4.2. Compatibility with Other Options and Experiments 1482 AccECN is compatible (at least on paper) with the most commonly used 1483 TCP options: MSS, time-stamp, window scaling, SACK and TCP-AO. It is 1484 also compatible with the recent promising experimental TCP options 1485 TCP Fast Open (TFO [I-D.ietf-tcpm-fastopen]) and Multipath TCP (MPTCP 1486 [RFC6824]). AccECN is particularly friendly to all these protocols, 1487 because space for TCP options is particularly scarce on the SYN, 1488 where AccECN consumes zero additional header space. 1490 5. Protocol Properties 1492 This section is informative not normative. It describes how well the 1493 protocol satisfies the agreed requirements for a more accurate ECN 1494 feedback protocol [I-D.ietf-tcpm-accecn-reqs]. 1496 Accuracy: From each ACK, the Data Sender can infer the number of new 1497 Not-ECT, ECT(0), ECT(1) and CE markings since the previous ACK. 1499 Accuracy: The Data Receiver can feed back to the Data Sender a list 1500 of the order of the IP-ECN markings covered by each delayed ACK. 1502 Overhead: The AccECN scheme is divided into two parts. The 1503 essential part reuses the 3 flags already assigned to ECN in the 1504 IP header. The supplementary part requires fifteen bits. 1506 Overhead: Two alternative locations for the supplementary protocol 1507 field are proposed: 1509 1. In the 16-bit Urgent Pointer when URG=0. This specification 1510 reserves 15 bits of this space, but while the specification is 1511 only experimental it refrains from using this space in the 1512 main TCP header. If AccECN progresses to the standards track 1513 and uses these 15b, it will require zero additional overhead, 1514 because it will overload fields that already takes up space in 1515 every TCP header 1517 2. In a TCP option. This takes up 4B; the fifteen bits have to 1518 be rounded up to 2B, plus 2B for the TCP option Kind and 1519 Length. 1521 Timeliness: In the absence of lost ACKs, no feedback is deferred to 1522 a future ACK, which is intended to enable latency-sensitive uses 1523 of ECN feedback. 1525 Timeliness: {ToDo: Add improved timeliness if the Delayed ACK 1526 Control (DAC) feature is included.} 1528 Resilience: Each ACK includes a counter of one of the ECN congestion 1529 signals. If ACKs are lost, the counter on the first ACK following 1530 the losses allows the Data Sender to immediately recover the 1531 number of one of the ECN markings that it missed. 1533 Resilience: Subsequent ACKs will allow it to recover the number of 1534 other ECN markings that it missed. 1536 Resilience against Bias: Undetected ACK loss is as likely to 1537 decrease as increase congestion signals detected by the Data 1538 Sender. 1540 Resilience against Bias: However, if the supplementary part is 1541 unavailable, the required conservative decoding of feedback during 1542 ACK loss is more likely to increase perceived congestion signals, 1543 which would otherwise be more likely to be under-reported. 1545 Timeliness vs Overhead: For efficiency, each delayed ACK only 1546 includes one of the counters at a time, therefore recovery of the 1547 count of the other signals might not be immediate if an ACK is 1548 lost that covers more than one signal. The receiver cannot 1549 predict which ACKs might get lost, if any. Therefore it repeats 1550 the count of each signal roughly in proportion to how often each 1551 signal changes. 1553 Ordering: The order of arriving ECN codepoints is communicated in a 1554 10-bit field in the supplementary part; 1556 Resilience vs. Ordering: Following an ACK loss, only a count of the 1557 lost ECN signals is recovered, not their order of arrival over the 1558 sequence covered by the loss. 1560 Ordering vs. Overhead: The encoding is tailored for sequences of ECN 1561 codepoints expected to be typical. It can encode sequences of up 1562 to 15 segments but, if the pattern of arrivals becomes too 1563 complex, the protocol forces the Data Receiver to emit an ACK. 1564 The protocol can always encode any sequence of 3 segments in one 1565 delayed ACK; 1567 Ordering, Timeliness and Resilience: If one delayed ACK covers 1568 changes to more than one congestion counter the supplementary 1569 sequence information provides more timely congestion feedback than 1570 waiting for the other congestion counters on future ACKs, and it 1571 provides resilience against the possibility of those future ACKs 1572 going missing; 1574 Complexity: {ToDo: Once implemented, quantify the code complexity} 1576 Integrity: AccECN is compatible with complementary protocols that 1577 assure the integrity of ECN feedback. 1579 Backward Compatibility: If only one endpoint supports the AccECN 1580 scheme, it will fall-back to the most advanced ECN feedback scheme 1581 supported by the other end. 1583 Backward Compatibility: Each endpoint can detect normalisation of 1584 the Supplementary AccECN field by middleboxes at any time during a 1585 connection. It could then fall-back to the essential part using 1586 only the fewer but safer bits in the TCP header. 1588 Forward Compatibility: The behaviour of endpoints and middleboxes is 1589 carefully defined for all reserved or currently unused codepoints 1590 in the scheme, to ensure that any blocking of anomalous values is 1591 always at least under reversible policy control. 1593 6. IANA Considerations 1595 6.1. SupAccECN TCP Option Allocation 1597 This specification requires IANA to allocate one value from the TCP 1598 option Kind name-space, against the name "Supplementary Accurate ECN" 1599 (SupAccECN). 1601 Early implementation before the IANA allocation MUST follow [RFC6994] 1602 and use experimental option 254 and magic number 0xACCE (16 bits) 1603 {ToDo register this with IANA}, then migrate to the new option after 1604 the allocation. 1606 6.2. Non-Urgent Field Registry 1608 This specification requests that IANA sets up a new TCP parameters 1609 registry in accordance with [RFC5226]. This registry enables future 1610 standards track RFCs to assign values to sub-fields of the TCP Non- 1611 Urgent field defined in Section 3.3.1.2. 1613 Name of registry: Non-Urgent field. 1615 Information required for assignments: 1617 * Width and position of sub-field or sub-fields, 1619 * Assignment of values to sub-field(s), 1621 * Confirmation of compliance with additional conditions 1 & 2 1622 below. 1624 Review Process: Standards Action - Values to be assigned for 1625 Standards Track RFCs approved by the IESG. At the IESG's 1626 discretion, values MAY be assigned for Standards Track RFCs still 1627 in the process of approval, in order to resolve the catch-22 where 1628 the assignment needs deployment testing but deployment testing 1629 needs the assignment. 1631 Size, format and syntax of registry entries: Binary values of sub- 1632 fields. 1634 Initial assignments and reservations: This specification reserves 1635 the 15 least significant bits of the Non-Urgent field for use by a 1636 potential future standards action that might define the AccECN 1637 scheme for the standards track. 1639 Additional conditions for assignment: 1641 1. Assignments within the Non-Urgent field MUST be used by a 1642 protocol that is robust to the field being unavailable 1643 occasionally. This is because the Non-Urgent field is unusable 1644 and undefined on segments with URG = 1 in the TCP header 1645 [RFC0793]. The Non-Urgent field overloads the meaning of the 1646 16-bit Urgent Pointer only when URG = 0. 1648 2. The value zero, i.e. all 16 bits of the Non-Urgent field cleared 1649 to zero, SHOULD be undefined, because it is known that certain 1650 'normalising' middleboxes overzealously zero the urgent pointer 1651 when URG = 0. An undefined zero value can be achieved by 1652 requiring that the value all-zeros is undefined for at least one 1653 sub-field of the Non-Urgent field. Then even if the value all- 1654 zeros is defined and used in other sub-fields, the value all- 1655 zeros for the whole field will be undefined. 1657 7. Security Considerations 1659 If ever the supplementary part of AccECN is unusable (due for example 1660 to middlebox interference) the essential part of AccECN's congestion 1661 feedback offers only limited resilience to long runs of ACK loss (see 1662 Section 3.2.2). These problems are unlikely to be due to malicious 1663 intervention (because if an attacker could discard a long run of ACKs 1664 it could wreak other arbitrary havoc). However, it would be of 1665 concern if AccECN's resilience could be indirectly compromised during 1666 a flooding attack. AccECN is still considered safe though, because 1667 an AccECN Data Sender can detect when the supplementary part is 1668 unusable, and it is then required to switch to more conservative 1669 assumptions about wrap of congestion indication counters (see 1670 Section 3.2.2 and Appendix A.1). 1672 AccECN does not signal the ordering of ECN codepoints covered by a 1673 delayed ACK reliably, i.e. if one delayed ACK is lost, the ECN 1674 sequence information in that ACK is not retransmitted. The design of 1675 AccECN assumes gaps in this information will not be critical, and 1676 that this information is unlikely to be security-sensitive. However, 1677 this point is mentioned for completeness. 1679 The SYN cookie method for mitigating SYN flooding attacks is not 1680 generally compatible with enhancements to the TCP 3-way handshake. 1681 Nonetheless, Section 4.1 describes how a server can negotiate AccECN 1682 and use SYN cookies. 1684 AccECN is compatible with all the known schemes that ensure the 1685 integrity of ECN feedback (see Section 3.3.5 for details). Given the 1686 experimental ECN nonce is now probably undeployable, AccECN has been 1687 generalised for other possible uses of the ECT(1) codepoint to avoid 1688 any risk of obsolescence. 1690 8. Acknowledgements 1692 We want to thank Michael Welzl for his input and discussion. The 1693 idea of using the three ECN-related TCP flags as one field for more 1694 accurate TCP-ECN feedback was first introduced in the re-ECN protocol 1695 that was the ancestor of ConEx. 1697 Bob Briscoe was part-funded by the European Community under its 1698 Seventh Framework Programme through the Reducing Internet Transport 1699 Latency (RITE) project (ICT-317700) and through the Trilogy 2 project 1700 (ICT-317756). The views expressed here are solely those of the 1701 authors. 1703 9. Comments Solicited 1705 Comments and questions are encouraged and very welcome. They can be 1706 addressed to the IETF TCP maintenance and minor modifications working 1707 group mailing list , and/or to the authors. 1709 10. References 1711 10.1. Normative References 1713 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 1714 793, September 1981. 1716 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1717 Requirement Levels", BCP 14, RFC 2119, March 1997. 1719 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1720 of Explicit Congestion Notification (ECN) to IP", RFC 1721 3168, September 2001. 1723 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1724 Control", RFC 5681, September 2009. 1726 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", RFC 1727 6994, August 2013. 1729 10.2. Informative References 1731 [I-D.bensley-tcpm-dctcp] 1732 sbens@microsoft.com, s., Eggert, L., and D. Thaler, 1733 "Microsoft's Datacenter TCP (DCTCP): TCP Congestion 1734 Control for Datacenters", draft-bensley-tcpm-dctcp-01 1735 (work in progress), June 2014. 1737 [I-D.ietf-conex-abstract-mech] 1738 Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1739 Concepts and Abstract Mechanism", draft-ietf-conex- 1740 abstract-mech-11 (work in progress), March 2014. 1742 [I-D.ietf-tcpm-accecn-reqs] 1743 Kuehlewind, M., Scheffenegger, R., and B. Briscoe, 1744 "Problem Statement and Requirements for a More Accurate 1745 ECN Feedback", draft-ietf-tcpm-accecn-reqs-05 (work in 1746 progress), February 2014. 1748 [I-D.ietf-tcpm-fastopen] 1749 Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1750 Fast Open", draft-ietf-tcpm-fastopen-09 (work in 1751 progress), July 2014. 1753 [I-D.kuehlewind-tcpm-ecn-fallback] 1754 Kuehlewind, M. and B. Trammell, "A Mechanism for ECN Path 1755 Probing and Fallback", draft-kuehlewind-tcpm-ecn- 1756 fallback-01 (work in progress), September 2013. 1758 [I-D.moncaster-tcpm-rcv-cheat] 1759 Moncaster, T., Briscoe, B., and A. Jacquet, "A TCP Test to 1760 Allow Senders to Identify Receiver Non-Compliance", draft- 1761 moncaster-tcpm-rcv-cheat-02 (work in progress), November 1762 2007. 1764 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1765 Selective Acknowledgment Options", RFC 2018, October 1996. 1767 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1768 Congestion Notification (ECN) Signaling with Nonces", RFC 1769 3540, June 2003. 1771 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 1772 Mitigations", RFC 4987, August 2007. 1774 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 1775 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 1776 May 2008. 1778 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1779 Ramakrishnan, "Adding Explicit Congestion Notification 1780 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, June 1781 2009. 1783 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1784 Authentication Option", RFC 5925, June 2010. 1786 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 1787 "TCP Extensions for Multipath Operation with Multiple 1788 Addresses", RFC 6824, January 2013. 1790 Appendix A. Example Algorithms 1792 This appendix is informative, not normative. It gives examples in 1793 pseudocode for the various algorithms used by AccECN. 1795 A.1. Example Algorithm for Safety Against Long Sequences of ACK Loss 1797 This appendix gives an example algorithm that a Data Sender can use 1798 to heuristically detect a long enough unbroken string of ACK losses 1799 that could have concealed wrap of the congestion counter in the ACE 1800 field of the next ACK to arrive. The Data Sender is unlikely to need 1801 to run an algorithm like this unless it detects that supplementary 1802 AccECN feedback is not available (see Section 3.2.2 and Section 3.6). 1804 It is assumed that the focus is solely safety not complete protocol 1805 precision. Therefore, this example solely detects possible wrap of 1806 the congestion indication (CI) counter, not E1 or NI. This is on the 1807 assumption that, even if ECT(1) is redefined to indicate congestion 1808 in some way, then ECN CE markings will always indicate more severe 1809 congestion. It is also assumed that numerous Not-ECT markings imply 1810 middlebox tampering, which only needs to be detected, not quantified 1811 perfectly. 1813 If the supplementary Top-ACE field cannot be used, there is only room 1814 for 4 values of the congestion indication (CI) counter in the ACE 1815 field. The CI counter in an arriving ACK could have wrapped and 1816 become ambiguous to the Data Sender if a row of ACKs goes missing 1817 that covers a stream of data long enough to contain 4 or more CE 1818 marks. We use the word missing rather than lost, because some or all 1819 the missing ACKs might arrive eventually, but out of order. Even if 1820 some of the lost ACKs are piggy-backed on data (i.e. not pure ACKs) 1821 retransmissions will not repair the lost AccECN information, because 1822 AccECN requires retransmissions to carry the latest AccECN counters, 1823 not the original ones (Section 3.2.3). 1825 If the CE marking probability were p on the forward data path, 1826 ambiguity would arise if 100% of ACKs went missing from the reverse 1827 path in a row was at least 4/p long. For example, if p was 5% on the 1828 forward path, ambiguity would ensue if simultaneously on the reverse 1829 path a sequence of ACKs covering 4/0.05 = 80 packets all went 1830 missing. With a delayed ACK ratio of 2 that translates to missing 40 1831 ACKs in a row. Obviously, missing ACKs would be far less likely if 1832 pure ACKs were allowed to be ECN-capable. However, because RFC 3168 1833 currently precludes this, we will assume that pure ACKs are not ECN- 1834 capable. 1836 To protect against such an unlikely event, Section 3.2.2 requires the 1837 Incoming Protocol Handler to assume that the CI field did wrap if it 1838 could have wrapped under prevailing conditions. It could be 1839 extremely conservative and assume that ECN marking suddenly jumped to 1840 100% on the forward path just when there were no ACKs on the reverse 1841 path to detect it. 1843 Specifically, if the Incoming Protocol Handler receives an ACK with 1844 an acknowledgement number that acknowledges L full-sized segments 1845 since the previous ACK, it could conservatively assume that the CI 1846 field incremented by 1848 D' = L - ((L-D) % 4), 1850 where D is the apparent increase in the CI field. This would still 1851 be safe if segments were 5% of full-sized as long as ECN marking was 1852 5% or less, not 100%. 1854 For example, imagine an ACK acknowledges 5 more full-size segments 1855 than any previous ACK, and that it apparently increases CI by 2. The 1856 above formula works out that a safe increment of CI would still be 2 1857 (because 5 - ((5-2) % 4) = 2). However, if CI apparently increases 1858 by 2 but acknowledges 11 more full-sized segments, then CI should be 1859 assumed to have increased by 10 (because 11 - ((11-2) % 4) = 10). 1861 Implementers could build in more heuristics to estimate prevailing 1862 segment sizes and prevailing ECN marking. For instance, L in the 1863 above formula could be replaced with L' = L*p*M/s, where M is the 1864 MSS, s is the prevailing segment size and p is the prevailing ECN 1865 marking probability. However, ultimately, if TCP's ECN feedback 1866 becomes inaccurate it still has loss detection to fall back on. 1867 Therefore, it would seem safe to implement a simple algorithm like 1868 that given initially, rather than a perfect one. 1870 If missing acknowledgement numbers arrive later (due to reordering), 1871 Section 3.2.2 says "the Data Sender MAY attempt to neutralise the 1872 effect of any action it took based on a conservative assumption that 1873 it later found to be incorrect". To do this, the Data Sender would 1874 have to store the values of all the relevant variables whenever it 1875 made assumptions, so that it could re-evaluate them later. Given 1876 this could become complex and it is not required, we do not attempt 1877 to provide an example of how to do this. 1879 A.2. Example Counter Selection Algorithms 1881 When the Data Receiver sends an ACK, if the last IP-ECN field that 1882 arrived was ECT(0), Section 3.2.3 says, "...the Data Receiver can 1883 signal either the CI or the E1 counter. The choice of which to 1884 signal SHOULD be based on the principle that the more one counter has 1885 changed recently the more it SHOULD be signalled." A couple of 1886 alternative algorithms are suggested below that would satisfy this 1887 requirement. 1889 A.2.1. Counter Selection Algorithm Alt#1 1891 Counter selection algorithm Alt#1 repeats whichever counter has been 1892 repeated proportionately less often, relative to how often it has 1893 changed, with preference for CI if they tie. Or in pseudocode: 1895 if ( (e1 / r_e1) > (ci / r_ci) ) 1896 send_ack(e1) 1897 else 1898 send_ack(ci) 1900 where r_e1 and r_ci are counts of how often E1 and CI were already 1901 repeated when ECT(0) was signalled. The algorithm below implements 1902 this comparison between two divisions using only integer addition. 1903 It is a little terse, so it is explained afterwards. 1905 ci = 0 // CE counter 1906 w_ci = 0 // internal 'weight' variable for CI 1907 r_ci = 0 // internal count of how often CI has been repeated 1908 e1 = 0 // ECT(1) counter 1909 w_e1 = 0 // internal 'weight' variable for E1 1910 r_e1 = 0 // internal count of how often E1 has been repeated 1911 ni = 0 // Not-ECT counter 1913 dack_to_be_sent() // shorthand for test if a delayed ACK is needed 1915 switch (read(pkt.ip.ecn)) { 1916 case CE : 1917 ci++ 1918 w_ci += r_e1 1919 if (dack_to_be_sent()) send_ack(ci) 1920 case ECT1 : 1921 e1++ 1922 w_e1 += r_ci 1923 if (dack_to_be_sent()) send_ack(e1) 1924 case Not-ECT : 1925 ni++ 1926 if (dack_to_be_sent()) send_ack(ni) 1927 case ECT0 : 1928 if (dack_to_be_sent()) { 1929 /* Choice between E1 and CI */ 1930 if (w_e1 > w_ci) { // Preference to CI if they tie 1931 send_ack(e1) 1932 r_e1++ 1933 w_ci += ci 1934 } else { 1935 send_ack(ci) 1936 r_ci++ 1937 w_e1 += e1 1938 } 1939 } 1940 } 1942 {ToDo: Handle wrap of the weights (see my notebook?).} 1944 Explanation: The algorithm ensures that the weights always equal the 1945 following products: 1947 w_ci = ci * r_e1, 1948 w_e1 = e1 * r_ci. 1950 It does this by incremental addition rather than multiplication: 1952 o every time r_e1 increments by 1, w_ci is incremented by 1 * ci; 1953 o every time ci increments by 1, w_ci is incremented by 1 * r_e1; 1955 and the same for w_e1 and the pair of variables it consists of. 1957 This ensures that the condition 1959 w_e1 > w_ci 1961 used in the algorithm is equivalent to: 1963 e1 * r_ci > ci * r_e1, 1965 or rearranging: 1967 (e1 / r_e1) > (ci / r_ci), 1969 which is the required proportionality condition. 1971 A.2.2. Counter Selection Algorithm Alt#2 1973 Counter selection algorithm Alt#2 implements the policy "Send each 1974 recently changed codepoint twice, unless the other one has also 1975 changed, and alternate sending CI, E1 if no counter changes." 1977 {ToDo: Alt#2 has the disadvantage that it can repeat E1 a lot, even 1978 if E1 has never been signalled, which unnecessarily reduces the 1979 resilience of CI. 1981 ci = 0 // CE counter 1982 q_ci = 0 // queue of CI's to repeat 1983 nxt_ci = TRUE // Signal E1 next if FALSE 1984 e1 = 0 // ECT(1) counter 1985 q_e1 = 0 // queue of E1's to repeat 1986 ni = 0 // Not-ECT counter 1988 dack_to_be_sent() // shorthand for test if a delayed ACK is needed 1990 switch (read(pkt.ip.ecn)) { 1991 case CE : 1992 ci++ 1993 q_ci = 2 1994 if (dack_to_be_sent()) send_ack(ci) 1995 case ECT1 : 1996 e1++ 1997 q_e1 = 2 1998 if (dack_to_be_sent()) send_ack(e1) 1999 case Not-ECT : 2000 ni++ 2001 if (dack_to_be_sent()) send_ack(ni) 2002 case ECT0 : 2003 if (dack_to_be_sent()) { 2004 /* Choice between E1 and CI */ 2005 if (q_ci || q_e1) { // If either queue is non-zero 2006 if (q_e1 > q_ci) { // Preference to CI if they tie 2007 send_ack(e1) 2008 q_e1 = max(0, q_e1 - 1) 2009 } else { 2010 send_ack(ci) 2011 q_ci = max(0, q_ci - 1) 2012 } 2013 } else { // Both queues are zero 2014 if (nxt_ci) 2015 send_ack(ci) 2016 else 2017 send_ack(e1) 2018 nxt_ci = !nxt_ci // Toggle the next signal 2019 } 2020 } 2021 } 2023 A.3. Example Encodings and Decodings of Top-ACE and ACE 2025 This appendix gives formulae for encoding and decoding the counters 2026 CI, E1 or NI with higher resilience to ACK loss by supplementing the 2027 ACE field with the Top-ACE field, as required in Section 3.3.3. 2029 A.3.1. Encoding Top-ACE and ACE by the Data Receiver 2031 The values associated with codepoints in ACE for CI and E1 are 2032 respectively base 4 and base 3 numbers (see Table 3). Although there 2033 is only space for one value of NI, mathematically, NI can still be 2034 treated as a base 1 counter. Then the following general formulae 2035 allow a Data Receiver to encode any of the counters CI, E1 or NI, by 2036 calling them all cntr, and defining ACE_base as their respective 2037 number base: 2039 Top-ACE = Int(cntr / ACE_base) % 16, 2040 ACE_cntr = cntr % ACE_base. 2042 Then the Data Receiver looks up the codepoint to put in the ACE field 2043 by looking up ACE_cntr in Table 3 in the column of the relevant 2044 counter (CI, E1 or NI). Int() means round down to an integer and '%' 2045 is the modulo operator. 2047 To implement this without a costly division operation, two counters 2048 can be maintained while processing the header information for the 2049 ACK. The first counter can be mapped into the ACE field via Table 3. 2050 A wrap every 4 increments of the counter could be implemented as a 2051 single conditional check, and when it wraps, a secondary, high-order 2052 counter could be incremented. This secondary counter could then be 2053 mapped directly into the Top ACE field. For instance, the two 2054 counters for CE markings would be implemented as follows: 2056 if (read(pkt.ip.ecn) == CE) { 2057 if (ACE_cntr.ci == 4) { 2058 ACE_cntr.ci = 0 2059 if (Top-ACE.ci == 16) { 2060 Top-ACE.ci = 0 2061 } else 2062 Top-ACE.ci++ 2063 } else 2064 ACE_cntr.ci++ 2065 } 2067 The three examples below explain how the algorithm determines which 2068 codepoints to place in Top-ACE and ACE, for each counter in turn. 2069 For brevity, they use the first mathematical formula above, rather 2070 than the second conditional logic variant. 2072 Example #1: if the Data Receiver has determined that it will signal 2073 its CI counter next and its local value is 73, it encodes this as: 2075 Top-ACE = INT(73 / 4) % 16 2076 = 2 2077 = 0b0010 2078 ACE_cntr = 73 % 4 2079 = 1 2081 Looking up the codepoint for CI = 1 in Table 3 gives: 2083 ACE = 0b001. 2085 Example #2: if the Data Receiver has determined that it will signal 2086 its E1 counter next and its local value is 75, it encodes this as: 2088 Top-ACE = INT(75 / 3) % 16 2089 = 9 2090 = 0b1001 2091 ACE_cntr = 75 % 3 2092 = 0 2094 Looking up the codepoint for E1 = 0 in Table 3 gives: 2096 ACE = 0b100. 2098 Example #3: if the Data Receiver has determined that it will signal 2099 its NI counter next and its local value is 43, it encodes this as: 2101 Top-ACE = INT(43 / 1) % 16 2102 = 11 2103 = 0b1011 2104 ACE_cntr = 43 % 1 2105 = 0 // Anything modulo 1 is 0 2107 Looking up the codepoint for NI = 0 in Table 3 gives: 2109 ACE = 0b111. 2111 A.3.2. Decoding Top-ACE and ACE by the Data Sender 2113 An AccECN Data Sender decodes the incoming combination of Top-ACE and 2114 ACE by looking up the ACE codepoint in Table 3 to get ACE_cntr and 2115 ACE_base, then: 2117 cntr = Top-ACE * ACE_base + ACE_cntr. 2119 For example, if ACE = 0b101 and Top-ACE = 0b0111 = 7, the Data Sender 2120 looks up ACE = 0b101 in Table 3 to see that this is the E1 counter 2121 and that ACE_cntr = 1 base 3. Therefore, 2122 E1 = cntr = 7 * 3 + 1 2123 = 22 2125 The Data Sender is likely to be primarily interested in the increment 2126 in this counter relative to the previous ACK. In the case of E1, it 2127 will have to use modulo 48 arithmetic for the difference, because the 2128 encoding wraps at 48 (see Table 4). Specifically, if the Data 2129 Sender's local counter is snd_e1, then the difference, 2131 delta_e1 = (E1 + 48 - snd_e1 % 48) % 48 2133 {ToDo: Provide algorithms that decode correctly with ACK reordering} 2135 A.4. Example ECN Sequence (ESQ) Encoding Algorithms 2137 This appendix gives an example algorithm for the Data Receiver to 2138 encode the arriving sequence of IP-ECN codepoints in the ECN Sequence 2139 (ESQ) field of a delayed ACK, as required in Section 3.3.4. 2141 /* Algorithm to encode the arrival sequence of IP-ECN codepoints 2142 */ 2143 DEFAULT = ECT0 // Any ECN codepoint except Not-ECT 2144 DACK_T_MAX = 500 // Max time to delay an ACK [ms] 2145 RL_MAX = 7 // Max run-length that can fit in 3-bit field 2146 DACK_SEG_MAX = 2 // Max full-sized delayed ACK segments: 2147 MSS = 1500 // Example max segment size [B] 2148 DACK_B_MAX = DACK_SEG_MAX * MSS // Max deferred bytes 2150 sp = mk1 = DEFAULT // 2-bit ECN codepoints: space and mark 2151 mk2 // second mark (fed back in ACE, not ESQ) 2152 rl1 = rl2 = 0 // 3-bit run-lengths 2153 dack_b = 0 // deferred bytes 2155 /* Strategy: in readiness for a packet arrival, hold the variables 2156 * necessary to build the ECN sequence field (ESQ) of the next ACK. 2157 * If a packet arrives, and it can be added to the held sequence, 2158 * do so and return. 2159 * If it can't be added to the held sequence, send the ACK 2160 * with the most recent packet as the second mark. 2161 * If the delayed ack timer expires, unwind the last packet in the 2162 * held sequence to use as the second mark, and send the ACK 2163 */ 2165 foreach pkt { 2166 tmp = read(pkt.ip.ecn) // Store incoming ECN field 2167 dack_b += read(pkt.ip.size) // Add to deferred bytes 2169 if (dack_b >= DACK_B_MAX) { // Test deferred bytes threshold 2170 mk2 = tmp // Assign incoming ECN to mk2 2171 send_ack(rl1,rl2,sp,mk1,mk2) // Encode ESQ and send ACK 2172 } elif ((rl1 + rl2) =< 0) { // Is the held sequence empty? 2173 sp = tmp // Initialise with a space in run2 2174 rl2++ 2175 init_timer(dack_expire, DACK_T_MAX) // Arm delayed ACK timer 2176 } elif (tmp == sp) { // Is the incoming ECN another space? 2177 if (rl2 < RL_MAX) { // Is there room in run2? 2178 rl2++ // Extend run2 2179 } elif (rl1 =< 0) { // Otherwise, is run1 empty? 2180 mk1 = sp // Shift run2 to run1, making mk1=sp 2181 rl1 = rl2 2182 rl2 = 1 2183 } 2184 /* If got to here, incoming ECN is assigned as a mark */ 2185 } elif (rl1 =< 0) { // If there's room in run1, switch to it 2186 mk1 = tmp 2187 rl1 = rl2 2188 rl2 = 0 2189 } elif ( (tmp == mk1) // Is incoming ECN a mark already seen 2190 && (rl1 = 2) // with only one space before it? 2191 && (rl2 = 0) ) { 2192 mk1 = sp // If so, swap marks with spaces 2193 sp = tmp 2194 rl1 = 1 2195 rl2 = 2 2196 } else { // Cannot extend sequence 2197 mk2 = tmp // Assign the incoming ECN to mk2 2198 send_ack(rl1,rl2,sp,mk1,mk2) // Encode ESQ and send ACK 2199 } 2200 } 2202 /* dack_expire() 2203 * Routine called when the delayed ACK timer expires. 2204 * There is no incoming packet to fill mk2, 2205 * so the last value from the held sequence has to be used instead 2206 * (there will always be a held sequence because the timer is only 2207 * armed once the sequence is non-empty). 2208 */ 2209 dack_expire() { 2210 if (rl2 > 0) { // run2 contains a value 2211 rl2-- 2212 mk2 = sp // copy it into mk2 2213 } else { // run2 is empty, therefore run1 is not 2214 mk2 = mk1 // copy mk1 into mk2 2215 rl2 = rl1-- // shift run1 into run2 without mk1 2216 rl1 = 0 2217 } // Last value extraction is complete 2218 send_ack(rl1,rl2,sp,mk1,mk2) // Encode ESQ and send ACK 2219 } 2221 /* send_ack() 2222 * Algorithm to encode the arrival sequence of IP-ECN codepoints 2223 * into the ECN sequence (ESQ) field of a TCP ACK, then send it. 2224 */ 2225 send_ack(rl1,rl2,sp,mk1,mk2) { 2226 del_timer(dack) // Remove any pending delayed ACK timer 2227 /* Marshall the ECN Sequence field (esq) */ 2228 pkt.tcp.esq = lsb(2,sp) & lsb(2,mk1) & lsb(3,rl1) & lsb(3,rl2) 2229 /* lsb(n,x): pseudocode for the lowest n significant bits of x */ 2230 /* x & y : pseudocode for concatenate x and y */ 2231 /* 2232 * Insert code to send ACK here, with mk2 in pkt.tcp.ace 2233 */ 2234 /* Reset all variables ready for next packet arrival */ 2235 sp = mk1 = DEFAULT 2236 rl1 = rl2 = 0 2237 } 2239 Appendix B. Alternative Design Choices (To Be Removed Before 2240 Publication) 2242 This appendix is informative, not normative. It records alternative 2243 designs that the authors chose not to include in the normative 2244 specification, but which the IETF might wish to consider for 2245 inclusion. 2247 B.1. Supplementary AccECN Field on the SYN/ACK 2249 {ToDo: The tcpm working group is recommended to consider including 2250 this in an AccECN RFC from the start. The AccECN protocol defined in 2251 the body of this specification currently gives no ECN feedback on the 2252 SYN/ACK on the assumption that the SYN is not ECN-capable. If it is 2253 required for the protocol to be future-proofed against the 2254 possibility that SYNs might one-day be ECN-capable, the following 2255 definition of the SupAccECN field for the SYN/ACK would need to be 2256 added to Section 3.3.1 and Section 3.3.2. The text below is written 2257 as if it is normative, but it is only informative while it is demoted 2258 to this appendix.} 2260 B.1.1. Placement of the Supplementary AccECN Field in a SYN/ACK 2262 To include the SupAccECN field on a SYN/ACK, the Data Receiver MUST 2263 use the SupAccECN TCP Option with TCP option Kind 0x (TBA) and 2264 set the Length field to 3 [octets], as illustrated in Figure 9. . 2266 0 1 2 2267 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 2268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2269 | Kind = 0xKK | Length = 3 |0 0 0 0| Sup- | 2270 | | | | AccECN| 2271 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2273 Figure 9: Placement of the SupAccECN field within the SupAccECN TCP 2274 Option on a SYN/ACK 2276 If the Data Sender has entered AccECN mode but there is no SupAccECN 2277 TCP Option on a SYN/ACK, the Incoming AccECN Protocol Handler MUST 2278 take the SupAccECN field to be right-justified within the Non-Urgent 2279 field (i.e. the least significant bit of SupAccECN is aligned with 2280 the least significant bit of the Non-Urgent Field) as shown in 2281 Figure 10. The remaining most significant bits are currently unused 2282 (CU). 2284 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2285 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 2286 | X X X X X X X X X X X X | SupAccECN | 2287 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 2289 Figure 10: Placement of the SupAccECN field within the Non-Urgent 2290 field on a SYN/ACK 2292 B.1.2. Structure of the Supplementary AccECN Field in a SYN/ACK 2294 The size of the SupAccECN field on a SYN/ACK (i.e. a segment with SYN 2295 = 1 and ACK = 1) is always 4 bits. Figure 11 defines the sub-fields 2296 of the SupAccECN field on a SYN/ACK. 2298 0 1 2 3 2299 +---+---+---+---+ 2300 | D-ECN | E-ECN | 2301 +---+---+---+---+ 2303 Figure 11: The Supplementary AccECN Field on a SYN/ACK Segment 2305 The sub-fields of SupAccECN on a SYN/ACK segment have the following 2306 meanings: 2308 E-ECN: Echo ECN, for the responding host (B) to echo the IP-ECN 2309 field that arrives in the SYN. RFC 3168 requires that the ECN 2310 field on a SYN must always be Not-ECT (0b00). Therefore initially 2311 the E-ECN field is likely to always be 0b00. However, the AccECN 2312 wire protocol allows for the possibility that ECN-capable SYNs 2313 might be allowed in future. The responding host (B) MUST echo a 2314 copy of the IP-ECN field of the SYN in the E-ECN field of the SYN/ 2315 ACK. 2317 If the SYN were to arrive carrying a congestion indication, the 2318 responding host (B) MUST also increment the relevant counter 2319 (r.ci, r.e1 or r.e1 ) as specified in Section 3.2.1. Then the 2320 counters on subsequent feedback will remain consistent even though 2321 the SYN/ACK does not have an ACE field to feedback congestion 2322 counters (because it is still using the same bits as flags for 2323 capability negotiation). The E-ECN field has been defined within 2324 a SYN/ACK because the start of a flow is when it is most critical 2325 for congestion feedback to be timely. Without the E-ECN field, 2326 feedback of any congestion marking on a SYN would get deferred for 2327 at least a round trip. 2329 D-ECN: Reserved for a Duplicate ECN field, meaning a duplicate of 2330 the ECN field in the IP header of the same packet. This field is 2331 not defined in the present specification, but it is reserved for 2332 possible use by a companion specification about ECN-fall-back (see 2333 Appendix B.3). 2335 Forward Compatibility: In the meantime, the responding host (B) 2336 MUST set D-ECN to ECT(0) (0b10), the originating host (A) MUST 2337 ignore this field and middleboxes MUST forward this field 2338 unaltered whether or not it is 0b10. 2340 B.2. Remove Not-ECT from ECN Sequence (ESQ) Encoding 2342 This alternative encoding would allow the ESQ field to be 1 bit 2343 shorter (9 bits instead of 10). The trade-off is that the receiver 2344 has to send an ACK immediately whenever a Not-ECT packet arrives. 2345 This is because this alternative encoding only caters for one Not-ECT 2346 codepoint in the ACE field, and none in the ESQ field. 2348 Once ECN has been negotiated for a connection, the sender ought to 2349 rarely send data segments with the Not-ECT codepoint. The only data 2350 segments on which RFC 3168 requires the sender to set Not-ECT are 2351 retransmissions and window probes. Pure ACKs also have to be sent as 2352 Not-ECT, but they are not data segments, so they are not included in 2353 the feedback sequence. 2355 If the encoding of the ESQ field has to allow for Not-ECT as well as 2356 the three ECN-capable codepoints, it needs space to encode 4 possible 2357 spaces and 4 possible marks. This requires 4 bits for 4x4=16 2358 combinations (two 2-bit fields for SP and MK1). If on the other hand 2359 Not-ECT is excluded, space for only 3x3=9 combinations is required. 2360 This many combinations can only be fitted into 3 bits if they can be 2361 reduced to 8 codepoints by encoding two combinations as one symbol. 2362 Two combinations can be encoded as one symbol using the same encoding 2363 for sp=mk1=ECT(1) and sp=mk1=CE. This is because either an ECT(1) or 2364 CE code in the ACE field can be used to distinguish which is which. 2365 However, whenever a run of ECT(1) or of CE ended, the encoding 2366 algorithm would have to send two ACKs at once. 2368 Arguments against this alternative design choice: 2370 o Although retransmissions would be expected to be rare in a fully 2371 ECN-enabled network, there might be frequent losses and 2372 retransmissions during early deployment of ECN, when many 2373 bottleneck links might not be ECN-enabled. Then this alternative 2374 encoding would reduce the opportunities when a receiver could use 2375 delayed ACKs. 2377 o Even if the sender sets Not-ECT on few data segments, incorrectly 2378 configured or buggy network equipment exists that clears the IP- 2379 ECN field to Not-ECT. With this alternative encoding, connections 2380 via such equipment would never be able to use delayed ACKs. The 2381 consequential extra ACK load might be considered an incentive for 2382 these networks to fix their bugs. However, the endpoints would 2383 also suffer the extra ACK load. 2385 o To save 1 bit in the encoding it seems necessary for the algorithm 2386 to sometimes have to send two ACKs at once. 2388 B.3. ECN Fall-Back 2390 {ToDo: consider whether the present specification could be enhanced 2391 with ECN fall-back on the SYN/ACK to give earlier fall-back than in 2392 [I-D.kuehlewind-tcpm-ecn-fallback]. Space for a duplicate of the IP- 2393 ECN field on the SYN/ACK has been reserved in the SupAccECN field 2394 (Appendix B.1), but the behaviour is still TBA. A duplicate of the 2395 IP-ECN field has not been provided on the SYN, because it would be 2396 unremarkable if ECN on the SYN was zeroed by security devices, given 2397 RFC 3168 prohibited ECT on SYN because it enables DoS attacks. 2398 Therefore the IP-ECN field has to be tested on the last ACK of the 2399 3WHS, IMO} 2401 B.4. Remote Delayed ACK Control Proposal 2403 {ToDo: The tcpm working group is recommended to consider including 2404 this in an AccECN RFC from the start, because it would be less useful 2405 if it was unpredictable whether it had been implemented. The text 2406 below is written as if it is normative, but it is only informative 2407 while it is demoted to this appendix.} {ToDo: Add a use-case.} 2408 Traditionally, each decision on whether to delay an ACK is taken 2409 independently by the Data Receiver. This makes it hard to deploy 2410 behaviours where the Data Sender would like the Data Receiver not to 2411 delay feedback, perhaps so that it can measure the effect of subtle 2412 changes in the timing between packets to more rapidly get up to speed 2413 during slow-start without overshoot. 2415 A single bit for a Delayed ACK Control (DAC) flag is defined within 2416 the SupAccECN field of segments with SYN=0. Space for this is 2417 reserved in Section 3.3.2 and illustrated in Figure 6. For either 2418 half-connection, the Data Sender can use the DAC flag to request that 2419 the remote Data Receiver turns delayed ACKing on or off: 2421 o DAC = 0 means the sender requests that the receiver turns Delayed 2422 ACKing on, using the receiver's choice of delayed ACK factor. 2424 o DAC = 1 means the sender requests that the receiver turns Delayed 2425 ACKing off. 2427 For resilience, the Data Sender MUST repeat its currently chosen 2428 value of DAC continuously on every packet. The Data Receiver SHOULD 2429 start to honour the request on receipt. Therefore, as soon as a 2430 segment arrives with DAC=1, a Data Sender SHOULD immediately send any 2431 deferred ACKs and no longer withhold ACKs while it continues to 2432 receive segments with DAC=1. The DAC flag is meaningful on every 2433 packet with SYN=0. The DAC flag is not needed and therefore not 2434 present in the SupAccECN field when SYN=1 (Figure 11), because TCP 2435 never withholds the SYN/ACK or the final ACK of the 3-way handshake. 2437 A receiver MAY ignore a request from a sender to alter its Delayed 2438 ACKing behaviour, e.g. a challenged receiver that cannot send ACKs 2439 fast enough need not turn off Delayed ACKs, or a receiver that has 2440 not implemented delayed ACKs need not turn them on. 2442 Appendix C. Open Protocol Design Issues (To Be Removed Before 2443 Publication) 2445 1. A possibility to simplify the protocol would be to remove 2446 ordering feedback entirely, but require the receiver to disable 2447 delayed ACKs during slow-start (including within a connection 2448 after a time-out or idle period) or to provide the DAC flag to 2449 allow the sender to ask the receiver to disable delayed ACKs when 2450 it needs more accuracy. However, not delaying ACKs may impact 2451 server performance. Also a new way to identify middlebox 2452 interference in the remaining SupAccECN field (Top-ACE & DAC) 2453 would have to be found. 2455 2. The protocol currently gives no ECN feedback on the SYN/ACK on 2456 the assumption that the SYN is not ECN-capable. If it is 2457 required for the protocol to be future-proofed against the 2458 possibility that SYNs might one-day be ECN-capable, the proposal 2459 in Appendix B.1 could be adopted. This also provides earlier 2460 ECN-fall-back than would otherwise be possible. 2462 3. Section 3.3.1 says an AccECN implementation has to be prepared to 2463 read the SupAccECN field from either a TCP option or the Non- 2464 Urgent field. If the definition of the SupAccECN field changes 2465 between this experimental spec and the standards track spec, the 2466 structure of the Non-Urgent field will have to include a version 2467 number somehow. 2469 4. The Non-Urgent field might be used for something else in future 2470 rather than SupAccECN, despite the attempt to reserve it in this 2471 spec. Section 3.3.1 says "If a SupAccECN TCP option is present, 2472 the Non-Urgent field MUST be ignored.", which seems to correctly 2473 ensure that experimental implementations will not read the 2474 altered Non-Urgent field in this case. However, they will 2475 incorrectly read the Non-Urgent field if a future AccECN protocol 2476 uses a different TCP option. 2478 5. There is possibly a concern that, if the supplementary field is 2479 unavailable, the counter selection (Section 3.2.3) always uses 2480 the last codepoint in a delayed ACK, which may starve visibility 2481 of other counters. 2483 6. Counter Selection Algo #Alt2 Appendix A.2.2 needs to be altered 2484 to prevent the E1 counter being continually repeated when no 2485 ECT(1) codepoints are arriving at the Data Receiver. 2487 7. A production version of Counter Selection Algo #Alt1 2488 Appendix A.2.1 needs to be developed that handles wrapping of the 2489 variables, without losing proportionality. 2491 8. Example algorithms need to be developed that decode the Top- 2492 ACE:ACE counters correctly when ACKs are reordered. 2494 9. The definition of the D-ECN field Section 3.3.2 and ECN fall-back 2495 more generally Appendix B.3 will need to be resolved before 2496 publication. 2498 Appendix D. Changes in This Version (To Be Removed Before Publication) 2500 The difference between any pair of versions can be displayed at 2501 2503 From 02 to 03: 2505 * Extensively rewritten. No summary of changes has been 2506 prepared. 2508 Authors' Addresses 2510 Bob Briscoe 2511 BT 2512 B54/77, Adastral Park 2513 Martlesham Heath 2514 Ipswich IP5 3RE 2515 UK 2517 Phone: +44 1473 645196 2518 EMail: bob.briscoe@bt.com 2519 URI: http://bobbriscoe.net/ 2521 Richard Scheffenegger 2522 NetApp, Inc. 2523 Am Euro Platz 2 2524 Vienna 1120 2525 Austria 2527 Phone: +43 1 3676811 3146 2528 EMail: rs@netapp.com 2530 Mirja Kuehlewind 2531 University of Stuttgart 2532 Pfaffenwaldring 47 2533 Stuttgart 70569 2534 Germany 2536 EMail: mirja.kuehlewind@ikr.uni-stuttgart.de