idnits 2.17.1 draft-briscoe-conex-re-ecn-tcp-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 16, 2013) is 3937 days in the past. Is this intentional? Checking references for intended status: Historic ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 4835 (Obsoleted by RFC 7321) ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260) == Outdated reference: A later version (-10) exists of draft-ietf-conex-tcp-modifications-04 == Outdated reference: A later version (-03) exists of draft-briscoe-conex-re-ecn-motiv-02 -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe, Ed. 3 Internet-Draft A. Jacquet 4 Intended status: Historic BT 5 Expires: January 17, 2014 T. Moncaster 6 Moncaster.com 7 A. Smith 8 BT 9 July 16, 2013 11 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP 12 draft-briscoe-conex-re-ecn-tcp-02 14 Abstract 16 This document introduces re-ECN (re-inserted explicit congestion 17 notification), which is intended to make a simple but far-reaching 18 change to the Internet architecture. The sender uses the IP header 19 to reveal the congestion that it expects on the end-to-end path. The 20 protocol works by arranging an extended ECN field in each packet so 21 that, as it crosses any interface in an internetwork, it will carry a 22 truthful prediction of congestion on the remainder of its path. It 23 can be deployed incrementally around unmodified routers. The purpose 24 of this document is to specify the re-ECN protocol at the IP layer 25 and to give guidelines on any consequent changes required to 26 transport protocols. It includes the changes required to TCP both as 27 an example and as a specification. It briefly gives examples of 28 mechanisms that can use the protocol to ensure data sources respond 29 sufficiently to congestion, but these are described more fully in a 30 companion document. 32 Note concerning Intended Status: If this draft were ever published as 33 an RFC it would probably have historic status. There is limited 34 space in the IP header, so re-ECN had to compromise by requiring the 35 receiver to be ECN-enabled otherwise the sender could not use re-ECN. 36 Re-ECN was a precursor to chartering of the IETF's Congestion 37 Exposure (ConEx) working group, but during chartering there were 38 still too few ECN receivers enabled, therefore it was decided to 39 pursue other compromises in order to fit a similar capability into 40 the IP header. 42 Status of This Memo 44 This Internet-Draft is submitted in full conformance with the 45 provisions of BCP 78 and BCP 79. 47 Internet-Drafts are working documents of the Internet Engineering 48 Task Force (IETF). Note that other groups may also distribute 49 working documents as Internet-Drafts. The list of current Internet- 50 Drafts is at http://datatracker.ietf.org/drafts/current/. 52 Internet-Drafts are draft documents valid for a maximum of six months 53 and may be updated, replaced, or obsoleted by other documents at any 54 time. It is inappropriate to use Internet-Drafts as reference 55 material or to cite them other than as "work in progress." 57 This Internet-Draft will expire on January 17, 2014. 59 Copyright Notice 61 Copyright (c) 2013 IETF Trust and the persons identified as the 62 document authors. All rights reserved. 64 This document is subject to BCP 78 and the IETF Trust's Legal 65 Provisions Relating to IETF Documents 66 (http://trustee.ietf.org/license-info) in effect on the date of 67 publication of this document. Please review these documents 68 carefully, as they describe your rights and restrictions with respect 69 to this document. Code Components extracted from this document must 70 include Simplified BSD License text as described in Section 4.e of 71 the Trust Legal Provisions and are provided without warranty as 72 described in the Simplified BSD License. 74 Table of Contents 76 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 77 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 6 78 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 79 4. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 7 80 4.1. Simplified Re-ECN Protocol . . . . . . . . . . . . . . . . 7 81 4.1.1. Congestion Control and Policing the Protocol . . . . . 8 82 4.1.2. Background and Applicability . . . . . . . . . . . . . 8 83 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 84 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 85 4.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 11 86 4.4. Positive and Negative Flows . . . . . . . . . . . . . . . 13 87 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 14 88 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 14 89 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 16 90 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 17 91 5.4. Justification for Setting the First SYN to FNE . . . . . . 18 92 5.5. Control and Management . . . . . . . . . . . . . . . . . . 19 93 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 19 94 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 20 95 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 20 96 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 21 98 6. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 22 99 6.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 100 6.1.1. RECN mode: Full Re-ECN capable transport . . . . . . . 23 101 6.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 102 compliant ECN Receiver . . . . . . . . . . . . . . . . 25 103 6.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 27 104 6.1.4. Extended ECN (EECN) Field Settings during Flow 105 Start or after Idle Periods . . . . . . . . . . . . . 28 106 6.1.5. Pure ACKS, Retransmissions, Window Probes and 107 Partial ACKs . . . . . . . . . . . . . . . . . . . . . 32 108 6.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 33 109 6.2.1. General Guidelines for Adding Re-ECN to Other 110 Transports . . . . . . . . . . . . . . . . . . . . . . 33 111 6.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 33 112 6.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 34 113 6.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 34 114 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 34 115 8. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 35 116 8.1. Congestion Notification Integrity . . . . . . . . . . . . 36 117 9. Security Considerations . . . . . . . . . . . . . . . . . . . 37 118 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 119 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 39 120 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39 121 13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 39 122 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 39 123 14.1. Normative References . . . . . . . . . . . . . . . . . . . 39 124 14.2. Informative References . . . . . . . . . . . . . . . . . . 40 125 Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 42 126 Appendix B. Justification for Two Codepoints Signifying Zero 127 Worth Packets . . . . . . . . . . . . . . . . . . . . 44 128 Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 45 129 Appendix D. Packet Marking with FNE During Flow Start . . . . . . 47 130 Appendix E. Argument for holding back the ECN nonce . . . . . . . 49 131 Appendix F. Alternative Terminology Used in Other Documents . . . 51 133 Authors' Statement: (to be removed by the RFC Editor) 135 The most immediate priority for the authors is to delay any move of 136 the ECN nonce to Proposed Standard status, in order to leave options 137 open for the future. The argument for this position is developed in 138 Appendix E. 140 Changes from previous drafts (to be removed by the RFC Editor) 142 Full diffs from all previous versions (created using the rfcdiff 143 tool) are available at 145 From draft-briscoe-conex-...-01 to -02 (current version): Re-issued 146 to keep alive; updated references 148 From draft-briscoe-conex-...-00 to -01: Re-issued to keep alive; 149 updated references 151 From draft-briscoe-tsvwg-...-08 to draft-briscoe-conex-...-00: 153 Re-issued to keep alive for reference by ConEx working group 155 Changed working group tag in filename from tsvwg to conex 157 Changed intended status to historic and added explanatory note 159 Updated references. Also, now that RFC6040 has been published, 160 the section on tunnelling required a re-write 162 Corrected name of CE(0) to Cancelled in Table 2 164 Noted errors and omissions (rather than spending time correcting 165 them): 167 * Made a few 'ToDo' comments visible that had previously been 168 comments within the document source 170 * Identified errors with 'ToDo' comments, referring to correct 171 material where possible. 173 From -08 to -09: 175 Re-issued to keep alive for reference by ConEx working group. 177 Hardly any changes to content, even where it is out of date, 178 except references updated. 180 From -07 to -08: 182 Minor changes and consistency checks. 184 References updated. 186 From -06 to -07: 188 Major changes made following splitting this protocol document from 189 the related motivations document [I-D.re-ecn-motiv]. 191 Significant re-ordering of remaining text. 193 New terminology introduced for clarity. 195 Minor editorial changes throughout. 197 1. Introduction 199 This document provides a complete specification for the addition of 200 the re-ECN protocol to IP and guidelines on how to add it to 201 transport layer protocols, including a complete specification of re- 202 ECN in TCP as an example. The motivation behind this proposal is 203 given in [I-D.re-ecn-motiv], but we include a brief summary here. 205 Re-ECN is intended to allow senders to inform the network of the 206 level of congestion they expect their flows to see. This information 207 is currently only visible at the transport layer. ECN [RFC3168] 208 reveals the upstream congestion state of any path by monitoring the 209 rate of CE marks. The receiver then informs the sender when they 210 have seen a marked packet. Re-ECN builds on ECN by providing new 211 codepoints that allow the sender to declare the level of congestion 212 they expect on the forward path. It is closely related to ECN and 213 indeed we define a compatibility mode to allow a re-ECN sender to 214 communicate with an ECN receiver. 216 If a sender understates expected congestion compared to actual 217 congestion then the network could discard packets or enact some other 218 sanction. A policer can also be introduced at the ingress of 219 networks that can limit the level of congestion being caused. 221 A general statement of the problem solved by re-ECN is to provide 222 sufficient information in each IP datagram to be able to hold senders 223 and whole networks accountable for the congestion they cause 224 downstream, before they cause it. But the every-day problems that 225 re-ECN can solve are much more recognisable than this rather generic 226 statement: mitigating distributed denial of service (DDoS); 227 simplifying differentiation of quality of service (QoS); policing 228 compliance to congestion control; and so on. 230 It is important to add a few key points. 232 o In any standard network it always takes one round trip before any 233 feedback is received. For this reason a sender must make a 234 conservative prediction by transmitting IP packets with a special 235 Cautious marking when it is unsure of the state of the network. 237 o It should be noted that the prediction is carried in-band in 238 normal data packets and for many transports feedback can be 239 carried in the normal acknowledgements or control packets. 241 o The re-ECN protocol is independent of the transport. In TCP, 242 acknowledgments are used to convey the feedback from receiver to 243 sender. This memo concentrates on TCP as an example transport 244 protocol, however the re-ECN protocol is compatible with any 245 transport where feedback can be sent from receiver to sender. 247 This document is structured as follows. First an overview of the re- 248 ECN protocol is given (Section 4), outlining its attributes and 249 explaining conceptually how it works as a whole. The two main parts 250 of the document follow. That is, the protocol specification divided 251 into network (Section 5) and transport (Section 6) layers. 252 Deployment issues discussed throughout the document are brought 253 together in Section 7. Related work is discussed in (Section 8). 255 2. Requirements notation 257 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 258 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 259 document are to be interpreted as described in [RFC2119]. 261 3. Terminology 263 {ToDo: No attempt has been made to bring terminology into line with 264 that agreed within the ConEx working group. For instance the term 265 dropper remains unchanged, even though the ConEx w-g has decided to 266 call it an audit function (which is actually a much better term).} 268 The following terminology is used throughout this memo. Some of this 269 terminology has changed as this draft has been revised. Therefore, 270 to help avoid confusion, Appendix F sets out all the alternative 271 terminology that has been used in other re-ECN related documents. 273 o Neutral packet - a packet that is able to be congestion marked by 274 an ECN or re-ECN queue. 276 o Negative packet - a Neutral packet that has been congestion marked 277 by an ECN or re-ECN queue. 279 o Positive packet - a packet that has been marked by the sender to 280 indicate the expected level of congestion along its path. In 281 general Positive packets should only be sent in response to 282 feedback received from the receiver.* 284 o Cancelled packet - a Positive Packet that has been congestion 285 marked by an ECN or re-ECN queue. 287 o Cautious packet - a packet that has been marked by the sender to 288 indicate the expected level of congestion along its path. In 289 general Cautious packets should be used when there is insufficient 290 feedback to be confident about the congestion state of the 291 network.* 293 * the difference between positive and cautious packets is 294 explained in detail later in the document along with guidelines on 295 the use of Cautious packets. 297 All the above terms have related IP codepoints as defined in 298 (Section 5). 300 4. Protocol Overview 302 4.1. Simplified Re-ECN Protocol 304 We describe here the simplified re-ECN protocol. To simplify the 305 description we assume packets and segments are synonymous. 307 Packets are sent from a sender to a receiver. In Figure 1 the queues 308 (Q1 and Q2) are ECN enabled as per RFC 3168 [RFC3168]. If congestion 309 occurs then packets are marked with the congestion experienced (CE) 310 flag exactly as in the ECN protocol [RFC3168]; the routers do not 311 need to be modified and do not need to know the re-ECN protocol. The 312 receiver constantly informs the sender of the current count of 313 Negative packets it has seen. The sender uses this information 314 determine how many Positive packets it must send into the network. 315 The receiver's aim is to balance the number of bytes that have been 316 congestion marked with the number of Positive bytes it has sent. 318 +--------- Feedback----------+ 319 | | 320 v | 321 +---+ +----+ +----+ +---+ 322 | | | | | | | | 323 | S |--->| Q1 |--->| Q2 |--->| R | 324 | | | | | | | | 325 +---+ +----+ +----+ +---+ 327 Figure 1: Simple Re-ECN 329 4.1.1. Congestion Control and Policing the Protocol 331 The arrangement of the protocol ensures that packets carry a 332 declaration of the amount of congestion that will be experienced on 333 the path. The re-ECN protocol is orthogonal to any congestion 334 control algorithms, but can be used to ensure that congestion control 335 is being applied by the sender. 337 In general we assume that there will be a policer at the network 338 ingress which can rate limit traffic based on the amount of 339 congestion declared. 341 At the network egress there is a dropper which can impose sanctions 342 on flows that incorrectly declare congestion. 344 Policers and droppers are explained in more detail in 345 [I-D.re-ecn-motiv]. 347 4.1.2. Background and Applicability 349 The re-ECN protocol makes no changes and has no effect on the TCP 350 congestion control algorithm or on other rate responses to 351 congestion. Re-ECN is not a new congestion control protocol, rather 352 it is orthogonal to congestion control itself. Re-ECN is concerned 353 with revealing information about congestion so that users and 354 networks can be held accountable for the congestion they cause, or 355 allow to be caused. 357 Re-ECN builds on ECN so we briefly recap the essentials of the ECN 358 protocol [RFC3168]. Two bits in the IP protocol (v4 or v6) are 359 assigned to the ECN field. The sender clears the field to "00" (Not- 360 ECT) if either end-point transport is not ECN-capable. Otherwise it 361 indicates an ECN-capable transport (ECT) using either of the two 362 code-points "10" or "01" (ECT(0) and ECT(1) resp.). 364 ECN-capable queues probabilistically set this field to "11" if 365 congestion is experienced (CE). In general this marking probability 366 will increase with the length of the queue at its egress link 367 (typically using the RED algorithm [RFC2309]). However, they still 368 drop rather than mark Not-ECT packets. With multiple ECN-capable 369 queues on a path, a flow of packets accumulates the fraction of CE 370 marking that each queue adds. The combined effect of the packet 371 marking of all the queues along the path signals congestion of the 372 whole path to the receiver. So, for example, if one queue early in a 373 path is marking 1% of packets and another later in a path is marking 374 2%, flows that pass through both queues will experience approximately 375 3% marking (see Appendix A for a precise treatment). 377 The choice of two ECT code-points in the ECN field [RFC3168] 378 permitted future flexibility, optionally allowing the sender to 379 encode the experimental ECN nonce [RFC3540] in the packet stream. 380 The nonce is designed to allow a sender to check the integrity of 381 congestion feedback. But Section 8.1 explains that it still gives no 382 control over how fast the sender transmits as a result of the 383 feedback. On the other hand, re-ECN is designed both to ensure that 384 congestion is declared honestly and that the sender's rate responds 385 appropriately. 387 Re-ECN is based on a feedback arrangement called `re- 388 feedback' [Re-fb]. The word is short for either receiver-aligned, 389 re-inserted or re-echoed feedback. But it actually works even when 390 no feedback is available. In fact it has been carefully designed to 391 work for single datagram flows. It also encourages aggregation of 392 single packet flows by congestion control proxies. Then, even if the 393 traffic mix of the Internet were to become dominated by short 394 messages, it would still be possible to control congestion 395 effectively and efficiently. 397 Changing the Internet's feedback architecture seems to imply 398 considerable upheaval. But re-ECN can be deployed incrementally at 399 the transport layer around unmodified queues using existing fields in 400 IP (v4 or v6). However it does also require the last undefined bit 401 in the IPv4 header, which it uses in combination with the 2-bit ECN 402 field to create four new codepoints. Nonetheless, we RECOMMEND 403 adding optional preferential drop to IP queues based on the re-ECN 404 fields in order to improve resilience against DoS attacks. 405 Similarly, re-ECN works best if both the sender and receiver 406 transports are re-ECN-capable, but it can work with just sender 407 support(Section 6.1.2). 409 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 411 The re-ECN wire protocol uses the two bit ECN field broadly as in 412 RFC3168 [RFC3168] as described above, but with five differences of 413 detail (brought together in a list in Section 7). This specification 414 defines a new re-ECN extension (RE) flag. We will defer the 415 definition of the actual position of the RE flag in the IPv4 & v6 416 headers until Section 5. When we don't need to choose between IPv4 417 and v6 wire protocols it will suffice call it the RE flag. 419 Unlike the ECN field, the RE flag is intended to be set by the sender 420 and SHOULD remain unchanged along the path, although it can be read 421 by network elements that understand the re-ECN protocol. It is 422 feasible that a network element MAY change the setting of the RE 423 flag, perhaps acting as a proxy for an end-point, but such a protocol 424 would have to be defined in another specification 425 (e.g. [I-D.re-pcn-border-cheat]). 427 Although the RE flag is a separate, single bit field, it can be read 428 as an extension to the two-bit ECN field; the three concatenated bits 429 in what we will call the extended ECN field (EECN) giving eight 430 codepoints. We will use the RFC3168 names of the ECN codepoints to 431 describe settings of the ECN field when the RE flag setting is "don't 432 care", but we also define the following six extended ECN codepoint 433 names for when we need to be more specific. 435 One of re-ECN's codepoints is an alternative use of the codepoint set 436 aside in RFC3168 for the ECN nonce (ECT(1)). Transports using re-ECN 437 do not need to use the ECN nonce as long as the sender is also 438 checking for transport protocol compliance [tcp-rcv-cheat]. The case 439 for doing this is given in Appendix E. Two re-ECN codepoints are 440 given compatible uses to those defined in RFC3168 (Not-ECT and CE). 441 The other codepoint used by RFC3168 (ECT(0)) isn't used for re-ECN. 442 Altogether this leave one codepoint of the eight unused by ECN or re- 443 ECN and available for future use. 445 +--------+-------------+-------+-----------+------------------------+ 446 | ECN | RFC3168 | RE | EECN | re-ECN meaning | 447 | field | codepoint | flag | codepoint | | 448 +--------+-------------+-------+-----------+------------------------+ 449 | 00 | Not-ECT | 0 | Not-ECT | Not re-ECN-capable | 450 | | | | | transport (Legacy) | 451 | 00 | --- | 1 | FNE | Feedback not | 452 | | | | | established (Cautious) | 453 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 454 | | | | | and RECT (Positive) | 455 | 01 | --- | 1 | RECT | Re-ECN capable | 456 | | | | | transport (Neutral) | 457 | 10 | ECT(0) | 0 | ECT(0) | RFC3168 ECN use only | 458 | | | | | | 459 | 10 | --- | 1 | --CU-- | Currently unused | 460 | | | | | | 461 | 11 | CE | 0 | CE(0) | Re-Echo cancelled by | 462 | | | | | CE (Cancelled) | 463 | 11 | --- | 1 | CE(-1) | Congestion Experienced | 464 | | | | | (Negative) | 465 +--------+-------------+-------+-----------+------------------------+ 467 Table 1: Extended ECN Codepoints 469 4.3. Re-ECN Protocol Operation 471 In this section we will give an overview of the operation of the re- 472 ECN protocol for TCP/IP, leaving a detailed specification to the 473 following sections. Other transports will be discussed later. 475 {ToDo: This section to be updated to explain that the sender re- 476 echoes losses in the same way as ECN markings.} 478 In summary, the protocol adds a third `re-echo' stage to the existing 479 TCP/IP ECN protocol. Whenever the network adds CE congestion 480 signalling to the IP header on the forward data path, the receiver 481 feeds it back to the ingress using TCP, then the sender re-echoes it 482 into the forward data path using the RE flag in the next packet. 484 Prior to receiving any feedback a sender will not know which setting 485 of the RE flag to use, so it sends Cautious packets by setting the 486 FNE codepoint. The network reads the FNE codepoint conservatively as 487 equivalent to re-echoed congestion. 489 Specifically, once feedback from an ECN or re-ECN capable flow is 490 established, a re-ECN sender always initialises the ECN field to 491 ECT(1). And it usually sets the RE flag to "1" indicating a Neutral 492 packet. Whenever a queue marks a packet to CE, the receiver feeds 493 back this event to the sender. On receiving this feedback, the re- 494 ECN sender will clear the RE flag to "0" in the next packet it sends 495 (indicating a Positive packet). 497 We chose to set and clear the RE flag this way round to ease 498 incremental deployment (see Section 7). To avoid confusion we will 499 use the term `blanking' (rather than marking) when the RE flag is 500 cleared to "0". So, over a stream of packets, we will talk of the 501 `RE blanking fraction' as the fraction of octets in packets with the 502 RE flag cleared to "0". 504 +---+ +----+ +----+ +---+ 505 | S |--| Q1 |----------------| Q2 |--| R | 506 +---+ +----+ +----+ +---+ 507 . . . . 508 ^ . . . . 509 | . . . . 510 | . RE blanking fraction . . 511 3% |-------------------------------+======= 512 | . . | . 513 2% | . . | . 514 | . . CE marking fraction | . 515 1% | . +----------------------+ . 516 | . | . . 517 0% +---------------------------------------> 518 ^ ^ ^ 519 L M N Observation points 521 Figure 2: A 2-Queue Example (Imprecise) 523 Figure 2 uses a simple network to illustrate how re-ECN allows queues 524 to measure downstream congestion. The receiver views a CE marking 525 fraction of 3% which is fed back to the sender. The sender sets an 526 RE blanking fraction of 3% to match this. This RE blanking fraction 527 can be observed along the path as the RE flag is not changed by 528 network nodes once set by the sender. This is shown by the 529 horizontal line at 3% in the figure. The CE marked fraction is shown 530 by the stepped line which rises to meet the RE blanking fraction line 531 with steps at each queue where packets are marked. Two queues are 532 shown (Q1 and Q2) that are currently congested. Each time packets 533 pass through a fraction are marked; 1% at Q1 and 2% at Q2). The 534 approximate downstream congestion can be measured at the observation 535 points shown along the path by subtracting the CE marking fraction 536 from the RE blanking fraction, as shown in the table below 537 (Appendix A derives these approximations from a precise analysis). 538 NB due to the unary nature of ECN marking and the equivalent unary 539 nature of re-ECN blanking, the precise fraction of marked bytes must 540 be calculated by maintaining a moving average of the number of 541 packets that have been marked as a proportion of the total number of 542 packets. 544 Along the path the fraction of packets that had their RE field 545 cleared remains unchanged so it can be used as a reference against 546 which to compare upstream congestion. The difference predicts 547 downstream congestion for the rest of the path. Therefore, measuring 548 the fractions of each codepoint at any point in the Internet will 549 reveal upstream, downstream and whole path congestion. 551 Note that we have introduced discussion of marking and blanking 552 fractions solely for illustration. We are not saying any protocol 553 handler will work with these average fractions directly. In fact the 554 protocol actually requires the number of marked and blanked bytes to 555 balance by the time the packet reaches the receiver. 557 4.4. Positive and Negative Flows 559 In Section 3 we introduced the terms Positive, Neutral, Negative, 560 Cautious and Cancelled. This terminology is based on the requirement 561 to balance the proportion of bytes marked as CE with the proportion 562 of bytes that are re-echo marked. In the rest of this memo we will 563 loosely talk of positive or negative flows, meaning flows where the 564 moving average of the downstream congestion metric is persistently 565 positive or negative. A negative flow is one where more CE marked 566 packets than re-ECN blanked packets arrive. Likewise in positive 567 flows more re-ECN blanked packets arrive than CE marked packets. The 568 notion of a negative metric arises because it is derived by 569 subtracting one metric from another. Of course actual downstream 570 congestion cannot be negative, only the metric can (whether due to 571 time lags or deliberate malice). 573 Therefore we will talk of packets having `worth' of +1, 0 or -1, 574 which, when multiplied by their size, indicates their contribution to 575 the downstream congestion metric. The worth of each type of packet 576 is given below in Table 2. The idea is that most flows start with 577 zero worth. Every time the network decrements the worth of a packet, 578 the sender increments the worth of a later packet. Then, over time, 579 as many positive octets should arrive at the receiver as negative. 580 Note we have said octets not packets, so if packets are of different 581 sizes, the worth should be incremented on enough octets to balance 582 the octets in negative packets arriving at the receiver. It is this 583 balance that will allow the network to hold the sender accountable 584 for the congestion it causes. 586 If a packet carrying re-echoed congestion happens to also be 587 congestion marked, the +1 worth added by the sender will be cancelled 588 out by the -1 network congestion marking. Although the two worth 589 values correctly cancel out, neither the congestion marking nor the 590 re-echoed congestion are lost, because the RE bit and the ECN field 591 are orthogonal. So, whenever this happens, the receiver will 592 correctly detect and re-echo the new congestion event as well. 594 The table below specifies unambiguously the worth of each extended 595 ECN codepoint. Note the order is different from the previous table 596 to better show how the worth increments and decrements. 598 +---------+-------+---------------+-------+-------------------------+ 599 | ECN | RE | Extended ECN | Worth | Re-ECN Term | 600 | field | bit | codepoint | | | 601 +---------+-------+---------------+-------+-------------------------+ 602 | 00 | 0 | Not-RECT | ... | --- | 603 | 00 | 1 | FNE | +1 | Cautious | 604 | 01 | 0 | Re-Echo | +1 | Positive | 605 | 10 | 0 | Legacy | ... | RFC3168 ECN use only | 606 | | | | | | 607 | 11 | 0 | CE(0) | 0 | Cancelled | 608 | 01 | 1 | RECT | 0 | Neutral | 609 | 10 | 1 | --CU-- | ... | Currently unused | 610 | | | | | | 611 | 11 | 1 | CE(-1) | -1 | Negative | 612 +---------+-------+---------------+-------+-------------------------+ 614 Table 2: 'Worth' of Extended ECN Codepoints 616 5. Network Layer 618 5.1. Re-ECN IPv4 Wire Protocol 620 The wire protocol of the ECN field in the IP header remains largely 621 unchanged from [RFC3168]. However, an extension to the ECN field we 622 call the RE (Re-ECN extension) flag (Section 4.2) is defined in this 623 document. It doubles the extended ECN codepoint space, giving 8 624 potential codepoints. The semantics of the extra codepoints are 625 backward compatible with the semantics of the 4 original codepoints 626 [RFC3168] (Section 7 collects together and summarises all the changes 627 defined in this document). 629 For IPv4, this document proposes that the new RE control flag will be 630 positioned where the `reserved' control flag was at bit 48 of the 631 IPv4 header (counting from 0). Alternatively, some would call this 632 bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 633 header (Figure 3). 635 0 1 2 636 +---+---+---+ 637 | R | D | M | 638 | E | F | F | 639 +---+---+---+ 641 Figure 3: New Definition of the Re-ECN Extension (RE) Control Flag at 642 the Start of Byte 7 of the IPv4 Header 644 The semantics of the RE flag are described in outline in Section 4 645 and specified fully in Section 6. The RE flag is always considered 646 in conjunction with the 2-bit ECN field, as if they were concatenated 647 together to form a 3-bit extended ECN field. If the ECN field is set 648 to either the ECT(1) or CE codepoint, when the RE flag is blanked 649 (cleared to "0") it represents a re-echo of congestion experienced by 650 an early packet. If the ECN field is set to the Not-ECT codepoint, 651 when the RE flag is set to "1" it represents the feedback not 652 established (FNE) codepoint, which signals that the packet was sent 653 without the benefit of congestion feedback. 655 It is believed that the FNE codepoint can simultaneously serve other 656 purposes, particularly where the start of a flow needs distinguishing 657 from packets later in the flow. For instance it would have been 658 useful to identify new flows for tag switching and might enable 659 similar developments in the future if it were adopted. It is similar 660 to the state set-up bit idea designed to protect against memory 661 exhaustion attacks. This idea was proposed informally by David Clark 662 and documented by Handley and Greenhalgh [Steps_DoS]. The FNE 663 codepoint can be thought of as a `soft-state set-up flag', because it 664 is idempotent (i.e. one occurrence of the flag is sufficient but 665 further occurrences achieve the same effect if previous ones were 666 lost). 668 We are sure there will probably be other claims pending on the use of 669 bit 48. We know of at least two [ARI05], [RFC3514] but neither have 670 been pursued in the IETF, so far, although the present proposal would 671 meet the needs of the latter. 673 The security flag proposal (commonly known as the evil bit) was 674 published on 1 April 2003 as Informational RFC 3514, but it was not 675 adopted due to confusion over whether evil-doers might set it 676 inappropriately. The present proposal is backward compatible with 677 RFC3514 because if re-ECN compliant senders were benign they would 678 correctly clear the evil bit to honestly declare that they had just 679 received congestion feedback. Whereas evil-doers would hide 680 congestion feedback by setting the evil bit continuously, or at least 681 more often than they should. So, evil senders can be identified, 682 because they declare that they are good less often than they should. 684 5.2. Re-ECN IPv6 Wire Protocol 686 For IPv6, this document proposes that the new RE control flag will be 687 positioned as the first bit of the option field of a new Congestion 688 hop by hop option header (Figure 4). 690 0 1 2 3 691 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 692 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 693 | Next Header | Hdr ext Len | Option Type | Opt Length =4 | 694 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 695 |R| Reserved for future use | 696 |E| | 697 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 699 Figure 4: Definition of a New IPv6 Congestion Hop by Hop Option 700 Header containing the re-ECN Extension (RE) Control Flag 702 0 1 2 3 4 5 6 7 8 703 +-+-+-+-+-+-+-+-+- 704 |AIU|C|Option ID| 705 +-+-+-+-+-+-+-+-+- 707 Figure 5: Congestion Hop by Hop Option Type Encoding 709 The Hop-by-Hop Options header enables packets to carry information to 710 be examined and processed by routers or nodes along the packet's 711 delivery path, including the source and destination nodes. For re- 712 ECN, the two bits of the Action If Unrecognized (AIU) flag of the 713 Congestion extension header MUST be set to "00" meaning if 714 unrecognized `skip over option and continue processing the header'. 715 Then, any routers or a receiver not upgraded with the optional re-ECN 716 features described in this memo will simply ignore this header. But 717 routers with these optional re-ECN features or a re-ECN policing 718 function, will process this Congestion extension header. 720 The `C' flag MUST be set to "1" to specify that the Option Data 721 (currently only the RE control flag) can change en-route to the 722 packet's final destination. This ensures that, when an 723 Authentication header (AH [RFC4302]) is present in the packet, for 724 any option whose data may change en-route, its entire Option Data 725 field will be treated as zero-valued octets when computing or 726 verifying the packet's authenticating value. 728 Although the RE control flag should not be changed along the path, we 729 expect that the rest of this option field that is currently `Reserved 730 for future use' could be used for a multi-bit congestion notification 731 field which we would expect to change en route. Therefore, as 732 changes to the RE flag could be detected end-to-end without 733 authentication (see Section 9), we set the C flag to '1'. 735 5.3. Router Forwarding Behaviour 737 {ToDo: Consider a section on how whole protocol interworks with drop. 738 Perhaps in Protocol Overview.} 740 Re-ECN works well without modifying the forwarding behaviour of any 741 routers. However, below, two OPTIONAL changes to forwarding 742 behaviour are defined which respectively enhance performance and 743 improve a router's discrimination against flooding attacks. They are 744 both OPTIONAL additions that we propose MAY apply by default to all 745 Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN 746 marking behaviours [RFC3168]. Specifications for PHBs MAY define 747 different forwarding behaviours from this default, but this is not 748 required. [I-D.re-pcn-border-cheat] is one example. 750 FNE indicates ECT: 752 The FNE codepoint tells a router to assume that the packet was 753 sent by an ECN-capable transport (see Section 5.4). Therefore an 754 FNE packet MAY be marked rather than dropped. Note that the FNE 755 codepoint has been intentionally chosen so that, to RFC3168 756 compliant routers (which do not inspect the RE flag) an FNE packet 757 appears to be Not-ECT so it will be dropped by legacy AQM 758 algorithms. 760 A network operator MUST NOT configure a queue to ECN mark rather 761 than drop FNE packets unless it can guarantee that FNE packets 762 will be rate limited, either locally or upstream. The ingress 763 policers discussed in [I-D.re-ecn-motiv] would count as rate 764 limiters for this purpose. 766 Preferential Drop: If a re-ECN capable router queue experiences very 767 high load so that it has to drop arriving packets (e.g. a DoS 768 attack), it MAY preferentially drop packets within the same 769 Diffserv PHB using the preference order for extended ECN 770 codepoints given in Table 3. Preferential dropping can be 771 difficult to implement on some hardware, but if feasible it would 772 discriminate against attack traffic if done as part of the overall 773 policing framework of [I-D.re-ecn-motiv]. If nowhere else, 774 routers at the egress of a network SHOULD implement preferential 775 drop (stronger than the MAY above). For simplicity, preferences 4 776 & 5 MAY be merged into one preference level. 778 The tabulated drop preferences are arranged to preserve packets 779 with more positive worth (Section 4.4), given senders of positive 780 packets must have honestly declared downstream congestion. A full 781 treatment of this is provided in the companion document describing 782 the motivation and architecture for re-ECN [I-D.re-ecn-motiv] 783 particularly when the application of re-ECN to protect against 784 DDoS attacks is described. 786 +-------+-----+------------+-------+------------+-------------------+ 787 | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | 788 | field | bit | ECN | | (1 = drop | | 789 | | | codepoint | | 1st) | | 790 +-------+-----+------------+-------+------------+-------------------+ 791 | 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | 792 | | | | | | congestion and | 793 | | | | | | RECT | 794 | 00 | 1 | FNE | +1 | 4 | Feedback not | 795 | | | | | | established | 796 | 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | 797 | | | | | | by congestion | 798 | | | | | | experienced | 799 | 01 | 1 | RECT | 0 | 3 | Re-ECN capable | 800 | | | | | | transport | 801 | 11 | 1 | CE(-1) | -1 | 3 | Congestion | 802 | | | | | | experienced | 803 | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | 804 | 10 | 0 | --- | n/a | 2 | RFC3168 ECN use | 805 | | | | | | only | 806 | 00 | 0 | Not-RECT | n/a | 1 | Not | 807 | | | | | | Re-ECN-capable | 808 | | | | | | transport | 809 +-------+-----+------------+-------+------------+-------------------+ 811 Table 3: Drop Preference of EECN Codepoints (Sorted by `Worth') 813 5.4. Justification for Setting the First SYN to FNE 815 the initial SYN MUST be set to FNE by Re-ECT client A (Section 6.1.4) 816 and (Section 5.3) says a queue MAY optionally treat an FNE packet as 817 ECN capable, so an initial SYN may be marked CE(-1) rather than 818 dropped. This seems dangerous, because the sender has not yet 819 established whether the receiver is a RFC3168 one that does not 820 understand congestion marking. It also seems to allow malicious 821 senders to take advantage of ECN marking to avoid so much drop when 822 launching SYN flooding attacks. Below we explain the features of the 823 protocol design that remove both these dangers. 825 ECN-capable initial SYN with a Not-ECT server: If the TCP server B 826 is re-ECN capable, provision is made for it to feedback a possible 827 congestion marked SYN in the SYN ACK (Section 6.1.4). But if the 828 TCP client A finds out from the SYN ACK that the server was not 829 ECN-capable, the TCP client MUST conservatively consider the first 830 SYN as congestion marked before setting itself into Not-ECT mode. 831 Section 6.1.4 mandates that such a TCP client MUST also set its 832 initial window to 1 segment. In this way we remove the need to 833 cautiously avoid setting the first SYN to Not-RECT. This will 834 give worse performance while deployment is patchy, but better 835 performance once deployment is widespread. 837 SYN flooding attacks can't exploit ECN-capability: Malicious hosts 838 may think they can use the advantage that ECN-marking gives over 839 drop in launching classic SYN-flood attacks. But Section 5.3 840 mandates that a router MUST only be configured to treat packets 841 with the FNE codepoint as ECN-capable if FNE packets are rate 842 limited somewhere. Introduction of the FNE codepoint was a 843 deliberate move to enable transport-neutral handling of flow-start 844 and flow state set-up in the IP layer where it belongs. It then 845 becomes possible to protect against flooding attacks of all forms 846 (not just SYN flooding) without transport-specific inspection for 847 things like the SYN flag in TCP headers. Then, for instance, SYN 848 flooding attacks using IPsec ESP encryption can also be rate 849 limited at the IP layer. 851 It might seem pedantic going to all this trouble to enable ECN on the 852 initial packet of a flow, but it is motivated by a much wider concern 853 to ensure safe congestion control will still be possible even if the 854 application mix evolves to the point where the majority of flows 855 consist of a single window or even a single packet. It also allows 856 denial of service attacks to be more easily isolated and prevented. 858 {ToDo: Give alternative where initial packet is Not-RECT and last ACK 859 of three-way handshake is FNE. Explain this will give better 860 performance while deployment is patchy, but worse performance once 861 deployment is high.} 863 5.5. Control and Management 865 5.5.1. Negative Balance Warning 867 A new ICMP message type is being considered so that a dropper can 868 warn the apparent sender of a flow that it has started to sanction 869 the flow. The message would have similar semantics to the `Time 870 exceeded' ICMP message type. To ensure the sender has to invest some 871 work before the network will generate such a message, a dropper 872 SHOULD only send such a message for flows that have demonstrated that 873 they have started correctly by establishing a positive record, but 874 have later gone negative. The threshold is up to the implementation. 875 The purpose of the message is to deconfuse the cause of drops from 876 other causes, such as congestion or transmission losses. The dropper 877 would send the message to the sender of the flow, not the receiver. 878 If we did define this message type, it would be REQUIRED for all re- 879 ECT senders to parse and understand it. Note that a sender MUST only 880 use this message to explain why losses are occurring. A sender MUST 881 NOT take this message to mean that losses have occurred that it was 882 not aware of. Otherwise, spoof messages could be sent by malicious 883 sources to slow down a sender (c.f. ICMP source quench). 885 However, the need for this message type is not yet confirmed, as we 886 are considering how to prevent it being used by malicious senders to 887 scan for droppers and to test their threshold settings. {ToDo: 888 Complete this section.} 890 5.5.2. Rate Response Control 892 As discussed in [I-D.re-ecn-motiv] the sender's access operator will 893 be expected to use bulk per-user policing, but they might choose to 894 introduce a per-flow policer. In cases where operators do introduce 895 per-flow policing, there may be a need for a sender to send a request 896 to the ingress policer asking for permission to apply a non-default 897 response to congestion (where TCP-friendly is assumed to be the 898 default). This would require the sender to know what message 899 format(s) to use and to be able to discover how to address the 900 policer. The required control protocol(s) are outside the scope of 901 this document, but will require definition elsewhere. 903 The policer is likely to be local to the sender and inline, probably 904 at the ingress interface to the internetwork. So, discovery should 905 not be hard. A variety of control protocols already exist for some 906 widely used rate-responses to congestion. For instance DCCP 907 congestion control identifiers (CCIDs [RFC4340]) fulfil this role and 908 so does QoS signalling (e.g. and RSVP request for controlled load 909 service is equivalent to a request for no rate response to 910 congestion, but with admission control). 912 5.6. IP in IP Tunnels 914 Ideally, for re-ECN to work through IP in IP tunnels, the tunnel 915 entry should copy both the RE flag and the ECN field from the inner 916 to the outer IP header. Then at the tunnel exit, any CE marking of 917 the outer ECN field should overwrite the inner ECN field (unless the 918 inner field is Not-ECT in which case an alarm should be raised). The 919 RE flag shouldn't change along a path, so the outer RE flag should be 920 the same as the inner. If it isn't, a management alarm should be 921 raised. 923 This requirement is satisfied by the latest specification for 924 handling ECN through IP tunnels [RFC6040] as well as by IPsec 925 [RFC4301]. However, it is not satisfied by the ingress behaviour 926 specified in [RFC3168] although at least the full-functionality 927 variant of the egress behaviour is fine. RFC6040 updates RFC3168, 928 but it is likely that many legacy non-IPsec IP-in-IP tunnels will 929 exist. 931 If legacy tunnels are left as specified in [RFC3168], whether the 932 limited or full-functionality variants is used, a problem arises with 933 re-ECN if a tunnel crosses an inter-domain boundary, because the 934 difference between positive and negative markings will not be 935 correctly accounted for. In a limited functionality ECN tunnel, the 936 flow will appear to be RFC3168 compliant traffic, and therefore may 937 be wrongly rate limited. In a full-functionality ECN tunnel, the 938 result will depend whether the tunnel entry copies the inner RE flag 939 to the outer header or the RE flag in the outer header is always 940 cleared. If the former, the flow will tend to be too positive when 941 accounted for at borders. If the latter, it will be too negative. 942 If the rules set out in [RFC6040] are followed then this will not be 943 an issue. 945 5.7. Non-Issues 947 The following issues might seem to cause unfavourable interactions 948 with re-ECN, but we will explain why they don't: 950 o Various link layers support explicit congestion notification, such 951 as Frame Relay and ATM. Explicit congestion notification is 952 proposed to be added to other link layers, such as Ethernet 953 (802.3ar Ethernet congestion management) and MPLS [RFC5129]; 955 o Encryption and IPsec. 957 In the case of congestion notification at the link layer, each 958 particular link layer scheme either manages congestion on the link 959 with its own link-level feedback (the usual arrangement in the cases 960 of ATM and Frame Relay), or congestion notification from the link 961 layer is merged into congestion notification at the IP level when the 962 frame headers are decapsulated at the end of the link (the 963 recommended arrangement in the Ethernet and MPLS cases). Given the 964 RE flag is not intended to change along the path, this means that 965 downstream congestion will still be measurable at any point where IP 966 is processed on the path by subtracting positive from negative 967 markings. 969 In the case of encryption, as long as the tunnel issues described in 970 Section 5.6 are dealt with, payload encryption itself will not be a 971 problem. The design goal of re-ECN is to include downstream 972 congestion in the IP header so that it is not necessary to bury into 973 inner headers. Obfuscation of flow identifiers is not a problem for 974 re-ECN policing elements. Re-ECN doesn't ever require flow 975 identifiers to be valid, it only requires them to be unique. So if 976 an IPsec encapsulating security payload (ESP [RFC4835]) or an 977 authentication header (AH [RFC4302]) is used, the security parameters 978 index (SPI) will be a sufficient flow identifier, as it is intended 979 to be unique to a flow without revealing actual port numbers. 981 In general, even if endpoints use some locally agreed scheme to hide 982 port numbers, re-ECN policing elements can just consider the pair of 983 source and destination IP addresses as the flow identifier. Re-ECN 984 encourages endpoints to at least tell the network layer that a 985 sequence of packets are all part of the same flow, if indeed they 986 are. The alternative would be for the sender to make each packet 987 appear to be a new flow, which would require them all to be marked 988 FNE in order to avoid being treated with the bulk of malicious flows 989 at the egress dropper. Given the FNE marking is worth +1 and 990 networks are likely to rate limit FNE packets, endpoints are given an 991 incentive not to set FNE on each packet. But if the sender really 992 does want to hide the flow relationship between packets it can choose 993 to pay the cost of multiple FNE packets, which in the long run will 994 compensate for the extra memory required on network policing elements 995 to process each flow. 997 {ToDo: Add a note about it being useful that the AH header does not 998 cover the RE flag, referring to Section 9.} 1000 6. Transport Layers 1002 6.1. TCP 1004 Re-ECN capability at the sender is essential. At the receiver it is 1005 optional, as long as the receiver has a basic RFC3168-compliant ECN- 1006 capable transport (ECT) [RFC3168]. Given re-ECN is not the first 1007 attempt to define the semantics of the ECN field, we give a table 1008 below summarising what happens for various combinations of 1009 capabilities of the sender S and receiver R, as indicated in the 1010 first four columns below. The last column gives the mode a half- 1011 connection should be in after the first two of the three TCP 1012 handshakes. 1014 +--------+--------------+------------+---------+--------------------+ 1015 | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | 1016 | | (RFC3540) | (RFC3168) | | Half-connection | 1017 | | | | | Mode | 1018 +--------+--------------+------------+---------+--------------------+ 1019 | SR | | | | RECN | 1020 | S | R | | | RECN-Co | 1021 | S | | R | | RECN-Co | 1022 | S | | | R | Not-ECT | 1023 +--------+--------------+------------+---------+--------------------+ 1025 Table 4: Modes of TCP Half-connection for Combinations of ECN 1026 Capabilities of Sender S and Receiver R 1028 We will describe what happens in each mode, then describe how they 1029 are negotiated. The abbreviations for the modes in the above table 1030 mean: 1032 RECN: Full re-ECN capable transport 1034 RECN-Co: Re-ECN sender in compatibility mode with a RFC3168 1035 compliant [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable 1036 receiver. Implementation of this mode is OPTIONAL. 1038 Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when 1039 at least one of the transports does not understand even basic ECN 1040 marking. 1042 Note that we use the term Re-ECT for a host transport that is re-ECN- 1043 capable but RECN for the modes of the half connections between hosts 1044 when they are both Re-ECT. If a host transport is Re-ECT, this fact 1045 alone does NOT imply either of its half connections will necessarily 1046 be in RECN mode, at least not until it has confirmed that the other 1047 host is Re-ECT. 1049 6.1.1. RECN mode: Full Re-ECN capable transport 1051 In full RECN mode, for each half connection, both the sender and the 1052 receiver each maintain an unsigned integer counter we will call ECC 1053 (echo congestion counter). The receiver maintains a count of how 1054 many times a CE marked packet has arrived during the half-connection. 1055 Once a RECN connection is established, the three TCP option flags 1056 (ECE, CWR & NS) used for ECN-related functions in other versions of 1057 ECN are used as a 3-bit field for the receiver to repeatedly tell the 1058 sender the current value of ECC, modulo 8, whenever it sends a TCP 1059 ACK. We will call this the echo congestion increment (ECI) field. 1060 This overloaded use of these 3 option flags as one 3-bit ECI field is 1061 shown in Figure 7. The actual definition of the TCP header, 1062 including the addition of support for the ECN nonce, is shown for 1063 comparison in Figure 6. This specification does not redefine the 1064 names of these three TCP option flags, it merely overloads them with 1065 another definition once a flow is established. 1067 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1068 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1069 | | | N | C | E | U | A | P | R | S | F | 1070 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 1071 | | | | R | E | G | K | H | T | N | N | 1072 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1074 Figure 6: The (post-ECN Nonce) definition of bytes 13 and 14 of the 1075 TCP Header 1077 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1078 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1079 | | | | U | A | P | R | S | F | 1080 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 1081 | | | | G | K | H | T | N | N | 1082 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1084 Figure 7: Definition of the ECI field within bytes 13 and 14 of the 1085 TCP Header, overloading the current definitions above for established 1086 RECN flows. 1088 Receiver Action in RECN Mode 1090 Every time a CE marked packet arrives at a receiver in RECN mode, 1091 the receiver transport increments its local value of ECC and MUST 1092 echo its value, modulo 8, to the sender in the ECI field of the 1093 next ACK. It MUST repeat the same value of ECI in every 1094 subsequent ACK until the next CE event, when it increments ECI 1095 again. 1097 The increment of the local ECC values is modulo 8 so the field 1098 value simply wraps round back to zero when it overflows. The 1099 least significant bit is to the right (labelled bit 9). 1101 A receiver in RECN mode MAY delay the echo of a CE to the next 1102 delayed-ACK, which would be necessary if ACK-withholding were 1103 implemented. 1105 Sender Action in RECN Mode 1107 On the arrival of every ACK, the sender compares the ECI field 1108 with its own ECC value, then replaces its local value with that 1109 from the ACK. The difference D (D = (ECI + 8 - ECC mod 8) mod 8) 1110 is assumed to be the number of CE marked packets that arrived at 1111 the receiver since it sent the previously received ACK (but see 1112 below for the sender's safety strategy). Whenever the ECI field 1113 increments by D (and/or d drops are detected), the sender MUST 1114 clear the RE flag to "0" in the IP header of the next D' data 1115 packets it sends (where D' = D + d), effectively re-echoing each 1116 single increment of ECI. Otherwise the data sender MUST send all 1117 data packets with RE set to "1". 1119 As a general rule, once a flow is established, as well as setting 1120 or clearing the RE flag as above, a data sender in RECN mode MUST 1121 always set the ECN field to ECT(1). However, the settings of the 1122 extended ECN field during flow start are defined in Section 6.1.4. 1124 As we have already emphasised, the re-ECN protocol makes no 1125 changes and has no effect on the TCP congestion control algorithm. 1126 So, the first increment of ECI (or detection of a drop) in a RTT 1127 triggers the standard TCP congestion response, no more than one 1128 congestion response per round trip, as usual. However, the sender 1129 re-echoes every increment of ECI irrespective of RTTs. 1131 A TCP sender also acts as the receiver for the other half- 1132 connection. The host will maintain two ECC values S.ECC and R.ECC 1133 as sender and receiver respectively. Every TCP header sent by a 1134 host in RECN mode will also repeat the prevailing value of R.ECC 1135 in its ECI field. If a sender in RECN mode has to retransmit a 1136 packet due to a suspected loss, the re-transmitted packet MUST 1137 carry the latest prevailing value of R.ECC when it is re- 1138 transmitted, which will not necessarily be the one it carried 1139 originally. 1141 6.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 compliant ECN 1142 Receiver 1144 If the half-connection is in RECN-Co mode, ECN feedback proceeds no 1145 differently to that of RFC3168 compliant ECN. In other words, the 1146 receiver sets the ECE flag repeatedly in the TCP header and the 1147 sender responds by setting the CWR flag. Although RECN-Co mode is 1148 used when the receiver has not implemented the re-ECN protocol, the 1149 sender can infer enough from its RFC3168 compliant ECN feedback to 1150 set or clear the RE flag reasonably well. Specifically, every time 1151 the receiver toggles the ECE field from "0" to "1" (or a loss is 1152 detected), as well as setting CWR in the TCP flags, the re-ECN sender 1153 MUST blank the RE flag of the next packet to "0" as it would do in 1154 full RECN mode. Otherwise, the data sender SHOULD send all other 1155 packets with RE set to "1". Once a flow is established, a re-ECN 1156 data sender in RECN-Co mode MUST always set the ECN field to ECT(1). 1158 If a CE marked packet arrives at the receiver within a round trip 1159 time of a previous mark, the receiver will still be echoing ECE for 1160 the last CE mark. Therefore, such a mark will be missed by the 1161 sender. Of course, this isn't of concern for congestion control, but 1162 it does mean that very occasionally the RE blanking fraction will be 1163 understated. Therefore flows in RECN-Co mode may occasionally be 1164 mistaken for very lightly cheating flows and consequently might 1165 suffer a small number of packet drops through an egress dropper. We 1166 expect re-ECN would be deployed for some time before policers and 1167 droppers start to enforce it. So, given there is not much ECN 1168 deployment yet anyway, this minor problem may affect only a very 1169 small proportion of flows, reducing to nothing over the years as 1170 RFC3168 compliant ECN hosts upgrade. The use of RECN-Co mode would 1171 need to be reviewed in the light of experience at the time of re-ECN 1172 deployment. 1174 RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their 1175 code simple, MAY choose not to implement this mode. If they do not, 1176 a re-ECN sender SHOULD fall back to RFC3168 compliant ECT mode in the 1177 presence of an ECN-capable receiver. It MAY choose to fall back to 1178 the ECT-Nonce mode, but if re-ECN implementers don't want to be 1179 bothered with RECN-Co mode, they probably won't want to add an ECT- 1180 Nonce mode either. 1182 6.1.2.1. Re-ECN support for the ECN Nonce 1184 A TCP half-connection in RECN-Co mode MUST NOT support the ECN 1185 Nonce [RFC3540]. This means that the sending code of a re-ECN 1186 implementation will never need to include ECN Nonce support. Re-ECN 1187 is intended to provide wider protection than the ECN nonce against 1188 congestion control misbehaviour, and re-ECN only requires support 1189 from the sender, therefore it is preferable to specifically rule out 1190 the need for dual sender implementations. As a consequence, a re-ECN 1191 capable sender will never set ECT(0), so it will be easier for 1192 network elements to discriminate re-ECN traffic flows from other ECN 1193 traffic, which will always contain some ECT(0) packets. 1195 However, a re-ECN implementation MAY OPTIONALLY include receiving 1196 code that complies with the ECN Nonce protocol when interacting with 1197 a sender that supports the ECN nonce (rather than re-ECN), but this 1198 support is not required. 1200 RFC3540 allows an ECN nonce sender to choose whether to sanction a 1201 receiver that does not ever set the nonce sum. Given re-ECN is 1202 intended to provide wider protection than the ECN nonce against 1203 congestion control misbehaviour, implementers of re-ECN receivers MAY 1204 choose not to implement backwards compatibility with the ECN nonce 1205 capability. This may be because they deem that the risk of sanctions 1206 is low, perhaps because significant deployment of the ECN nonce seems 1207 unlikely at implementation time. 1209 6.1.3. Capability Negotiation 1211 During the TCP hand-shake at the start of a connection, an originator 1212 of the connection (host A) with a re-ECN-capable transport MUST 1213 indicate it is Re-ECT by setting the TCP flags NS=1, CWR=1 and ECE=1 1214 in the initial SYN. 1216 A responding Re-ECT host (host B) MUST return a SYN ACK with flags 1217 CWR=1 and ECE=0. The responding host MUST NOT set this combination 1218 of flags unless the preceding SYN has already indicated Re-ECT 1219 support as above. Normally a Re-ECT server (B) will reply to a Re- 1220 ECT client with NS=0, but if the initial SYN from Re-ECT client A is 1221 marked CE(-1), a Re-ECT server B MUST increment its local value of 1222 ECC. But B cannot reflect the value of ECC in the SYN ACK, because 1223 it is still using the 3 bits to negotiate connection capabilities. 1224 So, server B MUST set the alternative TCP header flags in its SYN 1225 ACK: NS=1, CWR=1 and ECE=0. 1227 These handshakes are summarised in Table 5 below, with X indicating 1228 NS can be either 1 or 0 depending respectively on whether congestion 1229 had been experienced or not. The handshakes used for the other 1230 flavours of ECN are also shown for comparison. To compress the width 1231 of the table, the headings of the first four columns have been 1232 severely abbreviated, as follows: 1234 R: *R*e-ECT 1236 N: ECT-*N*once (RFC3540) 1238 E: *E*CT (RFC3168) 1240 I: Not-ECT (*I*mplicit congestion notification). 1242 These correspond with the same headings used in Table 4. Indeed, the 1243 resulting modes in the last two columns of the table below are a more 1244 comprehensive way of saying the same thing as Table 4. 1246 +----+---+---+---+------------+-------------+-----------+-----------+ 1247 | R | N | E | I | SYN A-B | SYN ACK B-A | A-B Mode | B-A Mode | 1248 +----+---+---+---+------------+-------------+-----------+-----------+ 1249 | | | | | NS CWR ECE | NS CWR ECE | | | 1250 | AB | | | | 1 1 1 | X 1 0 | RECN | RECN | 1251 | A | B | | | 1 1 1 | 1 0 1 | RECN-Co | ECT-Nonce | 1252 | A | | B | | 1 1 1 | 0 0 1 | RECN-Co | ECT | 1253 | A | | | B | 1 1 1 | 0 0 0 | Not-ECT | Not-ECT | 1254 | B | A | | | 0 1 1 | 0 0 1 | ECT-Nonce | RECN-Co | 1255 | B | | A | | 0 1 1 | 0 0 1 | ECT | RECN-Co | 1256 | B | | | A | 0 0 0 | 0 0 0 | Not-ECT | Not-ECT | 1257 +----+---+---+---+------------+-------------+-----------+-----------+ 1259 Table 5: TCP Capability Negotiation between Originator (A) and 1260 Responder (B) 1262 As soon as a re-ECN capable TCP server receives a SYN, it MUST set 1263 its two half-connections into the modes given in Table 5. As soon as 1264 a re-ECN capable TCP client receives a SYN ACK, it MUST set its two 1265 half-connections into the modes given in Table 5. The half- 1266 connections will remain in these modes for the rest of the 1267 connection, including for the third segment of TCP's three-way hand- 1268 shake (the ACK). 1270 {ToDo: Consider delaying mode changes if using SYN cookies (will also 1271 affect next section).} 1273 {ToDo: consider RSTs within a connection.} 1275 Recall that, if the SYN ACK reflects the same flag settings as the 1276 preceding SYN (because there is a broken RFC3168 compliant 1277 implementation that behaves this way), RFC3168 specifies that the 1278 whole connection MUST revert to Not-ECT. 1280 Also note that, whenever the SYN flag of a TCP segment is set 1281 (including when the ACK flag is also set), the NS, CWR and ECE flags 1282 ( i.e the ECI field of the SYN-ACK) MUST NOT be interpreted as the 1283 3-bit ECI value, which is only set as a copy of the local ECC value 1284 in non-SYN packets. 1286 6.1.4. Extended ECN (EECN) Field Settings during Flow Start or after 1287 Idle Periods 1289 If the originator (A) of a TCP connection supports re-ECN it MUST set 1290 the extended ECN (EECN) field in the IP header of the initial SYN 1291 packet to the feedback not established (FNE) codepoint. 1293 FNE is a new extended ECN codepoint defined by this specification 1294 (Section 4.2). The feedback not established (FNE) codepoint is used 1295 when the transport does not have the benefit of ECN feedback so it 1296 cannot decide whether to set or clear the RE flag. 1298 If after receiving a SYN the server B has set its sending half- 1299 connection into RECN mode or RECN-Co mode, it MUST set the extended 1300 ECN field in the IP header of its SYN ACK to the feedback not 1301 established (FNE) codepoint. Note the careful wording here, which 1302 means that Re-ECT server B MUST set FNE on a SYN ACK whether it is 1303 responding to a SYN from a Re-ECT client or from a client that is 1304 merely ECN-capable. This is because FNE indicates the transport is 1305 ECN capable as well as re-ECN capable. 1307 The original ECN specification [RFC3168] required SYNs and SYN ACKs 1308 to use the Not-ECT codepoint of the ECN field. The aim was to 1309 prevent well-known DoS attacks such as SYN flooding being able to 1310 gain from the advantage that ECN capability afforded over drop at 1311 ECN-capable routers. 1313 For a SYN ACK, Kuzmanovic [RFC5562] has shown that this caution was 1314 unnecessary, and allows a SYN ACK to be ECN-capable to improve 1315 performance. By stipulating the FNE codepoint for the initial SYN, 1316 we comply with RFC3168 in word but not in spirit, because we have 1317 indeed set the ECN field to Not-ECT, but we have extended the ECN 1318 field with another bit. And it will be seen (Section 5.3) that we 1319 have defined one setting of that bit to mean an ECN-capable 1320 transport. Therefore, by proposing that the FNE codepoint MUST be 1321 used on the initial SYN of a connection, we have gone further by 1322 proposing to make the initial SYN ECN-capable too. Section 5.4 1323 justifies deciding to make the initial SYN ECN-capable. 1325 Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will 1326 have already been set on the initial SYN and possibly the SYN ACK as 1327 above. But each re-ECN sender will have to set FNE cautiously on a 1328 few data packets as well, given a number of packets will usually have 1329 to be sent before sufficient congestion feedback is received. The 1330 behaviour will be different depending on the mode of the half- 1331 connection: 1333 RECN mode: Given the constraints on TCP's initial window [RFC3390] 1334 and its exponential window increase during slow start 1335 phase [RFC5681], it turns out that the sender SHOULD set FNE on 1336 the first and third data packets in its flow after the initial 1337 3-way handshake, assuming equal sized data packets once a flow is 1338 established. Appendix D presents the calculation that led to this 1339 conclusion. Below, after running through the start of an example 1340 TCP session, we give the intuition learned from that calculation. 1341 {ToDo: unfortunately the calculation was based on erroneous 1342 assumptions; see [I-D.conex-tcp-mods] for a better approach.} 1344 RECN-Co mode: A re-ECT sender that switches into re-ECN 1345 compatibility mode or into Not-ECT mode (because it has detected 1346 the corresponding host is not re-ECN capable) MUST limit its 1347 initial window to 1 segment. The reasoning behind this constraint 1348 is given in Section 5.4. Having set this initial window, a re-ECN 1349 sender in RECN-Co mode SHOULD set FNE on the first and third data 1350 packets in a flow, as for RECN mode. 1352 +----+------+----------------+-------+-------+---------------+------+ 1353 | | Data | TCP A(Re-ECT) | IP A | IP B | TCP B(Re-ECT) | Data | 1354 +----+------+----------------+-------+-------+---------------+------+ 1355 | | Byte | SEQ ACK CTL | EECN | EECN | SEQ ACK CTL | Byte | 1356 | -- | ---- | ------------- | ----- | ----- | ------------- | ---- | 1357 | 1 | | 0100 SYN | FNE | --> | R.ECC=0 | | 1358 | | | CWR,ECE,NS | | | | | 1359 | 2 | | R.ECC=0 | <-- | FNE | 0300 0101 | | 1360 | | | | | | SYN,ACK,CWR | | 1361 | 3 | | 0101 0301 ACK | RECT | --> | R.ECC=0 | | 1362 | 4 | 1000 | 0101 0301 ACK | FNE | --> | R.ECC=0 | | 1363 | 5 | | R.ECC=0 | <-- | FNE | 0301 1102 ACK | 1460 | 1364 | 6 | | R.ECC=0 | <-- | RECT | 1762 1102 ACK | 1460 | 1365 | 7 | | R.ECC=0 | <-- | FNE | 3222 1102 ACK | 1460 | 1366 | 8 | | 1102 1762 ACK | RECT | --> | R.ECC=0 | | 1367 | 9 | | R.ECC=0 | <-- | RECT | 4682 1102 ACK | 1460 | 1368 | 10 | | R.ECC=0 | <-- | RECT | 6142 1102 ACK | 1460 | 1369 | 11 | | 1102 3222 ACK | RECT | --> | R.ECC=0 | | 1370 | 12 | | R.ECC=0 | <-- | RECT | 7602 1102 ACK | 1460 | 1371 | 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 | 1372 | | | ... | | | | | 1373 +----+------+----------------+-------+-------+---------------+------+ 1375 Table 6: TCP Session Example #1 1377 Table 6 shows an example TCP session, where the server B sets FNE on 1378 its first and third data packets (lines 5 & 7) as well as on the 1379 initial SYN ACK as previously described. The left hand half of the 1380 table shows the relevant settings of headers sent by client A in 1381 three layers: the TCP payload size; TCP settings; then IP settings. 1382 The right hand half gives equivalent columns for server B. The only 1383 TCP settings shown are the sequence number (SEQ), acknowledgement 1384 number (ACK) and the relevant control (CTL) flags that the relevant 1385 sending host sets in the TCP header. The IP columns show the setting 1386 of the extended ECN (EECN) field. 1388 Also shown on the receiving side of the table is the value of the 1389 receiver's echo congestion counter (R.ECC) after processing the 1390 incoming EECN header. Note that, once a host sets a half-connection 1391 into RECN mode, it MUST initialise its local value of ECC to zero. 1393 The intuition that Appendix D gives for why a sender should set FNE 1394 on the first and third data packets is as follows. At line 13, a 1395 packet sent by B is shown with an '*', which means it has been 1396 congestion marked by an intermediate queue from RECT to CE(-1). On 1397 receiving this CE marked packet, client A increments its ECC counter 1398 to 1 as shown. This was the 7th data packet B sent, but before 1399 feedback about this event returns to B, it might well have sent many 1400 more packets. Indeed, during exponential slow start, about as many 1401 packets will be in flight (unacknowledged) as have been acknowledged. 1402 So, when the feedback from the congestion event on B's 7th segment 1403 returns, B will have sent about 7 further packets that will still be 1404 in flight. At that stage, B's best estimate of the network's packet 1405 marking fraction will be 1/7. So, as B will have sent about 14 1406 packets, it should have already marked 2 of them as FNE in order to 1407 have marked 1/7; hence the need to have set the first and third data 1408 packets to FNE. 1410 Client A's behaviour in Table 6 also shows FNE being set on the first 1411 SYN and the first data packet (lines 1 & 4), but in this case it 1412 sends no more data packets, so of course, it cannot, and does not 1413 need to, set FNE again. Note that in the A-B direction there is no 1414 need to set FNE on the third part of the three-way hand-shake (line 1415 3---the ACK). 1417 Note that in this section we have used the word SHOULD rather than 1418 MUST when specifying how to set FNE on data segments before positive 1419 congestion feedback arrives (but note that the word MUST was used for 1420 FNE on the SYN and SYN ACK). FNE is only RECOMMENDED for the first 1421 and third data segments to entertain the possibility that the TCP 1422 transport has the benefit of other knowledge of the path, which it 1423 re-uses from one flow for the benefit of a newly starting flow. For 1424 instance, one flow can re-use knowledge of other flows between the 1425 same hosts if using a Congestion Manager [RFC3124] or when a proxy 1426 host aggregates congestion information for large numbers of flows. 1428 {ToDo: There is probably scope for re-writing the above in a 1429 different way so that it says MUST unless some other knowledge of the 1430 path is available. See earlier note pointing out FNE on 1st & 3rd is 1431 too few.} 1433 After an idle period of more than 1 second, a re-ECN sender transport 1434 MUST set the EECN field of the packet that resumes the connection to 1435 FNE. Note that this next packet may be sent a very long time later, 1436 a packet does NOT have to be sent after 1 second of idling. In order 1437 that the design of network policers can be deterministic, this 1438 specification deliberately puts an absolute lower limit on how long a 1439 connection can be idle before the packet that resumes the connection 1440 must be set to FNE, rather than relating it to the connection round 1441 trip time. We use the lower bound of the retransmission timeout 1442 (RTO) [RFC6298], which is commonly used as the idle period before TCP 1443 must reduce to the restart window [RFC5681]. Note our specification 1444 of re-ECN's idle period is NOT intended to change the idle period for 1445 TCP's restart, nor indeed for any other purposes. 1447 {ToDo: Describe how the sender falls back to RFC3168 modes if packets 1448 don't appear to be getting through (to work round firewalls 1449 discarding packets they consider unusual).} 1451 {ToDo: Possible future capabilities for changing Slow Start} 1453 6.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs 1455 A re-ECN sender MUST clear the RE flag to "0" and set the ECN field 1456 to Not-ECT in pure ACKs, retransmissions and window probes, as 1457 specified in [RFC3168]. Our eventual goal is for all packets to be 1458 sent with re-ECN enabled, and we believe the semantics of the ECI 1459 field go a long way towards being able to achieve this. However, we 1460 have not completed a full security analysis for these cases, 1461 therefore, currently we merely re-state current practice. 1463 We must also reconcile the facts that congestion marking is applied 1464 to packets but acknowledgements cover octet ranges and acknowledged 1465 octet boundaries need not match the transmitted boundaries. The 1466 general principle we work to is to remain compatible with TCP's 1467 congestion control which is driven by congestion events at packet 1468 granularity while at the same time aiming to blank the RE flag on at 1469 least as many octets in a flow as have been marked CE. 1471 Therefore, a re-ECN TCP receiver MUST increment its ECC value as many 1472 times as CE marked packets have been received. And that value MUST 1473 be echoed to the sender in the first available ACK using the ECI 1474 field. This ensures the TCP sender's congestion control receives 1475 timely feedback on congestion events at the same packet granularity 1476 that they were generated on congested queues. 1478 Then, a re-ECN sender stores the difference D between its own ECC 1479 value and the incoming ECI field by incrementing a counter R. Then, R 1480 is decremented by 1 each subsequent packet that is sent with the RE 1481 flag blanked, until R is no longer positive. Using this technique, 1482 whenever a re-ECN transport sends a not re-ECN capable packet (e.g. a 1483 retransmission), the remaining packets required to have the RE flag 1484 blanked will be automatically carried over to subsequent packets, 1485 through the variable R. 1487 This does not ensure precisely the same number of octets have RE 1488 blanked as were CE marked. But we believe positive errors will 1489 cancel negative over a long enough period. {ToDo: However, more 1490 research is needed to prove whether this is so. If it is not, it may 1491 be necessary to increment and decrement R in octets rather than 1492 packets, by incrementing R as the product of D and the size in octets 1493 of packets being sent (typically the MSS).} 1495 6.2. Other Transports 1497 6.2.1. General Guidelines for Adding Re-ECN to Other Transports 1499 As a general rule, Re-ECT sender transports that have established the 1500 receiver transport is at least ECN-capable (not necessarily re-ECN 1501 capable) MUST blank the RE codepoint for at least as many octets as 1502 arrive at receiver with the CE codepoint set. Re-ECN-capable sender 1503 transports should always initialise the ECN field to the ECT(1) 1504 codepoint once a flow is established. 1506 If the sender transport does not have sufficient feedback to even 1507 estimate the path's CE rate, it SHOULD set FNE continuously. If the 1508 sender transport has some, perhaps stale, feedback to estimate that 1509 the path's CE rate is nearly definitely less than E%, the transport 1510 MAY blank RE in packets for E% of sent octets, and set the RECT 1511 codepoint for the remainder. 1513 The following sections give guidelines on how re-ECN support could be 1514 added to RSVP or NSIS, to DCCP, and to SCTP - although separate 1515 Internet drafts will be necessary to document the exact mechanics of 1516 re-ECN in each of these protocols. 1518 {ToDo: Give a brief outline of what would be expected for each of the 1519 following: 1521 o UDP fire and forget (e.g. DNS) 1523 o UDP streaming with no feedback 1525 o UDP streaming with feedback 1527 } 1529 6.2.2. Guidelines for adding Re-ECN to RSVP or NSIS 1531 A separate I-D has been submitted [I-D.re-pcn-border-cheat] 1532 describing how re-ECN can be used in an edge-to-edge rather than end- 1533 to-end scenario. It can then be used by downstream networks to 1534 police whether upstream networks are blocking new flow reservations 1535 when downstream congestion is too high, even though the congestion is 1536 in other operators' downstream networks. This relates to current 1537 IETF work on Admission Control over Diffserv using Pre-Congestion 1538 Notification (PCN) [RFC5559]. 1540 6.2.3. Guidelines for adding Re-ECN to DCCP 1542 Beside adjusting the initial features negotiation sequence, operating 1543 re-ECN in DCCP [RFC4340] could be achieved by defining a new option 1544 to be added to acknowledgments, that would include a multibit field 1545 where the destination could copy its ECC. 1547 6.2.4. Guidelines for adding Re-ECN to SCTP 1549 Appendix A in [RFC4960] gives the specifications for SCTP to support 1550 ECN. Similar steps should be taken to support re-ECN. Beside 1551 adjusting the initial features negotiation sequence, operating re-ECN 1552 in SCTP could be achieved by defining a new control chunk, that would 1553 include a multibit field where the destination could copy its ECC 1555 7. Incremental Deployment 1557 The design of the re-ECN protocol started from the fact that the 1558 current ECN marking behaviour of queues was sufficient and that re- 1559 feedback could be introduced around these queues by changing the 1560 sender behaviour but not the routers. Otherwise, if we had required 1561 routers to be changed, the chance of encountering a path that had 1562 every router upgraded would be vanishingly small during early 1563 deployment, giving no incentive to start deployment. Also, as there 1564 is no new forwarding behaviour, routers and hosts do not have to 1565 signal or negotiate anything. 1567 However, networks that choose to protect themselves using re-ECN do 1568 have to add new security functions at their trust boundaries with 1569 others. They distinguish legacy traffic by its ECN field. Traffic 1570 from Not-ECT transports is distinguishable by its Not-ECT marking. 1571 Traffic from RFC3168 compliant ECN transports is distinguished from 1572 re-ECN by which of ECT(0) or ECT(1) is used. We chose to use ECT(1) 1573 for re-ECN traffic deliberately. Existing ECN sources set ECT(0) on 1574 either 50% (the nonce) or 100% (the default) of packets, whereas re- 1575 ECN does not use ECT(0) at all. We can use this distinguishing 1576 feature of RFC3168 compliant ECN traffic to separate it out for 1577 different treatment at the various border security functions: egress 1578 dropping, ingress policing and border policing. 1580 The general principle we adopt is that an egress dropper will not 1581 drop any legacy traffic, but ingress and border policers will limit 1582 the bulk rate of legacy traffic (Not-ECT, ECT(0) and those marked 1583 with the unused codepoint) that can enter each network. Then, during 1584 early re-ECN deployment, operators can set very permissive (or non- 1585 existent) rate-limits on legacy traffic, but once re-ECN 1586 implementations are generally available, legacy traffic can be rate- 1587 limited increasingly harshly. Ultimately, an operator might choose 1588 to block all legacy traffic entering its network, or at least only 1589 allow through a trickle. 1591 Then, as the limits are set more strictly, the more RFC3168 ECN 1592 sources will gain by upgrading to re-ECN. Thus, towards the end of 1593 the voluntary incremental deployment period, RFC3168 compliant 1594 transports can be given progressively stronger encouragement to 1595 upgrade. 1597 The following list of minor changes, brings together all the points 1598 where re-ECN semantics for use of the two-bit ECN field are different 1599 compared to RFC3168: 1601 o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender 1602 sets ECT(0) by default (Section 4.3); 1604 o No provision is necessary for a re-ECN capable source transport to 1605 use the ECN nonce (Section 6.1.2.1); 1607 o Routers MAY preferentially drop different extended ECN codepoints 1608 (Section 5.3); 1610 o Packets carrying the feedback not established (FNE) codepoint MAY 1611 optionally be marked rather than dropped by routers, even though 1612 their ECN field is Not-ECT (with the important caveat in 1613 Section 5.3); 1615 o Packets may be dropped by policing nodes because of apparent 1616 misbehaviour, not just because of congestion ; 1618 o Tunnel entry behaviour is still to be defined, but may have to be 1619 different from RFC3168 (Section 5.6). 1621 None of these changes REQUIRE any modifications to routers. Also 1622 none of these changes affect anything about end to end congestion 1623 control; they are all to do with allowing networks to police that end 1624 to end congestion control is well-behaved. 1626 8. Related Work 1627 8.1. Congestion Notification Integrity 1629 The choice of two ECT code-points in the ECN field [RFC3168] 1630 permitted future flexibility, optionally allowing the sender to 1631 encode the experimental ECN nonce [RFC3540] in the packet stream. 1632 This mechanism has since been included in the specifications of DCCP 1633 [RFC4340]. 1635 {ToDo: DCCP provides nonce support - how does this affect the RFC?} 1637 The ECN nonce is an elegant scheme that allows the sender to detect 1638 if someone in the feedback loop - the receiver especially - tries to 1639 claim no congestion was experienced when in fact congestion led to 1640 packet drops or ECN marks. For each packet it sends, the sender 1641 chooses between the two ECT codepoints in a pseudo-random sequence. 1642 Then, whenever the network marks a packet with CE, if the receiver 1643 wants to deny congestion happened, she has to guess which ECT 1644 codepoint was overwritten. She has only a 50:50 chance of being 1645 correct each time she denies a congestion mark or a drop, which 1646 ultimately will give her away. 1648 The purpose of a network-layer nonce should primarily be protection 1649 of the network, while a transport-layer nonce would be better used to 1650 protect the sender from cheating receivers. Now, the assumption 1651 behind the ECN nonce is that a sender will want to detect whether a 1652 receiver is suppressing congestion feedback. This is only true if 1653 the sender's interests are aligned with the network's, or with the 1654 community of users as a whole. This may be true for certain large 1655 senders, who are under close scrutiny and have a reputation to 1656 maintain. But we have to deal with a more hostile world, where 1657 traffic may be dominated by peer-to-peer transfers, rather than 1658 downloads from a few popular sites. Often the `natural' self- 1659 interest of a sender is not aligned with the interests of other 1660 users. It often wishes to transfer data quickly to the receiver as 1661 much as the receiver wants the data quickly. 1663 In contrast, the re-ECN protocol enables policing of an agreed rate- 1664 response to congestion (e.g. TCP-friendliness) at the sender's 1665 interface with the internetwork. It also ensures downstream networks 1666 can police their upstream neighbours, to encourage them to police 1667 their users in turn. But most importantly, it requires the sender to 1668 declare path congestion to the network and it can remove traffic at 1669 the egress if this declaration is dishonest. So it can police 1670 correctly, irrespective of whether the receiver tries to suppress 1671 congestion feedback or whether the sender ignores genuine congestion 1672 feedback. Therefore the re-ECN protocol addresses a much wider range 1673 of cheating problems, which includes the one addressed by the ECN 1674 nonce. 1676 {ToDo: Ensure we address the early ACK problem.} 1678 9. Security Considerations 1680 {ToDo: Describe attacks by networks on flows and by spoofing 1681 sources.} {ToDo: Re-ECN & DNS servers} 1683 This whole memo concerns the deployment of a secure congestion 1684 control framework. However, below we list some specific security 1685 issues that we are still working on: 1687 o Malicious users have ability to launch dynamically changing 1688 attacks, exploiting the time it takes to detect an attack, given 1689 ECN marking is binary. We are concentrating on subtle 1690 interactions between the ingress policer and the egress dropper in 1691 an effort to make it impossible to game the system. 1693 o There is an inherent need for at least some flow state at the 1694 egress dropper given the binary marking environment, which leads 1695 to an apparent vulnerability to state exhaustion attacks. An 1696 egress dropper design with bounded flow state is in write-up. 1698 o A malicious source can spoof another user's address and send 1699 negative traffic to the same destination in order to fool the 1700 dropper into sanctioning the other user's flow. To prevent or 1701 mitigate these two different kinds of DoS attack, against the 1702 dropper and against given flows, we are considering various 1703 protection mechanisms. 1705 o A malicious client can send requests using a spoofed source 1706 address to a server (such as a DNS server) that tends to respond 1707 with single packet responses. This server will then be tricked 1708 into having to set FNE on the first (and only) packet of all these 1709 wasted responses. Given packets marked FNE are worth +1, this 1710 will cause such servers to consume more of their allowance to 1711 cause congestion than they would wish to. In general, re-ECN is 1712 deliberately designed so that single packet flows have to bear the 1713 cost of not discovering the congestion state of their path. One 1714 of the reasons for introducing re-ECN is to encourage short flows 1715 to make use of previous path knowledge by moving the cost of this 1716 lack of knowledge to sources that create short flows. Therefore, 1717 we in the long run we might expect services like DNS to aggregate 1718 single packet flows into connections where it brings benefits. 1719 However, this attack where DNS requests are made from spoofed 1720 addresses genuinely forces the server to waste its resources. The 1721 only mitigating feature is that the attacker has to set FNE on 1722 each of its requests if they are to get through an egress dropper 1723 to a DNS server. The attacker therefore has to consume as many 1724 resources as the victim, which at least implies re-ECN does not 1725 unwittingly amplify this attack. 1727 Having highlighted outstanding security issues, we now explain the 1728 design decisions that were taken based on a security-related 1729 rationale. It may seem that the six codepoints of the eight made 1730 available by extending the ECN field with the RE flag have been used 1731 rather wastefully to encode just five states. In effect the RE flag 1732 has been used as an orthogonal single bit, using up four codepoints 1733 to encode the three states of positive, neutral and negative worth. 1734 The mapping of the codepoints in an earlier version of this proposal 1735 used the codepoint space more efficiently, but the scheme became 1736 vulnerable to network operators bypassing congestion penalties by 1737 focusing congestion marking on positive packets. Appendix B explains 1738 why fixing that problem while allowing for incremental deployment, 1739 would have used another codepoint anyway. So it was better to use 1740 this orthogonal encoding scheme, which greatly simplified the whole 1741 protocol and brought with it some subtle security benefits (see the 1742 last paragraph of Appendix B). 1744 With the scheme as now proposed, once the RE flag is set or cleared 1745 by the sender or its proxy, it should not be written by the network, 1746 only read. So the endpoints can detect if any network maliciously 1747 alters the RE flag. IPsec AH integrity checking does not cover the 1748 IPv4 option flags (they were considered mutable---even the one we 1749 propose using for the RE flag that was `currently unused' when IPsec 1750 was defined). But it would be sufficient for a pair of endpoints to 1751 make random checks on whether the RE flag was the same when it 1752 reached the egress as when it left the ingress. Indeed, if IPsec AH 1753 had covered the RE flag, any network intending to alter sufficient RE 1754 flags to make a gain would have focused its alterations on packets 1755 without authenticating headers (AHs). 1757 The security of re-ECN has been deliberately designed to not rely on 1758 cryptography. 1760 10. IANA Considerations 1762 This memo includes no request to IANA (yet). 1764 If this memo was to progress to standards track, it would list: 1766 o The new RE flag in IPv4 (Section 5.1) and its extension with the 1767 ECN field to create a new set of extended ECN (EECN) codepoints; 1769 o The definition of the EECN codepoints for default Diffserv PHBs 1770 (Section 4.2) 1772 o The Hop-by-Hop option ID for the new extension header for IPv6 1773 (Section 5.2); 1775 o The new combinations of flags in the TCP header for capability 1776 negotiation (Section 6.1.3); 1778 11. Conclusions 1780 {ToDo:} 1782 12. Acknowledgements 1784 Sebastien Cazalet and Andrea Soppera contributed to the idea of re- 1785 feedback. All the following have given helpful comments: Andrea 1786 Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, 1787 Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, 1788 John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru 1789 Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd 1790 (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark 1791 Handley (who developed the attack with canceled packets), Adam 1792 Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft 1793 (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who 1794 complemented our own dummy traffic attacks with others), Liz Maida 1795 (MIT), Meral Shirazipour (Ericsson) and comments from participants in 1796 the CRN/CFP Broadband and DoS-resistant Internet working groups.A 1797 special thank you to Alessandro Salvatori for coming up with fiendish 1798 attacks on re-ECN. 1800 13. Comments Solicited 1802 Comments and questions are encouraged and very welcome. They can be 1803 addressed to the IETF Congestion Exposure (ConEx) working group's 1804 mailing list , and/or to the authors. 1806 14. References 1808 14.1. Normative References 1810 [RFC2119] Bradner, S., "Key words for use in RFCs to 1811 Indicate Requirement Levels", BCP 14, 1812 RFC 2119, March 1997. 1814 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, 1815 "The Addition of Explicit Congestion 1816 Notification (ECN) to IP", RFC 3168, 1817 September 2001. 1819 [RFC3390] Allman, M., Floyd, S., and C. Partridge, 1820 "Increasing TCP's Initial Window", 1821 RFC 3390, October 2002. 1823 [RFC4302] Kent, S., "IP Authentication Header", 1824 RFC 4302, December 2005. 1826 [RFC4340] Kohler, E., Handley, M., and S. Floyd, 1827 "Datagram Congestion Control Protocol 1828 (DCCP)", RFC 4340, March 2006. 1830 [RFC4341] Floyd, S. and E. Kohler, "Profile for 1831 Datagram Congestion Control Protocol 1832 (DCCP) Congestion Control ID 2: TCP-like 1833 Congestion Control", RFC 4341, March 2006. 1835 [RFC4342] Floyd, S., Kohler, E., and J. Padhye, 1836 "Profile for Datagram Congestion Control 1837 Protocol (DCCP) Congestion Control ID 3: 1838 TCP-Friendly Rate Control (TFRC)", 1839 RFC 4342, March 2006. 1841 [RFC4835] Manral, V., "Cryptographic Algorithm 1842 Implementation Requirements for 1843 Encapsulating Security Payload (ESP) and 1844 Authentication Header (AH)", RFC 4835, 1845 April 2007. 1847 [RFC4960] Stewart, R., "Stream Control Transmission 1848 Protocol", RFC 4960, September 2007. 1850 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and 1851 K. Ramakrishnan, "Adding Explicit 1852 Congestion Notification (ECN) Capability 1853 to TCP's SYN/ACK Packets", RFC 5562, 1854 June 2009. 1856 [RFC5681] Allman, M., Paxson, V., and E. Blanton, 1857 "TCP Congestion Control", RFC 5681, 1858 September 2009. 1860 [RFC6040] Briscoe, B., "Tunnelling of Explicit 1861 Congestion Notification", RFC 6040, 1862 November 2010. 1864 14.2. Informative References 1866 [ARI05] Adams, J., Roberts, L., and A. 1867 IJsselmuiden, "Changing the Internet to 1868 Support Real-Time Content Supply from a 1869 Large Fraction of Broadband Residential 1870 Users", BT Technology Journal 1871 (BTTJ) 23(2), April 2005. 1873 [I-D.conex-tcp-mods] Kuehlewind, M. and R. Scheffenegger, "TCP 1874 modifications for Congestion Exposure", 1875 draft-ietf-conex-tcp-modifications-04 1876 (work in progress), July 2013. 1878 [I-D.re-ecn-motiv] Briscoe, B., Jacquet, A., Moncaster, T., 1879 and A. Smith, "Re-ECN: A Framework for 1880 adding Congestion Accountability to 1881 TCP/IP", 1882 draft-briscoe-conex-re-ecn-motiv-02 (work 1883 in progress), July 2013. 1885 [I-D.re-pcn-border-cheat] Briscoe, B., "Emulating Border Flow 1886 Policing using Re-PCN on Bulk Data", 1887 draft-briscoe-re-pcn-border-cheat-03 (work 1888 in progress), October 2009. 1890 [RFC2309] Braden, B., Clark, D., Crowcroft, J., 1891 Davie, B., Deering, S., Estrin, D., Floyd, 1892 S., Jacobson, V., Minshall, G., Partridge, 1893 C., Peterson, L., Ramakrishnan, K., 1894 Shenker, S., Wroclawski, J., and L. Zhang, 1895 "Recommendations on Queue Management and 1896 Congestion Avoidance in the Internet", 1897 RFC 2309, April 1998. 1899 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, 1900 E., Wang, Z., and W. Weiss, "An 1901 Architecture for Differentiated Services", 1902 RFC 2475, December 1998. 1904 [RFC3124] Balakrishnan, H. and S. Seshan, "The 1905 Congestion Manager", RFC 3124, June 2001. 1907 [RFC3514] Bellovin, S., "The Security Flag in the 1908 IPv4 Header", RFC 3514, April 2003. 1910 [RFC3540] Spring, N., Wetherall, D., and D. Ely, 1911 "Robust Explicit Congestion Notification 1912 (ECN) Signaling with Nonces", RFC 3540, 1913 June 2003. 1915 [RFC4301] Kent, S. and K. Seo, "Security 1916 Architecture for the Internet Protocol", 1917 RFC 4301, December 2005. 1919 [RFC5129] Davie, B., Briscoe, B., and J. Tay, 1920 "Explicit Congestion Marking in MPLS", 1921 RFC 5129, January 2008. 1923 [RFC5559] Eardley, P., "Pre-Congestion Notification 1924 (PCN) Architecture", RFC 5559, June 2009. 1926 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. 1927 Sargent, "Computing TCP's Retransmission 1928 Timer", RFC 6298, June 2011. 1930 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano- 1931 Gilfedder, C., Salvatori, A., Soppera, A., 1932 and M. Koyabe, "Policing Congestion 1933 Response in an Internetwork Using Re- 1934 Feedback", ACM SIGCOMM CCR 35(4)277--288, 1935 August 2005, . 1939 [Savage99] Savage, S., Cardwell, N., Wetherall, D., 1940 and T. Anderson, "TCP congestion control 1941 with a misbehaving receiver", ACM SIGCOMM 1942 CCR 29(5), October 1999, . 1945 [Steps_DoS] Handley, M. and A. Greenhalgh, "Steps 1946 towards a DoS-resistant Internet 1947 Architecture", Proc. ACM SIGCOMM workshop 1948 on Future directions in network 1949 architecture (FDNA'04) pp 49--56, 1950 August 2004. 1952 [tcp-rcv-cheat] Moncaster, T., Briscoe, B., and A. 1953 Jacquet, "A TCP Test to Allow Senders to 1954 Identify Receiver Non-Compliance", 1955 draft-moncaster-tcpm-rcv-cheat-02 (work in 1956 progress), November 2007. 1958 Appendix A. Precise Re-ECN Protocol Operation 1960 The protocol operation in Section 4.3 was described as an 1961 approximation. In fact, standard ECN marking at a queue combines 1% 1962 and 2% marking into slightly less than 3% whole-path marking, because 1963 queues deliberately mark CE whether or not it has already been marked 1964 by another queue upstream. So the combined marking fraction would 1965 actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. 1967 To generalise this we will need some notation. 1969 o j represents the index of each resource (typically queues) along a 1970 path, ranging from 0 at the first queue to n-1 at the last. 1972 o m_j represents the fraction of octets to be *m*arked CE by a 1973 particular queue (whether or not they are already marked) because 1974 of congestion of resource j. 1976 o u_j represents congestion signals arriving from *u*pstream of 1977 resource j, being the fraction of CE marking in arriving packet 1978 headers (before marking). 1980 o p_j represents *p*ath congestion, being the fraction of packets 1981 arriving at resource j with the RE flag blanked (excluding Not- 1982 RECT packets). 1984 o v_j denotes expected congestion downstream of resource j, which 1985 can be thought of as a *v*irtual marking fraction, being derived 1986 from two other marking fractions. 1988 Observed fractions of each particular codepoint (u, p and v) and 1989 queue marking rate m are dimensionless fractions, being the ratio of 1990 two data volumes (marked and total) over a monitoring period. All 1991 measurements are in terms of octets, not packets, assuming that line 1992 resources are more congestible than packet processing. 1994 The path congestion (RE blanking fraction) set by the sender should 1995 reflect upstream congestion (CE marking fraction) from the viewpoint 1996 of the destination, which it feeds back to the sender. Therefore in 1997 the steady state 1999 p_0 = u_n 2000 = 1 - (1 - m_1)(1 - m_2)... 2002 Similarly, at some point j in the middle of the network, given p = 1 2003 - (1 - u_j)(1 - v_j), then 2005 v_j = 1 - (1 - p)/(1 - u_j) 2007 ~= p - u_j; if u_j << 100% 2009 So, between the two routers in the example in Section 4.3, congestion 2010 downstream is 2011 v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) 2012 = 2.00%, 2014 or a useful approximation of downstream congestion is 2016 v_1 ~= 2.98% - 1.00% 2017 ~= 1.98%. 2019 Appendix B. Justification for Two Codepoints Signifying Zero Worth 2020 Packets 2022 It may seem a waste of a codepoint to set aside two codepoints of the 2023 Extended ECN field to signify zero worth (RECT and CE(0) are both 2024 worth zero). The justification is subtle, but worth recording. 2026 The original version of Re-ECN ([Re-fb] and draft-00 of this memo) 2027 used three codepoints for neutral (ECT(1)), positive (ECT(0)) and 2028 negative (CE) packets. The sender set packets to neutral unless re- 2029 echoing congestion, when it set them positive, in much the same way 2030 that it blanks the RE flag in the current protocol. However, routers 2031 were meant to mark congestion by setting packets negative (CE) 2032 irrespective of whether they had previously been neutral or positive. 2034 However, we did not arrange for senders to remember which packet had 2035 been sent with which codepoint, or for feedback to say exactly which 2036 packets arrived with which codepoints. The transport was meant to 2037 inflate the number of positive packets it sent to allow for a few 2038 being wiped out by congestion marking. We (wrongly) assumed that 2039 routers would congestion mark packets indiscriminately, so the 2040 transport could infer how many positive packets had been marked and 2041 compensate accordingly by re-echoing. But this created a perverse 2042 incentive for routers to preferentially congestion mark positive 2043 packets rather than neutral ones. 2045 We could have removed this perverse incentive by requiring Re-ECN 2046 senders to remember which packets they had sent with which codepoint. 2047 And for feedback from the receiver to identify which packets arrived 2048 as which. Then, if a positive packet was congestion marked to 2049 negative, the sender could have re-echoed twice to maintain the 2050 balance between positive and negative at the receiver. 2052 Instead, we chose to make re-echoing congestion (blanking RE) 2053 orthogonal to congestion notification (marking CE), which required a 2054 second neutral codepoint. Then the receiver would be able to detect 2055 and echo a congestion event even if it arrived on a packet that had 2056 originally been positive. 2058 If we had added extra complexity to the sender and receiver 2059 transports to track changes to individual packets, we could have made 2060 it work, but then routers would have had an incentive to mark 2061 positive packets with half the probability of neutral packets. That 2062 in turn would have led router algorithms to become more complex. 2063 Then senders wouldn't know whether a mark had been introduced by a 2064 simple or a complex router algorithm. That in turn would have 2065 required another codepoint to distinguish between RFC3168 ECN and new 2066 Re-ECN router marking. 2068 Once the cost of IP header codepoint real-estate was the same for 2069 both schemes, there was no doubt that the simpler option for 2070 endpoints and for routers should be chosen. The resulting protocol 2071 also no longer needed the tricky inflation/deflation complexity of 2072 the original (broken) scheme. It was also much simpler to understand 2073 conceptually. 2075 A further advantage of the new orthogonal four-codepoint scheme was 2076 that senders owned sole rights to change the RE flag and routers 2077 owned sole rights to change the ECN field. Although we still arrange 2078 the incentives so neither party strays outside their dominion, these 2079 clear lines of authority simplify the matter. 2081 Finally, a little redundancy can be very powerful in a scheme such as 2082 this. In one flow, the proportion of packets changed to CE should be 2083 the same as the proportion of RECT packets changed to CE(-1) and the 2084 proportion of Re-Echo packets changed to CE(0). Double checking 2085 using such redundant relationships can improve the security of a 2086 scheme (cf. double-entry book-keeping or the ECN Nonce). 2087 Alternatively, it might be necessary to exploit the redundancy in the 2088 future to encode an extra information channel. 2090 Appendix C. ECN Compatibility 2092 The rationale for choosing the particular combinations of SYN and SYN 2093 ACK flags in Section 6.1.3 is as follows. 2095 Choice of SYN flags: A Re-ECN sender can work with RFC3168 compliant 2096 ECN receivers so we wanted to use the same flags as would be used 2097 in an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same 2098 time, we wanted a server (host B) that is Re-ECT to be able to 2099 recognise that the client (A) is also Re-ECT. We believe also 2100 setting NS=1 in the initial SYN achieves both these objectives, as 2101 it should be ignored by RFC3168 compliant ECT receivers and by 2102 ECT-Nonce receivers. But senders that are not Re-ECT should not 2103 set NS=1. At the time ECN was defined, the NS flag was not 2104 defined, so setting NS=1 should be ignored by existing ECT 2105 receivers (but testing against implementations may yet prove 2106 otherwise). The ECN Nonce RFC [RFC3540] is silent on what the NS 2107 field might be set to in the TCP SYN, but we believe the intent 2108 was for a nonce client to set NS=0 in the initial SYN (again only 2109 testing will tell). Therefore we define a Re-ECN-setup SYN as one 2110 with NS=1, CWR=1 & ECE=1 2112 Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to 2113 be able to determine whether the server (B) is Re-ECT. The 2114 original ECN specification required an ECT server to respond to an 2115 ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There 2116 is no room to modify this by setting the NS flag, as that is 2117 already set in the SYN ACK of an ECT-Nonce server. So we used the 2118 only combination of CWR and ECE that would not be used by existing 2119 TCP receivers: CWR=1 and ECE=0. The original ECN specification 2120 defines this combination as a non-ECN-setup SYN ACK, which remains 2121 true for RFC3168 compliant and Nonce ECTs. But for Re-ECN we 2122 define it as a Re-ECN-setup SYN ACK. We didn't use a SYN ACK with 2123 both CWR and ECE cleared to 0 because that would be the likely 2124 response from most Not-ECT receivers. And we didn't use a SYN ACK 2125 with both CWR and ECE set to 1 either, as at least one broken 2126 receiver implementation echoes whatever flags were in the SYN into 2127 its SYN ACK. Therefore we define a Re-ECN-setup SYN ACK as one 2128 with CWR=1 & ECE=0. 2130 Choice of two alternative SYN ACKs: the NS flag may take either 2131 value in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re- 2132 ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to 2133 echo congestion experienced (CE) on the initial SYN. Otherwise a 2134 Re-ECN-setup SYN ACK MUST be returned with NS=0. The only current 2135 known use of the NS flag in a SYN ACK is to indicate support for 2136 the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. 2137 Given the ECN nonce MUST NOT be used for a RECN mode connection, a 2138 Re-ECN-setup SYN ACK can use either setting of the NS flag without 2139 any risk of confusion, because the CWR & ECE flags will be 2140 reversed relative to those used by an ECN nonce SYN ACK. 2142 {ToDo: include the text below, either here, or in the algorithm 2143 sections} At an egress dropper, well-behaved RFC3168 compliant flows 2144 will appear to consist mostly of ECT(0) packets, with a few CE(0) 2145 packet. And, if the legacy source is setting the ECN nonce, the 2146 majority of packets will be an equal mix of ECT(0) and ECT(1) packets 2147 (the latter appearing to be Re-Echo packets in Re-ECN terms). None 2148 of these three packet markings is negative, so an egress dropper can 2149 handle all legacy flows in bulk and, as long as they don't send any 2150 packets using Re-ECN markings, it need not drop any legacy packets. 2151 So, as soon as an ECT(0) packet is seen, its flow ID can be added to 2152 the set of known legacy flows (a single Bloom filter would suffice). 2153 But, if any packets in flows classified as RFC3168 compliant are 2154 marked with any other marking than the three expected, the flow can 2155 be removed from the RFC3168 set, to be treated in bulk with mis- 2156 behaving Re-ECN flows---the remainder of flow IDs that require no 2157 flow state to be held. 2159 To an ingress Re-ECN policer, legacy ECN flows will appear as very 2160 highly congested paths. When policers are first deployed they can be 2161 configured permissively, allowing through both `RFC3168' ECN and 2162 misbehaving Re-ECN flows. Then, as the threshold is set more 2163 strictly, the more RFC3168 ECN sources will gain by upgrading to Re- 2164 ECN. Thus, towards the end of the voluntary incremental deployment 2165 period, RFC3168 transports can be given progressively stronger 2166 encouragement to upgrade. 2168 Appendix D. Packet Marking with FNE During Flow Start 2170 FNE (feedback not established) packets have two functions. Their 2171 main role is to announce the start of a new flow when feedback has 2172 not yet been established. However they also have the role of 2173 balancing the expected feedback and can be used where there are 2174 sudden changes in the rate of transmission. Whilst this should not 2175 happen under TCP their use as speculative marking is used in building 2176 the following argument as to why the first and third packets should 2177 be set to FNE. 2179 The proportion of FNE packets in each round-trip should be a high 2180 estimate of the potential error in the balance of number of 2181 congestion marked packets versus number of re-echo packets already 2182 issued. 2184 Let's call: 2186 S: the number of the TCP segments sent so far 2188 F: the number of FNE packets sent so far 2190 R: the number of Re-Echo packets sent so far 2192 A: the number of acknowledgments received so far 2194 C: the number of acknowledgments echoing a CE packet 2196 In normal operation, when we want to send packet S+1, we first need 2197 to check that enough Re-Echo packets have been issued: 2199 If R 1 FNE 2239 o if the acknowledgment doesn't echo a mark 2241 * for the second packet, A=F=S=1 R=C=0 ==> 1 RECT 2243 * for the third packet, S=2 A=F=1 R=C=0 ==> 1 FNE 2245 o if no acknowledgement for these two packets echoes a congestion 2246 mark, then {A=S=3 F=2 R=C=0} which gives k<2*4/1-3, so the source 2248 o if no acknowledgement for these four packets echoes a congestion 2249 mark, then {A=S=7 F=2 R=C=0} which gives k<2*8/1-7, so the source 2250 could send another 8 RECT packets. ==> 8 RECT 2252 This behaviour happens to match TCP's congestion window control in 2253 slow start, which is why for TCP sources, only the first and third 2254 packet need be FNE packets. 2256 A source that would open the congestion window any quicker would have 2257 to insert more FNE packets. As another example a UDP source sending 2258 VBR traffic might need to send several FNE packets ahead of the 2259 traffic peaks it generates. 2261 Appendix E. Argument for holding back the ECN nonce 2263 The ECN nonce is a mechanism that allows a /sending/ transport to 2264 detect if drop or ECN marking at a congested router has been 2265 suppressed by a node somewhere in the feedback loop---another router 2266 or the receiver. 2268 Space for the ECN nonce was set aside in [RFC3168] (currently 2269 proposed standard) while the full nonce mechanism is specified in 2270 [RFC3540] (currently experimental). The specifications for [RFC4340] 2271 (currently proposed standard) requires that "Each DCCP sender SHOULD 2272 set ECN Nonces on its packets...". It also mandates as a requirement 2273 for all CCID profiles that "Any newly defined acknowledgement 2274 mechanism MUST include a way to transmit ECN Nonce Echoes back to the 2275 sender.", therefore: 2277 o The CCID profile for TCP-like Congestion Control [RFC4341] 2278 (currently proposed standard) says "The sender will use the ECN 2279 Nonce for data packets, and the receiver will echo those nonces in 2280 its Ack Vectors." 2282 o The CCID profile for TCP-Friendly Rate Control (TFRC) [RFC4342] 2283 recommends that "The sender [use] Loss Intervals options' ECN 2284 Nonce Echoes (and possibly any Ack Vectors' ECN Nonce Echoes) to 2285 probabilistically verify that the receiver is correctly reporting 2286 all dropped or marked packets." 2288 The primary function of the ECN nonce is to protect the integrity of 2289 the information about congestion: ECN marks and packet drops. 2290 However, when the nonce is used to protect the integrity of 2291 information about packet drops, rather than ECN marks, a transport 2292 layer nonce will always be sufficient (because a drop loses the 2293 transport header as well as the ECN field in the network header), 2294 which would avoid using scarce IP header codepoint space. Similarly, 2295 a transport layer nonce would protect against a receiver sending 2296 early acknowledgements [Savage99]. 2298 If the ECN nonce reveals integrity problems with the information 2299 about congestion, the sending transport can use that knowledge for 2300 two functions: 2302 o to protect its own resources, by allocating them in proportion to 2303 the rates that each network path can sustain, based on congestion 2304 control, 2306 o and to protect congested routers in the network, by slowing down 2307 drastically its connection to the destination with corrupt 2308 congestion information. 2310 If the sending transport chooses to act in the interests of congested 2311 routers, it can reduce its rate if it detects some malicious party in 2312 the feedback loop may be suppressing ECN feedback. But it would only 2313 be useful to congested routers when /all/ senders using them are 2314 trusted to act in interest of the congested routers. 2316 In the end, the only essential use of a network layer nonce is when 2317 sending transports (e.g. large servers) want to allocate their /own/ 2318 resources in proportion to the rates that each network path can 2319 sustain, based on congestion control. In that case, the nonce allows 2320 senders to be assured that they aren't being duped into giving more 2321 of their own resources to a particular flow. And if congestion 2322 suppression is detected, the sending transport can rate limit the 2323 offending connection to protect its own resources. Certainly, this 2324 is a useful function, but the IETF should carefully decide whether 2325 such a single, very specific case warrants IP header space. 2327 In contrast, Re-ECN allows all routers to fully protect themselves 2328 from such attacks, without having to trust anyone - senders, 2329 receivers, neighbouring networks. Re-ECN is therefore proposed in 2330 preference to the ECN nonce on the basis that it addresses the 2331 generic problem of accountability for congestion of a network's 2332 resources at the IP layer. 2334 Delaying the ECN nonce is justified because the applicability of the 2335 ECN nonce seems too limited for it to consume a two-bit codepoint in 2336 the IP header. It therefore seems prudent to give time for an 2337 alternative way to be found to do the one function the nonce is 2338 essential for. 2340 Moreover, while we have re-designed the Re-ECN codepoints so that 2341 they do not prevent the ECN nonce progressing, the same is not true 2342 the other way round. If the ECN nonce started to see some deployment 2343 (perhaps because it was blessed with proposed standard status), 2344 incremental deployment of Re-ECN would effectively be impossible, 2345 because Re-ECN marking fractions at inter-domain borders would be 2346 polluted by unknown levels of nonce traffic. 2348 The authors are aware that Re-ECN must prove it has the potential it 2349 claims if it is to displace the nonce. Therefore, every effort has 2350 been made to complete a comprehensive specification of Re-ECN so that 2351 its potential can be assessed. We therefore seek the opinion of the 2352 Internet community on whether the Re-ECN protocol is sufficiently 2353 useful to warrant standards action. 2355 Appendix F. Alternative Terminology Used in Other Documents 2357 A number of alternative terms have been used in various documents 2358 describing re-feedback and re-ECN. These are set out in the 2359 following table 2361 +---------------------+----------------+------------------+ 2362 | Current Terminology | EECN codepoint | Colour | 2363 +---------------------+----------------+------------------+ 2364 | Cautious | FNE | Green | 2365 | Positive | Re-Echo | Black | 2366 | Neutral | RECT | Grey | 2367 | Negative | CE(-1) | Red | 2368 | Cancelled | CE(0) | Red-Black | 2369 | Legacy ECN | ECT(0) | White | 2370 | Currently Unused | --CU-- | Currently unused | 2371 | | | | 2372 | Legacy | Not-ECT | White | 2373 +---------------------+----------------+------------------+ 2375 Table 7: Alternative re-ECN Terminology 2377 Authors' Addresses 2379 Bob Briscoe (editor) 2380 BT 2381 B54/77, Adastral Park 2382 Martlesham Heath 2383 Ipswich IP5 3RE 2384 UK 2386 Phone: +44 1473 645196 2387 EMail: bob.briscoe@bt.com 2388 URI: http://bobbriscoe.net/ 2389 Arnaud Jacquet 2390 BT 2391 B54/70, Adastral Park 2392 Martlesham Heath 2393 Ipswich IP5 3RE 2394 UK 2396 Phone: +44 1473 647284 2397 EMail: arnaud.jacquet@bt.com 2398 URI: 2400 Toby Moncaster 2401 Moncaster.com 2402 Dukes 2403 Layer Marney 2404 Colchester CO5 9UZ 2405 UK 2407 EMail: toby@moncaster.com 2409 Alan Smith 2410 BT 2411 B54/76, Adastral Park 2412 Martlesham Heath 2413 Ipswich IP5 3RE 2414 UK 2416 Phone: +44 1473 640404 2417 EMail: alan.p.smith@bt.com