idnits 2.17.1 draft-briscoe-conex-re-ecn-tcp-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 04, 2014) is 3556 days in the past. Is this intentional? Checking references for intended status: Historic ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 4835 (Obsoleted by RFC 7321) ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260) == Outdated reference: A later version (-10) exists of draft-ietf-conex-tcp-modifications-05 -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe, Ed. 3 Internet-Draft A. Jacquet 4 Intended status: Historic BT 5 Expires: January 5, 2015 T. Moncaster 6 Moncaster.com 7 A. Smith 8 BT 9 July 04, 2014 11 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP 12 draft-briscoe-conex-re-ecn-tcp-04 14 Abstract 16 This document introduces re-ECN (re-inserted explicit congestion 17 notification), which is intended to make a simple but far-reaching 18 change to the Internet architecture. The sender uses the IP header 19 to reveal the congestion that it expects on the end-to-end path. The 20 protocol works by arranging an extended ECN field in each packet so 21 that, as it crosses any interface in an internetwork, it will carry a 22 truthful prediction of congestion on the remainder of its path. It 23 can be deployed incrementally around unmodified routers. The purpose 24 of this document is to specify the re-ECN protocol at the IP layer 25 and to give guidelines on any consequent changes required to 26 transport protocols. It includes the changes required to TCP both as 27 an example and as a specification. It briefly gives examples of 28 mechanisms that can use the protocol to ensure data sources respond 29 sufficiently to congestion, but these are described more fully in a 30 companion document. 32 Note concerning Intended Status: If this draft were ever published as 33 an RFC it would probably have historic status. There is limited 34 space in the IP header, so re-ECN had to compromise by requiring the 35 receiver to be ECN-enabled otherwise the sender could not use re-ECN. 36 Re-ECN was a precursor to chartering of the IETF's Congestion 37 Exposure (ConEx) working group, but during chartering there were 38 still too few ECN receivers enabled, therefore it was decided to 39 pursue other compromises in order to fit a similar capability into 40 the IP header. 42 Status of This Memo 44 This Internet-Draft is submitted in full conformance with the 45 provisions of BCP 78 and BCP 79. 47 Internet-Drafts are working documents of the Internet Engineering 48 Task Force (IETF). Note that other groups may also distribute 49 working documents as Internet-Drafts. The list of current Internet- 50 Drafts is at http://datatracker.ietf.org/drafts/current/. 52 Internet-Drafts are draft documents valid for a maximum of six months 53 and may be updated, replaced, or obsoleted by other documents at any 54 time. It is inappropriate to use Internet-Drafts as reference 55 material or to cite them other than as "work in progress." 57 This Internet-Draft will expire on January 5, 2015. 59 Copyright Notice 61 Copyright (c) 2014 IETF Trust and the persons identified as the 62 document authors. All rights reserved. 64 This document is subject to BCP 78 and the IETF Trust's Legal 65 Provisions Relating to IETF Documents 66 (http://trustee.ietf.org/license-info) in effect on the date of 67 publication of this document. Please review these documents 68 carefully, as they describe your rights and restrictions with respect 69 to this document. Code Components extracted from this document must 70 include Simplified BSD License text as described in Section 4.e of 71 the Trust Legal Provisions and are provided without warranty as 72 described in the Simplified BSD License. 74 Table of Contents 76 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 77 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 5 78 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 79 4. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 6 80 4.1. Simplified Re-ECN Protocol . . . . . . . . . . . . . . . 6 81 4.1.1. Congestion Control and Policing the Protocol . . . . 6 82 4.1.2. Background and Applicability . . . . . . . . . . . . 7 83 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 84 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 85 4.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 9 86 4.4. Positive and Negative Flows . . . . . . . . . . . . . . . 11 87 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 12 88 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 12 89 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 14 90 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 15 91 5.4. Justification for Setting the First SYN to FNE . . . . . 16 92 5.5. Control and Management . . . . . . . . . . . . . . . . . 17 93 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . 17 94 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 18 95 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . 18 96 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . 19 98 6. Transport Layers . . . . . . . . . . . . . . . . . . . . . . 20 99 6.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 100 6.1.1. RECN mode: Full Re-ECN capable transport . . . . . . 21 101 6.1.2. Drops and Marks . . . . . . . . . . . . . . . . . . . 23 102 6.1.3. Safety against Long Pure ACK Loss Sequences . . . . . 24 103 6.1.4. RECN-Co mode: Re-ECT Sender with a RFC3168 compliant 104 ECN Receiver . . . . . . . . . . . . . . . . . . . . 25 105 6.1.5. Capability Negotiation . . . . . . . . . . . . . . . 26 106 6.1.6. Extended ECN (EECN) Field Settings during Flow Start 107 or after Idle Periods . . . . . . . . . . . . . . . . 28 108 6.1.7. Pure ACKS, Retransmissions, Window Probes and Partial 109 ACKs . . . . . . . . . . . . . . . . . . . . . . . . 31 110 6.2. Other Transports . . . . . . . . . . . . . . . . . . . . 32 111 6.2.1. General Guidelines for Adding Re-ECN to Other 112 Transports . . . . . . . . . . . . . . . . . . . . . 32 113 6.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . 33 114 6.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . 33 115 6.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . 33 116 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . 33 117 8. Related Work . . . . . . . . . . . . . . . . . . . . . . . . 35 118 8.1. Congestion Notification Integrity . . . . . . . . . . . . 35 119 9. Security Considerations . . . . . . . . . . . . . . . . . . . 36 120 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 121 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 38 122 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 38 123 13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 38 124 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 39 125 14.1. Normative References . . . . . . . . . . . . . . . . . . 39 126 14.2. Informative References . . . . . . . . . . . . . . . . . 40 127 Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . 42 128 Appendix B. Justification for Two Codepoints Signifying Zero 129 Worth Packets . . . . . . . . . . . . . . . . . . . 43 130 Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . 44 131 Appendix D. Packet Marking with FNE During Flow Start . . . . . 46 132 Appendix E. Argument for holding back the ECN nonce . . . . . . 48 133 Appendix F. Alternative Terminology Used in Other Documents . . 50 134 Appendix G. Changes from previous drafts (to be removed by the 135 RFC Editor) . . . . . . . . . . . . . . . . . . . . 50 137 1. Introduction 139 AUTHORS' STATEMENT (to be removed by the RFC Editor): The most 140 immediate priority for the authors is to delay any move of the ECN 141 nonce to Proposed Standard status, in order to leave options open for 142 the future. The argument for this position is developed in 143 Appendix E. 145 This document provides a complete specification for the addition of 146 the re-ECN protocol to IP and guidelines on how to add it to 147 transport layer protocols, including a complete specification of re- 148 ECN in TCP as an example. The motivation behind this proposal is 149 given in [I-D.re-ecn-motiv], but we include a brief summary here. 151 Re-ECN is intended to allow senders to inform the network of the 152 level of congestion they expect their flows to see. This information 153 is currently only visible at the transport layer. ECN [RFC3168] 154 reveals the upstream congestion state of any path by monitoring the 155 rate of CE marks. The receiver then informs the sender when they 156 have seen a marked packet. Re-ECN builds on ECN by providing new 157 codepoints that allow the sender to declare the level of congestion 158 they expect on the forward path. It is closely related to ECN and 159 indeed we define a compatibility mode to allow a re-ECN sender to 160 communicate with an ECN receiver. 162 If a sender understates expected congestion compared to actual 163 congestion then the network could discard packets or enact some other 164 sanction. A policer can also be introduced at the ingress of 165 networks that can limit the level of congestion being caused. 167 A general statement of the problem solved by re-ECN is to provide 168 sufficient information in each IP datagram to be able to hold senders 169 and whole networks accountable for the congestion they cause 170 downstream, before they cause it. But the every-day problems that 171 re-ECN can solve are much more recognisable than this rather generic 172 statement: mitigating distributed denial of service (DDoS); 173 simplifying differentiation of quality of service (QoS); policing 174 compliance to congestion control; and so on. 176 It is important to add a few key points. 178 o In any standard network it always takes one round trip before any 179 feedback is received. For this reason a sender must make a 180 conservative prediction by transmitting IP packets with a special 181 Cautious marking when it is unsure of the state of the network. 183 o It should be noted that the prediction is carried in-band in 184 normal data packets and for many transports feedback can be 185 carried in the normal acknowledgements or control packets. 187 o The re-ECN protocol is independent of the transport. In TCP, 188 acknowledgments are used to convey the feedback from receiver to 189 sender. This memo concentrates on TCP as an example transport 190 protocol, however the re-ECN protocol is compatible with any 191 transport where feedback can be sent from receiver to sender. 193 This document is structured as follows. First an overview of the re- 194 ECN protocol is given (Section 4), outlining its attributes and 195 explaining conceptually how it works as a whole. The two main parts 196 of the document follow. That is, the protocol specification divided 197 into network (Section 5) and transport (Section 6) layers. 198 Deployment issues discussed throughout the document are brought 199 together in Section 7. Related work is discussed in (Section 8). 201 2. Requirements notation 203 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 204 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 205 document are to be interpreted as described in [RFC2119]. 207 3. Terminology 209 {ToDo: No attempt has been made to bring terminology into line with 210 that agreed within the ConEx working group. For instance the term 211 dropper remains unchanged, even though the ConEx w-g has decided to 212 call it an audit function (which is actually a much better term).} 214 The following terminology is used throughout this memo. Some of this 215 terminology has changed as this draft has been revised. Therefore, 216 to help avoid confusion, Appendix F sets out all the alternative 217 terminology that has been used in other re-ECN related documents. 219 o Neutral packet - a packet that is able to be congestion marked by 220 an ECN or re-ECN queue. 222 o Negative packet - a Neutral packet that has been congestion marked 223 by an ECN or re-ECN queue. 225 o Positive packet - a packet that has been marked by the sender to 226 indicate the expected level of congestion along its path. In 227 general Positive packets should only be sent in response to 228 feedback received from the receiver.* 230 o Cancelled packet - a Positive Packet that has been congestion 231 marked by an ECN or re-ECN queue. 233 o Cautious packet - a packet that has been marked by the sender to 234 indicate the expected level of congestion along its path. In 235 general Cautious packets should be used when there is insufficient 236 feedback to be confident about the congestion state of the 237 network.* 238 * the difference between positive and cautious packets is 239 explained in detail later in the document along with guidelines on 240 the use of Cautious packets. 242 All the above terms have related IP codepoints as defined in 243 (Section 5). 245 4. Protocol Overview 247 4.1. Simplified Re-ECN Protocol 249 We describe here the simplified re-ECN protocol. To simplify the 250 description we assume packets and segments are synonymous. 252 Packets are sent from a sender to a receiver. In Figure 1 the queues 253 (Q1 and Q2) are ECN enabled as per RFC 3168 [RFC3168]. If congestion 254 occurs then packets are marked with the congestion experienced (CE) 255 flag exactly as in the ECN protocol [RFC3168]; the routers do not 256 need to be modified and do not need to know the re-ECN protocol. The 257 receiver constantly informs the sender of the current count of 258 Negative packets it has seen. The sender uses this information 259 determine how many Positive packets it must send into the network. 260 The receiver's aim is to balance the number of bytes that have been 261 congestion marked with the number of Positive bytes it has sent. 263 +--------- Feedback----------+ 264 | | 265 v | 266 +---+ +----+ +----+ +---+ 267 | | | | | | | | 268 | S |--->| Q1 |--->| Q2 |--->| R | 269 | | | | | | | | 270 +---+ +----+ +----+ +---+ 272 Figure 1: Simple Re-ECN 274 4.1.1. Congestion Control and Policing the Protocol 276 The arrangement of the protocol ensures that packets carry a 277 declaration of the amount of congestion that will be experienced on 278 the path. The re-ECN protocol is orthogonal to any congestion 279 control algorithms, but can be used to ensure that congestion control 280 is being applied by the sender. 282 In general we assume that there will be a policer at the network 283 ingress which can rate limit traffic based on the amount of 284 congestion declared. 286 At the network egress there is a dropper which can impose sanctions 287 on flows that incorrectly declare congestion. 289 Policers and droppers are explained in more detail in 290 [I-D.re-ecn-motiv]. 292 4.1.2. Background and Applicability 294 The re-ECN protocol makes no changes and has no effect on the TCP 295 congestion control algorithm or on other rate responses to 296 congestion. Re-ECN is not a new congestion control protocol, rather 297 it is orthogonal to congestion control itself. Re-ECN is concerned 298 with revealing information about congestion so that users and 299 networks can be held accountable for the congestion they cause, or 300 allow to be caused. 302 Re-ECN builds on ECN so we briefly recap the essentials of the ECN 303 protocol [RFC3168]. Two bits in the IP protocol (v4 or v6) are 304 assigned to the ECN field. The sender clears the field to "00" (Not- 305 ECT) if either end-point transport is not ECN-capable. Otherwise it 306 indicates an ECN-capable transport (ECT) using either of the two 307 code-points "10" or "01" (ECT(0) and ECT(1) resp.). 309 ECN-capable queues probabilistically set this field to "11" if 310 congestion is experienced (CE). In general this marking probability 311 will increase with the length of the queue at its egress link 312 (typically using the RED algorithm [RFC2309]). However, they still 313 drop rather than mark Not-ECT packets. With multiple ECN-capable 314 queues on a path, a flow of packets accumulates the fraction of CE 315 marking that each queue adds. The combined effect of the packet 316 marking of all the queues along the path signals congestion of the 317 whole path to the receiver. So, for example, if one queue early in a 318 path is marking 1% of packets and another later in a path is marking 319 2%, flows that pass through both queues will experience approximately 320 3% marking (see Appendix A for a precise treatment). 322 The choice of two ECT code-points in the ECN field [RFC3168] 323 permitted future flexibility, optionally allowing the sender to 324 encode the experimental ECN nonce [RFC3540] in the packet stream. 325 The nonce is designed to allow a sender to check the integrity of 326 congestion feedback. But Section 8.1 explains that it still gives no 327 control over how fast the sender transmits as a result of the 328 feedback. On the other hand, re-ECN is designed both to ensure that 329 congestion is declared honestly and that the sender's rate responds 330 appropriately. 332 Re-ECN is based on a feedback arrangement called `re- 333 feedback' [Re-fb]. The word is short for either receiver-aligned, 334 re-inserted or re-echoed feedback. But it actually works even when 335 no feedback is available. In fact it has been carefully designed to 336 work for single datagram flows. It also encourages aggregation of 337 single packet flows by congestion control proxies. Then, even if the 338 traffic mix of the Internet were to become dominated by short 339 messages, it would still be possible to control congestion 340 effectively and efficiently. 342 Changing the Internet's feedback architecture seems to imply 343 considerable upheaval. But re-ECN can be deployed incrementally at 344 the transport layer around unmodified queues using existing fields in 345 IP (v4 or v6). However it does also require the last undefined bit 346 in the IPv4 header, which it uses in combination with the 2-bit ECN 347 field to create four new codepoints. Nonetheless, we RECOMMEND 348 adding optional preferential drop to IP queues based on the re-ECN 349 fields in order to improve resilience against DoS attacks. 350 Similarly, re-ECN works best if both the sender and receiver 351 transports are re-ECN-capable, but it can work with just sender 352 support(Section 6.1.4). 354 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 356 The re-ECN wire protocol uses the two bit ECN field broadly as in 357 RFC3168 [RFC3168] as described above, but with five differences of 358 detail (brought together in a list in Section 7). This specification 359 defines a new re-ECN extension (RE) flag. We will defer the 360 definition of the actual position of the RE flag in the IPv4 & v6 361 headers until Section 5. When we don't need to choose between IPv4 362 and v6 wire protocols it will suffice call it the RE flag. 364 Unlike the ECN field, the RE flag is intended to be set by the sender 365 and SHOULD remain unchanged along the path, although it can be read 366 by network elements that understand the re-ECN protocol. It is 367 feasible that a network element MAY change the setting of the RE 368 flag, perhaps acting as a proxy for an end-point, but such a protocol 369 would have to be defined in another specification 370 (e.g. [I-D.re-pcn-border-cheat]). 372 Although the RE flag is a separate, single bit field, it can be read 373 as an extension to the two-bit ECN field; the three concatenated bits 374 in what we will call the extended ECN field (EECN) giving eight 375 codepoints. We will use the RFC3168 names of the ECN codepoints to 376 describe settings of the ECN field when the RE flag setting is "don't 377 care", but we also define the following six extended ECN codepoint 378 names for when we need to be more specific. 380 One of re-ECN's codepoints is an alternative use of the codepoint set 381 aside in RFC3168 for the ECN nonce (ECT(1)). Transports using re-ECN 382 do not need to use the ECN nonce as long as the sender is also 383 checking for transport protocol compliance [tcp-rcv-cheat]. The case 384 for doing this is given in Appendix E. Two re-ECN codepoints are 385 given compatible uses to those defined in RFC3168 (Not-ECT and CE). 386 The other codepoint used by RFC3168 (ECT(0)) isn't used for re-ECN. 387 Altogether this leave one codepoint of the eight unused by ECN or re- 388 ECN and available for future use. 390 +--------+-------------+-------+-----------+------------------------+ 391 | ECN | RFC3168 | RE | EECN | re-ECN meaning | 392 | field | codepoint | flag | codepoint | | 393 +--------+-------------+-------+-----------+------------------------+ 394 | 00 | Not-ECT | 0 | Not-ECT | Not re-ECN-capable | 395 | | | | | transport (Legacy) | 396 | 00 | --- | 1 | FNE | Feedback not | 397 | | | | | established (Cautious) | 398 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 399 | | | | | and RECT (Positive) | 400 | 01 | --- | 1 | RECT | Re-ECN capable | 401 | | | | | transport (Neutral) | 402 | 10 | ECT(0) | 0 | ECT(0) | RFC3168 ECN use only | 403 | 10 | --- | 1 | --CU-- | Currently unused | 404 | 11 | CE | 0 | CE(0) | Re-Echo cancelled by | 405 | | | | | CE (Cancelled) | 406 | 11 | --- | 1 | CE(-1) | Congestion Experienced | 407 | | | | | (Negative) | 408 +--------+-------------+-------+-----------+------------------------+ 410 Table 1: Extended ECN Codepoints 412 4.3. Re-ECN Protocol Operation 414 In this section we will give an overview of the operation of the re- 415 ECN protocol for TCP/IP, leaving a detailed specification to the 416 following sections. Other transports will be discussed later. 418 {ToDo: This section to be updated to explain that the sender re- 419 echoes losses in the same way as ECN markings.} 421 In summary, the protocol adds a third `re-echo' stage to the existing 422 TCP/IP ECN protocol. Whenever the network adds CE congestion 423 signalling to the IP header on the forward data path, the receiver 424 feeds it back to the ingress using TCP, then the sender re-echoes it 425 into the forward data path using the RE flag in the next packet. 427 Prior to receiving any feedback a sender will not know which setting 428 of the RE flag to use, so it sends Cautious packets by setting the 429 FNE codepoint. The network reads the FNE codepoint conservatively as 430 equivalent to re-echoed congestion. 432 Specifically, once feedback from an ECN or re-ECN capable flow is 433 established, a re-ECN sender always initialises the ECN field to 434 ECT(1). And it usually sets the RE flag to "1" indicating a Neutral 435 packet. Whenever a queue marks a packet to CE, the receiver feeds 436 back this event to the sender. On receiving this feedback, the re- 437 ECN sender will clear the RE flag to "0" in the next packet it sends 438 (indicating a Positive packet). 440 We chose to set and clear the RE flag this way round to ease 441 incremental deployment (see Section 7). To avoid confusion we will 442 use the term `blanking' (rather than marking) when the RE flag is 443 cleared to "0". So, over a stream of packets, we will talk of the 444 `RE blanking fraction' as the fraction of octets in packets with the 445 RE flag cleared to "0". 447 +---+ +----+ +----+ +---+ 448 | S |--| Q1 |----------------| Q2 |--| R | 449 +---+ +----+ +----+ +---+ 450 . . . . 451 ^ . . . . 452 | . . . . 453 | . RE blanking fraction . . 454 3% |-------------------------------+======= 455 | . . | . 456 2% | . . | . 457 | . . CE marking fraction | . 458 1% | . +----------------------+ . 459 | . | . . 460 0% +---------------------------------------> 461 ^ ^ ^ 462 L M N Observation points 464 Figure 2: A 2-Queue Example (Imprecise) 466 Figure 2 uses a simple network to illustrate how re-ECN allows queues 467 to measure downstream congestion. The receiver views a CE marking 468 fraction of 3% which is fed back to the sender. The sender sets an 469 RE blanking fraction of 3% to match this. This RE blanking fraction 470 can be observed along the path as the RE flag is not changed by 471 network nodes once set by the sender. This is shown by the 472 horizontal line at 3% in the figure. The CE marked fraction is shown 473 by the stepped line which rises to meet the RE blanking fraction line 474 with steps at each queue where packets are marked. Two queues are 475 shown (Q1 and Q2) that are currently congested. Each time packets 476 pass through a fraction are marked; 1% at Q1 and 2% at Q2). The 477 approximate downstream congestion can be measured at the observation 478 points shown along the path by subtracting the CE marking fraction 479 from the RE blanking fraction, as shown in the table below 480 (Appendix A derives these approximations from a precise analysis). 481 NB due to the unary nature of ECN marking and the equivalent unary 482 nature of re-ECN blanking, the precise fraction of marked bytes must 483 be calculated by maintaining a moving average of the number of 484 packets that have been marked as a proportion of the total number of 485 packets. 487 Along the path the fraction of packets that had their RE field 488 cleared remains unchanged so it can be used as a reference against 489 which to compare upstream congestion. The difference predicts 490 downstream congestion for the rest of the path. Therefore, measuring 491 the fractions of each codepoint at any point in the Internet will 492 reveal upstream, downstream and whole path congestion. 494 Note that we have introduced discussion of marking and blanking 495 fractions solely for illustration. We are not saying any protocol 496 handler will work with these average fractions directly. In fact the 497 protocol actually requires the number of marked and blanked bytes to 498 balance by the time the packet reaches the receiver. 500 4.4. Positive and Negative Flows 502 In Section 3 we introduced the terms Positive, Neutral, Negative, 503 Cautious and Cancelled. This terminology is based on the requirement 504 to balance the proportion of bytes marked as CE with the proportion 505 of bytes that are re-echo marked. In the rest of this memo we will 506 loosely talk of positive or negative flows, meaning flows where the 507 moving average of the downstream congestion metric is persistently 508 positive or negative. A negative flow is one where more CE marked 509 packets than re-ECN blanked packets arrive. Likewise in positive 510 flows more re-ECN blanked packets arrive than CE marked packets. The 511 notion of a negative metric arises because it is derived by 512 subtracting one metric from another. Of course actual downstream 513 congestion cannot be negative, only the metric can (whether due to 514 time lags or deliberate malice). 516 Therefore we will talk of packets having `worth' of +1, 0 or -1, 517 which, when multiplied by their size, indicates their contribution to 518 the downstream congestion metric. The worth of each type of packet 519 is given below in Table 2. The idea is that most flows start with 520 zero worth. Every time the network decrements the worth of a packet, 521 the sender increments the worth of a later packet. Then, over time, 522 as many positive octets should arrive at the receiver as negative. 524 Note we have said octets not packets, so if packets are of different 525 sizes, the worth should be incremented on enough octets to balance 526 the octets in negative packets arriving at the receiver. It is this 527 balance that will allow the network to hold the sender accountable 528 for the congestion it causes. 530 If a packet carrying re-echoed congestion happens to also be 531 congestion marked, the +1 worth added by the sender will be cancelled 532 out by the -1 network congestion marking. Although the two worth 533 values correctly cancel out, neither the congestion marking nor the 534 re-echoed congestion are lost, because the RE bit and the ECN field 535 are orthogonal. So, whenever this happens, the receiver will 536 correctly detect and re-echo the new congestion event as well. 538 The table below specifies unambiguously the worth of each extended 539 ECN codepoint. Note the order is different from the previous table 540 to better show how the worth increments and decrements. 542 +---------+-------+---------------+-------+-------------------------+ 543 | ECN | RE | Extended ECN | Worth | Re-ECN Term | 544 | field | bit | codepoint | | | 545 +---------+-------+---------------+-------+-------------------------+ 546 | 00 | 0 | Not-RECT | ... | --- | 547 | 00 | 1 | FNE | +1 | Cautious | 548 | 01 | 0 | Re-Echo | +1 | Positive | 549 | 10 | 0 | Legacy | ... | RFC3168 ECN use only | 550 | 11 | 0 | CE(0) | 0 | Cancelled | 551 | 01 | 1 | RECT | 0 | Neutral | 552 | 10 | 1 | --CU-- | ... | Currently unused | 553 | 11 | 1 | CE(-1) | -1 | Negative | 554 +---------+-------+---------------+-------+-------------------------+ 556 Table 2: 'Worth' of Extended ECN Codepoints 558 5. Network Layer 560 5.1. Re-ECN IPv4 Wire Protocol 562 The wire protocol of the ECN field in the IP header remains largely 563 unchanged from [RFC3168]. However, an extension to the ECN field we 564 call the RE (Re-ECN extension) flag (Section 4.2) is defined in this 565 document. It doubles the extended ECN codepoint space, giving 8 566 potential codepoints. The semantics of the extra codepoints are 567 backward compatible with the semantics of the 4 original codepoints 568 [RFC3168] (Section 7 collects together and summarises all the changes 569 defined in this document). 571 For IPv4, this document proposes that the new RE control flag will be 572 positioned where the `reserved' control flag was at bit 48 of the 573 IPv4 header (counting from 0). Alternatively, some would call this 574 bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 575 header (Figure 3). 577 0 1 2 578 +---+---+---+ 579 | R | D | M | 580 | E | F | F | 581 +---+---+---+ 583 Figure 3: New Definition of the Re-ECN Extension (RE) Control Flag at 584 the Start of Byte 7 of the IPv4 Header 586 The semantics of the RE flag are described in outline in Section 4 587 and specified fully in Section 6. The RE flag is always considered 588 in conjunction with the 2-bit ECN field, as if they were concatenated 589 together to form a 3-bit extended ECN field. If the ECN field is set 590 to either the ECT(1) or CE codepoint, when the RE flag is blanked 591 (cleared to "0") it represents a re-echo of congestion experienced by 592 an early packet. If the ECN field is set to the Not-ECT codepoint, 593 when the RE flag is set to "1" it represents the feedback not 594 established (FNE) codepoint, which signals that the packet was sent 595 without the benefit of congestion feedback. 597 It is believed that the FNE codepoint can simultaneously serve other 598 purposes, particularly where the start of a flow needs distinguishing 599 from packets later in the flow. For instance it would have been 600 useful to identify new flows for tag switching and might enable 601 similar developments in the future if it were adopted. It is similar 602 to the state set-up bit idea designed to protect against memory 603 exhaustion attacks. This idea was proposed informally by David Clark 604 and documented by Handley and Greenhalgh [Steps_DoS]. The FNE 605 codepoint can be thought of as a `soft-state set-up flag', because it 606 is idempotent (i.e. one occurrence of the flag is sufficient but 607 further occurrences achieve the same effect if previous ones were 608 lost). 610 We are sure there will probably be other claims pending on the use of 611 bit 48. We know of at least two [ARI05], [RFC3514] but neither have 612 been pursued in the IETF, so far, although the present proposal would 613 meet the needs of the latter. 615 The security flag proposal (commonly known as the evil bit) was 616 published on 1 April 2003 as Informational RFC 3514, but it was not 617 adopted due to confusion over whether evil-doers might set it 618 inappropriately. The present proposal is backward compatible with 619 RFC3514 because if re-ECN compliant senders were benign they would 620 correctly clear the evil bit to honestly declare that they had just 621 received congestion feedback. Whereas evil-doers would hide 622 congestion feedback by setting the evil bit continuously, or at least 623 more often than they should. So, evil senders can be identified, 624 because they declare that they are good less often than they should. 626 5.2. Re-ECN IPv6 Wire Protocol 628 For IPv6, this document proposes that the new RE control flag will be 629 positioned as the first bit of the option field of a new Congestion 630 hop by hop option header (Figure 4). 632 0 1 2 3 633 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 634 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 635 | Next Header | Hdr ext Len | Option Type | Opt Length =4 | 636 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 637 |R| Reserved for future use | 638 |E| | 639 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 641 Figure 4: Definition of a New IPv6 Congestion Hop by Hop Option 642 Header containing the re-ECN Extension (RE) Control Flag 644 0 1 2 3 4 5 6 7 8 645 +-+-+-+-+-+-+-+-+- 646 |AIU|C|Option ID| 647 +-+-+-+-+-+-+-+-+- 649 Figure 5: Congestion Hop by Hop Option Type Encoding 651 The Hop-by-Hop Options header enables packets to carry information to 652 be examined and processed by routers or nodes along the packet's 653 delivery path, including the source and destination nodes. For re- 654 ECN, the two bits of the Action If Unrecognized (AIU) flag of the 655 Congestion extension header MUST be set to "00" meaning if 656 unrecognized `skip over option and continue processing the header'. 657 Then, any routers or a receiver not upgraded with the optional re-ECN 658 features described in this memo will simply ignore this header. But 659 routers with these optional re-ECN features or a re-ECN policing 660 function, will process this Congestion extension header. 662 The `C' flag MUST be set to "1" to specify that the Option Data 663 (currently only the RE control flag) can change en-route to the 664 packet's final destination. This ensures that, when an 665 Authentication header (AH [RFC4302]) is present in the packet, for 666 any option whose data may change en-route, its entire Option Data 667 field will be treated as zero-valued octets when computing or 668 verifying the packet's authenticating value. 670 Although the RE control flag should not be changed along the path, we 671 expect that the rest of this option field that is currently `Reserved 672 for future use' could be used for a multi-bit congestion notification 673 field which we would expect to change en route. Therefore, as 674 changes to the RE flag could be detected end-to-end without 675 authentication (see Section 9), we set the C flag to '1'. 677 5.3. Router Forwarding Behaviour 679 {ToDo: Consider a section on how whole protocol interworks with drop. 680 Perhaps in Protocol Overview.} 682 Re-ECN works well without modifying the forwarding behaviour of any 683 routers. However, below, two OPTIONAL changes to forwarding 684 behaviour are defined which respectively enhance performance and 685 improve a router's discrimination against flooding attacks. They are 686 both OPTIONAL additions that we propose MAY apply by default to all 687 Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN 688 marking behaviours [RFC3168]. Specifications for PHBs MAY define 689 different forwarding behaviours from this default, but this is not 690 required. [I-D.re-pcn-border-cheat] is one example. 692 FNE indicates ECT: 694 The FNE codepoint tells a router to assume that the packet was 695 sent by an ECN-capable transport (see Section 5.4). Therefore an 696 FNE packet MAY be marked rather than dropped. Note that the FNE 697 codepoint has been intentionally chosen so that, to RFC3168 698 compliant routers (which do not inspect the RE flag) an FNE packet 699 appears to be Not-ECT so it will be dropped by legacy AQM 700 algorithms. 702 A network operator MUST NOT configure a queue to ECN mark rather 703 than drop FNE packets unless it can guarantee that FNE packets 704 will be rate limited, either locally or upstream. The ingress 705 policers discussed in [I-D.re-ecn-motiv] would count as rate 706 limiters for this purpose. 708 Preferential Drop: If a re-ECN capable router queue experiences very 709 high load so that it has to drop arriving packets (e.g. a DoS 710 attack), it MAY preferentially drop packets within the same 711 Diffserv PHB using the preference order for extended ECN 712 codepoints given in Table 3. Preferential dropping can be 713 difficult to implement on some hardware, but if feasible it would 714 discriminate against attack traffic if done as part of the overall 715 policing framework of [I-D.re-ecn-motiv]. If nowhere else, 716 routers at the egress of a network SHOULD implement preferential 717 drop (stronger than the MAY above). For simplicity, preferences 4 718 & 5 MAY be merged into one preference level. 720 The tabulated drop preferences are arranged to preserve packets 721 with more positive worth (Section 4.4), given senders of positive 722 packets must have honestly declared downstream congestion. A full 723 treatment of this is provided in the companion document describing 724 the motivation and architecture for re-ECN [I-D.re-ecn-motiv] 725 particularly when the application of re-ECN to protect against 726 DDoS attacks is described. 728 +-------+-----+------------+-------+------------+-------------------+ 729 | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | 730 | field | bit | ECN | | (1 = drop | | 731 | | | codepoint | | 1st) | | 732 +-------+-----+------------+-------+------------+-------------------+ 733 | 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | 734 | | | | | | congestion and | 735 | | | | | | RECT | 736 | 00 | 1 | FNE | +1 | 4 | Feedback not | 737 | | | | | | established | 738 | 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | 739 | | | | | | by congestion | 740 | | | | | | experienced | 741 | 01 | 1 | RECT | 0 | 3 | Re-ECN capable | 742 | | | | | | transport | 743 | 11 | 1 | CE(-1) | -1 | 3 | Congestion | 744 | | | | | | experienced | 745 | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | 746 | 10 | 0 | --- | n/a | 2 | RFC3168 ECN use | 747 | | | | | | only | 748 | 00 | 0 | Not-RECT | n/a | 1 | Not Re-ECN- | 749 | | | | | | capable transport | 750 +-------+-----+------------+-------+------------+-------------------+ 752 Table 3: Drop Preference of EECN Codepoints (Sorted by `Worth') 754 5.4. Justification for Setting the First SYN to FNE 756 the initial SYN MUST be set to FNE by Re-ECT client A (Section 6.1.6) 757 and (Section 5.3) says a queue MAY optionally treat an FNE packet as 758 ECN capable, so an initial SYN may be marked CE(-1) rather than 759 dropped. This seems dangerous, because the sender has not yet 760 established whether the receiver is a RFC3168 one that does not 761 understand congestion marking. It also seems to allow malicious 762 senders to take advantage of ECN marking to avoid so much drop when 763 launching SYN flooding attacks. Below we explain the features of the 764 protocol design that remove both these dangers. 766 ECN-capable initial SYN with a Not-ECT server: If the TCP server B 767 is re-ECN capable, provision is made for it to feedback a possible 768 congestion marked SYN in the SYN ACK (Section 6.1.6). But if the 769 TCP client A finds out from the SYN ACK that the server was not 770 ECN-capable, the TCP client MUST conservatively consider the first 771 SYN as congestion marked before setting itself into Not-ECT mode. 772 Section 6.1.6 mandates that such a TCP client MUST also set its 773 initial window to 1 segment. In this way we remove the need to 774 cautiously avoid setting the first SYN to Not-RECT. This will 775 give worse performance while deployment is patchy, but better 776 performance once deployment is widespread. 778 SYN flooding attacks can't exploit ECN-capability: Malicious hosts 779 may think they can use the advantage that ECN-marking gives over 780 drop in launching classic SYN-flood attacks. But Section 5.3 781 mandates that a router MUST only be configured to treat packets 782 with the FNE codepoint as ECN-capable if FNE packets are rate 783 limited somewhere. Introduction of the FNE codepoint was a 784 deliberate move to enable transport-neutral handling of flow-start 785 and flow state set-up in the IP layer where it belongs. It then 786 becomes possible to protect against flooding attacks of all forms 787 (not just SYN flooding) without transport-specific inspection for 788 things like the SYN flag in TCP headers. Then, for instance, SYN 789 flooding attacks using IPsec ESP encryption can also be rate 790 limited at the IP layer. 792 It might seem pedantic going to all this trouble to enable ECN on the 793 initial packet of a flow, but it is motivated by a much wider concern 794 to ensure safe congestion control will still be possible even if the 795 application mix evolves to the point where the majority of flows 796 consist of a single window or even a single packet. It also allows 797 denial of service attacks to be more easily isolated and prevented. 799 {ToDo: Give alternative where initial packet is Not-RECT and last ACK 800 of three-way handshake is FNE. Explain this will give better 801 performance while deployment is patchy, but worse performance once 802 deployment is high.} 804 5.5. Control and Management 806 5.5.1. Negative Balance Warning 808 A new ICMP message type is being considered so that a dropper can 809 warn the apparent sender of a flow that it has started to sanction 810 the flow. The message would have similar semantics to the `Time 811 exceeded' ICMP message type. To ensure the sender has to invest some 812 work before the network will generate such a message, a dropper 813 SHOULD only send such a message for flows that have demonstrated that 814 they have started correctly by establishing a positive record, but 815 have later gone negative. The threshold is up to the implementation. 816 The purpose of the message is to deconfuse the cause of drops from 817 other causes, such as congestion or transmission losses. The dropper 818 would send the message to the sender of the flow, not the receiver. 819 If we did define this message type, it would be REQUIRED for all re- 820 ECT senders to parse and understand it. Note that a sender MUST only 821 use this message to explain why losses are occurring. A sender MUST 822 NOT take this message to mean that losses have occurred that it was 823 not aware of. Otherwise, spoof messages could be sent by malicious 824 sources to slow down a sender (c.f. ICMP source quench). 826 However, the need for this message type is not yet confirmed, as we 827 are considering how to prevent it being used by malicious senders to 828 scan for droppers and to test their threshold settings. {ToDo: 829 Complete this section.} 831 5.5.2. Rate Response Control 833 As discussed in [I-D.re-ecn-motiv] the sender's access operator will 834 be expected to use bulk per-user policing, but they might choose to 835 introduce a per-flow policer. In cases where operators do introduce 836 per-flow policing, there may be a need for a sender to send a request 837 to the ingress policer asking for permission to apply a non-default 838 response to congestion (where TCP-friendly is assumed to be the 839 default). This would require the sender to know what message 840 format(s) to use and to be able to discover how to address the 841 policer. The required control protocol(s) are outside the scope of 842 this document, but will require definition elsewhere. 844 The policer is likely to be local to the sender and inline, probably 845 at the ingress interface to the internetwork. So, discovery should 846 not be hard. A variety of control protocols already exist for some 847 widely used rate-responses to congestion. For instance DCCP 848 congestion control identifiers (CCIDs [RFC4340]) fulfil this role and 849 so does QoS signalling (e.g. and RSVP request for controlled load 850 service is equivalent to a request for no rate response to 851 congestion, but with admission control). 853 5.6. IP in IP Tunnels 855 Ideally, for re-ECN to work through IP in IP tunnels, the tunnel 856 entry should copy both the RE flag and the ECN field from the inner 857 to the outer IP header. Then at the tunnel exit, any CE marking of 858 the outer ECN field should overwrite the inner ECN field (unless the 859 inner field is Not-ECT in which case an alarm should be raised). The 860 RE flag shouldn't change along a path, so the outer RE flag should be 861 the same as the inner. If it isn't, a management alarm should be 862 raised. 864 This requirement is satisfied by the latest specification for 865 handling ECN through IP tunnels [RFC6040] as well as by IPsec 866 [RFC4301]. However, it is not satisfied by the ingress behaviour 867 specified in [RFC3168] although at least the full-functionality 868 variant of the egress behaviour is fine. RFC6040 updates RFC3168, 869 but it is likely that many legacy non-IPsec IP-in-IP tunnels will 870 exist. 872 If legacy tunnels are left as specified in [RFC3168], whether the 873 limited or full-functionality variants is used, a problem arises with 874 re-ECN if a tunnel crosses an inter-domain boundary, because the 875 difference between positive and negative markings will not be 876 correctly accounted for. In a limited functionality ECN tunnel, the 877 flow will appear to be RFC3168 compliant traffic, and therefore may 878 be wrongly rate limited. In a full-functionality ECN tunnel, the 879 result will depend whether the tunnel entry copies the inner RE flag 880 to the outer header or the RE flag in the outer header is always 881 cleared. If the former, the flow will tend to be too positive when 882 accounted for at borders. If the latter, it will be too negative. 883 If the rules set out in [RFC6040] are followed then this will not be 884 an issue. 886 5.7. Non-Issues 888 The following issues might seem to cause unfavourable interactions 889 with re-ECN, but we will explain why they don't: 891 o Various link layers support explicit congestion notification, such 892 as Frame Relay and ATM. Explicit congestion notification is 893 proposed to be added to other link layers, such as Ethernet 894 (802.3ar Ethernet congestion management) and MPLS [RFC5129]; 896 o Encryption and IPsec. 898 In the case of congestion notification at the link layer, each 899 particular link layer scheme either manages congestion on the link 900 with its own link-level feedback (the usual arrangement in the cases 901 of ATM and Frame Relay), or congestion notification from the link 902 layer is merged into congestion notification at the IP level when the 903 frame headers are decapsulated at the end of the link (the 904 recommended arrangement in the Ethernet and MPLS cases). Given the 905 RE flag is not intended to change along the path, this means that 906 downstream congestion will still be measurable at any point where IP 907 is processed on the path by subtracting positive from negative 908 markings. 910 In the case of encryption, as long as the tunnel issues described in 911 Section 5.6 are dealt with, payload encryption itself will not be a 912 problem. The design goal of re-ECN is to include downstream 913 congestion in the IP header so that it is not necessary to bury into 914 inner headers. Obfuscation of flow identifiers is not a problem for 915 re-ECN policing elements. Re-ECN doesn't ever require flow 916 identifiers to be valid, it only requires them to be unique. So if 917 an IPsec encapsulating security payload (ESP [RFC4835]) or an 918 authentication header (AH [RFC4302]) is used, the security parameters 919 index (SPI) will be a sufficient flow identifier, as it is intended 920 to be unique to a flow without revealing actual port numbers. 922 In general, even if endpoints use some locally agreed scheme to hide 923 port numbers, re-ECN policing elements can just consider the pair of 924 source and destination IP addresses as the flow identifier. Re-ECN 925 encourages endpoints to at least tell the network layer that a 926 sequence of packets are all part of the same flow, if indeed they 927 are. The alternative would be for the sender to make each packet 928 appear to be a new flow, which would require them all to be marked 929 FNE in order to avoid being treated with the bulk of malicious flows 930 at the egress dropper. Given the FNE marking is worth +1 and 931 networks are likely to rate limit FNE packets, endpoints are given an 932 incentive not to set FNE on each packet. But if the sender really 933 does want to hide the flow relationship between packets it can choose 934 to pay the cost of multiple FNE packets, which in the long run will 935 compensate for the extra memory required on network policing elements 936 to process each flow. 938 {ToDo: Add a note about it being useful that the AH header does not 939 cover the RE flag, referring to Section 9.} 941 6. Transport Layers 943 6.1. TCP 945 Re-ECN capability at the sender is essential. At the receiver it is 946 optional, as long as the receiver has a basic RFC3168-compliant ECN- 947 capable transport (ECT) [RFC3168]. Given re-ECN is not the first 948 attempt to define the semantics of the ECN field, we give a table 949 below summarising what happens for various combinations of 950 capabilities of the sender S and receiver R, as indicated in the 951 first four columns below. The last column gives the mode a half- 952 connection should be in after the first two of the three TCP 953 handshakes. 955 +--------+--------------+------------+---------+--------------------+ 956 | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R Half- | 957 | | (RFC3540) | (RFC3168) | | connection Mode | 958 +--------+--------------+------------+---------+--------------------+ 959 | SR | | | | RECN | 960 | S | R | | | RECN-Co | 961 | S | | R | | RECN-Co | 962 | S | | | R | Not-ECT | 963 +--------+--------------+------------+---------+--------------------+ 965 Table 4: Modes of TCP Half-connection for Combinations of ECN 966 Capabilities of Sender S and Receiver R 968 We will describe what happens in each mode, then describe how they 969 are negotiated. The abbreviations for the modes in the above table 970 mean: 972 RECN: Full re-ECN capable transport 974 RECN-Co: Re-ECN sender in compatibility mode with a RFC3168 975 compliant [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable 976 receiver. Implementation of this mode is OPTIONAL. 978 Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when 979 at least one of the transports does not understand even basic ECN 980 marking. 982 Note that we use the term Re-ECT for a host transport that is re-ECN- 983 capable but RECN for the modes of the half connections between hosts 984 when they are both Re-ECT. If a host transport is Re-ECT, this fact 985 alone does NOT imply either of its half connections will necessarily 986 be in RECN mode, at least not until it has confirmed that the other 987 host is Re-ECT. 989 6.1.1. RECN mode: Full Re-ECN capable transport 991 In full RECN mode, for each half connection, both the sender and the 992 receiver each maintain an unsigned integer counter we will call ECC 993 (echo congestion counter). The receiver maintains a count of how 994 many times a CE marked packet has arrived during the half-connection. 995 Once a RECN connection is established, the three TCP option flags 996 (ECE, CWR & NS) used for ECN-related functions in other versions of 997 ECN are used as a 3-bit field for the receiver to repeatedly tell the 998 sender the current value of ECC, modulo 8, whenever it sends a TCP 999 ACK. We will call this the echo congestion increment (ECI) field. 1000 This overloaded use of these 3 option flags as one 3-bit ECI field is 1001 shown in Figure 7. The actual definition of the TCP header, 1002 including the addition of support for the ECN nonce, is shown for 1003 comparison in Figure 6. This specification does not redefine the 1004 names of these three TCP option flags, it merely overloads them with 1005 another definition once a flow is established. 1007 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1008 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1009 | | | N | C | E | U | A | P | R | S | F | 1010 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 1011 | | | | R | E | G | K | H | T | N | N | 1012 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1014 Figure 6: The (post-ECN Nonce) definition of bytes 13 and 14 of the 1015 TCP Header 1017 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1018 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1019 | | | | U | A | P | R | S | F | 1020 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 1021 | | | | G | K | H | T | N | N | 1022 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1024 Figure 7: Definition of the ECI field within bytes 13 and 14 of the 1025 TCP Header, overloading the current definitions above for established 1026 RECN flows. 1028 Receiver Action in RECN Mode 1030 Every time a CE marked packet arrives at a receiver in RECN mode, 1031 the receiver transport increments its local value of ECC and MUST 1032 echo its value, modulo 8, to the sender in the ECI field of the 1033 next ACK. It MUST repeat the same value of ECI in every 1034 subsequent ACK until the next CE event, when it increments ECI 1035 again. 1037 The increment of the local ECC values is modulo 8 so the field 1038 value simply wraps round back to zero when it overflows. The 1039 least significant bit is to the right (labelled bit 9). 1041 A receiver in RECN mode MAY delay the echo of a CE to the next 1042 delayed-ACK, which would be necessary if ACK-withholding were 1043 implemented. 1045 Sender Action in RECN Mode 1047 On the arrival of every ACK, the sender compares the ECI field 1048 with its own ECC value, then replaces its local value with that 1049 from the ACK. The difference D (D = (ECI + 8 - ECC mod 8) mod 8) 1050 is assumed to be the number of CE marked packets that arrived at 1051 the receiver since it sent the previously received ACK (but see 1052 below for the sender's safety strategy). Whenever the ECI field 1053 increments by D (and/or d drops are detected), the sender MUST 1054 clear the RE flag to "0" in the IP header of the next D' data 1055 packets it sends (where D' = D + d), effectively re-echoing each 1056 single increment of ECI. Otherwise the data sender MUST send all 1057 data packets with RE set to "1". 1059 As a general rule, once a flow is established, as well as setting 1060 or clearing the RE flag as above, a data sender in RECN mode MUST 1061 always set the ECN field to ECT(1). However, the settings of the 1062 extended ECN field during flow start are defined in Section 6.1.6. 1064 As we have already emphasised, the re-ECN protocol makes no 1065 changes and has no effect on the TCP congestion control algorithm. 1066 So, the first increment of ECI (or detection of a drop) in a RTT 1067 triggers the standard TCP congestion response, no more than one 1068 congestion response per round trip, as usual. However, the sender 1069 re-echoes every increment of ECI irrespective of RTTs. 1071 A TCP sender also acts as the receiver for the other half- 1072 connection. The host will maintain two ECC values S.ECC and R.ECC 1073 as sender and receiver respectively. Every TCP header sent by a 1074 host in RECN mode will also repeat the prevailing value of R.ECC 1075 in its ECI field. If a sender in RECN mode has to retransmit a 1076 packet due to a suspected loss, the re-transmitted packet MUST 1077 carry the latest prevailing value of R.ECC when it is re- 1078 transmitted, which will not necessarily be the one it carried 1079 originally. 1081 6.1.2. Drops and Marks 1083 Re-ECN is based on the ECN protocol [RFC3168] . In turn the 1084 congestion markings ECN uses are typically based on the RED 1085 algorithm [RFC2309]. This algorithm marks packets as CE with a 1086 probability that increases as the size of the router queue increases. 1087 However, if the queue becomes too full then it will revert to 1088 dropping packets. Because of this it is important that a re-ECN 1089 sender treats each packet drop it detects as if it were actually a CE 1090 mark. This ensures that it can continue to correctly echo congestion 1091 even through a highly congested path. 1093 In order to ensure that drops are correctly echoed the sender needs 1094 to add the number of drops detected per RTT to the difference in ECI 1095 value waiting to be echoed. Drop detection is defined as set out in 1096 [RFC2581] -- if the connection is in slow start then a single 1097 duplicate acknowledgement will be treated as an indication of a drop. 1098 When the system is in the congestion avoidance stage then 3 duplicate 1099 acknowledgements will be treated as a sign of a drop. In all cases, 1100 if a re-transmission time-out occurs then that will be treated as a 1101 drop. 1103 6.1.3. Safety against Long Pure ACK Loss Sequences 1105 The ECI method was chosen for echoing congestion marking because a 1106 re-ECN sender needs to know about every CE mark arriving at the 1107 receiver, not just whether at least one arrives within a round trip 1108 time (which is all the ECE/CWR mechanism supported). And, as pure 1109 ACKs are not protected by TCP reliable delivery, we repeat the same 1110 ECI value in every ACK until it changes. Even if many ACKs in a row 1111 are lost, as soon as one gets through, the ECI field it repeats from 1112 previous ACKs that didn't get through will update the sender on how 1113 many CE marks arrived since the last ACK got through. 1115 The sender will only lose a record of the arrival of a CE mark if all 1116 the ACKS are lost (and all of them were pure ACKs) for a stream of 1117 data long enough to contain 8 or more CE marks. So, if the marking 1118 fraction was p, at least 8/p pure ACKs would have to be lost. For 1119 example, if p was 5%, a sequence of 160 pure ACKs would all have to 1120 be lost. To protect against such extremely unlikely events, if a re- 1121 ECN sender detects a sequence of pure ACKs has been lost it SHOULD 1122 assume the ECI field wrapped as many times as possible within the 1123 sequence. 1125 Specifically, if a re-ECN sender receives an ACK with an 1126 acknowledgement number that acknowledges L segments since the 1127 previous ACK but with a sequence number unchanged from the previously 1128 received ACK, it SHOULD conservatively assume that the ECI field 1129 incremented by D' = L - ((L-D) mod 8), where D is the apparent 1130 increase in the ECI field. For example if the ACK arriving after 9 1131 pure ACK losses apparently increased ECI by 2, the assumed increment 1132 of ECI would still be 2. But if ECI apparently increased by 2 after 1133 11 pure ACK losses, ECI should be assumed to have increased by 10. 1135 A re-ECN sender MAY implement a heuristic algorithm to predict beyond 1136 reasonable doubt that the ECI field probably did not wrap within a 1137 sequence of lost pure ACKs. But such an algorithm is OPTIONAL. Such 1138 an algorithm MUST NOT be used unless it is proven to work even in the 1139 presence of correlation between high ACK loss rate on the back 1140 channel and high CE marking rate on the forward channel. 1142 Whatever assumption a re-ECN sender makes about potentially lost CE 1143 marks, both its congestion control and its re-echoing behaviour 1144 SHOULD be consistent with the assumption it makes. 1146 6.1.4. RECN-Co mode: Re-ECT Sender with a RFC3168 compliant ECN 1147 Receiver 1149 If the half-connection is in RECN-Co mode, ECN feedback proceeds no 1150 differently to that of RFC3168 compliant ECN. In other words, the 1151 receiver sets the ECE flag repeatedly in the TCP header and the 1152 sender responds by setting the CWR flag. Although RECN-Co mode is 1153 used when the receiver has not implemented the re-ECN protocol, the 1154 sender can infer enough from its RFC3168 compliant ECN feedback to 1155 set or clear the RE flag reasonably well. Specifically, every time 1156 the receiver toggles the ECE field from "0" to "1" (or a loss is 1157 detected), as well as setting CWR in the TCP flags, the re-ECN sender 1158 MUST blank the RE flag of the next packet to "0" as it would do in 1159 full RECN mode. Otherwise, the data sender SHOULD send all other 1160 packets with RE set to "1". Once a flow is established, a re-ECN 1161 data sender in RECN-Co mode MUST always set the ECN field to ECT(1). 1163 If a CE marked packet arrives at the receiver within a round trip 1164 time of a previous mark, the receiver will still be echoing ECE for 1165 the last CE mark. Therefore, such a mark will be missed by the 1166 sender. Of course, this isn't of concern for congestion control, but 1167 it does mean that very occasionally the RE blanking fraction will be 1168 understated. Therefore flows in RECN-Co mode may occasionally be 1169 mistaken for very lightly cheating flows and consequently might 1170 suffer a small number of packet drops through an egress dropper. We 1171 expect re-ECN would be deployed for some time before policers and 1172 droppers start to enforce it. So, given there is not much ECN 1173 deployment yet anyway, this minor problem may affect only a very 1174 small proportion of flows, reducing to nothing over the years as 1175 RFC3168 compliant ECN hosts upgrade. The use of RECN-Co mode would 1176 need to be reviewed in the light of experience at the time of re-ECN 1177 deployment. 1179 RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their 1180 code simple, MAY choose not to implement this mode. If they do not, 1181 a re-ECN sender SHOULD fall back to RFC3168 compliant ECT mode in the 1182 presence of an ECN-capable receiver. It MAY choose to fall back to 1183 the ECT-Nonce mode, but if re-ECN implementers don't want to be 1184 bothered with RECN-Co mode, they probably won't want to add an ECT- 1185 Nonce mode either. 1187 6.1.4.1. Re-ECN support for the ECN Nonce 1189 A TCP half-connection in RECN-Co mode MUST NOT support the ECN 1190 Nonce [RFC3540]. This means that the sending code of a re-ECN 1191 implementation will never need to include ECN Nonce support. Re-ECN 1192 is intended to provide wider protection than the ECN nonce against 1193 congestion control misbehaviour, and re-ECN only requires support 1194 from the sender, therefore it is preferable to specifically rule out 1195 the need for dual sender implementations. As a consequence, a re-ECN 1196 capable sender will never set ECT(0), so it will be easier for 1197 network elements to discriminate re-ECN traffic flows from other ECN 1198 traffic, which will always contain some ECT(0) packets. 1200 However, a re-ECN implementation MAY OPTIONALLY include receiving 1201 code that complies with the ECN Nonce protocol when interacting with 1202 a sender that supports the ECN nonce (rather than re-ECN), but this 1203 support is not required. 1205 RFC3540 allows an ECN nonce sender to choose whether to sanction a 1206 receiver that does not ever set the nonce sum. Given re-ECN is 1207 intended to provide wider protection than the ECN nonce against 1208 congestion control misbehaviour, implementers of re-ECN receivers MAY 1209 choose not to implement backwards compatibility with the ECN nonce 1210 capability. This may be because they deem that the risk of sanctions 1211 is low, perhaps because significant deployment of the ECN nonce seems 1212 unlikely at implementation time. 1214 6.1.5. Capability Negotiation 1216 During the TCP hand-shake at the start of a connection, an originator 1217 of the connection (host A) with a re-ECN-capable transport MUST 1218 indicate it is Re-ECT by setting the TCP flags NS=1, CWR=1 and ECE=1 1219 in the initial SYN. 1221 A responding Re-ECT host (host B) MUST return a SYN ACK with flags 1222 CWR=1 and ECE=0. The responding host MUST NOT set this combination 1223 of flags unless the preceding SYN has already indicated Re-ECT 1224 support as above. Normally a Re-ECT server (B) will reply to a Re- 1225 ECT client with NS=0, but if the initial SYN from Re-ECT client A is 1226 marked CE(-1), a Re-ECT server B MUST increment its local value of 1227 ECC. But B cannot reflect the value of ECC in the SYN ACK, because 1228 it is still using the 3 bits to negotiate connection capabilities. 1229 So, server B MUST set the alternative TCP header flags in its SYN 1230 ACK: NS=1, CWR=1 and ECE=0. 1232 These handshakes are summarised in Table 5 below, with X indicating 1233 NS can be either 1 or 0 depending respectively on whether congestion 1234 had been experienced or not. The handshakes used for the other 1235 flavours of ECN are also shown for comparison. To compress the width 1236 of the table, the headings of the first four columns have been 1237 severely abbreviated, as follows: 1239 R: |*R|e-ECT 1241 N: ECT-|*N|once (RFC3540) 1242 E: |*E|CT (RFC3168) 1244 I: Not-ECT (|*I|mplicit congestion notification). 1246 These correspond with the same headings used in Table 4. Indeed, the 1247 resulting modes in the last two columns of the table below are a more 1248 comprehensive way of saying the same thing as Table 4. 1250 +----+---+---+---+------------+-------------+-----------+-----------+ 1251 | R | N | E | I | SYN A-B | SYN ACK B-A | A-B Mode | B-A Mode | 1252 +----+---+---+---+------------+-------------+-----------+-----------+ 1253 | | | | | NS CWR ECE | NS CWR ECE | | | 1254 | AB | | | | 1 1 1 | X 1 0 | RECN | RECN | 1255 | A | B | | | 1 1 1 | 1 0 1 | RECN-Co | ECT-Nonce | 1256 | A | | B | | 1 1 1 | 0 0 1 | RECN-Co | ECT | 1257 | A | | | B | 1 1 1 | 0 0 0 | Not-ECT | Not-ECT | 1258 | B | A | | | 0 1 1 | 0 0 1 | ECT-Nonce | RECN-Co | 1259 | B | | A | | 0 1 1 | 0 0 1 | ECT | RECN-Co | 1260 | B | | | A | 0 0 0 | 0 0 0 | Not-ECT | Not-ECT | 1261 +----+---+---+---+------------+-------------+-----------+-----------+ 1263 Table 5: TCP Capability Negotiation between Originator (A) and 1264 Responder (B) 1266 As soon as a re-ECN capable TCP server receives a SYN, it MUST set 1267 its two half-connections into the modes given in Table 5. As soon as 1268 a re-ECN capable TCP client receives a SYN ACK, it MUST set its two 1269 half-connections into the modes given in Table 5. The half- 1270 connections will remain in these modes for the rest of the 1271 connection, including for the third segment of TCP's three-way hand- 1272 shake (the ACK). 1274 {ToDo: Consider delaying mode changes if using SYN cookies (will also 1275 affect next section).} 1277 {ToDo: consider RSTs within a connection.} 1279 Recall that, if the SYN ACK reflects the same flag settings as the 1280 preceding SYN (because there is a broken RFC3168 compliant 1281 implementation that behaves this way), RFC3168 specifies that the 1282 whole connection MUST revert to Not-ECT. 1284 Also note that, whenever the SYN flag of a TCP segment is set 1285 (including when the ACK flag is also set), the NS, CWR and ECE flags 1286 ( i.e the ECI field of the SYN-ACK) MUST NOT be interpreted as the 1287 3-bit ECI value, which is only set as a copy of the local ECC value 1288 in non-SYN packets. 1290 6.1.6. Extended ECN (EECN) Field Settings during Flow Start or after 1291 Idle Periods 1293 If the originator (A) of a TCP connection supports re-ECN it MUST set 1294 the extended ECN (EECN) field in the IP header of the initial SYN 1295 packet to the feedback not established (FNE) codepoint. 1297 FNE is a new extended ECN codepoint defined by this specification 1298 (Section 4.2). The feedback not established (FNE) codepoint is used 1299 when the transport does not have the benefit of ECN feedback so it 1300 cannot decide whether to set or clear the RE flag. 1302 If after receiving a SYN the server B has set its sending half- 1303 connection into RECN mode or RECN-Co mode, it MUST set the extended 1304 ECN field in the IP header of its SYN ACK to the feedback not 1305 established (FNE) codepoint. Note the careful wording here, which 1306 means that Re-ECT server B MUST set FNE on a SYN ACK whether it is 1307 responding to a SYN from a Re-ECT client or from a client that is 1308 merely ECN-capable. This is because FNE indicates the transport is 1309 ECN capable as well as re-ECN capable. 1311 The original ECN specification [RFC3168] required SYNs and SYN ACKs 1312 to use the Not-ECT codepoint of the ECN field. The aim was to 1313 prevent well-known DoS attacks such as SYN flooding being able to 1314 gain from the advantage that ECN capability afforded over drop at 1315 ECN-capable routers. 1317 For a SYN ACK, Kuzmanovic [RFC5562] has shown that this caution was 1318 unnecessary, and allows a SYN ACK to be ECN-capable to improve 1319 performance. By stipulating the FNE codepoint for the initial SYN, 1320 we comply with RFC3168 in word but not in spirit, because we have 1321 indeed set the ECN field to Not-ECT, but we have extended the ECN 1322 field with another bit. And it will be seen (Section 5.3) that we 1323 have defined one setting of that bit to mean an ECN-capable 1324 transport. Therefore, by proposing that the FNE codepoint MUST be 1325 used on the initial SYN of a connection, we have gone further by 1326 proposing to make the initial SYN ECN-capable too. Section 5.4 1327 justifies deciding to make the initial SYN ECN-capable. 1329 Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will 1330 have already been set on the initial SYN and possibly the SYN ACK as 1331 above. But each re-ECN sender will have to set FNE cautiously on a 1332 few data packets as well, given a number of packets will usually have 1333 to be sent before sufficient congestion feedback is received. The 1334 behaviour will be different depending on the mode of the half- 1335 connection: 1337 RECN mode: Given the constraints on TCP's initial window [RFC3390] 1338 and its exponential window increase during slow start 1339 phase [RFC5681], it turns out that the sender SHOULD set FNE on 1340 the first and third data packets in its flow after the initial 1341 3-way handshake, assuming equal sized data packets once a flow is 1342 established. Appendix D presents the calculation that led to this 1343 conclusion. Below, after running through the start of an example 1344 TCP session, we give the intuition learned from that calculation. 1345 {ToDo: unfortunately the calculation was based on erroneous 1346 assumptions; see [I-D.conex-tcp-mods] for a better approach.} 1348 RECN-Co mode: A re-ECT sender that switches into re-ECN 1349 compatibility mode or into Not-ECT mode (because it has detected 1350 the corresponding host is not re-ECN capable) MUST limit its 1351 initial window to 1 segment. The reasoning behind this constraint 1352 is given in Section 5.4. Having set this initial window, a re-ECN 1353 sender in RECN-Co mode SHOULD set FNE on the first and third data 1354 packets in a flow, as for RECN mode. 1356 +----+------+----------------+-------+-------+---------------+------+ 1357 | | Data | TCP A(Re-ECT) | IP A | IP B | TCP B(Re-ECT) | Data | 1358 +----+------+----------------+-------+-------+---------------+------+ 1359 | | Byte | SEQ ACK CTL | EECN | EECN | SEQ ACK CTL | Byte | 1360 | -- | ---- | ------------- | ----- | ----- | ------------- | ---- | 1361 | 1 | | 0100 SYN | FNE | --> | R.ECC=0 | | 1362 | | | CWR,ECE,NS | | | | | 1363 | 2 | | R.ECC=0 | <-- | FNE | 0300 0101 | | 1364 | | | | | | SYN,ACK,CWR | | 1365 | 3 | | 0101 0301 ACK | RECT | --> | R.ECC=0 | | 1366 | 4 | 1000 | 0101 0301 ACK | FNE | --> | R.ECC=0 | | 1367 | 5 | | R.ECC=0 | <-- | FNE | 0301 1102 ACK | 1460 | 1368 | 6 | | R.ECC=0 | <-- | RECT | 1762 1102 ACK | 1460 | 1369 | 7 | | R.ECC=0 | <-- | FNE | 3222 1102 ACK | 1460 | 1370 | 8 | | 1102 1762 ACK | RECT | --> | R.ECC=0 | | 1371 | 9 | | R.ECC=0 | <-- | RECT | 4682 1102 ACK | 1460 | 1372 | 10 | | R.ECC=0 | <-- | RECT | 6142 1102 ACK | 1460 | 1373 | 11 | | 1102 3222 ACK | RECT | --> | R.ECC=0 | | 1374 | 12 | | R.ECC=0 | <-- | RECT | 7602 1102 ACK | 1460 | 1375 | 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 | 1376 | | | ... | | | | | 1377 +----+------+----------------+-------+-------+---------------+------+ 1379 Table 6: TCP Session Example #1 1381 Table 6 shows an example TCP session, where the server B sets FNE on 1382 its first and third data packets (lines 5 & 7) as well as on the 1383 initial SYN ACK as previously described. The left hand half of the 1384 table shows the relevant settings of headers sent by client A in 1385 three layers: the TCP payload size; TCP settings; then IP settings. 1386 The right hand half gives equivalent columns for server B. The only 1387 TCP settings shown are the sequence number (SEQ), acknowledgement 1388 number (ACK) and the relevant control (CTL) flags that the relevant 1389 sending host sets in the TCP header. The IP columns show the setting 1390 of the extended ECN (EECN) field. 1392 Also shown on the receiving side of the table is the value of the 1393 receiver's echo congestion counter (R.ECC) after processing the 1394 incoming EECN header. Note that, once a host sets a half-connection 1395 into RECN mode, it MUST initialise its local value of ECC to zero. 1397 The intuition that Appendix D gives for why a sender should set FNE 1398 on the first and third data packets is as follows. At line 13, a 1399 packet sent by B is shown with an '*', which means it has been 1400 congestion marked by an intermediate queue from RECT to CE(-1). On 1401 receiving this CE marked packet, client A increments its ECC counter 1402 to 1 as shown. This was the 7th data packet B sent, but before 1403 feedback about this event returns to B, it might well have sent many 1404 more packets. Indeed, during exponential slow start, about as many 1405 packets will be in flight (unacknowledged) as have been acknowledged. 1406 So, when the feedback from the congestion event on B's 7th segment 1407 returns, B will have sent about 7 further packets that will still be 1408 in flight. At that stage, B's best estimate of the network's packet 1409 marking fraction will be 1/7. So, as B will have sent about 14 1410 packets, it should have already marked 2 of them as FNE in order to 1411 have marked 1/7; hence the need to have set the first and third data 1412 packets to FNE. 1414 Client A's behaviour in Table 6 also shows FNE being set on the first 1415 SYN and the first data packet (lines 1 & 4), but in this case it 1416 sends no more data packets, so of course, it cannot, and does not 1417 need to, set FNE again. Note that in the A-B direction there is no 1418 need to set FNE on the third part of the three-way hand-shake (line 3 1419 ---the ACK). 1421 Note that in this section we have used the word SHOULD rather than 1422 MUST when specifying how to set FNE on data segments before positive 1423 congestion feedback arrives (but note that the word MUST was used for 1424 FNE on the SYN and SYN ACK). FNE is only RECOMMENDED for the first 1425 and third data segments to entertain the possibility that the TCP 1426 transport has the benefit of other knowledge of the path, which it 1427 re-uses from one flow for the benefit of a newly starting flow. For 1428 instance, one flow can re-use knowledge of other flows between the 1429 same hosts if using a Congestion Manager [RFC3124] or when a proxy 1430 host aggregates congestion information for large numbers of flows. 1432 {ToDo: There is probably scope for re-writing the above in a 1433 different way so that it says MUST unless some other knowledge of the 1434 path is available. See earlier note pointing out FNE on 1st & 3rd is 1435 too few.} 1437 After an idle period of more than 1 second, a re-ECN sender transport 1438 MUST set the EECN field of the packet that resumes the connection to 1439 FNE. Note that this next packet may be sent a very long time later, 1440 a packet does NOT have to be sent after 1 second of idling. In order 1441 that the design of network policers can be deterministic, this 1442 specification deliberately puts an absolute lower limit on how long a 1443 connection can be idle before the packet that resumes the connection 1444 must be set to FNE, rather than relating it to the connection round 1445 trip time. We use the lower bound of the retransmission timeout 1446 (RTO) [RFC6298], which is commonly used as the idle period before TCP 1447 must reduce to the restart window [RFC5681]. Note our specification 1448 of re-ECN's idle period is NOT intended to change the idle period for 1449 TCP's restart, nor indeed for any other purposes. 1451 {ToDo: Describe how the sender falls back to RFC3168 modes if packets 1452 don't appear to be getting through (to work round firewalls 1453 discarding packets they consider unusual).} 1455 {ToDo: Possible future capabilities for changing Slow Start} 1457 6.1.7. Pure ACKS, Retransmissions, Window Probes and Partial ACKs 1459 A re-ECN sender MUST clear the RE flag to "0" and set the ECN field 1460 to Not-ECT in pure ACKs, retransmissions and window probes, as 1461 specified in [RFC3168]. Our eventual goal is for all packets to be 1462 sent with re-ECN enabled, and we believe the semantics of the ECI 1463 field go a long way towards being able to achieve this. However, we 1464 have not completed a full security analysis for these cases, 1465 therefore, currently we merely re-state current practice. 1467 We must also reconcile the facts that congestion marking is applied 1468 to packets but acknowledgements cover octet ranges and acknowledged 1469 octet boundaries need not match the transmitted boundaries. The 1470 general principle we work to is to remain compatible with TCP's 1471 congestion control which is driven by congestion events at packet 1472 granularity while at the same time aiming to blank the RE flag on at 1473 least as many octets in a flow as have been marked CE. 1475 Therefore, a re-ECN TCP receiver MUST increment its ECC value as many 1476 times as CE marked packets have been received. And that value MUST 1477 be echoed to the sender in the first available ACK using the ECI 1478 field. This ensures the TCP sender's congestion control receives 1479 timely feedback on congestion events at the same packet granularity 1480 that they were generated on congested queues. 1482 Then, a re-ECN sender stores the difference D between its own ECC 1483 value and the incoming ECI field by incrementing a counter R. Then, 1484 R is decremented by 1 each subsequent packet that is sent with the RE 1485 flag blanked, until R is no longer positive. Using this technique, 1486 whenever a re-ECN transport sends a not re-ECN capable packet (e.g. a 1487 retransmission), the remaining packets required to have the RE flag 1488 blanked will be automatically carried over to subsequent packets, 1489 through the variable R. 1491 This does not ensure precisely the same number of octets have RE 1492 blanked as were CE marked. But we believe positive errors will 1493 cancel negative over a long enough period. {ToDo: However, more 1494 research is needed to prove whether this is so. If it is not, it may 1495 be necessary to increment and decrement R in octets rather than 1496 packets, by incrementing R as the product of D and the size in octets 1497 of packets being sent (typically the MSS).} 1499 6.2. Other Transports 1501 6.2.1. General Guidelines for Adding Re-ECN to Other Transports 1503 As a general rule, Re-ECT sender transports that have established the 1504 receiver transport is at least ECN-capable (not necessarily re-ECN 1505 capable) MUST blank the RE codepoint for at least as many octets as 1506 arrive at receiver with the CE codepoint set. Re-ECN-capable sender 1507 transports should always initialise the ECN field to the ECT(1) 1508 codepoint once a flow is established. 1510 If the sender transport does not have sufficient feedback to even 1511 estimate the path's CE rate, it SHOULD set FNE continuously. If the 1512 sender transport has some, perhaps stale, feedback to estimate that 1513 the path's CE rate is nearly definitely less than E%, the transport 1514 MAY blank RE in packets for E% of sent octets, and set the RECT 1515 codepoint for the remainder. 1517 The following sections give guidelines on how re-ECN support could be 1518 added to RSVP or NSIS, to DCCP, and to SCTP - although separate 1519 Internet drafts will be necessary to document the exact mechanics of 1520 re-ECN in each of these protocols. 1522 {ToDo: Give a brief outline of what would be expected for each of the 1523 following: 1525 o UDP fire and forget (e.g. DNS) 1526 o UDP streaming with no feedback 1528 o UDP streaming with feedback 1530 } 1532 6.2.2. Guidelines for adding Re-ECN to RSVP or NSIS 1534 A separate I-D has been submitted [I-D.re-pcn-border-cheat] 1535 describing how re-ECN can be used in an edge-to-edge rather than end- 1536 to-end scenario. It can then be used by downstream networks to 1537 police whether upstream networks are blocking new flow reservations 1538 when downstream congestion is too high, even though the congestion is 1539 in other operators' downstream networks. This relates to current 1540 IETF work on Admission Control over Diffserv using Pre-Congestion 1541 Notification (PCN) [RFC5559]. 1543 6.2.3. Guidelines for adding Re-ECN to DCCP 1545 Beside adjusting the initial features negotiation sequence, operating 1546 re-ECN in DCCP [RFC4340] could be achieved by defining a new option 1547 to be added to acknowledgments, that would include a multibit field 1548 where the destination could copy its ECC. 1550 6.2.4. Guidelines for adding Re-ECN to SCTP 1552 Appendix A in [RFC4960] gives the specifications for SCTP to support 1553 ECN. Similar steps should be taken to support re-ECN. Beside 1554 adjusting the initial features negotiation sequence, operating re-ECN 1555 in SCTP could be achieved by defining a new control chunk, that would 1556 include a multibit field where the destination could copy its ECC 1558 7. Incremental Deployment 1560 The design of the re-ECN protocol started from the fact that the 1561 current ECN marking behaviour of queues was sufficient and that re- 1562 feedback could be introduced around these queues by changing the 1563 sender behaviour but not the routers. Otherwise, if we had required 1564 routers to be changed, the chance of encountering a path that had 1565 every router upgraded would be vanishingly small during early 1566 deployment, giving no incentive to start deployment. Also, as there 1567 is no new forwarding behaviour, routers and hosts do not have to 1568 signal or negotiate anything. 1570 However, networks that choose to protect themselves using re-ECN do 1571 have to add new security functions at their trust boundaries with 1572 others. They distinguish legacy traffic by its ECN field. Traffic 1573 from Not-ECT transports is distinguishable by its Not-ECT marking. 1575 Traffic from RFC3168 compliant ECN transports is distinguished from 1576 re-ECN by which of ECT(0) or ECT(1) is used. We chose to use ECT(1) 1577 for re-ECN traffic deliberately. Existing ECN sources set ECT(0) on 1578 either 50% (the nonce) or 100% (the default) of packets, whereas re- 1579 ECN does not use ECT(0) at all. We can use this distinguishing 1580 feature of RFC3168 compliant ECN traffic to separate it out for 1581 different treatment at the various border security functions: egress 1582 dropping, ingress policing and border policing. 1584 The general principle we adopt is that an egress dropper will not 1585 drop any legacy traffic, but ingress and border policers will limit 1586 the bulk rate of legacy traffic (Not-ECT, ECT(0) and those marked 1587 with the unused codepoint) that can enter each network. Then, during 1588 early re-ECN deployment, operators can set very permissive (or non- 1589 existent) rate-limits on legacy traffic, but once re-ECN 1590 implementations are generally available, legacy traffic can be rate- 1591 limited increasingly harshly. Ultimately, an operator might choose 1592 to block all legacy traffic entering its network, or at least only 1593 allow through a trickle. 1595 Then, as the limits are set more strictly, the more RFC3168 ECN 1596 sources will gain by upgrading to re-ECN. Thus, towards the end of 1597 the voluntary incremental deployment period, RFC3168 compliant 1598 transports can be given progressively stronger encouragement to 1599 upgrade. 1601 The following list of minor changes, brings together all the points 1602 where re-ECN semantics for use of the two-bit ECN field are different 1603 compared to RFC3168: 1605 o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender 1606 sets ECT(0) by default (Section 4.3); 1608 o No provision is necessary for a re-ECN capable source transport to 1609 use the ECN nonce (Section 6.1.4.1); 1611 o Routers MAY preferentially drop different extended ECN codepoints 1612 (Section 5.3); 1614 o Packets carrying the feedback not established (FNE) codepoint MAY 1615 optionally be marked rather than dropped by routers, even though 1616 their ECN field is Not-ECT (with the important caveat in 1617 Section 5.3); 1619 o Packets may be dropped by policing nodes because of apparent 1620 misbehaviour, not just because of congestion ; 1622 o Tunnel entry behaviour is still to be defined, but may have to be 1623 different from RFC3168 (Section 5.6). 1625 None of these changes REQUIRE any modifications to routers. Also 1626 none of these changes affect anything about end to end congestion 1627 control; they are all to do with allowing networks to police that end 1628 to end congestion control is well-behaved. 1630 8. Related Work 1632 8.1. Congestion Notification Integrity 1634 The choice of two ECT code-points in the ECN field [RFC3168] 1635 permitted future flexibility, optionally allowing the sender to 1636 encode the experimental ECN nonce [RFC3540] in the packet stream. 1637 This mechanism has since been included in the specifications of DCCP 1638 [RFC4340]. 1640 {ToDo: DCCP provides nonce support - how does this affect the RFC?} 1642 The ECN nonce is an elegant scheme that allows the sender to detect 1643 if someone in the feedback loop - the receiver especially - tries to 1644 claim no congestion was experienced when in fact congestion led to 1645 packet drops or ECN marks. For each packet it sends, the sender 1646 chooses between the two ECT codepoints in a pseudo-random sequence. 1647 Then, whenever the network marks a packet with CE, if the receiver 1648 wants to deny congestion happened, she has to guess which ECT 1649 codepoint was overwritten. She has only a 50:50 chance of being 1650 correct each time she denies a congestion mark or a drop, which 1651 ultimately will give her away. 1653 The purpose of a network-layer nonce should primarily be protection 1654 of the network, while a transport-layer nonce would be better used to 1655 protect the sender from cheating receivers. Now, the assumption 1656 behind the ECN nonce is that a sender will want to detect whether a 1657 receiver is suppressing congestion feedback. This is only true if 1658 the sender's interests are aligned with the network's, or with the 1659 community of users as a whole. This may be true for certain large 1660 senders, who are under close scrutiny and have a reputation to 1661 maintain. But we have to deal with a more hostile world, where 1662 traffic may be dominated by peer-to-peer transfers, rather than 1663 downloads from a few popular sites. Often the `natural' self- 1664 interest of a sender is not aligned with the interests of other 1665 users. It often wishes to transfer data quickly to the receiver as 1666 much as the receiver wants the data quickly. 1668 In contrast, the re-ECN protocol enables policing of an agreed rate- 1669 response to congestion (e.g. TCP-friendliness) at the sender's 1670 interface with the internetwork. It also ensures downstream networks 1671 can police their upstream neighbours, to encourage them to police 1672 their users in turn. But most importantly, it requires the sender to 1673 declare path congestion to the network and it can remove traffic at 1674 the egress if this declaration is dishonest. So it can police 1675 correctly, irrespective of whether the receiver tries to suppress 1676 congestion feedback or whether the sender ignores genuine congestion 1677 feedback. Therefore the re-ECN protocol addresses a much wider range 1678 of cheating problems, which includes the one addressed by the ECN 1679 nonce. 1681 {ToDo: Ensure we address the early ACK problem.} 1683 9. Security Considerations 1685 {ToDo: Describe attacks by networks on flows and by spoofing 1686 sources.} {ToDo: Re-ECN & DNS servers} 1688 This whole memo concerns the deployment of a secure congestion 1689 control framework. However, below we list some specific security 1690 issues that we are still working on: 1692 o Malicious users have ability to launch dynamically changing 1693 attacks, exploiting the time it takes to detect an attack, given 1694 ECN marking is binary. We are concentrating on subtle 1695 interactions between the ingress policer and the egress dropper in 1696 an effort to make it impossible to game the system. 1698 o There is an inherent need for at least some flow state at the 1699 egress dropper given the binary marking environment, which leads 1700 to an apparent vulnerability to state exhaustion attacks. An 1701 egress dropper design with bounded flow state is in write-up. 1703 o A malicious source can spoof another user's address and send 1704 negative traffic to the same destination in order to fool the 1705 dropper into sanctioning the other user's flow. To prevent or 1706 mitigate these two different kinds of DoS attack, against the 1707 dropper and against given flows, we are considering various 1708 protection mechanisms. 1710 o A malicious client can send requests using a spoofed source 1711 address to a server (such as a DNS server) that tends to respond 1712 with single packet responses. This server will then be tricked 1713 into having to set FNE on the first (and only) packet of all these 1714 wasted responses. Given packets marked FNE are worth +1, this 1715 will cause such servers to consume more of their allowance to 1716 cause congestion than they would wish to. In general, re-ECN is 1717 deliberately designed so that single packet flows have to bear the 1718 cost of not discovering the congestion state of their path. One 1719 of the reasons for introducing re-ECN is to encourage short flows 1720 to make use of previous path knowledge by moving the cost of this 1721 lack of knowledge to sources that create short flows. Therefore, 1722 we in the long run we might expect services like DNS to aggregate 1723 single packet flows into connections where it brings benefits. 1724 However, this attack where DNS requests are made from spoofed 1725 addresses genuinely forces the server to waste its resources. The 1726 only mitigating feature is that the attacker has to set FNE on 1727 each of its requests if they are to get through an egress dropper 1728 to a DNS server. The attacker therefore has to consume as many 1729 resources as the victim, which at least implies re-ECN does not 1730 unwittingly amplify this attack. 1732 Having highlighted outstanding security issues, we now explain the 1733 design decisions that were taken based on a security-related 1734 rationale. It may seem that the six codepoints of the eight made 1735 available by extending the ECN field with the RE flag have been used 1736 rather wastefully to encode just five states. In effect the RE flag 1737 has been used as an orthogonal single bit, using up four codepoints 1738 to encode the three states of positive, neutral and negative worth. 1739 The mapping of the codepoints in an earlier version of this proposal 1740 used the codepoint space more efficiently, but the scheme became 1741 vulnerable to network operators bypassing congestion penalties by 1742 focusing congestion marking on positive packets. Appendix B explains 1743 why fixing that problem while allowing for incremental deployment, 1744 would have used another codepoint anyway. So it was better to use 1745 this orthogonal encoding scheme, which greatly simplified the whole 1746 protocol and brought with it some subtle security benefits (see the 1747 last paragraph of Appendix B). 1749 With the scheme as now proposed, once the RE flag is set or cleared 1750 by the sender or its proxy, it should not be written by the network, 1751 only read. So the endpoints can detect if any network maliciously 1752 alters the RE flag. IPsec AH integrity checking does not cover the 1753 IPv4 option flags (they were considered mutable---even the one we 1754 propose using for the RE flag that was `currently unused' when IPsec 1755 was defined). But it would be sufficient for a pair of endpoints to 1756 make random checks on whether the RE flag was the same when it 1757 reached the egress as when it left the ingress. Indeed, if IPsec AH 1758 had covered the RE flag, any network intending to alter sufficient RE 1759 flags to make a gain would have focused its alterations on packets 1760 without authenticating headers (AHs). 1762 The security of re-ECN has been deliberately designed to not rely on 1763 cryptography. 1765 10. IANA Considerations 1767 This memo includes no request to IANA (yet). 1769 If this memo was to progress to standards track, it would list: 1771 o The new RE flag in IPv4 (Section 5.1) and its extension with the 1772 ECN field to create a new set of extended ECN (EECN) codepoints; 1774 o The definition of the EECN codepoints for default Diffserv PHBs 1775 (Section 4.2) 1777 o The Hop-by-Hop option ID for the new extension header for IPv6 1778 (Section 5.2); 1780 o The new combinations of flags in the TCP header for capability 1781 negotiation (Section 6.1.5); 1783 11. Conclusions 1785 {ToDo:} 1787 12. Acknowledgements 1789 Sebastien Cazalet and Andrea Soppera contributed to the idea of re- 1790 feedback. All the following have given helpful comments: Andrea 1791 Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, 1792 Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, 1793 John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru 1794 Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd 1795 (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark 1796 Handley (who developed the attack with canceled packets), Adam 1797 Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft 1798 (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who 1799 complemented our own dummy traffic attacks with others), Liz Maida 1800 (MIT), Meral Shirazipour (Ericsson) and comments from participants in 1801 the CRN/CFP Broadband and DoS-resistant Internet working groups.A 1802 special thank you to Alessandro Salvatori for coming up with fiendish 1803 attacks on re-ECN. 1805 13. Comments Solicited 1807 Comments and questions are encouraged and very welcome. They can be 1808 addressed to the IETF Congestion Exposure (ConEx) working group's 1809 mailing list , and/or to the authors. 1811 14. References 1813 14.1. Normative References 1815 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1816 Requirement Levels", BCP 14, RFC 2119, March 1997. 1818 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1819 Control", RFC 2581, April 1999. 1821 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1822 of Explicit Congestion Notification (ECN) to IP", RFC 1823 3168, September 2001. 1825 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 1826 Initial Window", RFC 3390, October 2002. 1828 [RFC4302] Kent, S., "IP Authentication Header", RFC 4302, December 1829 2005. 1831 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 1832 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 1834 [RFC4341] Floyd, S. and E. Kohler, "Profile for Datagram Congestion 1835 Control Protocol (DCCP) Congestion Control ID 2: TCP-like 1836 Congestion Control", RFC 4341, March 2006. 1838 [RFC4342] Floyd, S., Kohler, E., and J. Padhye, "Profile for 1839 Datagram Congestion Control Protocol (DCCP) Congestion 1840 Control ID 3: TCP-Friendly Rate Control (TFRC)", RFC 4342, 1841 March 2006. 1843 [RFC4835] Manral, V., "Cryptographic Algorithm Implementation 1844 Requirements for Encapsulating Security Payload (ESP) and 1845 Authentication Header (AH)", RFC 4835, April 2007. 1847 [RFC4960] Stewart, R., "Stream Control Transmission Protocol", RFC 1848 4960, September 2007. 1850 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 1851 Ramakrishnan, "Adding Explicit Congestion Notification 1852 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, June 1853 2009. 1855 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1856 Control", RFC 5681, September 2009. 1858 [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion 1859 Notification", RFC 6040, November 2010. 1861 14.2. Informative References 1863 [ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the 1864 Internet to Support Real-Time Content Supply from a Large 1865 Fraction of Broadband Residential Users", BT Technology 1866 Journal (BTTJ) 23(2), April 2005. 1868 [I-D.conex-tcp-mods] 1869 Kuehlewind, M. and R. Scheffenegger, "TCP modifications 1870 for Congestion Exposure", draft-ietf-conex-tcp- 1871 modifications-05 (work in progress), February 2014. 1873 [I-D.re-ecn-motiv] 1874 Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, 1875 "Re-ECN: A Framework for adding Congestion Accountability 1876 to TCP/IP", draft-briscoe-conex-re-ecn-motiv-03 (work in 1877 progress), March 2014. 1879 [I-D.re-pcn-border-cheat] 1880 Briscoe, B., "Emulating Border Flow Policing using Re-PCN 1881 on Bulk Data", draft-briscoe-re-pcn-border-cheat-03 (work 1882 in progress), October 2009. 1884 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 1885 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 1886 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 1887 S., Wroclawski, J., and L. Zhang, "Recommendations on 1888 Queue Management and Congestion Avoidance in the 1889 Internet", RFC 2309, April 1998. 1891 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., 1892 and W. Weiss, "An Architecture for Differentiated 1893 Services", RFC 2475, December 1998. 1895 [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", 1896 RFC 3124, June 2001. 1898 [RFC3514] Bellovin, S., "The Security Flag in the IPv4 Header", RFC 1899 3514, April 1 2003. 1901 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1902 Congestion Notification (ECN) Signaling with Nonces", RFC 1903 3540, June 2003. 1905 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 1906 Internet Protocol", RFC 4301, December 2005. 1908 [RFC5129] Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion 1909 Marking in MPLS", RFC 5129, January 2008. 1911 [RFC5559] Eardley, P., "Pre-Congestion Notification (PCN) 1912 Architecture", RFC 5559, June 2009. 1914 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1915 "Computing TCP's Retransmission Timer", RFC 6298, June 1916 2011. 1918 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 1919 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 1920 Congestion Response in an Internetwork Using Re-Feedback", 1921 ACM SIGCOMM CCR 35(4)277--288, August 2005, 1922 . 1925 [Savage99] 1926 Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 1927 "TCP congestion control with a misbehaving receiver", ACM 1928 SIGCOMM CCR 29(5), October 1999, 1929 . 1931 [Steps_DoS] 1932 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 1933 resistant Internet Architecture", Proc. ACM SIGCOMM 1934 workshop on Future directions in network architecture 1935 (FDNA'04) pp 49--56, August 2004. 1937 [tcp-rcv-cheat] 1938 Moncaster, T., Briscoe, B., and A. Jacquet, "A TCP Test to 1939 Allow Senders to Identify Receiver Non-Compliance", draft- 1940 moncaster-tcpm-rcv-cheat-03 (work in progress), July 2014. 1942 Appendix A. Precise Re-ECN Protocol Operation 1944 The protocol operation in Section 4.3 was described as an 1945 approximation. In fact, standard ECN marking at a queue combines 1% 1946 and 2% marking into slightly less than 3% whole-path marking, because 1947 queues deliberately mark CE whether or not it has already been marked 1948 by another queue upstream. So the combined marking fraction would 1949 actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. 1951 To generalise this we will need some notation. 1953 o j represents the index of each resource (typically queues) along a 1954 path, ranging from 0 at the first queue to n-1 at the last. 1956 o m_j represents the fraction of octets to be |*m|arked CE by a 1957 particular queue (whether or not they are already marked) because 1958 of congestion of resource j. 1960 o u_j represents congestion signals arriving from |*u|pstream of 1961 resource j, being the fraction of CE marking in arriving packet 1962 headers (before marking). 1964 o p_j represents |*p|ath congestion, being the fraction of packets 1965 arriving at resource j with the RE flag blanked (excluding Not- 1966 RECT packets). 1968 o v_j denotes expected congestion downstream of resource j, which 1969 can be thought of as a |*v|irtual marking fraction, being derived 1970 from two other marking fractions. 1972 Observed fractions of each particular codepoint (u, p and v) and 1973 queue marking rate m are dimensionless fractions, being the ratio of 1974 two data volumes (marked and total) over a monitoring period. All 1975 measurements are in terms of octets, not packets, assuming that line 1976 resources are more congestible than packet processing. 1978 The path congestion (RE blanking fraction) set by the sender should 1979 reflect upstream congestion (CE marking fraction) from the viewpoint 1980 of the destination, which it feeds back to the sender. Therefore in 1981 the steady state 1983 p_0 = u_n 1984 = 1 - (1 - m_1)(1 - m_2)... 1986 Similarly, at some point j in the middle of the network, given p = 1 1987 - (1 - u_j)(1 - v_j), then 1988 v_j = 1 - (1 - p)/(1 - u_j) 1990 ~= p - u_j; if u_j << 100% 1992 So, between the two routers in the example in Section 4.3, congestion 1993 downstream is 1995 v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) 1996 = 2.00%, 1998 or a useful approximation of downstream congestion is 2000 v_1 ~= 2.98% - 1.00% 2001 ~= 1.98%. 2003 Appendix B. Justification for Two Codepoints Signifying Zero Worth 2004 Packets 2006 It may seem a waste of a codepoint to set aside two codepoints of the 2007 Extended ECN field to signify zero worth (RECT and CE(0) are both 2008 worth zero). The justification is subtle, but worth recording. 2010 The original version of Re-ECN ([Re-fb] and draft-00 of this memo) 2011 used three codepoints for neutral (ECT(1)), positive (ECT(0)) and 2012 negative (CE) packets. The sender set packets to neutral unless re- 2013 echoing congestion, when it set them positive, in much the same way 2014 that it blanks the RE flag in the current protocol. However, routers 2015 were meant to mark congestion by setting packets negative (CE) 2016 irrespective of whether they had previously been neutral or positive. 2018 However, we did not arrange for senders to remember which packet had 2019 been sent with which codepoint, or for feedback to say exactly which 2020 packets arrived with which codepoints. The transport was meant to 2021 inflate the number of positive packets it sent to allow for a few 2022 being wiped out by congestion marking. We (wrongly) assumed that 2023 routers would congestion mark packets indiscriminately, so the 2024 transport could infer how many positive packets had been marked and 2025 compensate accordingly by re-echoing. But this created a perverse 2026 incentive for routers to preferentially congestion mark positive 2027 packets rather than neutral ones. 2029 We could have removed this perverse incentive by requiring Re-ECN 2030 senders to remember which packets they had sent with which codepoint. 2031 And for feedback from the receiver to identify which packets arrived 2032 as which. Then, if a positive packet was congestion marked to 2033 negative, the sender could have re-echoed twice to maintain the 2034 balance between positive and negative at the receiver. 2036 Instead, we chose to make re-echoing congestion (blanking RE) 2037 orthogonal to congestion notification (marking CE), which required a 2038 second neutral codepoint. Then the receiver would be able to detect 2039 and echo a congestion event even if it arrived on a packet that had 2040 originally been positive. 2042 If we had added extra complexity to the sender and receiver 2043 transports to track changes to individual packets, we could have made 2044 it work, but then routers would have had an incentive to mark 2045 positive packets with half the probability of neutral packets. That 2046 in turn would have led router algorithms to become more complex. 2047 Then senders wouldn't know whether a mark had been introduced by a 2048 simple or a complex router algorithm. That in turn would have 2049 required another codepoint to distinguish between RFC3168 ECN and new 2050 Re-ECN router marking. 2052 Once the cost of IP header codepoint real-estate was the same for 2053 both schemes, there was no doubt that the simpler option for 2054 endpoints and for routers should be chosen. The resulting protocol 2055 also no longer needed the tricky inflation/deflation complexity of 2056 the original (broken) scheme. It was also much simpler to understand 2057 conceptually. 2059 A further advantage of the new orthogonal four-codepoint scheme was 2060 that senders owned sole rights to change the RE flag and routers 2061 owned sole rights to change the ECN field. Although we still arrange 2062 the incentives so neither party strays outside their dominion, these 2063 clear lines of authority simplify the matter. 2065 Finally, a little redundancy can be very powerful in a scheme such as 2066 this. In one flow, the proportion of packets changed to CE should be 2067 the same as the proportion of RECT packets changed to CE(-1) and the 2068 proportion of Re-Echo packets changed to CE(0). Double checking 2069 using such redundant relationships can improve the security of a 2070 scheme (cf. double-entry book-keeping or the ECN Nonce). 2071 Alternatively, it might be necessary to exploit the redundancy in the 2072 future to encode an extra information channel. 2074 Appendix C. ECN Compatibility 2076 The rationale for choosing the particular combinations of SYN and SYN 2077 ACK flags in Section 6.1.5 is as follows. 2079 Choice of SYN flags: A Re-ECN sender can work with RFC3168 compliant 2080 ECN receivers so we wanted to use the same flags as would be used 2081 in an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same 2082 time, we wanted a server (host B) that is Re-ECT to be able to 2083 recognise that the client (A) is also Re-ECT. We believe also 2084 setting NS=1 in the initial SYN achieves both these objectives, as 2085 it should be ignored by RFC3168 compliant ECT receivers and by 2086 ECT-Nonce receivers. But senders that are not Re-ECT should not 2087 set NS=1. At the time ECN was defined, the NS flag was not 2088 defined, so setting NS=1 should be ignored by existing ECT 2089 receivers (but testing against implementations may yet prove 2090 otherwise). The ECN Nonce RFC [RFC3540] is silent on what the NS 2091 field might be set to in the TCP SYN, but we believe the intent 2092 was for a nonce client to set NS=0 in the initial SYN (again only 2093 testing will tell). Therefore we define a Re-ECN-setup SYN as one 2094 with NS=1, CWR=1 & ECE=1 2096 Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to 2097 be able to determine whether the server (B) is Re-ECT. The 2098 original ECN specification required an ECT server to respond to an 2099 ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There 2100 is no room to modify this by setting the NS flag, as that is 2101 already set in the SYN ACK of an ECT-Nonce server. So we used the 2102 only combination of CWR and ECE that would not be used by existing 2103 TCP receivers: CWR=1 and ECE=0. The original ECN specification 2104 defines this combination as a non-ECN-setup SYN ACK, which remains 2105 true for RFC3168 compliant and Nonce ECTs. But for Re-ECN we 2106 define it as a Re-ECN-setup SYN ACK. We didn't use a SYN ACK with 2107 both CWR and ECE cleared to 0 because that would be the likely 2108 response from most Not-ECT receivers. And we didn't use a SYN ACK 2109 with both CWR and ECE set to 1 either, as at least one broken 2110 receiver implementation echoes whatever flags were in the SYN into 2111 its SYN ACK. Therefore we define a Re-ECN-setup SYN ACK as one 2112 with CWR=1 & ECE=0. 2114 Choice of two alternative SYN ACKs: the NS flag may take either 2115 value in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re- 2116 ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to 2117 echo congestion experienced (CE) on the initial SYN. Otherwise a 2118 Re-ECN-setup SYN ACK MUST be returned with NS=0. The only current 2119 known use of the NS flag in a SYN ACK is to indicate support for 2120 the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. 2121 Given the ECN nonce MUST NOT be used for a RECN mode connection, a 2122 Re-ECN-setup SYN ACK can use either setting of the NS flag without 2123 any risk of confusion, because the CWR & ECE flags will be 2124 reversed relative to those used by an ECN nonce SYN ACK. 2126 {ToDo: include the text below, either here, or in the algorithm 2127 sections} At an egress dropper, well-behaved RFC3168 compliant flows 2128 will appear to consist mostly of ECT(0) packets, with a few CE(0) 2129 packet. And, if the legacy source is setting the ECN nonce, the 2130 majority of packets will be an equal mix of ECT(0) and ECT(1) packets 2131 (the latter appearing to be Re-Echo packets in Re-ECN terms). None 2132 of these three packet markings is negative, so an egress dropper can 2133 handle all legacy flows in bulk and, as long as they don't send any 2134 packets using Re-ECN markings, it need not drop any legacy packets. 2135 So, as soon as an ECT(0) packet is seen, its flow ID can be added to 2136 the set of known legacy flows (a single Bloom filter would suffice). 2137 But, if any packets in flows classified as RFC3168 compliant are 2138 marked with any other marking than the three expected, the flow can 2139 be removed from the RFC3168 set, to be treated in bulk with mis- 2140 behaving Re-ECN flows---the remainder of flow IDs that require no 2141 flow state to be held. 2143 To an ingress Re-ECN policer, legacy ECN flows will appear as very 2144 highly congested paths. When policers are first deployed they can be 2145 configured permissively, allowing through both `RFC3168' ECN and 2146 misbehaving Re-ECN flows. Then, as the threshold is set more 2147 strictly, the more RFC3168 ECN sources will gain by upgrading to Re- 2148 ECN. Thus, towards the end of the voluntary incremental deployment 2149 period, RFC3168 transports can be given progressively stronger 2150 encouragement to upgrade. 2152 Appendix D. Packet Marking with FNE During Flow Start 2154 FNE (feedback not established) packets have two functions. Their 2155 main role is to announce the start of a new flow when feedback has 2156 not yet been established. However they also have the role of 2157 balancing the expected feedback and can be used where there are 2158 sudden changes in the rate of transmission. Whilst this should not 2159 happen under TCP their use as speculative marking is used in building 2160 the following argument as to why the first and third packets should 2161 be set to FNE. 2163 The proportion of FNE packets in each round-trip should be a high 2164 estimate of the potential error in the balance of number of 2165 congestion marked packets versus number of re-echo packets already 2166 issued. 2168 Let's call: 2170 S: the number of the TCP segments sent so far 2172 F: the number of FNE packets sent so far 2174 R: the number of Re-Echo packets sent so far 2176 A: the number of acknowledgments received so far 2178 C: the number of acknowledgments echoing a CE packet 2180 In normal operation, when we want to send packet S+1, we first need 2181 to check that enough Re-Echo packets have been issued: 2183 If R 1 FNE 2223 o if the acknowledgment doesn't echo a mark 2225 * for the second packet, A=F=S=1 R=C=0 ==> 1 RECT 2227 * for the third packet, S=2 A=F=1 R=C=0 ==> 1 FNE 2229 o if no acknowledgement for these two packets echoes a congestion 2230 mark, then {A=S=3 F=2 R=C=0} which gives k<2*4/1-3, so the source 2232 o if no acknowledgement for these four packets echoes a congestion 2233 mark, then {A=S=7 F=2 R=C=0} which gives k<2*8/1-7, so the source 2234 could send another 8 RECT packets. ==> 8 RECT 2236 This behaviour happens to match TCP's congestion window control in 2237 slow start, which is why for TCP sources, only the first and third 2238 packet need be FNE packets. 2240 A source that would open the congestion window any quicker would have 2241 to insert more FNE packets. As another example a UDP source sending 2242 VBR traffic might need to send several FNE packets ahead of the 2243 traffic peaks it generates. 2245 Appendix E. Argument for holding back the ECN nonce 2247 The ECN nonce is a mechanism that allows a /sending/ transport to 2248 detect if drop or ECN marking at a congested router has been 2249 suppressed by a node somewhere in the feedback loop---another router 2250 or the receiver. 2252 Space for the ECN nonce was set aside in [RFC3168] (currently 2253 proposed standard) while the full nonce mechanism is specified in 2254 [RFC3540] (currently experimental). The specifications for [RFC4340] 2255 (currently proposed standard) requires that "Each DCCP sender SHOULD 2256 set ECN Nonces on its packets...". It also mandates as a requirement 2257 for all CCID profiles that "Any newly defined acknowledgement 2258 mechanism MUST include a way to transmit ECN Nonce Echoes back to the 2259 sender.", therefore: 2261 o The CCID profile for TCP-like Congestion Control [RFC4341] 2262 (currently proposed standard) says "The sender will use the ECN 2263 Nonce for data packets, and the receiver will echo those nonces in 2264 its Ack Vectors." 2266 o The CCID profile for TCP-Friendly Rate Control (TFRC) [RFC4342] 2267 recommends that "The sender [use] Loss Intervals options' ECN 2268 Nonce Echoes (and possibly any Ack Vectors' ECN Nonce Echoes) to 2269 probabilistically verify that the receiver is correctly reporting 2270 all dropped or marked packets." 2272 The primary function of the ECN nonce is to protect the integrity of 2273 the information about congestion: ECN marks and packet drops. 2274 However, when the nonce is used to protect the integrity of 2275 information about packet drops, rather than ECN marks, a transport 2276 layer nonce will always be sufficient (because a drop loses the 2277 transport header as well as the ECN field in the network header), 2278 which would avoid using scarce IP header codepoint space. Similarly, 2279 a transport layer nonce would protect against a receiver sending 2280 early acknowledgements [Savage99]. 2282 If the ECN nonce reveals integrity problems with the information 2283 about congestion, the sending transport can use that knowledge for 2284 two functions: 2286 o to protect its own resources, by allocating them in proportion to 2287 the rates that each network path can sustain, based on congestion 2288 control, 2290 o and to protect congested routers in the network, by slowing down 2291 drastically its connection to the destination with corrupt 2292 congestion information. 2294 If the sending transport chooses to act in the interests of congested 2295 routers, it can reduce its rate if it detects some malicious party in 2296 the feedback loop may be suppressing ECN feedback. But it would only 2297 be useful to congested routers when /all/ senders using them are 2298 trusted to act in interest of the congested routers. 2300 In the end, the only essential use of a network layer nonce is when 2301 sending transports (e.g. large servers) want to allocate their /own/ 2302 resources in proportion to the rates that each network path can 2303 sustain, based on congestion control. In that case, the nonce allows 2304 senders to be assured that they aren't being duped into giving more 2305 of their own resources to a particular flow. And if congestion 2306 suppression is detected, the sending transport can rate limit the 2307 offending connection to protect its own resources. Certainly, this 2308 is a useful function, but the IETF should carefully decide whether 2309 such a single, very specific case warrants IP header space. 2311 In contrast, Re-ECN allows all routers to fully protect themselves 2312 from such attacks, without having to trust anyone - senders, 2313 receivers, neighbouring networks. Re-ECN is therefore proposed in 2314 preference to the ECN nonce on the basis that it addresses the 2315 generic problem of accountability for congestion of a network's 2316 resources at the IP layer. 2318 Delaying the ECN nonce is justified because the applicability of the 2319 ECN nonce seems too limited for it to consume a two-bit codepoint in 2320 the IP header. It therefore seems prudent to give time for an 2321 alternative way to be found to do the one function the nonce is 2322 essential for. 2324 Moreover, while we have re-designed the Re-ECN codepoints so that 2325 they do not prevent the ECN nonce progressing, the same is not true 2326 the other way round. If the ECN nonce started to see some deployment 2327 (perhaps because it was blessed with proposed standard status), 2328 incremental deployment of Re-ECN would effectively be impossible, 2329 because Re-ECN marking fractions at inter-domain borders would be 2330 polluted by unknown levels of nonce traffic. 2332 The authors are aware that Re-ECN must prove it has the potential it 2333 claims if it is to displace the nonce. Therefore, every effort has 2334 been made to complete a comprehensive specification of Re-ECN so that 2335 its potential can be assessed. We therefore seek the opinion of the 2336 Internet community on whether the Re-ECN protocol is sufficiently 2337 useful to warrant standards action. 2339 Appendix F. Alternative Terminology Used in Other Documents 2341 A number of alternative terms have been used in various documents 2342 describing re-feedback and re-ECN. These are set out in the 2343 following table 2345 +---------------------+----------------+------------------+ 2346 | Current Terminology | EECN codepoint | Colour | 2347 +---------------------+----------------+------------------+ 2348 | Cautious | FNE | Green | 2349 | Positive | Re-Echo | Black | 2350 | Neutral | RECT | Grey | 2351 | Negative | CE(-1) | Red | 2352 | Cancelled | CE(0) | Red-Black | 2353 | Legacy ECN | ECT(0) | White | 2354 | Currently Unused | --CU-- | Currently unused | 2355 | | | | 2356 | Legacy | Not-ECT | White | 2357 +---------------------+----------------+------------------+ 2359 Table 7: Alternative re-ECN Terminology 2361 Appendix G. Changes from previous drafts (to be removed by the RFC 2362 Editor) 2364 Full diffs from all previous versions (created using the rfcdiff 2365 tool) are available at 2367 From draft-briscoe-conex-...-03 to -04 (current version): Re-issued 2368 to keep alive; Updated references (but protocol specification 2369 remains frozen as it was at draft-briscoe-tsvwg-...-08); 2370 reinstated section on "Safety against Long Pure ACK Loss 2371 Sequences" about wrap of the ECI field that had accidentially been 2372 commented out in draft-briscoe-tsvwg-...-07 2374 From draft-briscoe-conex-...-02 to -03 (current version): Re-issued 2375 to keep alive; updated references 2377 From draft-briscoe-conex-...-01 to -02 (current version): Re-issued 2378 to keep alive; updated references 2380 From draft-briscoe-conex-...-00 to -01: Re-issued to keep alive; 2381 updated references 2383 From draft-briscoe-tsvwg-...-08 to draft-briscoe-conex-...-00: 2385 Re-issued to keep alive for reference by ConEx working group 2387 Changed working group tag in filename from tsvwg to conex 2389 Changed intended status to historic and added explanatory note 2391 Updated references. Also, now that RFC6040 has been published, 2392 the section on tunnelling required a re-write 2394 Corrected name of CE(0) to Cancelled in Table 2 2396 Noted errors and omissions (rather than spending time correcting 2397 them): 2399 * Made a few 'ToDo' comments visible that had previously been 2400 comments within the document source 2402 * Identified errors with 'ToDo' comments, referring to correct 2403 material where possible. 2405 From -08 to -09: 2407 Re-issued to keep alive for reference by ConEx working group. 2409 Hardly any changes to content, even where it is out of date, 2410 except references updated. 2412 From -07 to -08: 2414 Minor changes and consistency checks. 2416 References updated. 2418 From -06 to -07: 2420 Major changes made following splitting this protocol document from 2421 the related motivations document [I-D.re-ecn-motiv]. 2423 Significant re-ordering of remaining text. 2425 New terminology introduced for clarity. 2427 Minor editorial changes throughout. 2429 Authors' Addresses 2431 Bob Briscoe (editor) 2432 BT 2433 B54/77, Adastral Park 2434 Martlesham Heath 2435 Ipswich IP5 3RE 2436 UK 2438 Phone: +44 1473 645196 2439 EMail: bob.briscoe@bt.com 2440 URI: http://bobbriscoe.net/ 2442 Arnaud Jacquet 2443 BT 2444 B54/70, Adastral Park 2445 Martlesham Heath 2446 Ipswich IP5 3RE 2447 UK 2449 Phone: +44 1473 647284 2450 EMail: arnaud.jacquet@bt.com 2452 Toby Moncaster 2453 Moncaster.com 2454 Dukes 2455 Layer Marney 2456 Colchester CO5 9UZ 2457 UK 2459 EMail: toby@moncaster.com 2460 Alan Smith 2461 BT 2462 B54/76, Adastral Park 2463 Martlesham Heath 2464 Ipswich IP5 3RE 2465 UK 2467 Phone: +44 1473 640404 2468 EMail: alan.p.smith@bt.com