idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-tcp-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 3, 2009) is 5530 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260) == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-ecnsyn-07 == Outdated reference: A later version (-03) exists of draft-moncaster-tcpm-rcv-cheat-02 == Outdated reference: A later version (-11) exists of draft-ietf-pcn-architecture-09 -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) -- Obsolete informational reference (is this intentional?): RFC 4835 (Obsoleted by RFC 7321) == Outdated reference: A later version (-03) exists of draft-briscoe-re-pcn-border-cheat-02 Summary: 3 errors (**), 0 flaws (~~), 5 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Intended status: Standards Track A. Jacquet 5 Expires: September 4, 2009 T. Moncaster 6 A. Smith 7 BT 8 March 3, 2009 10 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP 11 draft-briscoe-tsvwg-re-ecn-tcp-07 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance with the 16 provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on September 4, 2009. 36 Copyright Notice 38 Copyright (c) 2009 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents in effect on the date of 43 publication of this document (http://trustee.ietf.org/license-info). 44 Please review these documents carefully, as they describe your rights 45 and restrictions with respect to this document. 47 Abstract 49 This document introduces a new protocol for explicit congestion 50 notification (ECN), termed re-ECN, which can be deployed 51 incrementally around unmodified routers. The protocol works by 52 arranging an extended ECN field in each packet so that, as it crosses 53 any interface in an internetwork, it will carry a truthful prediction 54 of congestion on the remainder of its path. The purpose of this 55 document is to specify the re-ECN protocol at the IP layer and to 56 give guidelines on any consequent changes required to transport 57 protocols. It includes the changes required to TCP both as an 58 example and as a specification. It briefly gives examples of 59 mechanisms that can use the protocol to ensure data sources respond 60 correctly to congestion,and these are described more fully in a 61 companion document [re-ecn-motive]. 63 Authors' Statement: Status (to be removed by the RFC Editor) 65 Although the re-ECN protocol is intended to make a simple but far- 66 reaching change to the Internet architecture, the most immediate 67 priority for the authors is to delay any move of the ECN nonce to 68 Proposed Standard status. The argument for this position is 69 developed in Appendix E. 71 Changes from previous drafts (to be removed by the RFC Editor) 73 Full diffs created using the rfcdiff tool are available at 74 76 From -06 to -07 (current version): 78 Major changes made following splitting this protocol document from 79 the related motivations document [re-ecn-motive]. 81 Significant re-ordering of remaining text. 83 New terminology introduced for clarity. 85 Minor editorial changes throughout. 87 Table of Contents 89 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 90 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 6 91 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 92 4. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 7 93 4.1. Simplified Re-ECN Protocol . . . . . . . . . . . . . . . . 7 94 4.1.1. Congestion Control and Policing the Protocol . . . . . 7 95 4.1.2. Background and Applicability . . . . . . . . . . . . . 8 96 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 97 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 98 4.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 10 99 4.4. Positive and Negative Flows . . . . . . . . . . . . . . . 12 100 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 13 101 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 13 102 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 15 103 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 16 104 5.4. Justification for Setting the First SYN to FNE . . . . . . 17 105 5.5. Control and Management . . . . . . . . . . . . . . . . . . 18 106 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 18 107 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 19 108 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 19 109 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 20 110 6. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 21 111 6.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 112 6.1.1. RECN mode: Full Re-ECN capable transport . . . . . . . 22 113 6.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 114 compliant ECN Receiver . . . . . . . . . . . . . . . . 24 115 6.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 26 116 6.1.4. Extended ECN (EECN) Field Settings during Flow 117 Start or after Idle Periods . . . . . . . . . . . . . 27 118 6.1.5. Pure ACKS, Retransmissions, Window Probes and 119 Partial ACKs . . . . . . . . . . . . . . . . . . . . . 31 120 6.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 32 121 6.2.1. General Guidelines for Adding Re-ECN to Other 122 Transports . . . . . . . . . . . . . . . . . . . . . . 32 123 6.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 32 124 6.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 33 125 6.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 33 126 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 33 127 8. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 34 128 8.1. Congestion Notification Integrity . . . . . . . . . . . . 34 129 9. Security Considerations . . . . . . . . . . . . . . . . . . . 35 130 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37 131 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 37 132 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 37 133 13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 38 134 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 38 135 14.1. Normative References . . . . . . . . . . . . . . . . . . . 38 136 14.2. Informative References . . . . . . . . . . . . . . . . . . 39 137 Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 41 138 Appendix B. Justification for Two Codepoints Signifying Zero 139 Worth Packets . . . . . . . . . . . . . . . . . . . . 43 140 Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 44 141 Appendix D. Packet Marking with FNE During Flow Start . . . . . . 45 142 Appendix E. Argument for holding back the ECN nonce . . . . . . . 47 143 Appendix F. Alternative Terminology Used in Other Documents . . . 49 145 1. Introduction 147 This document aims to provide a complete specification of the 148 addition of the re-ECN protocol to IP and guidelines on how to add it 149 to transport layer protocols, including a complete specification of 150 re-ECN in TCP as an example. The motivation behind this proposal is 151 given in [re-ecn-motive], but we include a brief summary here. 153 Re-ECN is intended to allow senders to inform the network of the 154 level of congestion they expect their flows to see. This information 155 is currently only visible at the transport layer. ECN [RFC3168] 156 reveals the upstream congestion state of any path by monitoring the 157 rate of CE marks. The receiver then informs the sender when they 158 have seen a marked packet. Re-ECN builds on ECN by providing new 159 codepoints that allow the sender to declare the level of congestion 160 they expect on the forward path. It is closely related to ECN and 161 indeed we define a compatability mode to allow a re-ECN sender to 162 communicate with an ECN receiver [xref]. 164 If a sender understates expected congestion compared to actual 165 congestion then the network could discard packets or enact some other 166 sanction. A policer can also be introduced at the ingress of 167 networks that can limit the level of congestion being caused. 169 A general statement of the problem solved by re-ECN is to provide 170 sufficient information in each IP datagram to be able to hold senders 171 and whole networks accountable for the congestion they cause 172 downstream, before they cause it. But the every-day problems that 173 re-ECN can solve are much more recognisable than this rather generic 174 statement: mitigating distributed denial of service (DDoS); 175 simplifying differentiation of quality of service (QoS); policing 176 compliance to congestion control; and so on. 178 It is important to add a few key points. 180 o In any stnadard network it always takes one round trip before any 181 feedback is received. For this reason a sender must make a 182 conservative prediction by transmitting IP packets with a special 183 Cautious marking. 185 o It should be noted that the prediction is carried in-band in 186 normal data packets and for many transports feedback can be 187 carried in the normal acknowledgements or control packets. 189 o The re-ECN protocol is independent of the transport. In TCP, 190 acknowledgments are used to convey the feedback from receiver to 191 sender. This memo concentrates on TCP as an example transport 192 protocol, however the re-ECN protocol is compatible with any 193 transport where feedback can be sent from receiver to sender. 195 This document is structured as follows. First an overview of the re- 196 ECN protocol is given (Section 4), outlining its attributes and 197 explaining conceptually how it works as a whole. The two main parts 198 of the document follow. That is, the protocol specification divided 199 into network (Section 5) and transport (Section 6) layers. 200 Deployment issues discussed throughout the document are brought 201 together in Section 7. Related work is discussed in (Section 8). 203 2. Requirements notation 205 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 206 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 207 document are to be interpreted as described in [RFC2119]. 209 3. Terminology 211 The following terminology is used throughout this memo. Some of this 212 terminology is new and, to avoid confusion, Appendix F sets out all 213 the alternative terminology that has been used in other re-ECN 214 related documents. 216 o Neutral packet - a packet that is able to be congestion marked by 217 an ECN or re-ECN queue. 219 o Negative packet - a Neutral packet that has been congestion marked 220 by an ECN or re-ECN queue. 222 o Positive packet - a packet that has been marked by the sender to 223 indicate the expected level of congestion along its path. In 224 general Positive packets should only be sent in response to 225 feedback received from the receiver.* 227 o Cancelled packet - a Positive Packet that has been congestion 228 marked by an ECN or re-ECN queue. 230 o Cautious packet - a packet that has been marked by the sender to 231 indeiate the expected level of congestion along its path. In 232 general Cautious packets should be used when there is insufficient 233 feedback to be confident about the congestion state of the 234 network.* 236 o * the difference between positive and cautious packets is 237 explained in detail later in the document along with guidelines on 238 the use of Cautious packets. 240 All the above terms have related IP codepoints as defined in 241 (Section 5). 243 4. Protocol Overview 245 4.1. Simplified Re-ECN Protocol 247 We describe here the simplified re-ECN protocol. To simplify the 248 description we assume packets and segments are synonymous. 250 Packets are sent from a sender to a receiver. In Figure 1 the queues 251 (Q1 and Q2) are ECN enabled as per RFC 3168 [RFC3168]. If congestion 252 occurs then packets are marked with the congestion experienced (CE) 253 flag exactly as in the ECN protocol [RFC3168]; the routers do not 254 need to be modified and do not need to know the re-ECN protocol. The 255 receiver constantly informs the sender of the current count of 256 Positive packets it has seen. The sender uses this information 257 determine how many Positive packets it must send into the network. 258 The receiver's aim is to balance the number of bytes that have been 259 congestion marked with the number of Positive bytes it has sent. 261 +--------- Feedback----------+ 262 | | 263 v | 264 +---+ +----+ +----+ +---+ 265 | | | | | | | | 266 | S |--->| Q1 |--->| Q2 |--->| R | 267 | | | | | | | | 268 +---+ +----+ +----+ +---+ 270 Figure 1: Simple Re-ECN 272 4.1.1. Congestion Control and Policing the Protocol 274 The arrangement of the protocol ensures that packets carry a 275 declaration of the amount of congestion that will be experienced on 276 the path. The re-ECN protocol is orthogonal to to any congestion 277 control algorithms, but can be used to ensure that congestion control 278 is being applied by the sender. 280 In general we assume that there will be a policer at the network 281 ingress which can rate limit traffic based on the amount of 282 congestion declared. 284 At the network egress there is a droper which can impose sanctions on 285 flows that incorrectly declare congestion. 287 Policers and droppers are explained in more detail in 289 [re-ecn-motive]. 291 4.1.2. Background and Applicability 293 The re-ECN protocol makes no changes and has no effect on the TCP 294 congestion control algorithm or on other rate responses to 295 congestion. re-ECN is not a new congestion control protocol, rather 296 it is orthogonal to congestion control itself. Re-ECN is concerned 297 with revealing information about congestion so that users and 298 networks can be held accountable for the congestion they cause, or 299 allow to be caused. 301 Re-ECN builds on ECN so we briefly recap the essentials of the ECN 302 protocol [RFC3168]. Two bits in the IP protocol (v4 or v6) are 303 assigned to the ECN field. The sender clears the field to "00" (Not- 304 ECT) if either end-point transport is not ECN-capable. Otherwise it 305 indicates an ECN-capable transport (ECT) using either of the two 306 code-points "10" or "01" (ECT(0) and ECT(1) resp.). 308 ECN-capable queues probabilistically set this field to "11" if 309 congestion is experienced (CE). In general this marking probability 310 will increase with the length of the queue at its egress link 311 (typically using the RED algorithm [RFC2309]). However, they still 312 drop rather than mark Not-ECT packets. With multiple ECN-capable 313 queues on a path, a flow of packets accumulates the fraction of CE 314 marking that each queue adds. The combined effect of the packet 315 marking of all the queues along the path signals congestion of the 316 whole path to the receiver. So, for example, if one queue early in a 317 path is marking 1% of packets and another later in a path is marking 318 2%, flows that pass through both queues will experience approximately 319 3% marking (see Appendix A for a precise treatment). 321 The choice of two ECT code-points in the ECN field [RFC3168] 322 permitted future flexibility, optionally allowing the sender to 323 encode the experimental ECN nonce [RFC3540] in the packet stream. 324 The nonce is designed to allow a sender to check the integrity of 325 congestion feedback. But Section 8.1 explains that it still gives no 326 control over how fast the sender transmits as a result of the 327 feedback. On the other hand, re-ECN is designed both to ensure that 328 congestion is declared honestly and that the sender's rate responds 329 appropriately. 331 Re-ECN is based on a feedback arrangement called `re- 332 feedback' [Re-fb]. The word is short for either receiver-aligned, 333 re-inserted or re-echoed feedback. But it actually works even when 334 no feedback is available. In fact it has been carefully designed to 335 work for single datagram flows. It also encourages aggregation of 336 single packet flows by congestion control proxies. Then, even if the 337 traffic mix of the Internet were to become dominated by short 338 messages, it would still be possible to control congestion 339 effectively and efficiently. 341 Changing the Internet's feedback architecture seems to imply 342 considerable upheaval. But re-ECN can be deployed incrementally at 343 the transport layer around unmodified queues using existing fields in 344 IP (v4 or v6). However it does also require the last undefined bit 345 in the IPv4 header, which it uses in combination with the 2-bit ECN 346 field to create four new codepoints. Nonetheless, we RECOMMEND 347 adding optional preferentail drop to IP queues based on the re-ECN 348 fields in order to improve resilience against DoS attacks. 349 Similarly, re-ECN works best if both the sender and receiver 350 transports are re-ECN-capable, but it can work with just sender 351 support(Section 6.1.2). 353 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 355 The re-ECN wire protocol uses the two bit ECN field broadly as in 356 RFC3168 [RFC3168] as described above, but with five differences of 357 detail (brought together in a list in Section 7). This specification 358 defines a new re-ECN extension (RE) flag. We will defer the 359 definition of the actual position of the RE flag in the IPv4 & v6 360 headers until Section 5. When we don't need to choose between IPv4 361 and v6 wire protocols it will suffice call it the RE flag. 363 Unlike the ECN field, the RE flag is intended to be set by the sender 364 and SHOULD remain unchanged along the path, although it can be read 365 by network elements that understand the re-ECN protocol. It is 366 feasible that a network element MAY change the setting of the RE 367 flag, perhaps acting as a proxy for an end-point, but such a protocol 368 would have to be defined in another specification (e.g. [Re-PCN]). 370 Although the RE flag is a separate, single bit field, it can be read 371 as an extension to the two-bit ECN field; the three concatenated bits 372 in what we will call the extended ECN field (EECN) giving eight 373 codepoints. We will use the RFC3168 names of the ECN codepoints to 374 describe settings of the ECN field when the RE flag setting is "don't 375 care", but we also define the following six extended ECN codepoint 376 names for when we need to be more specific. 378 One of re-ECN's codepoints is an alternative use of the codepoint set 379 aside in RFC3168 for the ECN nonce (ECT(1)). Transports using re-ECN 380 do not need to use the ECN nonce as long as the sender is also 381 checking for transport protocol compliance 382 [I-D.moncaster-tcpm-rcv-cheat]. The case for doing this is given in 383 Appendix E. Two re-ECN codepoints are given compatible uses to those 384 defined in RFC3168 (Not-ECT and CE). The other codepoint used by 385 RFC3168 (ECT(0)) isn't used for re-ECN. Altogether this leave one 386 codepoint of the eight unused by ECN or re-ECN and available for 387 future use. 389 +--------+-------------+-------+-----------+------------------------+ 390 | ECN | RFC3168 | RE | EECN | re-ECN meaning | 391 | field | codepoint | flag | codepoint | | 392 +--------+-------------+-------+-----------+------------------------+ 393 | 00 | Not-ECT | 0 | Not-ECT | Not re-ECN-capable | 394 | | | | | transport (Legacy) | 395 | 00 | --- | 1 | FNE | Feedback not | 396 | | | | | established (Cautious) | 397 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 398 | | | | | and RECT (Positive) | 399 | 01 | --- | 1 | RECT | Re-ECN capable | 400 | | | | | transport (Neutral) | 401 | 10 | ECT(0) | 0 | ECT(0) | RFC3168 ECN use only | 402 | | | | | | 403 | 10 | --- | 1 | --CU-- | Currently unused | 404 | | | | | | 405 | 11 | CE | 0 | CE(0) | Re-Echo cancelled by | 406 | | | | | CE (Cancelled) | 407 | 11 | --- | 1 | CE(-1) | Congestion Experienced | 408 | | | | | (Negative) | 409 +--------+-------------+-------+-----------+------------------------+ 411 Table 1: Extended ECN Codepoints 413 4.3. Re-ECN Protocol Operation 415 In this section we will give an overview of the operation of the re- 416 ECN protocol for TCP/IP, leaving a detailed specification to the 417 following sections. Other transports will be discussed later. 419 In summary, the protocol adds a third `re-echo' stage to the existing 420 TCP/IP ECN protocol. Whenever the network adds CE congestion 421 signalling to the IP header on the forward data path, the receiver 422 feeds it back to the ingress using TCP, then the sender re-echoes it 423 into the forward data path using the RE flag in the next packet. 425 Prior to receiving any feedback a sender will not know which setting 426 of the RE flag to use, so it sends Cautious packets by setting the 427 FNE codepoint. The network reads the FNE codepoint conservatively as 428 equivalent to re-echoed congestion. 430 Specifically, once feedback from an ECN or re-ECN capable flow is 431 established, a re-ECN sender always initialises the ECN field to 432 ECT(1). And it usually sets the RE flag to "1" indicating a Neutral 433 packet. Whenever a queue marks a packet to CE, the receiver feeds 434 back this event to the sender. On receiving this feedback, the re- 435 ECN sender will clear the RE flag to "0" in the next packet it sends 436 (indicating a Positive packet). 438 We chose to set and clear the RE flag this way round to ease 439 incremental deployment (see Section 7). To avoid confusion we will 440 use the term `blanking' (rather than marking) when the RE flag is 441 cleared to "0". So, over a stream of packets, we will talk of the 442 `RE blanking fraction' as the fraction of octets in packets with the 443 RE flag cleared to "0". 445 +---+ +----+ +----+ +---+ 446 | S |--| Q1 |----------------| Q2 |--| R | 447 +---+ +----+ +----+ +---+ 448 . . . . 449 ^ . . . . 450 | . . . . 451 | . RE blanking fraction . . 452 3% |-------------------------------+======= 453 | . . | . 454 2% | . . | . 455 | . . CE marking fraction | . 456 1% | . +----------------------+ . 457 | . | . . 458 0% +---------------------------------------> 459 ^ ^ ^ 460 L M N Observation points 462 Figure 2: A 2-Queue Example (Imprecise) 464 Figure 2 uses a simple network to illustrate how re-ECN allows queues 465 to measure downstream congestion. The receiver views a CE marking 466 fraction of 3% which is fed back to the sender. The sender sets an 467 RE blanking fraction of 3% to match this. This RE blanking fraction 468 can be observed along the path as the RE flag is not changed by 469 network nodes once set by the sender. This is shown by the 470 horizontal line at 3% in the figure. The CE marked fraction is shown 471 by the stepped line which rises to meet the RE blanking fraction line 472 with steps at at each queue where packets are marked. Two queues are 473 shown (Q1 and Q2) that are currently congested. Each time packets 474 pass through a fraction are marked; 1% at Q1 and 2% at Q2). The 475 approximate downstream congestion can be measured at the observation 476 points shown along the path by subtracting the CE marking fraction 477 from the RE blanking fraction, as shown in the table below 478 (Appendix A derives these approximations from a precise analysis). 480 NB due to the unary nature of ECN marking and the equivalent unary 481 nature of re-ECN blanking, the precise fraction of marked bytes must 482 be calculated by maintaining a moving average of the number of 483 packets that have been marked as a proportion of the total number of 484 packets. 486 Along the path the fraction of packets that had their RE field 487 cleared remains unchanged so it can be used as a reference against 488 which to compare upstream congestion. The difference predicts 489 downstream congestion for the rest of the path. Therefore, measuring 490 the fractions of each codepoint at any point in the Internet will 491 reveal upstream, downstream and whole path congestion. 493 Note that we have introduced discussion of marking and blanking 494 fractions solely for illustration. We are not saying any protocol 495 handler will work with these average fractions directly. In fact the 496 protocol actually requires the number of marked and blanked bytes to 497 balance by the time the packet reaches the receiver. 499 4.4. Positive and Negative Flows 501 In Section 3 we introduced the terms Positive, Neutral, Negative, 502 Cautious and Cancelled. This terminology is based on the requirement 503 to balance the proportion of bytes marked as CE with the proportion 504 of bytes that are re-echo marked. In the rest of this memo we will 505 loosely talk of positive or negative flows, meaning flows where the 506 moving average of the downstream congestion metric is persistently 507 positive or negative. A negative flow is one where more CE marked 508 packets than re-ECN blanked packets arrive. Likewise in positive 509 flows more re-ECN blanked packets arrive than CE marked packets. The 510 notion of a negative metric arises because it is derived by 511 subtracting one metric from another. Of course actual downstream 512 congestion cannot be negative, only the metric can (whether due to 513 time lags or deliberate malice). 515 Therefore we will talk of packets having `worth' of +1, 0 or -1, 516 which, when multiplied by their size, indicates their contribution to 517 the downstream congestion metric. The worth of each type of packet 518 is given below in Table 2. The idea is that most flows start with 519 zero worth. Every time the network decrements the worth of a packet, 520 the sender increments the worth of a later packet. Then, over time, 521 as many positive octets should arrive at the receiver as negative. 522 Note we have said octets not packets, so if packets are of different 523 sizes, the worth should be incremented on enough octets to balance 524 the octets in negative packets arriving at the receiver. It is this 525 balance that will allow the network to hold the sender accountable 526 for the congestion it causes. 528 If a packet carrying re-echoed congestion happens to also be 529 congestion marked, the +1 worth added by the sender will be cancelled 530 out by the -1 network congestion marking. Although the two worth 531 values correctly cancel out, neither the congestion marking nor the 532 re-echoed congestion are lost, because the RE bit and the ECN field 533 are orthogonal. So, whenever this happens, the receiver will 534 correctly detect and re-echo the new congestion event as well. 536 The table below specifies unambiguously the worth of each extended 537 ECN codepoint. Note the order is different from the previous table 538 to better show how the worth increments and decrements. 540 +---------+-------+---------------+-------+-------------------------+ 541 | ECN | RE | Extended ECN | Worth | Re-ECN Term | 542 | field | bit | codepoint | | | 543 +---------+-------+---------------+-------+-------------------------+ 544 | 00 | 0 | Not-RECT | ... | --- | 545 | 00 | 1 | FNE | +1 | Cautious | 546 | 01 | 0 | Re-Echo | +1 | Positive | 547 | 10 | 0 | Legacy | ... | RFC3168 ECN use only | 548 | | | | | | 549 | 11 | 0 | CE(0) | 0 | Negative | 550 | 01 | 1 | RECT | 0 | Neutral | 551 | 10 | 1 | --CU-- | ... | Currently unused | 552 | | | | | | 553 | 11 | 1 | CE(-1) | -1 | Negative | 554 +---------+-------+---------------+-------+-------------------------+ 556 Table 2: 'Worth' of Extended ECN Codepoints 558 5. Network Layer 560 5.1. Re-ECN IPv4 Wire Protocol 562 The wire protocol of the ECN field in the IP header remains largely 563 unchanged from [RFC3168]. However, an extension to the ECN field we 564 call the RE (Re-ECN extension) flag (Section 4.2) is defined in this 565 document. It doubles the extended ECN codepoint space, giving 8 566 potential codepoints. The semantics of the extra codepoints are 567 backward compatible with the semantics of the 4 original codepoints 568 [RFC3168] (Section 7 collects together and summarises all the changes 569 defined in this document). 571 For IPv4, this document proposes that the new RE control flag will be 572 positioned where the `reserved' control flag was at bit 48 of the 573 IPv4 header (counting from 0). Alternatively, some would call this 574 bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 575 header (Figure 3). 577 0 1 2 578 +---+---+---+ 579 | R | D | M | 580 | E | F | F | 581 +---+---+---+ 583 Figure 3: New Definition of the Re-ECN Extension (RE) Control Flag at 584 the Start of Byte 7 of the IPv4 Header 586 The semantics of the RE flag are described in outline in Section 4 587 and specified fully in Section 6. The RE flag is always considered 588 in conjunction with the 2-bit ECN field, as if they were concatenated 589 together to form a 3-bit extended ECN field. If the ECN field is set 590 to either the ECT(1) or CE codepoint, when the RE flag is blanked 591 (cleared to "0") it represents a re-echo of congestion experienced by 592 an early packet. If the ECN field is set to the Not-ECT codepoint, 593 when the RE flag is set to "1" it represents the feedback not 594 established (FNE) codepoint, which signals that the packet was sent 595 without the benefit of congestion feedback. 597 It is believed that the FNE codepoint can simultaneously serve other 598 purposes, particularly where the start of a flow needs distinguishing 599 from packets later in the flow. For instance it would have been 600 useful to identify new flows for tag switching and might enable 601 similar developments in the future if it were adopted. It is similar 602 to the state set-up bit idea designed to protect against memory 603 exhaustion attacks. This idea was proposed informally by David Clark 604 and documented by Handley and Greenhalgh [Steps_DoS]. The FNE 605 codepoint can be thought of as a `soft-state set-up flag', because it 606 is idempotent (i.e. one occurrence of the flag is sufficient but 607 further occurrences achieve the same effect if previous ones were 608 lost). 610 We are sure there will probably be other claims pending on the use of 611 bit 48. We know of at least two [ARI05], [RFC3514] but neither have 612 been pursued in the IETF, so far, although the present proposal would 613 meet the needs of the latter. 615 The security flag proposal (commonly known as the evil bit) was 616 published on 1 April 2003 as Informational RFC 3514, but it was not 617 adopted due to confusion over whether evil-doers might set it 618 inappropriately. The present proposal is backward compatible with 619 RFC3514 because if re-ECN compliant senders were benign they would 620 correctly clear the evil bit to honestly declare that they had just 621 received congestion feedback. Whereas evil-doers would hide 622 congestion feedback by setting the evil bit continuously, or at least 623 more often than they should. So, evil senders can be identified, 624 because they declare that they are good less often than they should. 626 5.2. Re-ECN IPv6 Wire Protocol 628 For IPv6, this document proposes that the new RE control flag will be 629 positioned as the first bit of the option field of a new Congestion 630 hop by hop option header (Figure 4). 632 0 1 2 3 633 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 634 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 635 | Next Header | Hdr ext Len | Option Type | Opt Length =4 | 636 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 637 |R| Reserved for future use | 638 |E| | 639 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 641 Figure 4: Definition of a New IPv6 Congestion Hop by Hop Option 642 Header containing the re-ECN Extension (RE) Control Flag 644 0 1 2 3 4 5 6 7 8 645 +-+-+-+-+-+-+-+-+- 646 |AIU|C|Option ID| 647 +-+-+-+-+-+-+-+-+- 649 Figure 5: Congestion Hop by Hop Option Type Encoding 651 The Hop-by-Hop Options header enables packets to carry information to 652 be examined and processed by routers or nodes along the packet's 653 delivery path, including the source and destination nodes. For re- 654 ECN, the two bits of the Action If Unrecognized (AIU) flag of the 655 Congestion extension header MUST be set to "00" meaning if 656 unrecognized `skip over option and continue processing the header'. 657 Then, any routers or a receiver not upgraded with the optional re-ECN 658 features described in this memo will simply ignore this header. But 659 routers with these optional re-ECN features or a re-ECN policing 660 function, will process this Congestion extension header. 662 The `C' flag MUST be set to "1" to specify that the Option Data 663 (currently only the RE control flag) can change en-route to the 664 packet's final destination. This ensures that, when an 665 Authentication header (AH [RFC4302]) is present in the packet, for 666 any option whose data may change en-route, its entire Option Data 667 field will be treated as zero-valued octets when computing or 668 verifying the packet's authenticating value. 670 Although the RE control flag should not be changed along the path, we 671 expect that the rest of this option field that is currently `Reserved 672 for future use' could be used for a multi-bit congestion notification 673 field which we would expect to change en route. As the RE flag does 674 not need end-to-end authentication, we set the C flag to '1'. 676 {ToDo: A Congestion Hop by Hop Option ID will need to be registered 677 with IANA.} 679 5.3. Router Forwarding Behaviour 681 Re-ECN works well without modifying the forwarding behaviour of any 682 routers. However, below, two OPTIONAL changes to forwarding 683 behaviour are defined which respectively enhance performance and 684 improve a router's discrimination against flooding attacks. They are 685 both OPTIONAL additions that we propose MAY apply by default to all 686 Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN 687 marking behaviours [RFC3168]. Specifications for PHBs MAY define 688 different forwarding behaviours from this default, but this is not 689 required. [Re-PCN] is one example. 691 FNE indicates ECT: 693 The FNE codepoint tells a router to assume that the packet was 694 sent by an ECN-capable transport (see Section 5.4). Therefore an 695 FNE packet MAY be marked rather than dropped. Note that the FNE 696 codepoint has been intentionally chosen so that, to RFC3168 697 compliant routers (which do not inspect the RE flag) an FNE packet 698 appears to be Not-ECT so it will be dropped by legacy AQM 699 algorithms. 701 A network operator MUST NOT configure a queue to ECN mark rather 702 than drop FNE packets unless it can guarantee that FNE packets 703 will be rate limited, either locally or upstream. The ingress 704 policers discussed in [re-ecn-motive] would count as rate limiters 705 for this purpose. 707 Preferential Drop: If a re-ECN capable router queue experiences very 708 high load so that it has to drop arriving packets (e.g. a DoS 709 attack), it MAY preferentially drop packets within the same 710 Diffserv PHB using the preference order for extended ECN 711 codepoints given in Table 3. Preferential dropping can be 712 difficult to implement on some hardware, but if feasible it would 713 discriminate against attack traffic if done as part of the overall 714 policing framework of [re-ecn-motive]. If nowhere else, routers 715 at the egress of a network SHOULD implement preferential drop 716 (stronger than the MAY above). For simplicity, preferences 4 & 5 717 MAY be merged into one preference level. 719 +-------+-----+------------+-------+------------+-------------------+ 720 | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | 721 | field | bit | ECN | | (1 = drop | | 722 | | | codepoint | | 1st) | | 723 +-------+-----+------------+-------+------------+-------------------+ 724 | 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | 725 | | | | | | congestion and | 726 | | | | | | RECT | 727 | 00 | 1 | FNE | +1 | 4 | Feedback not | 728 | | | | | | established | 729 | 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | 730 | | | | | | by congestion | 731 | | | | | | experienced | 732 | 01 | 1 | RECT | 0 | 3 | Re-ECN capable | 733 | | | | | | transport | 734 | 11 | 1 | CE(-1) | -1 | 3 | Congestion | 735 | | | | | | experienced | 736 | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | 737 | 10 | 0 | --- | n/a | 2 | RFC3168 ECN use | 738 | | | | | | only | 739 | 00 | 0 | Not-RECT | n/a | 1 | Not | 740 | | | | | | Re-ECN-capable | 741 | | | | | | transport | 742 +-------+-----+------------+-------+------------+-------------------+ 744 Table 3: Drop Preference of EECN Codepoints (Sorted by `Worth') 746 The above drop preferences are arranged to preserve packets with 747 more positive worth (Section 4.4), given senders of positive 748 packets must have honestly declared downstream congestion. A full 749 treatment of this is provided in the companion document desribing 750 the motivation and architecture for re-ECN [re-ecn-motive] 751 particularly when the application of re-ECN to protect against 752 DDoS attacks is described. 754 5.4. Justification for Setting the First SYN to FNE 756 the initial SYN MUST be set to FNE by Re-ECT client A (Section 6.1.4) 757 and (Section 5.3) says a queue MAY optionally treat an FNE packet as 758 ECN capable, so an initial SYN may be marked CE(-1) rather than 759 dropped. This seems dangerous, because the sender has not yet 760 established whether the receiver is a RFC3168 one that does not 761 understand congestion marking. It also seems to allow malicious 762 senders to take advantage of ECN marking to avoid so much drop when 763 launching SYN flooding attacks. Below we explain the features of the 764 protocol design that remove both these dangers. 766 ECN-capable initial SYN with a Not-ECT server: If the TCP server B 767 is re-ECN capable, provision is made for it to feedback a possible 768 congestion marked SYN in the SYN ACK (Section 6.1.4). But if the 769 TCP client A finds out from the SYN ACK that the server was not 770 ECN-capable, the TCP client MUST conservatively consider the first 771 SYN as congestion marked before setting itself into Not-ECT mode. 772 Section 6.1.4 mandates that such a TCP client MUST also set its 773 initial window to 1 segment. In this way we remove the need to 774 cautiously avoid setting the first SYN to Not-RECT. This will 775 give worse performance while deployment is patchy, but better 776 performance once deployment is widespread. 778 SYN flooding attacks can't exploit ECN-capability: Malicious hosts 779 may think they can use the advantage that ECN-marking gives over 780 drop in launching classic SYN-flood attacks. But Section 5.3 781 mandates that a router MUST only be configured to treat packets 782 with the FNE codepoint as ECN-capable if FNE packets are rate 783 limited somewhere. Introduction of the FNE codepoint was a 784 deliberate move to enable transport-neutral handling of flow-start 785 and flow state set-up in the IP layer where it belongs. It then 786 becomes possible to protect against flooding attacks of all forms 787 (not just SYN flooding) without transport-specific inspection for 788 things like the SYN flag in TCP headers. Then, for instance, SYN 789 flooding attacks using IPSec ESP encryption can also be rate 790 limited at the IP layer. 792 It might seem pedantic going to all this trouble to enable ECN on the 793 initial packet of a flow, but it is motivated by a much wider concern 794 to ensure safe congestion control will still be possible even if the 795 application mix evolves to the point where the majority of flows 796 consist of a single window or even a single packet. It also allows 797 denial of service attacks to be more easily isolated and prevented. 799 5.5. Control and Management 801 5.5.1. Negative Balance Warning 803 A new ICMP message type is being considered so that a dropper can 804 warn the apparent sender of a flow that it has started to sanction 805 the flow. The message would have similar semantics to the `Time 806 exceeded' ICMP message type. To ensure the sender has to invest some 807 work before the network will generate such a message, a dropper 808 SHOULD only send such a message for flows that have demonstrated that 809 they have started correctly by establishing a positive record, but 810 have later gone negative. The threshold is up to the implementation. 811 The purpose of the message is to deconfuse the cause of drops from 812 other causes, such as congestion or transmission losses. The dropper 813 would send the message to the sender of the flow, not the receiver. 815 If we did define this message type, it would be REQUIRED for all re- 816 ECT senders to parse and understand it. Note that a sender MUST only 817 use this message to explain why losses are occurring. A sender MUST 818 NOT take this message to mean that losses have occurred that it was 819 not aware of. Otherwise, spoof messages could be sent by malicious 820 sources to slow down a sender (c.f. ICMP source quench). 822 However, the need for this message type is not yet confirmed, as we 823 are considering how to prevent it being used by malicious senders to 824 scan for droppers and to test their threshold settings. {ToDo: 825 Complete this section.} 827 5.5.2. Rate Response Control 829 As discussed in [re-ecn-motive] the sender's access operator will be 830 expected to use bulk per-user policing, but they might choose to 831 introduce a per-flow policer. In cases where operators do introduce 832 per-flow policing, there may be a need for a sender to send a request 833 to the ingress policer asking for permission to apply a non-default 834 response to congestion (where TCP-friendly is assumed to be the 835 default). This would require the sender to know what message 836 format(s) to use and to be able to discover how to address the 837 policer. The required control protocol(s) are outside the scope of 838 this document, but will require definition elsewhere. 840 The policer is likely to be local to the sender and inline, probably 841 at the ingress interface to the internetwork. So, discovery should 842 not be hard. A variety of control protocols already exist for some 843 widely used rate-responses to congestion. For instance DCCP 844 congestion control identifiers (CCIDs [RFC4340]) fulfil this role and 845 so does QoS signalling (e.g. and RSVP request for controlled load 846 service is equivalent to a request for no rate response to 847 congestion, but with admission control). 849 5.6. IP in IP Tunnels 851 For re-ECN to work correctly through IP in IP tunnels, it needs 852 slightly different tunnel handling to regular ECN [RFC3168]. 853 Currently there is some incosistency between how the handling of IP 854 in IP tunnels is defined in [RFC3168] and how it is defined in 855 [RFC4301], but re-ECN would work fine with the IPsec behaviour. This 856 inconsistency is addressed in a new Internet Draft [ECN-tunnel] that 857 proposes to update RFC3168 tunnel behaviour to bring it into line 858 with IPsec. Ideally, for re-ECN to work through a tunnel, the tunnel 859 entry should copy both the RE flag and the ECN field from the inner 860 to the outer IP header. Then at the tunnel exit, any congestion 861 marking of the outer ECN field should overwrite the inner ECN field 862 (unless the inner field is Not-ECT in which case an alarm should be 863 raised). The RE flag shouldn't change along a path, so the outer RE 864 flag should be the same as the inner. If it isn't a management alarm 865 should be raised. This behaviour is the same as the full- 866 functionality variant of [RFC3168] at tunnel exit, but different at 867 tunnel entry. 869 If tunnels are left as they are specified in [RFC3168], whether the 870 limited or full-functionality variants are used, a problem arises 871 with re-ECN if a tunnel crosses an inter-domain boundary, because the 872 difference between positive and negative markings will not be 873 correctly accounted for. In a limited functionality ECN tunnel, the 874 flow will appear to be RFC3168 compliant traffic, and therefore may 875 be wrongly rate limited. In a full-functionality ECN tunnel, the 876 result will depend whether the tunnel entry copies the inner RE flag 877 to the outer header or the RE flag in the outer header is always 878 cleared. If the former, the flow will tend to be too positive when 879 accounted for at borders. If the latter, it will be too negative. 880 If the rules set out in [ECN-tunnel] are followed then this will not 881 be an issue. 883 5.7. Non-Issues 885 The following issues might seem to cause unfavourable interactions 886 with re-ECN, but we will explain why they don't: 888 o Various link layers support explicit congestion notification, such 889 as Frame Relay and ATM. Explicit congestion notification is 890 proposed to be added to other link layers, such as Ethernet 891 (802.3ar Ethernet congestion management) and MPLS [RFC5129]; 893 o Encryption and IPSec. 895 In the case of congestion notification at the link layer, each 896 particular link layer scheme either manages congestion on the link 897 with its own link-level feedback (the usual arrangement in the cases 898 of ATM and Frame Relay), or congestion notification from the link 899 layer is merged into congestion notification at the IP level when the 900 frame headers are decapsulated at the end of the link (the 901 recommended arrangement in the Ethernet and MPLS cases). Given the 902 RE flag is not intended to change along the path, this means that 903 downstream congestion will still be measureable at any point where IP 904 is processed on the path by subtracting positive from negative 905 markings. 907 In the case of encryption, as long as the tunnel issues described in 908 Section 5.6 are dealt with, payload encryption itself will not be a 909 problem. The design goal of re-ECN is to include downstream 910 congestion in the IP header so that it is not necessary to bury into 911 inner headers. Obfuscation of flow identifiers is not a problem for 912 re-ECN policing elements. Re-ECN doesn't ever require flow 913 identifiers to be valid, it only requires them to be unique. So if 914 an IPSec encapsulating security payload (ESP [RFC4835]) or an 915 authentication header (AH [RFC4302]) is used, the security parameters 916 index (SPI) will be a sufficient flow identifier, as it is intended 917 to be unique to a flow without revealing actual port numbers. 919 In general, even if endpoints use some locally agreed scheme to hide 920 port numbers, re-ECN policing elements can just consider the pair of 921 source and destination IP addresses as the flow identifier. Re-ECN 922 encourages endpoints to at least tell the network layer that a 923 sequence of packets are all part of the same flow, if indeed they 924 are. The alternative would be for the sender to make each packet 925 appear to be a new flow, which would require them all to be marked 926 FNE in order to avoid being treated with the bulk of malicious flows 927 at the egress dropper. Given the FNE marking is worth +1 and 928 networks are likely to rate limit FNE packets, endpoints are given an 929 incentive not to set FNE on each packet. But if the sender really 930 does want to hide the flow relationship between packets it can choose 931 to pay the cost of multiple FNE packets, which in the long run will 932 compensate for the extra memory required on network policing elements 933 to process each flow. 935 6. Transport Layers 937 6.1. TCP 939 Re-ECN capability at the sender is essential. At the receiver it is 940 optional, as long as the receiver has a basic RFC3168-compliant ECN- 941 capable transport (ECT) [RFC3168]. Given re-ECN is not the first 942 attempt to define the semantics of the ECN field, we give a table 943 below summarising what happens for various combinations of 944 capabilities of the sender S and receiver R, as indicated in the 945 first four columns below. The last column gives the mode a half- 946 connection should be in after the first two of the three TCP 947 handshakes. 949 +--------+--------------+------------+---------+--------------------+ 950 | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | 951 | | (RFC3540) | (RFC3168) | | Half-connection | 952 | | | | | Mode | 953 +--------+--------------+------------+---------+--------------------+ 954 | SR | | | | RECN | 955 | S | R | | | RECN-Co | 956 | S | | R | | RECN-Co | 957 | S | | | R | Not-ECT | 958 +--------+--------------+------------+---------+--------------------+ 960 Table 4: Modes of TCP Half-connection for Combinations of ECN 961 Capabilities of Sender S and Receiver R 963 We will describe what happens in each mode, then describe how they 964 are negotiated. The abbreviations for the modes in the above table 965 mean: 967 RECN: Full re-ECN capable transport 969 RECN-Co: Re-ECN sender in compatibility mode with a RFC3168 970 compliant [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable 971 receiver. Implementation of this mode is OPTIONAL. 973 Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when 974 at least one of the transports does not understand even basic ECN 975 marking. 977 Note that we use the term Re-ECT for a host transport that is re-ECN- 978 capable but RECN for the modes of the half connections between hosts 979 when they are both Re-ECT. If a host transport is Re-ECT, this fact 980 alone does NOT imply either of its half connections will necessarily 981 be in RECN mode, at least not until it has confirmed that the other 982 host is Re-ECT. 984 6.1.1. RECN mode: Full Re-ECN capable transport 986 In full RECN mode, for each half connection, both the sender and the 987 receiver each maintain an unsigned integer counter we will call ECC 988 (echo congestion counter). The receiver maintains a count of how 989 many times a CE marked packet has arrived during the half-connection. 990 Once a RECN connection is established, the three TCP option flags 991 (ECE, CWR & NS) used for ECN-related functions in other versions of 992 ECN are used as a 3-bit field for the receiver to repeatedly tell the 993 sender the current value of ECC, modulo 8, whenever it sends a TCP 994 ACK. We will call this the echo congestion increment (ECI) field. 995 This overloaded use of these 3 option flags as one 3-bit ECI field is 996 shown in Figure 7. The actual definition of the TCP header, 997 including the addition of support for the ECN nonce, is shown for 998 comparison in Figure 6. This specification does not redefine the 999 names of these three TCP option flags, it merely overloads them with 1000 another definition once a flow is established. 1002 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1003 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1004 | | | N | C | E | U | A | P | R | S | F | 1005 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 1006 | | | | R | E | G | K | H | T | N | N | 1007 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1009 Figure 6: The (post-ECN Nonce) definition of bytes 13 and 14 of the 1010 TCP Header 1012 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1013 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1014 | | | | U | A | P | R | S | F | 1015 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 1016 | | | | G | K | H | T | N | N | 1017 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 1019 Figure 7: Definition of the ECI field within bytes 13 and 14 of the 1020 TCP Header, overloading the current definitions above for established 1021 RECN flows. 1023 Receiver Action in RECN Mode 1025 Every time a CE marked packet arrives at a receiver in RECN mode, 1026 the receiver transport increments its local value of ECC and MUST 1027 echo its value, modulo 8, to the sender in the ECI field of the 1028 next ACK. It MUST repeat the same value of ECI in every 1029 subsequent ACK until the next CE event, when it increments ECI 1030 again. 1032 The increment of the local ECC values is modulo 8 so the field 1033 value simply wraps round back to zero when it overflows. The 1034 least significant bit is to the right (labelled bit 9). 1036 A receiver in RECN mode MAY delay the echo of a CE to the next 1037 delayed-ACK, which would be necessary if ACK-withholding were 1038 implemented. 1040 Sender Action in RECN Mode 1042 On the arrival of every ACK, the sender compares the ECI field 1043 with its own ECC value, then replaces its local value with that 1044 from the ACK. The difference D (D = (ECI + 8 - ECC mod 8) mod 8) 1045 is assumed to be the number of CE marked packets that arrived at 1046 the receiver since it sent the previously received ACK (but see 1047 below for the sender's safety strategy). Whenever the ECI field 1048 increments by D (and/or d drops are detected), the sender MUST 1049 clear the RE flag to "0" in the IP header of the next D' data 1050 packets it sends (where D' = D + d), effectively re-echoing each 1051 single increment of ECI. Otherwise the data sender MUST send all 1052 data packets with RE set to "1". 1054 As a general rule, once a flow is established, as well as setting 1055 or clearing the RE flag as above, a data sender in RECN mode MUST 1056 always set the ECN field to ECT(1). However, the settings of the 1057 extended ECN field during flow start are defined in Section 6.1.4. 1059 As we have already emphasised, the re-ECN protocol makes no 1060 changes and has no effect on the TCP congestion control algorithm. 1061 So, the first increment of ECI (or detection of a drop) in a RTT 1062 triggers the standard TCP congestion response, no more than one 1063 congestion response per round trip, as usual. However, the sender 1064 re-echoes every increment of ECI irrespective of RTTs. 1066 A TCP sender also acts as the receiver for the other half- 1067 connection. The host will maintain two ECC values S.ECC and R.ECC 1068 as sender and receiver respectively. Every TCP header sent by a 1069 host in RECN mode will also repeat the prevailing value of R.ECC 1070 in its ECI field. If a sender in RECN mode has to retransmit a 1071 packet due to a suspected loss, the re-transmitted packet MUST 1072 carry the latest prevailing value of R.ECC when it is re- 1073 transmitted, which will not necessarily be the one it carried 1074 originally. 1076 6.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 compliant ECN 1077 Receiver 1079 If the half-connection is in RECN-Co mode, ECN feedback proceeds no 1080 differently to that of RFC3168 compliant ECN. In other words, the 1081 receiver sets the ECE flag repeatedly in the TCP header and the 1082 sender responds by setting the CWR flag. Although RECN-Co mode is 1083 used when the receiver has not implemented the re-ECN protocol, the 1084 sender can infer enough from its RFC3168 compliant ECN feedback to 1085 set or clear the RE flag reasonably well. Specifically, every time 1086 the receiver toggles the ECE field from "0" to "1" (or a loss is 1087 detected), as well as setting CWR in the TCP flags, the re-ECN sender 1088 MUST blank the RE flag of the next packet to "0" as it would do in 1089 full RECN mode. Otherwise, the data sender SHOULD send all other 1090 packets with RE set to "1". Once a flow is established, a re-ECN 1091 data sender in RECN-Co mode MUST always set the ECN field to ECT(1). 1093 If a CE marked packet arrives at the receiver within a round trip 1094 time of a previous mark, the receiver will still be echoing ECE for 1095 the last CE mark. Therefore, such a mark will be missed by the 1096 sender. Of course, this isn't of concern for congestion control, but 1097 it does mean that very occasionally the RE blanking fraction will be 1098 understated. Therefore flows in RECN-Co mode may occasionally be 1099 mistaken for very lightly cheating flows and consequently might 1100 suffer a small number of packet drops through an egress dropper. We 1101 expect re-ECN would be deployed for some time before policers and 1102 droppers start to enforce it. So, given there is not much ECN 1103 deployment yet anyway, this minor problem may affect only a very 1104 small proportion of flows, reducing to nothing over the years as 1105 RFC3168 compliant ECN hosts upgrade. The use of RECN-Co mode would 1106 need to be reviewed in the light of experience at the time of re-ECN 1107 deployment. 1109 RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their 1110 code simple, MAY choose not to implement this mode. If they do not, 1111 a re-ECN sender SHOULD fall back to RFC3168 compliant ECT mode in the 1112 presence of an ECN-capable receiver. It MAY choose to fall back to 1113 the ECT-Nonce mode, but if re-ECN implementers don't want to be 1114 bothered with RECN-Co mode, they probably won't want to add an ECT- 1115 Nonce mode either. 1117 6.1.2.1. Re-ECN support for the ECN Nonce 1119 A TCP half-connection in RECN-Co mode MUST NOT support the ECN 1120 Nonce [RFC3540]. This means that the sending code of a re-ECN 1121 implementation will never need to include ECN Nonce support. Re-ECN 1122 is intended to provide wider protection than the ECN nonce against 1123 congestion control misbehaviour, and re-ECN only requires support 1124 from the sender, therefore it is preferable to specifically rule out 1125 the need for dual sender implementations. As a consequence, a re-ECN 1126 capable sender will never set ECT(0), so it will be easier for 1127 network elements to discriminate re-ECN traffic flows from other ECN 1128 traffic, which will always contain some ECT(0) packets. 1130 However, a re-ECN implementation MAY OPTIONALLY include receiving 1131 code that complies with the ECN Nonce protocol when interacting with 1132 a sender that supports the ECN nonce (rather than re-ECN), but this 1133 support is not required. 1135 RFC3540 allows an ECN nonce sender to choose whether to sanction a 1136 receiver that does not ever set the nonce sum. Given re-ECN is 1137 intended to provide wider protection than the ECN nonce against 1138 congestion control misbehaviour, implementers of re-ECN receivers MAY 1139 choose not to implement backwards compatibility with the ECN nonce 1140 capability. This may be because they deem that the risk of sanctions 1141 is low, perhaps because significant deployment of the ECN nonce seems 1142 unlikely at implementation time. 1144 6.1.3. Capability Negotiation 1146 During the TCP hand-shake at the start of a connection, an originator 1147 of the connection (host A) with a re-ECN-capable transport MUST 1148 indicate it is Re-ECT by setting the TCP flags NS=1, CWR=1 and ECE=1 1149 in the initial SYN. 1151 A responding Re-ECT host (host B) MUST return a SYN ACK with flags 1152 CWR=1 and ECE=0. The responding host MUST NOT set this combination 1153 of flags unless the preceding SYN has already indicated Re-ECT 1154 support as above. Normally a Re-ECT server (B) will reply to a Re- 1155 ECT client with NS=0, but if the initial SYN from Re-ECT client A is 1156 marked CE(-1), a Re-ECT server B MUST increment its local value of 1157 ECC. But B cannot reflect the value of ECC in the SYN ACK, because 1158 it is still using the 3 bits to negotiate connection capabilities. 1159 So, server B MUST set the alternative TCP header flags in its SYN 1160 ACK: NS=1, CWR=1 and ECE=0. 1162 These handshakes are summarised in Table 5 below, with X indicating 1163 NS can be either 0 or 1 depending on whether congestion had been 1164 experienced. The handshakes used for the other flavours of ECN are 1165 also shown for comparison. To compress the width of the table, the 1166 headings of the first four columns have been severely abbreviated, as 1167 follows: 1169 R: *R*e-ECT 1171 N: ECT-*N*once (RFC3540) 1173 E: *E*CT (RFC3168) 1175 I: Not-ECT (*I*mplicit congestion notification). 1177 These correspond with the same headings used in Table 4. Indeed, the 1178 resulting modes in the last two columns of the table below are a more 1179 comprehensive way of saying the same thing as Table 4. 1181 +----+---+---+---+------------+-------------+-----------+-----------+ 1182 | R | N | E | I | SYN A-B | SYN ACK B-A | A-B Mode | B-A Mode | 1183 +----+---+---+---+------------+-------------+-----------+-----------+ 1184 | | | | | NS CWR ECE | NS CWR ECE | | | 1185 | AB | | | | 1 1 1 | X 1 0 | RECN | RECN | 1186 | A | B | | | 1 1 1 | 1 0 1 | RECN-Co | ECT-Nonce | 1187 | A | | B | | 1 1 1 | 0 0 1 | RECN-Co | ECT | 1188 | A | | | B | 1 1 1 | 0 0 0 | Not-ECT | Not-ECT | 1189 | B | A | | | 0 1 1 | 0 0 1 | ECT-Nonce | RECN-Co | 1190 | B | | A | | 0 1 1 | 0 0 1 | ECT | RECN-Co | 1191 | B | | | A | 0 0 0 | 0 0 0 | Not-ECT | Not-ECT | 1192 +----+---+---+---+------------+-------------+-----------+-----------+ 1194 Table 5: TCP Capability Negotiation between Originator (A) and 1195 Responder (B) 1197 As soon as a re-ECN capable TCP server receives a SYN, it MUST set 1198 its two half-connections into the modes given in Table 5. As soon as 1199 a re-ECN capable TCP client receives a SYN ACK, it MUST set its two 1200 half-connections into the modes given in Table 5. The half- 1201 connections will remain in these modes for the rest of the 1202 connection, including for the third segment of TCP's three-way hand- 1203 shake (the ACK). 1205 {ToDo: Consider RSTs within a connection.} 1207 Recall that, if the SYN ACK reflects the same flag settings as the 1208 preceding SYN (because there is a broken RFC3168 compliant 1209 implementation that behaves this way), RFC3168 specifies that the 1210 whole connection MUST revert to Not-ECT. 1212 Also note that, whenever the SYN flag of a TCP segment is set 1213 (including when the ACK flag is also set), the NS, CWR and ECE flags 1214 ( i.e the ECI field of the SYNACK) MUST NOT be interpreted as the 1215 3-bit ECI value, which is only set as a copy of the local ECC value 1216 in non-SYN packets. 1218 6.1.4. Extended ECN (EECN) Field Settings during Flow Start or after 1219 Idle Periods 1221 If the originator (A) of a TCP connection supports re-ECN it MUST set 1222 the extended ECN (EECN) field in the IP header of the initial SYN 1223 packet to the feedback not established (FNE) codepoint. 1225 FNE is a new extended ECN codepoint defined by this specification 1226 (Section 4.2). The feedback not established (FNE) codepoint is used 1227 when the transport does not have the benefit of ECN feedback so it 1228 cannot decide whether to set or clear the RE flag. 1230 If after receiving a SYN the server B has set its sending half- 1231 connection into RECN mode or RECN-Co mode, it MUST set the extended 1232 ECN field in the IP header of its SYN ACK to the feedback not 1233 established (FNE) codepoint. Note the careful wording here, which 1234 means that Re-ECT server B MUST set FNE on a SYN ACK whether it is 1235 responding to a SYN from a Re-ECT client or from a client that is 1236 merely ECN-capable. This is because FNE indicates the transport is 1237 ECN capable. 1239 The original ECN specification [RFC3168] required SYNs and SYN ACKs 1240 to use the Not-ECT codepoint of the ECN field. The aim was to 1241 prevent well-known DoS attacks such as SYN flooding being able to 1242 gain from the advantage that ECN capability afforded over drop at 1243 ECN-capable routers. 1245 For a SYN ACK, Kuzmanovic [I-D.ietf-tcpm-ecnsyn] has shown that this 1246 caution was unnecessary, and proposes to allow a SYN ACK to be ECN- 1247 capable to improve performance. By stipulating the FNE codepoint for 1248 the initial SYN, we comply with RFC3168 in word but not in spirit, 1249 because we have indeed set the ECN field to Not-ECT, but we have 1250 extended the ECN field with another bit. And it will be seen 1251 (Section 5.3) that we have defined one setting of that bit to mean an 1252 ECN-capable transport. Therefore, by proposing that the FNE 1253 codepoint MUST be used on the initial SYN of a connection, we have 1254 gone further by proposing to make the initial SYN ECN-capable too. 1255 Section 5.4 justifies deciding to make the initial SYN ECN-capable. 1257 Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will 1258 have already been set on the initial SYN and possibly the SYN ACK as 1259 above. But each re-ECN sender will have to set FNE cautiously on a 1260 few data packets as well, given a number of packets will usually have 1261 to be sent before sufficient congestion feedback is received. The 1262 behaviour will be different depending on the mode of the half- 1263 connection: 1265 RECN mode: Given the constraints on TCP's initial window [RFC3390] 1266 and its exponential window increase during slow start 1267 phase [RFC2581], it turns out that the sender SHOULD set FNE on 1268 the first and third data packets in its flow after the initial 1269 3-way handshake, assuming equal sized data packets once a flow is 1270 established. Appendix D presents the calculation that led to this 1271 conclusion. Below, after running through the start of an example 1272 TCP session, we give the intuition learned from that calculation. 1274 RECN-Co mode: A re-ECT sender that switches into re-ECN 1275 compatibility mode or into Not-ECT mode (because it has detected 1276 the corresponding host is not re-ECN capable) MUST limit its 1277 initial window to 1 segment. The reasoning behind this constraint 1278 is given in Section 5.4. Having set this initial window, a re-ECN 1279 sender in RECN-Co mode SHOULD set FNE on the first and third data 1280 packets in a flow, as for RECN mode. 1282 +----+------+----------------+-------+-------+---------------+------+ 1283 | | Data | TCP A(Re-ECT) | IP A | IP B | TCP B(Re-ECT) | Data | 1284 +----+------+----------------+-------+-------+---------------+------+ 1285 | | Byte | SEQ ACK CTL | EECN | EECN | SEQ ACK CTL | Byte | 1286 | -- | ---- | ------------- | ----- | ----- | ------------- | ---- | 1287 | 1 | | 0100 SYN | FNE | --> | R.ECC=0 | | 1288 | | | CWR,ECE,NS | | | | | 1289 | 2 | | R.ECC=0 | <-- | FNE | 0300 0101 | | 1290 | | | | | | SYN,ACK,CWR | | 1291 | 3 | | 0101 0301 ACK | RECT | --> | R.ECC=0 | | 1292 | 4 | 1000 | 0101 0301 ACK | FNE | --> | R.ECC=0 | | 1293 | 5 | | R.ECC=0 | <-- | FNE | 0301 1102 ACK | 1460 | 1294 | 6 | | R.ECC=0 | <-- | RECT | 1762 1102 ACK | 1460 | 1295 | 7 | | R.ECC=0 | <-- | FNE | 3222 1102 ACK | 1460 | 1296 | 8 | | 1102 1762 ACK | RECT | --> | R.ECC=0 | | 1297 | 9 | | R.ECC=0 | <-- | RECT | 4682 1102 ACK | 1460 | 1298 | 10 | | R.ECC=0 | <-- | RECT | 6142 1102 ACK | 1460 | 1299 | 11 | | 1102 3222 ACK | RECT | --> | R.ECC=0 | | 1300 | 12 | | R.ECC=0 | <-- | RECT | 7602 1102 ACK | 1460 | 1301 | 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 | 1302 | | | ... | | | | | 1303 +----+------+----------------+-------+-------+---------------+------+ 1305 Table 6: TCP Session Example #1 1307 Table 6 shows an example TCP session, where the server B sets FNE on 1308 its first and third data packets (lines 5 & 7) as well as on the 1309 initial SYN ACK as previously described. The left hand half of the 1310 table shows the relevant settings of headers sent by client A in 1311 three layers: the TCP payload size; TCP settings; then IP settings. 1312 The right hand half gives equivalent columns for server B. The only 1313 TCP settings shown are the sequence number (SEQ), acknowledgement 1314 number (ACK) and the relevant control (CTL) flags that A sets in the 1315 TCP header. The IP columns show the setting of the extended ECN 1316 (EECN) field. 1318 Also shown on the receiving side of the table is the value of the 1319 receiver's echo congestion counter (R.ECC) after processing the 1320 incoming EECN header. Note that, once a host sets a half-connection 1321 into RECN mode, it MUST initialise its local value of ECC to zero. 1323 The intuition that Appendix D gives for why a sender should set FNE 1324 on the first and third data packets is as follows. At line 13, a 1325 packet sent by B is shown with an '*', which means it has been 1326 congestion marked by an intermediate queue from RECT to CE(-1). On 1327 receiving this CE marked packet, client A increments its ECC counter 1328 to 1 as shown. This was the 7th data packet B sent, but before 1329 feedback about this event returns to B, it might well have sent many 1330 more packets. Indeed, during exponential slow start, about as many 1331 packets will be in flight (unacknowledged) as have been acknowledged. 1332 So, when the feedback from the congestion event on B's 7th segment 1333 returns, B will have sent about 7 further packets that will still be 1334 in flight. At that stage, B's best estimate of the network's packet 1335 marking fraction will be 1/7. So, as B will have sent about 14 1336 packets, it should have already marked 2 of them as FNE in order to 1337 have marked 1/7; hence the need to have set the first and third data 1338 packets to FNE. 1340 Client A's behaviour in Table 6 also shows FNE being set on the first 1341 SYN and the first data packet (lines 1 & 4), but in this case it 1342 sends no more data packets, so of course, it cannot, and does not 1343 need to, set FNE again. Note that in the A-B direction there is no 1344 need to set FNE on the third part of the three-way hand-shake (line 1345 3---the ACK). 1347 Note that in this section we have used the word SHOULD rather than 1348 MUST when specifying how to set FNE on data segments before positive 1349 congestion feedback arrives (but note that the word MUST was used for 1350 FNE on the SYN and SYN ACK). FNE is only RECOMMENDED for the first 1351 and third data segments to entertain the possibility that the TCP 1352 transport has the benefit of other knowledge of the path, which it 1353 re-uses from one flow for the benefit of a newly starting flow. For 1354 instance, one flow can re-use knowledge of other flows between the 1355 same hosts if using a Congestion Manager [RFC3124] or when a proxy 1356 host aggregates congestion information for large numbers of flows. 1358 After an idle period of more than 1 second, a re-ECN sender transport 1359 MUST set the EECN field of the packet that resumes the connection to 1360 FNE. Note that this next packet may be sent a very long time later, 1361 a packet does NOT have to be sent after 1 second of idling. In order 1362 that the design of network policers can be deterministic, this 1363 specification deliberately puts an absolute lower limit on how long a 1364 connection can be idle before the packet that resumes the connection 1365 must be set to FNE, rather than relating it to the connection round 1366 trip time. We use the lower bound of the retransmission timeout 1367 (RTO) [RFC2988], which is commonly used as the idle period before TCP 1368 must reduce to the restart window [RFC2581]. Note our specification 1369 of re-ECN's idle period is NOT intended to change the idle period for 1370 TCP's restart, nor indeed for any other purposes. 1372 {ToDo: Describe how the sender falls back to RFC3168 modes if packets 1373 don't appear to be getting through (to work round firewalls 1374 discarding packets they consider unusual).} 1376 6.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs 1378 A re-ECN sender MUST clear the RE flag to "0" and set the ECN field 1379 to Not-ECT in pure ACKs, retransmissions and window probes, as 1380 specified in [RFC3168]. Our eventual goal is for all packets to be 1381 sent with re-ECN enabled, and we believe the semantics of the ECI 1382 field go a long way towards being able to achieve this. However, we 1383 have not completed a full security analysis for these cases, 1384 therefore, currently we merely re-state current practice. 1386 We must also reconcile the facts that congestion marking is applied 1387 to packets but acknowledgements cover octet ranges and acknowledged 1388 octet boundaries need not match the transmitted boundaries. The 1389 general principle we work to is to remain compatible with TCP's 1390 congestion control which is driven by congestion events at packet 1391 granularity while at the same time aiming to blank the RE flag on at 1392 least as many octets in a flow as have been marked CE. 1394 Therefore, a re-ECN TCP receiver MUST increment its ECC value as many 1395 times as CE marked packets have been received. And that value MUST 1396 be echoed to the sender in the first available ACK using the ECI 1397 field. This ensures the TCP sender's congestion control receives 1398 timely feedback on congestion events at the same packet granularity 1399 that they were generated on congested queues. 1401 Then, a re-ECN sender stores the difference D between its own ECC 1402 value and the incoming ECI field by incrementing a counter R. Then, R 1403 is decremented by 1 each subsequent packet that is sent with the RE 1404 flag blanked, until R is no longer positive. Using this technique, 1405 whenever a re-ECN transport sends a not re-ECN capable packet (e.g. a 1406 retransmission), the remaining packets required to have the RE flag 1407 blanked will be automatically carried over to subsequent packets, 1408 through the variable R. 1410 This does not ensure precisely the same number of octets have RE 1411 blanked as were CE marked. But we believe positive errors will 1412 cancel negative over a long enough period. {ToDo: However, more 1413 research is needed to prove whether this is so. If it is not, it may 1414 be necessary to increment and decrement R in octets rather than 1415 packets, by incrementing R as the product of D and the size in octets 1416 of packets being sent (typically the MSS).} 1418 6.2. Other Transports 1420 6.2.1. General Guidelines for Adding Re-ECN to Other Transports 1422 As a general rule, Re-ECT sender transports that have established the 1423 receiver transport is at least ECN-capable (not necessarily re-ECN 1424 capable) MUST blank the RE codepoint for at least as many octets as 1425 arrive at receiver with the CE codepoint set. Re-ECN-capable sender 1426 transports should always initialise the ECN field to the ECT(1) 1427 codepoint once a flow is established. 1429 If the sender transport does not have sufficient feedback to even 1430 estimate the path's CE rate, it SHOULD set FNE continuously. If the 1431 sender transport has some, perhaps stale, feedback to estimate that 1432 the path's CE rate is nearly definitely less than E%, the transport 1433 MAY blank RE in packets for E% of sent octets, and set the RECT 1434 codepoint for the remainder. 1436 The following sections give guidelines on how re-ECN support could be 1437 added to RSVP or NSIS, to DCCP, and to SCTP - although separate 1438 Internet drafts will be necessary to document the exact mechanics of 1439 re-ECN in each of these protocols. 1441 {ToDo: Give a brief outline of what would be expected for each of the 1442 following: 1444 o UDP fire and forget (e.g. DNS) 1446 o UDP streaming with no feedback 1448 o UDP streaming with feedback 1450 } 1452 6.2.2. Guidelines for adding Re-ECN to RSVP or NSIS 1454 A separate I-D has been submitted [Re-PCN] describing how re-ECN can 1455 be used in an edge-to-edge rather than end-to-end scenario. It can 1456 then be used by downstream networks to police whether upstream 1457 networks are blocking new flow reservations when downstream 1458 congestion is too high, even though the congestion is in other 1459 operators' downstream networks. This relates to current IETF work on 1460 Admission Control over Diffserv using Pre-Congestion Notification 1461 (PCN) [PCN-arch]. 1463 6.2.3. Guidelines for adding Re-ECN to DCCP 1465 Beside adjusting the initial features negotiation sequence, operating 1466 re-ECN in DCCP [RFC4340] could be achieved by defining a new option 1467 to be added to acknowledgments, that would include a multibit field 1468 where the destination could copy its ECC. 1470 6.2.4. Guidelines for adding Re-ECN to SCTP 1472 Appendix A in [RFC4960] gives the specifications for SCTP to support 1473 ECN. Similar steps should be taken to support re-ECN. Beside 1474 adjusting the initial features negotiation sequence, operating re-ECN 1475 in SCTP could be achieved by defining a new control chunk, that would 1476 include a multibit field where the destination could copy its ECC 1478 7. Incremental Deployment 1480 The design of the re-ECN protocol started from the fact that the 1481 current ECN marking behaviour of queues was sufficient and that re- 1482 feedback could be introduced around these queues by changing the 1483 sender behaviour but not the routers. Otherwise, if we had required 1484 routers to be changed, the chance of encountering a path that had 1485 every router upgraded would be vanishly small during early 1486 deployment, giving no incentive to start deployment. Also, as there 1487 is no new forwarding behaviour, routers and hosts do not have to 1488 signal or negotiate anything. 1490 However, networks that choose to protect themselves using re-ECN do 1491 have to add new security functions at their trust boundaries with 1492 others. They distinguish legacy traffic by its ECN field. Traffic 1493 from Not-ECT transports is distinguishable by its Not-ECT marking. 1494 Traffic from RFC3168 compliant ECN transports is distinguished from 1495 re-ECN by which of ECT(0) or ECT(1) is used. We chose to use ECT(1) 1496 for re-ECN traffic deliberately. Existing ECN sources set ECT(0) on 1497 either 50% (the nonce) or 100% (the default) of packets, whereas re- 1498 ECN does not use ECT(0) at all. We can use this distinguishing 1499 feature of RFC3168 compliant ECN traffic to separate it out for 1500 different treatment at the various border security functions: egress 1501 dropping, ingress policing and border policing. 1503 The general principle we adopt is that an egress dropper will not 1504 drop any legacy traffic, but ingress and border policers will limit 1505 the bulk rate of legacy traffic (Not-ECT, ECT(0) and those amrked 1506 with the unused codepoint) that can enter each network. Then, during 1507 early re-ECN deployment, operators can set very permissive (or non- 1508 existent) rate-limits on legacy traffic, but once re-ECN 1509 implementations are generally available, legacy traffic can be rate- 1510 limited increasingly harshly. Ultimately, an operator might choose 1511 to block all legacy traffic entering its network, or at least only 1512 allow through a trickle. 1514 Then, as the limits are set more strictly, the more RFC3168 ECN 1515 sources will gain by upgrading to re-ECN. Thus, towards the end of 1516 the voluntary incremental deployment period, RFC3168 compliant 1517 transports can be given progressively stronger encouragement to 1518 upgrade. 1520 The following list of minor changes, brings together all the points 1521 where re-ECN semantics for use of the two-bit ECN field are different 1522 compared to RFC3168: 1524 o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender 1525 sets ECT(0) by default (Section 4.3); 1527 o No provision is necessary for a re-ECN capable source transport to 1528 use the ECN nonce (Section 6.1.2.1); 1530 o Routers MAY preferentially drop different extended ECN codepoints 1531 (Section 5.3); 1533 o Packets carrying the feedback not established (FNE) codepoint MAY 1534 optionally be marked rather than dropped by routers, even though 1535 their ECN field is Not-ECT (with the important caveat in 1536 Section 5.3); 1538 o Packets may be dropped by policing nodes because of apparent 1539 misbehaviour, not just because of congestion ; 1541 o Tunnel entry behaviour is still to be defined, but may have to be 1542 different from RFC3168 (Section 5.6). 1544 None of these changes REQUIRE any modifications to routers. Also 1545 none of these changes affect anything about end to end congestion 1546 control; they are all to do with allowing networks to police that end 1547 to end congestion control is well-behaved. 1549 8. Related Work 1551 8.1. Congestion Notification Integrity 1553 The choice of two ECT code-points in the ECN field [RFC3168] 1554 permitted future flexibility, optionally allowing the sender to 1555 encode the experimental ECN nonce [RFC3540] in the packet stream. 1556 This mechanism has since been included in the specifications of DCCP 1557 [RFC4340]. 1559 The ECN nonce is an elegant scheme that allows the sender to detect 1560 if someone in the feedback loop - the receiver especially - tries to 1561 claim no congestion was experienced when in fact congestion led to 1562 packet drops or ECN marks. For each packet it sends, the sender 1563 chooses between the two ECT codepoints in a pseudo-random sequence. 1564 Then, whenever the network marks a packet with CE, if the receiver 1565 wants to deny congestion happened, she has to guess which ECT 1566 codepoint was overwritten. She has only a 50:50 chance of being 1567 correct each time she denies a congestion mark or a drop, which 1568 ultimately will give her away. 1570 The purpose of a network-layer nonce should primarily be protection 1571 of the network, while a transport-layer nonce would be better used to 1572 protect the sender from cheating receivers. Now, the assumption 1573 behind the ECN nonce is that a sender will want to detect whether a 1574 receiver is suppressing congestion feedback. This is only true if 1575 the sender's interests are aligned with the network's, or with the 1576 community of users as a whole. This may be true for certain large 1577 senders, who are under close scrutiny and have a reputation to 1578 maintain. But we have to deal with a more hostile world, where 1579 traffic may be dominated by peer-to-peer transfers, rather than 1580 downloads from a few popular sites. Often the `natural' self- 1581 interest of a sender is not aligned with the interests of other 1582 users. It often wishes to transfer data quickly to the receiver as 1583 much as the receiver wants the data quickly. 1585 In contrast, the re-ECN protocol enables policing of an agreed rate- 1586 response to congestion (e.g. TCP-friendliness) at the sender's 1587 interface with the internetwork. It also ensures downstream networks 1588 can police their upstream neighbours, to encourage them to police 1589 their users in turn. But most importantly, it requires the sender to 1590 declare path congestion to the network and it can remove traffic at 1591 the egress if this declaration is dishonest. So it can police 1592 correctly, irrespective of whether the receiver tries to suppress 1593 congestion feedback or whether the sender ignores genuine congestion 1594 feedback. Therefore the re-ECN protocol addresses a much wider range 1595 of cheating problems, which includes the one addressed by the ECN 1596 nonce. 1598 9. Security Considerations 1600 This whole memo concerns the deployment of a secure congestion 1601 control framework. However, below we list some specific security 1602 issues that we are still working on: 1604 o Malicious users have ability to launch dynamically changing 1605 attacks, exploiting the time it takes to detect an attack, given 1606 ECN marking is binary. We are concentrating on subtle 1607 interactions between the ingress policer and the egress dropper in 1608 an effort to make it impossible to game the system. 1610 o There is an inherent need for at least some flow state at the 1611 egress dropper given the binary marking environment, which leads 1612 to an apparent vulnerability to state exhaustion attacks. An 1613 egress dropper design with bounded flow state is in write-up. 1615 o A malicious source can spoof another user's address and send 1616 negative traffic to the same destination in order to fool the 1617 dropper into sanctioning the other user's flow. To prevent or 1618 mitigate these two different kinds of DoS attack, against the 1619 dropper and against given flows, we are considering various 1620 protection mechanisms. 1622 o A malicious client can send requests using a spoofed source 1623 address to a server (such as a DNS server) that tends to respond 1624 with single packet responses. This server will then be tricked 1625 into having to set FNE on the first (and only) packet of all these 1626 wasted responses. Given packets marked FNE are worth +1, this 1627 will cause such servers to consume more of their allowance to 1628 cause congestion than they would wish to. In general, re-ECN is 1629 deliberately designed so that single packet flows have to bear the 1630 cost of not discovering the congestion state of their path. One 1631 of the reasons for introducing re-ECN is to encourage short flows 1632 to make use of previous path knowledge by moving the cost of this 1633 lack of knowledge to sources that create short flows. Therefore, 1634 we in the long run we might expect services like DNS to aggregate 1635 single packet flows into connections where it brings benefits. 1636 However, this attack where DNS requests are made from spoofed 1637 addresses genuinely forces the server to waste its resources. The 1638 only mitigating feature is that the attacker has to set FNE on 1639 each of its requests if they are to get through an egress dropper 1640 to a DNS server. The attacker therefore has to consume as many 1641 resources as the victim, which at least implies re-ECN does not 1642 unwittingly amplify this attack. 1644 Having highlighted outstanding security issues, we now explain the 1645 design decisions that were taken based on a security-related 1646 rationale. It may seem that the six codepoints of the eight made 1647 available by extending the ECN field with the RE flag have been used 1648 rather wastefully to encode just five states. In effect the RE flag 1649 has been used as an orthogonal single bit, using up four codepoints 1650 to encode the three states of positive, neutral and negative worth. 1651 The mapping of the codepoints in an earlier version of this proposal 1652 used the codepoint space more efficiently, but the scheme became 1653 vulnerable to network operators bypassing congestion penalties by 1654 focusing congestion marking on positive packets. Appendix B explains 1655 why fixing that problem while allowing for incremental deployment, 1656 would have used another codepoint anyway. So it was better to use 1657 this orthogonal encoding scheme, which greatly simplified the whole 1658 protocol and brought with it some subtle security benefits (see the 1659 last paragraph of Appendix B). 1661 With the scheme as now proposed, once the RE flag is set or cleared 1662 by the sender or its proxy, it should not be written by the network, 1663 only read. So the endpoints can detect if any network maliciously 1664 alters the RE flag. IPSec AH integrity checking does not cover the 1665 IPv4 option flags (they were considered mutable---even the one we 1666 propose using for the RE flag that was `currently unused' when IPSec 1667 was defined). But it would be sufficient for a pair of endpoints to 1668 make random checks on whether the RE flag was the same when it 1669 reached the egress as when it left the ingress. Indeed, if IPSec AH 1670 had covered the RE flag, any network intending to alter sufficient RE 1671 flags to make a gain would have focused its alterations on packets 1672 without authenticating headers (AHs). 1674 The security of re-ECN has been deliberately designed to not rely on 1675 cryptography. 1677 10. IANA Considerations 1679 This memo includes no request to IANA (yet). 1681 If this memo was to progress to standards track, it would list: 1683 o The new RE flag in IPv4 (Section 5.1) and its extension with the 1684 ECN field to create a new set of extended ECN (EECN) codepoints; 1686 o The definition of the EECN codepoints for default Diffserv PHBs 1687 (Section 4.2) 1689 o The new extension header for IPv6 (Section 5.2); 1691 o The new combinations of flags in the TCP header for capability 1692 negotiation (Section 6.1.3); 1694 11. Conclusions 1696 {ToDo:} 1698 12. Acknowledgements 1700 Sebastien Cazalet and Andrea Soppera contributed to the idea of re- 1701 feedback. All the following have given helpful comments: Andrea 1702 Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, 1703 Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, 1704 John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru 1705 Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd 1706 (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark 1707 Handley (who developed the attack with canceled packets), Adam 1708 Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft 1709 (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who 1710 complemented our own dummy traffic attacks with others), Liz Maida 1711 (MIT), and comments from participants in the CRN/CFP Broadband and 1712 DoS-resistant Internet working groups.A special thank you to 1713 Alessandro Salvatori for coming up with fiendish attacks on re-ECN. 1715 13. Comments Solicited 1717 Comments and questions are encouraged and very welcome. They can be 1718 addressed to the IETF Transport Area working group's mailing list 1719 , and/or to the authors. 1721 14. References 1723 14.1. Normative References 1725 [RFC2119] Bradner, S., "Key words for use in 1726 RFCs to Indicate Requirement Levels", 1727 BCP 14, RFC 2119, March 1997. 1729 [RFC2581] Allman, M., Paxson, V., and W. 1730 Stevens, "TCP Congestion Control", 1731 RFC 2581, April 1999. 1733 [RFC3168] Ramakrishnan, K., Floyd, S., and D. 1734 Black, "The Addition of Explicit 1735 Congestion Notification (ECN) to IP", 1736 RFC 3168, September 2001. 1738 [RFC3390] Allman, M., Floyd, S., and C. 1739 Partridge, "Increasing TCP's Initial 1740 Window", RFC 3390, October 2002. 1742 [RFC4340] Kohler, E., Handley, M., and S. 1743 Floyd, "Datagram Congestion Control 1744 Protocol (DCCP)", RFC 4340, 1745 March 2006. 1747 [RFC4341] Floyd, S. and E. Kohler, "Profile for 1748 Datagram Congestion Control Protocol 1749 (DCCP) Congestion Control ID 2: TCP- 1750 like Congestion Control", RFC 4341, 1751 March 2006. 1753 [RFC4342] Floyd, S., Kohler, E., and J. Padhye, 1754 "Profile for Datagram Congestion 1755 Control Protocol (DCCP) Congestion 1756 Control ID 3: TCP-Friendly Rate 1757 Control (TFRC)", RFC 4342, 1758 March 2006. 1760 [RFC4960] Stewart, R., "Stream Control 1761 Transmission Protocol", RFC 4960, 1762 September 2007. 1764 14.2. Informative References 1766 [ARI05] Adams, J., Roberts, L., and A. 1767 IJsselmuiden, "Changing the Internet 1768 to Support Real-Time Content Supply 1769 from a Large Fraction of Broadband 1770 Residential Users", BT Technology 1771 Journal (BTTJ) 23(2), April 2005. 1773 [ECN-tunnel] Briscoe, B., "Layered Encapsulation 1774 of Congestion Notification", 1775 draft-briscoe-tsvwg-ecn-tunnel-01 1776 (work in progress), July 2008. 1778 [I-D.ietf-tcpm-ecnsyn] Kuzmanovic, A., "Adding Explicit 1779 Congestion Notification (ECN) 1780 Capability to TCP's SYN/ACK 1781 Packets", draft-ietf-tcpm-ecnsyn-07 1782 (work in progress), November 2008. 1784 [I-D.moncaster-tcpm-rcv-cheat] Moncaster, T., "A TCP Test to Allow 1785 Senders to Identify Receiver Non- 1786 Compliance", 1787 draft-moncaster-tcpm-rcv-cheat-02 1788 (work in progress), November 2007. 1790 [PCN-arch] Eardley, P., Babiarz, J., Chan, K., 1791 Charny, A., Geib, R., Karagiannis, 1792 G., Menth, M., and T. Tsou, "Pre- 1793 Congestion Notification 1794 Architecture", 1795 draft-ietf-pcn-architecture-09 (work 1796 in progress), January 2008. 1798 [RFC2309] Braden, B., Clark, D., Crowcroft, J., 1799 Davie, B., Deering, S., Estrin, D., 1800 Floyd, S., Jacobson, V., Minshall, 1801 G., Partridge, C., Peterson, L., 1802 Ramakrishnan, K., Shenker, S., 1803 Wroclawski, J., and L. Zhang, 1804 "Recommendations on Queue Management 1805 and Congestion Avoidance in the 1806 Internet", RFC 2309, April 1998. 1808 [RFC2475] Blake, S., Black, D., Carlson, M., 1809 Davies, E., Wang, Z., and W. Weiss, 1810 "An Architecture for Differentiated 1811 Services", RFC 2475, December 1998. 1813 [RFC2988] Paxson, V. and M. Allman, "Computing 1814 TCP's Retransmission Timer", 1815 RFC 2988, November 2000. 1817 [RFC3124] Balakrishnan, H. and S. Seshan, "The 1818 Congestion Manager", RFC 3124, 1819 June 2001. 1821 [RFC3514] Bellovin, S., "The Security Flag in 1822 the IPv4 Header", RFC 3514, 1823 April 2003. 1825 [RFC3540] Spring, N., Wetherall, D., and D. 1826 Ely, "Robust Explicit Congestion 1827 Notification (ECN) Signaling with 1828 Nonces", RFC 3540, June 2003. 1830 [RFC4301] Kent, S. and K. Seo, "Security 1831 Architecture for the Internet 1832 Protocol", RFC 4301, December 2005. 1834 [RFC4302] Kent, S., "IP Authentication Header", 1835 RFC 4302, December 2005. 1837 [RFC4835] Eastlake, D., "Cryptographic 1838 Algorithm Implementation Requirements 1839 for Encapsulating Security Payload 1840 (ESP) and Authentication Header 1841 (AH)", RFC 4835, April 2007. 1843 [RFC5129] Davie, B., Briscoe, B., and J. Tay, 1844 "Explicit Congestion Marking in 1845 MPLS", RFC 5129, January 2008. 1847 [Re-PCN] Briscoe, B., "Emulating Border Flow 1848 Policing using Re-ECN on Bulk Data", 1849 draft-briscoe-re-pcn-border-cheat-02 1850 (work in progress), September 2008. 1852 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano- 1853 Gilfedder, C., Salvatori, A., 1854 Soppera, A., and M. Koyabe, "Policing 1855 Congestion Response in an 1856 Internetwork Using Re-Feedback", ACM 1857 SIGCOMM CCR 35(4)277--288, 1858 August 2005, . 1862 [Savage99] Savage, S., Cardwell, N., Wetherall, 1863 D., and T. Anderson, "TCP congestion 1864 control with a misbehaving receiver", 1865 ACM SIGCOMM CCR 29(5), October 1999, 1866 . 1869 [Steps_DoS] Handley, M. and A. Greenhalgh, "Steps 1870 towards a DoS-resistant Internet 1871 Architecture", Proc. ACM SIGCOMM 1872 workshop on Future directions in 1873 network architecture (FDNA'04) pp 1874 49--56, August 2004. 1876 [re-ecn-motive] Briscoe, B., "Re-ECN: The Motivation 1877 for Adding Congestion Accountability 1878 to TCP/IP", draft-briscoe-tsvwg-re- 1879 ecn-tcp-motivation-00 (work in 1880 progress), March 2009. 1882 Appendix A. Precise Re-ECN Protocol Operation 1884 {ToDo: fix this} 1886 The protocol operation in the middle described in Section 4.3 was an 1887 approximation. In fact, standard ECN router marking combines 1% and 1888 2% marking into slightly less than 3% whole-path marking, because 1889 routers deliberately mark CE whether or not it has already been 1890 marked by another router upstream. So the combined marking fraction 1891 would actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. 1893 To generalise this we will need some notation. 1895 o j represents the index of each resource (typically queues) along a 1896 path, ranging from 0 at the first router to n-1 at the last. 1898 o m_j represents the fraction of octets *m*arked CE by a particular 1899 router (whether or not they are already marked) because of 1900 congestion of resource j. 1902 o u_j represents congestion *u*pstream of resource j, being the 1903 fraction of CE marking in arriving packet headers (before 1904 marking). 1906 o p_j represents *p*ath congestion, being the fraction of packets 1907 arriving at resource j with the RE flag blanked (excluding Not- 1908 RECT packets). 1910 o v_j denotes expected congestion downstream of resource j, which 1911 can be thought of as a *v*irtual marking fraction, being derived 1912 from two other marking fractions. 1914 Observed fractions of each particular codepoint (u, p and v) and 1915 router marking rate m are dimensionless fractions, being the ratio of 1916 two data volumes (marked and total) over a monitoring period. All 1917 measurements are in terms of octets, not packets, assuming that line 1918 resources are more congestible than packet processing. 1920 The path congestion (RE blanking fraction) set by the sender should 1921 reflect the upstream congestion (CE marking fraction) fed back from 1922 the destination. Therefore in the steady state 1924 p_0 = u_n 1925 = 1 - (1 - m_1)(1 - m_2)... 1927 Similarly, at some point j in the middle of the network, if p = 1 - 1928 (1 - u_j)(1 - v_j), then 1930 v_j = 1 - (1 - p)/(1 - u_j) 1932 ~= p - u_j; if u_j << 100% 1934 So, between the two routers in the example in Section 4.3, congestion 1935 downstream is 1937 v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) 1938 = 2.00%, 1940 or a useful approximation of downstream congestion is 1941 v_1 ~= 2.98% - 1.00% 1942 ~= 1.98%. 1944 Appendix B. Justification for Two Codepoints Signifying Zero Worth 1945 Packets 1947 It may seem a waste of a codepoint to set aside two codepoints of the 1948 Extended ECN field to signify zero worth (RECT and CE(0) are both 1949 worth zero). The justification is subtle, but worth recording. 1951 The original version of Re-ECN ([Re-fb] and draft-00 of this memo) 1952 used three codepoints for neutral (ECT(1)), positive (ECT(0)) and 1953 negative (CE) packets. The sender set packets to neutral unless re- 1954 echoing congestion, when it set them positive, in much the same way 1955 that it blanks the RE flag in the current protocol. However, routers 1956 were meant to mark congestion by setting packets negative (CE) 1957 irrespective of whether they had previously been neutral or positive. 1959 However, we did not arrange for senders to remember which packet had 1960 been sent with which codepoint, or for feedback to say exactly which 1961 packets arrived with which codepoints. The transport was meant to 1962 inflate the number of positive packets it sent to allow for a few 1963 being wiped out by congestion marking. We (wrongly) assumed that 1964 routers would congestion mark packets indiscriminately, so the 1965 transport could infer how many positive packets had been marked and 1966 compensate accordingly by re-echoing. But this created a perverse 1967 incentive for routers to preferentially congestion mark positive 1968 packets rather than neutral ones. 1970 We could have removed this perverse incentive by requiring Re-ECN 1971 senders to remember which packets they had sent with which codepoint. 1972 And for feedback from the receiver to identify which packets arrived 1973 as which. Then, if a positive packet was congestion marked to 1974 negative, the sender could have re-echoed twice to maintain the 1975 balance between positive and negative at the receiver. 1977 Instead, we chose to make re-echoing congestion (blanking RE) 1978 orthogonal to congestion notification (marking CE), which required a 1979 second neutral codepoint. Then the receiver would be able to detect 1980 and echo a congestion event even if it arrived on a packet that had 1981 originally been positive. 1983 If we had added extra complexity to the sender and receiver 1984 transports to track changes to individual packets, we could have made 1985 it work, but then routers would have had an incentive to mark 1986 positive packets with half the probability of neutral packets. That 1987 in turn would have led router algorithms to become more complex. 1988 Then senders wouldn't know whether a mark had been introduced by a 1989 simple or a complex router algorithm. That in turn would have 1990 required another codepoint to distinguish between RFC3168 ECN and new 1991 Re-ECN router marking. 1993 Once the cost of IP header codepoint real-estate was the same for 1994 both schemes, there was no doubt that the simpler option for 1995 endpoints and for routers should be chosen. The resulting protocol 1996 also no longer needed the tricky inflation/deflation complexity of 1997 the original (broken) scheme. It was also much simpler to understand 1998 conceptually. 2000 A further advantage of the new orthogonal four-codepoint scheme was 2001 that senders owned sole rights to change the RE flag and routers 2002 owned sole rights to change the ECN field. Although we still arrange 2003 the incentives so neither party strays outside their dominion, these 2004 clear lines of authority simplify the matter. 2006 Finally, a little redundancy can be very powerful in a scheme such as 2007 this. In one flow, the proportion of packets changed to CE should be 2008 the same as the proportion of RECT packets changed to CE(-1) and the 2009 proportion of Re-Echo packets changed to CE(0). Double checking 2010 using such redundant relationships can improve the security of a 2011 scheme (cf. double-entry book-keeping or the ECN Nonce). 2012 Alternatively, it might be necessary to exploit the redundancy in the 2013 future to encode an extra information channel. 2015 Appendix C. ECN Compatibility 2017 The rationale for choosing the particular combinations of SYN and SYN 2018 ACK flags in Section 6.1.3 is as follows. 2020 Choice of SYN flags: A Re-ECN sender can work with RFC3168 compliant 2021 ECN receivers so we wanted to use the same flags as would be used 2022 in an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same 2023 time, we wanted a server (host B) that is Re-ECT to be able to 2024 recognise that the client (A) is also Re-ECT. We believe also 2025 setting NS=1 in the initial SYN achieves both these objectives, as 2026 it should be ignored by RFC3168 compliant ECT receivers and by 2027 ECT-Nonce receivers. But senders that are not Re-ECT should not 2028 set NS=1. At the time ECN was defined, the NS flag was not 2029 defined, so setting NS=1 should be ignored by existing ECT 2030 receivers (but testing against implementations may yet prove 2031 otherwise). The ECN Nonce RFC [RFC3540] is silent on what the NS 2032 field might be set to in the TCP SYN, but we believe the intent 2033 was for a nonce client to set NS=0 in the initial SYN (again only 2034 testing will tell). Therefore we define a Re-ECN-setup SYN as one 2035 with NS=1, CWR=1 & ECE=1 2037 Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to 2038 be able to determine whether the server (B) is Re-ECT. The 2039 original ECN specification required an ECT server to respond to an 2040 ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There 2041 is no room to modify this by setting the NS flag, as that is 2042 already set in the SYN ACK of an ECT-Nonce server. So we used the 2043 only combination of CWR and ECE that would not be used by existing 2044 TCP receivers: CWR=1 and ECE=0. The original ECN specification 2045 defines this combination as a non-ECN-setup SYN ACK, which remains 2046 true for RFC3168 compliant and Nonce ECTs. But for Re-ECN we 2047 define it as a Re-ECN-setup SYN ACK. We didn't use a SYN ACK with 2048 both CWR and ECE cleared to 0 because that would be the likely 2049 response from most Not-ECT receivers. And we didn't use a SYN ACK 2050 with both CWR and ECE set to 1 either, as at least one broken 2051 receiver implementation echoes whatever flags were in the SYN into 2052 its SYN ACK. Therefore we define a Re-ECN-setup SYN ACK as one 2053 with CWR=1 & ECE=0. 2055 Choice of two alternative SYN ACKs: the NS flag may take either 2056 value in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re- 2057 ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to 2058 echo congestion experienced (CE) on the initial SYN. Otherwise a 2059 Re-ECN-setup SYN ACK MUST be returned with NS=0. The only current 2060 known use of the NS flag in a SYN ACK is to indicate support for 2061 the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. 2062 Given the ECN nonce MUST NOT be used for a RECN mode connection, a 2063 Re-ECN-setup SYN ACK can use either setting of the NS flag without 2064 any risk of confusion, because the CWR & ECE flags will be 2065 reversed relative to those used by an ECN nonce SYN ACK. 2067 Appendix D. Packet Marking with FNE During Flow Start 2069 FNE (feedback not established) packets have two functions. Their 2070 main role is to announce the start of a new flow when feedback has 2071 not yet been established. However they also have the role of 2072 balancing the expected feedback and can be used where there are 2073 sudden changes in the rate of transmission. Whilst this should not 2074 happen under TCP their use as speculative marking is used in building 2075 the following argument as to why the first and third packets should 2076 be set to FNE. 2078 The proportion of FNE packets in each roundtrip should be a high 2079 estimate of the potential error in the balance of number of 2080 congestion marked packets versus number of re-echo packets already 2081 issued. 2083 Let's call: 2085 S: the number of the TCP segments sent so far 2087 F: the number of FNE packets sent so far 2089 R: the number of Re-Echo packets sent so far 2091 A: the number of acknowledgments received so far 2093 C: the number of acknowledgments echoing a CE packet 2095 In normal operation, when we want to send packet S+1, we first need 2096 to check that enough Re-Echo packets have been issued: 2098 If R 1 FNE 2137 o if the acknowledgment doesn't echo a mark 2139 * for the second packet, A=F=S=1 R=C=0 ==> 1 RECT 2141 * for the third packet, S=2 A=F=1 R=C=0 ==> 1 FNE 2143 o if no acknowledgement for these two packets echoes a congestion 2144 mark, then {A=S=3 F=2 R=C=0} which gives k<2*4/1-3, so the source 2146 o if no acknowledgement for these four packets echoes a congestion 2147 mark, then {A=S=7 F=2 R=C=0} which gives k<2*8/1-7, so the source 2148 could send another 8 RECT packets. ==> 8 RECT 2150 This behaviour happens to match TCP's congestion window control in 2151 slow start, which is why for TCP sources, only the first and third 2152 packet need be FNE packets. 2154 A source that would open the congestion window any quicker would have 2155 to insert more FNE packets. As another example a UDP source sending 2156 VBR traffic might need to send several FNE packets ahead of the 2157 traffic peaks it generates. 2159 Appendix E. Argument for holding back the ECN nonce 2161 The ECN nonce is a mechanism that allows a /sending/ transport to 2162 detect if drop or ECN marking at a congested router has been 2163 suppressed by a node somewhere in the feedback loop---another router 2164 or the receiver. 2166 Space for the ECN nonce was set aside in [RFC3168] (currently 2167 proposed standard) while the full nonce mechanism is specified in 2168 [RFC3540] (currently experimental). The specifications for [RFC4340] 2169 (currently proposed standard) requires that "Each DCCP sender SHOULD 2170 set ECN Nonces on its packets...". It also mandates as a requirement 2171 for all CCID profiles that "Any newly defined acknowledgement 2172 mechanism MUST include a way to transmit ECN Nonce Echoes back to the 2173 sender.", therefore: 2175 o The CCID profile for TCP-like Congestion Control [RFC4341] 2176 (currently proposed standard) says "The sender will use the ECN 2177 Nonce for data packets, and the receiver will echo those nonces in 2178 its Ack Vectors." 2180 o The CCID profile for TCP-Friendly Rate Control (TFRC) [RFC4342] 2181 recommends that "The sender [use] Loss Intervals options' ECN 2182 Nonce Echoes (and possibly any Ack Vectors' ECN Nonce Echoes) to 2183 probabilistically verify that the receiver is correctly reporting 2184 all dropped or marked packets." 2186 The primary function of the ECN nonce is to protect the integrity of 2187 the information about congestion: ECN marks and packet drops. 2188 However, when the nonce is used to protect the integrity of 2189 information about packet drops, rather than ECN marks, a transport 2190 layer nonce will always be sufficient (because a drop loses the 2191 transport header as well as the ECN field in the network header), 2192 which would avoid using scarce IP header codepoint space. Similarly, 2193 a transport layer nonce would protect against a receiver sending 2194 early acknowledgements [Savage99]. 2196 If the ECN nonce reveals integrity problems with the information 2197 about congestion, the sending transport can use that knowledge for 2198 two functions: 2200 o to protect its own resources, by allocating them in proportion to 2201 the rates that each network path can sustain, based on congestion 2202 control, 2204 o and to protect congested routers in the network, by slowing down 2205 drastically its connection to the destination with corrupt 2206 congestion information. 2208 If the sending transport chooses to act in the interests of congested 2209 routers, it can reduce its rate if it detects some malicious party in 2210 the feedback loop may be suppressing ECN feedback. But it would only 2211 be useful to congested routers when /all/ senders using them are 2212 trusted to act in interest of the congested routers. 2214 In the end, the only essential use of a network layer nonce is when 2215 sending transports (e.g. large servers) want to allocate their /own/ 2216 resources in proportion to the rates that each network path can 2217 sustain, based on congestion control. In that case, the nonce allows 2218 senders to be assured that they aren't being duped into giving more 2219 of their own resources to a particular flow. And if congestion 2220 suppression is detected, the sending transport can rate limit the 2221 offending connection to protect its own resources. Certainly, this 2222 is a useful function, but the IETF should carefully decide whether 2223 such a single, very specific case warrants IP header space. 2225 In contrast, Re-ECN allows all routers to fully protect themselves 2226 from such attacks, without having to trust anyone - senders, 2227 receivers, neighbouring networks. Re-ECN is therefore proposed in 2228 preference to the ECN nonce on the basis that it addresses the 2229 generic problem of accountability for congestion of a network's 2230 resources at the IP layer. 2232 Delaying the ECN nonce is justified because the applicability of the 2233 ECN nonce seems too limited for it to consume a two-bit codepoint in 2234 the IP header. It therefore seems prudent to give time for an 2235 alternative way to be found to do the one function the nonce is 2236 essential for. 2238 Moreover, while we have re-designed the Re-ECN codepoints so that 2239 they do not prevent the ECN nonce progressing, the same is not true 2240 the other way round. If the ECN nonce started to see some deployment 2241 (perhaps because it was blessed with proposed standard status), 2242 incremental deployment of Re-ECN would effectively be impossible, 2243 because Re-ECN marking fractions at inter-domain borders would be 2244 polluted by unknown levels of nonce traffic. 2246 The authors are aware that Re-ECN must prove it has the potential it 2247 claims if it is to displace the nonce. Therefore, every effort has 2248 been made to complete a comprehensive specification of Re-ECN so that 2249 its potential can be assessed. We therefore seek the opinion of the 2250 Internet community on whether the Re-ECN protocol is sufficiently 2251 useful to warrant standards action. 2253 Appendix F. Alternative Terminology Used in Other Documents 2255 A number of alternative terms have been used in various documents 2256 describign re-feedback and re-ECN. These are set out in the 2257 following table 2258 +-------------------+---------------+-------------------------------+ 2259 | Current | EECN | Colour | 2260 | Terminology | codepoint | | 2261 +-------------------+---------------+-------------------------------+ 2262 | Cautious | FNE | Green | 2263 | Positive | Re-Echo | Black | 2264 | Neutral | RECT | Grey | 2265 | Negative | CE(-1) | Red | 2266 | Cancelled | CE(0) | Red-Black | 2267 | Legacy ECN | ECT(0) | White | 2268 | Currently Unused | --CU-- | Currently unused | 2269 | | | | 2270 | Legacy | Not-ECT | White | 2271 +-------------------+---------------+-------------------------------+ 2273 Table 7: Alternative re-ECN Terminology 2275 Authors' Addresses 2277 Bob Briscoe 2278 BT & UCL 2279 B54/77, Adastral Park 2280 Martlesham Heath 2281 Ipswich IP5 3RE 2282 UK 2284 Phone: +44 1473 645196 2285 EMail: bob.briscoe@bt.com 2286 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 2288 Arnaud Jacquet 2289 BT 2290 B54/70, Adastral Park 2291 Martlesham Heath 2292 Ipswich IP5 3RE 2293 UK 2295 Phone: +44 1473 647284 2296 EMail: arnaud.jacquet@bt.com 2297 URI: 2299 Toby Moncaster 2300 BT 2301 B54/70, Adastral Park 2302 Martlesham Heath 2303 Ipswich IP5 3RE 2304 UK 2306 Phone: +44 1473 648734 2307 EMail: toby.moncaster@bt.com 2309 Alan Smith 2310 BT 2311 B54/76, Adastral Park 2312 Martlesham Heath 2313 Ipswich IP5 3RE 2314 UK 2316 Phone: +44 1473 640404 2317 EMail: alan.p.smith@bt.com