idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-tcp-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 18. -- Found old boilerplate from RFC 3978, Section 5.5 on line 3921. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 3932. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 3939. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 3945. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 23, 2006) is 6392 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC2960' is defined on line 3125, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960) == Outdated reference: A later version (-04) exists of draft-briscoe-tsvwg-cl-architecture-03 == Outdated reference: A later version (-01) exists of draft-davie-ecn-mpls-00 -- Obsolete informational reference (is this intentional?): RFC 2402 (Obsoleted by RFC 4302, RFC 4305) -- Obsolete informational reference (is this intentional?): RFC 2406 (Obsoleted by RFC 4303, RFC 4305) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) Summary: 6 errors (**), 0 flaws (~~), 4 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Intended status: Informational A. Jacquet 5 Expires: April 26, 2007 A. Salvatori 6 M. Koyabe 7 BT 8 October 23, 2006 10 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP 11 draft-briscoe-tsvwg-re-ecn-tcp-03 13 Status of this Memo 15 By submitting this Internet-Draft, each author represents that any 16 applicable patent or other IPR claims of which he or she is aware 17 have been or will be disclosed, and any of which he or she becomes 18 aware will be disclosed, in accordance with Section 6 of BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt. 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet-Draft will expire on April 26, 2007. 38 Copyright Notice 40 Copyright (C) The Internet Society (2006). 42 Abstract 44 This document introduces a new protocol for explicit congestion 45 notification (ECN), termed re-ECN, which can be deployed 46 incrementally around unmodified routers. The protocol arranges an 47 extended ECN field in each packet so that, as it crosses any 48 interface in an internetwork, it will carry a truthful prediction of 49 congestion on the remainder of its path. Then the upstream party at 50 any trust boundary in the internetwork can be held responsible for 51 the congestion they cause, or allow to be caused. So, networks can 52 introduce straightforward accountability and policing mechanisms for 53 incoming traffic from end-customers or from neighbouring network 54 domains. The purpose of this document is to specify the re-ECN 55 protocol at the IP layer and to give guidelines on any consequent 56 changes required to transport protocols. It includes the changes 57 required to TCP both as an example and as a specification. It also 58 gives examples of mechanisms that can use the protocol to ensure data 59 sources respond correctly to congestion. And it describes example 60 mechanisms that ensure the dominant selfish strategy of both network 61 domains and end-points will be to set the extended ECN field 62 honestly. 64 Authors' Statement: Status (to be removed by the RFC Editor) 66 This document is posted as an Internet-Draft with the intent (at 67 least that of the authors) to eventually progress to standards track. 69 Although the re-ECN protocol is intended to make a simple but far- 70 reaching change to the Internet architecture, the most immediate 71 priority for the authors is to delay any move of the ECN nonce to 72 Proposed Standard status. The argument for this position is 73 developed in Appendix I. 75 Changes from previous drafts (to be removed by the RFC Editor) 77 From -00 to -01: 79 Encoding of re-ECN wire protocol changed for reasons given in 80 Appendix B and consequently draft substantially re-written. 82 Substantial text added in sections on applications, incremental 83 deployment, architectural rationale and security considerations. 85 From -01 to -02: 87 Explanation on informal terminology in Section 3.4 clarified. 89 IPv6 wire protocol encoding added (Section 5.2). 91 Text on (non-)issues with tunnels, encryption and link layer 92 congestion notification added (Section 5.6 & Section 5.7). 94 Section added giving evolvability arguments against encouraging 95 bottleneck policing (Section 6.1.2). And text on re-ECN's 96 evolvability by design added to Section 6.1.3 98 Text on inter-domain policing (Section 6.1.6) and inter-domain 99 fail-safes (Section 6.1.7) added. 101 From -02 to -03: 103 Started guidelines for re-ECN support in DCCP and SCTP. 105 Added annex on limitations of nonce mechanism. 107 Minor editorial changes throughout. 109 Table of Contents 111 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 112 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 7 113 3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 8 114 3.1. Background and Applicability . . . . . . . . . . . . . . . 8 115 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 116 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 117 3.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 10 118 3.4. Informal Terminology . . . . . . . . . . . . . . . . . . . 12 119 4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 15 120 4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 121 4.1.1. RECN mode: Full re-ECN capable transport . . . . . . . 16 122 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or 123 Nonce ECT Receiver . . . . . . . . . . . . . . . . . . 18 124 4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 20 125 4.1.4. Extended ECN (EECN) Field Settings during Flow 126 Start or after Idle Periods . . . . . . . . . . . . . 21 127 4.1.5. Pure ACKS, Retransmissions, Window Probes and 128 Partial ACKs . . . . . . . . . . . . . . . . . . . . . 25 129 4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 26 130 4.2.1. General Guidelines for Adding Re-ECN to Other 131 Transports . . . . . . . . . . . . . . . . . . . . . . 26 132 4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 26 133 4.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 27 134 4.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 27 135 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 27 136 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 27 137 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 28 138 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 30 139 5.4. Justification for Setting the First SYN to FNE . . . . . . 31 140 5.5. Control and Management . . . . . . . . . . . . . . . . . . 32 141 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 32 142 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 33 143 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 33 144 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 34 145 6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 35 146 6.1. Policing Congestion Response . . . . . . . . . . . . . . . 35 147 6.1.1. The Policing Problem . . . . . . . . . . . . . . . . . 35 148 6.1.2. The Case Against Bottleneck Policing . . . . . . . . . 36 149 6.1.3. Re-ECN Incentive Framework . . . . . . . . . . . . . . 37 150 6.1.4. Egress Dropper . . . . . . . . . . . . . . . . . . . . 44 151 6.1.5. Rate Policing . . . . . . . . . . . . . . . . . . . . 45 152 6.1.6. Inter-domain Policing . . . . . . . . . . . . . . . . 47 153 6.1.7. Inter-domain Fail-safes . . . . . . . . . . . . . . . 51 154 6.1.8. Simulations . . . . . . . . . . . . . . . . . . . . . 51 155 6.2. Other Applications . . . . . . . . . . . . . . . . . . . . 51 156 6.2.1. DDoS Mitigation . . . . . . . . . . . . . . . . . . . 52 157 6.2.2. End-to-end QoS . . . . . . . . . . . . . . . . . . . . 53 158 6.2.3. Traffic Engineering . . . . . . . . . . . . . . . . . 53 159 6.2.4. Inter-Provider Service Monitoring . . . . . . . . . . 53 160 6.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . 53 161 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 54 162 7.1. Incremental Deployment Features . . . . . . . . . . . . . 54 163 7.2. Incremental Deployment Incentives . . . . . . . . . . . . 55 164 8. Architectural Rationale . . . . . . . . . . . . . . . . . . . 60 165 9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 63 166 9.1. Policing Rate Response to Congestion . . . . . . . . . . . 63 167 9.2. Congestion Notification Integrity . . . . . . . . . . . . 63 168 9.3. Identifying Upstream and Downstream Congestion . . . . . . 64 169 10. Security Considerations . . . . . . . . . . . . . . . . . . . 65 170 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 66 171 12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 67 172 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 67 173 14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 67 174 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 67 175 15.1. Normative References . . . . . . . . . . . . . . . . . . . 67 176 15.2. Informative References . . . . . . . . . . . . . . . . . . 68 177 Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 71 178 Appendix B. Justification for Two Codepoints Signifying Zero 179 Worth Packets . . . . . . . . . . . . . . . . . . . . 72 180 Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 74 181 Appendix D. Packet Marking During Flow Start . . . . . . . . . . 75 182 Appendix E. Example Egress Dropper Algorithm . . . . . . . . . . 75 183 Appendix F. Re-TTL . . . . . . . . . . . . . . . . . . . . . . . 75 184 Appendix G. Policer Designs to ensure Congestion 185 Responsiveness . . . . . . . . . . . . . . . . . . . 76 186 G.1. Per-user Policing . . . . . . . . . . . . . . . . . . . . 76 187 G.2. Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 77 188 Appendix H. Downstream Congestion Metering Algorithms . . . . . . 80 189 H.1. Bulk Downstream Congestion Metering Algorithm . . . . . . 80 190 H.2. Inflation Factor for Persistently Negative Flows . . . . . 80 191 Appendix I. Argument for holding back the ECN nonce . . . . . . . 81 192 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 83 193 Intellectual Property and Copyright Statements . . . . . . . . . . 85 195 1. Introduction 197 This document aims: 199 o To provide a complete specification of the addition of the re-ECN 200 protocol to IP and guidelines on how to add it to transport layer 201 protocols, including a complete specification of re-ECN in TCP as 202 an example; 204 o To show how a number of hard problems become much easier to solve 205 once re-ECN is available in IP. 207 A general statement of the problem solved by re-ECN is to provide 208 sufficient information in each IP datagram to be able to hold senders 209 and whole networks accountable for the congestion they cause 210 downstream, before they cause it. But the every-day problems that 211 re-ECN can solve are much more recognisable than this rather generic 212 statement: mitigating distributed denial of service (DDoS); 213 simplifying differentiation of quality of service (QoS); policing 214 compliance to congestion control; and so on. 216 Uniquely, re-ECN manages to enable solutions to these problems 217 without unduly stifling innovative new ways to use the Internet. 218 This was a hard balance to strike, given it could be argued that DDoS 219 is an innovative way to use the Internet. The most valuable insight 220 was to allow each network to choose the level of constraint it wishes 221 to impose. Also re-ECN has been carefully designed so that networks 222 that choose to use it conservatively can protect themselves against 223 the congestion caused in their network by users on other networks 224 with more liberal policies. 226 For instance, some network owners want to block applications like 227 voice and video unless their network is compensated for the extra 228 share of bottleneck bandwidth taken. These real-time applications 229 tend to be unresponsive when congestion arises. Whereas elastic TCP- 230 based applications back away quickly, ending up taking a much smaller 231 share of congested capacity for themselves. Other network owners 232 want to invest in large amounts of capacity and make their gains from 233 simplicity of operation and economies of scale. 235 Re-ECN allows the more conservative networks to police out flows that 236 have not asked to be unresponsive to congestion---not because they 237 are voice or video---just because they don't respond to congestion. 238 But it also allows other networks to choose not to police. 239 Crucially, when flows from liberal networks cross into a conservative 240 network, re-ECN enables the conservative network to apply penalties 241 to its neighbouring networks for the congestion they allow to be 242 caused. And these penalties can be applied to bulk data, without 243 regard to flows. 245 Then, if unresponsive applications become so dominant that some of 246 the more liberal networks experience congestion collapse [RFC3714], 247 they can change their minds and use re-ECN to apply tighter controls 248 in order to bring congestion back under control. 250 Re-ECN works by arranging that each packet arrives at each network 251 element carrying a view of expected congestion on its own downstream 252 path, albeit averaged over multiple packets. Most usefully, 253 congestion on the remainder of the path becomes visible in the IP 254 header at the first ingress. Many of the applications of re-ECN 255 involve a policer at this ingress using the view of downstream 256 congestion arriving in packets to police or control the packet rate. 258 Importantly, the scheme is recursive: a whole network harbouring 259 users causing congestion in downstream networks can be held 260 responsible or policed by its downstream neighbour. 262 This document is structured as follows. First an overview of the re- 263 ECN protocol is given (Section 3), outlining its attributes and 264 explaining conceptually how it works as a whole. The two main parts 265 of the document follow, as described above. That is, the protocol 266 specification divided into transport (Section 4) and network 267 (Section 5) layers, then the applications it can be put to, such as 268 policing DDoS, QoS and congestion control (Section 6). Although 269 these applications do not require standardisation themselves, they 270 are described in a fair degree of detail in order to explain how re- 271 ECN can be used. Given, re-ECN proposes to use the last undefined 272 bit in the IPv4 header, we felt it necessary to outline the potential 273 that re-ECN could release in return for being given that bit. 275 Deployment issues discussed throughout the document are brought 276 together in Section 7, which is followed by a brief section 277 explaining the somewhat subtle rationale for the design from an 278 architectural perspective (Section 8). We end by describing related 279 work (Section 9), listing security considerations (Section 10) and 280 finally drawing conclusions (Section 12). 282 2. Requirements notation 284 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 285 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 286 document are to be interpreted as described in [RFC2119]. 288 This document first specifies a protocol, then describes a framework 289 that creates the right incentives to ensure compliance to the 290 protocol. This could cause confusion because the second part of the 291 document considers many cases where malicious nodes may not comply 292 with the protocol. When such contingencies are described, if any of 293 the above keywords are not capitalised, that is deliberate. So, for 294 instance, the following two apparently contradictory sentences would 295 be perfectly consistent: i) x MUST do this; ii) x may not do this. 297 3. Protocol Overview 299 3.1. Background and Applicability 301 First we briefly recap the essentials of the ECN protocol [RFC3168]. 302 Two bits in the IP protocol (v4 or v6) are assigned to the ECN field. 303 The sender clears the field to "00" (Not-ECT) if either end-point 304 transport is not ECN-capable. Otherwise it indicates an ECN-capable 305 transport (ECT) using either of the two code-points "10" or "01" 306 (ECT(0) and ECT(1) resp.). 308 ECN-capable routers probabilistically set "11" if congestion is 309 experienced (CE), the marking probability increasing with the length 310 of the queue at its egress link (typically using the RED 311 algorithm [RFC2309]). However, they still drop rather than mark Not- 312 ECT packets. With multiple ECN-capable routers on a path, a flow of 313 packets accumulates the fraction of CE marking that each router adds. 314 The combined effect of the packet marking of all the routers along 315 the path signals congestion of the whole path to the receiver. So, 316 for example, if one router early in a path is marking 1% of packets 317 and another later in a path is marking 2%, flows that pass through 318 both routers will experience approximately 3% marking (see Appendix A 319 for a precise treatment). 321 The choice of two ECT code-points in the ECN field [RFC3168] 322 permitted future flexibility, optionally allowing the sender to 323 encode the experimental ECN nonce [RFC3540] in the packet stream. 324 The nonce is designed to allow a sender to check the integrity of 325 congestion feedback. But Section 9.2 explains that it still gives no 326 control over how fast the sender transmits as a result of the 327 feedback. On the other hand, re-ECN is designed both to ensure that 328 congestion is declared honestly and that the sender's rate responds 329 appropriately. 331 Re-ECN is based on a feedback arrangement called `re- 332 feedback' [Re-fb]. The word is short for either receiver-aligned, 333 re-inserted or re-echoed feedback. But it actually works even when 334 no feedback is available. In fact it has been carefully designed to 335 work for single datagram flows. Indeed, it even encourages 336 aggregation of single packet flows by congestion control proxies. 338 Then, even if the traffic mix of the Internet were to become 339 dominated by short messages, it would still be possible to control 340 congestion effectively and efficiently. 342 Changing the Internet's feedback architecture seems to imply 343 considerable upheaval. But re-ECN can be deployed incrementally at 344 the transport layer around unmodified routers using existing fields 345 in IP (v4 or v6). However it does also require the last undefined 346 bit in the IPv4 header, which it uses in combination with the 2-bit 347 ECN field to create four new codepoints. Nonetheless, changes to IP 348 routers are RECOMMENDED in order to improve resilience against DoS 349 attacks. Similarly, re-ECN works best if both the sender and 350 receiver transports are re-ECN-capable, but it can work with just 351 sender support. Section 7.1 summarises the incremental deployment 352 strategy. 354 The re-ECN protocol makes no changes and has no effect on the TCP 355 congestion control algorithm or on other rate responses to 356 congestion. Re-ECN is only concerned with enabling the ingress 357 network to police that a source is complying with a congestion 358 control algorithm, which is orthogonal to congestion control itself. 360 Before re-ECN can be considered worthy of using up the last bit in 361 the IP header, we must be sure that all our claims are robust. We 362 have gradually been reducing the list of outstanding issues, but the 363 few that still remain are listed in Section 6.3. We expect new 364 attacks may still be found, but we offer the re-ECN protocol on the 365 basis that it is built on fairly solid theoretical foundations and, 366 so far, it has proved possible to keep it relatively robust. 368 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 370 The re-ECN wire protocol uses the two bit ECN field broadly as in 371 RFC3168 [RFC3168] as described above, but with five differences of 372 detail (brought together in a list in Section 7.1). This 373 specification defines a new re-ECN extension (RE) flag. We will 374 defer the definition of the actual position of the RE flag in the 375 IPv4 & v6 headers until Section 5. Until then it will suffice to use 376 an abstraction of the IPv4 and v6 wire protocols by just calling it 377 the RE flag. 379 Unlike the ECN field, the RE flag is intended to be set by the sender 380 and remain unchanged along the path, although it can be read by 381 network elements that understand the re-ECN protocol. It is feasible 382 that a network element MAY change the setting of the RE flag, perhaps 383 acting as a proxy for an end-point, but such a protocol would have to 384 be defined in another specification (e.g. [Re-PCN]). 386 Although the RE flag is a separate, single bit field, it can be read 387 as an extension to the two-bit ECN field; the three concatenated bits 388 in what we will call the extended ECN field (EECN) making eight 389 codepoints. We will use the RFC3168 names of the ECN codepoints to 390 describe settings of the ECN field when the RE flag setting is "don't 391 care", but we also define the following six extended ECN codepoint 392 names for when we need to be more specific. 394 +-------+------------+------+--------------+------------------------+ 395 | ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | 396 | field | codepoint | flag | codepoint | | 397 +-------+------------+------+--------------+------------------------+ 398 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 399 | | | | | transport | 400 | 00 | Not-ECT | 1 | FNE | Feedback not | 401 | | | | | established | 402 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 403 | | | | | and RECT | 404 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 405 | | | | | transport | 406 | 10 | ECT(0) | 0 | --- | Legacy ECN use only | 407 | | | | | | 408 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 409 | | | | | | 410 | 11 | CE | 0 | CE(0) | Re-Echo canceled by | 411 | | | | | congestion experienced | 412 | 11 | CE | 1 | CE(-1) | Congestion experienced | 413 +-------+------------+------+--------------+------------------------+ 415 Table 1: Extended ECN Codepoints 417 3.3. Re-ECN Protocol Operation 419 In this section we will give an overview of the operation of the re- 420 ECN protocol for TCP/IP, leaving a detailed specification to the 421 following sections. Other transports will be discussed later. 423 In summary, the protocol adds a third `re-echo' stage to the existing 424 TCP/IP ECN protocol. Whenever the network adds CE congestion 425 signalling to the IP header on the forward data path, the receiver 426 feeds it back to the ingress using TCP, then the sender re-echoes it 427 into the forward data path using the RE flag in the next packet. 429 Prior to receiving any feedback a sender will not know which setting 430 of the RE flag to use, so it sets the feedback not established (FNE) 431 codepoint. The network reads the FNE codepoint conservatively as 432 equivalent to re-echoed congestion. 434 Specifically, once a flow is established, a re-ECN sender always 435 initialises the ECN field to ECT(1). And it usually sets the RE flag 436 to "1". Whenever a router re-marks a packet to CE, the receiver 437 feeds back this event to the sender. On receiving this feedback, the 438 re-ECN sender will clear the RE flag to "0" in the next packet it 439 sends. 441 We chose to set and clear the RE flag this way round to ease 442 incremental deployment (see Section 7.1). To avoid confusion we will 443 use the term `blanking' (rather than marking) when the RE flag is 444 cleared to "0". So, over a stream of packets, we will talk of the 445 `RE blanking fraction' as the fraction of octets in packets with the 446 RE flag cleared to "0". 448 ^ 449 | 450 | RE blanking fraction 451 3% |--------------------------------+===== 452 | | 453 2% | | 454 | CE marking fraction | 455 1% | +-----------------------+ 456 | | 457 0% +----------------------------------------> 458 ^ 0 ^ i ^ resource index 459 | ^ | ^ | 460 0 | 1 | 2 observation points 461 1.00% 2.00% marking fraction 463 Figure 1: A 2-Router Example (Imprecise) 465 Figure 1 uses the two router example introduced earlier to illustrate 466 why re-ECN allows routers to measure downstream congestion. The 467 horizontal axis represents the index of each congestible resource 468 (typically queues) along a path through the Internet. There may be 469 many routers on the path, but we assume only two are currently 470 congested (those with resource index 0 and i). The two superimposed 471 plots show the fraction of each extended ECN codepoint in a flow 472 observed along this path. Given about 3% of packets reaching the 473 destination are marked CE, in response to feedback the sender will 474 blank the RE flag in about 3% of packets it sends. Then approximate 475 downstream congestion can be measured at the observation points shown 476 along the path by subtracting the CE marking fraction from the RE 477 blanking fraction, as shown in the table below (Appendix A derives 478 these approximations from a precise analysis). 480 +-------------------+------------------------------+ 481 | Observation point | Approx downstream congestion | 482 +-------------------+------------------------------+ 483 | 0 | 3% - 0% = 3% | 484 | 1 | 3% - 1% = 2% | 485 | 2 | 3% - 3% = 0% | 486 +-------------------+------------------------------+ 488 Table 2: Downstream Congestion Measured at Example Observation Points 490 All along the path, whole-path congestion remains unchanged so it can 491 be used as a reference against which to compare upstream congestion. 492 The difference predicts downstream congestion for the rest of the 493 path. Therefore, measuring the fractions of each codepoint at any 494 point in the Internet will reveal upstream, downstream and whole path 495 congestion. 497 Note that we have introduced discussion of marking and blanking 498 fractions solely for illustration. To be absolutely clear, these 499 fractions are averages that would result from the behaviour of a TCP 500 protocol handler mechanically blanking outgoing packets in direct 501 response to incoming feedback---we are not saying any protocol 502 handler works with these average fractions directly. 504 3.4. Informal Terminology 506 In the rest of this memo we will loosely talk of positive or negative 507 flows, meaning flows where the moving average of the downstream 508 congestion metric is persistently positive or negative. The notion 509 of a negative metric arises because it is derived by subtracting one 510 metric from another. Of course actual downstream congestion cannot 511 be negative, only the metric can (whether due to time lags or 512 deliberate malice). 514 Just as we will loosely talk of positive and negative flows, we will 515 also talk of positive or negative packets, meaning packets that 516 contribute positively or negatively to the downstream congestion 517 metric. 519 Therefore we will talk of packets having `worth' of +1, 0 or -1, 520 which, when multiplied by their size, indicates their contribution to 521 the downstream congestion metric. 523 Figure 2 shows the main state transitions of the system once a flow 524 is established, showing the worth of packets in each state. When the 525 network congestion marks a packet it decrements its worth (moving 526 from the left of the main square to the right). When the sender 527 blanks the RE flag in order to re-echo congestion it increments the 528 worth of a packet (moving from the bottom of the main square to the 529 top). 531 Sender state Sent Worth Received Worth 532 packet packet 533 +----------------------------------------------------+ 534 | ^ 535 V | 536 Congestion echoed -->Re-Echo +1 --+---> CE(0) 0 --+ 537 (positive) | (canceled) | 538 V network | 539 | congestion | 540 | | 541 Flow established --> RECT 0 ----+-> CE(-1) -1 --+ 542 ^ (neutral) | | (negative) 543 | | | 544 | no V V 545 | congestion | | 546 +-----------<--------------+-+ 548 Figure 2: Re-ECN System State Diagram (bootstrap not shown) 550 The idea is that every time the network decrements the worth of a 551 packet, the sender increments the worth of a later packet. Then, 552 over time, as many positive octets should arrive at the receiver as 553 negative. Note we have said octets not packets, so if packets are of 554 different sizes, the worth should be incremented on enough octets to 555 balance the octets in negative packets arriving at the receiver. It 556 is this balance that will allow the network to hold the sender 557 accountable for the congestion it causes, as we shall see. The 558 informal outline below uses TCP as an example transport, but the idea 559 would be broadly similar for any transport that adapts its rate to 560 congestion. 562 We will start with the sender in `flow established' state. Normally, 563 as acknowledgements of earlier packets arrive that don't feedback any 564 congestion, the congestion window can be opened, so the sender goes 565 round the smaller sub-loop, sending RECT packets (worth 0) and 566 returning to the flow established state to send another one. If a 567 router congestion marks one of the packets, it decrements the 568 packet's worth. The sender will have been continuing to traverse 569 round the smaller feedback loop every time acknowledgements arrive. 570 But when congestion feedback returns from this packet that was marked 571 with -1 worth (the largest loop in the figure) the sender jumps to 572 the congestion echoed state in order to re-echo the congestion, 573 incrementing the worth of the next packet to +1 by blanking its RE 574 flag. The sender then returns to the flow established state and 575 continues round the smaller loop, sending packets worth 0. Note that 576 the size of the loops is just an artefact of the figure; it is not 577 meant to imply that one loop is slower than the other - they are both 578 the same end to end feedback loop. 580 If a packet carrying re-echoed congestion happens to also be 581 congestion marked, the +1 worth added by the sender will be cancelled 582 out by the -1 network congestion marking. Although the two worth 583 values correctly cancel out, neither the congestion marking nor the 584 re-echoed congestion are lost, because the RE bit and the ECN field 585 are orthogonal. So, whenever this happens, the receiver will 586 correctly detect and re-echo the new congestion event as well (the 587 top sub-loop). When we need to distinguish, we will sometimes call a 588 packet marked RECT 'neutral' (0 worth), while we will call the CE(0) 589 marking 'canceled' (also 0 worth). If a re-echoed packet isn't 590 unlucky enough to be further congestion marked, the sender will 591 return to the flow established state and continue to send RECT 592 packets (worth 0). 594 The table below specifies unambiguously the worth of each extended 595 ECN codepoint. Note the order is different from the previous table 596 to better show how the worth increments and decrements. The FNE 597 codepoint is an exception. It is used in the flow bootstrap process 598 (explained later) and has the same positive (+1) worth as a packet 599 with the Re-Echo codepoint. 601 +--------+------+----------------+-------+--------------------------+ 602 | ECN | RE | Extended ECN | Worth | Re-ECN meaning | 603 | field | bit | codepoint | | | 604 +--------+------+----------------+-------+--------------------------+ 605 | 00 | 0 | Not-RECT | ... | Not re-ECN-capable | 606 | | | | | transport | 607 | 01 | 0 | Re-Echo | +1 | Re-echoed congestion and | 608 | | | | | RECT | 609 | 10 | 0 | --- | ... | Legacy ECN use only | 610 | 11 | 0 | CE(0) | 0 | Re-Echo canceled by | 611 | | | | | congestion experienced | 612 | 00 | 1 | FNE | +1 | Feedback not established | 613 | 01 | 1 | RECT | 0 | Re-ECN capable transport | 614 | 10 | 1 | --CU-- | ... | Currently unused | 615 | | | | | | 616 | 11 | 1 | CE(-1) | -1 | Congestion experienced | 617 +--------+------+----------------+-------+--------------------------+ 619 Table 3: 'Worth' of Extended ECN Codepoints 621 4. Transport Layers 623 4.1. TCP 625 Re-ECN capability at the sender is essential. At the receiver it is 626 optional, as long as the receiver has a basic (`vanilla flavour') 627 RFC3168-compliant ECN-capable transport (ECT) [RFC3168]. Given re- 628 ECN is not the first attempt to define the semantics of the ECN 629 field, we give a table below summarising what happens for various 630 combinations of capabilities of the sender S and receiver R, as 631 indicated in the first four columns below. The last column gives the 632 mode a half-connection should be in after the first two of the three 633 TCP handshakes. 635 +--------+--------------+------------+---------+--------------------+ 636 | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | 637 | | (RFC3540) | (RFC3168) | | Half-connection | 638 | | | | | Mode | 639 +--------+--------------+------------+---------+--------------------+ 640 | SR | | | | RECN | 641 | S | R | | | RECN-Co | 642 | S | | R | | RECN-Co | 643 | S | | | R | Not-ECT | 644 +--------+--------------+------------+---------+--------------------+ 646 Table 4: Modes of TCP Half-connection for Combinations of ECN 647 Capabilities of Sender S and Receiver R 649 We will describe what happens in each mode, then describe how they 650 are negotiated. The abbreviations for the modes in the above table 651 mean: 653 RECN: Full re-ECN capable transport 655 RECN-Co: Re-ECN sender in compatibility mode with a 656 vanilla [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable 657 receiver. Implementation of this mode is OPTIONAL. 659 Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when 660 at least one of the transports does not understand even basic ECN 661 marking. 663 Note that we use the term Re-ECT for a host transport that is re-ECN- 664 capable but RECN for the modes of the half connections between hosts 665 when they are both Re-ECT. If a host transport is Re-ECT, this fact 666 alone does NOT imply either of its half connections will necessarily 667 be in RECN mode, at least not until it has confirmed that the other 668 host is Re-ECT. 670 4.1.1. RECN mode: Full re-ECN capable transport 672 In full RECN mode, for each half connection, both the sender and the 673 receiver each maintain an unsigned integer counter we will call ECC 674 (echo congestion counter). The receiver maintains a count, modulo 8, 675 of how many times a CE marked packet has arrived during the half- 676 connection. Once a RECN connection is established, the three TCP 677 option flags (ECE, CWR & NS) used for ECN-related functions in 678 previous versions of ECN are used as a 3-bit field for the receiver 679 to repeatedly tell the sender the current value of ECC whenever it 680 sends a TCP ACK. We will call this the echo congestion increment 681 (ECI) field. This overloaded use of these 3 option flags as one 682 3-bit ECI field is shown in Figure 4. The actual definition of the 683 TCP header, including the addition of support for the ECN nonce, is 684 shown for comparison in Figure 3. This specification does not 685 redefine the names of these three TCP option flags, it merely 686 overloads them with another definition once a flow is established. 688 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 689 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 690 | | | N | C | E | U | A | P | R | S | F | 691 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 692 | | | | R | E | G | K | H | T | N | N | 693 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 695 Figure 3: The (post-ECN Nonce) definition of bytes 13 and 14 of the 696 TCP Header 698 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 699 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 700 | | | | U | A | P | R | S | F | 701 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 702 | | | | G | K | H | T | N | N | 703 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 705 Figure 4: Definition of the ECI field within bytes 13 and 14 of the 706 TCP Header, overloading the current definitions above for established 707 RECN flows. 709 Receiver Action in RECN Mode 711 Every time a CE marked packet arrives at a receiver in RECN mode, 712 the receiver transport increments its local value of ECC modulo 8 713 and MUST echo its value to the sender in the ECI field of the next 714 ACK. It MUST repeat the same value of ECI in every subsequent ACK 715 until the next CE event, when it increments ECI again. 717 The increment of the local ECC values is modulo 8 so the field 718 value simply wraps round back to zero when it overflows. The 719 least significant bit is to the right (labelled bit 9). 721 A receiver in RECN mode MAY delay the echo of a CE to the next 722 delayed-ACK, which would be necessary if ACK-withholding were 723 implemented. 725 Sender Action in RECN Mode 727 On the arrival of every ACK, the sender compares the ECI field 728 with its own ECC value, then replaces its local value with that 729 from the ACK. The difference D is assumed to be the number of CE 730 marked packets that arrived at the receiver since it sent the 731 previously received ACK (but see below for the sender's safety 732 strategy). Whenever the ECI field increments by D (or D drops are 733 detected), the sender MUST clear the RE flag to "0" in the IP 734 header of the next D data packets it sends, effectively re-echoing 735 each single increment of ECI. Otherwise the data sender MUST send 736 all data packets with RE set to "1". 738 As a general rule, once a flow is established, as well as setting 739 or clearing the RE flag as above, a data sender in RECN mode MUST 740 always set the ECN field to ECT(1). However, the settings of the 741 extended ECN field during flow start are defined in Section 4.1.4. 743 As we have already emphasised, the re-ECN protocol makes no 744 changes and has no effect on the TCP congestion control algorithm. 745 So, each increment of ECI (or detection of a drop) also triggers 746 the standard TCP congestion response, but with no more than one 747 congestion response per round trip, as usual. 749 A TCP sender also acts as the receiver for the other half- 750 connection. The host will maintain two ECC values S.ECC and R.ECC 751 as sender and receiver respectively. Every TCP header sent by a 752 host in RECN mode will also repeat the prevailing value of R.ECC 753 in its ECI field. If a sender in RECN mode has to retransmit a 754 packet due to a suspected loss, the re-transmitted packet MUST 755 carry the latest prevailing value of R.ECC when it is re- 756 transmitted, which will not necessarily be the one it carried 757 originally. 759 4.1.1.1. Safety against Long Pure ACK Loss Sequences 761 The ECI method was chosen for echoing congestion marking because a 762 re-ECN sender needs to know about every CE mark arriving at the 763 receiver, not just whether at least one arrives within a round trip 764 time (which is all the ECE/CWR mechanism supported). And, as pure 765 ACKs are not protected by TCP reliable delivery, we repeat the same 766 ECI value in every ACK until it changes. Even if many ACKs in a row 767 are lost, as soon as one gets through, the ECI field it repeats from 768 previous ACKs that didn't get through will update the sender on how 769 many CE marks arrived since the last ACK got through. 771 The sender will only lose a record of the arrival of a CE mark if all 772 the ACKS are lost (and all of them were pure ACKs) for a stream of 773 data long enough to contain 8 or more CE marks. So, if the marking 774 fraction was p, at least 8/p pure ACKs would have to be lost. For 775 example, if p was 5%, a sequence of 160 pure ACKs would all have to 776 be lost. To protect against such extremely unlikely events, if a re- 777 ECN sender detects a sequence of pure ACKs has been lost it SHOULD 778 assume the ECI field wrapped as many times as possible within the 779 sequence. 781 Specifically, if a re-ECN sender receives an ACK with an 782 acknowledgement number that acknowledges L segments since the 783 previous ACK but with a sequence number unchanged from the previously 784 received ACK, it SHOULD conservatively assume that the ECI field 785 incremented by D' = L - ((L-D) mod 8), where D is the apparent 786 increase in the ECI field. For example if the ACK arriving after 9 787 pure ACK losses apparently increased ECI by 2, the assumed increment 788 of ECI would still be 2. But if ECI apparently increased by 2 after 789 11 pure ACK losses, ECI should be assumed to have increased by 10. 791 A re-ECN sender MAY implement a heuristic algorithm to predict beyond 792 reasonable doubt that the ECI field probably did not wrap within a 793 sequence of lost pure ACKs. But such an algorithm is NOT REQUIRED. 794 Such an algorithm MUST NOT be used unless it is proven to work even 795 in the presence of correlation between high ACK loss rate on the back 796 channel and high CE marking rate on the forward channel. 798 Whatever assumption a re-ECN sender makes about potentially lost CE 799 marks, both its congestion control and its re-echoing behaviour 800 SHOULD be consistent with the assumption it makes. 802 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver 804 If the half-connection is in RECN-Co mode, ECN feedback proceeds no 805 differently to that of vanilla ECN. In other words, the receiver 806 sets the ECE flag repeatedly in the TCP header and the sender 807 responds by setting the CWR flag. Although RECN-Co mode is used when 808 the receiver has not implemented the re-ECN protocol, the sender can 809 infer enough from its vanilla ECN feedback to set or clear the RE 810 flag reasonably well. Specifically, every time the receiver toggles 811 the ECE field from "0" to "1" (or a loss is detected), as well as 812 setting CWR in the TCP flags, the re-ECN sender MUST blank the RE 813 flag of the next packet to "0" as it would do in full RECN mode. 814 Otherwise, the data sender SHOULD send all other packets with RE set 815 to "1". Once a flow is established, a re-ECN data sender in RECN-Co 816 mode MUST always set the ECN field to ECT(1). 818 If a CE marked packet arrives at the receiver within a round trip 819 time of a previous mark, the receiver will still be echoing ECE for 820 the last CE mark. Therefore, such a mark will be missed by the 821 sender. Of course, this isn't of concern for congestion control, but 822 it does mean that very occasionally the RE blanking fraction will be 823 understated. Therefore flows in RECN-Co mode may occasionally be 824 mistaken for very lightly cheating flows and consequently might 825 suffer a small number of packet drops through an egress dropper 826 (Section 6.1.4). We expect re-ECN would be deployed for some time 827 before policers and droppers start to enforce it. So, given there is 828 not much ECN deployment yet anyway, this minor problem may affect 829 only a very small proportion of flows, reducing to nothing over the 830 years as vanilla ECN hosts upgrade. The use of RECN-Co mode would 831 need to be reviewed in the light of experience at the time of re-ECN 832 deployment. 834 RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their 835 code simple, MAY choose not to implement this mode. If they do not, 836 a re-ECN sender SHOULD fall back to vanilla ECT mode in the presence 837 of an ECN-capable receiver. It MAY choose to fall back to the ECT- 838 Nonce mode, but if re-ECN implementers don't want to be bothered with 839 RECN-Co mode, they probably won't want to add an ECT-Nonce mode 840 either. 842 4.1.2.1. Re-ECN support for the ECN Nonce 844 A TCP half-connection in RECN-Co mode MUST NOT support the ECN 845 Nonce [RFC3540]. This means that the sending code of a re-ECN 846 implementation will never need to include ECN Nonce support. Re-ECN 847 is intended to provide wider protection than the ECN nonce against 848 congestion control misbehaviour, and re-ECN only requires support 849 from the sender, therefore it is preferable to specifically rule out 850 the need for dual sender implementations. As a consequence, a re-ECN 851 capable sender will never set ECT(0), so it will be easier for 852 network elements to discriminate re-ECN traffic flows from other ECN 853 traffic, which will always contain some ECT(0) packets. 855 However, a re-ECN implementation MAY OPTIONALLY include receiving 856 code that complies with the ECN Nonce protocol when interacting with 857 a sender that supports the ECN nonce (rather than re-ECN), but this 858 support is NOT REQUIRED. 860 RFC3540 allows an ECN nonce sender to choose whether to sanction a 861 receiver that does not ever set the nonce sum. Given re-ECN is 862 intended to provide wider protection than the ECN nonce against 863 congestion control misbehaviour, implementers of re-ECN receivers MAY 864 choose not to implement backwards compatibility with the ECN nonce 865 capability. This may be because they deem that the risk of sanctions 866 is low, perhaps because significant deployment of the ECN nonce seems 867 unlikely at implementation time. 869 4.1.3. Capability Negotiation 871 During the TCP hand-shake at the start of a connection, an originator 872 of the connection (host A) with a re-ECN-capable transport MUST 873 indicate it is Re-ECT by setting the TCP options NS=1, CWR=1 and 874 ECE=1 in the initial SYN. 876 A responding Re-ECT host (host B) MUST return a SYN ACK with flags 877 CWR=1 and ECE=0. The responding host MUST NOT set this combination 878 of flags unless the preceding SYN has already indicated Re-ECT 879 support as above. A Re-ECT server (B) can use either setting of the 880 NS flag combined with this type of SYN ACK in response to a SYN from 881 a Re-ECT client (A). Normally a Re-ECT server will reply to a Re-ECT 882 client with NS=0, but in the special circumstance below it can return 883 a SYN ACK with NS=1. 885 If the initial SYN from Re-ECT client A is marked CE(-1), a Re-ECT 886 server B MUST increment its local value of ECC. But B cannot reflect 887 the value of ECC in the SYN ACK, because it is still using the 3 bits 888 to negotiate connection capabilities. So, server B MUST set the 889 alternative TCP header flags in its SYN ACK: NS=1, CWR=1 and ECE=0. 891 These handshakes are summarised in Table 5 below, with X meaning 892 `don't care'. The handshakes used for the other flavours of ECN are 893 also shown for comparison. To compress the width of the table, the 894 headings of the first four columns have been severely abbreviated, as 895 follows: 897 R: *R*e-ECT 899 N: ECT-*N*once (RFC3540) 901 E: *E*CT (RFC3168) 902 I: Not-ECT (*I*mplicit congestion notification). 904 These correspond with the same headings used in Table 4. Indeed, the 905 resulting modes in the last two columns of the table below are a more 906 comprehensive way of saying the same thing as Table 4. 908 +----+---+---+---+------------+-------------+-----------+-----------+ 909 | R | N | E | I | SYN A-B | SYN ACK B-A | A-B Mode | B-A Mode | 910 +----+---+---+---+------------+-------------+-----------+-----------+ 911 | | | | | NS CWR ECE | NS CWR ECE | | | 912 | AB | | | | 1 1 1 | X 1 0 | RECN | RECN | 913 | A | B | | | 1 1 1 | 1 0 1 | RECN-Co | ECT-Nonce | 914 | A | | B | | 1 1 1 | 0 0 1 | RECN-Co | ECT | 915 | A | | | B | 1 1 1 | 0 0 0 | Not-ECT | Not-ECT | 916 | B | A | | | 0 1 1 | 0 0 1 | ECT-Nonce | RECN-Co | 917 | B | | A | | 0 1 1 | 0 0 1 | ECT | RECN-Co | 918 | B | | | A | 0 0 0 | 0 0 0 | Not-ECT | Not-ECT | 919 +----+---+---+---+------------+-------------+-----------+-----------+ 921 Table 5: TCP Capability Negotiation between Originator (A) and 922 Responder (B) 924 As soon as a re-ECN capable TCP server receives a SYN, it MUST set 925 its two half-connections into the modes given in Table 5. As soon as 926 a re-ECN capable TCP client receives a SYN ACK, it MUST set its two 927 half-connections into the modes given in Table 5. The half- 928 connections will remain in these modes for the rest of the 929 connection, including for the third segment of TCP's three-way hand- 930 shake (the ACK). 932 {ToDo: Consider SYNs within a connection.} 934 Recall that, if the SYN ACK reflects the same flag settings as the 935 preceding SYN (because there is a broken legacy implementation that 936 behaves this way), RFC3168 specifies that the whole connection MUST 937 revert to Not-ECT. 939 Also note that, whenever the SYN flag of a TCP segment is set 940 (including when the ACK flag is also set), the NS, CWR and ECE flags 941 MUST NOT be interpreted as the 3-bit ECI value, which is only set as 942 a copy of the local ECC value in non-SYN packets. 944 4.1.4. Extended ECN (EECN) Field Settings during Flow Start or after 945 Idle Periods 947 If the originator (A) of a TCP connection supports re-ECN it MUST set 948 the extended ECN (EECN) field in the IP header of the initial SYN 949 packet to the feedback not established (FNE) codepoint. 951 FNE is a new extended ECN codepoint defined by this specification 952 (Section 3.2). The feedback not established (FNE) codepoint is used 953 when the transport does not have the benefit of ECN feedback so it 954 cannot decide whether to set or clear the RE flag. 956 If after receiving a SYN the server B has set its sending half- 957 connection into RECN mode or RECN-Co mode, it MUST set the extended 958 ECN field in the IP header of its SYN ACK to the feedback not 959 established (FNE) codepoint. Note the careful wording here, which 960 means that Re-ECT server B MUST set FNE on a SYN ACK whether it is 961 responding to a SYN from a Re-ECT client or from a client that is 962 merely ECN-capable. 964 The original ECN specification [RFC3168] required SYNs and SYN ACKs 965 to use the Not-ECT codepoint of the ECN field. The aim was to 966 prevent well-known DoS attacks such as SYN flooding being able to 967 gain from the advantage that ECN capability afforded over drop at 968 ECN-capable routers. 970 For a SYN ACK, Kuzmanovic [I-D.ietf-tsvwg-ecnsyn] has shown that this 971 caution was unnecessary, and proposes to allow a SYN ACK to be ECN- 972 capable to improve performance. We have gone further by proposing to 973 make the initial SYN ECN-capable too. By stipulating the FNE 974 codepoint for the initial SYN, we comply with RFC3168 in word but not 975 in spirit, because we have indeed set the ECN field to Not-ECT, but 976 we have extended the ECN field with another bit. And it will be seen 977 (Section 5.3) that we have defined one setting of that bit to mean an 978 ECN-capable transport. Therefore, by proposing that the FNE 979 codepoint MUST be used on the initial SYN of a connection, we have 980 (deliberately) made the initial SYN ECN-capable. Section 5.4 981 justifies deciding to make the initial SYN ECN-capable. 983 Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will 984 have already been set on the initial SYN and possibly the SYN ACK as 985 above. But each re-ECN sender will have to set FNE cautiously on a 986 few data packets as well, given a number of packets will usually have 987 to be sent before sufficient congestion feedback is received. The 988 behaviour will be different depending on the mode of the half- 989 connection: 991 RECN mode: Given the constraints on TCP's initial window [RFC3390] 992 and its exponential window increase during slow start 993 phase [RFC2581], it turns out that the sender SHOULD set FNE on 994 the first and third data packets in its flow, assuming equal sized 995 data packets once a flow is established. Appendix D presents the 996 calculation that led to this conclusion. Below, after running 997 through the start of an example TCP session, we give the intuition 998 learned from that calculation. 1000 RECN-Co mode: A re-ECT sender that switches into re-ECN 1001 compatibility mode or into Not-ECT mode (because it has detected 1002 the corresponding host is not re-ECN capable) MUST limit its 1003 initial window to 1 segment. The reasoning behind this constraint 1004 is given in Section 5.4. Having set this initial window, a re-ECN 1005 sender in RECN-Co mode SHOULD set FNE on the first and third data 1006 packets in a flow, as for RECN mode. 1008 +----+------+----------------+-------+-------+---------------+------+ 1009 | | Data | TCP A(Re-ECT) | IP A | IP B | TCP B(Re-ECT) | Data | 1010 +----+------+----------------+-------+-------+---------------+------+ 1011 | | Byte | SEQ ACK CTL | EECN | EECN | SEQ ACK CTL | Byte | 1012 | -- | ---- | ------------- | ----- | ----- | ------------- | ---- | 1013 | 1 | | 0100 SYN | FNE | --> | R.ECC=0 | | 1014 | | | CWR,ECE,NS | | | | | 1015 | 2 | | R.ECC=0 | <-- | FNE | 0300 0101 | | 1016 | | | | | | SYN,ACK,CWR | | 1017 | 3 | | 0101 0301 ACK | RECT | --> | R.ECC=0 | | 1018 | 4 | 1000 | 0101 0301 ACK | FNE | --> | R.ECC=0 | | 1019 | 5 | | R.ECC=0 | <-- | FNE | 0301 1102 ACK | 1460 | 1020 | 6 | | R.ECC=0 | <-- | RECT | 1762 1102 ACK | 1460 | 1021 | 7 | | R.ECC=0 | <-- | FNE | 3222 1102 ACK | 1460 | 1022 | 8 | | 1102 1762 ACK | RECT | --> | R.ECC=0 | | 1023 | 9 | | R.ECC=0 | <-- | RECT | 4682 1102 ACK | 1460 | 1024 | 10 | | R.ECC=0 | <-- | RECT | 6142 1102 ACK | 1460 | 1025 | 11 | | 1102 3222 ACK | RECT | --> | R.ECC=0 | | 1026 | 12 | | R.ECC=0 | <-- | RECT | 7602 1102 ACK | 1460 | 1027 | 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 | 1028 | | | ... | | | | | 1029 +----+------+----------------+-------+-------+---------------+------+ 1031 Table 6: TCP Session Example #1 1033 Table 6 shows an example TCP session, where the server B sets FNE on 1034 its first and third data packets (lines 5 & 7) as well as on the 1035 initial SYN ACK as previously described. The left hand half of the 1036 table shows the relevant settings of headers sent by client A in 1037 three layers: the TCP payload size; TCP settings; then IP settings. 1038 The right hand half gives equivalent columns for server B. The only 1039 TCP settings shown are the sequence number (SEQ), acknowledgement 1040 number (ACK) and the relevant control (CTL) flags that A sets in the 1041 TCP header. The IP columns show the setting of the extended ECN 1042 (EECN) field. 1044 Also shown on the receiving side of the table is the value of the 1045 receiver's echo congestion counter (R.ECC) after processing the 1046 incoming EECN header. Note that, once a host sets a half-connection 1047 into RECN mode, it MUST initialise its local value of ECC to zero. 1049 The intuition that Appendix D gives for why a sender should set FNE 1050 on the first and third data packets is as follows. At line 13, a 1051 packet sent by B is shown with an '*', which means it has been 1052 congestion marked by an intermediate router from RECT to CE(-1). On 1053 receiving this CE marked packet, client A increments its ECC counter 1054 to 1 as shown. This was the 7th data packet B sent, but before 1055 feedback about this event returns to B, it might well have sent many 1056 more packets. Indeed, during exponential slow start, about as many 1057 packets will be in flight (unacknowledged) as have been acknowledged. 1058 So, when the feedback from the congestion event on B's 7th segment 1059 returns, B will have sent about 7 further packets that will still be 1060 in flight. At that stage, B's best estimate of the network's packet 1061 marking fraction will be 1/7. So, as B will have sent about 14 1062 packets, it should have already marked 2 of them as FNE in order to 1063 have marked 1/7; hence the need to have set the first and third data 1064 packets to FNE. 1066 Client A's behaviour in Table 6 also shows FNE being set on the first 1067 SYN and the first data packet (lines 1 & 4), but in this case it 1068 sends no more data packets, so of course, it cannot, and does not 1069 need to, set FNE again. Note that in the A-B direction there is no 1070 need to set FNE on the third part of the three-way hand-shake (line 1071 3---the ACK). 1073 Note that in this section we have used the word SHOULD rather than 1074 MUST when specifying how to set FNE on data segments before positive 1075 congestion feedback arrives (but note that the word MUST was used for 1076 FNE on the SYN and SYN ACK). FNE is only RECOMMENDED for the first 1077 and third data segments to entertain the possibility that the TCP 1078 transport has the benefit of other knowledge of the path, which it 1079 re-uses from one flow for the benefit of a newly starting flow. For 1080 instance, one flow can re-use knowledge of other flows between the 1081 same hosts if using a Congestion Manager [RFC3124] or when a proxy 1082 host aggregates congestion information for large numbers of flows. 1084 After an idle period of more than 1 second, a re-ECN sender transport 1085 MUST set the EECN field of the packet that resumes the connection to 1086 FNE. Note that this next packet may be sent a very long time later, 1087 a packet does NOT have to be sent after 1 second of idling. In order 1088 that the design of network policers can be deterministic, this 1089 specification deliberately puts an absolute lower limit on how long a 1090 connection can be idle before the packet that resumes the connection 1091 must be set to FNE, rather than relating it to the connection round 1092 trip time. We use the lower bound of the retransmission timeout 1093 (RTO) [RFC2988], which is commonly used as the idle period before TCP 1094 must reduce to the restart window [RFC2581]. Note our specification 1095 of re-ECN's idle period is NOT intended to change the idle period for 1096 TCP's restart, nor indeed for any other purposes. 1098 {ToDo: Describe how the sender falls back to legacy modes if packets 1099 don't appear to be getting through (to work round firewalls 1100 discarding packets they consider unusual).} 1102 4.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs 1104 A re-ECN sender MUST clear the RE flag to "0" and set the ECN field 1105 to Not-ECT in pure ACKs, retransmissions and window probes, as 1106 specified in [RFC3168]. Our eventual goal is for all packets to be 1107 sent with re-ECN enabled, and we believe the semantics of the ECI 1108 field go a long way towards being able to achieve this. However, we 1109 have not completed a full security analysis for these cases, 1110 therefore, currently we merely re-state current practice. 1112 We must also reconcile the facts that congestion marking is applied 1113 to packets but acknowledgements cover octet ranges and acknowledged 1114 octet boundaries need not match the transmitted boundaries. The 1115 general principle we work to is to remain compatible with TCP's 1116 congestion control which is driven by congestion events at packet 1117 granularity while at the same time aiming to blank the RE flag on at 1118 least as many octets in a flow as have been marked CE. 1120 Therefore, a re-ECN TCP receiver MUST increment its ECC value as many 1121 times as CE marked packets have been received. And that value MUST 1122 be echoed to the sender in the first available ACK using the ECI 1123 field. This ensures the TCP sender's congestion control receives 1124 timely feedback on congestion events at the same packet granularity 1125 that they were generated on congested routers. 1127 Then, a re-ECN sender stores the difference D between its own ECC 1128 value and the incoming ECI field by incrementing a counter R. Then, R 1129 is decremented by 1 each subsequent packet that is sent with the RE 1130 flag blanked, until R is no longer positive. Using this technique, 1131 whenever a re-ECN transport sends a not re-ECN capable (NRECN) packet 1132 (e.g. a retransmission), the remaining packets required to have the 1133 RE flag blanked will be automatically carried over to subsequent 1134 packets, through the variable R. 1136 This does not ensure precisely the same number of octets have RE 1137 blanked as were CE marked. But we believe positive errors will 1138 cancel negative over a long enough period. {ToDo: However, more 1139 research is needed to prove whether this is so. If it is not, it may 1140 be necessary to increment and decrement R in octets rather than 1141 packets, by incrementing R as the product of D and the size in octets 1142 of packets being sent (typically the MSS).} 1144 4.2. Other Transports 1146 4.2.1. General Guidelines for Adding Re-ECN to Other Transports 1148 Re-ECT sender transports that have established the receiver transport 1149 is at least ECN-capable (not necessarily re-ECN capable) MUST blank 1150 the RE codepoint in packets carrying at least as many octets as 1151 arrive at receiver with the CE codepoint set. Re-ECN-capable sender 1152 transports should always initialise the ECN field to the ECT(1) 1153 codepoint once a flow is established. 1155 If the sender transport does not have sufficient feedback to even 1156 estimate the path's CE rate, it SHOULD set FNE continuously. If the 1157 sender transport has some, perhaps stale, feedback to estimate that 1158 the path's CE rate is nearly definitely less than E%, the transport 1159 MAY blank RE in packets for E% of sent octets, and set the RECT 1160 codepoint for the remainder. 1162 The following sections give guidelines on how re-ECN support could be 1163 added to RSVP or NSIS, to DCCP, and to SCTP - although separate 1164 Internet drafts will be necessary to document the exact mechanics of 1165 re-ECN if each of these protocols. 1167 {ToDo: Give a brief outline of what would be expected for each of the 1168 following: 1170 o UDP fire and forget (e.g. DNS) 1172 o UDP streaming with no feedback 1174 o UDP streaming with feedback 1176 } 1178 4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS 1180 A separate I-D has been submitted [Re-PCN] describing how re-ECN can 1181 be used in an edge-to-edge rather than end-to-end scenario. It can 1182 then be used by downstream networks to police whether upstream 1183 networks are blocking new flow reservations when downstream 1184 congestion is too high, even though the congestion is in other 1185 operators' downstream networks. This relates to current work in 1186 progress on Admission Control over Diffserv using Pre-Congestion 1187 Notification, being reported to the IETF TSVWG [CL-deploy]. 1189 4.2.3. Guidelines for adding Re-ECN to DCCP 1191 Beside adjusting the initial features negotiation sequence, operating 1192 re-ECN in DCCP could be achieved by defining a new option to be added 1193 to acknowledgments, that would include a multibit field where the 1194 destination could copy its ECC. 1196 4.2.4. Guidelines for adding Re-ECN to SCTP 1198 Annex 1 in RFC4340 gives the specifications for SCTP to support ECN. 1199 Similar steps should be taken to support re-ECN. Beside adjusting 1200 the initial features negotiation sequence, operating re-ECN in SCTP 1201 could be achieved by defining a new control chunk, that would include 1202 a multibit field where the destination could copy its ECC 1204 5. Network Layer 1206 5.1. Re-ECN IPv4 Wire Protocol 1208 The wire protocol of the ECN field in the IP header remains largely 1209 unchanged from [RFC3168]. However, an extension to the ECN field we 1210 call the RE (re-ECN extension) flag (Section 3.2) is defined in this 1211 document. It doubles the extended ECN codepoint space, giving 8 1212 potential codepoints. The semantics of the extra codepoints are 1213 backward compatible with the semantics of the 4 original codepoints 1214 [RFC3168] (Section 7.1 collects together and summarises all the 1215 changes defined in this document). 1217 For IPv4, this document proposes that the new RE control flag will be 1218 positioned where the `reserved' control flag was at bit 48 of the 1219 IPv4 header (counting from 0). Alternatively, some would call this 1220 bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 1221 header (Figure 5). 1223 0 1 2 1224 +---+---+---+ 1225 | R | D | M | 1226 | E | F | F | 1227 +---+---+---+ 1229 Figure 5: New Definition of the Re-ECN Extension (RE) Control Flag at 1230 the Start of Byte 7 of the IPv4 Header 1232 The semantics of the RE flag are described in outline in Section 3 1233 and specified fully in Section 4. The RE flag is always considered 1234 in conjunction with the 2-bit ECN field, as if they were concatenated 1235 together to form a 3-bit extended ECN field. If the ECN field is set 1236 to either the ECT(1) or CE codepoint, when the RE flag is blanked 1237 (cleared to "0") it represents a re-echo of congestion experienced by 1238 an early packet. If the ECN field is set to the Not-ECT codepoint, 1239 when the RE flag is set to "1" it represents the feedback not 1240 established (FNE) codepoint, which signals that the packet was sent 1241 without the benefit of congestion feedback. 1243 It is believed that the FNE codepoint can simultaneously serve other 1244 purposes, particularly where the start of a flow needs distinguishing 1245 from packets later in the flow. For instance it would have been 1246 useful to identify new flows for tag switching and might enable 1247 similar developments in the future if it were adopted. It is similar 1248 to the state set-up bit idea designed to protect against memory 1249 exhaustion attacks. This idea was proposed informally by David Clark 1250 and documented by Handley and Greenhalgh [Steps_DoS]. The FNE 1251 codepoint can be thought of as a `soft-state set-up flag', because it 1252 is idempotent (i.e. one occurrence of the flag is sufficient but 1253 further occurrences achieve the same effect if previous ones were 1254 lost). 1256 We are sure there will probably be other claims pending on the use of 1257 bit 48. We know of at least two [ARI05], [RFC3514] but neither have 1258 been pursued in the IETF, so far, although the present proposal would 1259 meet the needs of the former. 1261 The security flag proposal (commonly known as the evil bit) was 1262 published on 1 April 2003 as Informational RFC 3514, but it was not 1263 adopted due to confusion over whether evil-doers might set it 1264 inappropriately. The present proposal is backward compatible with 1265 RFC3514 because if re-ECN compliant senders were benign they would 1266 correctly clear the evil bit to honestly declare that they had just 1267 received congestion feedback. Whereas evil-doers would hide 1268 congestion feedback by setting the evil bit continuously, or at least 1269 more often than they should. So, evil senders can be identified, 1270 because they declare that they are good less often than they should. 1272 5.2. Re-ECN IPv6 Wire Protocol 1274 For IPv6, this document proposes that the new RE control flag will be 1275 positioned as the first bit of the option field of a new Congestion 1276 hop by hop option header (Figure 6). 1278 0 1 2 3 1279 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1280 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1281 | Next Header | Hdr ext Len | Option Type | Option Len | 1282 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1283 |R| Reserved for future use | 1284 |E| | 1285 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1287 Figure 6: Definition of a New IPv6 Congestion Hop by Hop Option 1288 Header containing the Re-ECN Extension (RE) Control Flag 1290 0 1 2 3 4 5 6 7 8 1291 +-+-+-+-+-+-+-+-+- 1292 |AIU|C|Option ID| 1293 +-+-+-+-+-+-+-+-+- 1295 Figure 7: Congestion Hop by Hop Option Type Encoding 1297 The Hop-by-Hop Options header enables packets to carry information to 1298 be examined and processed by routers or nodes along the packet's 1299 delivery path, including the source and destination nodes. For re- 1300 ECN, the two bits of the Action If Unrecognized (AIU) flag of the 1301 Congestion extension header MUST be set to "00" meaning if 1302 unrecognized `skip over option and continue processing the header'. 1303 Then, any routers or a receiver not upgraded with the optional re-ECN 1304 features described in this memo will simply ignore this header. But 1305 routers with these optional re-ECN features or a re-ECN policing 1306 function, will process this Congestion extension header. 1308 The `C' flag MUST be set to "1" to specify that the Option Data 1309 (currently only the RE control flag) can change en-route to the 1310 packet's final destination. This ensures that, when an 1311 Authentication header (AH [RFC2402]) is present in the packet, for 1312 any option whose data may change en-route, its entire Option Data 1313 field will be treated as zero-valued octets when computing or 1314 verifying the packet's authenticating value. 1316 Although the RE control flag should not be changed along the path, we 1317 expect that the rest of this option field that is currently `Reserved 1318 for future use' could be used for a multi-bit congestion notification 1319 field which we would expect to change en route. As the RE flag does 1320 not need end-to-end authentication, we set the C flag to '1'. 1322 {ToDo: A Congestion Hop by Hop Option ID will need to be registered 1323 with IANA.} 1325 5.3. Router Forwarding Behaviour 1327 Re-ECN works well without modifying the forwarding behaviour of any 1328 routers. However, below, two OPTIONAL changes to forwarding 1329 behaviour are defined which respectively enhance performance and 1330 improve a router's discrimination against flooding attacks. They are 1331 both OPTIONAL additions that we propose MAY apply by default to all 1332 Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN 1333 marking behaviours [RFC3168]. Specifications for PHBs MAY define 1334 different forwarding behaviours from this default, but this is NOT 1335 REQUIRED. [Re-PCN] is one example. 1337 FNE indicates ECT: 1339 The FNE codepoint tells a router to assume that the packet was 1340 sent by an ECN-capable transport (see Section 5.4). Therefore an 1341 FNE packet MAY be marked rather than dropped. Note that the FNE 1342 codepoint has been intentionally chosen so that, to legacy routers 1343 (which do not inspect the RE flag) an FNE packet appears to be 1344 Not-ECT so it will be dropped by legacy AQM algorithms. 1346 A network operator MUST NOT configure a router to ECN mark rather 1347 than drop FNE packets unless it can guarantee that FNE packets 1348 will be rate limited, either locally or upstream. The ingress 1349 policers discussed in Section 6.1.5 would count as rate limiters 1350 for this purpose. 1352 Preferential Drop: If a re-ECN capable router experiences very high 1353 load so that it has to drop arriving packets (e.g. a DoS attack), 1354 it MAY preferentially drop packets within the same Diffserv PHB 1355 using the preference order for extended ECN codepoints given in 1356 Table 7. Preferential dropping can be difficult to implement on 1357 some hardware, but if feasible it would discriminate against 1358 attack traffic if done as part of the overall policing framework 1359 of Section 6.1.3. If nowhere else, routers at the egress of a 1360 network SHOULD implement preferential drop (stronger than the MAY 1361 above). For simplicity, preferences 4 & 5 MAY be merged into one 1362 preference level. 1364 +-------+-----+------------+-------+------------+-------------------+ 1365 | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | 1366 | field | bit | ECN | | (1 = drop | | 1367 | | | codepoint | | 1st) | | 1368 +-------+-----+------------+-------+------------+-------------------+ 1369 | 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | 1370 | | | | | | congestion and | 1371 | | | | | | RECT | 1372 | 00 | 1 | FNE | +1 | 4 | Feedback not | 1373 | | | | | | established | 1374 | 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | 1375 | | | | | | by congestion | 1376 | | | | | | experienced | 1377 | 01 | 1 | RECT | 0 | 3 | Re-ECN capable | 1378 | | | | | | transport | 1379 | 11 | 1 | CE(-1) | -1 | 3 | Congestion | 1380 | | | | | | experienced | 1381 | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | 1382 | 10 | 0 | --- | n/a | 2 | Legacy ECN use | 1383 | | | | | | only | 1384 | 00 | 0 | Not-RECT | n/a | 1 | Not | 1385 | | | | | | re-ECN-capable | 1386 | | | | | | transport | 1387 +-------+-----+------------+-------+------------+-------------------+ 1389 Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth') 1391 The above drop preferences are arranged to preserve packets with 1392 more positive worth (Section 3.4), given senders of positive 1393 packets must have honestly declared downstream congestion. This 1394 is explained fully in Section 6 on applications, particularly when 1395 the application of re-ECN to protect against DDoS attacks is 1396 described. 1398 5.4. Justification for Setting the First SYN to FNE 1400 Congested routers may mark an FNE packet to CE(-1) (Section 5.3), and 1401 the initial SYN MUST be set to FNE by Re-ECT client A 1402 (Section 4.1.4). So an initial SYN may be marked CE(-1) rather than 1403 dropped. This seems dangerous, because the sender has not yet 1404 established whether the receiver is a legacy one that does not 1405 understand congestion marking. It also seems to allow malicious 1406 senders to take advantage of ECN marking to avoid so much drop when 1407 launching SYN flooding attacks. Below we explain the features of the 1408 protocol design that remove both these dangers. 1410 ECN-capable initial SYN with a Not-ECT server: If the TCP server B 1411 is re-ECN capable, provision is made for it to feedback a possible 1412 congestion marked SYN in the SYN ACK (Section 4.1.4). But if the 1413 TCP client A finds out from the SYN ACK that the server was not 1414 ECN-capable, the TCP client MUST consider the first SYN as 1415 congestion marked before setting itself into Not-ECT mode. 1416 Section 4.1.4 mandates that such a TCP client MUST also set its 1417 initial window to 1 segment. In this way we remove the need to 1418 cautiously avoid setting the first SYN to Not-RECT. This will 1419 give worse performance while deployment is patchy, but better 1420 performance once deployment is widespread. 1422 SYN flooding attacks can't exploit ECN-capability: Malicious hosts 1423 may think they can use the advantage that ECN-marking gives over 1424 drop in launching classic SYN-flood attacks. But Section 5.3 1425 mandates that a router MUST only be configured to treat packets 1426 with the FNE codepoint as ECN-capable if FNE packets are rate 1427 limited. Introduction of the FNE codepoint was a deliberate move 1428 to enable transport-neutral handling of flow-start and flow state 1429 set-up in the IP layer where it belongs. It then becomes possible 1430 to protect against flooding attacks of all forms (not just SYN 1431 flooding) without transport-specific inspection for things like 1432 the SYN flag in TCP headers. Then, for instance, SYN flooding 1433 attacks using IPSec ESP encryption can also be rate limited at the 1434 IP layer. 1436 It might seem pedantic going to all this trouble to enable ECN on the 1437 initial packet of a flow, but it is motivated by a much wider concern 1438 to ensure safe congestion control will still be possible even if the 1439 application mix evolves to the point where the majority of flows 1440 consist of a single window or even a single packet. It also allows 1441 denial of service attacks to be more easily isolated and prevented. 1443 5.5. Control and Management 1445 5.5.1. Negative Balance Warning 1447 A new ICMP message type is being considered so that a dropper can 1448 warn the apparent sender of a flow that it has started to sanction 1449 the flow. The message would have similar semantics to the `Time 1450 exceeded' ICMP message type. To ensure the sender has to invest some 1451 work before the network will generate such a message, a dropper 1452 SHOULD only send such a message for flows that have demonstrated that 1453 they have started correctly by establishing a positive record, but 1454 have later gone negative. The threshold is up to the implementation. 1455 The purpose of the message is to deconfuse the cause of drops from 1456 other causes, such as congestion or transmission losses. The dropper 1457 would send the message to the sender of the flow, not the receiver. 1459 If we did define this message type, it would be REQUIRED for all re- 1460 ECT senders to parse and understand it. Note that a sender MUST only 1461 use this message to explain why losses are occurring. A sender MUST 1462 NOT take this message to mean that losses have occurred that it was 1463 not aware of. Otherwise, spoof messages could be sent by malicious 1464 sources to slow down a sender (c.f. ICMP source quench). 1466 However, the need for this message type is not yet confirmed, as we 1467 are considering how to prevent it being used by malicious senders to 1468 scan for droppers and to test their threshold settings. {ToDo: 1469 Complete this section.} 1471 5.5.2. Rate Response Control 1473 The incentive framework of Section 6.1.3 implies there may be a need 1474 for a sender to send a request to an ingress policer asking that it 1475 be allowed to apply a non-default response to congestion (where TCP- 1476 friendly is assumed to be the default). This would require the 1477 sender to know what message format(s) to use and to be able to 1478 discover how to address the policer. The required control 1479 protocol(s) are outside the scope of this document, but will require 1480 definition elsewhere. 1482 The policer is likely to be local to the sender and inline, probably 1483 at the ingress interface to the internetwork. So, discovery should 1484 not be hard. A variety of control protocols already exist for some 1485 widely used rate-responses to congestion. For instance DCCP 1486 congestion control identifiers (CCIDs [RFC4340]) fulfil this role and 1487 so does QoS signalling (e.g. and RSVP request for controlled load 1488 service is equivalent to a request for no rate response to 1489 congestion, but with admission control). 1491 5.6. IP in IP Tunnels 1493 For re-ECN to work correctly through IP in IP tunnels, it needs 1494 slightly different tunnel handling to regular ECN [RFC3168]. 1495 Ideally, for re-ECN to work through a tunnel, the tunnel entry should 1496 copy both the RE flag and the ECN field from the inner to the outer 1497 IP header. Then at the tunnel exit, any congestion marking of the 1498 outer ECN field should overwrite the inner ECN field (unless the 1499 inner field is Not-ECT in which case an alarm should be raised). The 1500 RE flag shouldn't change along a path, so the outer RE flag should be 1501 the same as the inner. If it isn't a management alarm should be 1502 raised. This behaviour is the same as the full-functionality variant 1503 of [RFC3168] at tunnel exit, but different at tunnel entry. 1505 If tunnels are left as they are specified in [RFC3168], whether the 1506 limited or full-functionality variants are used, a problem arises 1507 with re-ECN if a tunnel crosses an inter-domain boundary, because the 1508 difference between positive and negative markings will not be 1509 correctly accounted for. In a limited functionality ECN tunnel, the 1510 flow will appear to be legacy traffic, and therefore may be wrongly 1511 rate limited. In a full-functionality ECN tunnel, the result will 1512 depend whether the tunnel entry copies the inner RE flag to the outer 1513 header or the RE flag in the outer header is always cleared. If the 1514 former, the flow will tend to be too positive when accounted for at 1515 borders. If the latter, it will be too negative. 1517 {ToDo: A future version of this draft will discuss the necessary 1518 changes to IP in IP tunnels in more depth.} 1520 5.7. Non-Issues 1522 The following issues might seem to cause unfavourable interactions 1523 with re-ECN, but we will explain why they don't: 1525 o Various link layers support explicit congestion notification, such 1526 as Frame Relay and ATM. Explicit congestion notification is 1527 proposed to be added to other link layers, such as Ethernet 1528 (802.3ar Ethernet congestion management) and MPLS [ECN-MPLS]; 1530 o Encryption and IPSec. 1532 In the case of congestion notification at the link layer, each 1533 particular link layer scheme either manages congestion on the link 1534 with its own link-level feedback (the usual arrangement in the cases 1535 of ATM and Frame Relay), or congestion notification from the link 1536 layer is merged into congestion notification at the IP level when the 1537 frame headers are decapsulated at the end of the link (the 1538 recommended arrangement in the Ethernet and MPLS cases). Given the 1539 RE flag is not intended to change along the path, this means that 1540 downstream congestion will still be measureable at any point where IP 1541 is processed on the path by subtracting positive from negative 1542 markings. 1544 In the case of encryption, as long as the tunnel issues described in 1545 Section 5.6 are dealt with, payload encryption itself will not be a 1546 problem. The design goal of re-ECN is to include downstream 1547 congestion in the IP header so that it is not necessary to bury into 1548 inner headers. Obfuscation of flow identifiers is not a problem for 1549 re-ECN policing elements. Re-ECN doesn't ever require flow 1550 identifiers to be valid, it only requires them to be unique. So if 1551 an IPSec encapsulating security payload (ESP [RFC2406]) or an 1552 authentication header (AH [RFC2402]) is used, the security parameters 1553 index (SPI) will be a sufficient flow identifier, as it is intended 1554 to be unique to a flow without revealing actual port numbers. 1556 In general, even if endpoints use some locally agreed scheme to hide 1557 port numbers, re-ECN policing elements can just consider the pair of 1558 source and destination IP addresses as the flow identifier. Re-ECN 1559 encourages endpoints to at least tell the network layer that a 1560 sequence of packets are all part of the same flow, if indeed they 1561 are. The alternative would be for the sender to make each packet 1562 appear to be a new flow, which would require them all to be marked 1563 FNE in order to avoid being treated with the bulk of malicious flows 1564 at the egress dropper. Given the FNE marking is worth +1 and 1565 networks are likely to rate limit FNE packets, endpoints are given an 1566 incentive not to set FNE on each packet. But if the sender really 1567 does want to hide the flow relationship between packets it can choose 1568 to pay the cost of multiple FNE packets, which in the long run will 1569 compensate for the extra memory required on network policing elements 1570 to process each flow. 1572 6. Applications 1574 6.1. Policing Congestion Response 1576 6.1.1. The Policing Problem 1578 The current Internet architecture trusts hosts to respond voluntarily 1579 to congestion. Limited evidence shows that the large majority of 1580 end-points on the Internet comply with a TCP-friendly response to 1581 congestion. But telephony (and increasingly video) services over the 1582 best efforts Internet are attracting the interest of major commercial 1583 operations. Most of these applications do not respond to congestion 1584 at all. Those that can switch to lower rate codecs, still have a 1585 lower bound below which they must become unresponsive to congestion. 1587 Of course, the Internet is intended to support many different 1588 application behaviours. But the problem is that this freedom can be 1589 exercised irresponsibly. The greater problem is that we will never 1590 be able to agree on where the boundary is between responsible and 1591 irresponsible. Therefore re-ECN is designed to allow different 1592 networks to set their own view of the limit to irresponsibility, and 1593 to allow networks that choose a more conservative limit to push back 1594 against congestion caused in more liberal networks. 1596 As an example of the impossibility of setting a standard for 1597 fairness, mandating TCP-friendliness would set the bar too high for 1598 unresponsive streaming media, but still some would say the bar was 1599 too low. Even though all known peer-to-peer filesharing applications 1600 are TCP-compatible, they can cause a disproportionate amount of 1601 congestion, simply by using multiple flows and by transferring data 1602 continuously relative to other short-lived sessions. On the other 1603 hand, if we swung the other way and set the bar low enough to allow 1604 streaming media to be unresponsive, we would also allow denial of 1605 service attacks, which are typically unresponsive to congestion and 1606 consist of multiple continuous flows. 1608 Applications that need (or choose) to be unresponsive to congestion 1609 can effectively take (some would say steal) whatever share of 1610 bottleneck resources they want from responsive flows. Whether or not 1611 such free-riding is common, inability to prevent it increases the 1612 risk of poor returns for investors in network infrastructure, leading 1613 to under-investment. An increasing proportion of unresponsive or 1614 free-riding demand coupled with persistent under-supply is a broken 1615 economic cycle. Therefore, if the current, largely co-operative 1616 consensus continues to erode, congestion collapse could become more 1617 common in more areas of the Internet [RFC3714]. 1619 While we have designed re-ECN so that networks can choose to deploy 1620 stringent policing, this does not imply we advocate that every 1621 network should introduce tight controls on those that cause 1622 congestion. Re-ECN has been specifically designed to allow different 1623 networks to choose how conservative or liberal they wish to be with 1624 respect to policing congestion. But those that choose to be 1625 conservative can protect themselves from the excesses that liberal 1626 networks allow their users. 1628 6.1.2. The Case Against Bottleneck Policing 1630 The state of the art in rate policing is the bottleneck policer, 1631 which is intended to be deployed at any forwarding resource that may 1632 become congested. Its aim is to detect flows that cause 1633 significantly more local congestion than others. Although operators 1634 might solve their immediate problems by deploying bottleneck 1635 policers, we are concerned that widespread deployment would make it 1636 extremely hard to evolve new application behaviours. We believe the 1637 IETF should offer re-ECN as the preferred protocol on which to base 1638 solutions to the policing problems of operators, because it would not 1639 harm evolvability and, frankly, it would be far more effective (see 1640 later for why). 1642 Approaches like [XCHOKe] & [pBox] are nice approaches for rate 1643 policing traffic without the benefit of whole path information (such 1644 as could be provided by re-ECN). But they must be deployed at 1645 bottlenecks in order to work. Unfortunately, a large proportion of 1646 traffic traverses at least two bottlenecks (in two access networks), 1647 particularly with the current traffic mix where peer-to-peer file- 1648 sharing is prevalent. If ECN were deployed, we believe it would be 1649 likely that these bottleneck policers would be adapted to combine ECN 1650 congestion marking from the upstream path with local congestion 1651 knowledge. But then the only useful placement for such policers 1652 would be close to the egress of the internetwork. 1654 But then, if these bottleneck policers were widely deployed (which 1655 would require them to be more effective than they are now), the 1656 Internet would find itself with one universal rate adaptation policy 1657 (probably TCP-friendliness) embedded throughout the network. Given 1658 TCP's congestion control algorithm is already known to be hitting its 1659 scalability limits and new algorithms are being developed for high- 1660 speed congestion control, embedding TCP policing into the Internet 1661 would make evolution to new algorithms extremely painful. If a 1662 source wanted to use a different algorithm, it would have to first 1663 discover then negotiate with all the policers on its path, 1664 particularly those in the far access network. The IETF has already 1665 traveled that path with the Intserv architecture and found it 1666 constrains scalability [RFC2208]. 1668 Anyway, if bottleneck policers were ever widely deployed, they would 1669 be likely to be bypassed by determined attackers. They inherently 1670 have to police fairness per flow or per source-destination pair. 1671 Therefore they can easily be circumvented either by opening multiple 1672 flows (by varying the end-point port number); or by spoofing the 1673 source address but arranging with the receiver to hide the true 1674 return address at a higher layer. 1676 6.1.3. Re-ECN Incentive Framework 1678 The aim is to create an incentive environment that ensures optimal 1679 sharing of capacity despite everyone acting selfishly (including 1680 lying and cheating). Of course, the mechanisms put in place for this 1681 can lie dormant wherever co-operation is the norm. 1683 Throughout this document we focus on path congestion. But some forms 1684 of fairness, particularly TCP's, also depend on round trip time. So, 1685 we also propose to measure downstream path delay using re-feedback. 1686 This proposal will be published in a very simple future draft, but 1687 for now we give an outline in Appendix F. 1689 Figure 8 sketches the incentive framework that we will describe piece 1690 by piece throughout this section. We will do a first pass in 1691 overview, then return to each piece in detail. We re-use the earlier 1692 example of how downstream congestion is derived by subtracting 1693 upstream congestion from path congestion (Figure 1) but depict 1694 multiple trust boundaries to turn it into an internetwork. For 1695 clarity, only downstream congestion is shown (the difference between 1696 the two earlier plots). The graph displays downstream path 1697 congestion seen in a typical flow as it traverses an example path 1698 from sender S to receiver R, across networks N1, N2 & N4. Everyone 1699 is shown using re-ECN correctly, but we intend to show why everyone 1700 would /choose/ to use it correctly, and honestly. 1702 Three main types of self-interest can be identified: 1704 o Users want to transmit data across the network as fast as 1705 possible, paying as little as possible for the privilege. In this 1706 respect, there is no distinction between senders and receivers, 1707 but we must be wary of potential malice by one on the other; 1709 o Network operators want to maximise revenues from the resources 1710 they invest in. They compete amongst themselves for the custom of 1711 users. 1713 o Attackers (whether users or networks) want to use any opportunity 1714 to subvert the new re-ECN system for their own gain or to damage 1715 the service of their victims, whether targeted or random. 1717 policer 1718 | 1719 | 1720 S <-----N1----> <---N2---> <---N4--> R domain 1721 | : : 1722 A\|/: : 1723 | V : : 1724 3% |---------+ : 1725 | : | : 1726 2% | : +-----------------------+ : 1727 | : downstream congestion | : 1728 1% | : | : 1729 | : | : 1730 0% +---------------------------------+=====--> 1731 0 i ^ resource index 1732 | | /|\ 1733 1.00% 2.00% | marking fraction 1734 | 1735 dropper 1737 Figure 8: Incentive Framework, showing creation of opposing pressures 1738 to under-declare and over-declare downstream congestion, using a 1739 policer and a dropper 1741 Source congestion control: We want to ensure that the sender will 1742 throttle its rate as downstream congestion increases. Whatever 1743 the agreed congestion response (whether TCP-compatible or some 1744 enhanced QoS), to some extent it will always be against the 1745 sender's interest to comply. 1747 Ingress policing: But it is in all the network operators' interests 1748 to encourage fair congestion response, so that their investments 1749 are employed to satisfy the most valuable demand. The re-ECN 1750 protocol ensures packets carry the necessary information about 1751 their own expected downstream congestion so that N1 can deploy a 1752 policer at its ingress to check that S1 is complying with whatever 1753 congestion control it should be using (Section 6.1.5). If N1 is 1754 extremely conservative it may police each flow, but it can choose 1755 to just police the bulk amount of congestion each customer causes 1756 without regard to flows, or if it is extremely liberal it need not 1757 police congestion control at all. Whatever, it is always 1758 preferable to police traffic at the very first ingress into an 1759 internetwork, before non-compliant traffic can cause any damage. 1761 Edge egress dropper: If the policer ensures the source has less 1762 right to a high rate the higher it declares downstream congestion, 1763 the source has a clear incentive to understate downstream 1764 congestion. But, if flows of packets are understated when they 1765 enter the internetwork, they will have become negative by the time 1766 they leave. So, we introduce a dropper at the last network 1767 egress, which drops packets in flows that persistently declare 1768 negative downstream congestion (see Section 6.1.4 for details). 1770 ..competitive routing 1771 .' : '. 1772 .' p e n a l:t i e s '. 1773 : | : \ : 1774 A : | : | : 1775 |S <-----N1----> <---N2---> <---N4--> R domain 1776 | : | : | : 1777 | V | : | : 1778 3% |--------+ | : | : 1779 | | V V V V 1780 2% | +-----------------------+ 1781 | downstream congestion | 1782 1% | : | 1783 | : | 1784 0% +--------------------------------+=====--> 1785 0 ^ i resource index 1786 | /|\ | 1787 1.00% | 2.00% marking fraction 1788 | 1789 sanctions 1791 Figure 9: Incentives at Inter-domain Borders 1793 Inter-domain traffic policing: But next we must ask, if congestion 1794 arises downstream (say in N4), what is the ingress network's 1795 (N1's) incentive to police its customers' response? If N1 turns a 1796 blind eye, its own customers benefit while other networks suffer. 1797 This is why all inter-domain QoS architectures (e.g. Intserv, 1798 Diffserv) police traffic each time it crosses a trust boundary. 1799 We have already shown that re-ECN gives a trustworthy measure of 1800 the expected downstream congestion that a flow will cause by 1801 subtracting negative volume from positive at any intermediate 1802 point on a path. N4 (say) can use this measure to police all the 1803 responses to congestion of all the sources beyond its upstream 1804 neighbour (N2), but in bulk with one very simple passive 1805 mechanism, rather than per flow, as we will now explain using 1806 Figure 9. 1808 Emulating policing with inter-domain congestion penalties: Between 1809 high-speed networks, we would rather avoid per-flow policing, and 1810 we would rather avoid holding back traffic while it is policed. 1811 Instead, once re-ECN has arranged headers to carry downstream 1812 congestion honestly, N2 can contract to pay N4 penalties in 1813 proportion to a single bulk count of the congestion metrics 1814 crossing their mutual trust boundary (Section 6.1.6). In this 1815 way, N4 puts pressure on N2 to suppress downstream congestion, for 1816 every flow passing through the border interface, even though they 1817 will all start and end in different places, and even though they 1818 may all be allowed different responses to congestion. The figure 1819 depicts this downward pressure on N2 by the solid downward arrow 1820 at the egress of N2. Then N2 has an incentive either to police 1821 the congestion response of its own ingress traffic (from N1) or to 1822 emulate policing by applying penalties to N1 in turn on the basis 1823 of congestion counted at their mutual boundary. In this recursive 1824 way, the incentives for each flow to respond correctly to 1825 congestion trace back with each flow precisely to each source, 1826 despite the mechanism not recognising flows (see Section 6.2.2). 1828 Inter-domain congestion charging diversity: Any two networks are 1829 free to agree any of a range of penalty regimes between themselves 1830 within the following reasonable constraints. N2 should expect to 1831 have to pay penalties to N4 where penalties monotonically increase 1832 with the volume of congestion and negative penalties are not 1833 allowed. For instance, they may agree an SLA with tiered 1834 congestion thresholds, where higher penalties apply the higher the 1835 threshold that is broken. But the most obvious (and useful) form 1836 of penalty is where N4 levies a charge on N2 proportional to the 1837 volume of downstream congestion N2 dumps into N4. In the 1838 explanation that follows, we assume this specific variant of 1839 volume charging between networks - charging proportionate to the 1840 volume of congestion. 1842 We must make clear that we are not advocating that everyone should 1843 use this form of contract. We are well aware that the IETF tries 1844 to avoid standardising technology that depends on a particular 1845 business model. And we strongly share this desire to encourage 1846 diversity. But our aim is merely to show that border policing can 1847 at least work with this one model, then we can assume that 1848 operators might experiment with the metric in other models (see 1849 Section 6.1.6 for examples). Of course, operators are free to 1850 complement this usage element of their charges with traditional 1851 capacity charging, and we expect they will. 1853 No congestion charging to users: Bulk congestion penalties at trust 1854 boundaries are passive and extremely simple, and lose none of 1855 their per-packet precision from one boundary to the next (unlike 1856 Diffserv all-address traffic conditioning agreements, which 1857 dissipate their effectiveness across long topologies). But at any 1858 trust boundary, there is no imperative to use congestion charging. 1859 Traditional traffic policing can be used, if the complexity and 1860 cost is preferred. In particular, at the boundary with end 1861 customers (e.g. between S and N1), traffic policing will most 1862 likely be more appropriate. Policer complexity is less of a 1863 concern at the edge of the network. And end-customers are known 1864 to be highly averse to the unpredictability of congestion 1865 charging. 1867 NOTE WELL: This document neither advocates nor requires congestion 1868 charging for end customers and advocates but does not require 1869 inter-domain congestion charging. 1871 Competitive discipline of inter-domain traffic engineering: With 1872 inter-domain congestion charging, a domain seems to have a 1873 perverse incentive to fake congestion; N2's profit depends on the 1874 difference between congestion at its ingress (its revenue) and at 1875 its egress (its cost). So, overstating internal congestion seems 1876 to increase profit. However, smart border routing [Smart_rtg] by 1877 N1 will bias its multipath routing towards the least cost routes. 1878 So, N2 risks losing all its revenue to competitive routes if it 1879 overstates congestion (see Section 6.2.3). In other words, if N2 1880 is the least congested route, its ability to raise excess profits 1881 is limited by the congestion on the next least congested route. 1882 This pressure on N2 to remain competitive is represented by the 1883 dotted downward arrow at the ingress to N2 in Figure 9. 1885 Closing the loop: All the above elements conspire to trap everyone 1886 between two opposing pressures (the downward and upward arrows in 1887 Figure 8 & Figure 9), ensuring the downstream congestion metric 1888 arrives at the destination neither above nor below zero. So, we 1889 have arrived back where we started in our argument. The ingress 1890 edge network can rely on downstream congestion declared in the 1891 packet headers presented by the sender. So it can police the 1892 sender's congestion response accordingly. 1894 Evolvability of congestion control: We have seen that re-ECN enables 1895 policing at the very first ingress. We have also seen that, as 1896 flows continue on their path through further networks downstream, 1897 re-ECN removes the need for further per-domain ingress policing of 1898 all the different congestion responses allowed to each different 1899 flow. This is why the evolvability of re-ECN policing is so 1900 superior to bottleneck policing or to any policing of different 1901 QoS for different flows. Even if all access networks choose to 1902 conservatively police congestion per flow, each will want to 1903 compete with the others to allow new responses to congestion for 1904 new types of application. With re-ECN, each can introduce new 1905 controls independently, without coordinating with other networks 1906 and without having to standardise anything. But, as we have just 1907 seen, by making inter-domain penalties proportionate to bulk 1908 downtream congestion, downstream networks can be agnostic to the 1909 specific congestion response for each flow, but they can still 1910 apply more back-pressure the more liberal the ingress access 1911 network has been in the response to congestion it allowed for each 1912 flow. 1914 6.1.3.1. The Case against Classic Feedback 1916 A system that produces an optimal outcome as a result of everyone's 1917 selfish actions is extremely powerful. Especially one that enables 1918 evolvability of congestion control. But why do we have to change to 1919 re-ECN to achieve it? Can't classic congestion feedback (as used 1920 already by standard ECN) be arranged to provide similar incentives 1921 and similar evolvability? Superficially it can. Kelly's seminal 1922 work showed how we can allow everyone the freedom to evolve whatever 1923 congestion control behaviour is in their application's best interest 1924 but still optimise the whole system of networks and users by placing 1925 a price on congestion to ensure responsible use of this 1926 freedom [Evol_cc]). Kelly used ECN with its classic congestion 1927 feedback model as the mechanism to convey congestion price 1928 information. The mechanism was nearly identical to volume charging; 1929 except only the volume of packets marked with congestion experienced 1930 (CE) was counted. 1932 However, below we explain why relying on classic feedback /required/ 1933 congestion charging to be used, while re-ECN achieves the same 1934 powerful outcome (given it is built on Kelly's foundations), but does 1935 not /require/ congestion charging. In brief, the problem with 1936 classic feedback is that the incentives have to trace the indirect 1937 path back to the sender---the long way round the feedback loop. For 1938 example, if classic feedback were used in Figure 8, N2 would have had 1939 to influence N1 via all of N4, R & S rather than directly. 1941 Inability to agree what is happening downstream: In order to police 1942 its upstream neighbour's congestion response, the neighbours 1943 should be able to agree on the congestion to be responded to. 1944 Whatever the feedback regime, as packets change hands at each 1945 trust boundary, any path metrics they carry are verifiable by both 1946 neighbours. But, with a classic path metric, they can only agree 1947 on the /upstream/ path congestion. 1949 Inaccessible back-channel: The network needs a whole-path congestion 1950 metric if it wants to control the source. Classically, whole path 1951 congestion emerges at the destination, to be fed back from 1952 receiver to sender in a back-channel. But, in any data network, 1953 back-channels need not be visible to relays, as they are 1954 essentially communications between the end-points. They may be 1955 encrypted, asymmetrically routed or simply omitted, so no network 1956 element can reliably intercept them. The congestion charging 1957 literature solves this problem by charging the receiver and 1958 assuming this will cause the receiver to refer the charges to the 1959 sender. But, of course, this creates unintended side-effects... 1961 `Receiver pays' unacceptable: In connectionless datagram networks, 1962 receivers and receiving networks cannot prevent reception from 1963 malicious senders, so `receiver pays' opens them to `denial of 1964 funds' attacks. 1966 End-user congestion charging unacceptable: Even if 'denial of funds' 1967 were not a problem, we know that end-users are highly averse to 1968 the unpredictability of congestion charging and anyway, we want to 1969 avoid restricting network operators to just one retail tariff. 1970 But with classic feedback only an upstream metric is available, so 1971 we cannot avoid having to wrap the `receiver pays' money flow 1972 around the feedback loop, necessarily forcing end-users to be 1973 subjected to congestion charging. 1975 To summarise so far, with classic feedback, policing congestion 1976 response without losing evolvability /requires/ congestion charging 1977 of end-users and a `receiver pays' model, whereas, with re-ECN, it is 1978 still possible to influence incentives using congestion charging but 1979 using the safer `sender pays' model. However, congestion charging is 1980 only likely to be appropriate between domains. So, without losing 1981 evolvability, re-ECN enables technical policing mechanisms that are 1982 more appropriate for end users than congestion pricing. 1984 We now take a second pass over the incentive framework, filling in 1985 the detail. 1987 6.1.4. Egress Dropper 1989 As traffic leaves the last network before the receiver (domain N4 in 1990 Figure 8), the fraction of positive octets in a flow should match the 1991 fraction of negative octets introduced by congestion marking, leaving 1992 a balance of zero. If it is less (a negative flow), it implies that 1993 the source is understating path congestion (which will reduce the 1994 penalties that N2 owes N4). 1996 If flows are positive, N4 need take no action---this simply means its 1997 upstream neighbour is paying more penalties than it needs to, and the 1998 source is going slower than it needs to. But, to protect itself 1999 against persistently negative flows, N4 will need to install a 2000 dropper at its egress. Appendix E gives a suggested algorithm for 2001 this dropper. There is no intention that the dropper algorithm needs 2002 to be standardised, it is merely provided to show that an efficient, 2003 robust algorithm is possible. But whatever algorithm is used must 2004 meet the criteria below: 2006 o It SHOULD introduce minimal false positives for honest flows; 2008 o It SHOULD quickly detect and sanction dishonest flows (minimal 2009 false negatives); 2011 o It MUST be invulnerable to state exhaustion attacks from malicious 2012 sources. For instance, if the dropper uses flow-state, it should 2013 not be possible for a source to send numerous packets, each with a 2014 different flow ID, to force the dropper to exhaust its memory 2015 capacity; 2017 o It MUST introduce sufficient loss in goodput so that malicious 2018 sources cannot play off losses in the egress dropper against 2019 higher allowed throughput. Salvatori [CLoop_pol] describes this 2020 attack, which involves the source understating path congestion 2021 then inserting forward error correction (FEC) packets to 2022 compensate expected losses. 2024 Note that the dropper operates on flows but we would like it not to 2025 require per-flow state. This is why we have been careful to ensure 2026 that all flows MUST start with a packet marked with the FNE 2027 codepoint. If a flow does not start with the FNE codepoint, a 2028 dropper is likely to treat it unfavourably. This risk makes it worth 2029 setting the FNE codepoint at the start of a flow, even though there 2030 is a cost to the sender of setting FNE (positive `worth'). Indeed, 2031 with the FNE codepoint, the rate at which a sender can generate new 2032 flows can be limited (Appendix G). In this respect, the FNE 2033 codepoint works like Handley's state set-up bit [Steps_DoS]. 2035 Appendix E also gives an example dropper implementation that 2036 aggregates flow state. Dropper algorithms will often maintain a 2037 moving average across flows of the fraction of RE blanked packets. 2038 When maintaining an average across flows, a dropper SHOULD only allow 2039 flows into the average if they start with FNE, but it SHOULD NOT 2040 include packets with the FNE codepoint set in the average. A sender 2041 sets the FNE codepoint when it does not have the benefit of feedback 2042 from the receiver. So, counting packets with FNE cleared would be 2043 likely to make the average unnecessarily positive, providing headroom 2044 (or should we say footroom?) for dishonest (negative) traffic. 2046 If the dropper detects a persistently negative flow, it SHOULD drop 2047 sufficient negative and neutral packets to force the flow to not be 2048 negative. Drops SHOULD be focused on just sufficient packets in 2049 misbehaving flows to remove the negative bias while doing minimal 2050 extra harm. 2052 6.1.5. Rate Policing 2054 Access operators who wish to check that a sender is complying with a 2055 particular rate response to congestion can deploy rate policers at 2056 the very first ingress to the internetwork. Re-ECN has been designed 2057 to avoid the need for bottleneck policing so that we can avoid a 2058 future where a single rate adaptation policy is embedded throughout 2059 the network. Instead, re-ECN allows the particular rate adaptation 2060 policy to be solely agreed bilaterally between the sender and its 2061 ingress access provider (Section 5.5.2 discusses possible ways to 2062 signal between them), which allows congestion control to be policed, 2063 but maintains its evolvability, requiring only a single, local box to 2064 be updated. 2066 If desired, the re-ECN protocol allows these ingress policers to 2067 perform per-flow policing according to the widely adopted TCP rate 2068 adaptation, perhaps as a default. But it also allows new rate 2069 adaptation policies beyond TCP to be enforced. Perhaps more 2070 usefully, it also allows the flexibility for networks to choose to 2071 police users as a whole, rather than flows. 2073 Appendix G gives examples of per-user and per-flow policing 2074 algorithms. But there is no implication that these algorithms are to 2075 be standardised, or that they are ideal. The ingress rate policer is 2076 the part of the re-ECN incentive framework that is intended to be the 2077 most flexible. Once endpoint protocol handlers for re-ECN and egress 2078 droppers are in place, operators can choose exactly which congestion 2079 response they want to police, and whether they want to do it per 2080 user, per flow or not at all. 2082 However, if a rate policer is used, it should use path (not 2083 downstream) congestion as the relevant metric, which is represented 2084 by the fraction of octets in packets with positive (Re-Echo and FNE) 2085 and canceled (CE(0)) markings. Of course, re-ECN provides all the 2086 information a policer needs directly in the packets being policed. 2087 So, even policing TCP's AIMD algorithm is relatively straightforward. 2088 Appendix G presents an example design, but the choice of preferred 2089 mechanism is up to the implementer. 2091 Note that we have included canceled packets in the measure of path 2092 congestion. Canceled packets arise when the sender re-echoes earlier 2093 congestion, but then this Re-Echo packet just happens to be 2094 congestion marked itself. One would not normally expect many 2095 canceled packets at the first ingress because one would not normally 2096 expect much congestion marking to have been necessary that soon in 2097 the path. However, a home network or campus network may well sit 2098 between the sending endpoint and the ingress policer, so some 2099 congestion may occur upstream of the policer. And if congestion does 2100 occur upstream, some canceled packets should be visible, and should 2101 be taken into account in the measure of path congestion. 2103 But a much more important reason for including canceled packets in 2104 the measure of path congestion at an ingress policer is that a sender 2105 might otherwise subvert the protocol by sending canceled packets 2106 instead of neutral (RECT) packets. Like neutral, canceled packets 2107 are worth zero, so the sender knows they won't be counted against any 2108 quota it might have been allowed. But unlike neutral packets, 2109 canceled packets are immune to congestion marking, because they have 2110 already been congestion marked. So, it is both correct and useful 2111 that canceled packets should be included in a policer's measure of 2112 path congestion, as this removes the incentive the sender would 2113 otherwise have to mark more packets as canceled than it should. 2115 An ingress policer should also ensure that flows are not already 2116 negative when they enter the access network. As with canceled 2117 packets, the presence of negative packets will typically be unusual. 2118 Therefore it will be easy to detect negative flows at the ingress by 2119 just detecting negative packets then monitoring the flow they belong 2120 to. 2122 Of course, even if the sender does operate its own network, it may 2123 arrange not to congestion mark traffic. Whether the sender does this 2124 or not is of no concern to anyone else except the sender. Such a 2125 sender will not be policed against its own network's contribution to 2126 congestion, but the only resulting problem would be overload in the 2127 sender's own network. 2129 Finally, we must not forget that an easy way to circumvent re-ECN's 2130 defences is for the source to turn off re-ECN support, by setting the 2131 Not-RECT codepoint, implying legacy traffic. Therefore an ingress 2132 policer must put a general rate-limit on Not-RECT traffic, which 2133 SHOULD be lax during early, patchy deployment, but will have to 2134 become stricter as deployment widens. Similarly, flows starting 2135 without an FNE packet can be confined by a strict rate-limit used for 2136 the remainder of flows that haven't proved they are well-behaved by 2137 starting correctly (therefore they need not consume any flow state--- 2138 they are just confined to the `misbehaving' bin if they carry an 2139 unrecognised flow ID). 2141 6.1.6. Inter-domain Policing 2143 One of the main design goals of re-ECN is for border security 2144 mechanisms to be as simple as possible, otherwise they will become 2145 the pinch-points that limit scalability of the whole internetwork. 2146 We want to avoid per-flow processing at borders and to keep to 2147 passive mechanisms that can monitor traffic in parallel to 2148 forwarding, rather than having to filter traffic inline---in series 2149 with forwarding. 2151 So far, we have been able to keep the border mechanisms simple, 2152 despite having had to harden them against some subtle attacks on the 2153 re-ECN design. The mechanisms are still passive and avoid per-flow 2154 processing. 2156 The basic accounting mechanism at each border interface simply 2157 involves accumulating the volume of packets with positive worth (Re- 2158 Echo and FNE), and subtracting the volume of those with negative 2159 worth: CE(-1). Even though this mechanism takes no regard of flows, 2160 over an accounting period (say a month) this subtraction will account 2161 for the downstream congestion caused by all the flows traversing the 2162 interface, wherever they come from, and wherever they go to. The two 2163 networks can agree to use this metric however they wish to determine 2164 some congestion-related penalty against the upstream network. 2165 Although the algorithm could hardly be simpler, it is spelled out 2166 using pseudo-code in Appendix H.1. 2168 Various attempts to subvert the re-ECN design have been made. In all 2169 cases their root cause is persistently negative flows. But, after 2170 describing these attacks we will show that we don't actually have to 2171 get rid of all persistently negative flows in order to thwart the 2172 attacks. 2174 In honest flows, downstream congestion is measured as positive minus 2175 negative volume. So if all flows are honest (i.e. not persistently 2176 negative), adding all positive volume and all negative volume without 2177 regard to flows will give an aggregate measure of downstream 2178 congestion. But such simple aggregation is only possible if no flows 2179 are persistently negative. Unless persistently negative flows are 2180 completely removed, they will reduce the aggregate measure of 2181 congestion. The aggregate may still be positive overall, but not as 2182 positive as it would have been had the negative flows been removed. 2184 In Section 6.1.4 we discussed how to sanction traffic to remove, or 2185 at least to identify, persistently negative flows. But, even if the 2186 sanction for negative traffic is to discard it, unless it is 2187 discarded at the exact point it goes negative, it will wrongly 2188 subtract from aggregate downstream congestion, at least at any 2189 borders it crosses after it has gone negative but before it is 2190 discarded. 2192 We rely on sanctions to deter dishonest understatement of congestion. 2193 But even the ultimate sanction of discard can only be effective if 2194 the sender is bothered about the data getting through to its 2195 destination. A number of attacks have been identified where a sender 2196 gains from sending dummy traffic or it can attack someone or 2197 something using dummy traffic even though it isn't communicating any 2198 information to anyone: 2200 o A host can send traffic with no positive markings towards its 2201 intended destination, aiming to transmit as much traffic as any 2202 dropper will allow [Bauer06]. It may add forward error correction 2203 (FEC) to repair as much drop as it experiences. 2205 o A host can send dummy traffic into the network with no positive 2206 markings and with no intention of communicating with anyone, but 2207 merely to cause higher levels of congestion for others who do want 2208 to communicate (DoS). So, to ride over the extra congestion, 2209 everyone else has to spend more of whatever rights to cause 2210 congestion they have been allowed. 2212 o A network can simply create its own dummy traffic to congest 2213 another network, perhaps causing it to lose business at no cost to 2214 the attacking network. This is a form of denial of service 2215 perpetrated by one network on another. The preferential drop 2216 measures in Section 5.3 provide crude protection against such 2217 attacks, but we are not overly worried about more accurate 2218 prevention measures, because it is already possible for networks 2219 to DoS other networks on the general Internet, but they generally 2220 don't because of the grave consequences of being found out. We 2221 are only concerned if re-ECN increases the motivation for such an 2222 attack, as in the next example. 2224 o A network can just generate negative traffic and send it over its 2225 border with a neighbour to reduce the overall penalties that it 2226 should pay to that neighbour. It could even initialise the TTL so 2227 it expired shortly after entering the neighbouring network, 2228 reducing the chance of detection further downstream. This attack 2229 need not be motivated by a desire to deny service and indeed need 2230 not cause denial of service. A network's main motivator would 2231 most likely be to reduce the penalties it pays to a neighbour. 2232 But, the prospect of financial gain might tempt the network into 2233 mounting a DoS attack on the other network as well, given the gain 2234 would offset some of the risk of being detected. 2236 The first step towards a solution to all these problems with negative 2237 flows is to be able to estimate the contribution they make to 2238 downstream congestion at a border and to correct the measure 2239 accordingly. Although ideally we want to remove negative flows 2240 themselves, perhaps surprisingly, the most effective first step is to 2241 cancel out the polluting effect negative flows have on the measure of 2242 downstream congestion at a border. It is more important to get an 2243 unbiased estimate of their effect, than to try to remove them all. A 2244 suggested algorithm to give an unbiased estimate of the contribution 2245 from negative flows to the downstream congestion measure is given in 2246 Appendix H.2. 2248 Although making an accurate assessment of the contribution from 2249 negative flows may not be easy, just the single step of neutralising 2250 their polluting effect on congestion metrics removes all the gains 2251 networks could otherwise make from mounting dummy traffic attacks on 2252 each other. This puts all networks on the same side (only with 2253 respect to negative flows of course), rather than being pitched 2254 against each other. The network where this flow goes negative as 2255 well as all the networks downstream lose out from not being 2256 reimbursed for any congestion this flow causes. So they all have an 2257 interest in getting rid of these negative flows. Networks forwarding 2258 a flow before it goes negative aren't strictly on the same side, but 2259 they are disinterested bystanders---they don't care that the flow 2260 goes negative downstream, but at least they can't actively gain from 2261 making it go negative. The problem becomes localised so that once a 2262 flow goes negative, all the networks from where it happens and beyond 2263 downstream each have a small problem, each can detect it has a 2264 problem and each can get rid of the problem if it chooses to. But 2265 negative flows can no longer be used for any new attacks. 2267 Once an unbiased estimate of the effect of negative flows can be 2268 made, the problem reduces to detecting and preferably removing flows 2269 that have gone negative as soon as possible. But importantly, 2270 complete eradication of negative flows is no longer critical---best 2271 endeavours will be sufficient. 2273 For instance, let us consider the case where a source sends traffic 2274 with no positive markings at all, hoping to at least get as much 2275 traffic delivered as network-based droppers will allow. The flow is 2276 likely to go at least slightly negative in the first network on the 2277 path (N1 if we use the example network layout in Figure 9). If all 2278 networks use the algorithm in Appendix H.2 to inflate penalties at 2279 their border with an upstream network, they will remove the effect of 2280 negative flows. So, for instance, N2 will not be paying a penalty to 2281 N1 for this flow. Further, because the flow contributes no positive 2282 markings at all, a dropper at the egress will completely remove it. 2284 The remaining problem is that every network is carrying a flow that 2285 is causing congestion to others but not being held to account for the 2286 congestion it is causing. Whenever the fail-safe border algorithm 2287 (Section 6.1.7) or the border algorithm to compensate for negative 2288 flows (Appendix H.2) detects a negative flow, it can instantiate a 2289 focused dropper for that flow locally. It may be some time before 2290 the flow is detected, but the more strongly negative the flow is, the 2291 more quickly it will be detected by the fail-safe algorithm. But, in 2292 the meantime, it will not be distorting border incentives. Until it 2293 is detected, if it contributes to drop anywhere, its packets will 2294 tend to be dropped before others if routers use the preferential drop 2295 rules in Section 5.3, which discriminate against non-positive 2296 packets. All networks below the point where a flow goes negative 2297 (N1, N2 and N4 in this case) have an incentive to remove this flow, 2298 but the router where it first goes negative (in N1) can of course 2299 remove the problem for everyone downstream. 2301 In the case of DDoS attacks, Section 6.2.1 describes how re-ECN 2302 mitigates their force. 2304 Note that the guiding principle behind all the above discussion is 2305 that any gain from subverting the protocol should be precisely 2306 neutralised, rather than punished. If a gain is punished to a 2307 greater extent than is sufficient to neutralise it, it will most 2308 likely open up a new vulnerability, where the amplifying effect of 2309 the punishment mechanism can be turned on others. 2311 For instance, if possible, flows should be removed as soon as they go 2312 negative, but we do NOT RECOMMEND any attempts to discard such flows 2313 further upstream while they are still positive. Such over-zealous 2314 push-back is unnecessary and potentially dangerous. These flows have 2315 paid their `fare' up to the point they go negative, so there is no 2316 harm in delivering them that far. If someone downstream asks for a 2317 flow to be dropped as near to the source as possible, because they 2318 say it is going to become negative later, an upstream node cannot 2319 test the truth of this assertion. Rather than have to authenticate 2320 such messages, re-ECN has been designed so that flows can be dropped 2321 solely based on locally measurable evidence. A message hinting that 2322 a flow should be watched closely to test for negativity is fine. But 2323 not a message that claims that a positive flow will go negative 2324 later, so it should be dropped. . 2326 6.1.7. Inter-domain Fail-safes 2328 The mechanisms described so far create incentives for rational 2329 network operators to behave. That is, one operator aims to make 2330 another behave responsibly by applying penalties and expects a 2331 rational response (i.e. one that trades off costs against benefits). 2332 It is usually reasonable to assume that other network operators will 2333 behave rationally (policy routing can avoid those that might not). 2334 But this approach does not protect against the misconfigurations and 2335 accidents of other operators. 2337 Therefore, we propose the following two mechanisms at a network's 2338 borders to provide "defence in depth". Both are similar: 2340 Highly positive flows: A small sample of positive packets should be 2341 picked randomly as they cross a border interface. Then subsequent 2342 packets matching the same source and destination address and DSCP 2343 should be monitored. If the fraction of positive marking is well 2344 above a threshold (to be determined by operational practice), a 2345 management alarm SHOULD be raised, and the flow MAY be 2346 automatically subject to focused drop. 2348 Persistently negative flows: A small sample of congestion marked 2349 (negative) packets should be picked randomly as they cross a 2350 border interface. Then subsequent packets matching the same 2351 source and destination address and DSCP should be monitored. If 2352 the balance of positive minus negative markings is persistently 2353 negative, a management alarm SHOULD be raised, and the flow MAY be 2354 automatically subject to focused drop. 2356 Both these mechanisms rely on the fact that highly positive (or 2357 negative) flows will appear more quickly in the sample by selecting 2358 randomly solely from positive (or negative) packets. 2360 6.1.8. Simulations 2362 Simulations of policer and dropper performance done for the multi-bit 2363 version of re-feedback have been included in section 5 "Dropper 2364 Performance" of [Re-fb]. Simulations of policer and dropper for the 2365 re-ECN version described in this document are work in progress. 2367 6.2. Other Applications 2368 6.2.1. DDoS Mitigation 2370 A flooding attack is inherently about congestion of a resource. 2371 Because re-ECN ensures the sources causing network congestion 2372 experience the cost of their own actions, it acts as a first line of 2373 defence against DDoS. As load focuses on a victim, upstream queues 2374 grow, requiring honest sources to pre-load packets with a higher 2375 fraction of positive packets. Once downstream routers are so 2376 congested that they are dropping traffic, they will be CE marking the 2377 traffic they do forward 100%. Honest sources will therefore be 2378 sending Re-Echo 100% (and therefore being severely rate-limited at 2379 the ingress). 2381 Senders under malicious control can either do the same as honest 2382 sources, and be rate-limited at ingress, or they can understate 2383 congestion by sending more neutral RECT packets than they should. If 2384 sources understate congestion (i.e. do not re-echo sufficient 2385 positive packets) and the preferential drop ranking is implemented on 2386 routers (Section 5.3), these routers will preserve positive traffic 2387 until last. So, the neutral traffic from malicious sources will all 2388 be automatically dropped first. Either way, the malicious sources 2389 cannot send more than honest sources. 2391 Further, hosts under malicious control will tend to be re-used for 2392 many different attacks. They will therefore build up a long term 2393 history of causing congestion. Therefore, as long as the population 2394 of potentially compromisable hosts around the Internet is limited, 2395 the per-user policing algorithms in Appendix G.1 will gradually 2396 throttle down zombies and other launchpads for attacks. Therefore, 2397 widespread deployment of re-ECN could considerably dampen the force 2398 of DDoS. Certainly, zombie armies could hold their fire for long 2399 enough to be able to build up enough credit in the per-user policers 2400 to launch an attack. But they would then still be limited to no more 2401 throughput than other, honest users. 2403 Inter-domain traffic policing (see Section 6.1.6)ensures that any 2404 network that harbours compromised `zombie' hosts will have to bear 2405 the cost of the congestion caused by traffic from zombies in 2406 downstream networks. Such networks will be incentivised to deploy 2407 per-user policers that rate-limit hosts that are unresponsive to 2408 congestion so they can only send very slowly into congested paths. 2409 As well as protecting other networks, the extremely poor performance 2410 at any sign of congestion will incentivise the zombie's owner to 2411 clean it up. However, the host should behave normally when using 2412 uncongested paths. 2414 Uniquely, re-ECN handles DDoS traffic without relying on the validity 2415 of identifiers in packets. Certainly the egress dropper relies on 2416 uniqueness of flow identifiers, but not their validity. So if a 2417 source spoofs another address, re-ECN works just as well, as long as 2418 the attacker cannot imitate all the flow identifiers of another 2419 active flow passing through the same dropper (see Section 6.3). 2420 Similarly, the ingress policer relies on uniqueness of flow IDs, not 2421 their validity. Because a new flow will only be allowed any rate at 2422 all if it starts with FNE, and the more FNE packets there are 2423 starting new flows, the more they will be limited. Essentially a re- 2424 ECN policer limits the bulk of all congestion entering the network 2425 through a physical interface; limiting the congestion caused by each 2426 flow is merely an optional extra. 2428 6.2.2. End-to-end QoS 2430 {ToDo: (Section 3.3.2 of [Re-fb] entitled `Edge QoS' gives an outline 2431 of the text that will be added here).} 2433 6.2.3. Traffic Engineering 2435 {ToDo: } 2437 6.2.4. Inter-Provider Service Monitoring 2439 {ToDo: } 2441 6.3. Limitations 2443 The known limitations of the re-ECN approach are: 2445 o We still cannot defend against the attack described in Section 10 2446 where a malicious source sends negative traffic through the same 2447 egress dropper as another flow and imitates its flow identifiers, 2448 allowing a malicious source to cause an innocent flow to 2449 experience heavy drop. 2451 o Re-feedback for TTL (re-TTL) would also be desirable at the same 2452 time as re-ECN. Unfortunately this requires a further standards 2453 action for the mechanisms briefly described in Appendix F 2455 o Traffic must be ECN-capable for re-ECN to be effective. The only 2456 defence against malicious users who turn off ECN capbility is that 2457 networks are expected to rate limit Not-ECT traffic and to apply 2458 higher drop preference to it during congestion. Although these 2459 are blunt instruments, they at least represent a feasible scenario 2460 for the future Internet where Not-ECT traffic co-exists with re- 2461 ECN traffic, but as a severely hobbled under-class. We recommend 2462 (Section 7.1) that while accommodating a smooth initial transition 2463 to re-ECN, policing policies should gradually be tightened to rate 2464 limit Not-ECT traffic more strictly in the longer term. 2466 o When checking whether a flow is balancing positive markings with 2467 congestion marking, re-ECN can only account for congestion 2468 marking, not drops. So, whenever a sender experiences drop, it 2469 does not have to re-echo the congestion event. Nonetheless, it is 2470 hardly any advantage to be able to send faster than other flows 2471 only if your traffic is dropped and the other traffic isn't. 2473 o We are considering the issue of whether it would be useful to 2474 truncate rather than drop packets that appear to be malicious, so 2475 that the feedback loop is not broken but useful data can be 2476 removed. 2478 7. Incremental Deployment 2480 7.1. Incremental Deployment Features 2482 The design of the re-ECN protocol started from the fact that the 2483 current ECN marking behaviour of routers was sufficient and that re- 2484 feedback could be introduced around these routers by changing the 2485 sender behaviour but not the routers. Otherwise, if we had required 2486 routers to be changed, the chance of encountering a path that had 2487 every router upgraded would be vanishly small during early 2488 deployment, giving no incentive to start deployment. Also, as there 2489 is no new forwarding behaviour, routers and hosts do not have to 2490 signal or negotiate anything. 2492 However, networks that choose to protect themselves using re-ECN do 2493 have to add new security functions at their trust boundaries with 2494 others. They distinguish legacy traffic by its ECN field. Traffic 2495 from Not-ECT transports is distinguishable by its Not-RECT marking. 2496 Traffic from legacy ECN transports is distinguished from re-ECN by 2497 which of ECT(0) or ECT(1) is used. We chose to use ECT(1) for re-ECN 2498 traffic deliberately. Existing ECN sources set ECT(0) on either 50% 2499 (the nonce) or 100% (the default) of packets, whereas re-ECN does not 2500 use ECT(0) at all. We can use this distinguishing feature of legacy 2501 ECN traffic to separate it out for different treatment at the various 2502 border security functions: egress dropping, ingress policing and 2503 border policing. 2505 The general principle we adopt is that an egress dropper will not 2506 drop any legacy traffic, but ingress and border policers will limit 2507 the bulk rate of legacy traffic that can enter each network. Then, 2508 during early re-ECN deployment, operators can set very permissive (or 2509 non-existent) rate-limits on legacy traffic, but once re-ECN 2510 implementations are generally available, legacy traffic can be rate- 2511 limited increasingly harshly. Ultimately, an operator might choose 2512 to block all legacy traffic entering its network, or at least only 2513 allow through a trickle. 2515 Then, as the limits are set more strictly, the more legacy ECN 2516 sources will gain by upgrading to re-ECN. Thus, towards the end of 2517 the voluntary incremental deployment period, legacy transports can be 2518 given progressively stronger encouragement to upgrade. 2520 The following list of minor changes, brings together all the points 2521 where Re-ECN semantics for use of the two-bit ECN field are different 2522 compared to RFC3168: 2524 o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender 2525 sets ECT(0) by default (Section 3.3); 2527 o No provision is necessary for a re-ECN capable source transport to 2528 use the ECN nonce (Section 4.1.2.1); 2530 o Routers MAY preferentially drop different extended ECN codepoints 2531 (Section 5.3); 2533 o Packets carrying the feedback not established (FNE) codepoint MAY 2534 optionally be marked rather than dropped by routers, even though 2535 their ECN field is Not-ECT (with the important caveat in 2536 Section 5.3); 2538 o Packets may be dropped by policing nodes because of apparent 2539 misbehaviour, not just because of congestion (Section 6); 2541 o Tunnel entry behaviour is still to be defined, but may have to be 2542 different from RFC3168 (Section 5.6). 2544 None of these changes REQUIRE any modifications to routers. Also 2545 none of these changes affect anything about end to end congestion 2546 control; they are all to do with allowing networks to police that end 2547 to end congestion control is well-behaved. 2549 7.2. Incremental Deployment Incentives 2551 It would only be worth standardising the re-ECN protocol if there 2552 existed a coherent story for how it might be incrementally deployed. 2553 In order for it to have a chance of deployment, everyone who needs to 2554 act must have a strong incentive to act, and the incentives must 2555 arise in the order that deployment would have to happen. Re-ECN 2556 works around unmodified ECN routers, but we can't just discuss why 2557 and how re-ECN deployment might build on ECN deployment, because 2558 there is precious little to build on in the first place. Instead, we 2559 aim to show that re-ECN deployment could carry ECN with it. We focus 2560 on commercial deployment incentives, although some of the arguments 2561 apply equally to academic or government sectors. 2563 ECN deployment: 2565 ECN is largely implemented in commercial routers, but generally 2566 not as a supported feature, and it has largely not been deployed 2567 by commercial network operators. It has been released in many 2568 Unix-based operating systems, but not in proprietary OSs like 2569 Windows or those in many mobile devices. For detailed deployment 2570 status, see [ECN-Deploy]. We believe the reason ECN deployment 2571 has not happened is twofold: 2573 * ECN requires changes to both routers and hosts. If someone 2574 wanted to sell the improvement that ECN offers, they would have 2575 to co-ordinate deployment of their product with others. An ECN 2576 server only gives any improvement on an ECN network. An ECN 2577 network only gives any improvement if used by ECN devices. 2578 Deployment that requires co-ordination adds cost and delay and 2579 tends to dilute any competitive advantage that might be gained. 2581 * ECN `only' gives a performance improvement. Making a product a 2582 bit faster (whether the product is a device or a network), 2583 isn't usually a sufficient selling point to be worth the cost 2584 of co-ordinating across the industry to deploy it. Network 2585 operators tend to avoid re-configuring a working network unless 2586 launching a new product. 2588 ECN and re-ECN for Edge-to-edge Assured QoS: 2590 We believe the proposal to provide assured QoS sessions using a 2591 form of ECN called pre-congestion notification (PCN) [CL-deploy] 2592 is most likely to break the deadlock in ECN deployment first. It 2593 only requires edge-to-edge deployment so it does not require 2594 endpoint support. It can be deployed in a single network, then 2595 grow incrementally to interconnected networks. And it provides a 2596 different `product' (internetworked assured QoS), rather than 2597 merely making an existing product a bit faster. 2599 Not only could this assured QoS application kick-start ECN 2600 deployment, it could also carry re-ECN deployment with it; because 2601 re-ECN can enable the assured QoS region to expand to a large 2602 internetwork where neighbouring networks do not trust each other. 2603 [Re-PCN] argues that re-ECN security should be built in to the QoS 2604 system from the start, explaining why and how. 2606 If ECN and re-ECN were deployed edge-to-edge for assured QoS, 2607 operators would gain valuable experience. They would also clear 2608 away many technical obstacles such as firewall configurations that 2609 block all but the legacy settings of the ECN field and the RE 2610 flag. 2612 ECN in Access Networks: 2614 The next obstacle to ECN deployment would be extension to access 2615 and backhaul networks, where considerable link layer differences 2616 makes implementation non-trivial, particularly on congested 2617 wireless links. ECN and re-ECN work fine during partial 2618 deployment, but they will not be very useful if the most congested 2619 elements in networks are the last to support them. Access network 2620 support is one of the weakest parts of this deployment story. All 2621 we can hope is that, once the benefits of ECN are better 2622 understood by operators, they will push for the necessary link 2623 layer implementations as deployment proceeds. 2625 Policing Unresponsive Flows: 2627 Re-ECN allows a network to offer differentiated quality of service 2628 as explained in Section 6.2.2. But we do not believe this will 2629 motivate initial deployment of re-ECN, because the industry is 2630 already set on alternative ways of doing QoS. Despite being much 2631 more complicated and expensive, the alternative approaches are 2632 here and now. 2634 But re-ECN is critical to QoS deployment in another respect. It 2635 can be used to prevent applications from taking whatever bandwidth 2636 they choose without asking. 2638 Currently, applications that remain resolute in their lack of 2639 response to congestion are rewarded by other TCP applications. In 2640 other words, TCP is naively friendly, in that it reduces its rate 2641 in response to congestion whether it is competing with friends 2642 (other TCPs) or with enemies (unresponsive applications). 2644 Therefore, those network owners that want to sell QoS will be keen 2645 to ensure that their users can't help themselves to QoS for free. 2646 Given the very large revenues at stake, we believe effective 2647 policing of congestion response will become highly sought after by 2648 network owners. 2650 But this does not necessarily argue for re-ECN deployment. 2651 Network owners might choose to deploy bottleneck policers rather 2652 than re-ECN-based policing. However, under Related Work 2653 (Section 9) we argue that bottleneck policers are inherently 2654 vulnerable to circumvention. 2656 Therefore we believe there will be a strong demand from network 2657 owners for re-ECN deployment so they can police flows that do not 2658 ask to be unresponsive to congestion, in order to protect their 2659 revenues from flows that do ask (QoS). In particular, we suspect 2660 that the operators of cellular networks will want to prevent VoIP 2661 and video applications being used freely on their networks as a 2662 more open market develops in GPRS and 3G devices. 2664 Initial deployments are likely to be isolated to single cellular 2665 networks. Cellular operators would first place requirements on 2666 device manufacturers to include re-ECN in the standards for mobile 2667 devices. In parallel, they would put out tenders for ingress and 2668 egress policers. Then, after a while they would start to tighten 2669 rate limits on Not-ECT traffic from non-standard devices and they 2670 would start policing whatever non-accredited applications people 2671 might install on mobile devices with re-ECN support in the 2672 operating system. This would force even independent mobile device 2673 manufacturers to provide re-ECN support. Early standardisation 2674 across the cellular operators is likely, including interconnection 2675 agreements with penalties for excess downstream congestion. 2677 We suspect some fixed broadband networks (whether cable or DSL) 2678 would follow a similar path. However, we also believe that larger 2679 parts of the fixed Internet would not choose to police on a per- 2680 flow basis. Some might choose to police congestion on a per-user 2681 basis in order to manage heavy peer-to-peer file-sharing, but it 2682 seems likely that a sizeable majority would not deploy any form of 2683 policing. 2685 This hybrid situation begs the question, "How does re-ECN work for 2686 networks that choose to using policing if they connect with others 2687 that don't?" Traffic from non-ECN capable sources will arrive 2688 from other networks and cause congestion within the policed, ECN- 2689 capable networks. So networks that chose to police congestion 2690 would rate-limit Not-ECT traffic throughout their network, 2691 particularly at their borders. They would probably also set 2692 higher usage prices in their interconnection contracts for 2693 incoming Not-ECT and Not-RECT traffic. We assume that 2694 interconnection contracts between networks in the same tier will 2695 include congestion penalties before contracts with provider 2696 backbones do. 2698 A hybrid situation could remain for all time. As was explained in 2699 the introduction, we believe in healthy competition between 2700 policing and not policing, with no imperative to convert the whole 2701 world to the religion of policing. Networks that chose not to 2702 deploy egress droppers would leave themselves open to being 2703 congested by senders in other networks. But that would be their 2704 choice. 2706 The important aspect of the egress dropper though is that it most 2707 protects the network that deploys it. If a network does not 2708 deploy an egress dropper, sources sending into it from other 2709 networks will be able to understate the congestion they are 2710 causing. Whereas, if a network deploys an egress dropper, it can 2711 know how much congestion other networks are dumping into it, and 2712 apply penalties or charges accordingly. So, whether or not a 2713 network polices its own sources at ingress, it is in its interests 2714 to deploy an egress dropper. 2716 Host support: 2718 In the above deployment scenario, host operating system support 2719 for re-ECN came about through the cellular operators demanding it 2720 in device standards (i.e. 3GPP). Of course, increasingly, mobile 2721 devices are being built to support multiple wireless technologies. 2722 So, if re-ECN were stipulated for cellular devices, it would 2723 automatically appear in those devices connected to the wireless 2724 fringes of fixed networks if they coupled cellular with WiFi or 2725 Bluetooth technology, for instance. Also, once implemented in the 2726 operating system of one mobile device, it would tend to be found 2727 in other devices using the same family of operating system. 2729 Therefore, whether or not a fixed network deployed ECN, or 2730 deployed re-ECN policers and droppers, many of its hosts might 2731 well be using re-ECN over it. Indeed, they would be at an 2732 advantage when communicating with hosts across Re-ECN policed 2733 networks that rate limited Not-RECT traffic. 2735 Other possible scenarios: 2737 The above is thankfully not the only plausible scenario we can 2738 think of. One of the many clubs of operators that meet regularly 2739 around the world might decide to act together to persuade a major 2740 operating system manufacturer to implement re-ECN. And they may 2741 agree between them on an interconnection model that includes 2742 congestion penalties. 2744 Re-ECN provides an interesting opportunity for device 2745 manufacturers as well as network operators. Policers can be 2746 configured loosely when first deployed. Then as re-ECN take-up 2747 increases, they can be tightened up, so that a network with re-ECN 2748 deployed can gradually squeeze down the service provided to legacy 2749 devices that have not upgraded to re-ECN. Many device vendors 2750 rely on replacement sales. And operating system companies rely 2751 heavily on new release sales. Also support services would like to 2752 be able to force stragglers to upgrade. So, the ability to 2753 throttle service to legacy operating systems is quite valuable. 2755 Also, policing unresponsive sources may not be the only or even 2756 the first application that drives deployment. It may be policing 2757 causes of heavy congestion (e.g. peer-to-peer file-sharing). Or 2758 it may be mitigation of denial of service. Or we may be wrong in 2759 thinking simpler QoS will not be the initial motivation for re-ECN 2760 deployment. Indeed, the combined pressure for all these may be 2761 the motivator, but it seems optimistic to expect such a level of 2762 joined-up thinking from today's communications industry. We 2763 believe a single application alone must be a sufficient motivator. 2765 In short, everyone gains from adding accountability to TCP/IP, 2766 except the selfish or malicious. So, deployment incentives tend 2767 to be strong. 2769 8. Architectural Rationale 2771 In the Internet's technical community, the danger of not responding 2772 to congestion is well-understood, as well as its attendant risk of 2773 congestion collapse [RFC3714]. However, one side of the Internet's 2774 commercial community considers that the very essence of IP is to 2775 provide open access to the internetwork for all applications. They 2776 see congestion as a symptom of over-conservative investment, and rely 2777 on revising application designs to find novel ways to keep 2778 applications working despite congestion. They argue that the 2779 Internet was never intended to be solely for TCP-friendly 2780 applications. Meanwhile, another side of the Internet's commercial 2781 community believes that it is worthwhile providing a network for 2782 novel applications only if it has sufficient capacity, which can 2783 happen only if a greater share of application revenues can be 2784 /assured/ for the infrastructure provider. Otherwise the major 2785 investments required would carry too much risk and wouldn't happen. 2787 The lesson articulated in [Tussle] is that we shouldn't embed our 2788 view on these arguments into the Internet at design time. Instead we 2789 should design the Internet so that the outcome of these arguments can 2790 get decided at run-time. Re-ECN is designed in that spirit. Once 2791 the protocol is available, different network operators can choose how 2792 liberal they want to be in holding people accountable for the 2793 congestion they cause. Some might boldly invest in capacity and not 2794 police its use at all, hoping that novel applications will result. 2795 Others might use re-ECN for fine-grained flow policing, expecting to 2796 make money selling vertically integrated services. Yet others might 2797 sit somewhere half-way, perhaps doing coarse, per-user policing. All 2798 might change their minds later. But re-ECN always allows them to 2799 interconnect so that the careful ones can protect themselves from the 2800 liberal ones. 2802 The incentive-based approach used for re-ECN is based on Gibbens and 2803 Kelly's arguments [Evol_cc] on allowing endpoints the freedom to 2804 evolve new congestion control algorithms for new applications. They 2805 ensured responsible behaviour despite everyone's self-interest by 2806 applying pricing to ECN marking, and Kelly had proved stability and 2807 optimality in an earlier paper. 2809 Re-ECN keeps all the underlying economic incentives, but rearranges 2810 the feedback. The idea is to allow a network operator (if it 2811 chooses) to deploy engineering mechanisms like policers at the front 2812 of the network which can be designed to behave /as if/ they are 2813 responding to congestion prices. Rather than having to subject users 2814 to congestion pricing, networks can then use more traditional 2815 charging regimes (or novel ones). But the engineering can constrain 2816 the overall amount of congestion a user can cause. This provides a 2817 buffer against completely outrageous congestion control, but still 2818 makes it easy for novel applications to evolve if they need different 2819 congestion control to the norms. It also allows novel charging 2820 regimes to evolve. 2822 Despite being achieved with a relatively minor protocol change, re- 2823 ECN is an architectural change. Previously, Internet congestion 2824 could only be controlled by the data sender, because it was the only 2825 one both in a position to control the load and in a position to see 2826 information on congestion. Re-ECN levels the playing field. It 2827 recognises that the network also has a role to play in moderating 2828 (policing) congestion control. But policing is only truly effective 2829 at the first ingress into an internetwork, whereas path congestion 2830 was previously only visible at the last egress. So, re-ECN 2831 democratises congestion information. Then the choice over who 2832 actually controls congestion can be made at run-time, not design 2833 time---a bit like an aircraft with dual controls. And different 2834 operators can make different choices. We believe non-architectural 2835 approaches to this problem are unlikely to offer more than partial 2836 solutions (see Section 9). 2838 Importantly, re-ECN does NOT REQUIRE assumptions about specific 2839 congestion responses to be embedded in any network elements, except 2840 at the first ingress to the internetwork if that level of control is 2841 desired by the ingress operator. But such tight policing will be a 2842 matter of agreement between the source and its access network 2843 operator. The ingress operator need not police congestion response 2844 at flow granularity; it can simply hold a source responsible for the 2845 aggregate congestion it causes, perhaps keeping it within a monthly 2846 congestion quota. Or if the ingress network trusts the source, it 2847 can do nothing. 2849 Therefore, the aim of the re-ECN protocol is NOT solely to police 2850 TCP-friendliness. Re-ECN preserves IP as a generic network layer for 2851 all sorts of responses to congestion, for all sorts of transports. 2852 Re-ECN merely ensures truthful downstream congestion information is 2853 available in the network layer for all sorts of accountability 2854 applications. 2856 The end to end design principle does not say that all functions 2857 should be moved out of the lower layers---only those functions that 2858 are not generic to all higher layers. Re-ECN adds a function to the 2859 network layer that is generic, but was omitted: accountability for 2860 causing congestion. Accountability is not something that an end-user 2861 can provide to themselves. We believe re-ECN adds no more than is 2862 sufficient to hold each flow accountable, even if it consists of a 2863 single datagram. 2865 "Accountability" implies being able to identify who is responsible 2866 for causing congestion. However, at the network layer it would NOT 2867 be useful to identify the cause of congestion by adding individual or 2868 organisational identity information, NOR by using source IP 2869 addresses. Rather than bringing identity information to the point of 2870 congestion, we bring downstream congestion information to the point 2871 where the cause can be most easily identified and dealt with. That 2872 is, at any trust boundary congestion can be associated with the 2873 physically connected upstream neighbour that is directly responsible 2874 for causing it (whether intentionally or not). A trust boundary 2875 interface is exactly the place to police or throttle in order to 2876 directly mitigate congestion, rather than having to trace the 2877 (ir)responsible party in order to shut them down. 2879 Some considered that ECN itself was a layering violation. The 2880 reasoning went that the interface to a layer should provide a service 2881 to the higher layer and hide how the lower layer does it. However, 2882 ECN reveals the state of the network layer and below to the transport 2883 layer. A more positive way to describe ECN is that it is like the 2884 return value of a function call to the network layer. It explicitly 2885 returns the status of the request to deliver a packet, by returning a 2886 value representing the current risk that a packet will not be served. 2887 Re-ECN has similar semantics, except the transport layer must try to 2888 guess the return value, then it can use the actual return value from 2889 the network layer to modify the next guess. 2891 9. Related Work 2893 {Due to lack of time, this section is incomplete. The reader is 2894 referred to the Related Work section of [Re-fb] for a brief selection 2895 of related ideas.} 2897 9.1. Policing Rate Response to Congestion 2899 ATM network elements send congestion back-pressure 2900 messages [ITU-T.I.371] along each connection, duplicating any end to 2901 end feedback because they don't trust it. On the other hand, re-ECN 2902 ensures information in forwarded packets can be used for congestion 2903 management without requiring a connection-oriented architecture and 2904 re-using the overhead of fields that are already set aside for end to 2905 end congestion control (and routing loop detection in the case of re- 2906 TTL in Appendix F). 2908 We borrowed ideas from policers in the literature [pBox],[XCHOKe], 2909 AFD etc. for our rate equation policer. However, without the benefit 2910 of re-ECN they don't police the correct rate for the condition of 2911 their path. They detect unusually high /absolute/ rates, but only 2912 while the policer itself is congested, because they work by detecting 2913 prevalent flows in the discards from the local RED queue. These 2914 policers must sit at every potential bottleneck, whereas our policer 2915 need only be located at each ingress to the internetwork. As Floyd & 2916 Fall explain [pBox], the limitation of their approach is that a high 2917 sending rate might be perfectly legitimate, if the rest of the path 2918 is uncongested or the round trip time is short. Commercially 2919 available rate policers cap the rate of any one flow. Or they 2920 enforce monthly volume caps in an attempt to control high volume 2921 file-sharing. They limit the value a customer derives. They might 2922 also limit the congestion customers can cause, but only as an 2923 accidental side-effect. They actually punish traffic that fills 2924 troughs as much as traffic that causes peaks in utilisation. In 2925 practice network operators need to be able to allocate service by 2926 cost during congestion, and by value at other times. 2928 9.2. Congestion Notification Integrity 2930 The choice of two ECT code-points in the ECN field [RFC3168] 2931 permitted future flexibility, optionally allowing the sender to 2932 encode the experimental ECN nonce [RFC3540] in the packet stream. 2933 This mechanism has since been included in the specifications of DCCP 2934 [RFC4340]. 2936 The ECN nonce is an elegant scheme that allows the sender to detect 2937 if someone in the feedback loop - the receiver especially - tries to 2938 claim no congestion was experienced when in fact congestion lead to 2939 packet drops or ECN marks. For each packet it sends, the sender 2940 chooses between the two ECT codepoints in a pseudo-random sequence. 2941 Then, whenever the network marks a packet with CE, if the receiver 2942 wants to deny congestion happened, she has to guess which ECT 2943 codepoint was overwritten. She has only a 50:50 chance of being 2944 correct each time she denies a congestion mark or a drop, which 2945 ultimately will give her away. 2947 The purpose of a network-layer nonce has to be the protection of the 2948 network in the first place, while a transport-layer nonce had better 2949 be used to protect the sender from cheating receivers. Now, the 2950 assumption behind the ECN nonce is that a sender will want to detect 2951 whether a receiver is suppressing congestion feedback. This is only 2952 true if the sender's interests are aligned with the network's, or 2953 with the community of users as a whole. This may be true for certain 2954 large senders, who are under close scrutiny and have a reputation to 2955 maintain. But we have to deal with a more hostile world, where 2956 traffic may be dominated by peer-to-peer transfers, rather than 2957 downloads from a few popular sites. Often the `natural' self- 2958 interest of a sender is not aligned with the interests of other 2959 users. It often wishes to transfer data quickly to the receiver as 2960 much as the receiver wants the data quickly. 2962 In contrast, the re-ECN protocol enables policing of an agreed rate- 2963 response to congestion (e.g. TCP-friendliness) at the sender's 2964 interface with the internetwork. It also ensures downstream networks 2965 can police their upstream neighbours, to encourage them to police 2966 their users in turn. But most importantly, it requires the sender to 2967 declare path congestion to the network and it can remove traffic at 2968 the egress if this declaration is dishonest. So it can police 2969 correctly, irrespective of whether the receiver tries to suppress 2970 congestion feedback or whether the sender ignores genuine congestion 2971 feedback. Therefore the re-ECN protocol addresses a much wider range 2972 of cheating problems, which includes the one addressed by the ECN 2973 nonce. 2975 9.3. Identifying Upstream and Downstream Congestion 2977 Purple [Purple] proposes that routers should use the CWR flag in the 2978 TCP header of ECN-capable flows to work out path congestion and 2979 therefore downstream congestion in a similar way to re-ECN. However, 2980 because CWR is in the transport layer, it is not always visible to 2981 network layer routers and policers. Purple's motivation was to 2982 improve AQM, not policing. But, of course, nodes trying to avoid a 2983 policer would not be expected to allow CWR to be visible. 2985 10. Security Considerations 2987 This whole memo concerns the deployment of a secure congestion 2988 control framework. However, below we list some specific security 2989 issues that we are still working on: 2991 o Malicious users have ability to launch dynamically changing 2992 attacks, exploiting the time it takes to detect an attack, given 2993 ECN marking is binary. We are concentrating on subtle 2994 interactions between the ingress policer and the egress dropper in 2995 an effort to make it impossible to game the system. 2997 o There is an inherent need for at least some flow state at the 2998 egress dropper given the binary marking environment, which leads 2999 to an apparent vulnerability to state exhaustion attacks. An 3000 egress dropper design with bounded flow state is in write-up. 3002 o A malicious source can spoof another user's address and send 3003 negative traffic to the same destination in order to fool the 3004 dropper into sanctioning the other user's flow. To prevent or 3005 mitigate these two different kinds of DoS attack, against the 3006 dropper and against given flows, we are considering various 3007 protection mechanisms. Section 5.5.1 discusses one of these. 3009 o A malicious client can send requests using a spoofed source 3010 address to a server (such as a DNS server) that tends to respond 3011 with single packet responses. This server will then be tricked 3012 into having to set FNE on the first (and only) packet of all these 3013 wasted responses. Given packets marked FNE are worth +1, this 3014 will cause such servers to consume more of their allowance to 3015 cause congestion than they would wish to. In general, re-ECN is 3016 deliberately designed so that single packet flows have to bear the 3017 cost of not discovering the congestion state of their path. One 3018 of the reasons for introducing re-ECN is to encourage short flows 3019 to make use of previous path knowledge by moving the cost of this 3020 lack of knowledge to sources that create short flows. Therefore, 3021 we in the long run we might expect services like DNS to aggregate 3022 single packet flows into connections where it brings benefits. 3023 However, this attack where DNS requests are made from spoofed 3024 addresses genuinely forces the server to waste its resources. The 3025 only mitigating feature is that the attacker has to set FNE on 3026 each of its requests if they are to get through an egress dropper 3027 to a DNS server. The attacker therefore has to consume as many 3028 resources as the victim, which at least implies re-ECN does not 3029 unwittingly amplify this attack. 3031 Having highlighted outstanding security issues, we now explain the 3032 design decisions that were taken based on a security-related 3033 rationale. It may seem that the six codepoints of the eight made 3034 available by extending the ECN field with the RE flag have been used 3035 rather wastefully to encode just five states. In effect the RE flag 3036 has been used as an orthogonal single bit, using up four codepoints 3037 to encode the three states of positive, neutral and negative worth. 3038 The mapping of the codepoints in an earlier version of this proposal 3039 used the codepoint space more efficiently, but the scheme became 3040 vulnerable to network operators bypassing congestion penalties by 3041 focusing congestion marking on positive packets. Appendix B explains 3042 why fixing that problem while allowing for incremental deployment, 3043 would have used another codepoint anyway. So it was better to use 3044 this orthogonal encoding scheme, which greatly simplified the whole 3045 protocol and brought with it some subtle security benefits. 3047 With the scheme as now proposed, once the RE flag is set or cleared 3048 by the sender or its proxy, it should not be written by the network, 3049 only read. So the gateways can detect if any network maliciously 3050 alters the RE flag. IPSec AH integrity checking does not cover the 3051 IPv4 option flags (they were considered mutable---even the one we 3052 propose using for the RE flag that was `currently unused' when IPSec 3053 was defined). But it would be sufficient for a pair of gateways to 3054 make random checks on whether the RE flag was the same when it 3055 reached the egress gateway as when it left the ingress. Indeed, if 3056 IPSec AH had covered the RE flag, any network intending to alter 3057 sufficient RE flags to make a gain would have focused its alterations 3058 on packets without authenticating headers (AHs). 3060 The security of re-ECN has been deliberately designed to not rely on 3061 cryptography. 3063 11. IANA Considerations 3065 This memo includes no request to IANA (yet). 3067 If this memo was to progress to standards track, it would list: 3069 o The new RE flag in IPv4 (Section 5.1) and its extension with the 3070 ECN field to create a new set of extended ECN (EECN) codepoints; 3072 o The definition of the EECN codepoints for default Diffserv PHBs 3073 (Section 3.2) 3075 o The new extension header for IPv6 (Section 5.2); 3077 o The new combinations of flags in the TCP header for capability 3078 negotiation (Section 4.1.3); 3080 o The new ICMP message type (Section 5.5.1). 3082 12. Conclusions 3084 {ToDo:} 3086 13. Acknowledgements 3088 Sebastien Cazalet and Andrea Soppera contributed to the idea of re- 3089 feedback. All the following have given helpful comments: Andrea 3090 Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, 3091 Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, 3092 John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru 3093 Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd 3094 (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark 3095 Handley (who developed the attack with canceled packets), Adam 3096 Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft 3097 (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who 3098 complemented our own dummy traffic attacks with others), Liz Maida 3099 (MIT), and comments from participants in the CRN/CFP Broadband and 3100 DoS-resistant Internet working groups. 3102 14. Comments Solicited 3104 Comments and questions are encouraged and very welcome. They can be 3105 addressed to the IETF Transport Area working group's mailing list 3106 , and/or to the authors. 3108 15. References 3110 15.1. Normative References 3112 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3113 Requirement Levels", BCP 14, RFC 2119, March 1997. 3115 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 3116 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 3117 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 3118 S., Wroclawski, J., and L. Zhang, "Recommendations on 3119 Queue Management and Congestion Avoidance in the 3120 Internet", RFC 2309, April 1998. 3122 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 3123 Control", RFC 2581, April 1999. 3125 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 3126 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 3127 Zhang, L., and V. Paxson, "Stream Control Transmission 3128 Protocol", RFC 2960, October 2000. 3130 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 3131 of Explicit Congestion Notification (ECN) to IP", 3132 RFC 3168, September 2001. 3134 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 3135 Initial Window", RFC 3390, October 2002. 3137 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 3138 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 3140 [RFC4341] Floyd, S. and E. Kohler, "Profile for Datagram Congestion 3141 Control Protocol (DCCP) Congestion Control ID 2: TCP-like 3142 Congestion Control", RFC 4341, March 2006. 3144 [RFC4342] Floyd, S., Kohler, E., and J. Padhye, "Profile for 3145 Datagram Congestion Control Protocol (DCCP) Congestion 3146 Control ID 3: TCP-Friendly Rate Control (TFRC)", RFC 4342, 3147 March 2006. 3149 15.2. Informative References 3151 [ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the 3152 Internet to Support Real-Time Content Supply from a Large 3153 Fraction of Broadband Residential Users", BT Technology 3154 Journal (BTTJ) 23(2), April 2005. 3156 [Bauer06] Bauer, S., Faratin, P., and R. Beverly, "Assessing the 3157 assumptions underlying mechanism design for the Internet", 3158 Proc. Workshop on the Economics of Networked Systems 3159 (NetEcon06) , June 2006, . 3162 [CL-deploy] 3163 Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., 3164 Charny, A., Babiarz, J., Chan, K., Westberg, L., Bader, 3165 A., and G. Karagiannis, "A Deployment Model for Admission 3166 Control over DiffServ using Pre-Congestion Notification", 3167 draft-briscoe-tsvwg-cl-architecture-03 (work in progress), 3168 June 2006. 3170 [CLoop_pol] 3171 Salvatori, A., "Closed Loop Traffic Policing", Politecnico 3172 Torino and Institut Eurecom Masters Thesis , 3173 September 2005. 3175 [ECN-Deploy] 3176 Floyd, S., "ECN (Explicit Congestion Notification) in 3177 TCP/IP; Implementation and Deployment of ECN", Web-page , 3178 May 2004, 3179 . 3181 [ECN-MPLS] 3182 Bruce, B., Briscoe, B., and J. Tay, "Explicit Congestion 3183 Marking in MPLS", draft-davie-ecn-mpls-00 (work in 3184 progress), June 2006. 3186 [Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the 3187 evolution of congestion control", Automatica 35(12)1969-- 3188 1985, December 1999, 3189 . 3191 [I-D.ietf-tsvwg-ecnsyn] 3192 Kuzmanovic, A., "Adding Explicit Congestion Notification 3193 (ECN) Capability to TCP's SYN/ACK Packets", 3194 draft-ietf-tsvwg-ecnsyn-00 (work in progress), 3195 November 2005. 3197 [ITU-T.I.371] 3198 ITU-T, "Traffic Control and Congestion Control in 3199 {B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. 3201 [Jiang02] Jiang, H. and D. Dovrolis, "The Macroscopic Behavior of 3202 the TCP Congestion Avoidance Algorithm", ACM SIGCOMM 3203 CCR 32(3)75-88, July 2002, 3204 . 3206 [Mathis97] 3207 Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The 3208 Macroscopic Behavior of the TCP Congestion Avoidance 3209 Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997, 3210 . 3212 [Purple] Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE: 3213 Predictive Active Queue Management Utilizing Congestion 3214 Information", Proc. Local Computer Networks (LCN 2003) , 3215 October 2003. 3217 [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, 3218 M., Romanow, A., Weinrib, A., and L. Zhang, "Resource 3219 ReSerVation Protocol (RSVP) Version 1 Applicability 3220 Statement Some Guidelines on Deployment", RFC 2208, 3221 September 1997. 3223 [RFC2402] Kent, S. and R. Atkinson, "IP Authentication Header", 3224 RFC 2402, November 1998. 3226 [RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security 3227 Payload (ESP)", RFC 2406, November 1998. 3229 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., 3230 and W. Weiss, "An Architecture for Differentiated 3231 Services", RFC 2475, December 1998. 3233 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 3234 Timer", RFC 2988, November 2000. 3236 [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", 3237 RFC 3124, June 2001. 3239 [RFC3514] Bellovin, S., "The Security Flag in the IPv4 Header", 3240 RFC 3514, April 2003. 3242 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 3243 Congestion Notification (ECN) Signaling with Nonces", 3244 RFC 3540, June 2003. 3246 [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion 3247 Control for Voice Traffic in the Internet", RFC 3714, 3248 March 2004. 3250 [Re-PCN] Briscoe, B., "Emulating Border Flow Policing using Re-ECN 3251 on Bulk Data", draft-briscoe-tsvwg-re-ecn-border-cheat-01 3252 (work in progress), March 2006. 3254 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 3255 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 3256 Congestion Response in an Internetwork Using Re-Feedback", 3257 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 3261 [Smart_rtg] 3262 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 3263 "Optimizing Cost and Performance for Multihoming", ACM 3264 SIGCOMM CCR 34(4)79--92, October 2004, 3265 . 3267 [Steps_DoS] 3268 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 3269 resistant Internet Architecture", Proc. ACM SIGCOMM 3270 workshop on Future directions in network architecture 3271 (FDNA'04) pp 49--56, August 2004. 3273 [Tussle] Clark, D., Sollins, K., Wroclawski, J., and R. Braden, 3274 "Tussle in Cyberspace: Defining Tomorrow's Internet", ACM 3275 SIGCOMM CCR 32(4)347--356, October 2002, 3276 . 3279 [XCHOKe] Chhabra, P., Chuig, S., Goel, A., John, A., Kumar, A., 3280 Saran, H., and R. Shorey, "XCHOKe: Malicious Source 3281 Control for Congestion Avoidance at Internet Gateways", 3282 Proceedings of IEEE International Conference on Network 3283 Protocols (ICNP-02) , November 2002, 3284 . 3286 [pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End 3287 Congestion Control in the Internet", IEEE/ACM Transactions 3288 on Networking 7(4) 458--472, August 1999, 3289 . 3291 Appendix A. Precise Re-ECN Protocol Operation 3293 {ToDo: fix this} 3295 The protocol operation described in Section 3.3 was an approximation. 3296 In fact, standard ECN router marking combines 1% and 2% marking into 3297 slightly less than 3% whole-path marking, because routers 3298 deliberately mark CE whether or not it has already been marked by 3299 another router upstream. So the combined marking fraction would 3300 actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. 3302 To generalise this we will need some notation. 3304 o j represents the index of each resource (typically queues) along a 3305 path, ranging from 0 at the first router to n-1 at the last. 3307 o m_j represents the fraction of octets *m*arked CE by a particular 3308 router (whether or not they are already marked) because of 3309 congestion of resource j. 3311 o u_j represents congestion *u*pstream of resource j, being the 3312 fraction of CE marking in arriving packet headers (before 3313 marking). 3315 o p_j represents *p*ath congestion, being the fraction of packets 3316 arriving at resource j with the RE flag blanked (excluding Not- 3317 RECT packets). 3319 o v_j denotes expected congestion downstream of resource j, which 3320 can be thought of as a *v*irtual marking fraction, being derived 3321 from two other marking fractions. 3323 Observed fractions of each particular codepoint (u, p and v) and 3324 router marking rate m are dimensionless fractions, being the ratio of 3325 two data volumes (marked and total) over a monitoring period. All 3326 measurements are in terms of octets, not packets, assuming that line 3327 resources are more congestible than packet processing. 3329 The path congestion (RE blanking fraction) set by the sender should 3330 reflect the upstream congestion (CE marking fraction) fed back from 3331 the destination. Therefore in the steady state 3333 p_0 = u_n 3334 = 1 - (1 - m_1)(1 - m_2)... 3336 Similarly, at some point j in the middle of the network, if p = 1 - 3337 (1 - u_j)(1 - v_j), then 3339 v_j = 1 - (1 - p)/(1 - u_j) 3341 ~= p - u_j; if u_j << 100% 3343 So, between the two routers in the example in Section 3.3, congestion 3344 downstream is 3346 v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) 3347 = 2.00%, 3349 or a useful approximation of downstream congestion is 3351 v_1 ~= 2.98% - 1.00% 3352 ~= 1.98%. 3354 Appendix B. Justification for Two Codepoints Signifying Zero Worth 3355 Packets 3357 It may seem a waste of a codepoint to set aside two codepoints of the 3358 Extended ECN field to signify zero worth (RECT and CE(0) are both 3359 worth zero). The justification is subtle, but worth recording. 3361 The original version of re-ECN ([Re-fb] and draft-00 of this memo) 3362 used three codepoints for neutral (ECT(1)), positive (ECT(0)) and 3363 negative (CE) packets. The sender set packets to neutral unless re- 3364 echoing congestion, when it set them positive, in much the same way 3365 that it blanks the RE flag in the current protocol. However, routers 3366 were meant to mark congestion by setting packets negative (CE) 3367 irrespective of whether they had previously been neutral or positive. 3369 However, we did not arrange for senders to remember which packet had 3370 been sent with which codepoint, or for feedback to say exactly which 3371 packets arrived with which codepoints. The transport was meant to 3372 inflate the number of positive packets it sent to allow for a few 3373 being wiped out by congestion marking. We (wrongly) assumed that 3374 routers would congestion mark packets indiscriminately, so the 3375 transport could infer how many positive packets had been marked and 3376 compensate accordingly by re-echoing. But this created a perverse 3377 incentive for routers to preferentially congestion mark positive 3378 packets rather than neutral ones. 3380 We could have removed this perverse incentive by requiring re-ECN 3381 senders to remember which packets they had sent with which codepoint. 3382 And for feedback from the receiver to identify which packets arrived 3383 as which. Then, if a positive packet was congestion marked to 3384 negative, the sender could have re-echoed twice to maintain the 3385 balance between positive and negative at the receiver. 3387 Instead, we chose to make re-echoing congestion (blanking RE) 3388 orthogonal to congestion notification (marking CE), which required a 3389 second neutral codepoint (the orthogonal scheme forms the main square 3390 of four codepoints in Figure 2). Then the receiver would be able to 3391 detect and echo a congestion event even if it arrived on a packet 3392 that had originally been positive. 3394 If we had added extra complexity to the sender and receiver 3395 transports to track changes to individual packets, we could have made 3396 it work, but then routers would have had an incentive to mark 3397 positive packets with half the probability of neutral packets. That 3398 in turn would have led router algorithms to become more complex. 3399 Then senders wouldn't know whether a mark had been introduced by a 3400 simple or a complex router algorithm. That in turn would have 3401 required another codepoint to distinguish between legacy ECN and new 3402 re-ECN router marking. 3404 Once the cost of IP header codepoint real-estate was the same for 3405 both schemes, there was no doubt that the simpler option for 3406 endpoints and for routers should be chosen. The resulting protocol 3407 also no longer needed the tricky inflation/deflation complexity of 3408 the original (broken) scheme. It was also much simpler to understand 3409 conceptually. 3411 A further advantage of the new orthogonal four-codepoint scheme was 3412 that senders owned sole rights to change the RE flag and routers 3413 owned sole rights to change the ECN field. Although we still arrange 3414 the incentives so neither party strays outside their dominion, these 3415 clear lines of authority simplify the matter. 3417 Finally, a little redundancy can be very powerful in a scheme such as 3418 this. In one flow, the proportion of packets changed to CE should be 3419 the same as the proportion of RECT packets changed to CE(-1) and the 3420 proportion of Re-Echo packets changed to CE(0). Double checking 3421 using such redundant relationships can improve the security of a 3422 scheme (cf. double-entry book-keeping or the ECN Nonce). 3423 Alternatively, it might be necessary to exploit the redundancy in the 3424 future to encode an extra information channel. 3426 Appendix C. ECN Compatibility 3428 The rationale for choosing the particular combinations of SYN and SYN 3429 ACK flags in Section 4.1.3 is as follows. 3431 Choice of SYN flags: A re-ECN sender can work with vanilla ECN 3432 receivers so we wanted to use the same flags as would be used in 3433 an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same time, 3434 we wanted a server (host B) that is Re-ECT to be able to recognise 3435 that the client (A) is also Re-ECT. We believe also setting NS=1 3436 in the initial SYN achieves both these objectives, as it should be 3437 ignored by vanilla ECT receivers and by ECT-Nonce receivers. But 3438 senders that are not Re-ECT should not set NS=1. At the time ECN 3439 was defined, the NS flag was not defined, so setting NS=1 should 3440 be ignored by existing ECT receivers (but testing against 3441 implementations may yet prove otherwise). The ECN Nonce 3442 RFC [RFC3540] is silent on what the NS field might be set to in 3443 the TCP SYN, but we believe the intent was for a nonce client to 3444 set NS=0 in the initial SYN (again only testing will tell). 3445 Therefore we define a Re-ECN-setup SYN as one with NS=1, CWR=1 & 3446 ECE=1 3448 Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to 3449 be able to determine whether the server (B) is Re-ECT. The 3450 original ECN specification required an ECT server to respond to an 3451 ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There 3452 is no room to modify this by setting the NS flag, as that is 3453 already set in the SYN ACK of an ECT-Nonce server. So we used the 3454 only combination of CWR and ECE that would not be used by existing 3455 TCP receivers: CWR=1 and ECE=0. The original ECN specification 3456 defines this combination as a non-ECN-setup SYN ACK, which remains 3457 true for vanilla and Nonce ECTs. But for re-ECN we define it as a 3458 Re-ECN-setup SYN ACK. We didn't use a SYN ACK with both CWR and 3459 ECE cleared to 0 because that would be the likely response from 3460 most Not-ECT receivers. And we didn't use a SYN ACK with both CWR 3461 and ECE set to 1 either, as at least one broken receiver 3462 implementation echoes whatever flags were in the SYN into its SYN 3463 ACK. Therefore we define a Re-ECN-setup SYN ACK as one with CWR=1 3464 & ECE=0. 3466 Choice of two alternative SYN ACKs: the NS flag may take either 3467 value in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re- 3468 ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to 3469 echo congestion experienced (CE) on the initial SYN. Otherwise a 3470 Re-ECN-setup SYN ACK MUST be returned with NS=0. The only current 3471 known use of the NS flag in a SYN ACK is to indicate support for 3472 the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. 3473 Given the ECN nonce MUST NOT be used for a RECN mode connection, a 3474 Re-ECN-setup SYN ACK can use either setting of the NS flag without 3475 any risk of confusion, because the CWR & ECE flags will be 3476 reversed relative to those used by an ECN nonce SYN ACK. 3478 Appendix D. Packet Marking During Flow Start 3480 {ToDo: Write up proof that sender should mark FNE on first and third 3481 data packets, even with the largest allowed initial window.} 3483 Appendix E. Example Egress Dropper Algorithm 3485 {ToDo: Write up the basic algorithm with flow state, then the 3486 aggregated one.} 3488 Appendix F. Re-TTL 3490 This Appendix gives an overview of a proposal to be able to overload 3491 the TTL field in the IP header to monitor downstream propagation 3492 delay. It is planned to fully write up this proposal in a future 3493 Internet Draft. 3495 Delay re-feedback can be achieved by overloading the TTL field, 3496 without changing IP or router TTL processing. A target value for TTL 3497 at the destination would need standardising, say 16. If the path hop 3498 count increased by more than 16 during a routing change, it would 3499 temporarily be mistaken for a routing loop, so this target would need 3500 to be chosen to exceed typical hop count increases. The TCP wire 3501 protocol and handlers would need modifying to feed back the 3502 destination TTL and initialise it. It would be necessary to 3503 standardise the unit of TTL in terms of real time (as was the 3504 original intent in the early days of the Internet). 3506 In the longer term, precision could be improved if routers 3507 decremented TTL to represent exact propagation delay to the next 3508 router. That is, for a router to decrement TTL by, say, 1.8 time 3509 units it would alternate the decrement of every packet between 1 & 2 3510 at a ratio of 1:4. Although this might sometimes require a seemingly 3511 dangerous null decrement, a packet in a loop would still decrement to 3512 zero after 255 time units on average. As more routers were upgraded 3513 to this more accurate TTL decrement, path delay estimates would 3514 become increasingly accurate despite the presence of some legacy 3515 routers that continued to always decrement the TTL by 1. 3517 Appendix G. Policer Designs to ensure Congestion Responsiveness 3519 G.1. Per-user Policing 3521 User policing requires a policer on the ingress interface of the 3522 access router associated with the user. At that point, the traffic 3523 of the user hasn't diverged on different routes yet; nor has it mixed 3524 with traffic from other sources. 3526 In order to ensure that a user doesn't generate more congestion in 3527 the network than her due share, a modified bulk token-bucket is 3528 maintained with the following parameter: 3530 o b_0 the initial token level 3532 o r the filling rate 3534 o b_max the bucket depth 3536 The same token bucket algorithm is used as in many areas of 3537 networking, but how it is used is very different: 3539 o all traffic from a user over the lifetime of their subscription is 3540 policed in the same token bucket. 3542 o only positive and canceled packets (Re-Echo, FNE and CE(0)) 3543 consume tokens 3545 Such a policer will allow network operators to throttle the 3546 contribution of their users to network congestion. This will require 3547 the appropriate contractual terms to be in place between operators 3548 and users. For instance: a condition for a user to subscribe to a 3549 given network service may be that she should not cause more than a 3550 volume C_user of congestion over a reference period T_user, although 3551 she may carry forward up to N_user times her allowance at the end of 3552 each period. These terms directly set the parameter of the user 3553 policer: 3555 o b_0 = C_user 3557 o r = C_user/T_user 3559 o b_max = b_0 * (N_user +1) 3561 Besides the congestion budget policer above, another user policer may 3562 be necessary to further rate-limit FNE packets, if they are to be 3563 marked rather than dropped (see discussion in Section 5.3.). Rate- 3564 limiting FNE packets will prevent high bursts of new flow arrivals, 3565 which is a very useful feature in DoS prevention. A condition to 3566 subscribe to a given network service would have to be that a user 3567 should not generate more than C_FNE FNE packets, over a reference 3568 period T_FNE, with no option to carry forward any of the allowance at 3569 the end of each period. These terms directly set the parameters of 3570 the FNE policer: 3572 o b_0 = C_FNE 3574 o r = C_FNE/T_FNE 3576 o b_max = b_0 3578 T_FNE should be a much shorter period than T_user: for instance T_FNE 3579 could be in the order of minutes while T_user could be in order of 3580 weeks. 3582 G.2. Per-flow Rate Policing 3584 Per-flow policing aims to enforce congestion responsiveness on the 3585 shortest information timescale on a network path: packet roundtrips. 3587 This again requires that the appropriate terms be agreed between a 3588 network operator and its users, where a congestion responsiveness 3589 policy might be required for the use of a given network service 3590 (perhaps unless the user specifically requests otherwise). 3592 As an example, we describe below how a rate adaptation policer can be 3593 designed when the applicable rate adaptation policy is TCP- 3594 compliance. In that context, the average throughput of a flow will 3595 be expected to be bounded by the value of the TCP throughput during 3596 congestion avoidance, given n Mathis' formula [Mathis97] 3597 x_TCP = k * s / ( T * sqrt(m) ) 3599 where: 3601 o x_TCP is the throughput of the TCP flow in packets per second, 3603 o k is a constant upper-bounded by sqrt(3/2), 3605 o s is the average packet size of the flow, 3607 o T is the roundtrip time of the flow, 3609 o m is the congestion level experienced by the flow. 3611 We define the marking period N=1/m which represents the average 3612 number of packets between two positive or canceled packets. Mathis' 3613 formula can be re-written as: 3615 x_TCP = k*s*sqrt(N)/T 3617 We can then get the average inter-mark time in a compliant TCP flow, 3618 dt_TCP, by solving (x_TCP/s)*dt_TCP = N which gives 3620 dt_TCP = sqrt(N)*T/k 3622 We rely on this equation for the design of a rate-adaptation policer 3623 as a variation of a token bucket. In that case a policer has to be 3624 set up for each policed flow. This may be triggered by FNE packets, 3625 with the remainder of flows being all rate limited together if they 3626 do not start with an FNE packet. 3628 Where maintaining per flow state is not a problem, for instance on 3629 some access routers, systematic per-flow policing may be considered. 3630 Should per-flow state be more constrained, rate adaptation policing 3631 could be limited to a random sample of flows exhibiting positive or 3632 canceled packets. 3634 As in the case of user policing, only positive or canceled packets 3635 will consume tokens, however the amount of tokens consumed will 3636 depend on the congestion signal. 3638 When a new rate adaptation policer is set up for flow j, the 3639 following state is created: 3641 o a token bucket b_j of depth b_max starting at level b_0 3643 o a timestamp t_j = timenow() 3644 o a counter N_j = 0 3646 o a roundtrip estimate T_j 3648 o a filling rate r 3650 When the policing node forwards a packet of flow j with no Re-Echo: 3652 o . the counter is incremented: N_j += 1 3654 When the policing node forwards a packet of flow j carrying a 3655 congestion mark (CE): 3657 o the counter is incremented: N_j += 1 3659 o the token level is adjusted: b_j += r*(timenow()-t_j) - sqrt(N_j)* 3660 T_j/k 3662 o the counter is reset: N_j = 0 3664 o the timer is reset: t_j = timenow() 3666 An implementation example will be given in a later draft that avoids 3667 having to extract the square root. 3669 Analysis: For a TCP flow, for r= 1 token/sec, on average, 3671 r*(timenow()-t_j)-sqrt(N_j)* T_j/k = dt_TCP - sqrt(N)*T/k = 0 3673 This means that the token level will fluctuate around its initial 3674 level. The depth b_max of the bucket sets the timescale on which the 3675 rate adaptation policy is performed while the filling rate r sets the 3676 trade-off between responsiveness and robustness: 3678 o the higher b_max, the longer it will take to catch greedy flows 3680 o the higher r, the fewer false positives (greedy verdict on 3681 compliant flows) but the more false negatives (compliant verdict 3682 on greedy flows) 3684 This rate adaptation policer requires the availability of a roundtrip 3685 estimate which may be obtained for instance from the application of 3686 re-feedback to the downstream delay Appendix F or passive estimation 3687 [Jiang02]. 3689 When the bucket of a policer located at the access router (whether it 3690 is a per-user policer or a per-flow policer) becomes empty, the 3691 access router SHOULD drop at least all packets causing the token 3692 level to become negative. The network operator MAY take further 3693 sanctions if the token level of the per-flow policers associated with 3694 a user becomes negative. 3696 Appendix H. Downstream Congestion Metering Algorithms 3698 H.1. Bulk Downstream Congestion Metering Algorithm 3700 To meter the bulk amount of downstream congestion in traffic crossing 3701 an inter-domain border an algorithm is needed that accumulates the 3702 size of positive packets and subtracts the size of negative packets. 3703 We maintain two counters: 3705 V_b: accumulated congestion volume 3707 B: total data volume (in case it is needed) 3709 A suitable pseudo-code algorithm for a border router is as follows: 3711 ==================================================================== 3712 V_b = 0 3713 B = 0 3714 for each re-ECN-capable packet { 3715 b = readLength(packet) /* set b to packet size */ 3716 B += b /* accumulate total volume */ 3717 if readEECN(packet) == (Re-Echo || FNE) { 3718 V_b += b /* increment... */ 3719 } elseif readEECN(packet) == CE(-1) { 3720 V_b -= b /* ...or decrement V_b... */ 3721 } /*...depending on EECN field */ 3722 } 3723 ==================================================================== 3725 At the end of an accounting period this counter V_b represents the 3726 congestion volume that penalties could be applied to, as described in 3727 Section 6.1.6. 3729 For instance, accumulated volume of congestion through a border 3730 interface over a month might be V_b = 5PB (petabyte = 10^15 byte). 3731 This might have resulted from an average downstream congestion level 3732 of 1% on an accumulated total data volume of B = 500PB. 3734 H.2. Inflation Factor for Persistently Negative Flows 3736 The following process is suggested to complement the simple algorithm 3737 above in order to protect against the various attacks from 3738 persistently negative flows described in Section 6.1.6. As explained 3739 in that section, the most important and first step is to estimate the 3740 contribution of persistently negative flows to the bulk volume of 3741 downstream pre-congestion and to inflate this bulk volume as if these 3742 flows weren't there. The process below has been designed to give an 3743 unboased estimate, but it may be possible to define other processes 3744 that achieve similar ends. 3746 While the above simple metering algorithm is counting the bulk of 3747 traffic over an accounting period, the meter should also select a 3748 subset of the whole flow ID space that is small enough to be able to 3749 realistically measure but large enough to give a realistic sample. 3750 Many different samples of different subsets of the ID space should be 3751 taken at different times during the accounting period, preferably 3752 covering the whole ID space. During each sample, the meter should 3753 count the volume of positive packets and subtract the volume of 3754 negative, maintaining a separate account for each flow in the sample. 3755 It should run a lot longer than the large majority of flows, to avoid 3756 a bias from missing the starts and ends of flows, which tend to be 3757 positive and negative respectively. 3759 Once the accounting period finishes, the meter should calculate the 3760 total of the accounts V_{bI} for the subset of flows I in the sample, 3761 and the total of the accounts V_{fI} excluding flows with a negative 3762 account from the subset I. Then the weighted mean of all these 3763 samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I} 3764 V_{bI}. 3766 If V_b is the result of the bulk accounting algorithm over the 3767 accounting period (Appendix H.1) it can be inflated by this factor 3768 a_S to get a good unbiased estimate of the volume of downstream 3769 congestion over the accounting period a_S.V_b, without being polluted 3770 by the effect of persistently negative flows. 3772 Appendix I. Argument for holding back the ECN nonce 3774 The ECN nonce is a mechanism that allows a /sending/ transport to 3775 detect if drop or ECN marking at a congested router has been 3776 suppressed by a node somewhere in the feedback loop---another router 3777 or the receiver. 3779 Space for the ECN nonce was set aside in [RFC3168] (currently 3780 proposed standard) while the full nonce mechanism is specified in RFC 3781 3540 (currently experimental). The specifications for [RFC4340] 3782 (currently proposed standard) requires that "Each DCCP sender SHOULD 3783 set ECN Nonces on its packets...". It also mandates as a requirement 3784 for all CCID profiles that "Any newly defined acknowledgement 3785 mechanism MUST include a way to transmit ECN Nonce Echoes back to the 3786 sender.", therefore: 3788 o The CCID profile for TCP-like Congestion Control [RFC4341] 3789 (currently proposed standard) says "The sender will use the ECN 3790 Nonce for data packets, and the receiver will echo those nonces in 3791 its Ack Vectors." 3793 o The CCID profile for TCP-Friendly Rate Control (TFRC) [RFC4342] 3794 recommends that "The sender [use] Loss Intervals options' ECN 3795 Nonce Echoes (and possibly any Ack Vectors' ECN Nonce Echoes) to 3796 probabilistically verify that the receiver is correctly reporting 3797 all dropped or marked packets." 3799 The ECN nonce is used for three types of functions: 3801 o if the sender wants to ensure the integrity of the information 3802 about packet drops, 3804 o if the sending transport chooses to act in the interests of a 3805 congested router, 3807 o if the sending transport wants to allocate its own resources in 3808 proportion to the rates that each network path can sustain, based 3809 on congestion control. 3811 However, when the nonce is used to protect the integrity of 3812 information about packet drops, rather than ECN marks, a transport 3813 layer nonce will always be sufficient (because a drop loses the 3814 transport header as well as the ECN field in the network header), 3815 which would avoid using scarce IP header codepoint space. Similarly, 3816 a transport layer nonce would protect against a receiver sending 3817 early acknowledgements. 3819 The other two functions need the ECN nonce to be in the network 3820 layer, but both require rather optimistic trust assumptions in order 3821 to be useful. If the sending transport chooses to act in the 3822 interests of a congested router, it can reduce its rate if it detects 3823 some malicious party in the feedback loop may be suppressing ECN 3824 feedback. But it would only be useful to a router when /all/ senders 3825 using the router are trusted to act in the router's interest. 3827 In the end, the only essential use of a network layer nonce is when 3828 sending transports (e.g. large servers) want to allocate their /own/ 3829 resources in proportion to the rates that each network path can 3830 sustain, based on congestion control. In that case, the nonce allows 3831 senders to be assured that they aren't being duped into giving more 3832 of their own resources to a particular flow. And if congestion 3833 suppression is detected, the sending transport can rate limit the 3834 offending connection to protect its own resources. Certainly, this 3835 is a useful function, but the IETF should carefully decide whether 3836 such a single, very specific case warrants IP header space. 3838 In contrast, re-ECN allows all routers to fully protect themselves 3839 from such attacks, without having to trust anyone - senders, 3840 receivers, neighbouring networks. Re-ECN is therefore proposed in 3841 preference to the ECN nonce on the basis that it addresses the 3842 generic problem of accountability for congestion of a network's 3843 resources at the IP layer. 3845 Delaying the ECN nonce is justified because the applicability of the 3846 ECN nonce seems too limited for it to consume a two-bit codepoint in 3847 the IP header. 3849 Moreover, while we have re-designed the re-ECN codepoints so that 3850 they do not prevent the ECN nonce progressing, the same is not true 3851 the other way round. If the ECN nonce started to see some deployment 3852 (perhaps because it was blessed with proposed standard status), 3853 incremental deployment of re-ECN would effectively be impossible, 3854 because re-ECN marking fractions at inter-domain borders would be 3855 polluted by unknown levels of nonce traffic. 3857 The authors are aware that re-ECN must prove it has the potential it 3858 claims if it is to displace the nonce. Therefore, every effort has 3859 been made to complete a comprehensive specification of re-ECN so that 3860 its potential can be assessed. We therefore seek the opinion of the 3861 Internet community on whether the re-ECN protocol is sufficiently 3862 useful to warrant standards action. 3864 Authors' Addresses 3866 Bob Briscoe 3867 BT & UCL 3868 B54/77, Adastral Park 3869 Martlesham Heath 3870 Ipswich IP5 3RE 3871 UK 3873 Phone: +44 1473 645196 3874 Email: bob.briscoe@bt.com 3875 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 3876 Arnaud Jacquet 3877 BT 3878 B54/70, Adastral Park 3879 Martlesham Heath 3880 Ipswich IP5 3RE 3881 UK 3883 Phone: +44 1473 647284 3884 Email: arnaud.jacquet@bt.com 3885 URI: 3887 Alessandro Salvatori 3888 BT 3889 B54/77, Adastral Park 3890 Martlesham Heath 3891 Ipswich IP5 3RE 3892 UK 3894 Email: sandr8@gmail.com 3896 Martin Koyabe 3897 BT 3898 B54/69, Adastral Park 3899 Martlesham Heath 3900 Ipswich IP5 3RE 3901 UK 3903 Phone: +44 1473 646923 3904 Email: martin.koyabe@bt.com 3905 URI: 3907 Full Copyright Statement 3909 Copyright (C) The Internet Society (2006). 3911 This document is subject to the rights, licenses and restrictions 3912 contained in BCP 78, and except as set forth therein, the authors 3913 retain all their rights. 3915 This document and the information contained herein are provided on an 3916 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 3917 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 3918 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 3919 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 3920 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 3921 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 3923 Intellectual Property 3925 The IETF takes no position regarding the validity or scope of any 3926 Intellectual Property Rights or other rights that might be claimed to 3927 pertain to the implementation or use of the technology described in 3928 this document or the extent to which any license under such rights 3929 might or might not be available; nor does it represent that it has 3930 made any independent effort to identify any such rights. Information 3931 on the procedures with respect to rights in RFC documents can be 3932 found in BCP 78 and BCP 79. 3934 Copies of IPR disclosures made to the IETF Secretariat and any 3935 assurances of licenses to be made available, or the result of an 3936 attempt made to obtain a general license or permission for the use of 3937 such proprietary rights by implementers or users of this 3938 specification can be obtained from the IETF on-line IPR repository at 3939 http://www.ietf.org/ipr. 3941 The IETF invites any interested party to bring to its attention any 3942 copyrights, patents or patent applications, or other proprietary 3943 rights that may cover technology that may be required to implement 3944 this standard. Please address the information to the IETF at 3945 ietf-ipr@ietf.org. 3947 Acknowledgment 3949 Funding for the RFC Editor function is provided by the IETF 3950 Administrative Support Activity (IASA).