idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-tcp-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 18. -- Found old boilerplate from RFC 3978, Section 5.5 on line 3818. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 3795. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 3802. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 3808. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: Appendix E also gives an example dropper implementation that aggregates flow state. Dropper algorithms will often maintain a moving average across flows of the fraction of RE blanked packets. When maintaining an average across flows, a dropper SHOULD only allow flows into the average if they start with FNE, but it SHOULD not include packets with the FNE codepoint set in the average. A sender sets the FNE codepoint when it does not have the benefit of feedback from the receiver. So, counting packets with FNE cleared would be likely to make the average unnecessarily positive, providing headroom (or should we say footroom?) for dishonest (negative) traffic. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 26, 2006) is 6513 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'ITU-T.I.371' is defined on line 3170, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Downref: Normative reference to an Historic RFC: RFC 3540 == Outdated reference: A later version (-04) exists of draft-briscoe-tsvwg-cl-architecture-03 == Outdated reference: A later version (-01) exists of draft-davie-ecn-mpls-00 -- Obsolete informational reference (is this intentional?): RFC 2402 (Obsoleted by RFC 4302, RFC 4305) -- Obsolete informational reference (is this intentional?): RFC 2406 (Obsoleted by RFC 4303, RFC 4305) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) Summary: 6 errors (**), 0 flaws (~~), 6 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Expires: December 28, 2006 A. Jacquet 5 A. Salvatori 6 M. Koyabe 7 BT 8 June 26, 2006 10 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP 11 draft-briscoe-tsvwg-re-ecn-tcp-02 13 Status of this Memo 15 By submitting this Internet-Draft, each author represents that any 16 applicable patent or other IPR claims of which he or she is aware 17 have been or will be disclosed, and any of which he or she becomes 18 aware will be disclosed, in accordance with Section 6 of BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt. 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet-Draft will expire on December 28, 2006. 38 Copyright Notice 40 Copyright (C) The Internet Society (2006). 42 Abstract 44 This document introduces a new protocol for explicit congestion 45 notification (ECN), termed re-ECN, which can be deployed 46 incrementally around unmodified routers. The protocol arranges an 47 extended ECN field in each packet so that, as it crosses any 48 interface in an internetwork, it will carry a truthful prediction of 49 congestion on the remainder of its path. Then the upstream party at 50 any trust boundary in the internetwork can be held responsible for 51 the congestion they cause, or allow to be caused. So, networks can 52 introduce straightforward accountability and policing mechanisms for 53 incoming traffic from end-customers or from neighbouring network 54 domains. The purpose of this document is to specify the re-ECN 55 protocol at the IP layer and to give guidelines on any consequent 56 changes required to transport protocols. It includes the changes 57 required to TCP both as an example and as a specification. It also 58 gives examples of mechanisms that can use the protocol to ensure data 59 sources respond correctly to congestion. And it describes example 60 mechanisms that ensure the dominant selfish strategy of both network 61 domains and end-points will be to set the extended ECN field 62 honestly. 64 Authors' Statement: Status (to be removed by the RFC Editor) 66 This document is posted as an Internet-Draft with the intent (at 67 least that of the authors) to eventually progress to standards track. 69 Although the re-ECN protocol is intended to make a simple but far- 70 reaching change to the Internet architecture, the most immediate 71 priority for the authors is to delay any move of the ECN nonce to 72 Proposed Standard status. 74 The ECN nonce is an experimental RFC that allows /senders/ to check 75 the integrity of congestion feedback from /networks/. Therefore the 76 nonce only helps in scenarios where the sender is trusted to control 77 network congestion. On the other hand, the re-ECN protocol aims to 78 allow networks themselves to be able to police cheating senders and 79 receivers and to police neighbouring networks. Re-ECN is therefore 80 proposed in preference to the ECN nonce on the basis that it 81 addresses the generic problem of accountability for congestion of a 82 network's resources at the IP layer. 84 Delaying the ECN nonce is justified by two factors: 86 o The ECN nonce would permanently consumes a two-bit codepoint in 87 the IP header for a purpose specific to a limited trust model. 88 Although the nonce is a neat idea, its applicability seems too 89 limited to warrant space in the IP header; 91 o Although we have re-designed the re-ECN codepoints so that they do 92 not prevent the ECN nonce progressing, the same is not true the 93 other way round. If the ECN nonce started to see some deployment 94 (perhaps because it was blessed with proposed standard status), 95 incremental deployment of re-ECN would effectively be impossible, 96 because re-ECN marking fractions at inter-domain borders would be 97 polluted by unknown levels of nonce traffic. 99 The authors are aware that re-ECN must prove it has the potential it 100 claims if it is to displace the nonce. Therefore, every effort has 101 been made to complete a comprehensive specification of re-ECN so that 102 its potential can be assessed. We therefore seek the opinion of the 103 Internet community on whether the re-ECN protocol is sufficiently 104 useful to warrant standards action. 106 Changes from previous drafts (to be removed by the RFC Editor) 108 From -00 to -01: 110 Encoding of re-ECN wire protocol changed for reasons given in 111 Appendix B and consequently draft substantially re-written. 113 Substantial text added in sections on applications, incremental 114 deployment, architectural rationale and security considerations. 116 From -01 to -02: 118 Explanation on informal terminology in Section 3.4 clarified. 120 IPv6 wire protocol encoding added (Section 5.2). 122 Text on (non-)issues with tunnels, encryption and link layer 123 congestion notification added (Section 5.6 & Section 5.7). 125 Section added giving evolvability arguments against encouraging 126 bottleneck policing (Section 6.1.2). And text on re-ECN's 127 evolvability by design added to Section 6.1.3 129 Text on inter-domain policing (Section 6.1.6) and inter-domain 130 fail-safes (Section 6.1.7) added. 132 Minor editorial changes throughout. 134 Table of Contents 136 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 137 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 7 138 3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 8 139 3.1. Background and Applicability . . . . . . . . . . . . . . . 8 140 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 141 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 142 3.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 10 143 3.4. Informal Terminology . . . . . . . . . . . . . . . . . . . 12 144 4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 14 145 4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 146 4.1.1. RECN mode: Full re-ECN capable transport . . . . . . . 16 147 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or 148 Nonce ECT Receiver . . . . . . . . . . . . . . . . . . 18 149 4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 20 150 4.1.4. Extended ECN (EECN) Field Settings during Flow 151 Start or after Idle Periods . . . . . . . . . . . . . 21 152 4.1.5. Pure ACKS, Retransmissions, Window Probes and 153 Partial ACKs . . . . . . . . . . . . . . . . . . . . . 25 154 4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 26 155 4.2.1. Guidelines for Adding Re-ECN to Other Transports . . . 26 156 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 26 157 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 26 158 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 28 159 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 29 160 5.4. Justification for Setting the First SYN to FNE . . . . . . 30 161 5.5. Control and Management . . . . . . . . . . . . . . . . . . 31 162 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 31 163 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 32 164 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 32 165 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 33 166 6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 34 167 6.1. Policing Congestion Response . . . . . . . . . . . . . . . 34 168 6.1.1. The Policing Problem . . . . . . . . . . . . . . . . . 34 169 6.1.2. The Case Against Bottleneck Policing . . . . . . . . . 35 170 6.1.3. Re-ECN Incentive Framework . . . . . . . . . . . . . . 36 171 6.1.4. Egress Dropper . . . . . . . . . . . . . . . . . . . . 43 172 6.1.5. Rate Policing . . . . . . . . . . . . . . . . . . . . 44 173 6.1.6. Inter-domain Policing . . . . . . . . . . . . . . . . 46 174 6.1.7. Inter-domain Fail-safes . . . . . . . . . . . . . . . 50 175 6.1.8. Simulations . . . . . . . . . . . . . . . . . . . . . 51 176 6.2. Other Applications . . . . . . . . . . . . . . . . . . . . 51 177 6.2.1. DDoS Mitigation . . . . . . . . . . . . . . . . . . . 51 178 6.2.2. End-to-end QoS . . . . . . . . . . . . . . . . . . . . 52 179 6.2.3. Traffic Engineering . . . . . . . . . . . . . . . . . 52 180 6.2.4. Inter-Provider Service Monitoring . . . . . . . . . . 53 181 6.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . 53 183 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 53 184 7.1. Incremental Deployment Features . . . . . . . . . . . . . 53 185 7.2. Incremental Deployment Incentives . . . . . . . . . . . . 55 186 8. Architectural Rationale . . . . . . . . . . . . . . . . . . . 60 187 9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 62 188 9.1. Policing Rate Response to Congestion . . . . . . . . . . . 62 189 9.2. Congestion Notification Integrity . . . . . . . . . . . . 63 190 9.3. Identifying Upstream and Downstream Congestion . . . . . . 64 191 10. Security Considerations . . . . . . . . . . . . . . . . . . . 64 192 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 66 193 12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 66 194 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 66 195 14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 66 196 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 67 197 15.1. Normative References . . . . . . . . . . . . . . . . . . . 67 198 15.2. Informative References . . . . . . . . . . . . . . . . . . 67 199 Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 70 200 Appendix B. Justification for Two Codepoints Signifying Zero 201 Worth Packets . . . . . . . . . . . . . . . . . . . . 71 202 Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 73 203 Appendix D. Packet Marking During Flow Start . . . . . . . . . . 74 204 Appendix E. Example Egress Dropper Algorithm . . . . . . . . . . 74 205 Appendix F. Re-TTL . . . . . . . . . . . . . . . . . . . . . . . 74 206 Appendix G. Policer Designs to ensure Congestion 207 Responsiveness . . . . . . . . . . . . . . . . . . . 75 208 G.1. Per-user Policing . . . . . . . . . . . . . . . . . . . . 75 209 G.2. Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 76 210 Appendix H. Downstream Congestion Metering Algorithms . . . . . . 79 211 H.1. Bulk Downstream Congestion Metering Algorithm . . . . . . 79 212 H.2. Inflation Factor for Persistently Negative Flows . . . . . 79 213 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 81 214 Intellectual Property and Copyright Statements . . . . . . . . . . 82 216 1. Introduction 218 This document aims: 220 o To provide a complete specification of the addition of the re-ECN 221 protocol to IP and guidelines on how to add it to transport layer 222 protocols, including a complete specification of re-ECN in TCP as 223 an example; 225 o To show how a number of hard problems become much easier to solve 226 once re-ECN is available in IP. 228 A general statement of the problem solved by re-ECN is to provide 229 sufficient information in each IP datagram to be able to hold senders 230 and whole networks accountable for the congestion they cause 231 downstream, before they cause it. But the every-day problems that 232 re-ECN can solve are much more recognisable than this rather generic 233 statement: mitigating distributed denial of service (DDoS); 234 simplifying differentiation of quality of service (QoS); policing 235 compliance to congestion control; and so on. 237 Uniquely, re-ECN manages to enable solutions to these problems 238 without unduly stifling innovative new ways to use the Internet. 239 This was a hard balance to strike, given it could be argued that DDoS 240 is an innovative way to use the Internet. The most valuable insight 241 was to allow each network to choose the level of constraint it wishes 242 to impose. Also re-ECN has been carefully designed so that networks 243 that choose to use it conservatively can protect themselves against 244 the congestion caused in their network by users on other networks 245 with more liberal policies. 247 For instance, some network owners want to block applications like 248 voice and video unless their network is compensated for the extra 249 share of bottleneck bandwidth taken. These real-time applications 250 tend to be unresponsive when congestion arises. Whereas elastic TCP- 251 based applications back away quickly, ending up taking a much smaller 252 share of congested capacity for themselves. Other network owners 253 want to invest in large amounts of capacity and make their gains from 254 simplicity of operation and economies of scale. 256 Re-ECN allows the more conservative networks to police out flows that 257 have not asked to be unresponsive to congestion---not because they 258 are voice or video---just because they don't respond to congestion. 259 But it also allows other networks to choose not to police. 260 Crucially, when flows from liberal networks cross into a conservative 261 network, re-ECN enables the conservative network to apply penalties 262 to its neighbouring networks for the congestion they allow to be 263 caused. And these penalties can be applied to bulk data, without 264 regard to flows. 266 Then, if unresponsive applications become so dominant that some of 267 the more liberal networks experience congestion collapse [RFC3714], 268 they can change their minds and use re-ECN to apply tighter controls 269 in order to bring congestion back under control. 271 Re-ECN works by arranging that each packet arrives at each network 272 element carrying a view of expected congestion on its own downstream 273 path, albeit averaged over multiple packets. Most usefully, 274 congestion on the remainder of the path becomes visible in the IP 275 header at the first ingress. Many of the applications of re-ECN 276 involve a policer at this ingress using the view of downstream 277 congestion arriving in packets to police or control the packet rate. 279 Importantly, the scheme is recursive: a whole network harbouring 280 users causing congestion in downstream networks can be held 281 responsible or policed by its downstream neighbour. 283 This document is structured as follows. First an overview of the re- 284 ECN protocol is given (Section 3), outlining its attributes and 285 explaining conceptually how it works as a whole. The two main parts 286 of the document follow, as described above. That is, the protocol 287 specification divided into transport (Section 4) and network 288 (Section 5) layers, then the applications it can be put to, such as 289 policing DDoS, QoS and congestion control (Section 6). Although 290 these applications do not require standardisation themselves, they 291 are described in a fair degree of detail in order to explain how re- 292 ECN can be used. Given, re-ECN proposes to use the last undefined 293 bit in the IPv4 header, we felt it necessary to outline the potential 294 that re-ECN could release in return for being given that bit. 296 Deployment issues discussed throughout the document are brought 297 together in Section 7, which is followed by a brief section 298 explaining the somewhat subtle rationale for the design, from an 299 architectural perspective (Section 8). We end by describing related 300 work (Section 9), listing security considerations (Section 10) and 301 finally drawing conclusions (Section 12). 303 2. Requirements notation 305 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 306 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 307 document are to be interpreted as described in [RFC2119]. 309 This document first specifies a protocol, then describes a framework 310 that creates the right incentives to ensure compliance to the 311 protocol. This could cause confusion because the second part of the 312 document considers many cases where malicious nodes may not comply 313 with the protocol. When such contingencies are described, if any of 314 the above keywords are not capitalised, that is deliberate. So, for 315 instance, the following two apparently contradictory sentences would 316 be perfectly consistent: i) x MUST do this; ii) x may not do this. 318 3. Protocol Overview 320 3.1. Background and Applicability 322 First we briefly recap the essentials of the ECN protocol [RFC3168]. 323 Two bits in the IP protocol (v4 or v6) are assigned to the ECN field. 324 The sender clears the field to "00" (Not-ECT) if either end-point 325 transport is not ECN-capable. Otherwise it indicates an ECN-capable 326 transport (ECT) using either of the two code-points "10" or "01" 327 (ECT(0) and ECT(1) resp.). 329 ECN-capable routers probabilistically set "11" if congestion is 330 experienced (CE), the marking probability increasing with the length 331 of the queue at its egress link (typically using the RED 332 algorithm [RFC2309]). However, they still drop rather than mark Not- 333 ECT packets. With multiple ECN-capable routers on a path, a flow of 334 packets accumulates the fraction of CE marking that each router adds. 335 The combined effect of the packet marking of all the routers along 336 the path signals congestion of the whole path to the receiver. So, 337 for example, if one router early in a path is marking 1% of packets 338 and another later in a path is marking 2%, flows that pass through 339 both routers will experience approximately 3% marking (see Appendix A 340 for a precise treatment). 342 The choice of two ECT code-points in the ECN field [RFC3168] 343 permitted future flexibility, optionally allowing the sender to 344 encode the experimental ECN nonce [RFC3540] in the packet stream. 345 The nonce is designed to allow a sender to check the integrity of 346 congestion feedback. But Section 9.2 explains that it still gives no 347 control over how fast the sender transmits as a result of the 348 feedback. On the other hand, re-ECN is designed both to ensure that 349 congestion is declared honestly and that the sender's rate responds 350 appropriately. 352 Re-ECN is based on a feedback arrangement called 353 `re-feedback' [Re-fb]. The word is short for either receiver- 354 aligned, re-inserted or re-echoed feedback. But it actually works 355 even when no feedback is available. In fact it has been carefully 356 designed to work for single datagram flows. Indeed, it even 357 encourages aggregation of single packet flows by congestion control 358 proxies. Then, even if the traffic mix of the Internet were to 359 become dominated by short messages, it would still be possible to 360 control congestion effectively and efficiently. 362 Changing the Internet's feedback architecture seems to imply 363 considerable upheaval. But re-ECN can be deployed incrementally at 364 the transport layer around unmodified routers using existing fields 365 in IP (v4 or v6). However it does also require the last undefined 366 bit in the IPv4 header, which it uses in combination with the 2-bit 367 ECN field to create four new codepoints. Nonetheless, changes to IP 368 routers are RECOMMENDED in order to improve resilience against DoS 369 attacks. Similarly, re-ECN works best if both the sender and 370 receiver transports are re-ECN-capable, but it can work with just 371 sender support. Section 7.1 summarises the incremental deployment 372 strategy. 374 The re-ECN protocol makes no changes and has no effect on the TCP 375 congestion control algorithm or on other rate responses to 376 congestion. Re-ECN is only concerned with enabling the ingress 377 network to police that a source is complying with a congestion 378 control algorithm, which is orthogonal to congestion control itself. 380 Before re-ECN can be considered worthy of using up the last bit in 381 the IP header, we must be sure that all our claims are robust. We 382 have gradually been reducing the list of outstanding issues, but the 383 few that still remain are listed in Section 6.3. We expect new 384 attacks may still be found, but we offer the re-ECN protocol on the 385 basis that it is built on fairly solid theoretical foundations and, 386 so far, it has proved possible to keep it relatively robust. 388 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 390 The re-ECN wire protocol uses the two bit ECN field broadly as in 391 RFC3168 [RFC3168] as described above, but with five differences of 392 detail (brought together in a list in Section 7.1). This 393 specification defines a new re-ECN extension (RE) flag. We will 394 defer the definition of the actual position of the RE flag in the 395 IPv4 & v6 headers until Section 5. Until then it will suffice to use 396 an abstraction of the IPv4 and v6 wire protocols by just calling it 397 the RE flag. 399 Unlike the ECN field, the RE flag is intended to be set by the sender 400 and remain unchanged along the path, although it can be read by 401 network elements that understand the re-ECN protocol. It is feasible 402 that a network element MAY change the setting of the RE flag, perhaps 403 acting as a proxy for an end-point, but such a protocol would have to 404 be defined in another specification (e.g. [Re-PCN]). 406 Although the RE flag is a separate, single bit field, it can be read 407 as an extension to the two-bit ECN field; the three concatenated bits 408 in what we will call the extended ECN field (EECN) making eight 409 codepoints. We will use the RFC3168 names of the ECN codepoints to 410 describe settings of the ECN field when the RE flag setting is "don't 411 care", but we also define the following six extended ECN codepoint 412 names for when we need to be more specific. 414 +-------+-----------+------+--------------+-------------------------+ 415 | ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | 416 | field | codepoint | flag | codepoint | | 417 +-------+-----------+------+--------------+-------------------------+ 418 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 419 | | | | | transport | 420 | 00 | Not-ECT | 1 | FNE | Feedback not | 421 | | | | | established | 422 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 423 | | | | | and RECT | 424 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 425 | | | | | transport | 426 | 10 | ECT(0) | 0 | --- | Legacy ECN use only | 427 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 428 | | | | | | 429 | 11 | CE | 0 | CE(0) | Re-Echo canceled by | 430 | | | | | congestion experienced | 431 | 11 | CE | 1 | CE(-1) | Congestion experienced | 432 +-------+-----------+------+--------------+-------------------------+ 434 Table 1: Extended ECN Codepoints 436 3.3. Re-ECN Protocol Operation 438 In this section we will give an overview of the operation of the re- 439 ECN protocol for TCP/IP, leaving a detailed specification to the 440 following sections. Other transports will be discussed later. 442 In summary, the protocol adds a third `re-echo' stage to the existing 443 TCP/IP ECN protocol. Whenever the network adds CE congestion 444 signalling to the IP header on the forward data path, the receiver 445 feeds it back to the ingress using TCP, then the sender re-echoes it 446 into the forward data path using the RE flag in the next packet. 448 Prior to receiving any feedback a sender will not know which setting 449 of the RE flag to use, so it sets the feedback not established (FNE) 450 codepoint. The network reads the FNE codepoint conservatively as 451 equivalent to re-echoed congestion. 453 Specifically, once a flow is established, a re-ECN sender always 454 initialises the ECN field to ECT(1). And it usually sets the RE flag 455 to "1". Whenever a router re-marks a packet to CE, the receiver 456 feeds back this event to the sender. On receiving this feedback, the 457 re-ECN sender will clear the RE flag to "0" in the next packet it 458 sends. 460 We chose to set and clear the RE flag this way round to ease 461 incremental deployment (see Section 7.1). To avoid confusion we will 462 use the term `blanking' (rather than marking) when the RE flag is 463 cleared to "0". So, over a stream of packets, we will talk of the 464 `RE blanking fraction' as the fraction of octets in packets with the 465 RE flag cleared to "0". 467 ^ 468 | 469 | RE blanking fraction 470 3% |--------------------------------+===== 471 | | 472 2% | | 473 | CE marking fraction | 474 1% | +-----------------------+ 475 | | 476 0% +----------------------------------------> 477 ^ 0 ^ i ^ resource index 478 | ^ | ^ | 479 0 | 1 | 2 observation points 480 1.00% 2.00% marking fraction 482 Figure 1: A 2-Router Example (Imprecise) 484 Figure 1 uses the two router example introduced earlier to illustrate 485 why re-ECN allows routers to measure downstream congestion. The 486 horizontal axis represents the index of each congestible resource 487 (typically queues) along a path through the Internet. There may be 488 many routers on the path, but we assume only two are currently 489 congested (those with resource index 0 and i). The two superimposed 490 plots show the fraction of each extended ECN codepoint in a flow 491 observed along this path. Given about 3% of packets reaching the 492 destination are marked CE, in response to feedback the sender will 493 blank the RE flag in about 3% of packets it sends. Then approximate 494 downstream congestion can be measured at the observation points shown 495 along the path by subtracting the CE marking fraction from the RE 496 blanking fraction, as shown in the table below (Appendix A derives 497 these approximations from a precise analysis). 499 +-------------------+------------------------------+ 500 | Observation point | Approx downstream congestion | 501 +-------------------+------------------------------+ 502 | 0 | 3% - 0% = 3% | 503 | 1 | 3% - 1% = 2% | 504 | 2 | 3% - 3% = 0% | 505 +-------------------+------------------------------+ 507 Table 2: Downstream Congestion Measured at Example Observation Points 509 All along the path, whole-path congestion remains unchanged so it can 510 be used as a reference against which to compare upstream congestion. 511 The difference predicts downstream congestion for the rest of the 512 path. Therefore, measuring the fractions of each codepoint at any 513 point in the Internet will reveal upstream, downstream and whole path 514 congestion. 516 Note that we have introduced discussion of marking and blanking 517 fractions solely for illustration. To be absolutely clear, these 518 fractions are averages that would result from the behaviour of a TCP 519 protocol handler mechanically blanking outgoing packets in direct 520 response to incoming feedback---we are not saying any protocol 521 handler works with these average fractions directly. 523 3.4. Informal Terminology 525 In the rest of this memo we will loosely talk of positive or negative 526 flows, meaning flows where the moving average of the downstream 527 congestion metric is persistently positive or negative. The notion 528 of a negative metric arises because it is derived by subtracting one 529 metric from another. Of course actual downstream congestion cannot 530 be negative, only the metric can (whether due to time lags or 531 deliberate malice). 533 Just as we will loosely talk of positive and negative flows, we will 534 also talk of positive or negative packets, meaning packets that 535 contribute positively or negatively to the downstream congestion 536 metric. 538 Therefore packets we will talk of packets having `worth' of +1, 0 or 539 -1, which, when multiplied by their size, indicates their 540 contribution to the downstream congestion metric. 542 Figure 2 shows the main state transitions of the system once a flow 543 is established, showing the worth of packets in each state. When the 544 network congestion marks a packet it decrements its worth (moving 545 from the left of the main square to the right). When the sender 546 blanks the RE flag in order to re-echo congestion it increments the 547 worth of a packet (moving from the bottom of the main square to the 548 top). 550 Sender state Sent Worth Received Worth 551 packet packet 552 +----------------------------------------------------+ 553 | ^ 554 V | 555 Congestion echoed -->Re-Echo +1 --+---> CE(0) 0 --+ 556 (positive) | (canceled) | 557 V network | 558 | congestion | 559 | | 560 Flow established --> RECT 0 ----+-> CE(-1) -1 --+ 561 ^ (neutral) | | (negative) 562 | | | 563 | no V V 564 | congestion | | 565 +-----------<--------------+-+ 567 Figure 2: Re-ECN System State Diagram (bootstrap not shown) 569 The idea is that every time the network decrements the worth of a 570 packet, the sender increments the worth of a later packet. Then, 571 over time, as many positive octets should arrive at the receiver as 572 negative. Note we have said octets not packets, so if packets are of 573 different sizes, the worth should be incremented on enough octets to 574 balance the octets in negative packets arriving at the receiver. It 575 is this balance that will allow the network to hold the sender 576 accountable for the congestion it causes, as we shall see. the 577 informal outline below uses TCP as an example transport, but the idea 578 would be broadly similar for any transport that adapts its rate to 579 congestion. 581 We will start with the sender in `flow established' state, Normally 582 as acknowledgements of earlier packets arrive that don't feedback any 583 congestion, the congestion window can be opened, so the sender goes 584 round the smaller sub-loop, sending RECT packets (worth 0) and 585 returning to the flow established state to send another one. If a 586 router congestion marks one of the packets, it decrements the 587 packet's worth. The sender will have been continuing to traverse 588 round the smaller feedback loop every time acknowledgements arrive. 589 But when congestion feedback returns from this packet that was marked 590 with -1 worth (the largest loop in the figure) the sender jumps to 591 the congestion echoed state in order to re-echo the congestion, 592 incrementing the worth of the next packet to +1 by blanking its RE 593 flag. The sender then returns to the flow established state and 594 continues round the smaller loop, sending packets worth 0. Note that 595 the size of the loops is just an artefact of the figure; it is not 596 meant to imply that one loop is slower than the other - they are both 597 the same end to end feedback loop. 599 If a packet carrying re-echoed congestion happens to also be 600 congestion marked, the +1 worth added by the sender will be cancelled 601 out by the -1 network congestion marking. Although the two worth 602 values correctly cancel out, neither the congestion marking nor the 603 re-echoed congestion are lost, because the RE bit and the ECN field 604 are orthogonal. So, whenever this happens, the receiver will 605 correctly detect and re-echo the new congestion event as well (the 606 top sub-loop). When we need to distinguish, we will sometimes call a 607 packet marked RECT neutral (0 worth), while we will call the CE(0) 608 marking canceled (also 0 worth). If a re-echoed packet isn't unlucky 609 enough to be further congestion marked, the sender will return to the 610 flow established state and continue to send RECT packets (worth 0). 612 The table below specifies unambiguously the worth of each extended 613 ECN codepoint. Note the order is different from the previous table 614 to better show how the worth increments and decrements. The FNE 615 codepoint is an exception. It is used in the flow bootstrap process 616 (explained later) and has the same positive (+1) worth as a packet 617 with the Re-Echo codepoint. 619 +-------+-----+----------------+-------+----------------------------+ 620 | ECN | RE | Extended ECN | Worth | Re-ECN meaning | 621 | field | bit | codepoint | | | 622 +-------+-----+----------------+-------+----------------------------+ 623 | 00 | 0 | Not-RECT | ... | Not re-ECN-capable | 624 | | | | | transport | 625 | 01 | 0 | Re-Echo | +1 | Re-echoed congestion and | 626 | | | | | RECT | 627 | 10 | 0 | --- | ... | Legacy ECN use only | 628 | 11 | 0 | CE(0) | 0 | Re-Echo canceled by | 629 | | | | | congestion experienced | 630 | 00 | 1 | FNE | +1 | Feedback not established | 631 | 01 | 1 | RECT | 0 | Re-ECN capable transport | 632 | 10 | 1 | --CU-- | ... | Currently unused | 633 | | | | | | 634 | 11 | 1 | CE(-1) | -1 | Congestion experienced | 635 +-------+-----+----------------+-------+----------------------------+ 637 Table 3: 'Worth' of Extended ECN Codepoints 639 4. Transport Layers 640 4.1. TCP 642 Re-ECN capability at the sender is essential. At the receiver it is 643 optional, as long as the receiver has a basic (`vanilla flavour') 644 RFC3168-compliant ECN-capable transport (ECT) [RFC3168]. Given re- 645 ECN is not the first attempt to define the semantics of the ECN 646 field, we give a table below summarising what happens for various 647 combinations of capabilities of the sender S and receiver R, as 648 indicated in the first four columns below. The last column gives the 649 mode a half-connection should be in after the first two of the three 650 TCP handshakes. 652 +--------+---------------+-----------+---------+--------------------+ 653 | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | 654 | | (RFC3540) | (RFC3168) | | Half-connection | 655 | | | | | Mode | 656 +--------+---------------+-----------+---------+--------------------+ 657 | SR | | | | RECN | 658 | S | R | | | RECN-Co | 659 | S | | R | | RECN-Co | 660 | S | | | R | Not-ECT | 661 +--------+---------------+-----------+---------+--------------------+ 663 Table 4: Modes of TCP Half-connection for Combinations of ECN 664 Capabilities of Sender S and Receiver R 666 We will describe what happens in each mode, then describe how they 667 are negotiated. The abbreviations for the modes in the above table 668 mean: 670 RECN: Full re-ECN capable transport 672 RECN-Co: Re-ECN sender in compatibility mode with a vanilla [RFC3168] 673 ECN receiver or an [RFC3540] ECN nonce-capable receiver. 674 Implementation of this mode is OPTIONAL. 676 Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when 677 at least one of the transports does not understand even basic ECN 678 marking. 680 Note that we use the term Re-ECT for a host transport that is re-ECN- 681 capable but RECN for the modes of the half connections between hosts 682 when they are both Re-ECT. If a host transport is Re-ECT, this fact 683 alone does NOT imply either of its half connections will necessarily 684 be in RECN mode, at least not until it has confirmed that the other 685 host is Re-ECT. 687 4.1.1. RECN mode: Full re-ECN capable transport 689 In full RECN mode, for each half connection, both the sender and the 690 receiver each maintain an unsigned integer counter we will call ECC 691 (echo congestion counter). The receiver maintains a count, modulo 8, 692 of how many times a CE marked packet has arrived during the half- 693 connection. Once a RECN connection is established, the three TCP 694 option flags (ECE, CWR & NS) used for ECN-related functions in 695 previous versions of ECN are used as a 3-bit field for the receiver 696 to repeatedly tell the sender the current value of ECC whenever it 697 sends a TCP ACK. We will call this the echo congestion increment 698 (ECI) field. This overloaded use of these 3 option flags as one 699 3-bit ECI field is shown in Figure 4. The actual definition of the 700 TCP header, including the addition of support for the ECN nonce, is 701 shown for comparison in Figure 3. This specification does not 702 redefine the names of these three TCP option flags, it merely 703 overloads them with another definition once a flow is established. 705 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 706 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 707 | | | N | C | E | U | A | P | R | S | F | 708 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 709 | | | | R | E | G | K | H | T | N | N | 710 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 712 Figure 3: The (post-ECN Nonce) definition of bytes 13 and 14 of the 713 TCP Header 715 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 716 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 717 | | | | U | A | P | R | S | F | 718 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 719 | | | | G | K | H | T | N | N | 720 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 722 Figure 4: Definition of the ECI field within bytes 13 and 14 of the 723 TCP Header, overloading the current definitions above for established 724 RECN flows. 726 Receiver Action in RECN Mode 728 Every time a CE marked packet arrives at a receiver in RECN mode, 729 the receiver transport increments its local value of ECC modulo 8 730 and MUST echo its value to the sender in the ECI field of the next 731 ACK. It MUST repeat the same value of ECI in every subsequent ACK 732 until the next CE event, when it increments ECI again. 734 The increment of the local ECC values is modulo 8 so the field 735 value simply wraps round back to zero when it overflows. The 736 least significant bit is to the right (labelled bit 9). 738 A receiver in RECN mode MAY delay the echo of a CE to the next 739 delayed-ACK, which would be necessary if ACK-withholding were 740 implemented. 742 Sender Action in RECN Mode 744 On the arrival of every ACK, the sender compares the ECI field 745 with its own ECC value, then replaces its local value with that 746 from the ACK. The difference D is assumed to be the number of CE 747 marked packets that arrived at the receiver since it sent the 748 previously received ACK (but see below for the sender's safety 749 strategy). Whenever the ECI field increments by D (or D drops are 750 detected), the sender MUST clear the RE flag to "0" in the IP 751 header of the next D data packets it sends, effectively re-echoing 752 each single increment of ECI. Otherwise the data sender MUST send 753 all data packets with RE set to "1". 755 As a general rule, once a flow is established, as well as setting 756 or clearing the RE flag as above, a data sender in RECN mode MUST 757 always set the ECN field to ECT(1). However, the settings of the 758 extended ECN field during flow start are defined in Section 4.1.4. 760 As we have already emphasised, the re-ECN protocol makes no 761 changes and has no effect on the TCP congestion control algorithm. 762 So, each increment of ECI (or detection of a drop) also triggers 763 the standard TCP congestion response, but with no more than one 764 congestion response per round trip, as usual. 766 A TCP sender also acts as the receiver for the other half- 767 connection. The host will maintain two ECC values S.ECC and R.ECC 768 as sender and receiver respectively. Every TCP header sent by a 769 host in RECN mode will also repeat the prevailing value of R.ECC 770 in its ECI field. If a sender in RECN mode has to retransmit a 771 packet due to a suspected loss, the re-transmitted packet MUST 772 carry the latest prevailing value of R.ECC when it is re- 773 transmitted, which will not necessarily be the one it carried 774 originally. 776 4.1.1.1. Safety against Long Pure ACK Loss Sequences 778 The ECI method was chosen for echoing congestion marking because a 779 re-ECN sender needs to know about every CE mark arriving at the 780 receiver, not just whether at least one arrives within a round trip 781 time (which is all the ECE/CWR mechanism supported). And, as pure 782 ACKs are not protected by TCP reliable delivery, we repeat the same 783 ECI value in every ACK until it changes. Even if many ACKs in a row 784 are lost, as soon as one gets through, the ECI field it repeats from 785 previous ACKs that didn't get through will update the sender on how 786 many CE marks arrived since the last ACK got through. 788 The sender will only lose a record of the arrival of a CE mark if all 789 the ACKS are lost (and all of them were pure ACKs) for a stream of 790 data long enough to contain 8 or more CE marks. So, if the marking 791 fraction was p, at least 8/p pure ACKs would have to be lost. For 792 example, if p was 5%, a sequence of 160 pure ACKs would all have to 793 be lost. To protect against such extremely unlikely events, if a re- 794 ECN sender detects a sequence of pure ACKs has been lost it SHOULD 795 assume the ECI field wrapped as many times as possible within the 796 sequence. 798 Specifically, if a re-ECN sender receives an ACK with an 799 acknowledgement number that acknowledges L segments since the 800 previous ACK but with a sequence number unchanged from the previously 801 received ACK, it SHOULD conservatively assume that the ECI field 802 incremented by D' = L - ((L-D) mod 8), where D is the apparent 803 increase in the ECI field. For example if the ACK arriving after 9 804 pure ACK losses apparently increased ECI by 2, the assumed increment 805 of ECI would still be 2. But if ECI apparently increased by 2 after 806 11 pure ACK losses, ECI should be assumed to have increased by 10. 808 A re-ECN sender MAY implement a heuristic algorithm to predict beyond 809 reasonable doubt that the ECI field probably did not wrap within a 810 sequence of lost pure ACKs. But such an algorithm is NOT REQUIRED. 811 Such an algorithm MUST NOT be used unless it is proven to work even 812 in the presence of correlation between high ACK loss rate on the back 813 channel and high CE marking rate on the forward channel. 815 Whatever assumption a re-ECN sender makes about potentially lost CE 816 marks, both its congestion control and its re-echoing behaviour 817 SHOULD be consistent with the assumption it makes. 819 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver 821 If the half-connection is in RECN-Co mode, ECN feedback proceeds no 822 differently to that of vanilla ECN. In other words, the receiver 823 sets the ECE flag repeatedly in the TCP header and the sender 824 responds by setting the CWR flag. Although RECN-Co mode is used when 825 the receiver has not implemented the re-ECN protocol, the sender can 826 infer enough from its vanilla ECN feedback to set or clear the RE 827 flag reasonably well. Specifically, every time the receiver toggles 828 the ECE field from "0" to "1" (or a loss is detected), as well as 829 setting CWR in the TCP flags, the re-ECN sender MUST blank the RE 830 flag of the next packet to "0" as it would do in full RECN mode. 831 Otherwise, the data sender SHOULD send all other packets with RE set 832 to "1". Once a flow is established, a re-ECN data sender in RECN-Co 833 mode MUST always set the ECN field to ECT(1). 835 If a CE marked packet arrives at the receiver within a round trip 836 time of a previous mark, the receiver will still be echoing ECE for 837 the last CE mark. Therefore, such a mark will be missed by the 838 sender. Of course, this isn't of concern for congestion control, but 839 it does mean that very occasionally the RE blanking fraction will be 840 understated. Therefore flows in RECN-Co mode may occasionally be 841 mistaken for very lightly cheating flows and consequently might 842 suffer a small number of packet drops through an egress dropper 843 (Section 6.1.4). We expect re-ECN would be deployed for some time 844 before policers and droppers start to enforce it. So, given there is 845 not much ECN deployment yet anyway, this minor problem may affect 846 only a very small proportion of flows, reducing to nothing over the 847 years as vanilla ECN hosts upgrade. The use of RECN-Co mode would 848 need to be reviewed in the light of experience at the time of re-ECN 849 deployment. 851 RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their 852 code simple, MAY choose not to implement this mode. If they do not, 853 a re-ECN sender SHOULD fall back to vanilla ECT mode in the presence 854 of an ECN-capable receiver. It MAY choose to fall back to the ECT- 855 Nonce mode, but if re-ECN implementers don't want to be bothered with 856 RECN-Co mode, they probably won't want to add an ECT-Nonce mode 857 either. 859 4.1.2.1. Re-ECN support for the ECN Nonce 861 A TCP half-connection in RECN-Co mode MUST NOT support the ECN 862 Nonce [RFC3540]. This means that the sending code of a re-ECN 863 implementation will never need to include ECN Nonce support. Re-ECN 864 is intended to provide wider protection than the ECN nonce against 865 congestion control misbehaviour, and re-ECN only requires support 866 from the sender, therefore it is preferable to specifically rule out 867 the need for dual sender implementations. As a consequence, a re-ECN 868 capable sender will never set ECT(0), so it will be easier for 869 network elements to discriminate re-ECN traffic flows from other ECN 870 traffic, which will always contain some ECT(0) packets. 872 However, a re-ECN implementation MAY OPTIONALLY include receiving 873 code that complies with the ECN Nonce protocol when interacting with 874 a sender that supports the ECN nonce (rather than re-ECN), but this 875 support is NOT REQUIRED. 877 RFC3540 allows an ECN nonce sender to choose whether to sanction a 878 receiver that does not ever set the nonce sum. Given re-ECN is 879 intended to provide wider protection than the ECN nonce against 880 congestion control misbehaviour, implementers of re-ECN receivers MAY 881 choose not to implement backwards compatibility with the ECN nonce 882 capability. This may be because they deem that the risk of sanctions 883 is low, perhaps because significant deployment of the ECN nonce seems 884 unlikely at implementation time. 886 4.1.3. Capability Negotiation 888 During the TCP hand-shake at the start of a connection, an originator 889 of the connection (host A) with a re-ECN-capable transport MUST 890 indicate it is Re-ECT by setting the TCP options NS=1, CWR=1 and 891 ECE=1 in the initial SYN. 893 A responding Re-ECT host (host B) MUST return a SYN ACK with flags 894 CWR=1 and ECE=0. The responding host MUST NOT set this combination 895 of flags unless the preceding SYN has already indicated Re-ECT 896 support as above. A Re-ECT server (B) can use either setting of the 897 NS flag combined with this type of SYN ACK in response to a SYN from 898 a Re-ECT client (A). Normally a Re-ECT server will reply to a Re-ECT 899 client with NS=0, but in the special circumstance below it can return 900 a SYN ACK with NS=1. 902 If the initial SYN from Re-ECT client A is marked CE(-1), a Re-ECT 903 server B MUST increment its local value of ECC. But B cannot reflect 904 the value of ECC in the SYN ACK, because it is still using the 3 bits 905 to negotiate connection capabilities. So, server B MUST set the 906 alternative TCP header flags in its SYN ACK: NS=1, CWR=1 and ECE=0. 908 These handshakes are summarised in Table 5 below, with X meaning 909 `don't care'. The handshakes used for the other flavours of ECN are 910 also shown for comparison. To compress the width of the table, the 911 headings of the first four columns have been severely abbreviated, as 912 follows: 914 R: *R*e-ECT 916 N: ECT-*N*once (RFC3540) 918 E: *E*CT (RFC3168) 919 I: Not-ECT (*I*mplicit congestion notification). 921 These correspond with the same headings used in Table 4. Indeed, the 922 resulting modes in the last two columns of the table below are a more 923 comprehensive way of saying the same thing as Table 4. 925 +----+---+---+---+------------+-------------+-----------+-----------+ 926 | R | N | E | I | SYN A-B | SYN ACK B-A | A-B Mode | B-A Mode | 927 +----+---+---+---+------------+-------------+-----------+-----------+ 928 | | | | | NS CWR ECE | NS CWR ECE | | | 929 | AB | | | | 1 1 1 | X 1 0 | RECN | RECN | 930 | A | B | | | 1 1 1 | 1 0 1 | RECN-Co | ECT-Nonce | 931 | A | | B | | 1 1 1 | 0 0 1 | RECN-Co | ECT | 932 | A | | | B | 1 1 1 | 0 0 0 | Not-ECT | Not-ECT | 933 | B | A | | | 0 1 1 | 0 0 1 | ECT-Nonce | RECN-Co | 934 | B | | A | | 0 1 1 | 0 0 1 | ECT | RECN-Co | 935 | B | | | A | 0 0 0 | 0 0 0 | Not-ECT | Not-ECT | 936 +----+---+---+---+------------+-------------+-----------+-----------+ 938 Table 5: TCP Capability Negotiation between Originator (A) and 939 Responder (B) 941 As soon as a re-ECN capable TCP server receives a SYN, it MUST set 942 its two half-connections into the modes given in Table 5. As soon as 943 a re-ECN capable TCP client receives a SYN ACK, it MUST set its two 944 half-connections into the modes given in Table 5. The half- 945 connections will remain in these modes for the rest of the 946 connection, including for the third segment of TCP's three-way hand- 947 shake (the ACK). 949 {ToDo: Consider SYNs within a connection.} 951 Recall that, if the SYN ACK reflects the same flag settings as the 952 preceding SYN (because there is a broken legacy implementation that 953 behaves this way), RFC3168 specifies that the whole connection MUST 954 revert to Not-ECT. 956 Also note that, whenever the SYN flag of a TCP segment is set 957 (including when the ACK flag is also set), the NS, CWR and ECE flags 958 MUST NOT be interpreted as the 3-bit ECI value, which is only set as 959 a copy of the local ECC value in non-SYN packets. 961 4.1.4. Extended ECN (EECN) Field Settings during Flow Start or after 962 Idle Periods 964 If the originator (A) of a TCP connection supports re-ECN it MUST set 965 the extended ECN (EECN) field in the IP header of the initial SYN 966 packet to the feedback not established (FNE) codepoint. 968 FNE is a new extended ECN codepoint defined by this specification 969 (Section 3.2). The feedback not established (FNE) codepoint is used 970 when the transport does not have the benefit of ECN feedback so it 971 cannot decide whether to set or clear the RE flag. 973 If after receiving a SYN the server B has set its sending half- 974 connection into RECN mode or RECN-Co mode, it MUST set the extended 975 ECN field in the IP header of its SYN ACK to the feedback not 976 established (FNE) codepoint. Note the careful wording here, which 977 means that Re-ECT server B MUST set FNE on a SYN ACK whether it is 978 responding to a SYN from a Re-ECT client or from a client that is 979 merely ECN-capable. 981 The original ECN specification [RFC3168] required SYNs and SYN ACKs 982 to use the Not-ECT codepoint of the ECN field. The aim was to 983 prevent well-known DoS attacks such as SYN flooding being able to 984 gain from the advantage that ECN capability afforded over drop at 985 ECN-capable routers. 987 For a SYN ACK, Kuzmanovic [I-D.ietf-tsvwg-ecnsyn] has shown that this 988 caution was unnecessary, and proposes to allow a SYN ACK to be ECN- 989 capable to improve performance. We have gone further by proposing to 990 make the initial SYN ECN-capable too. By stipulating the FNE 991 codepoint for the initial SYN, we comply with RFC3168 in word but not 992 in spirit, because we have indeed set the ECN field to Not-ECT, but 993 we have extended the ECN field with another bit. And it will be seen 994 (Section 5.3) that we have defined one setting of that bit to mean an 995 ECN-capable transport. Therefore, by proposing that the FNE 996 codepoint MUST be used on the initial SYN of a connection, we have 997 (deliberately) made the initial SYN ECN-capable. Section 5.4 998 justifies deciding to make the initial SYN ECN-capable. 1000 Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will 1001 have already been set on the initial SYN and possibly the SYN ACK as 1002 above. But each re-ECN sender will have to set FNE cautiously on a 1003 few data packets as well, given a number of packets will usually have 1004 to be sent before sufficient congestion feedback is received. The 1005 behaviour will be different depending on the mode of the half- 1006 connection: 1008 RECN mode: Given the constraints on TCP's initial window [RFC3390] 1009 and its exponential window increase during slow start 1010 phase [RFC2581], it turns out that the sender SHOULD set FNE on 1011 the first and third data packets in its flow, assuming equal sized 1012 data packets once a flow is established. Appendix D presents the 1013 calculation that led to this conclusion. Below, after running 1014 through the start of an example TCP session, we give the intuition 1015 learned from that calculation. 1017 RECN-Co mode: A re-ECT sender that switches into re-ECN compatibility 1018 mode or into Not-ECT mode (because it has detected the 1019 corresponding host is not re-ECN capable) MUST limit its initial 1020 window to 1 segment. The reasoning behind this constraint is 1021 given in Section 5.4. Having set this initial window, a re-ECN 1022 sender in RECN-Co mode SHOULD set FNE on the first and third data 1023 packets in a flow, as for RECN mode. 1025 +----+------+----------------+-------+-------+---------------+------+ 1026 | | Data | TCP A(Re-ECT) | IP A | IP B | TCP B(Re-ECT) | Data | 1027 +----+------+----------------+-------+-------+---------------+------+ 1028 | | Byte | SEQ ACK CTL | EECN | EECN | SEQ ACK CTL | Byte | 1029 | -- | ---- | ------------- | ----- | ----- | ------------- | ---- | 1030 | 1 | | 0100 SYN | FNE | --> | R.ECC=0 | | 1031 | | | CWR,ECE,NS | | | | | 1032 | 2 | | R.ECC=0 | <-- | FNE | 0300 0101 | | 1033 | | | | | | SYN,ACK,CWR | | 1034 | 3 | | 0101 0301 ACK | RECT | --> | R.ECC=0 | | 1035 | 4 | 1000 | 0101 0301 ACK | FNE | --> | R.ECC=0 | | 1036 | 5 | | R.ECC=0 | <-- | FNE | 0301 1102 ACK | 1460 | 1037 | 6 | | R.ECC=0 | <-- | RECT | 1762 1102 ACK | 1460 | 1038 | 7 | | R.ECC=0 | <-- | FNE | 3222 1102 ACK | 1460 | 1039 | 8 | | 1102 1762 ACK | RECT | --> | R.ECC=0 | | 1040 | 9 | | R.ECC=0 | <-- | RECT | 4682 1102 ACK | 1460 | 1041 | 10 | | R.ECC=0 | <-- | RECT | 6142 1102 ACK | 1460 | 1042 | 11 | | 1102 3222 ACK | RECT | --> | R.ECC=0 | | 1043 | 12 | | R.ECC=0 | <-- | RECT | 7602 1102 ACK | 1460 | 1044 | 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 | 1045 | | | ... | | | | | 1046 +----+------+----------------+-------+-------+---------------+------+ 1048 Table 6: TCP Session Example #1 1050 Table 6 shows an example TCP session, where the server B sets FNE on 1051 its first and third data packets (lines 5 & 7) as well as on the 1052 initial SYN ACK as previously described. The left hand half of the 1053 table shows the relevant settings of headers sent by client A in 1054 three layers: the TCP payload size; TCP settings; then IP settings. 1055 The right hand half gives equivalent columns for server B. The only 1056 TCP settings shown are the sequence number (SEQ), acknowledgement 1057 number (ACK) and the relevant control (CTL) flags that A sets in the 1058 TCP header. The IP columns show the setting of the extended ECN 1059 (EECN) field. 1061 Also shown on the receiving side of the table is the value of the 1062 receiver's echo congestion counter (R.ECC) after processing the 1063 incoming EECN header. Note that, once a host sets a half-connection 1064 into RECN mode, it MUST initialise its local value of ECC to zero. 1066 The intuition that Appendix D gives for why a sender should set FNE 1067 on the first and third data packets is as follows. At line 13, a 1068 packet sent by B is shown with an '*', which means it has been 1069 congestion marked by an intermediate router from RECT to CE(-1). On 1070 receiving this CE marked packet, client A increments its ECC counter 1071 to 1 as shown. This was the 7th data packet B sent, but before 1072 feedback about this event returns to B, it might well have sent many 1073 more packets. Indeed, during exponential slow start, about as many 1074 packets will be in flight (unacknowledged) as have been acknowledged. 1075 So, when the feedback from the congestion event on B's 7th segment 1076 returns, B will have sent about 7 further packets that will still be 1077 in flight. At that stage, B's best estimate of the network's packet 1078 marking fraction will be 1/7. So, as B will have sent about 14 1079 packets, it should have already marked 2 of them as FNE in order to 1080 have marked 1/7; hence the need to have set the first and third data 1081 packets to FNE. 1083 Client A's behaviour in Table 6 also shows FNE being set on the first 1084 SYN and the first data packet (lines 1 & 4), but in this case it 1085 sends no more data packets, so of course, it cannot, and does not 1086 need to, set FNE again. Note that in the A-B direction there is no 1087 need to set FNE on the third part of the three-way hand-shake (line 1088 3---the ACK). 1090 Note that in this section we have used the word SHOULD rather than 1091 MUST when specifying how to set FNE on data segments before positive 1092 congestion feedback arrives (but note that the word MUST was used for 1093 FNE on the SYN and SYN ACK). FNE is only RECOMMENDED for the first 1094 and third data segments to entertain the possibility that the TCP 1095 transport has the benefit of other knowledge of the path, which it 1096 re-uses from one flow for the benefit of a newly starting flow. For 1097 instance, one flow can re-use knowledge of other flows between the 1098 same hosts if using a Congestion Manager [RFC3124] or when a proxy 1099 host aggregates congestion information for large numbers of flows. 1101 After an idle period of more than 1 second, a re-ECN sender transport 1102 MUST set the EECN field of the packet that resumes the connection to 1103 FNE. Note that this next packet may be sent a very long time later, 1104 a packet does NOT have to be sent after 1 second of idling. In order 1105 that the design of network policers can be deterministic, this 1106 specification deliberately puts an absolute lower limit on how long a 1107 connection can be idle before the packet that resumes the connection 1108 must be set to FNE, rather than relating it to the connection round 1109 trip time. We use the lower bound of the retransmission timeout 1110 (RTO) [RFC2988], which is commonly used as the idle period before TCP 1111 must reduce to the restart window [RFC2581]. Note our specification 1112 of re-ECN's idle period is NOT intended to change the idle period for 1113 TCP's restart, nor indeed for any other purposes. 1115 {ToDo: Describe how the sender falls back to legacy modes if packets 1116 don't appear to be getting through (to work round firewalls 1117 discarding packets they consider unusual).} 1119 4.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs 1121 A re-ECN sender MUST clear the RE flag to "0" and set the ECN field 1122 to Not-ECT in pure ACKs, retransmissions and window probes, as 1123 specified in [RFC3168]. Our eventual goal is for all packets to be 1124 sent with re-ECN enabled, and we believe the semantics of the ECI 1125 field go a long way towards being able to achieve this. However, we 1126 have not completed a full security analysis for these cases, 1127 therefore, currently we merely re-state current practice. 1129 We must also reconcile the facts that congestion marking is applied 1130 to packets but acknowledgements cover octet ranges and acknowledged 1131 octet boundaries need not match the transmitted boundaries. The 1132 general principle we work to is to remain compatible with TCP's 1133 congestion control which is driven by congestion events at packet 1134 granularity while at the same time aiming to blank the RE flag on at 1135 least as many octets in a flow as have been marked CE. 1137 Therefore, a re-ECN TCP receiver MUST increment its ECC value as many 1138 times as CE marked packets have been received. And that value MUST 1139 be echoed to the sender in the first available ACK using the ECI 1140 field. This ensures the TCP sender's congestion control receives 1141 timely feedback on congestion events at the same packet granularity 1142 that they were generated on congested routers. 1144 Then, a re-ECN sender stores the difference D between its own ECC 1145 value and the incoming ECI field by incrementing a counter R. Then, R 1146 is decremented by 1 each subsequent packet that is sent with the RE 1147 flag blanked, until R is no longer positive. Using this technique, 1148 whenever a re-ECN transport sends a not re-ECN capable (NRECN) packet 1149 (e.g. a retransmission), the remaining packets required to have the 1150 RE flag blanked will be automatically carried over to subsequent 1151 packets, through the variable R. 1153 This does not ensure precisely the same number of octets have RE 1154 blanked as were CE marked. But we believe positive errors will 1155 cancel negative over a long enough period. {ToDo: However, more 1156 research is needed to prove whether this is so. If it is not, it may 1157 be necessary to increment and decrement R in octets rather than 1158 packets, by incrementing R as the product of D and the size in octets 1159 of packets being sent (typically the MSS).} 1161 4.2. Other Transports 1163 4.2.1. Guidelines for Adding Re-ECN to Other Transports 1165 Re-ECT sender transports that have established the receiver transport 1166 is at least ECN-capable (not necessarily re-ECN capable) MUST blank 1167 the RE codepoint in packets carrying at least as many octets as 1168 arrive at receiver with the CE codepoint set. Re-ECN-capable sender 1169 transports should always initialise the ECN field to the ECT(1) 1170 codepoint once a flow is established. 1172 If the sender transport does not have sufficient feedback to even 1173 estimate the path's CE rate, it SHOULD set FNE continuously. If the 1174 sender transport has some, perhaps stale, feedback to estimate that 1175 the path's CE rate is nearly definitely less than E%, the transport 1176 MAY blank RE in packets for E% of sent octets, and set the RECT 1177 codepoint for the remainder. 1179 {ToDo: Give a brief outline of what would be expected for each of the 1180 following: 1182 o UDP fire and forget (e.g. DNS) 1184 o UDP streaming with no feedback 1186 o UDP streaming with feedback 1188 o DCCP [RFC4340] } 1190 o RSVP and/or NSIS: A separate I-D has been submitted [Re-PCN] 1191 describing how re-ECN can be used in an edge-to-edge rather than 1192 end-to-end scenario. It can then be used by downstream networks 1193 to police whether upstream networks are blocking new flow 1194 reservations when downstream congestion is too high, even though 1195 the congestion is in other operators' downstream networks. This 1196 relates to current work in progress on Admission Control over 1197 Diffserv using Pre-Congestion Notification, being reported to the 1198 IETF TSVWG [CL-deploy]. 1200 5. Network Layer 1202 5.1. Re-ECN IPv4 Wire Protocol 1204 The wire protocol of the ECN field in the IP header remains largely 1205 unchanged from [RFC3168]. However, an extension to the ECN field we 1206 call the RE (re-ECN extension) flag (Section 3.2) is defined in this 1207 document. It doubles the extended ECN codepoint space, giving 8 1208 potential codepoints. The semantics of the extra codepoints are 1209 backward compatible with the semantics of the 4 original codepoints 1210 [RFC3168] (Section 7.1 collects together and summarises all the 1211 changes defined in this document). 1213 For IPv4, this document proposes that the new RE control flag will be 1214 positioned where the `reserved' control flag was at bit 48 of the 1215 IPv4 header (counting from 0). Alternatively, some would call this 1216 bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 1217 header (Figure 5). 1219 0 1 2 1220 +---+---+---+ 1221 | R | D | M | 1222 | E | F | F | 1223 +---+---+---+ 1225 Figure 5: New Definition of the Re-ECN Extension (RE) Control Flag at 1226 the Start of Byte 7 of the IPv4 Header 1228 The semantics of the RE flag are described in outline in Section 3 1229 and specified fully in Section 4. The RE flag is always considered 1230 in conjunction with the 2-bit ECN field, as if they were concatenated 1231 together to form a 3-bit extended ECN field. If the ECN field is set 1232 to either the ECT(1) or CE codepoint, when the RE flag is blanked 1233 (cleared to "0") it represents a re-echo of congestion experienced by 1234 an early packet. If the ECN field is set to the Not-ECT codepoint, 1235 when the RE flag is set to "1" it represents the feedback not 1236 established (FNE) codepoint, which signals that the packet was sent 1237 without the benefit of congestion feedback. 1239 It is believed that the FNE codepoint can simultaneously serve other 1240 purposes, particularly where the start of a flow needs distinguishing 1241 from packets later in the flow. For instance it would have been 1242 useful to identify new flows for tag switching and might enable 1243 similar developments in the future if it were adopted. It is similar 1244 to the state set-up bit idea designed to protect against memory 1245 exhaustion attacks. This idea was proposed informally by David Clark 1246 and documented by Handley and Greenhalgh [Steps_DoS]. The FNE 1247 codepoint can be thought of as a `soft-state set-up flag', because it 1248 is idempotent (i.e. one occurrence of the flag is sufficient but 1249 further occurrences achieve the same effect if previous ones were 1250 lost). 1252 We are sure there will probably be other claims pending on the use of 1253 bit 48. We know of at least two [ARI05], [RFC3514] but neither have 1254 been pursued in the IETF, so far, although the present proposal would 1255 meet the needs of the former. 1257 The security flag proposal (commonly known as the evil bit) was 1258 published on 1 April 2003 as Informational RFC 3514, but it was not 1259 adopted due to confusion over whether evil-doers might set it 1260 inappropriately. The present proposal is backward compatible with 1261 RFC3514 because if re-ECN compliant senders were benign they would 1262 correctly clear the evil bit to honestly declare that they had just 1263 received congestion feedback. Whereas evil-doers would hide 1264 congestion feedback by setting the evil bit continuously, or at least 1265 more often than they should. So, evil senders can be identified, 1266 because they declare that they are good less often than they should. 1268 5.2. Re-ECN IPv6 Wire Protocol 1270 For IPv6, this document proposes that the new RE control flag will be 1271 positioned as the first bit of the option field of a new Congestion 1272 hop by hop option header (Figure 6). 1274 0 1 2 3 1275 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1276 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1277 | Next Header | Hdr ext Len | Option Type | Option Len | 1278 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1279 |R| Reserved for future use | 1280 |E| | 1281 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1283 Figure 6: Definition of a New IPv6 Congestion Hop by Hop Option 1284 Header containing the Re-ECN Extension (RE) Control Flag 1286 0 1 2 3 4 5 6 7 8 1287 +-+-+-+-+-+-+-+-+- 1288 |AIU|C|Option ID| 1289 +-+-+-+-+-+-+-+-+- 1291 Figure 7: Congestion Hop by Hop Option Type Encoding 1293 The Hop-by-Hop Options header enables packets to carry information to 1294 be examined and processed by routers or nodes along the packet's 1295 delivery path, including the source and destination nodes. For re- 1296 ECN, the two bits of the Action If Unrecognized (AIU) flag of the 1297 Congestion extension header MUST be set to "00" meaning if 1298 unrecognized `skip over option and continue processing the header'. 1299 Then, any routers or a receiver not upgraded with the optional re-ECN 1300 features described in this memo will simply ignore this header. But 1301 routers with these optional re-ECN features or a re-ECN policing 1302 function, will process this Congestion extension header. 1304 The `C' flag MUST be set to "1" to specify that the Option Data 1305 (currently only the RE control flag) can change en-route to the 1306 packet's final destination. This ensures that, when an 1307 Authentication header (AH [RFC2402]) is present in the packet, for 1308 any option whose data may change en-route, its entire Option Data 1309 field will be treated as zero-valued octets when computing or 1310 verifying the packet's authenticating value. 1312 Although the RE control flag should not be changed along the path, we 1313 expect that the rest of this option field that is currently `Reserved 1314 for future use' could be used for a multi-bit congestion notification 1315 field which we would expect to change en route. As the RE flag does 1316 not need end-to-end authentication, we set the C flag to '1'. 1318 {ToDo: A Congestion Hop by Hop Option ID will need to be registered 1319 with IANA.} 1321 5.3. Router Forwarding Behaviour 1323 Re-ECN works well without modifying the forwarding behaviour of any 1324 routers. However, below, two OPTIONAL changes to forwarding 1325 behaviour are defined, which respectively enhance performance and 1326 improve a router's discrimination against flooding attacks. They are 1327 both OPTIONAL additions that we propose MAY apply by default to all 1328 Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN 1329 marking behaviours [RFC3168]. Specifications for PHBs MAY define 1330 different forwarding behaviours from this default, but this is NOT 1331 REQUIRED. [Re-PCN] is one example. 1333 FNE indicates ECT: 1335 The FNE codepoint tells a router to assume that the packet was 1336 sent by an ECN-capable transport (see Section 5.4). Therefore an 1337 FNE packet MAY be marked rather than dropped. Note that the FNE 1338 codepoint has been intentionally chosen so that, to legacy routers 1339 (which do not inspect the RE flag) an FNE packet appears to be 1340 Not-ECT so it will be dropped by legacy AQM algorithms. 1342 A network operator MUST NOT configure a router to ECN mark rather 1343 than drop FNE packets unless it can guarantee that FNE packets 1344 will be rate limited, either locally or upstream. The ingress 1345 policers discussed in Section 6.1.5 would count as rate limiters 1346 for this purpose. 1348 Preferential Drop: If a re-ECN capable router experiences very high 1349 load so that it has to drop arriving packets (e.g. a DoS attack), 1350 it MAY preferentially drop packets within the same Diffserv PHB 1351 using the preference order for extended ECN codepoints given in 1352 Table 7. Preferential dropping can be difficult to implement on 1353 some hardware, but if feasible it would discriminate against 1354 attack traffic if done as part of the overall policing framework 1355 of Section 6.1.3. If nowhere else, routers at the egress of a 1356 network SHOULD implement preferential drop (stronger than the MAY 1357 above). For simplicity, preferences 4 & 5 MAY be merged into one 1358 preference level. 1360 +-------+-----+-----------+-------+------------+--------------------+ 1361 | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | 1362 | field | bit | ECN | | (1 = drop | | 1363 | | | codepoint | | 1st) | | 1364 +-------+-----+-----------+-------+------------+--------------------+ 1365 | 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | 1366 | | | | | | congestion and | 1367 | | | | | | RECT | 1368 | 00 | 1 | FNE | +1 | 4 | Feedback not | 1369 | | | | | | established | 1370 | 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | 1371 | | | | | | by congestion | 1372 | | | | | | experienced | 1373 | 01 | 1 | RECT | 0 | 3 | Re-ECN capable | 1374 | | | | | | transport | 1375 | 11 | 1 | CE(-1) | -1 | 3 | Congestion | 1376 | | | | | | experienced | 1377 | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | 1378 | 10 | 0 | --- | n/a | 2 | Legacy ECN use | 1379 | | | | | | only | 1380 | 00 | 0 | Not-RECT | n/a | 1 | Not re-ECN-capable | 1381 | | | | | | transport | 1382 +-------+-----+-----------+-------+------------+--------------------+ 1384 Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth') 1386 The above drop preferences are arranged to preserve packets with 1387 more positive worth (Section 3.4), given senders of positive 1388 packets must have honestly declared downstream congestion. This 1389 is explained fully in Section 6 on applications, particularly when 1390 the application of re-ECN to protect against DDoS attacks is 1391 described. 1393 5.4. Justification for Setting the First SYN to FNE 1395 Congested routers may mark an FNE packet to CE(-1) (Section 5.3), and 1396 the initial SYN MUST be set to FNE by Re-ECT client A 1397 (Section 4.1.4). So an initial SYN may be marked CE(-1) rather than 1398 dropped. This seems dangerous, because the sender has not yet 1399 established whether the receiver is a legacy one that does not 1400 understand congestion marking. It also seems to allow malicious 1401 senders to take advantage of ECN marking to avoid so much drop when 1402 launching SYN flooding attacks. Below we explain the features of the 1403 protocol design that remove both these dangers. 1405 ECN-capable initial SYN with a Not-ECT server: If the TCP server B is 1406 re-ECN capable, provision is made for it to feedback a possible 1407 congestion marked SYN in the SYN ACK (Section 4.1.4). But if the 1408 TCP client A finds out from the SYN ACK that the server was not 1409 ECN-capable, the TCP client MUST consider the first SYN as 1410 congestion marked before setting itself into Not-ECT mode. 1411 Section 4.1.4 mandates that such a TCP client MUST also set its 1412 initial window to 1 segment. In this way we remove the need to 1413 cautiously avoid setting the first SYN to Not-RECT. This will 1414 give worse performance while deployment is patchy, but better 1415 performance once deployment is widespread. 1417 SYN flooding attacks can't exploit ECN-capability: Malicious hosts 1418 may think they can use the advantage that ECN-marking gives over 1419 drop in launching classic SYN-flood attacks. But Section 5.3 1420 mandates that a router MUST only be configured to treat packets 1421 with the FNE codepoint as ECN-capable if FNE packets are rate 1422 limited. Introduction of the FNE codepoint was a deliberate move 1423 to enable transport-neutral handling of flow-start and flow state 1424 set-up in the IP layer where it belongs. It then becomes possible 1425 to protect against flooding attacks of all forms (not just SYN 1426 flooding) without transport-specific inspection for things like 1427 the SYN flag in TCP headers. Then, for instance, SYN flooding 1428 attacks using IPSec ESP encryption can also be rate limited at the 1429 IP layer. 1431 It might seem pedantic going to all this trouble to enable ECN on the 1432 initial packet of a flow, but it is motivated by a much wider concern 1433 to ensure safe congestion control will still be possible even if the 1434 application mix evolves to the point where the majority of flows 1435 consist of a single window or even a single packet. It also allows 1436 denial of service attacks to be more easily isolated and prevented. 1438 5.5. Control and Management 1440 5.5.1. Negative Balance Warning 1442 A new ICMP message type is being considered so that a dropper can 1443 warn the apparent sender of a flow that it has started to sanction 1444 the flow. The message would have similar semantics to the `Time 1445 exceeded' ICMP message type. To ensure the sender has to invest some 1446 work before the network will generate such a message, a dropper 1447 SHOULD only send such a message for flows that have demonstrated that 1448 they have started correctly by establishing a positive record, but 1449 have later gone negative. The threshold is up to the implementation. 1450 The purpose of the message is to deconfuse the cause of drops from 1451 other causes, such as congestion or transmission losses. The dropper 1452 would send the message to the sender of the flow, not the receiver. 1453 If we did define this message type, it would be REQUIRED for all re- 1454 ECT senders to parse and understand it. Note that a sender MUST only 1455 use this message to explain why losses are occurring. A sender MUST 1456 NOT take this message to mean that losses have occurred that it was 1457 not aware of. Otherwise, spoof messages could be sent by malicious 1458 sources to slow down a sender (c.f. ICMP source quench). 1460 However, the need for this message type is not yet confirmed, as we 1461 are considering how to prevent it being used by malicious senders to 1462 scan for droppers and to test their threshold settings. {ToDo: 1463 Complete this section.} 1465 5.5.2. Rate Response Control 1467 The incentive framework of Section 6.1.3 implies there may be a need 1468 for a sender to send a request to an ingress policer asking that it 1469 be allowed to apply a non-default response to congestion (where TCP- 1470 friendly is assumed to be the default). This would require the 1471 sender to know what message format(s) to use and to be able to 1472 discover how to address the policer. The required control 1473 protocol(s) are outside the scope of this document, but will require 1474 definition elsewhere. 1476 The policer is likely to be local to the sender and inline, probably 1477 at the ingress interface to the internetwork. So, discovery should 1478 not be hard. A variety of control protocols already exist for some 1479 widely used rate-responses to congestion. For instance DCCP 1480 congestion control identifiers (CCIDs [RFC4340]) fulfil this role and 1481 so does QoS signalling (e.g. and RSVP request for controlled load 1482 service is equivalent to a request for no rate response to 1483 congestion, but with admission control). 1485 5.6. IP in IP Tunnels 1487 For re-ECN to work correctly through IP in IP tunnels, it needs 1488 slightly different tunnel handling to regular ECN [RFC3168]. 1489 Ideally, for re-ECN to work through a tunnel, the tunnel entry should 1490 copy both the RE flag and the ECN field from the inner to the outer 1491 IP header. Then at the tunnel exit, any congestion marking of the 1492 outer ECN field should overwrite the inner ECN field (unless the 1493 inner field is Not-ECT in which case an alarm should be raised). The 1494 RE flag shouldn't change along a path, so the outer RE flag should be 1495 the same as the inner. If it isn't a management alarm should be 1496 raised. This behaviour is the same as the full-functionality variant 1497 of [RFC3168] at tunnel exit, but different at tunnel entry. 1499 If tunnels are left as they are specified in [RFC3168], whether the 1500 limited or full-functionality variants are used, a problem arises 1501 with re-ECN if a tunnel crosses an inter-domain boundary, because the 1502 difference between positive and negative markings will not be 1503 correctly accounted for. In a limited functionality ECN tunnel, the 1504 flow will appear to be legacy traffic, and therefore may be wrongly 1505 rate limited. In a full-functionality ECN tunnel, the result will 1506 depend whether the tunnel entry copies the inner RE flag to the outer 1507 header or the RE flag in the outer header is always cleared. If the 1508 former, the flow will tend to be too positive when accounted for at 1509 borders. If the latter, it will be too negative. 1511 {ToDo: A future version of this draft will discuss the necessary 1512 changes to IP in IP tunnels in more depth.} 1514 5.7. Non-Issues 1516 The following issues might seem to cause unfavourable interactions 1517 with re-ECN, but we will explain why they don't: 1519 o Various link layers support explicit congestion notification, such 1520 as Frame Relay and ATM. Explicit congestion notification is 1521 proposed to be added to other link layers, such as Ethernet 1522 (802.3ar Ethernet congestion management) and MPLS [ECN-MPLS]; 1524 o Encryption and IPSec. 1526 In the case of congestion notification at the link layer, each 1527 particular link layer scheme either manages congestion on the link 1528 with its own link-level feedback (the usual arrangement in the cases 1529 of ATM and Frame Relay), or congestion notification from the link 1530 layer is merged into congestion notification at the IP level when the 1531 frame headers are decapsulated at the end of the link (the 1532 recommended arrangement in the Ethernet and MPLS cases). Given the 1533 RE flag is not intended to change along the path, this means that 1534 downstream congestion will still be measureable at any point where IP 1535 is processed on the path by subtracting positive from negative 1536 markings. 1538 In the case of encryption, as long as the tunnel issues described in 1539 Section 5.6 are dealt with, payload encryption itself will not be a 1540 problem. The design goal of re-ECN is to include downstream 1541 congestion in the IP header so that it is not necessary to bury into 1542 inner headers. Obfuscation of flow identifiers is not a problem for 1543 re-ECN policing elements. Re-ECN doesn't ever require flow 1544 identifiers to be valid, it only requires them to be unique. So if 1545 an IPSec encapsulating security payload (ESP [RFC2406]) or an 1546 authentication header (AH [RFC2402]) is used, the security parameters 1547 index (SPI) will be a sufficient flow identifier, as it is intended 1548 to be unique to a flow without revealing actual port numbers. 1550 In general, even if endpoints use some locally agreed scheme to hide 1551 port numbers, re-ECN policing elements can just consider the pair of 1552 source and destination IP addresses as the flow identifier. Re-ECN 1553 encourages endpoints to at least tell the network layer that a 1554 sequence of packets are all part of the same flow, if indeed they 1555 are. The alternative would be for the sender to make each packet 1556 appear to be a new flow, which would require them all to be marked 1557 FNE in order to avoid being treated with the bulk of malicious flows 1558 at the egress dropper. Given the FNE marking is worth +1 and 1559 networks are likely to rate limit FNE packets, endpoints are given an 1560 incentive not to set FNE on each packet. But if the sender really 1561 does want to hide the flow relationship between packets it can choose 1562 to pay the cost of multiple FNE packets, which in the long run will 1563 compensate for the extra memory required on network policing elements 1564 to process each flow. 1566 6. Applications 1568 6.1. Policing Congestion Response 1570 6.1.1. The Policing Problem 1572 The current Internet architecture trusts hosts to respond voluntarily 1573 to congestion. Limited evidence shows that the large majority of 1574 end-points on the Internet comply with a TCP-friendly response to 1575 congestion. But telephony (and increasingly video) services over the 1576 best efforts Internet are attracting the interest of major commercial 1577 operations. Most of these applications do not respond to congestion 1578 at all. Those that can switch to lower rate codecs, still have a 1579 lower bound below which they must become unresponsive to congestion. 1581 Of course, the Internet is intended to support many different 1582 application behaviours. But the problem is that this freedom can be 1583 exercised irresponsibly. The greater problem is that we will never 1584 be able to agree on where the boundary is between responsible and 1585 irresponsible. Therefore re-ECN is designed to allow different 1586 networks to set their own view of the limit to irresponsibility, and 1587 to allow networks that choose a more conservative limit to push back 1588 against congestion caused in more liberal networks. 1590 As an example of the impossibility of setting a standard for 1591 fairness, mandating TCP-friendliness would set the bar too high for 1592 unresponsive streaming media, but still some would say the bar was 1593 too low. Even though all known peer-to-peer filesharing applications 1594 are TCP-compatible, they can cause a disproportionate amount of 1595 congestion, simply by using multiple flows and by transferring data 1596 continuously relative to other short-lived sessions. On the other 1597 hand, if we swung the other way and set the bar low enough to allow 1598 streaming media to be unresponsive, we would also allow denial of 1599 service attacks, which are typically unresponsive to congestion and 1600 consist of multiple continuous flows. 1602 Applications that need (or choose) to be unresponsive to congestion 1603 can effectively take (some would say steal) whatever share of 1604 bottleneck resources they want from responsive flows. Whether or not 1605 such free-riding is common, inability to prevent it increases the 1606 risk of poor returns for investors in network infrastructure, leading 1607 to under-investment. An increasing proportion of unresponsive or 1608 free-riding demand coupled with persistent under-supply is a broken 1609 economic cycle. Therefore, if the current, largely co-operative 1610 consensus continues to erode, congestion collapse could become more 1611 common in more areas of the Internet [RFC3714]. 1613 While we have designed re-ECN so that networks can choose to deploy 1614 stringent policing, this does not imply we advocate that every 1615 network should introduce tight controls on those that cause 1616 congestion. Re-ECN has been specifically designed to allow different 1617 networks to choose how conservative or liberal they wish to be with 1618 respect to policing congestion. But those that choose to be 1619 conservative can protect themselves from the excesses that liberal 1620 networks allow their users. 1622 6.1.2. The Case Against Bottleneck Policing 1624 The state of the art in rate policing is the bottleneck policer, 1625 which is intended to be deployed at any forwarding resource that may 1626 become congested. Its aim is to detect flows that cause 1627 significantly more local congestion than others. Although operators 1628 might solve their immediate problems by deploying bottleneck 1629 policers, we are concerned that widespread deployment would make it 1630 extremely hard to evolve new application behaviours. We believe the 1631 IETF should offer re-ECN as the preferred protocol on which to base 1632 solutions to the policing problems of operators, because it would not 1633 harm evolvability and, frankly, it would be far more effective (see 1634 later for why). 1636 Approaches like [XCHOKe] & [pBox] are nice approaches for rate 1637 policing traffic without the benefit of whole path information (such 1638 as could be provided by re-ECN). But they must be deployed at 1639 bottlenecks in order to work. Unfortunately, a large proportion of 1640 traffic traverses at least two bottlenecks (in two access networks), 1641 particularly with the current traffic mix where peer-to-peer file- 1642 sharing is prevalent. If ECN were deployed, we believe it would be 1643 likely that these bottleneck policers would be adapted to combine ECN 1644 congestion marking from the upstream path with local congestion 1645 knowledge. But then the only useful placement for such policers 1646 would be close to the egress of the internetwork. 1648 But then, if these bottleneck policers were widely deployed (which 1649 would require them to be more effective than they are now), the 1650 Internet would find itself with one universal rate adaptation policy 1651 (probably TCP-friendliness) embedded throughout the network. Given 1652 TCP's congestion control algorithm is already known to be hitting its 1653 scalability limits and new algorithms are being developed for high- 1654 speed congestion control, embedding TCP policing into the Internet 1655 would make evolution to new algorithms extremely painful. If a 1656 source wanted to use a different algorithm, it would have to first 1657 discover then negotiate with all the policers on its path, 1658 particularly those in the far access network. The IETF has already 1659 traveled that path with the Intserv architecture and found it 1660 constrains scalability [RFC2208]. 1662 Anyway, if bottleneck policers were ever widely deployed, they would 1663 be likely to be bypassed by determined attackers. They inherently 1664 have to police fairness per flow or per source-destination pair. 1665 Therefore they can easily be circumvented either by opening multiple 1666 flows (by varying the end-point port number); or by spoofing the 1667 source address but arranging with the receiver to hide the true 1668 return address at a higher layer. 1670 6.1.3. Re-ECN Incentive Framework 1672 The aim is to create an incentive environment that ensures optimal 1673 sharing of capacity despite everyone acting selfishly (including 1674 lying and cheating). Of course, the mechanisms put in place for this 1675 can lie dormant wherever co-operation is the norm. 1677 Throughout this document we focus on path congestion. But some forms 1678 of fairness, particularly TCP's, also depend on round trip time. So, 1679 we also propose to measure downstream path delay using re-feedback. 1680 This proposal will be published in a very simple future draft, but 1681 for now we give an outline in Appendix F. 1683 Figure 8 sketches the incentive framework that we will describe piece 1684 by piece throughout this section. We will do a first pass in 1685 overview, then return to each piece in detail. We re-use the earlier 1686 example of how downstream congestion is derived by subtracting 1687 upstream congestion from path congestion (Figure 1) but depict 1688 multiple trust boundaries to turn it into an internetwork. For 1689 clarity, only downstream congestion is shown (the difference between 1690 the two earlier plots). The graph displays downstream path 1691 congestion seen in a typical flow as it traverses an example path 1692 from sender S to receiver R, across networks N1, N2 & N4. Everyone 1693 is shown using re-ECN correctly, but we intend to show why everyone 1694 would /choose/ to use it correctly, and honestly. 1696 Three main types of self-interest can be identified: 1698 o Users want to transmit data across the network as fast as 1699 possible, paying as little as possible for the privilege. In this 1700 respect, there is no distinction between senders and receivers, 1701 but we must be wary of potential malice by one on the other; 1703 o Network operators want to maximise revenues from the resources 1704 they invest in. They compete amongst themselves for the custom of 1705 users. 1707 o Attackers (whether users or networks) want to use any opportunity 1708 to subvert the new re-ECN system for their own gain or to damage 1709 the service of their victims, whether targeted or random. 1711 policer 1712 | 1713 | 1714 S <-----N1----> <---N2---> <---N4--> R domain 1715 | : : 1716 A\|/: : 1717 | V : : 1718 3% |---------+ : 1719 | : | : 1720 2% | : +-----------------------+ : 1721 | : downstream congestion | : 1722 1% | : | : 1723 | : | : 1724 0% +---------------------------------+=====--> 1725 0 i ^ resource index 1726 | | /|\ 1727 1.00% 2.00% | marking fraction 1728 | 1729 dropper 1731 Figure 8: Incentive Framework, showing creation of opposing pressures 1732 to under-declare and over-declare downstream congestion, using a 1733 policer and a dropper 1734 Source congestion control: We want to ensure that the sender will 1735 throttle its rate as downstream congestion increases. Whatever 1736 the agreed congestion response (whether TCP-compatible or some 1737 enhanced QoS), to some extent it will always be against the 1738 sender's interest to comply. 1740 Ingress policing: But it is in all the network operators' interests 1741 to encourage fair congestion response, so that their investments 1742 are employed to satisfy the most valuable demand. The re-ECN 1743 protocol ensures packets carry the necessary information about 1744 their own expected downstream congestion so that N1 can deploy a 1745 policer at its ingress to check that S1 is complying with whatever 1746 congestion control it should be using (Section 6.1.5). If N1 is 1747 extremely conservative it may police each flow, but it can choose 1748 to just police the bulk amount of congestion each customer causes 1749 without regard to flows, or if it is extremely liberal it need not 1750 police congestion control at all. Whatever, it is always 1751 preferable to police traffic at the very first ingress into an 1752 internetwork, before non-compliant traffic can cause any damage. 1754 Edge egress dropper: If the policer ensures the source has less right 1755 to a high rate the higher it declares downstream congestion, the 1756 source has a clear incentive to understate downstream congestion. 1757 But, if flows of packets are understated when they enter the 1758 internetwork, they will have become negative by the time they 1759 leave. So, we introduce a dropper at the last network egress, 1760 which drops packets in flows that persistently declare negative 1761 downstream congestion (see Section 6.1.4 for details). 1763 ..competitive routing 1764 .' : '. 1765 .' p e n a l:t i e s '. 1766 : | : \ : 1767 A : | : | : 1768 |S <-----N1----> <---N2---> <---N4--> R domain 1769 | : | : | : 1770 | V | : | : 1771 3% |--------+ | : | : 1772 | | V V V V 1773 2% | +-----------------------+ 1774 | downstream congestion | 1775 1% | : | 1776 | : | 1777 0% +--------------------------------+=====--> 1778 0 ^ i resource index 1779 | /|\ | 1780 1.00% | 2.00% marking fraction 1781 | 1782 sanctions 1784 Figure 9: Incentives at Inter-domain Borders 1786 Inter-domain traffic policing: But next we must ask, if congestion 1787 arises downstream (say in N4), what is the ingress network's 1788 (N1's) incentive to police its customers' response? If N1 turns a 1789 blind eye, its own customers benefit while other networks suffer. 1790 This is why all inter-domain QoS architectures (e.g. Intserv, 1791 Diffserv) police traffic each time it crosses a trust boundary. 1792 We have already shown that re-ECN gives a trustworthy measure of 1793 the expected downstream congestion that a flow will cause by 1794 subtracting negative volume from positive at any intermediate 1795 point on a path. N4 (say) can use this measure to police all the 1796 responses to congestion of all the sources beyond its upstream 1797 neighbour (N2), but in bulk with one very simple passive 1798 mechanism, rather than per flow, as we will now explain using 1799 Figure 9. 1801 Emulating policing with inter-domain congestion penalties: Between 1802 high-speed networks, we would rather avoid per-flow policing, and 1803 we would rather avoid holding back traffic while it is policed. 1804 Instead, once re-ECN has arranged headers to carry downstream 1805 congestion honestly, N2 can contract to pay N4 penalties in 1806 proportion to a single bulk count of the congestion metrics 1807 crossing their mutual trust boundary (Section 6.1.6). In this 1808 way, N4 puts pressure on N2 to suppress downstream congestion, for 1809 every flow passing through the border interface, even though they 1810 will all start and end in different places, and even though they 1811 may all be allowed different responses to congestion. The figure 1812 depicts this downward pressure on N2 by the solid downward arrow 1813 at the egress of N2. Then N2 has an incentive either to police 1814 the congestion response of its own ingress traffic (from N1) or to 1815 emulate policing by applying penalties to N1 in turn on the basis 1816 of congestion counted at their mutual boundary. In this recursive 1817 way, the incentives for each flow to respond correctly to 1818 congestion trace back with each flow precisely to each source, 1819 despite the mechanism not recognising flows (see Section 6.2.2). 1821 Inter-domain congestion charging diversity: Any two networks are free 1822 to agree any of a range of penalty regimes between themselves 1823 within the following reasonable constraints. N2 should expect to 1824 have to pay penalties to N4 where penalties monotonically increase 1825 with the volume of congestion and negative penalties are not 1826 allowed. For instance, they may agree an SLA with tiered 1827 congestion thresholds, where higher penalties apply the higher the 1828 threshold that is broken. But the most obvious (and useful) form 1829 of penalty is where N4 levies a charge on N2 proportional to the 1830 volume of downstream congestion N2 dumps into N4. In the 1831 explanation that follows, we assume this specific variant of 1832 volume charging between networks - charging proportionate to the 1833 volume of congestion. 1835 We must make clear that we are not advocating that everyone should 1836 use this form of contract. We are well aware that the IETF tries 1837 to avoid standardising technology that depends on a particular 1838 business model. And we strongly share this desire to encourage 1839 diversity. But our aim is merely to show that border policing can 1840 at least work with this one model, then we can assume that 1841 operators might experiment with the metric in other models (see 1842 Section 6.1.6 for examples). Of course, operators are free to 1843 complement this usage element of their charges with traditional 1844 capacity charging, and we expect they will. 1846 No congestion charging to users: Bulk congestion penalties at trust 1847 boundaries are passive and extremely simple, and lose none of 1848 their per-packet precision from one boundary to the next (unlike 1849 Diffserv all-address traffic conditioning agreements, which 1850 dissipate their effectiveness across long topologies). But at any 1851 trust boundary, there is no imperative to use congestion charging. 1852 Traditional traffic policing can be used, if the complexity and 1853 cost is preferred. In particular, at the boundary with end 1854 customers (e.g. between S and N1), traffic policing will most 1855 likely be more appropriate. Policer complexity is less of a 1856 concern at the edge of the network. And end-customers are known 1857 to be highly averse to the unpredictability of congestion 1858 charging. 1860 NOTE WELL: This document neither advocates nor requires congestion 1861 charging for end customers and advocates but does not require 1862 inter-domain congestion charging. 1864 Competitive discipline of inter-domain traffic engineering: With 1865 inter-domain congestion charging, a domain seems to have a 1866 perverse incentive to fake congestion; N2's profit depends on the 1867 difference between congestion at its ingress (its revenue) and at 1868 its egress (its cost). So, overstating internal congestion seems 1869 to increase profit. However, smart border routing [Smart_rtg] by 1870 N1 will bias its multipath routing towards the least cost routes. 1871 So, N2 risks losing all its revenue to competitive routes if it 1872 overstates congestion (see Section 6.2.3). In other words, if N2 1873 is the least congested route, its ability to raise excess profits 1874 is limited by the congestion on the next least congested route. 1875 This pressure on N2 to remain competitive is represented by the 1876 dotted downward arrow at the ingress to N2 in Figure 9. 1878 Closing the loop: All the above elements conspire to trap everyone 1879 between two opposing pressures (the downward and upward arrows in 1880 Figure 8 & Figure 9), ensuring the downstream congestion metric 1881 arrives at the destination neither above nor below zero. So, we 1882 have arrived back where we started in our argument. The ingress 1883 edge network can rely on downstream congestion declared in the 1884 packet headers presented by the sender. So it can police the 1885 sender's congestion response accordingly. 1887 Evolvability of congestion control: We have seen that re-ECN enables 1888 policing at the very first ingress. We have also seen that, as 1889 flows continue on their path through further networks downstream, 1890 re-ECN removes the need for further per-domain ingress policing of 1891 all the different congestion responses allowed to each different 1892 flow. This is why the evolvability of re-ECN policing is so 1893 superior to bottleneck policing or to any policing of different 1894 QoS for different flows. Even if all access networks choose to 1895 conservatively police congestion per flow, each will want to 1896 compete with the others to allow new responses to congestion for 1897 new types of application. With re-ECN, each can introduce new 1898 controls independently, without coordinating with other networks 1899 and without having to standardise anything. But, as we have just 1900 seen, by making inter-domain penalties proportionate to bulk 1901 downtream congestion, downstream networks can be agnostic to the 1902 specific congestion response for each flow, but they can still 1903 apply more back-pressure the more liberal the ingress access 1904 network has been in the response to congestion it allowed for each 1905 flow. 1907 6.1.3.1. The Case against Classic Feedback 1909 A system that produces an optimal outcome as a result of everyone's 1910 selfish actions is extremely powerful. Especially one that enables 1911 evolvability of congestion control. But why do we have to change to 1912 re-ECN to achieve it? Can't classic congestion feedback (as used 1913 already by standard ECN) be arranged to provide similar incentives 1914 and similar evolvability? Superficially it can. Kelly's seminal 1915 work showed how we can allow everyone the freedom to evolve whatever 1916 congestion control behaviour is in their application's best interest 1917 but still optimise the whole system of networks and users by placing 1918 a price on congestion to ensure responsible use of this 1919 freedom [Evol_cc]). Kelly used ECN with its classic congestion 1920 feedback model as the mechanism to convey congestion price 1921 information. The mechanism was nearly identical to volume charging; 1922 except only the volume of packets marked with congestion experienced 1923 (CE) was counted. 1925 However, below we explain why relying on classic feedback /required/ 1926 congestion charging to be used, while re-ECN achieves the same 1927 powerful outcome (given it is built on Kelly's foundations), but does 1928 not /require/ congestion charging. In brief, the problem with 1929 classic feedback is that the incentives have to trace the indirect 1930 path back to the sender---the long way round the feedback loop. For 1931 example, if classic feedback were used in Figure 8, N2 would have had 1932 to influence N1 via all of N4, R & S rather than directly. 1934 Inability to agree what is happening downstream: In order to police 1935 its upstream neighbour's congestion response, the neighbours 1936 should be able to agree on the congestion to be responded to. 1937 Whatever the feedback regime, as packets change hands at each 1938 trust boundary, any path metrics they carry are verifiable by both 1939 neighbours. But, with a classic path metric, they can only agree 1940 on the /upstream/ path congestion. 1942 Inaccessible back-channel: The network needs a whole-path congestion 1943 metric if it wants to control the source. Classically, whole path 1944 congestion emerges at the destination, to be fed back from 1945 receiver to sender in a back-channel. But, in any data network, 1946 back-channels need not be visible to relays, as they are 1947 essentially communications between the end-points. They may be 1948 encrypted, asymmetrically routed or simply omitted, so no network 1949 element can reliably intercept them. The congestion charging 1950 literature solves this problem by charging the receiver and 1951 assuming this will cause the receiver to refer the charges to the 1952 sender. But, of course, this creates unintended side-effects... 1954 `Receiver pays' unacceptable: In connectionless datagram networks, 1955 receivers and receiving networks cannot prevent reception from 1956 malicious senders, so `receiver pays' opens them to `denial of 1957 funds' attacks. 1959 End-user congestion charging unacceptable: Even if 'denial of funds' 1960 were not a problem, we know that end-users are highly averse to 1961 the unpredictability of congestion charging and anyway, we want to 1962 avoid restricting network operators to just one retail tariff. 1963 But with classic feedback only an upstream metric is available, so 1964 we cannot avoid having to wrap the `receiver pays' money flow 1965 around the feedback loop, necessarily forcing end-users to be 1966 subjected to congestion charging. 1968 To summarise so far, with classic feedback, policing congestion 1969 response without losing evolvability /requires/ congestion charging 1970 of end-users and a `receiver pays' model, whereas, with re-ECN, it is 1971 still possible to influence incentives using congestion charging but 1972 using the safer `sender pays' model. However, congestion charging is 1973 only likely to be appropriate between domains. So, without losing 1974 evolvability, re-ECN enables technical policing mechanisms that are 1975 more appropriate for end users than congestion pricing. 1977 We now take a second pass over the incentive framework, filling in 1978 the detail. 1980 6.1.4. Egress Dropper 1982 As traffic leaves the last network before the receiver (domain N4 in 1983 Figure 8), the fraction of positive octets in a flow should match the 1984 fraction of negative octets introduced by congestion marking, leaving 1985 a balance of zero. If it is less (a negative flow), it implies that 1986 the source is understating path congestion (which will reduce the 1987 penalties that N2 owes N4). 1989 If flows are positive, N4 need take no action---this simply means its 1990 upstream neighbour is paying more penalties than it needs to, and the 1991 source is going slower than it needs to. But, to protect itself 1992 against persistently negative flows, N4 will need to install a 1993 dropper at its egress. Appendix E gives a suggested algorithm for 1994 this dropper. There is not intention that the dropper algorithm 1995 needs to be standardised, it is merely provided to show that an 1996 efficient, robust algorithm is possible. But whatever algorithm is 1997 used must meet the criteria below: 1999 o It SHOULD introduce minimal false positives for honest flows; 2000 o It SHOULD quickly detect and sanction dishonest flows (minimal 2001 false negatives); 2003 o It MUST be invulnerable to state exhaustion attacks from malicious 2004 sources. For instance, if the dropper uses flow-state, it should 2005 not be possible for a source to send numerous packets, each with a 2006 different flow ID, to force the dropper to exhaust its memory 2007 capacity; 2009 o It MUST introduce sufficient loss in goodput so that malicious 2010 sources cannot play off losses in the egress dropper against 2011 higher allowed throughput. Salvatori [CLoop_pol] describes this 2012 attack, which involves the source understating path congestion 2013 then inserting forward error correction (FEC) packets to 2014 compensate expected losses. 2016 Note that the dropper operates on flows but we would like it not to 2017 require per-flow state. This is why we have been careful to ensure 2018 that all flows MUST start with a packet marked with the FNE 2019 codepoint. If a flow does not start with the FNE codepoint, a 2020 dropper is likely to treat it unfavourably. This risk makes it worth 2021 setting the FNE codepoint at the start of a flow, even though there 2022 is a cost to the sender of setting FNE (positive `worth'). Indeed, 2023 with the FNE codepoint, the rate at which a sender can generate new 2024 flows can be limited (Appendix G). In this respect, the FNE 2025 codepoint works like Handley's state set-up bit [Steps_DoS]. 2027 Appendix E also gives an example dropper implementation that 2028 aggregates flow state. Dropper algorithms will often maintain a 2029 moving average across flows of the fraction of RE blanked packets. 2030 When maintaining an average across flows, a dropper SHOULD only allow 2031 flows into the average if they start with FNE, but it SHOULD not 2032 include packets with the FNE codepoint set in the average. A sender 2033 sets the FNE codepoint when it does not have the benefit of feedback 2034 from the receiver. So, counting packets with FNE cleared would be 2035 likely to make the average unnecessarily positive, providing headroom 2036 (or should we say footroom?) for dishonest (negative) traffic. 2038 If the dropper detects a persistently negative flow, it SHOULD drop 2039 sufficient negative and neutral packets to force the flow to not be 2040 negative. Drops SHOULD be focused on just sufficient packets in 2041 misbehaving flows to remove the negative bias while doing minimal 2042 extra harm. 2044 6.1.5. Rate Policing 2046 Access operators who wish to check that a sender is complying with a 2047 particular rate response to congestion can deploy rate policers at 2048 the very first ingress to the internetwork. Re-ECN has been designed 2049 to avoid the need for bottleneck policing so that we can avoid a 2050 future where a single rate adaptation policy is embedded throughout 2051 the network. Instead, re-ECN allows the particular rate adaptation 2052 policy to be solely agreed bilaterally between the sender and its 2053 ingress access provider (Section 5.5.2 discusses possible ways to 2054 signal between them), which allows congestion control to be policed, 2055 but maintains its evolvability, requiring only a single, local box to 2056 be updated. 2058 If desired, the re-ECN protocol allows these ingress policers to 2059 perform per-flow policing according to the widely adopted TCP rate 2060 adaptation, perhaps as a default. But it also allows new rate 2061 adaptation policies beyond TCP to be enforced. Perhaps more 2062 usefully, it also allows the flexibility for networks to choose to 2063 police users as a whole, rather than flows. 2065 Appendix G gives examples of per-user and per-flow policing 2066 algorithms. But there is no implication that these algorithms are to 2067 be standardised, or that they are ideal. The ingress rate policer is 2068 the part of the re-ECN incentive framework that is intended to be the 2069 most flexible. Once endpoint protocol handlers for re-ECN and egress 2070 droppers are in place, operators can choose exactly which congestion 2071 response they want to police, and whether they want to do it per 2072 user, per flow or not at all. 2074 However, if a rate policer is used, it should use path (not 2075 downstream) congestion as the relevant metric, which is represented 2076 by the fraction of octets in packets with positive (Re-Echo and FNE) 2077 and canceled (CE(0)) markings. Of course, re-ECN provides all the 2078 information a policer needs directly in the packets being policed. 2079 So, even policing TCP's AIMD algorithm is relatively straightforward. 2080 Appendix G presents an example design, but the choice of preferred 2081 mechanism is up to the implementer. 2083 Note that we have included canceled packets in the measure of path 2084 congestion. Canceled packets arise when the sender re-echoes earlier 2085 congestion, but then this Re-Echo packet just happens to be 2086 congestion marked itself. One would not normally expect many 2087 canceled packets at the first ingress because one would not normally 2088 expect much congestion marking to have been necessary that soon in 2089 the path. However, a home network or campus network may well sit 2090 between the sending endpoint and the ingress policer, so some 2091 congestion may occur upstream of the policer. And if congestion does 2092 occur upstream, some canceled packets should be visible, and should 2093 be taken into account in the measure of path congestion. 2095 But a much more important reason for including canceled packets in 2096 the measure of path congestion at an ingress policer is that a sender 2097 might otherwise subvert the protocol by sending canceled packets 2098 instead of neutral (RECT) packets. Like neutral, canceled packets 2099 are worth zero, so the sender knows they won't be counted against any 2100 quota it might have been allowed. But unlike neutral packets, 2101 canceled packets are immune to congestion marking, because they have 2102 already been congestion marked. So, it is both correct and useful 2103 that canceled packets should be included in a policer's measure of 2104 path congestion, as this removes the incentive the sender would 2105 otherwise have to mark more packets as canceled than it should. 2107 An ingress policer should also ensure that flows are not already 2108 negative when they enter the access network. As with canceled 2109 packets, the presence of negative packets will typically be unusual. 2110 Therefore it will be easy to detect negative flows at the ingress by 2111 just detecting negative packets then monitoring the flow they belong 2112 to. 2114 Of course, even if the sender does operate its own network, it may 2115 arrange not to congestion mark traffic. Whether the sender does this 2116 or not is of no concern to anyone else except the sender. Such a 2117 sender will not be policed against its own network's contribution to 2118 congestion, but the only resulting problem would be overload in the 2119 sender's own network. 2121 Finally, we must not forget that an easy way to circumvent re-ECN's 2122 defences is for the source to turn off re-ECN support, by setting the 2123 Not-RECT codepoint, implying legacy traffic. Therefore an ingress 2124 policer must put a general rate-limit on Not-RECT traffic, which 2125 SHOULD be lax during early, patchy deployment, but will have to 2126 become stricter as deployment widens. Similarly, flows starting 2127 without an FNE packet can be confined by a strict rate-limit used for 2128 the remainder of flows that haven't proved they are well-behaved by 2129 starting correctly (therefore they need not consume any flow state--- 2130 they are just confined to the `misbehaving' bin if they carry an 2131 unrecognised flow ID). 2133 6.1.6. Inter-domain Policing 2135 One of the main design goals of re-ECN is for border security 2136 mechanisms to be as simple as possible, otherwise they will become 2137 the pinch-points that limit scalability of the whole internetwork. 2138 We want to avoid per-flow processing at borders and to keep to 2139 passive mechanisms that can monitor traffic in parallel to 2140 forwarding, rather than having to filter traffic inline---in series 2141 with forwarding. 2143 So far, we have been able to keep the border mechanisms simple, 2144 despite having had to harden them against some subtle attacks on the 2145 re-ECN design. The mechanisms are still passive and avoid per-flow 2146 processing. 2148 The basic accounting mechanism at each border interface simply 2149 involves accumulating the volume of packets with positive worth (Re- 2150 Echo and FNE), and subtracting the volume of those with negative 2151 worth: CE(-1). Even though this mechanism takes no regard of flows, 2152 over an accounting period (say a month) this subtraction will account 2153 for the downstream congestion caused by all the flows traversing the 2154 interface, wherever they come from, and wherever they go to. The two 2155 networks can agree to use this metric however they wish to determine 2156 some congestion-related penalty against the upstream network. 2157 Although the algorithm could hardly be simpler, it is spelled out 2158 using pseudo-code in Appendix H.1. 2160 Various attempts to subvert the re-ECN design have been made. In all 2161 cases their root cause is persistently negative flows. But, after 2162 describing these attacks we will show that we don't actually have to 2163 get rid of all persistently negative flows in order to thwart the 2164 attacks. 2166 In honest flows, downstream congestion is measured as positive minus 2167 negative volume. So if all flows are honest (i.e. not persistently 2168 negative), adding all positive volume and all negative volume without 2169 regard to flows will give an aggregate measure of downstream 2170 congestion. But such simple aggregation is only possible if no flows 2171 are persistently negative. Unless persistently negative flows are 2172 completely removed, they will reduce the aggregate measure of 2173 congestion. The aggregate may still be positive overall, but not as 2174 positive as it would have been had the negative flows been removed. 2176 In Section 6.1.4 we discussed how to sanction traffic to remove, or 2177 at least to identify, persistently negative flows. But, even if the 2178 sanction for negative traffic is to discard it, unless it is 2179 discarded at the exact point it goes negative, it will wrongly 2180 subtract from aggregate downstream congestion, at least at any 2181 borders it crosses after it has gone negative but before it is 2182 discarded. 2184 We rely on sanctions to deter dishonest understatement of congestion. 2185 But even the ultimate sanction of discard can only be effective if 2186 the sender is bothered about the data getting through to its 2187 destination. A number of attacks have been identified where a sender 2188 gains from sending dummy traffic or it can attack someone or 2189 something using dummy traffic even though it isn't communicating any 2190 information to anyone: 2192 o A host can send traffic with no positive markings towards its 2193 intended destination, aiming to transmit as much traffic as any 2194 dropper will allow [Bauer06]. It may add forward error correction 2195 (FEC) to repair as much drop as it experiences. 2197 o A host can send dummy traffic into the network with no positive 2198 markings and with no intention of communicating with anyone, but 2199 merely to cause higher levels of congestion for others who do want 2200 to communicate (DoS). So, to ride over the extra congestion, 2201 everyone else has to spend more of whatever rights to cause 2202 congestion they have been allowed. 2204 o A network can simply create its own dummy traffic to congest 2205 another network, perhaps causing it to lose business at no cost to 2206 the attacking network. This is a form of denial of service 2207 perpetrated by one network on another. The preferential drop 2208 measures in Section 5.3 provide crude protection against such 2209 attacks, but we are not overly worried about more accurate 2210 prevention measures, because it is already possible for networks 2211 to DoS other networks on the general Internet, but they generally 2212 don't because of the grave consequences of being found out. We 2213 are only concerned if re-ECN increases the motivation for such an 2214 attack, as in the next example. 2216 o A network can just generate negative traffic and send it over its 2217 border with a neighbour to reduce the overall penalties that it 2218 should pay to that neighbour. It could even initialise the TTL so 2219 it expired shortly after entering the neighbouring network, 2220 reducing the chance of detection further downstream. This attack 2221 need not be motivated by a desire to deny service and indeed need 2222 not cause denial of service. A network's main motivator would 2223 most likely be to reduce the penalties it pays to a neighbour. 2224 But, the prospect of financial gain might tempt the network into 2225 mounting a DoS attack on the other network as well, given the gain 2226 would offset some of the risk of being detected. 2228 The first step towards a solution to all these problems with negative 2229 flows is to be able to estimate the contribution they make to 2230 downstream congestion at a border and to correct the measure 2231 accordingly. Although ideally we want to remove negative flows 2232 themselves, perhaps surprisingly, the most effective first step is to 2233 cancel out the polluting effect negative flows have on the measure of 2234 downstream congestion at a border. It is more important to get an 2235 unbiased estimate of their effect, than to try to remove them all. A 2236 suggested algorithm to give an unbiased estimate of the contribution 2237 from negative flows to the downstream congestion measure is given in 2238 Appendix H.2. 2240 Although making an accurate assessment of the contribution from 2241 negative flows may not be easy, just the single step of neutralising 2242 their polluting effect on congestion metrics removes all the gains 2243 networks could otherwise make from mounting dummy traffic attacks on 2244 each other. This puts all networks on the same side (only with 2245 respect to negative flows of course), rather than being pitched 2246 against each other. The network where this flow goes negative as 2247 well as all the networks downstream lose out from not being 2248 reimbursed for any congestion this flow causes. So they all have an 2249 interest in getting rid of these negative flows. Networks forwarding 2250 a flow before it goes negative aren't strictly on the same side, but 2251 they are disinterested bystanders---they don't care that the flow 2252 goes negative downstream, but at least they can't actively gain from 2253 making it go negative. The problem becomes localised so that once a 2254 flow goes negative, all the networks from where it happens and beyond 2255 downstream each have a small problem, each can detect it has a 2256 problem and each can get rid of the problem if it chooses to. But 2257 negative flows can no longer be used for any new attacks. 2259 Once an unbiased estimate of the effect of negative flows can be 2260 made, the problem reduces to detecting and preferably removing flows 2261 that have gone negative as soon as possible. But importantly, 2262 complete eradication of negative flows is no longer critical---best 2263 endeavours will be sufficient. 2265 For instance, let us consider the case where a source sends traffic 2266 with no positive markings at all, hoping to at least get as much 2267 traffic delivered as network-based droppers will allow. The flow is 2268 likely to go at least slightly negative in the first network on the 2269 path (N1 if we use the example network layout in Figure 9). If all 2270 networks use the algorithm in Appendix H.2 to inflate penalties at 2271 their border with an upstream network, they will remove the effect of 2272 negative flows. So, for instance, N2 will not be paying a penalty to 2273 N1 for this flow. Further, because the flow contributes no positive 2274 markings at all, a dropper at the egress will completely remove it. 2276 The remaining problem is that every network is carrying a flow that 2277 is causing congestion to others but not being held to account for the 2278 congestion it is causing. Whenever the fail-safe border algorithm 2279 (Section 6.1.7) or the border algorithm to compensate for negative 2280 flows (Appendix H.2) detects a negative flow, it can instantiate a 2281 focused dropper for that flow locally. It may be some time before 2282 the flow is detected, but the more strongly negative the flow is, the 2283 more quickly it will be detected by the fail-safe algorithm. But, in 2284 the meantime, it will not be distorting border incentives. Until it 2285 is detected, if it contributes to drop anywhere, its packets will 2286 tend to be dropped before others if routers use the preferential drop 2287 rules in Section 5.3, which discriminate against non-positive 2288 packets. All networks below the point where a flow goes negative 2289 (N1, N2 and N4 in this case) have an incentive to remove this flow, 2290 but the router where it first goes negative (in N1) can of course 2291 remove the problem for everyone downstream. 2293 In the case of DDoS attacks, Section 6.2.1 describes how re-ECN 2294 mitigates their force. 2296 Note that the guiding principle behind all the above discussion is 2297 that any gain from subverting the protocol should be precisely 2298 neutralised, rather than punished. If a gain is punished to a 2299 greater extent than is sufficient to neutralise it, it will most 2300 likely open up a new vulnerability, where the amplifying effect of 2301 the punishment mechanism can be turned on others. 2303 For instance, if possible, flows should be removed as soon as they go 2304 negative, but we do NOT RECOMMEND any attempts to discard such flows 2305 further upstream while they are still positive. Such over-zealous 2306 push-back is unnecessary and potentially dangerous. These flows have 2307 paid their `fare' up to the point they go negative, so there is no 2308 harm in delivering them that far. If someone downstream asks for a 2309 flow to be dropped as near to the source as possible, because they 2310 say it is going to become negative later, an upstream node cannot 2311 test the truth of this assertion. Rather than have to authenticate 2312 such messages, re-ECN has been designed so that flows can be dropped 2313 solely based on locally measurable evidence. A message hinting that 2314 a flow should be watched closely to test for negativity is fine. But 2315 not a message that claims that a positive flow will go negative 2316 later, so it should be dropped. . 2318 6.1.7. Inter-domain Fail-safes 2320 The mechanisms described so far create incentives for rational 2321 network operators to behave. That is, one operator aims to make 2322 another behave responsibly by applying penalties and expects a 2323 rational response (i.e. one that trades off costs against benefits). 2324 It is usually reasonable to assume that other network operators will 2325 behave rationally (policy routing can avoid those that might not). 2326 But this approach does not protect against the misconfigurations and 2327 accidents of other operators. 2329 Therefore, we propose the following two mechanisms at a network's 2330 borders to provide "defence in depth". Both are similar: 2332 Highly positive flows: A small sample of positive packets should be 2333 picked randomly as they cross a border interface. Then subsequent 2334 packets matching the same source and destination address and DSCP 2335 should be monitored. If the fraction of positive marking is well 2336 above a threshold (to be determined by operational practice), a 2337 management alarm SHOULD be raised, and the flow MAY be 2338 automatically subject to focused drop. 2340 Persistently negative flows: A small sample of congestion marked 2341 (negative) packets should be picked randomly as they cross a 2342 border interface. Then subsequent packets matching the same 2343 source and destination address and DSCP should be monitored. If 2344 the balance of positive minus negative markings is persistently 2345 negative, a management alarm SHOULD be raised, and the flow MAY be 2346 automatically subject to focused drop. 2348 Both these mechanisms rely on the fact that highly positive (or 2349 negative) flows will appear more quickly in the sample by selecting 2350 randomly solely from positive (or negative) packets. 2352 6.1.8. Simulations 2354 Simulations of policer and dropper performance done for the multi-bit 2355 version of re-feedback have been included in section 5 "Dropper 2356 Performance" of [Re-fb]. Simulations of policer and dropper for the 2357 re-ECN version described in this document are work in progress. 2359 6.2. Other Applications 2361 6.2.1. DDoS Mitigation 2363 A flooding attack is inherently about congestion of a resource. 2364 Because re-ECN ensures the sources causing network congestion 2365 experience the cost of their own actions, it acts as a first line of 2366 defence against DDoS. As load focuses on a victim, upstream queues 2367 grow, requiring honest sources to pre-load packets with a higher 2368 fraction of positive packets. Once downstream routers are so 2369 congested that they are dropping traffic, they will be CE marking the 2370 traffic they do forward 100%. Honest sources will therefore be 2371 sending Re-Echo 100% (and therefore being severely rate-limited at 2372 the ingress). 2374 Senders under malicious control can either do the same as honest 2375 sources, and be rate-limited at ingress, or they can understate 2376 congestion by sending more neutral RECT packets than they should. If 2377 sources understate congestion (i.e. do not re-echo sufficient 2378 positive packets) and the preferential drop ranking is implemented on 2379 routers (Section 5.3), these routers will preserve positive traffic 2380 until last. So, the neutral traffic from malicious sources will all 2381 be automatically dropped first. Either way, the malicious sources 2382 cannot send more than honest sources. 2384 Further, hosts under malicious control will tend to be re-used for 2385 many different attacks. They will therefore build up a long term 2386 history of causing congestion. Therefore, as long as the population 2387 of potentially compromisable hosts around the Internet is limited, 2388 the per-user policing algorithms in Appendix G.1 will gradually 2389 throttle down zombies and other launchpads for attacks. Therefore, 2390 widespread deployment of re-ECN could considerably dampen the force 2391 of DDoS. Certainly, zombie armies could hold their fire for long 2392 enough to be able to build up enough credit in the per-user policers 2393 to launch an attack. But they would then still be limited to no more 2394 throughput than other, honest users. 2396 Inter-domain traffic policing (see Section 6.1.6)ensures that any 2397 network that harbours compromised `zombie' hosts will have to bear 2398 the cost of the congestion caused by traffic from zombies in 2399 downstream networks. Such networks will be incentivised to deploy 2400 per-user policers that rate-limit hosts that are unresponsive to 2401 congestion so they can only send very slowly into congested paths. 2402 As well as protecting other networks, the extremely poor performance 2403 at any sign of congestion will incentivise the zombie's owner to 2404 clean it up. However, the host should behave normally when using 2405 uncongested paths. 2407 Uniquely, re-ECN handles DDoS traffic without relying on the validity 2408 of identifiers in packets. Certainly the egress dropper relies on 2409 uniqueness of flow identifiers, but not their validity. So if a 2410 source spoofs another address, re-ECN works just as well, as long as 2411 the attacker cannot imitate all the flow identifiers of another 2412 active flow passing through the same dropper (see Section 6.3). 2413 Similarly, the ingress policer relies on uniqueness of flow IDs, not 2414 their validity. Because a new flow will only be allowed any rate at 2415 all if it starts with FNE, and the more FNE packets there are 2416 starting new flows, the more they will be limited. Essentially a re- 2417 ECN policer limits the bulk of all congestion entering the network 2418 through a physical interface; limiting the congestion caused by each 2419 flow is merely an optional extra. 2421 6.2.2. End-to-end QoS 2423 {ToDo: (Section 3.3.2 of [Re-fb] entitled `Edge QoS' gives an outline 2424 of the text that will be added here).} 2426 6.2.3. Traffic Engineering 2428 {ToDo: } 2430 6.2.4. Inter-Provider Service Monitoring 2432 {ToDo: } 2434 6.3. Limitations 2436 The known limitations of the re-ECN approach are: 2438 o We still cannot defend against the attack described in Section 10 2439 where a malicious source sends negative traffic through the same 2440 egress dropper as another flow and imitates its flow identifiers, 2441 allowing a malicious source to cause an innocent flow to 2442 experience heavy drop. 2444 o Re-feedback for TTL (re-TTL) would also be desirable at the same 2445 time as re-ECN. Unfortunately this requires a further standards 2446 action for the mechanisms briefly described in Appendix F 2448 o Traffic must be ECN-capable for re-ECN to be effective. The only 2449 defence against malicious users who turn off ECN capbility is that 2450 networks are expected to rate limit Not-ECT traffic and to apply 2451 higher drop preference to it during congestion. Although these 2452 are blunt instruments, they at least represent a feasible scenario 2453 for the future Internet where Not-ECT traffic co-exists with re- 2454 ECN traffic, but as a severely hobbled under-class. We recommend 2455 (Section 7.1) that while accommodating a smooth initial transition 2456 to re-ECN, policing policies should gradually be tightened to rate 2457 limit Not-ECT traffic more strictly in the longer term. 2459 o When checking whether a flow is balancing positive markings with 2460 congestion marking, re-ECN can only account for congestion 2461 marking, not drops. So, whenever a sender experiences drop, it 2462 does not have to re-echo the congestion event. Nonetheless, it is 2463 hardly any advantage to be able to send faster than other flows 2464 only if your traffic is dropped and the other traffic isn't. 2466 o We are considering the issue of whether it would be useful to 2467 truncate rather than drop packets that appear to be malicious, so 2468 that the feedback loop is not broken but useful data can be 2469 removed. 2471 7. Incremental Deployment 2473 7.1. Incremental Deployment Features 2475 The design of the re-ECN protocol started from the fact that the 2476 current ECN marking behaviour of routers was sufficient and that re- 2477 feedback could be introduced around these routers by changing the 2478 sender behaviour but not the routers. Otherwise, if had required 2479 routers to be changed, the chance of encountering a path that had 2480 every router upgraded would be vanishly small during early 2481 deployment, giving no incentive to start deployment. Also, as there 2482 is no new forwarding behaviour, routers and hosts do not have to 2483 signal or negotiate anything. 2485 However, networks that choose to protect themselves using re-ECN do 2486 have to add new security functions at their trust boundaries with 2487 others. They distinguish legacy traffic by its ECN field. Traffic 2488 from Not-ECT transports is distinguishable by its Not-RECT marking. 2489 Traffic from legacy ECN transports is distinguished from re-ECN by 2490 which of ECT(0) or ECT(1) is used. We chose to use ECT(1) for re-ECN 2491 traffic deliberately. Existing ECN sources set ECT(0) on either 50% 2492 (the nonce) or 100% (the default) of packets, whereas re-ECN does not 2493 use ECT(0) at all. We can use this distinguishing feature of legacy 2494 ECN traffic to separate it out for different treatment at the various 2495 border security functions: egress dropping, ingress policing and 2496 border policing. 2498 The general principle we adopt is that an egress dropper will not 2499 drop any legacy traffic, but ingress and border policers will limit 2500 the bulk rate of legacy traffic that can enter each network. Then, 2501 during early re-ECN deployment, operators can set very permissive (or 2502 non-existent) rate-limits on legacy traffic, but once re-ECN 2503 implementations are generally available, legacy traffic can be rate- 2504 limited increasingly harshly. Ultimately, an operator might choose 2505 to block all legacy traffic entering its network, or at least only 2506 allow through a trickle. 2508 Then, as the limits are set more strictly, the more legacy ECN 2509 sources will gain by upgrading to re-ECN. Thus, towards the end of 2510 the voluntary incremental deployment period, legacy transports can be 2511 given progressively stronger encouragement to upgrade. 2513 The following list of minor changes, brings together all the points 2514 where Re-ECN semantics for use of the two-bit ECN field are different 2515 compared to RFC3168: 2517 o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender 2518 sets ECT(0) by default (Section 3.3); 2520 o No provision is necessary for a re-ECN capable source transport to 2521 use the ECN nonce (Section 4.1.2.1); 2523 o Routers MAY preferentially drop different extended ECN codepoints 2524 (Section 5.3); 2526 o Packets carrying the feedback not established (FNE) codepoint MAY 2527 optionally be marked rather than dropped by routers, even though 2528 their ECN field is Not-ECT (with the important caveat in 2529 Section 5.3); 2531 o Packets may be dropped by policing nodes because of apparent 2532 misbehaviour, not just because of congestion (Section 6); 2534 o Tunnel entry behaviour is still to be defined, but may have to be 2535 different from RFC3168 (Section 5.6). 2537 None of these changes REQUIRE any modifications to routers. Also 2538 none of these changes affect anything about end to end congestion 2539 control; they are all to do with allowing networks to police that end 2540 to end congestion control is well-behaved. 2542 7.2. Incremental Deployment Incentives 2544 It would only be worth standardising the re-ECN protocol if there 2545 existed a coherent story for how it might be incrementally deployed. 2546 In order for it to have a chance of deployment, everyone who needs to 2547 act, must have a strong incentive to act, and the incentives must 2548 arise in the order that deployment would have to happen. Re-ECN 2549 works around unmodified ECN routers, but we can't just discuss why 2550 and how re-ECN deployment might build on ECN deployment, because 2551 there is precious little to build on in the first place. Instead, we 2552 aim to show that re-ECN deployment could carry ECN with it. We focus 2553 on commercial deployment incentives, although some of the arguments 2554 apply equally to academic or government sectors. 2556 ECN deployment: 2558 ECN is largely implemented in commercial routers, but generally 2559 not as a supported feature, and it has largely not been deployed 2560 by commercial network operators. It has been released in many 2561 Unix-based operating systems, but not in proprietary OSs like 2562 Windows or those in many mobile devices. For detailed deployment 2563 status, see [ECN-Deploy]. We believe the reason ECN deployment 2564 has not happened is twofold: 2566 * ECN requires changes to both routers and hosts. If someone 2567 wanted to sell the improvement that ECN offers, they would have 2568 to co-ordinate deployment of their product with others. An ECN 2569 server only gives any improvement on an ECN network. An ECN 2570 network only gives any improvement if used by ECN devices. 2571 Deployment that requires co-ordination adds cost and delay and 2572 tends to dilute any competitive advantage that might be gained. 2574 * ECN `only' gives a performance improvement. Making a product a 2575 bit faster (whether the product is a device or a network), 2576 isn't usually a sufficient selling point to be worth the cost 2577 of co-ordinating across the industry to deploy it. Network 2578 operators tend to avoid re-configuring a working network unless 2579 launching a new product. 2581 ECN and re-ECN for Edge-to-edge Assured QoS: 2583 We believe the proposal to provide assured QoS sessions using a 2584 form of ECN called pre-congestion notification (PCN) [CL-deploy] 2585 is most likely to break the deadlock in ECN deployment first. It 2586 only requires edge-to-edge deployment so it does not require 2587 endpoint support. It can be deployed in a single network, then 2588 grow incrementally to interconnected networks. And it provides a 2589 different `product' (internetworked assured QoS), rather than 2590 merely making an existing product a bit faster. 2592 Not only could this assured QoS application kick-start ECN 2593 deployment, it could also carry re-ECN deployment with it; because 2594 re-ECN can enable the assured QoS region to expand to a large 2595 internetwork where neighbouring networks do not trust each other. 2596 [Re-PCN] argues that re-ECN security should be built in to the QoS 2597 system from the start, explaining why and how. 2599 If ECN and re-ECN were deployed edge-to-edge for assured QoS, 2600 operators would gain valuable experience. They would also clear 2601 away many technical obstacles such as firewall configurations that 2602 block all but the legacy settings of the ECN field and the RE 2603 flag. 2605 ECN in Access Networks: 2607 The next obstacle to ECN deployment would be extension to access 2608 and backhaul networks, where considerable link layer differences 2609 makes implementation non-trivial, particularly on congested 2610 wireless links. ECN and re-ECN work fine during partial 2611 deployment, but they will not be very useful if the most congested 2612 elements in networks are the last to support them. Access network 2613 support is one of the weakest parts of this deployment story. All 2614 we can hope is that, once the benefits of ECN are better 2615 understood by operators, they will push for the necessary link 2616 layer implementations as deployment proceeds. 2618 Policing Unresponsive Flows: 2620 Re-ECN allows a network to offer differentiated quality of service 2621 as explained in Section 6.2.2. But we do not believe this will 2622 motivate initial deployment of re-ECN, because the industry is 2623 already set on alternative ways of doing QoS. Despite being much 2624 more complicated and expensive, the alternative approaches are 2625 here and now. 2627 But re-ECN is critical to QoS deployment in another respect. It 2628 can be used to prevent applications from taking whatever bandwidth 2629 they choose without asking. 2631 Currently, applications that remain resolute in their lack of 2632 response to congestion are rewarded by other TCP applications. In 2633 other words, TCP is naively friendly, in that it reduces its rate 2634 in response to congestion whether it is competing with friends 2635 (other TCPs) or with enemies (unresponsive applications). 2637 Therefore, those network owners that want to sell QoS will be keen 2638 to ensure that their users can't help themselves to QoS for free. 2639 Given the very large revenues at stake, we believe effective 2640 policing of congestion response will become highly sought after by 2641 network owners. 2643 But this does not necessarily argue for re-ECN deployment. 2644 Network owners might choose to deploy bottleneck policers rather 2645 than re-ECN-based policing. However, under Related Work 2646 (Section 9) we argue that bottleneck policers are inherently 2647 vulnerable to circumvention. 2649 Therefore we believe there will be a strong demand from network 2650 owners for re-ECN deployment so they can police flows that do not 2651 ask to be unresponsive to congestion, in order to protect their 2652 revenues from flows that do ask (QoS). In particular, we suspect 2653 that the operators of cellular networks will want to prevent VoIP 2654 and video applications being used freely on their networks as a 2655 more open market develops in GPRS and 3G devices. 2657 Initial deployments are likely to be isolated to single cellular 2658 networks. Cellular operators would first place requirements on 2659 device manufacturers to include re-ECN in the standards for mobile 2660 devices. In parallel, they would put out tenders for ingress and 2661 egress policers. Then, after a while they would start to tighten 2662 rate limits on Not-ECT traffic from non-standard devices and they 2663 would start policing whatever non-accredited applications people 2664 might install on mobile devices with re-ECN support in the 2665 operating system. This would force even independent mobile device 2666 manufacturers to provide re-ECN support. Early standardisation 2667 across the cellular operators is likely, including interconnection 2668 agreements with penalties for excess downstream congestion. 2670 We suspect some fixed broadband networks (whether cable or DSL) 2671 would follow a similar path. However, we also believe that larger 2672 parts of the fixed Internet would not choose to police on a per- 2673 flow basis. Some might choose to police congestion on a per-user 2674 basis in order to manage heavy peer-to-peer file-sharing, but it 2675 seems likely that a sizeable majority would not deploy any form of 2676 policing. 2678 This hybrid situation begs the question, "How does re-ECN work for 2679 networks that choose to using policing if they connect with others 2680 that don't?" Traffic from non-ECN capable sources will arrive 2681 from other networks and cause congestion within the policed, ECN- 2682 capable networks. So networks that chose to police congestion 2683 would rate-limit Not-ECT traffic throughout their network, 2684 particularly at their borders. They would probably also set 2685 higher usage prices in their interconnection contracts for 2686 incoming Not-ECT and Not-RECT traffic. We assume that 2687 interconnection contracts between networks in the same tier will 2688 include congestion penalties before contracts with provider 2689 backbones do. 2691 A hybrid situation could remain for all time. As was explained in 2692 the introduction, we believe in healthy competition between 2693 policing and not policing, with no imperative to convert the whole 2694 world to the religion of policing. Networks that chose not to 2695 deploy egress droppers would leave themselves open to being 2696 congested by senders in other networks. But that would be their 2697 choice. 2699 The important aspect of the egress dropper though is that it most 2700 protects the network that deploys it. If a network does not 2701 deploy an egress dropper, sources sending into it from other 2702 networks will be able to understate the congestion they are 2703 causing. Whereas, if a network deploys an egress dropper, it can 2704 know how much congestion other networks are dumping into it. And 2705 apply penalties or charges accordingly. So, whether or not a 2706 network polices its own sources at ingress, it is in its interests 2707 to deploy an egress dropper. 2709 Host support: 2711 In the above deployment scenario, host operating system support 2712 for re-ECN came about through the cellular operators demanding it 2713 in device standards (i.e. 3GPP). Of course, increasingly, mobile 2714 devices are being built to support multiple wireless technologies. 2715 So, if re-ECN were stipulated for cellular devices, it would 2716 automatically appear in those devices connected to the wireless 2717 fringes of fixed networks if they coupled cellular with WiFi or 2718 Bluetooth technology, for instance. Also, once implemented in the 2719 operating system of one mobile device, it would tend to be found 2720 in other devices using the same family of operating system. 2722 Therefore, whether or not a fixed network deployed ECN, or 2723 deployed re-ECN policers and droppers, many of its hosts might 2724 well be using re-ECN over it. Indeed, they would be at an 2725 advantage when communicating with hosts across Re-ECN policed 2726 networks that rate limited Not-RECT traffic. 2728 Other possible scenarios: 2730 The above is thankfully not the only plausible scenario we can 2731 think of. One of the many clubs of operators that meet regularly 2732 around the world might decide to act together to persuade a major 2733 operating system manufacturer to implement re-ECN. And they may 2734 agree between them on an interconnection model that includes 2735 congestion penalties. 2737 Re-ECN provides an interesting opportunity for device 2738 manufacturers as well as network operators. Policers can be 2739 configured loosely when first deployed. Then as re-ECN take-up 2740 increases, they can be tightened up, so that a network with re-ECN 2741 deployed can gradually squeeze down the service provided to legacy 2742 devices that have not upgraded to re-ECN. Many device vendors 2743 rely on replacement sales. And operating system companies rely 2744 heavily on new release sales. Also support services would like to 2745 be able to force stragglers to upgrade. So, the ability to 2746 throttle service to legacy operating systems is quite valuable. 2748 Also, policing unresponsive sources may not be the only or even 2749 the first application that drives deployment. It may be policing 2750 causes of heavy congestion (e.g. peer-to-peer file-sharing). Or 2751 it may be mitigation of denial of service. Or we may be wrong in 2752 thinking simpler QoS will not be the initial motivation for re-ECN 2753 deployment. Indeed, the combined pressure for all these may be 2754 the motivator, but it seems optimistic to expect such a level of 2755 joined-up thinking from today's communications industry. We 2756 believe a single application alone must be a sufficient motivator. 2758 In short, everyone gains from adding accountability to TCP/IP, 2759 except the selfish or malicious. So, deployment incentives tend 2760 to be strong. 2762 8. Architectural Rationale 2764 In the Internet's technical community the danger of not responding to 2765 congestion is well-understood, with its attendant risk of congestion 2766 collapse [RFC3714]. However, many of the Internet's commercial 2767 community consider that the very essence of IP is to provide open 2768 access to the internetwork for all applications. Congestion is seen 2769 as a symptom of over-conservative investment. And the goal of 2770 application design is to find novel ways to continue working despite 2771 congestion. They argue that the Internet was never intended to be 2772 solely for TCP-friendly applications. Another side of the Internet's 2773 commercial community believe that it is no use providing a network 2774 for novel applications if it has insufficient capacity. And it will 2775 always have insufficient capacity unless a greater share of 2776 application revenues can be /assured/ for the infrastructure 2777 provider. Otherwise the major investments required will carry too 2778 much risk and won't happen. 2780 The lesson articulated in [Tussle] is that we shouldn't embed our 2781 view on these arguments into the Internet at design time. Instead we 2782 should design the Internet so that the outcome of these arguments can 2783 get decided at run-time. Re-ECN is designed in that spirit. Once 2784 the protocol is available, different network operators can choose how 2785 liberal they want to be in holding people accountable for the 2786 congestion they cause. Some might boldly invest in capacity and not 2787 police its use at all, hoping that novel applications will result. 2788 Others might use re-ECN for fine-grained flow policing, expecting to 2789 make money selling vertically integrated services. Yet others might 2790 sit somewhere half-way, perhaps doing coarse, per-user policing. All 2791 might change their minds later. But re-ECN always allows them to 2792 interconnect so that the careful ones can protect themselves from the 2793 liberal ones. 2795 The incentive-based approach used for re-ECN is based on Gibbens and 2796 Kelly's arguments [Evol_cc] on allowing endpoints the freedom to 2797 evolve new congestion control algorithms for new applications. They 2798 ensured responsible behaviour despite everyone's self-interest by 2799 applying pricing to ECN marking, and Kelly had proved stability and 2800 optimality in an earlier paper. 2802 Re-ECN keeps all the underlying economic incentives, but rearranges 2803 the feedback. The idea is to allow a network operator (if it 2804 chooses) to deploy engineering mechanisms like policers at the front 2805 of the network which can be designed to behave /as if/ they are 2806 responding to congestion prices. Rather than having to subject users 2807 to congestion pricing, networks can then use more traditional 2808 charging regimes (or novel ones). But the engineering can constrain 2809 the overall amount of congestion a user can cause. This provides a 2810 buffer against completely outrageous congestion control, but still 2811 makes it easy for novel applications to evolve if they need different 2812 congestion control to the norms. It also allows novel charging 2813 regimes to evolve. 2815 Despite being achieved with a relatively minor protocol change, re- 2816 ECN is an architectural change. Previously, Internet congestion 2817 could only be controlled by the data sender, because it was the only 2818 one both in a position to control the load and in a position to see 2819 information on congestion. Re-ECN levels the playing field. It 2820 recognises that the network also has a role to play in moderating 2821 (policing) congestion control. But policing is only truly effective 2822 at the first ingress into an internetwork, whereas path congestion 2823 was previously only visible at the last egress. So, re-ECN 2824 democratises congestion information. Then the choice over who 2825 actually controls congestion can be made at run-time, not design 2826 time---a bit like an aircraft with dual controls. And different 2827 operators can make different choices. We believe non-architectural 2828 approaches to this problem are unlikely to offer more than partial 2829 solutions (see Section 9). 2831 Importantly, re-ECN does NOT REQUIRE assumptions about specific 2832 congestion responses to be embedded in any network elements, except 2833 at the first ingress to the internetwork if that level of control is 2834 desired by the ingress operator. But such tight policing will be a 2835 matter of agreement between the source and its access network 2836 operator. The ingress operator need not police congestion response 2837 at flow granularity; it can simply hold a source responsible for the 2838 aggregate congestion it causes, perhaps keeping it within a monthly 2839 congestion quota. Or if the ingress network trusts the source, it 2840 can do nothing. 2842 Therefore, the aim of the re-ECN protocol is NOT solely to police 2843 TCP-friendliness. Re-ECN preserves IP as a generic network layer for 2844 all sorts of responses to congestion, for all sorts of transports. 2845 Re-ECN merely ensures truthful downstream congestion information is 2846 available in the network layer for all sorts of accountability 2847 applications. 2849 The end to end design principle does not say that all functions 2850 should be moved out of the lower layers---only those functions that 2851 are not generic to all higher layers. Re-ECN adds a function to the 2852 network layer that is generic, but was omitted: accountability for 2853 causing congestion. Accountability is not something that an end-user 2854 can provide to themselves. We believe re-ECN adds no more than is 2855 sufficient to hold each flow accountable, even if it consists of a 2856 single datagram. 2858 "Accountability" implies being able to identify who is responsible 2859 for causing congestion. However, at the network layer it would NOT 2860 be useful to identify the cause of congestion by adding individual or 2861 organisational identity information, NOR by using source IP 2862 addresses. Rather than bringing identity information to the point of 2863 congestion, we bring downstream congestion information to the point 2864 where the cause can be most easily identified and dealt with. That 2865 is, at any trust boundary congestion can be associated with the 2866 physically connected upstream neighbour that is directly responsible 2867 for causing it (whether intentionally or not). A trust boundary 2868 interface is exactly the place to police or throttle in order to 2869 directly mitigate congestion, rather than having to trace the 2870 (ir)responsible party in order to shut them down. 2872 Some considered that ECN itself was a layering violation. The 2873 reasoning went that the interface to a layer should provide a service 2874 to the higher layer and hide how the lower layer does it. However, 2875 ECN reveals the state of the network layer and below to the transport 2876 layer. A more positive way to describe ECN is that it is like the 2877 return value of a function call to the network layer. It explicitly 2878 returns the status of the request to deliver a packet, by returning a 2879 value representing the current risk that a packet will not be served. 2880 Re-ECN has similar semantics, except the transport layer must try to 2881 guess the return value, then it can use the actual return value from 2882 the network layer to modify the next guess. 2884 9. Related Work 2886 {Due to lack of time, this section is incomplete. The reader is 2887 referred to the Related Work section of [Re-fb] for a brief selection 2888 of related ideas.} 2890 9.1. Policing Rate Response to Congestion 2892 ATM network elements send congestion back-pressure messages [ITU- 2893 T.I.371] along each connection, duplicating any end to end feedback 2894 because they don't trust it. On the other hand, re-ECN ensures 2895 information in forwarded packets can be used for congestion 2896 management without requiring a connection-oriented architecture and 2897 re-using the overhead of fields that are already set aside for end to 2898 end congestion control (and routing loop detection in the case of re- 2899 TTL in Appendix F). 2901 We borrowed ideas from policers in the literature [pBox],[XCHOKe], 2902 AFD etc. for our rate equation policer. However, without the benefit 2903 of re-ECN they don't police the correct rate for the condition of 2904 their path. They detect unusually high /absolute/ rates, but only 2905 while the policer itself is congested, because they work by detecting 2906 prevalent flows in the discards from the local RED queue. These 2907 policers must sit at every potential bottleneck, whereas our policer 2908 need only be located at each ingress to the internetwork. As Floyd & 2909 Fall explain [pBox], the limitation of their approach is that a high 2910 sending rate might be perfectly legitimate, if the rest of the path 2911 is uncongested or the round trip time is short. Commercially 2912 available rate policers cap the rate of any one flow. Or they 2913 enforce monthly volume caps in an attempt to control high volume 2914 file-sharing. They limit the value a customer derives. They might 2915 also limit the congestion customers can cause, but only as an 2916 accidental side-effect. They actually punish traffic that fills 2917 troughs as much as traffic that causes peaks in utilisation. In 2918 practice network operators need to be able to allocate service by 2919 cost during congestion, and by value at other times. 2921 9.2. Congestion Notification Integrity 2923 The choice of two ECT code-points in the ECN field [RFC3168] 2924 permitted future flexibility, optionally allowing the sender to 2925 encode the experimental ECN nonce [RFC3540] in the packet stream. 2927 The ECN nonce is an elegant scheme that allows the sender to detect 2928 if someone in the feedback loop tries to claim no congestion was 2929 experienced when it fact it was (whether drop or ECN marking). The 2930 sender chooses between the two ECT codepoints in a pseudo-random 2931 sequence. Then, whenever the network marks a packet with CE, to deny 2932 the congestion happened, the cheater would have to guess which ECT 2933 codepoint was overwritten, with only a 50:50 chance of being correct 2934 each time. 2936 The assumption behind the ECN nonce is that a sender will want to 2937 detect whether a receiver is suppressing congestion feedback. This 2938 is only true if the sender's interests are aligned with the 2939 network's, or with the community of users as a whole. This may be 2940 true for certain large senders, who are under close scrutiny and have 2941 a reputation to maintain. But we have to deal with a more hostile 2942 world, where traffic may be dominated by peer-to-peer transfers, 2943 rather than downloads from a few popular sites. Often the `natural' 2944 self-interest of a sender is not aligned with the interests of other 2945 users. It often wishes to transfer data quickly to the receiver as 2946 much as the receiver wants the data quickly. 2948 In contrast, the re-ECN protocol enables policing of an agreed rate- 2949 response to congestion (e.g. TCP-friendliness) at the sender's 2950 interface with the internetwork. It also ensures downstream networks 2951 can police their upstream neighbours, to encourage them to police 2952 their users in turn. But most importantly, it requires the sender to 2953 declare path congestion to the network and it can remove traffic at 2954 the egress if this declaration is dishonest. So it can police 2955 correctly, irrespective of whether the receiver tries to suppress 2956 congestion feedback or whether the sender ignores genuine congestion 2957 feedback. Therefore the re-ECN protocol addresses a much wider range 2958 of cheating problems, which includes the one addressed by the ECN 2959 nonce. 2961 9.3. Identifying Upstream and Downstream Congestion 2963 Purple [Purple] proposes that routers should use the CWR flag in the 2964 TCP header of ECN-capable flows to work out path congestion and 2965 therefore downstream congestion in a similar way to re-ECN. However, 2966 because CWR is in the transport layer, it is not always visible to 2967 network layer routers and policers. Purple's motivation was to 2968 improve AQM, not policing. But, of course, nodes trying to avoid a 2969 policer would not be expected to allow CWR to be visible. 2971 10. Security Considerations 2973 This whole memo concerns the deployment of a secure congestion 2974 control framework. However, below we list some specific security 2975 issues that we are still working on: 2977 o Malicious users have ability to launch dynamically changing 2978 attacks, exploiting the time it takes to detect an attack, given 2979 ECN marking is binary. We are concentrating on subtle 2980 interactions between the ingress policer and the egress dropper in 2981 an effort to make it impossible to game the system. 2983 o There is an inherent need for at least some flow state at the 2984 egress dropper given the binary marking environment, which leads 2985 to an apparent vulnerability to state exhaustion attacks. An 2986 egress dropper design with bounded flow state is in write-up. 2988 o A malicious source can spoof another user's address and send 2989 negative traffic to the same destination in order to fool the 2990 dropper into sanctioning the other user's flow. To prevent or 2991 mitigate these two different kinds of DoS attack, against the 2992 dropper and against given flows, we are considering various 2993 protection mechanisms. Section 5.5.1 discusses one of these. 2995 o A malicious client can send requests using a spoofed source 2996 address to a server (such as a DNS server) that tends to respond 2997 with single packet responses. This server will then be tricked 2998 into having to set FNE on the first (and only) packet of all these 2999 wasted responses. Given packets marked FNE are worth +1, this 3000 will cause such servers to consume more of their allowance to 3001 cause congestion than they would wish to. In general, re-ECN is 3002 deliberately designed so that single packet flows have to bear the 3003 cost of not discovering the congestion state of their path. One 3004 of the reasons for introducing re-ECN is to encourage short flows 3005 to make use of previous path knowledge by moving the cost of this 3006 lack of knowledge to sources that create short flows. Therefore, 3007 we in the long run we might expect services like DNS to aggregate 3008 single packet flows into connections where it brings benefits. 3009 However, this attack where DNS requests are made from spoofed 3010 addresses genuinely forces the server to waste its resources. The 3011 only mitigating feature is that the attacker has to set FNE on 3012 each of its requests if they are to get through an egress dropper 3013 to a DNS server. The attacker therefore has to consume as many 3014 resources as the victim, which at least implies re-ECN does not 3015 unwittingly amplify this attack. 3017 Having highlighted outstanding security issues, we now explain the 3018 design decisions that were taken based on a security-related 3019 rationale. It may seem that the six codepoints of the eight made 3020 available by extending the ECN field with the RE flag have been used 3021 rather wastefully to encode just five states. In effect the RE flag 3022 has been used as an orthogonal single bit, using up four codepoints 3023 to encode the three states of positive, neutral and negative worth. 3024 The mapping of the codepoints in an earlier version of this proposal 3025 used the codepoint space more efficiently, but the scheme became 3026 vulnerable to network operators bypassing congestion penalties by 3027 focusing congestion marking on positive packets. Appendix B explains 3028 why fixing that problem while allowing for incremental deployment, 3029 would have used another codepoint anyway. So it was better to use 3030 this orthogonal encoding scheme, which greatly simplified the whole 3031 protocol and brought with it some subtle security benefits. 3033 With the scheme as now proposed, once the RE flag is set or cleared 3034 by the sender or its proxy, it should not be written by the network, 3035 only read. So the gateways can detect if any network maliciously 3036 alters the RE flag. IPSec AH integrity checking does not cover the 3037 IPv4 option flags (they were considered mutable---even the one we 3038 propose using for the RE flag that was `currently unused' when IPSec 3039 was defined). But it would be sufficient for a pair of gateways to 3040 make random checks on whether the RE flag was the same when it 3041 reached the egress gateway as when it left the ingress. Indeed, if 3042 IPSec AH had covered the RE flag, any network intending to alter 3043 sufficient RE flags to make a gain would have focused its alterations 3044 on packets without authenticating headers (AHs). 3046 The security of re-ECN has been deliberately designed to not rely on 3047 cryptography. 3049 11. IANA Considerations 3051 This memo includes no request to IANA (yet). 3053 If this memo was to progress to standards track, it would list: 3055 o The new RE flag in IPv4 (Section 5.1) and its extension with the 3056 ECN field to create a new set of extended ECN (EECN) codepoints; 3058 o The definition of the EECN codepoints for default Diffserv PHBs 3059 (Section 3.2) 3061 o The new extension header for IPv6 (Section 5.2); 3063 o The new combinations of flags in the TCP header for capability 3064 negotiation (Section 4.1.3); 3066 o The new ICMP message type (Section 5.5.1). 3068 12. Conclusions 3070 {ToDo:} 3072 13. Acknowledgements 3074 Sebastien Cazalet and Andrea Soppera contributed to the idea of re- 3075 feedback. All the following have given helpful comments: Andrea 3076 Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, 3077 Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, 3078 John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru 3079 Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd 3080 (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark 3081 Handley (who developed the attack with canceled packets), Adam 3082 Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft 3083 (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who 3084 complemented our own dummy traffic attacks with others), Liz Maida 3085 (MIT), and comments from participants in the CRN/CFP Broadband and 3086 DoS-resistant Internet working groups. 3088 14. Comments Solicited 3090 Comments and questions are encouraged and very welcome. They can be 3091 addressed to the IETF Transport Area working group's mailing list 3092 , and/or to the authors. 3094 15. References 3096 15.1. Normative References 3098 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3099 Requirement Levels", BCP 14, RFC 2119, March 1997. 3101 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 3102 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 3103 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 3104 S., Wroclawski, J., and L. Zhang, "Recommendations on 3105 Queue Management and Congestion Avoidance in the 3106 Internet", RFC 2309, April 1998. 3108 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 3109 Control", RFC 2581, April 1999. 3111 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 3112 of Explicit Congestion Notification (ECN) to IP", 3113 RFC 3168, September 2001. 3115 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 3116 Initial Window", RFC 3390, October 2002. 3118 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 3119 Congestion Notification (ECN) Signaling with Nonces", 3120 RFC 3540, June 2003. 3122 15.2. Informative References 3124 [ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the 3125 Internet to Support Real-Time Content Supply from a Large 3126 Fraction of Broadband Residential Users", BT Technology 3127 Journal (BTTJ) 23(2), April 2005. 3129 [Bauer06] Bauer, S., Faratin, P., and R. Beverly, "Assessing the 3130 assumptions underlying mechanism design for the Internet", 3131 Proc. Workshop on the Economics of Networked Systems 3132 (NetEcon06) , June 2006, . 3135 [CL-deploy] 3136 Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., 3137 Charny, A., Babiarz, J., Chan, K., Westberg, L., Bader, 3138 A., and G. Karagiannis, "A Deployment Model for Admission 3139 Control over DiffServ using Pre-Congestion Notification", 3140 draft-briscoe-tsvwg-cl-architecture-03 (work in progress), 3141 June 2006. 3143 [CLoop_pol] 3144 Salvatori, A., "Closed Loop Traffic Policing", Politecnico 3145 Torino and Institut Eurecom Masters Thesis , 3146 September 2005. 3148 [ECN-Deploy] 3149 Floyd, S., "ECN (Explicit Congestion Notification) in 3150 TCP/IP; Implementation and Deployment of ECN", Web-page , 3151 May 2004, 3152 . 3154 [ECN-MPLS] 3155 Bruce, B., Briscoe, B., and J. Tay, "Explicit Congestion 3156 Marking in MPLS", draft-davie-ecn-mpls-00 (work in 3157 progress), June 2006. 3159 [Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the 3160 evolution of congestion control", Automatica 35(12)1969-- 3161 1985, December 1999, 3162 . 3164 [I-D.ietf-tsvwg-ecnsyn] 3165 Kuzmanovic, A., "Adding Explicit Congestion Notification 3166 (ECN) Capability to TCP's SYN/ACK Packets", 3167 draft-ietf-tsvwg-ecnsyn-00 (work in progress), 3168 November 2005. 3170 [ITU-T.I.371] 3171 ITU-T, "Traffic Control and Congestion Control in 3172 {B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. 3174 [Jiang02] Jiang, H. and D. Dovrolis, "The Macroscopic Behavior of 3175 the TCP Congestion Avoidance Algorithm", ACM SIGCOMM 3176 CCR 32(3)75-88, July 2002, 3177 . 3179 [Mathis97] 3180 Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The 3181 Macroscopic Behavior of the TCP Congestion Avoidance 3182 Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997, 3183 . 3185 [Purple] Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE: 3186 Predictive Active Queue Management Utilizing Congestion 3187 Information", Proc. Local Computer Networks (LCN 2003) , 3188 October 2003. 3190 [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, 3191 M., Romanow, A., Weinrib, A., and L. Zhang, "Resource 3192 ReSerVation Protocol (RSVP) Version 1 Applicability 3193 Statement Some Guidelines on Deployment", RFC 2208, 3194 September 1997. 3196 [RFC2402] Kent, S. and R. Atkinson, "IP Authentication Header", 3197 RFC 2402, November 1998. 3199 [RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security 3200 Payload (ESP)", RFC 2406, November 1998. 3202 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., 3203 and W. Weiss, "An Architecture for Differentiated 3204 Services", RFC 2475, December 1998. 3206 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 3207 Timer", RFC 2988, November 2000. 3209 [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", 3210 RFC 3124, June 2001. 3212 [RFC3514] Bellovin, S., "The Security Flag in the IPv4 Header", 3213 RFC 3514, April 2003. 3215 [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion 3216 Control for Voice Traffic in the Internet", RFC 3714, 3217 March 2004. 3219 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 3220 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 3222 [Re-PCN] Briscoe, B., "Emulating Border Flow Policing using Re-ECN 3223 on Bulk Data", draft-briscoe-tsvwg-re-ecn-border-cheat-01 3224 (work in progress), March 2006. 3226 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 3227 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 3228 Congestion Response in an Internetwork Using Re-Feedback", 3229 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 3233 [Smart_rtg] 3234 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 3235 "Optimizing Cost and Performance for Multihoming", ACM 3236 SIGCOMM CCR 34(4)79--92, October 2004, 3237 . 3239 [Steps_DoS] 3240 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 3241 resistant Internet Architecture", Proc. ACM SIGCOMM 3242 workshop on Future directions in network architecture 3243 (FDNA'04) pp 49--56, August 2004. 3245 [Tussle] Clark, D., Sollins, K., Wroclawski, J., and R. Braden, 3246 "Tussle in Cyberspace: Defining Tomorrow's Internet", ACM 3247 SIGCOMM CCR 32(4)347--356, October 2002, 3248 . 3251 [XCHOKe] Chhabra, P., Chuig, S., Goel, A., John, A., Kumar, A., 3252 Saran, H., and R. Shorey, "XCHOKe: Malicious Source 3253 Control for Congestion Avoidance at Internet Gateways", 3254 Proceedings of IEEE International Conference on Network 3255 Protocols (ICNP-02) , November 2002, 3256 . 3258 [pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End 3259 Congestion Control in the Internet", IEEE/ACM Transactions 3260 on Networking 7(4) 458--472, August 1999, 3261 . 3263 Appendix A. Precise Re-ECN Protocol Operation 3265 The protocol operation described in Section 3.3 was an approximation. 3266 In fact, standard ECN router marking combines 1% and 2% marking into 3267 slightly less than 3% whole-path marking, because routers 3268 deliberately mark CE whether or not it has already been marked by 3269 another router upstream. So the combined marking fraction would 3270 actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. 3272 To generalise this we will need some notation. 3274 o j represents the index of each resource (typically queues) along a 3275 path, ranging from 0 at the first router to n-1 at the last. 3277 o m_j represents the fraction of octets *m*arked CE by a particular 3278 router (whether or not they are already marked) because of 3279 congestion of resource j. 3281 o u_j represents congestion *u*pstream of resource j, being the 3282 fraction of CE marking in arriving packet headers (before 3283 marking). 3285 o p_j represents *p*ath congestion, being the fraction of packets 3286 arriving at resource j with the RE flag blanked (excluding Not- 3287 RECT packets). 3289 o v_j denotes expected congestion downstream of resource j, which 3290 can be thought of as a *v*irtual marking fraction, being derived 3291 from two other marking fractions. 3293 Observed fractions of each particular codepoint (u, p and v) and 3294 router marking rate m are dimensionless fractions, being the ratio of 3295 two data volumes (marked and total) over a monitoring period. All 3296 measurements are in terms of octets, not packets, assuming that line 3297 resources are more congestible than packet processing. 3299 The path congestion (RE blanking fraction) set by the sender should 3300 reflect the upstream congestion (CE marking fraction) fed back from 3301 the destination. Therefore in the steady state 3303 p_0 = u_n 3304 = 1 - (1 - m_1)(1 - m_2)... 3306 Similarly, at some point j in the middle of the network, if p = 1 - 3307 (1 - u_j)(1 - v_j), then 3309 v_j = 1 - (1 - p)/(1 - u_j) 3311 ~= p - u_j; if u_j << 100% 3313 So, between the two routers in the example in Section 3.3, congestion 3314 downstream is 3316 v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) 3317 = 2.00%, 3319 or a useful approximation of downstream congestion is 3321 v_1 ~= 2.98% - 1.00% 3322 ~= 1.98%. 3324 Appendix B. Justification for Two Codepoints Signifying Zero Worth 3325 Packets 3327 It may seem a waste of a codepoint to set aside two codepoints of the 3328 Extended ECN field to signify zero worth (RECT and CE(0) are both 3329 worth zero). The justification is subtle, but worth recording. 3331 The original version of re-ECN ([Re-fb] and draft-00 of this memo) 3332 used three codepoints for neutral (ECT(1)), positive (ECT(0)) and 3333 negative (CE) packets. The sender set packets to neutral unless re- 3334 echoing congestion, when it set them positive, in much the same way 3335 that it blanks the RE flag in the current protocol. However, routers 3336 were meant to mark congestion by setting packets negative (CE) 3337 irrespective of whether they had previously been neutral or positive. 3339 However, we did not arrange for senders to remember which packet had 3340 been sent with which codepoint, or for feedback to say exactly which 3341 packets arrived with which codepoints. The transport was meant to 3342 inflate the number of positive packets it sent to allow for a few 3343 being wiped out by congestion marking. We (wrongly) assumed that 3344 routers would congestion mark packets indiscriminately, so the 3345 transport could infer how many positive packets had been marked and 3346 compensate accordingly by re-echoing. But this created a perverse 3347 incentive for routers to preferentially congestion mark positive 3348 packets rather than neutral ones. 3350 We could have removed this perverse incentive by requiring re-ECN 3351 senders to remember which packets they had sent with which codepoint. 3352 And for feedback from the receiver to identify which packets arrived 3353 as which. Then, if a positive packet was congestion marked to 3354 negative, the sender could have re-echoed twice to maintain the 3355 balance between positive and negative at the receiver. 3357 Instead, we chose to make re-echoing congestion (blanking RE) 3358 orthogonal to congestion notification (marking CE), which required a 3359 second neutral codepoint (the orthogonal scheme forms the main square 3360 of four codepoints in Figure 2). Then the receiver would be able to 3361 detect and echo a congestion event even if it arrived on a packet 3362 that had originally been positive. 3364 If we had added extra complexity to the sender and receiver 3365 transports to track changes to individual packets, we could have made 3366 it work, but then routers would have had an incentive to mark 3367 positive packets with half the probability of neutral packets. That 3368 in turn would have led router algorithms to become more complex. 3369 Then senders wouldn't know whether a mark had been introduced by a 3370 simple or a complex router algorithm. That in turn would have 3371 required another codepoint to distinguish between legacy ECN and new 3372 re-ECN router marking. 3374 Once the cost of IP header codepoint real-estate was the same for 3375 both schemes, there was no doubt that the simpler option for 3376 endpoints and for routers should be chosen. The resulting protocol 3377 also no longer needed the tricky inflation/deflation complexity of 3378 the original (broken) scheme. It was also much simpler to understand 3379 conceptually. 3381 A further advantage of the new orthogonal four-codepoint scheme was 3382 that senders owned sole rights to change the RE flag and routers 3383 owned sole rights to change the ECN field. Although we still arrange 3384 the incentives so neither party strays outside their dominion, these 3385 clear lines of authority simplify the matter. 3387 Finally, a little redundancy can be very powerful in a scheme such as 3388 this. In one flow, the proportion of packets changed to CE should be 3389 the same as the proportion of RECT packets changed to CE(-1) and the 3390 proportion of Re-Echo packets changed to CE(0). Double checking 3391 using such redundant relationships can improve the security of a 3392 scheme (cf. double-entry book-keeping or the ECN Nonce). 3393 Alternatively, it might be necessary to exploit the redundancy in the 3394 future to encode an extra information channel. 3396 Appendix C. ECN Compatibility 3398 The rationale for choosing the particular combinations of SYN and SYN 3399 ACK flags in Section 4.1.3 is as follows. 3401 Choice of SYN flags: A re-ECN sender can work with vanilla ECN 3402 receivers so we wanted to use the same flags as would be used in 3403 an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same time, 3404 we wanted a server (host B) that is Re-ECT to be able to recognise 3405 that the client (A) is also Re-ECT. We believe also setting NS=1 3406 in the initial SYN achieves both these objectives, as it should be 3407 ignored by vanilla ECT receivers and by ECT-Nonce receivers. But 3408 senders that are not Re-ECT should not set NS=1. At the time ECN 3409 was defined, the NS flag was not defined, so setting NS=1 should 3410 be ignored by existing ECT receivers (but testing against 3411 implementations may yet prove otherwise). The ECN Nonce 3412 RFC [RFC3540] is silent on what the NS field might be set to in 3413 the TCP SYN, but we believe the intent was for a nonce client to 3414 set NS=0 in the initial SYN (again only testing will tell). 3415 Therefore we define a Re-ECN-setup SYN as one with NS=1, CWR=1 & 3416 ECE=1 3418 Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to 3419 be able to determine whether the server (B) is Re-ECT. The 3420 original ECN specification required an ECT server to respond to an 3421 ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There 3422 is no room to modify this by setting the NS flag, as that is 3423 already set in the SYN ACK of an ECT-Nonce server. So we used the 3424 only combination of CWR and ECE that would not be used by existing 3425 TCP receivers: CWR=1 and ECE=0. The original ECN specification 3426 defines this combination as a non-ECN-setup SYN ACK, which remains 3427 true for vanilla and Nonce ECTs. But for re-ECN we define it as a 3428 Re-ECN-setup SYN ACK. We didn't use a SYN ACK with both CWR and 3429 ECE cleared to 0 because that would be the likely response from 3430 most Not-ECT receivers. And we didn't use a SYN ACK with both CWR 3431 and ECE set to 1 either, as at least one broken receiver 3432 implementation echoes whatever flags were in the SYN into its SYN 3433 ACK. Therefore we define a Re-ECN-setup SYN ACK as one with CWR=1 3434 & ECE=0. 3436 Choice of two alternative SYN ACKs: the NS flag may take either value 3437 in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re-ECT 3438 server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to echo 3439 congestion experienced (CE) on the initial SYN. Otherwise a Re- 3440 ECN-setup SYN ACK MUST be returned with NS=0. The only current 3441 known use of the NS flag in a SYN ACK is to indicate support for 3442 the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. 3443 Given the ECN nonce MUST NOT be used for a RECN mode connection, a 3444 Re-ECN-setup SYN ACK can use either setting of the NS flag without 3445 any risk of confusion, because the CWR & ECE flags will be 3446 reversed relative to those used by an ECN nonce SYN ACK. 3448 Appendix D. Packet Marking During Flow Start 3450 {ToDo: Write up proof that sender should mark FNE on first and third 3451 data packets, even with the largest allowed initial window.} 3453 Appendix E. Example Egress Dropper Algorithm 3455 {ToDo: Write up the basic algorithm with flow state, then the 3456 aggregated one.} 3458 Appendix F. Re-TTL 3460 This Appendix gives an overview of a proposal to be able to overload 3461 the TTL field in the IP header to monitor downstream propagation 3462 delay. It is planned to fully write up this proposal in a future 3463 Internet Draft. 3465 Delay re-feedback can be achieved by overloading the TTL field, 3466 without changing IP or router TTL processing. A target value for TTL 3467 at the destination would need standardising, say 16. If the path hop 3468 count increased by more than 16 during a routing change, it would 3469 temporarily be mistaken for a routing loop, so this target would need 3470 to be chosen to exceed typical hop count increases. The TCP wire 3471 protocol and handlers would need modifying to feed back the 3472 destination TTL and initialise it. It would be necessary to 3473 standardise the unit of TTL in terms of real time (as was the 3474 original intent in the early days of the Internet). 3476 In the longer term, precision could be improved if routers 3477 decremented TTL to represent exact propagation delay to the next 3478 router. That is, for a router to decrement TTL by, say, 1.8 time 3479 units it would alternate the decrement of every packet between 1 & 2 3480 at a ratio of 1:4. Although this might sometimes require a seemingly 3481 dangerous null decrement, a packet in a loop would still decrement to 3482 zero after 255 time units on average. As more routers were upgraded 3483 to this more accurate TTL decrement, path delay estimates would 3484 become increasingly accurate despite the presence of some legacy 3485 routers that continued to always decrement the TTL by 1. 3487 Appendix G. Policer Designs to ensure Congestion Responsiveness 3489 G.1. Per-user Policing 3491 User policing requires a policer on the ingress interface of the 3492 access router associated with the user. At that point, the traffic 3493 of the user hasn't diverged on different routes yet; nor has it mixed 3494 with traffic from other sources. 3496 In order to ensure that a user doesn't generate more congestion in 3497 the network than her due share, a modified bulk token-bucket is 3498 maintained with the following parameter: 3500 o b_0 the initial token level 3502 o r the filling rate 3504 o b_max the bucket depth 3506 The same token bucket algorithm is used as in many areas of 3507 networking, but how it is used is very different: 3509 o all traffic from a user over the lifetime of their subscription is 3510 policed in the same token bucket. 3512 o only positive and canceled packets (Re-Echo, FNE and CE(0)) 3513 consume tokens 3515 Such a policer will allow network operators to throttle the 3516 contribution of their users to network congestion. This will require 3517 the appropriate contractual terms to be in place between operators 3518 and users. For instance: a condition for a user to subscribe to a 3519 given network service may be that she should not cause more than a 3520 volume C_user of congestion over a reference period T_user, although 3521 she may carry forward up to N_user times her allowance at the end of 3522 each period. These terms directly set the parameter of the user 3523 policer: 3525 o b_0 = C_user 3527 o r = C_user/T_user 3529 o b_max = b_0 * (N_user +1) 3531 Besides the congestion budget policer above, another user policer may 3532 be necessary to further rate-limit FNE packets, if they are to be 3533 marked rather than dropped (see discussion in Section 5.3.). Rate- 3534 limiting FNE packets will prevent high bursts of new flow arrivals, 3535 which is a very useful feature in DoS prevention. A condition to 3536 subscribe to a given network service would have to be that a user 3537 should not generate more than C_FNE FNE packets, over a reference 3538 period T_FNE, with no option to carry forward any of the allowance at 3539 the end of each period. These terms directly set the parameters of 3540 the FNE policer: 3542 o b_0 = C_FNE 3544 o r = C_FNE/T_FNE 3546 o b_max = b_0 3548 T_FNE should be a much shorter period than T_user: for instance T_FNE 3549 could be in the order of minutes while T_user could be in order of 3550 weeks. 3552 G.2. Per-flow Rate Policing 3554 Per-flow policing aims to enforce congestion responsiveness on the 3555 shortest information timescale on a network path: packet roundtrips. 3557 This again requires that the appropriate terms be agreed between a 3558 network operator and its users, where a congestion responsiveness 3559 policy might be required for the use of a given network service 3560 (perhaps unless the user specifically requests otherwise). 3562 As an example, we describe below how a rate adaptation policer can be 3563 designed when the applicable rate adaptation policy is TCP- 3564 compliance. In that context, the average throughput of a flow will 3565 be expected to be bounded by the value of the TCP throughput during 3566 congestion avoidance, given n Mathis' formula [Mathis97] 3567 x_TCP = k * s / ( T * sqrt(m) ) 3569 where: 3571 o x_TCP is the throughput of the TCP flow in packets per second, 3573 o k is a constant upper-bounded by sqrt(3/2), 3575 o s is the average packet size of the flow, 3577 o T is the roundtrip time of the flow, 3579 o m is the congestion level experienced by the flow. 3581 We define the marking period N=1/m which represents the average 3582 number of packets between two positive or canceled packets. Mathis' 3583 formula can be re-written as: 3585 x_TCP = k*s*sqrt(N)/T 3587 We can then get the average inter-mark time in a compliant TCP flow, 3588 dt_TCP, by solving (x_TCP/s)*dt_TCP = N which gives 3590 dt_TCP = sqrt(N)*T/k 3592 We rely on this equation for the design of a rate-adaptation policer 3593 as a variation of a token bucket. In that case a policer has to be 3594 set up for each policed flow. This may be triggered by FNE packets, 3595 with the remainder of flows being all rate limited together if they 3596 do not start with an FNE packet. 3598 Where maintaining per flow state is not a problem, for instance on 3599 some access routers, systematic per-flow policing may be considered. 3600 Should per-flow state be more constrained, rate adaptation policing 3601 could be limited to a random sample of flows exhibiting positive or 3602 canceled packets. 3604 As in the case of user policing, only positive or canceled packets 3605 will consume tokens, however the amount of tokens consumed will 3606 depend on the congestion signal. 3608 When a new rate adaptation policer is set up for flow j, the 3609 following state is created: 3611 o a token bucket b_j of depth b_max starting at level b_0 3613 o a timestamp t_j = timenow() 3614 o a counter N_j = 0 3616 o a roundtrip estimate T_j 3618 o a filling rate r 3620 When the policing node forwards a packet of flow j with no Re-Echo: 3622 o . the counter is incremented: N_j += 1 3624 When the policing node forwards a packet of flow j carrying a 3625 congestion mark (CE): 3627 o the counter is incremented: N_j += 1 3629 o the token level is adjusted: b_j += r*(timenow()-t_j) - sqrt(N_j)* 3630 T_j/k 3632 o the counter is reset: N_j = 0 3634 o the timer is reset: t_j = timenow() 3636 An implementation example will be given in a later draft that avoids 3637 having to extract the square root. 3639 Analysis: For a TCP flow, for r= 1 token/sec, on average, 3641 r*(timenow()-t_j)-sqrt(N_j)* T_j/k = dt_TCP - sqrt(N)*T/k = 0 3643 This means that the token level will fluctuate around its initial 3644 level. The depth b_max of the bucket sets the timescale on which the 3645 rate adaptation policy is performed while the filling rate r sets the 3646 trade-off between responsiveness and robustness: 3648 o the higher b_max, the longer it will take to catch greedy flows 3650 o the higher r, the fewer false positives (greedy verdict on 3651 compliant flows) but the more false negatives (compliant verdict 3652 on greedy flows) 3654 This rate adaptation policer requires the availability of a roundtrip 3655 estimate which may be obtained for instance from the application of 3656 re-feedback to the downstream delay Appendix F or passive estimation 3657 [Jiang02]. 3659 When the bucket of a policer located at the access router (whether it 3660 is a per-user policer or a per-flow policer) becomes empty, the 3661 access router SHOULD drop at least all packets causing the token 3662 level to become negative. The network operator MAY take further 3663 sanctions if the token level of the per-flow policers associated with 3664 a user becomes negative. 3666 Appendix H. Downstream Congestion Metering Algorithms 3668 H.1. Bulk Downstream Congestion Metering Algorithm 3670 To meter the bulk amount of downstream congestion in traffic crossing 3671 an inter-domain border an algorithm is needed that accumulates the 3672 size of positive packets and subtracts the size of negative packets. 3673 We maintain two counters: 3675 V_b: accumulated congestion volume 3677 B: total data volume (in case it is needed) 3679 A suitable pseudo-code algorithm for a border router is as follows: 3681 ==================================================================== 3682 V_b = 0 3683 B = 0 3684 for each re-ECN-capable packet { 3685 b = readLength(packet) /* set b to packet size */ 3686 B += b /* accumulate total volume */ 3687 if readEECN(packet) == (Re-Echo || FNE) { 3688 V_b += b /* increment... */ 3689 } elseif readEECN(packet) == CE(-1) { 3690 V_b -= b /* ...or decrement V_b... */ 3691 } /*...depending on EECN field */ 3692 } 3693 ==================================================================== 3695 At the end of an accounting period this counter V_b represents the 3696 congestion volume that penalties could be applied to, as described in 3697 Section 6.1.6. 3699 For instance, accumulated volume of congestion through a border 3700 interface over a month might be V_b = 5PB (petabyte = 10^15 byte). 3701 This might have resulted from an average downstream congestion level 3702 of 1% on an accumulated total data volume of B = 500PB. 3704 H.2. Inflation Factor for Persistently Negative Flows 3706 The following process is suggested to complement the simple algorithm 3707 above in order to protect against the various attacks from 3708 persistently negative flows described in Section 6.1.6. As explained 3709 in that section, the most important and first step is to estimate the 3710 contribution of persistently negative flows to the bulk volume of 3711 downstream pre-congestion and to inflate this bulk volume as if these 3712 flows weren't there. The process below has been designed to give an 3713 unboased estimate, but it may be possible to define other processes 3714 that achieve similar ends. 3716 While the above simple metering algorithm is counting the bulk of 3717 traffic over an accounting period, the meter should also select a 3718 subset of the whole flow ID space that is small enough to be able to 3719 realistically measure but large enough to give a realistic sample. 3720 Many different samples of different subsets of the ID space should be 3721 taken at different times during the accounting period, preferably 3722 covering the whole ID space. During each sample, the meter should 3723 count the volume of positive packets and subtract the volume of 3724 negative, maintaining a separate account for each flow in the sample. 3725 It should run a lot longer than the large majority of flows, to avoid 3726 a bias from missing the starts and ends of flows, which tend to be 3727 positive and negative respectively. 3729 Once the accounting period finishes, the meter should calculate the 3730 total of the accounts V_{bI} for the subset of flows I in the sample, 3731 and the total of the accounts V_{fI} excluding flows with a negative 3732 account from the subset I. Then the weighted mean of all these 3733 samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I} 3734 V_{bI}. 3736 If V_b is the result of the bulk accounting algorithm over the 3737 accounting period (Appendix H.1) it can be inflated by this factor 3738 a_S to get a good unbiased estimate of the volume of downstream 3739 congestion over the accounting period a_S.V_b, without being polluted 3740 by the effect of persistently negative flows. 3742 Authors' Addresses 3744 Bob Briscoe 3745 BT & UCL 3746 B54/77, Adastral Park 3747 Martlesham Heath 3748 Ipswich IP5 3RE 3749 UK 3751 Phone: +44 1473 645196 3752 Email: bob.briscoe@bt.com 3753 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 3755 Arnaud Jacquet 3756 BT 3757 B54/70, Adastral Park 3758 Martlesham Heath 3759 Ipswich IP5 3RE 3760 UK 3762 Phone: +44 1473 647284 3763 Email: arnaud.jacquet@bt.com 3764 URI: 3766 Alessandro Salvatori 3767 BT 3768 B54/77, Adastral Park 3769 Martlesham Heath 3770 Ipswich IP5 3RE 3771 UK 3773 Email: sandr8@gmail.com 3775 Martin Koyabe 3776 BT 3777 B54/69, Adastral Park 3778 Martlesham Heath 3779 Ipswich IP5 3RE 3780 UK 3782 Phone: +44 1473 646923 3783 Email: martin.koyabe@bt.com 3784 URI: 3786 Intellectual Property Statement 3788 The IETF takes no position regarding the validity or scope of any 3789 Intellectual Property Rights or other rights that might be claimed to 3790 pertain to the implementation or use of the technology described in 3791 this document or the extent to which any license under such rights 3792 might or might not be available; nor does it represent that it has 3793 made any independent effort to identify any such rights. Information 3794 on the procedures with respect to rights in RFC documents can be 3795 found in BCP 78 and BCP 79. 3797 Copies of IPR disclosures made to the IETF Secretariat and any 3798 assurances of licenses to be made available, or the result of an 3799 attempt made to obtain a general license or permission for the use of 3800 such proprietary rights by implementers or users of this 3801 specification can be obtained from the IETF on-line IPR repository at 3802 http://www.ietf.org/ipr. 3804 The IETF invites any interested party to bring to its attention any 3805 copyrights, patents or patent applications, or other proprietary 3806 rights that may cover technology that may be required to implement 3807 this standard. Please address the information to the IETF at 3808 ietf-ipr@ietf.org. 3810 Disclaimer of Validity 3812 This document and the information contained herein are provided on an 3813 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 3814 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 3815 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 3816 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 3817 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 3818 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 3820 Copyright Statement 3822 Copyright (C) The Internet Society (2006). This document is subject 3823 to the rights, licenses and restrictions contained in BCP 78, and 3824 except as set forth therein, the authors retain all their rights. 3826 Acknowledgment 3828 Funding for the RFC Editor function is currently provided by the 3829 Internet Society.