idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-tcp-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 19. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 4038. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 4049. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 4056. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 4062. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 09, 2007) is 6134 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960) == Outdated reference: A later version (-02) exists of draft-ietf-tsvwg-ecn-mpls-01 == Outdated reference: A later version (-01) exists of draft-briscoe-tsvwg-ecn-tunnel-00 == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-ecnsyn-01 == Outdated reference: A later version (-03) exists of draft-moncaster-tcpm-rcv-cheat-01 -- Obsolete informational reference (is this intentional?): RFC 2402 (Obsoleted by RFC 4302, RFC 4305) -- Obsolete informational reference (is this intentional?): RFC 2406 (Obsoleted by RFC 4303, RFC 4305) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) Summary: 4 errors (**), 0 flaws (~~), 5 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Intended status: Standards Track A. Jacquet 5 Expires: January 10, 2008 A. Salvatori 6 M. Koyabe 7 T. Moncaster 8 BT 9 July 09, 2007 11 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP 12 draft-briscoe-tsvwg-re-ecn-tcp-04 14 Status of this Memo 16 By submitting this Internet-Draft, each author represents that any 17 applicable patent or other IPR claims of which he or she is aware 18 have been or will be disclosed, and any of which he or she becomes 19 aware will be disclosed, in accordance with Section 6 of BCP 79. 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF), its areas, and its working groups. Note that 23 other groups may also distribute working documents as Internet- 24 Drafts. 26 Internet-Drafts are draft documents valid for a maximum of six months 27 and may be updated, replaced, or obsoleted by other documents at any 28 time. It is inappropriate to use Internet-Drafts as reference 29 material or to cite them other than as "work in progress." 31 The list of current Internet-Drafts can be accessed at 32 http://www.ietf.org/ietf/1id-abstracts.txt. 34 The list of Internet-Draft Shadow Directories can be accessed at 35 http://www.ietf.org/shadow.html. 37 This Internet-Draft will expire on January 10, 2008. 39 Copyright Notice 41 Copyright (C) The IETF Trust (2007). 43 Abstract 45 This document introduces a new protocol for explicit congestion 46 notification (ECN), termed re-ECN, which can be deployed 47 incrementally around unmodified routers. The protocol arranges an 48 extended ECN field in each packet so that, as it crosses any 49 interface in an internetwork, it will carry a truthful prediction of 50 congestion on the remainder of its path. Then the upstream party at 51 any trust boundary in the internetwork can be held responsible for 52 the congestion they cause, or allow to be caused. So, networks can 53 introduce straightforward accountability and policing mechanisms for 54 incoming traffic from end-customers or from neighbouring network 55 domains. The purpose of this document is to specify the re-ECN 56 protocol at the IP layer and to give guidelines on any consequent 57 changes required to transport protocols. It includes the changes 58 required to TCP both as an example and as a specification. It also 59 gives examples of mechanisms that can use the protocol to ensure data 60 sources respond correctly to congestion. And it describes example 61 mechanisms that ensure the dominant selfish strategy of both network 62 domains and end-points will be to set the extended ECN field 63 honestly. 65 Authors' Statement: Status (to be removed by the RFC Editor) 67 Although the re-ECN protocol is intended to make a simple but far- 68 reaching change to the Internet architecture, the most immediate 69 priority for the authors is to delay any move of the ECN nonce to 70 Proposed Standard status. The argument for this position is 71 developed in Appendix I. 73 Changes from previous drafts (to be removed by the RFC Editor) 75 Full diffs created using the rfcdiff tool are available at 76 78 From -03 to -04 (current version): 80 Clarified reasons for holding back ECN nonce (Section 3.2 & 81 Appendix I). 83 Clarified Figure 1. 85 Added Section 4.1.1.1 on equivalence of drops and ECN marks. 87 Improved precision of Section 5.6 on IP in IP tunnels. 89 Explained the RTT fairness is possible to enforce, but unlikely to 90 be required (Section 6.1.3 & Appendix F). 92 Explained that bulk per-user policing should be adequate but per- 93 flow policing is also possible if desired, though it is not likely 94 to be necessary (Section 6.1.5 & Appendix G). 96 Reinforced need for passive policing at inter-domain borders to 97 enable all-optical networking (Section 6.1.6). 99 Minor editorial changes throughout. 101 From -02 to -03: 103 Started guidelines for re-ECN support in DCCP and SCTP. 105 Added annex on limitations of nonce mechanism. 107 Minor editorial changes throughout. 109 From -01 to -02: 111 Explanation on informal terminology in Section 3.4 clarified. 113 IPv6 wire protocol encoding added (Section 5.2). 115 Text on (non-)issues with tunnels, encryption and link layer 116 congestion notification added (Section 5.6 & Section 5.7). 118 Section added giving evolvability arguments against encouraging 119 bottleneck policing (Section 6.1.2). And text on re-ECN's 120 evolvability by design added to Section 6.1.3 122 Text on inter-domain policing (Section 6.1.6) and inter-domain 123 fail-safes (Section 6.1.7) added. 125 From -00 to -01: 127 Encoding of re-ECN wire protocol changed for reasons given in 128 Appendix B and consequently draft substantially re-written. 130 Substantial text added in sections on applications, incremental 131 deployment, architectural rationale and security considerations. 133 Table of Contents 135 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 136 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 7 137 3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 8 138 3.1. Background and Applicability . . . . . . . . . . . . . . . 8 139 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 140 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 141 3.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 11 142 3.4. Informal Terminology . . . . . . . . . . . . . . . . . . . 13 143 4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 15 144 4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 145 4.1.1. RECN mode: Full re-ECN capable transport . . . . . . . 16 146 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or 147 Nonce ECT Receiver . . . . . . . . . . . . . . . . . . 20 148 4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 21 149 4.1.4. Extended ECN (EECN) Field Settings during Flow 150 Start or after Idle Periods . . . . . . . . . . . . . 23 151 4.1.5. Pure ACKS, Retransmissions, Window Probes and 152 Partial ACKs . . . . . . . . . . . . . . . . . . . . . 26 153 4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 27 154 4.2.1. General Guidelines for Adding Re-ECN to Other 155 Transports . . . . . . . . . . . . . . . . . . . . . . 27 156 4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 28 157 4.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 28 158 4.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 28 159 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 28 160 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 28 161 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 30 162 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 31 163 5.4. Justification for Setting the First SYN to FNE . . . . . . 32 164 5.5. Control and Management . . . . . . . . . . . . . . . . . . 33 165 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 33 166 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 34 167 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 34 168 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 35 169 6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 36 170 6.1. Policing Congestion Response . . . . . . . . . . . . . . . 36 171 6.1.1. The Policing Problem . . . . . . . . . . . . . . . . . 36 172 6.1.2. The Case Against Bottleneck Policing . . . . . . . . . 37 173 6.1.3. Re-ECN Incentive Framework . . . . . . . . . . . . . . 38 174 6.1.4. Egress Dropper . . . . . . . . . . . . . . . . . . . . 45 175 6.1.5. Policing . . . . . . . . . . . . . . . . . . . . . . . 47 176 6.1.6. Inter-domain Policing . . . . . . . . . . . . . . . . 48 177 6.1.7. Inter-domain Fail-safes . . . . . . . . . . . . . . . 52 178 6.1.8. Simulations . . . . . . . . . . . . . . . . . . . . . 53 179 6.2. Other Applications . . . . . . . . . . . . . . . . . . . . 53 180 6.2.1. DDoS Mitigation . . . . . . . . . . . . . . . . . . . 53 181 6.2.2. End-to-end QoS . . . . . . . . . . . . . . . . . . . . 54 182 6.2.3. Traffic Engineering . . . . . . . . . . . . . . . . . 54 183 6.2.4. Inter-Provider Service Monitoring . . . . . . . . . . 54 184 6.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . 54 185 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 55 186 7.1. Incremental Deployment Features . . . . . . . . . . . . . 55 187 7.2. Incremental Deployment Incentives . . . . . . . . . . . . 57 188 8. Architectural Rationale . . . . . . . . . . . . . . . . . . . 61 189 9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 64 190 9.1. Policing Rate Response to Congestion . . . . . . . . . . . 64 191 9.2. Congestion Notification Integrity . . . . . . . . . . . . 65 192 9.3. Identifying Upstream and Downstream Congestion . . . . . . 66 193 10. Security Considerations . . . . . . . . . . . . . . . . . . . 66 194 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 68 195 12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 68 196 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 68 197 14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 69 198 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 69 199 15.1. Normative References . . . . . . . . . . . . . . . . . . . 69 200 15.2. Informative References . . . . . . . . . . . . . . . . . . 70 201 Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 73 202 Appendix B. Justification for Two Codepoints Signifying Zero 203 Worth Packets . . . . . . . . . . . . . . . . . . . . 74 204 Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 76 205 Appendix D. Packet Marking During Flow Start . . . . . . . . . . 77 206 Appendix E. Example Egress Dropper Algorithm . . . . . . . . . . 77 207 Appendix F. Re-TTL . . . . . . . . . . . . . . . . . . . . . . . 77 208 Appendix G. Policer Designs to ensure Congestion 209 Responsiveness . . . . . . . . . . . . . . . . . . . 78 210 G.1. Per-user Policing . . . . . . . . . . . . . . . . . . . . 78 211 G.2. Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 79 212 Appendix H. Downstream Congestion Metering Algorithms . . . . . . 82 213 H.1. Bulk Downstream Congestion Metering Algorithm . . . . . . 82 214 H.2. Inflation Factor for Persistently Negative Flows . . . . . 83 215 Appendix I. Argument for holding back the ECN nonce . . . . . . . 84 216 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 85 217 Intellectual Property and Copyright Statements . . . . . . . . . . 88 219 1. Introduction 221 This document aims: 223 o To provide a complete specification of the addition of the re-ECN 224 protocol to IP and guidelines on how to add it to transport layer 225 protocols, including a complete specification of re-ECN in TCP as 226 an example; 228 o To show how a number of hard problems become much easier to solve 229 once re-ECN is available in IP. 231 A general statement of the problem solved by re-ECN is to provide 232 sufficient information in each IP datagram to be able to hold senders 233 and whole networks accountable for the congestion they cause 234 downstream, before they cause it. But the every-day problems that 235 re-ECN can solve are much more recognisable than this rather generic 236 statement: mitigating distributed denial of service (DDoS); 237 simplifying differentiation of quality of service (QoS); policing 238 compliance to congestion control; and so on. 240 Uniquely, re-ECN manages to enable solutions to these problems 241 without unduly stifling innovative new ways to use the Internet. 242 This was a hard balance to strike, given it could be argued that DDoS 243 is an innovative way to use the Internet. The most valuable insight 244 was to allow each network to choose the level of constraint it wishes 245 to impose. Also re-ECN has been carefully designed so that networks 246 that choose to use it conservatively can protect themselves against 247 the congestion caused in their network by users on other networks 248 with more liberal policies. 250 For instance, some network owners want to block applications like 251 voice and video unless their network is compensated for the extra 252 share of bottleneck bandwidth taken. These real-time applications 253 tend to be unresponsive when congestion arises. Whereas elastic TCP- 254 based applications back away quickly, ending up taking a much smaller 255 share of congested capacity for themselves. Other network owners 256 want to invest in large amounts of capacity and make their gains from 257 simplicity of operation and economies of scale. 259 Re-ECN allows the more conservative networks to police out flows that 260 have not asked to be unresponsive to congestion---not because they 261 are voice or video---just because they don't respond to congestion. 262 But it also allows other networks to choose not to police. 263 Crucially, when flows from liberal networks cross into a conservative 264 network, re-ECN enables the conservative network to apply penalties 265 to its neighbouring networks for the congestion they allow to be 266 caused. And these penalties can be applied to bulk data, without 267 regard to flows. 269 Then, if unresponsive applications become so dominant that some of 270 the more liberal networks experience congestion collapse [RFC3714], 271 they can change their minds and use re-ECN to apply tighter controls 272 in order to bring congestion back under control. 274 Re-ECN works by arranging that each packet arrives at each network 275 element carrying a view of expected congestion on its own downstream 276 path, albeit averaged over multiple packets. Most usefully, 277 congestion on the remainder of the path becomes visible in the IP 278 header at the first ingress. Many of the applications of re-ECN 279 involve a policer at this ingress using the view of downstream 280 congestion arriving in packets to police or control the packet rate. 282 Importantly, the scheme is recursive: a whole network harbouring 283 users causing congestion in downstream networks can be held 284 responsible or policed by its downstream neighbour. 286 This document is structured as follows. First an overview of the re- 287 ECN protocol is given (Section 3), outlining its attributes and 288 explaining conceptually how it works as a whole. The two main parts 289 of the document follow, as described above. That is, the protocol 290 specification divided into transport (Section 4) and network 291 (Section 5) layers, then the applications it can be put to, such as 292 policing DDoS, QoS and congestion control (Section 6). Although 293 these applications do not require standardisation themselves, they 294 are described in a fair degree of detail in order to explain how re- 295 ECN can be used. Given re-ECN proposes to use the last undefined bit 296 in the IPv4 header, we felt it necessary to outline the potential 297 that re-ECN could release in return for being given that bit. 299 Deployment issues discussed throughout the document are brought 300 together in Section 7, which is followed by a brief section 301 explaining the somewhat subtle rationale for the design from an 302 architectural perspective (Section 8). We end by describing related 303 work (Section 9), listing security considerations (Section 10) and 304 finally drawing conclusions (Section 12). 306 2. Requirements notation 308 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 309 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 310 document are to be interpreted as described in [RFC2119]. 312 This document first specifies a protocol, then describes a framework 313 that creates the right incentives to ensure compliance to the 314 protocol. This could cause confusion because the second part of the 315 document considers many cases where malicious nodes may not comply 316 with the protocol. When such contingencies are described, if any of 317 the above keywords are not capitalised, that is deliberate. So, for 318 instance, the following two apparently contradictory sentences would 319 be perfectly consistent: i) x MUST do this; ii) x may not do this. 321 3. Protocol Overview 323 3.1. Background and Applicability 325 First we briefly recap the essentials of the ECN protocol [RFC3168]. 326 Two bits in the IP protocol (v4 or v6) are assigned to the ECN field. 327 The sender clears the field to "00" (Not-ECT) if either end-point 328 transport is not ECN-capable. Otherwise it indicates an ECN-capable 329 transport (ECT) using either of the two code-points "10" or "01" 330 (ECT(0) and ECT(1) resp.). 332 ECN-capable routers probabilistically set "11" if congestion is 333 experienced (CE), the marking probability increasing with the length 334 of the queue at its egress link (typically using the RED 335 algorithm [RFC2309]). However, they still drop rather than mark Not- 336 ECT packets. With multiple ECN-capable routers on a path, a flow of 337 packets accumulates the fraction of CE marking that each router adds. 338 The combined effect of the packet marking of all the routers along 339 the path signals congestion of the whole path to the receiver. So, 340 for example, if one router early in a path is marking 1% of packets 341 and another later in a path is marking 2%, flows that pass through 342 both routers will experience approximately 3% marking (see Appendix A 343 for a precise treatment). 345 The choice of two ECT code-points in the ECN field [RFC3168] 346 permitted future flexibility, optionally allowing the sender to 347 encode the experimental ECN nonce [RFC3540] in the packet stream. 348 The nonce is designed to allow a sender to check the integrity of 349 congestion feedback. But Section 9.2 explains that it still gives no 350 control over how fast the sender transmits as a result of the 351 feedback. On the other hand, re-ECN is designed both to ensure that 352 congestion is declared honestly and that the sender's rate responds 353 appropriately. 355 Re-ECN is based on a feedback arrangement called `re- 356 feedback' [Re-fb]. The word is short for either receiver-aligned, 357 re-inserted or re-echoed feedback. But it actually works even when 358 no feedback is available. In fact it has been carefully designed to 359 work for single datagram flows. It also encourages aggregation of 360 single packet flows by congestion control proxies. Then, even if the 361 traffic mix of the Internet were to become dominated by short 362 messages, it would still be possible to control congestion 363 effectively and efficiently. 365 Changing the Internet's feedback architecture seems to imply 366 considerable upheaval. But re-ECN can be deployed incrementally at 367 the transport layer around unmodified routers using existing fields 368 in IP (v4 or v6). However it does also require the last undefined 369 bit in the IPv4 header, which it uses in combination with the 2-bit 370 ECN field to create four new codepoints. Nonetheless, changes to IP 371 routers are RECOMMENDED in order to improve resilience against DoS 372 attacks. Similarly, re-ECN works best if both the sender and 373 receiver transports are re-ECN-capable, but it can work with just 374 sender support. Section 7.1 summarises the incremental deployment 375 strategy. 377 The re-ECN protocol makes no changes and has no effect on the TCP 378 congestion control algorithm or on other rate responses to 379 congestion. Re-ECN is only concerned with enabling the ingress 380 network to police that a source is complying with a congestion 381 control algorithm, which is orthogonal to congestion control itself. 383 Before re-ECN can be considered worthy of using up the last bit in 384 the IP header, we must be sure that all our claims are robust. We 385 have gradually been reducing the list of outstanding issues, but the 386 few that still remain are listed in Section 6.3. We expect new 387 attacks may still be found, but we offer the re-ECN protocol on the 388 basis that it is built on fairly solid theoretical foundations and, 389 so far, it has proved possible to keep it relatively robust. 391 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 393 The re-ECN wire protocol uses the two bit ECN field broadly as in 394 RFC3168 [RFC3168] as described above, but with five differences of 395 detail (brought together in a list in Section 7.1). This 396 specification defines a new re-ECN extension (RE) flag. We will 397 defer the definition of the actual position of the RE flag in the 398 IPv4 & v6 headers until Section 5. Until then it will suffice to use 399 an abstraction of the IPv4 and v6 wire protocols by just calling it 400 the RE flag. 402 Unlike the ECN field, the RE flag is intended to be set by the sender 403 and remain unchanged along the path, although it can be read by 404 network elements that understand the re-ECN protocol. It is feasible 405 that a network element MAY change the setting of the RE flag, perhaps 406 acting as a proxy for an end-point, but such a protocol would have to 407 be defined in another specification (e.g. [Re-PCN]). 409 Although the RE flag is a separate, single bit field, it can be read 410 as an extension to the two-bit ECN field; the three concatenated bits 411 in what we will call the extended ECN field (EECN) making eight 412 codepoints. We will use the RFC3168 names of the ECN codepoints to 413 describe settings of the ECN field when the RE flag setting is "don't 414 care", but we also define the following six extended ECN codepoint 415 names for when we need to be more specific. 417 RFC3168 ECN defines uses for all four codepoints of the two-bit ECN 418 field. This memo widens the codepoint space to eight, and uses six 419 codepoints. One of re-ECN's codepoints is an alternative use of the 420 codepoint set aside in RFC3168 for the ECN nonce (ECT(1)). 421 Transports not using re-ECN can still use the ECN nonce, while those 422 using re-ECN do not need to as long as the sender is also checking 423 for transport protocol compliance [I-D.moncaster-tcpm-rcv-cheat]. 424 The case for doing this is given in Appendix I. Two re-ECN 425 codepoints are given compatible uses to those defined in RFC3168 426 (Not-ECT and CE). The other codepoint used by RFC3168 (ECT(0)) isn't 427 used for re-ECN. Altogether this leave one codepoint of the eight 428 unused and available for future use. 430 +-------+------------+------+--------------+------------------------+ 431 | ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | 432 | field | codepoint | flag | codepoint | | 433 +-------+------------+------+--------------+------------------------+ 434 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 435 | | | | | transport | 436 | 00 | Not-ECT | 1 | FNE | Feedback not | 437 | | | | | established | 438 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 439 | | | | | and RECT | 440 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 441 | | | | | transport | 442 | 10 | ECT(0) | 0 | --- | Legacy ECN use only | 443 | | | | | | 444 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 445 | | | | | | 446 | 11 | CE | 0 | CE(0) | Re-Echo canceled by | 447 | | | | | congestion experienced | 448 | 11 | CE | 1 | CE(-1) | Congestion experienced | 449 +-------+------------+------+--------------+------------------------+ 451 Table 1: Extended ECN Codepoints 453 3.3. Re-ECN Protocol Operation 455 In this section we will give an overview of the operation of the re- 456 ECN protocol for TCP/IP, leaving a detailed specification to the 457 following sections. Other transports will be discussed later. 459 In summary, the protocol adds a third `re-echo' stage to the existing 460 TCP/IP ECN protocol. Whenever the network adds CE congestion 461 signalling to the IP header on the forward data path, the receiver 462 feeds it back to the ingress using TCP, then the sender re-echoes it 463 into the forward data path using the RE flag in the next packet. 465 Prior to receiving any feedback a sender will not know which setting 466 of the RE flag to use, so it sets the feedback not established (FNE) 467 codepoint. The network reads the FNE codepoint conservatively as 468 equivalent to re-echoed congestion. 470 Specifically, once a flow is established, a re-ECN sender always 471 initialises the ECN field to ECT(1). And it usually sets the RE flag 472 to "1". Whenever a router re-marks a packet to CE, the receiver 473 feeds back this event to the sender. On receiving this feedback, the 474 re-ECN sender will clear the RE flag to "0" in the next packet it 475 sends. 477 We chose to set and clear the RE flag this way round to ease 478 incremental deployment (see Section 7.1). To avoid confusion we will 479 use the term `blanking' (rather than marking) when the RE flag is 480 cleared to "0". So, over a stream of packets, we will talk of the 481 `RE blanking fraction' as the fraction of octets in packets with the 482 RE flag cleared to "0". 484 _ _ _ _ 485 / \ / \ / \ / \ 486 | S |--| 0 | - - - - - - - - | i |--| D | 487 \ _ / \ _ / \ _ / \ _ / 488 . . . . 489 ^ . . . . 490 | . . . . 491 | . RE blanking fraction . . 492 3% |-------------------------------+======= 493 | . . | . 494 2% | . . | . 495 | . . CE marking fraction | . 496 1% | . +----------------------+ . 497 | . | . . 498 0% +---------------------------------------> 499 ^ 0 ^ i ^ resource index 500 0 ^ 1 ^ 2 observation points 501 | | 502 1.00% 2.00% marking fraction 504 Figure 1: A 2-Router Example (Imprecise) 506 Figure 1 uses a simple network to illustrate how re-ECN allows 507 routers to measure downstream congestion. The horizontal axis 508 represents the index of each congestible resource (typically queues) 509 along a path through the Internet. There may be many routers on the 510 path, but we assume only two are currently congested (those with 511 resource index 0 and i). The two superimposed plots show the 512 fraction of each extended ECN codepoint in a flow observed along this 513 path. Given about 3% of packets reaching the destination are marked 514 CE, in response to feedback the sender will blank the RE flag in 515 about 3% of packets it sends. Then approximate downstream congestion 516 can be measured at the observation points shown along the path by 517 subtracting the CE marking fraction from the RE blanking fraction, as 518 shown in the table below (Appendix A derives these approximations 519 from a precise analysis). 521 +-------------------+------------------------------+ 522 | Observation point | Approx downstream congestion | 523 +-------------------+------------------------------+ 524 | 0 | 3% - 0% = 3% | 525 | 1 | 3% - 1% = 2% | 526 | 2 | 3% - 3% = 0% | 527 +-------------------+------------------------------+ 529 Table 2: Downstream Congestion Measured at Example Observation Points 531 All along the path, whole-path congestion remains unchanged so it can 532 be used as a reference against which to compare upstream congestion. 533 The difference predicts downstream congestion for the rest of the 534 path. Therefore, measuring the fractions of each codepoint at any 535 point in the Internet will reveal upstream, downstream and whole path 536 congestion. 538 Note that we have introduced discussion of marking and blanking 539 fractions solely for illustration. To be absolutely clear, these 540 fractions are averages that would result from the behaviour of a TCP 541 protocol handler mechanically blanking outgoing packets in direct 542 response to incoming feedback---we are not saying any protocol 543 handler works with these average fractions directly. 545 3.4. Informal Terminology 547 In the rest of this memo we will loosely talk of positive or negative 548 flows, meaning flows where the moving average of the downstream 549 congestion metric is persistently positive or negative. The notion 550 of a negative metric arises because it is derived by subtracting one 551 metric from another. Of course actual downstream congestion cannot 552 be negative, only the metric can (whether due to time lags or 553 deliberate malice). 555 Just as we will loosely talk of positive and negative flows, we will 556 also talk of positive or negative packets, meaning packets that 557 contribute positively or negatively to the downstream congestion 558 metric. 560 Therefore we will talk of packets having `worth' of +1, 0 or -1, 561 which, when multiplied by their size, indicates their contribution to 562 the downstream congestion metric. 564 Figure 2 shows the main state transitions of the system once a flow 565 is established, showing the worth of packets in each state. When the 566 network congestion marks a packet it decrements its worth (moving 567 from the left of the main square to the right). When the sender 568 blanks the RE flag in order to re-echo congestion it increments the 569 worth of a packet (moving from the bottom of the main square to the 570 top). 572 Sender state Sent Worth Received Worth 573 packet packet 574 +----------------------------------------------------+ 575 | ^ 576 V | 577 Congestion echoed -->Re-Echo +1 --+---> CE(0) 0 --+ 578 (positive) | (canceled) | 579 V network | 580 | congestion | 581 | | 582 Flow established --> RECT 0 ----+-> CE(-1) -1 --+ 583 ^ (neutral) | | (negative) 584 | | | 585 | no V V 586 | congestion | | 587 +-----------<--------------+-+ 589 Figure 2: Re-ECN System State Diagram (bootstrap not shown) 591 The idea is that every time the network decrements the worth of a 592 packet, the sender increments the worth of a later packet. Then, 593 over time, as many positive octets should arrive at the receiver as 594 negative. Note we have said octets not packets, so if packets are of 595 different sizes, the worth should be incremented on enough octets to 596 balance the octets in negative packets arriving at the receiver. It 597 is this balance that will allow the network to hold the sender 598 accountable for the congestion it causes, as we shall see. The 599 informal outline below uses TCP as an example transport, but the idea 600 would be broadly similar for any transport that adapts its rate to 601 congestion. 603 We will start with the sender in `flow established' state. Normally, 604 as acknowledgements of earlier packets arrive that don't feedback any 605 congestion, the congestion window can be opened, so the sender goes 606 round the smaller sub-loop, sending RECT packets (worth 0) and 607 returning to the flow established state to send another one. If a 608 router congestion marks one of the packets, it decrements the 609 packet's worth. The sender will have been continuing to traverse 610 round the smaller feedback loop every time acknowledgements arrive. 611 But when congestion feedback returns from this packet that was marked 612 with -1 worth (the largest loop in the figure) the sender jumps to 613 the congestion echoed state in order to re-echo the congestion, 614 incrementing the worth of the next packet to +1 by blanking its RE 615 flag. The sender then returns to the flow established state and 616 continues round the smaller loop, sending packets worth 0. Note that 617 the size of the loops is just an artefact of the figure; it is not 618 meant to imply that one loop is slower than the other - they are both 619 the same end to end feedback loop. 621 If a packet carrying re-echoed congestion happens to also be 622 congestion marked, the +1 worth added by the sender will be cancelled 623 out by the -1 network congestion marking. Although the two worth 624 values correctly cancel out, neither the congestion marking nor the 625 re-echoed congestion are lost, because the RE bit and the ECN field 626 are orthogonal. So, whenever this happens, the receiver will 627 correctly detect and re-echo the new congestion event as well (the 628 top sub-loop). When we need to distinguish, we will sometimes call a 629 packet marked RECT 'neutral' (0 worth), while we will call the CE(0) 630 marking 'canceled' (also 0 worth). If a re-echoed packet isn't 631 unlucky enough to be further congestion marked, the sender will 632 return to the flow established state and continue to send RECT 633 packets (worth 0). 635 The table below specifies unambiguously the worth of each extended 636 ECN codepoint. Note the order is different from the previous table 637 to better show how the worth increments and decrements. The FNE 638 codepoint is an exception. It is used in the flow bootstrap process 639 (explained later) and has the same positive (+1) worth as a packet 640 with the Re-Echo codepoint. 642 +--------+------+----------------+-------+--------------------------+ 643 | ECN | RE | Extended ECN | Worth | Re-ECN meaning | 644 | field | bit | codepoint | | | 645 +--------+------+----------------+-------+--------------------------+ 646 | 00 | 0 | Not-RECT | ... | Not re-ECN-capable | 647 | | | | | transport | 648 | 01 | 0 | Re-Echo | +1 | Re-echoed congestion and | 649 | | | | | RECT | 650 | 10 | 0 | --- | ... | Legacy ECN use only | 651 | 11 | 0 | CE(0) | 0 | Re-Echo canceled by | 652 | | | | | congestion experienced | 653 | 00 | 1 | FNE | +1 | Feedback not established | 654 | 01 | 1 | RECT | 0 | Re-ECN capable transport | 655 | 10 | 1 | --CU-- | ... | Currently unused | 656 | | | | | | 657 | 11 | 1 | CE(-1) | -1 | Congestion experienced | 658 +--------+------+----------------+-------+--------------------------+ 660 Table 3: 'Worth' of Extended ECN Codepoints 662 4. Transport Layers 664 4.1. TCP 666 Re-ECN capability at the sender is essential. At the receiver it is 667 optional, as long as the receiver has a basic (`vanilla flavour') 668 RFC3168-compliant ECN-capable transport (ECT) [RFC3168]. Given re- 669 ECN is not the first attempt to define the semantics of the ECN 670 field, we give a table below summarising what happens for various 671 combinations of capabilities of the sender S and receiver R, as 672 indicated in the first four columns below. The last column gives the 673 mode a half-connection should be in after the first two of the three 674 TCP handshakes. 676 +--------+--------------+------------+---------+--------------------+ 677 | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | 678 | | (RFC3540) | (RFC3168) | | Half-connection | 679 | | | | | Mode | 680 +--------+--------------+------------+---------+--------------------+ 681 | SR | | | | RECN | 682 | S | R | | | RECN-Co | 683 | S | | R | | RECN-Co | 684 | S | | | R | Not-ECT | 685 +--------+--------------+------------+---------+--------------------+ 687 Table 4: Modes of TCP Half-connection for Combinations of ECN 688 Capabilities of Sender S and Receiver R 690 We will describe what happens in each mode, then describe how they 691 are negotiated. The abbreviations for the modes in the above table 692 mean: 694 RECN: Full re-ECN capable transport 696 RECN-Co: Re-ECN sender in compatibility mode with a 697 vanilla [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable 698 receiver. Implementation of this mode is OPTIONAL. 700 Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when 701 at least one of the transports does not understand even basic ECN 702 marking. 704 Note that we use the term Re-ECT for a host transport that is re-ECN- 705 capable but RECN for the modes of the half connections between hosts 706 when they are both Re-ECT. If a host transport is Re-ECT, this fact 707 alone does NOT imply either of its half connections will necessarily 708 be in RECN mode, at least not until it has confirmed that the other 709 host is Re-ECT. 711 4.1.1. RECN mode: Full re-ECN capable transport 713 In full RECN mode, for each half connection, both the sender and the 714 receiver each maintain an unsigned integer counter we will call ECC 715 (echo congestion counter). The receiver maintains a count, modulo 8, 716 of how many times a CE marked packet has arrived during the half- 717 connection. Once a RECN connection is established, the three TCP 718 option flags (ECE, CWR & NS) used for ECN-related functions in other 719 versions of ECN are used as a 3-bit field for the receiver to 720 repeatedly tell the sender the current value of ECC whenever it sends 721 a TCP ACK. We will call this the echo congestion increment (ECI) 722 field. This overloaded use of these 3 option flags as one 3-bit ECI 723 field is shown in Figure 4. The actual definition of the TCP header, 724 including the addition of support for the ECN nonce, is shown for 725 comparison in Figure 3. This specification does not redefine the 726 names of these three TCP option flags, it merely overloads them with 727 another definition once a flow is established. 729 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 730 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 731 | | | N | C | E | U | A | P | R | S | F | 732 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 733 | | | | R | E | G | K | H | T | N | N | 734 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 736 Figure 3: The (post-ECN Nonce) definition of bytes 13 and 14 of the 737 TCP Header 739 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 740 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 741 | | | | U | A | P | R | S | F | 742 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 743 | | | | G | K | H | T | N | N | 744 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 746 Figure 4: Definition of the ECI field within bytes 13 and 14 of the 747 TCP Header, overloading the current definitions above for established 748 RECN flows. 750 Receiver Action in RECN Mode 752 Every time a CE marked packet arrives at a receiver in RECN mode, 753 the receiver transport increments its local value of ECC modulo 8 754 and MUST echo its value to the sender in the ECI field of the next 755 ACK. It MUST repeat the same value of ECI in every subsequent ACK 756 until the next CE event, when it increments ECI again. 758 The increment of the local ECC values is modulo 8 so the field 759 value simply wraps round back to zero when it overflows. The 760 least significant bit is to the right (labelled bit 9). 762 A receiver in RECN mode MAY delay the echo of a CE to the next 763 delayed-ACK, which would be necessary if ACK-withholding were 764 implemented. 766 Sender Action in RECN Mode 768 On the arrival of every ACK, the sender compares the ECI field 769 with its own ECC value, then replaces its local value with that 770 from the ACK. The difference D is assumed to be the number of CE 771 marked packets that arrived at the receiver since it sent the 772 previously received ACK (but see below for the sender's safety 773 strategy). Whenever the ECI field increments by D (and/or d drops 774 are detected), the sender MUST clear the RE flag to "0" in the IP 775 header of the next D' data packets it sends (where D' = D + d), 776 effectively re-echoing each single increment of ECI. Otherwise 777 the data sender MUST send all data packets with RE set to "1". 779 As a general rule, once a flow is established, as well as setting 780 or clearing the RE flag as above, a data sender in RECN mode MUST 781 always set the ECN field to ECT(1). However, the settings of the 782 extended ECN field during flow start are defined in Section 4.1.4. 784 As we have already emphasised, the re-ECN protocol makes no 785 changes and has no effect on the TCP congestion control algorithm. 786 So, each increment of ECI (or detection of a drop) also triggers 787 the standard TCP congestion response, but with no more than one 788 congestion response per round trip, as usual. 790 A TCP sender also acts as the receiver for the other half- 791 connection. The host will maintain two ECC values S.ECC and R.ECC 792 as sender and receiver respectively. Every TCP header sent by a 793 host in RECN mode will also repeat the prevailing value of R.ECC 794 in its ECI field. If a sender in RECN mode has to retransmit a 795 packet due to a suspected loss, the re-transmitted packet MUST 796 carry the latest prevailing value of R.ECC when it is re- 797 transmitted, which will not necessarily be the one it carried 798 originally. 800 4.1.1.1. Drops and Marks 802 Re-ECN is based on the ECN protocol [RFC3168] which in turn is 803 typically based on the RED algorithm [RFC2309]. This algorithm marks 804 packets as CE with a probability that increases as the size of the 805 router queue increases. Howeverif the queue becomes too full then it 806 will revert to dropping packets. Because of this it is important 807 that re-ECN treats each packet drop it detects as if it were actually 808 a CE mark. This ensures that it can continue to correctly echo 809 congestion even through a highly congested path. 811 In order to ensure that drops are correctly echoed the sender needs 812 to add the number of drops detected per RTT to the difference in ECI 813 value waiting to be echoed. A drop is defined as set out in 814 [RFC2581] -- if the connection is in slow start then a single 815 duplicate aknowledgement will be treated as an indication of a drop. 816 When the system is in the congestion avoidance stage then 3 duplicate 817 acknowledgements will be treated as a sign of a drop. In all cases, 818 if a re-transmission time-out occurs then that will be treatd as a 819 drop. 821 4.1.1.2. Safety against Long Pure ACK Loss Sequences 823 The ECI method was chosen for echoing congestion marking because a 824 re-ECN sender needs to know about every CE mark arriving at the 825 receiver, not just whether at least one arrives within a round trip 826 time (which is all the ECE/CWR mechanism supported). And, as pure 827 ACKs are not protected by TCP reliable delivery, we repeat the same 828 ECI value in every ACK until it changes. Even if many ACKs in a row 829 are lost, as soon as one gets through, the ECI field it repeats from 830 previous ACKs that didn't get through will update the sender on how 831 many CE marks arrived since the last ACK got through. 833 The sender will only lose a record of the arrival of a CE mark if all 834 the ACKS are lost (and all of them were pure ACKs) for a stream of 835 data long enough to contain 8 or more CE marks. So, if the marking 836 fraction was p, at least 8/p pure ACKs would have to be lost. For 837 example, if p was 5%, a sequence of 160 pure ACKs would all have to 838 be lost. To protect against such extremely unlikely events, if a re- 839 ECN sender detects a sequence of pure ACKs has been lost it SHOULD 840 assume the ECI field wrapped as many times as possible within the 841 sequence. 843 Specifically, if a re-ECN sender receives an ACK with an 844 acknowledgement number that acknowledges L segments since the 845 previous ACK but with a sequence number unchanged from the previously 846 received ACK, it SHOULD conservatively assume that the ECI field 847 incremented by D' = L - ((L-D) mod 8), where D is the apparent 848 increase in the ECI field. For example if the ACK arriving after 9 849 pure ACK losses apparently increased ECI by 2, the assumed increment 850 of ECI would still be 2. But if ECI apparently increased by 2 after 851 11 pure ACK losses, ECI should be assumed to have increased by 10. 853 A re-ECN sender MAY implement a heuristic algorithm to predict beyond 854 reasonable doubt that the ECI field probably did not wrap within a 855 sequence of lost pure ACKs. But such an algorithm is NOT REQUIRED. 856 Such an algorithm MUST NOT be used unless it is proven to work even 857 in the presence of correlation between high ACK loss rate on the back 858 channel and high CE marking rate on the forward channel. 860 Whatever assumption a re-ECN sender makes about potentially lost CE 861 marks, both its congestion control and its re-echoing behaviour 862 SHOULD be consistent with the assumption it makes. 864 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver 866 If the half-connection is in RECN-Co mode, ECN feedback proceeds no 867 differently to that of vanilla ECN. In other words, the receiver 868 sets the ECE flag repeatedly in the TCP header and the sender 869 responds by setting the CWR flag. Although RECN-Co mode is used when 870 the receiver has not implemented the re-ECN protocol, the sender can 871 infer enough from its vanilla ECN feedback to set or clear the RE 872 flag reasonably well. Specifically, every time the receiver toggles 873 the ECE field from "0" to "1" (or a loss is detected), as well as 874 setting CWR in the TCP flags, the re-ECN sender MUST blank the RE 875 flag of the next packet to "0" as it would do in full RECN mode. 876 Otherwise, the data sender SHOULD send all other packets with RE set 877 to "1". Once a flow is established, a re-ECN data sender in RECN-Co 878 mode MUST always set the ECN field to ECT(1). 880 If a CE marked packet arrives at the receiver within a round trip 881 time of a previous mark, the receiver will still be echoing ECE for 882 the last CE mark. Therefore, such a mark will be missed by the 883 sender. Of course, this isn't of concern for congestion control, but 884 it does mean that very occasionally the RE blanking fraction will be 885 understated. Therefore flows in RECN-Co mode may occasionally be 886 mistaken for very lightly cheating flows and consequently might 887 suffer a small number of packet drops through an egress dropper 888 (Section 6.1.4). We expect re-ECN would be deployed for some time 889 before policers and droppers start to enforce it. So, given there is 890 not much ECN deployment yet anyway, this minor problem may affect 891 only a very small proportion of flows, reducing to nothing over the 892 years as vanilla ECN hosts upgrade. The use of RECN-Co mode would 893 need to be reviewed in the light of experience at the time of re-ECN 894 deployment. 896 RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their 897 code simple, MAY choose not to implement this mode. If they do not, 898 a re-ECN sender SHOULD fall back to vanilla ECT mode in the presence 899 of an ECN-capable receiver. It MAY choose to fall back to the ECT- 900 Nonce mode, but if re-ECN implementers don't want to be bothered with 901 RECN-Co mode, they probably won't want to add an ECT-Nonce mode 902 either. 904 4.1.2.1. Re-ECN support for the ECN Nonce 906 A TCP half-connection in RECN-Co mode MUST NOT support the ECN 907 Nonce [RFC3540]. This means that the sending code of a re-ECN 908 implementation will never need to include ECN Nonce support. Re-ECN 909 is intended to provide wider protection than the ECN nonce against 910 congestion control misbehaviour, and re-ECN only requires support 911 from the sender, therefore it is preferable to specifically rule out 912 the need for dual sender implementations. As a consequence, a re-ECN 913 capable sender will never set ECT(0), so it will be easier for 914 network elements to discriminate re-ECN traffic flows from other ECN 915 traffic, which will always contain some ECT(0) packets. 917 However, a re-ECN implementation MAY OPTIONALLY include receiving 918 code that complies with the ECN Nonce protocol when interacting with 919 a sender that supports the ECN nonce (rather than re-ECN), but this 920 support is NOT REQUIRED. 922 RFC3540 allows an ECN nonce sender to choose whether to sanction a 923 receiver that does not ever set the nonce sum. Given re-ECN is 924 intended to provide wider protection than the ECN nonce against 925 congestion control misbehaviour, implementers of re-ECN receivers MAY 926 choose not to implement backwards compatibility with the ECN nonce 927 capability. This may be because they deem that the risk of sanctions 928 is low, perhaps because significant deployment of the ECN nonce seems 929 unlikely at implementation time. 931 4.1.3. Capability Negotiation 933 During the TCP hand-shake at the start of a connection, an originator 934 of the connection (host A) with a re-ECN-capable transport MUST 935 indicate it is Re-ECT by setting the TCP options NS=1, CWR=1 and 936 ECE=1 in the initial SYN. 938 A responding Re-ECT host (host B) MUST return a SYN ACK with flags 939 CWR=1 and ECE=0. The responding host MUST NOT set this combination 940 of flags unless the preceding SYN has already indicated Re-ECT 941 support as above. A Re-ECT server (B) can use either setting of the 942 NS flag combined with this type of SYN ACK in response to a SYN from 943 a Re-ECT client (A). Normally a Re-ECT server will reply to a Re-ECT 944 client with NS=0, but in the special circumstance below it can return 945 a SYN ACK with NS=1. 947 If the initial SYN from Re-ECT client A is marked CE(-1), a Re-ECT 948 server B MUST increment its local value of ECC. But B cannot reflect 949 the value of ECC in the SYN ACK, because it is still using the 3 bits 950 to negotiate connection capabilities. So, server B MUST set the 951 alternative TCP header flags in its SYN ACK: NS=1, CWR=1 and ECE=0. 953 These handshakes are summarised in Table 5 below, with X meaning 954 `don't care'. The handshakes used for the other flavours of ECN are 955 also shown for comparison. To compress the width of the table, the 956 headings of the first four columns have been severely abbreviated, as 957 follows: 959 R: *R*e-ECT 961 N: ECT-*N*once (RFC3540) 963 E: *E*CT (RFC3168) 965 I: Not-ECT (*I*mplicit congestion notification). 967 These correspond with the same headings used in Table 4. Indeed, the 968 resulting modes in the last two columns of the table below are a more 969 comprehensive way of saying the same thing as Table 4. 971 +----+---+---+---+------------+-------------+-----------+-----------+ 972 | R | N | E | I | SYN A-B | SYN ACK B-A | A-B Mode | B-A Mode | 973 +----+---+---+---+------------+-------------+-----------+-----------+ 974 | | | | | NS CWR ECE | NS CWR ECE | | | 975 | AB | | | | 1 1 1 | X 1 0 | RECN | RECN | 976 | A | B | | | 1 1 1 | 1 0 1 | RECN-Co | ECT-Nonce | 977 | A | | B | | 1 1 1 | 0 0 1 | RECN-Co | ECT | 978 | A | | | B | 1 1 1 | 0 0 0 | Not-ECT | Not-ECT | 979 | B | A | | | 0 1 1 | 0 0 1 | ECT-Nonce | RECN-Co | 980 | B | | A | | 0 1 1 | 0 0 1 | ECT | RECN-Co | 981 | B | | | A | 0 0 0 | 0 0 0 | Not-ECT | Not-ECT | 982 +----+---+---+---+------------+-------------+-----------+-----------+ 984 Table 5: TCP Capability Negotiation between Originator (A) and 985 Responder (B) 987 As soon as a re-ECN capable TCP server receives a SYN, it MUST set 988 its two half-connections into the modes given in Table 5. As soon as 989 a re-ECN capable TCP client receives a SYN ACK, it MUST set its two 990 half-connections into the modes given in Table 5. The half- 991 connections will remain in these modes for the rest of the 992 connection, including for the third segment of TCP's three-way hand- 993 shake (the ACK). 995 {ToDo: Consider SYNs within a connection.} 997 Recall that, if the SYN ACK reflects the same flag settings as the 998 preceding SYN (because there is a broken legacy implementation that 999 behaves this way), RFC3168 specifies that the whole connection MUST 1000 revert to Not-ECT. 1002 Also note that, whenever the SYN flag of a TCP segment is set 1003 (including when the ACK flag is also set), the NS, CWR and ECE flags 1004 MUST NOT be interpreted as the 3-bit ECI value, which is only set as 1005 a copy of the local ECC value in non-SYN packets. 1007 4.1.4. Extended ECN (EECN) Field Settings during Flow Start or after 1008 Idle Periods 1010 If the originator (A) of a TCP connection supports re-ECN it MUST set 1011 the extended ECN (EECN) field in the IP header of the initial SYN 1012 packet to the feedback not established (FNE) codepoint. 1014 FNE is a new extended ECN codepoint defined by this specification 1015 (Section 3.2). The feedback not established (FNE) codepoint is used 1016 when the transport does not have the benefit of ECN feedback so it 1017 cannot decide whether to set or clear the RE flag. 1019 If after receiving a SYN the server B has set its sending half- 1020 connection into RECN mode or RECN-Co mode, it MUST set the extended 1021 ECN field in the IP header of its SYN ACK to the feedback not 1022 established (FNE) codepoint. Note the careful wording here, which 1023 means that Re-ECT server B MUST set FNE on a SYN ACK whether it is 1024 responding to a SYN from a Re-ECT client or from a client that is 1025 merely ECN-capable. 1027 The original ECN specification [RFC3168] required SYNs and SYN ACKs 1028 to use the Not-ECT codepoint of the ECN field. The aim was to 1029 prevent well-known DoS attacks such as SYN flooding being able to 1030 gain from the advantage that ECN capability afforded over drop at 1031 ECN-capable routers. 1033 For a SYN ACK, Kuzmanovic [I-D.ietf-tcpm-ecnsyn] has shown that this 1034 caution was unnecessary, and proposes to allow a SYN ACK to be ECN- 1035 capable to improve performance. We have gone further by proposing to 1036 make the initial SYN ECN-capable too. By stipulating the FNE 1037 codepoint for the initial SYN, we comply with RFC3168 in word but not 1038 in spirit, because we have indeed set the ECN field to Not-ECT, but 1039 we have extended the ECN field with another bit. And it will be seen 1040 (Section 5.3) that we have defined one setting of that bit to mean an 1041 ECN-capable transport. Therefore, by proposing that the FNE 1042 codepoint MUST be used on the initial SYN of a connection, we have 1043 (deliberately) made the initial SYN ECN-capable. Section 5.4 1044 justifies deciding to make the initial SYN ECN-capable. 1046 Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will 1047 have already been set on the initial SYN and possibly the SYN ACK as 1048 above. But each re-ECN sender will have to set FNE cautiously on a 1049 few data packets as well, given a number of packets will usually have 1050 to be sent before sufficient congestion feedback is received. The 1051 behaviour will be different depending on the mode of the half- 1052 connection: 1054 RECN mode: Given the constraints on TCP's initial window [RFC3390] 1055 and its exponential window increase during slow start 1056 phase [RFC2581], it turns out that the sender SHOULD set FNE on 1057 the first and third data packets in its flow, assuming equal sized 1058 data packets once a flow is established. Appendix D presents the 1059 calculation that led to this conclusion. Below, after running 1060 through the start of an example TCP session, we give the intuition 1061 learned from that calculation. 1063 RECN-Co mode: A re-ECT sender that switches into re-ECN 1064 compatibility mode or into Not-ECT mode (because it has detected 1065 the corresponding host is not re-ECN capable) MUST limit its 1066 initial window to 1 segment. The reasoning behind this constraint 1067 is given in Section 5.4. Having set this initial window, a re-ECN 1068 sender in RECN-Co mode SHOULD set FNE on the first and third data 1069 packets in a flow, as for RECN mode. 1071 +----+------+----------------+-------+-------+---------------+------+ 1072 | | Data | TCP A(Re-ECT) | IP A | IP B | TCP B(Re-ECT) | Data | 1073 +----+------+----------------+-------+-------+---------------+------+ 1074 | | Byte | SEQ ACK CTL | EECN | EECN | SEQ ACK CTL | Byte | 1075 | -- | ---- | ------------- | ----- | ----- | ------------- | ---- | 1076 | 1 | | 0100 SYN | FNE | --> | R.ECC=0 | | 1077 | | | CWR,ECE,NS | | | | | 1078 | 2 | | R.ECC=0 | <-- | FNE | 0300 0101 | | 1079 | | | | | | SYN,ACK,CWR | | 1080 | 3 | | 0101 0301 ACK | RECT | --> | R.ECC=0 | | 1081 | 4 | 1000 | 0101 0301 ACK | FNE | --> | R.ECC=0 | | 1082 | 5 | | R.ECC=0 | <-- | FNE | 0301 1102 ACK | 1460 | 1083 | 6 | | R.ECC=0 | <-- | RECT | 1762 1102 ACK | 1460 | 1084 | 7 | | R.ECC=0 | <-- | FNE | 3222 1102 ACK | 1460 | 1085 | 8 | | 1102 1762 ACK | RECT | --> | R.ECC=0 | | 1086 | 9 | | R.ECC=0 | <-- | RECT | 4682 1102 ACK | 1460 | 1087 | 10 | | R.ECC=0 | <-- | RECT | 6142 1102 ACK | 1460 | 1088 | 11 | | 1102 3222 ACK | RECT | --> | R.ECC=0 | | 1089 | 12 | | R.ECC=0 | <-- | RECT | 7602 1102 ACK | 1460 | 1090 | 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 | 1091 | | | ... | | | | | 1092 +----+------+----------------+-------+-------+---------------+------+ 1094 Table 6: TCP Session Example #1 1096 Table 6 shows an example TCP session, where the server B sets FNE on 1097 its first and third data packets (lines 5 & 7) as well as on the 1098 initial SYN ACK as previously described. The left hand half of the 1099 table shows the relevant settings of headers sent by client A in 1100 three layers: the TCP payload size; TCP settings; then IP settings. 1101 The right hand half gives equivalent columns for server B. The only 1102 TCP settings shown are the sequence number (SEQ), acknowledgement 1103 number (ACK) and the relevant control (CTL) flags that A sets in the 1104 TCP header. The IP columns show the setting of the extended ECN 1105 (EECN) field. 1107 Also shown on the receiving side of the table is the value of the 1108 receiver's echo congestion counter (R.ECC) after processing the 1109 incoming EECN header. Note that, once a host sets a half-connection 1110 into RECN mode, it MUST initialise its local value of ECC to zero. 1112 The intuition that Appendix D gives for why a sender should set FNE 1113 on the first and third data packets is as follows. At line 13, a 1114 packet sent by B is shown with an '*', which means it has been 1115 congestion marked by an intermediate router from RECT to CE(-1). On 1116 receiving this CE marked packet, client A increments its ECC counter 1117 to 1 as shown. This was the 7th data packet B sent, but before 1118 feedback about this event returns to B, it might well have sent many 1119 more packets. Indeed, during exponential slow start, about as many 1120 packets will be in flight (unacknowledged) as have been acknowledged. 1121 So, when the feedback from the congestion event on B's 7th segment 1122 returns, B will have sent about 7 further packets that will still be 1123 in flight. At that stage, B's best estimate of the network's packet 1124 marking fraction will be 1/7. So, as B will have sent about 14 1125 packets, it should have already marked 2 of them as FNE in order to 1126 have marked 1/7; hence the need to have set the first and third data 1127 packets to FNE. 1129 Client A's behaviour in Table 6 also shows FNE being set on the first 1130 SYN and the first data packet (lines 1 & 4), but in this case it 1131 sends no more data packets, so of course, it cannot, and does not 1132 need to, set FNE again. Note that in the A-B direction there is no 1133 need to set FNE on the third part of the three-way hand-shake (line 1134 3---the ACK). 1136 Note that in this section we have used the word SHOULD rather than 1137 MUST when specifying how to set FNE on data segments before positive 1138 congestion feedback arrives (but note that the word MUST was used for 1139 FNE on the SYN and SYN ACK). FNE is only RECOMMENDED for the first 1140 and third data segments to entertain the possibility that the TCP 1141 transport has the benefit of other knowledge of the path, which it 1142 re-uses from one flow for the benefit of a newly starting flow. For 1143 instance, one flow can re-use knowledge of other flows between the 1144 same hosts if using a Congestion Manager [RFC3124] or when a proxy 1145 host aggregates congestion information for large numbers of flows. 1147 After an idle period of more than 1 second, a re-ECN sender transport 1148 MUST set the EECN field of the packet that resumes the connection to 1149 FNE. Note that this next packet may be sent a very long time later, 1150 a packet does NOT have to be sent after 1 second of idling. In order 1151 that the design of network policers can be deterministic, this 1152 specification deliberately puts an absolute lower limit on how long a 1153 connection can be idle before the packet that resumes the connection 1154 must be set to FNE, rather than relating it to the connection round 1155 trip time. We use the lower bound of the retransmission timeout 1156 (RTO) [RFC2988], which is commonly used as the idle period before TCP 1157 must reduce to the restart window [RFC2581]. Note our specification 1158 of re-ECN's idle period is NOT intended to change the idle period for 1159 TCP's restart, nor indeed for any other purposes. 1161 {ToDo: Describe how the sender falls back to legacy modes if packets 1162 don't appear to be getting through (to work round firewalls 1163 discarding packets they consider unusual).} 1165 4.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs 1167 A re-ECN sender MUST clear the RE flag to "0" and set the ECN field 1168 to Not-ECT in pure ACKs, retransmissions and window probes, as 1169 specified in [RFC3168]. Our eventual goal is for all packets to be 1170 sent with re-ECN enabled, and we believe the semantics of the ECI 1171 field go a long way towards being able to achieve this. However, we 1172 have not completed a full security analysis for these cases, 1173 therefore, currently we merely re-state current practice. 1175 We must also reconcile the facts that congestion marking is applied 1176 to packets but acknowledgements cover octet ranges and acknowledged 1177 octet boundaries need not match the transmitted boundaries. The 1178 general principle we work to is to remain compatible with TCP's 1179 congestion control which is driven by congestion events at packet 1180 granularity while at the same time aiming to blank the RE flag on at 1181 least as many octets in a flow as have been marked CE. 1183 Therefore, a re-ECN TCP receiver MUST increment its ECC value as many 1184 times as CE marked packets have been received. And that value MUST 1185 be echoed to the sender in the first available ACK using the ECI 1186 field. This ensures the TCP sender's congestion control receives 1187 timely feedback on congestion events at the same packet granularity 1188 that they were generated on congested routers. 1190 Then, a re-ECN sender stores the difference D between its own ECC 1191 value and the incoming ECI field by incrementing a counter R. Then, R 1192 is decremented by 1 each subsequent packet that is sent with the RE 1193 flag blanked, until R is no longer positive. Using this technique, 1194 whenever a re-ECN transport sends a not re-ECN capable (NRECN) packet 1195 (e.g. a retransmission), the remaining packets required to have the 1196 RE flag blanked will be automatically carried over to subsequent 1197 packets, through the variable R. 1199 This does not ensure precisely the same number of octets have RE 1200 blanked as were CE marked. But we believe positive errors will 1201 cancel negative over a long enough period. {ToDo: However, more 1202 research is needed to prove whether this is so. If it is not, it may 1203 be necessary to increment and decrement R in octets rather than 1204 packets, by incrementing R as the product of D and the size in octets 1205 of packets being sent (typically the MSS).} 1207 4.2. Other Transports 1209 4.2.1. General Guidelines for Adding Re-ECN to Other Transports 1211 Re-ECT sender transports that have established the receiver transport 1212 is at least ECN-capable (not necessarily re-ECN capable) MUST blank 1213 the RE codepoint in packets carrying at least as many octets as 1214 arrive at receiver with the CE codepoint set. Re-ECN-capable sender 1215 transports should always initialise the ECN field to the ECT(1) 1216 codepoint once a flow is established. 1218 If the sender transport does not have sufficient feedback to even 1219 estimate the path's CE rate, it SHOULD set FNE continuously. If the 1220 sender transport has some, perhaps stale, feedback to estimate that 1221 the path's CE rate is nearly definitely less than E%, the transport 1222 MAY blank RE in packets for E% of sent octets, and set the RECT 1223 codepoint for the remainder. 1225 The following sections give guidelines on how re-ECN support could be 1226 added to RSVP or NSIS, to DCCP, and to SCTP - although separate 1227 Internet drafts will be necessary to document the exact mechanics of 1228 re-ECN in each of these protocols. 1230 {ToDo: Give a brief outline of what would be expected for each of the 1231 following: 1233 o UDP fire and forget (e.g. DNS) 1235 o UDP streaming with no feedback 1237 o UDP streaming with feedback 1239 } 1241 4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS 1243 A separate I-D has been submitted [Re-PCN] describing how re-ECN can 1244 be used in an edge-to-edge rather than end-to-end scenario. It can 1245 then be used by downstream networks to police whether upstream 1246 networks are blocking new flow reservations when downstream 1247 congestion is too high, even though the congestion is in other 1248 operators' downstream networks. This relates to current IETF work on 1249 Admission Control over Diffserv using Pre-Congestion Notification 1250 (PCN) [PCN-arch]. 1252 4.2.3. Guidelines for adding Re-ECN to DCCP 1254 Beside adjusting the initial features negotiation sequence, operating 1255 re-ECN in DCCP [RFC4340] could be achieved by defining a new option 1256 to be added to acknowledgments, that would include a multibit field 1257 where the destination could copy its ECC. 1259 4.2.4. Guidelines for adding Re-ECN to SCTP 1261 Annex 1 in [RFC2960] gives the specifications for SCTP to support 1262 ECN. Similar steps should be taken to support re-ECN. Beside 1263 adjusting the initial features negotiation sequence, operating re-ECN 1264 in SCTP could be achieved by defining a new control chunk, that would 1265 include a multibit field where the destination could copy its ECC 1267 5. Network Layer 1269 5.1. Re-ECN IPv4 Wire Protocol 1271 The wire protocol of the ECN field in the IP header remains largely 1272 unchanged from [RFC3168]. However, an extension to the ECN field we 1273 call the RE (re-ECN extension) flag (Section 3.2) is defined in this 1274 document. It doubles the extended ECN codepoint space, giving 8 1275 potential codepoints. The semantics of the extra codepoints are 1276 backward compatible with the semantics of the 4 original codepoints 1277 [RFC3168] (Section 7.1 collects together and summarises all the 1278 changes defined in this document). 1280 For IPv4, this document proposes that the new RE control flag will be 1281 positioned where the `reserved' control flag was at bit 48 of the 1282 IPv4 header (counting from 0). Alternatively, some would call this 1283 bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 1284 header (Figure 5). 1286 0 1 2 1287 +---+---+---+ 1288 | R | D | M | 1289 | E | F | F | 1290 +---+---+---+ 1292 Figure 5: New Definition of the Re-ECN Extension (RE) Control Flag at 1293 the Start of Byte 7 of the IPv4 Header 1295 The semantics of the RE flag are described in outline in Section 3 1296 and specified fully in Section 4. The RE flag is always considered 1297 in conjunction with the 2-bit ECN field, as if they were concatenated 1298 together to form a 3-bit extended ECN field. If the ECN field is set 1299 to either the ECT(1) or CE codepoint, when the RE flag is blanked 1300 (cleared to "0") it represents a re-echo of congestion experienced by 1301 an early packet. If the ECN field is set to the Not-ECT codepoint, 1302 when the RE flag is set to "1" it represents the feedback not 1303 established (FNE) codepoint, which signals that the packet was sent 1304 without the benefit of congestion feedback. 1306 It is believed that the FNE codepoint can simultaneously serve other 1307 purposes, particularly where the start of a flow needs distinguishing 1308 from packets later in the flow. For instance it would have been 1309 useful to identify new flows for tag switching and might enable 1310 similar developments in the future if it were adopted. It is similar 1311 to the state set-up bit idea designed to protect against memory 1312 exhaustion attacks. This idea was proposed informally by David Clark 1313 and documented by Handley and Greenhalgh [Steps_DoS]. The FNE 1314 codepoint can be thought of as a `soft-state set-up flag', because it 1315 is idempotent (i.e. one occurrence of the flag is sufficient but 1316 further occurrences achieve the same effect if previous ones were 1317 lost). 1319 We are sure there will probably be other claims pending on the use of 1320 bit 48. We know of at least two [ARI05], [RFC3514] but neither have 1321 been pursued in the IETF, so far, although the present proposal would 1322 meet the needs of the former. 1324 The security flag proposal (commonly known as the evil bit) was 1325 published on 1 April 2003 as Informational RFC 3514, but it was not 1326 adopted due to confusion over whether evil-doers might set it 1327 inappropriately. The present proposal is backward compatible with 1328 RFC3514 because if re-ECN compliant senders were benign they would 1329 correctly clear the evil bit to honestly declare that they had just 1330 received congestion feedback. Whereas evil-doers would hide 1331 congestion feedback by setting the evil bit continuously, or at least 1332 more often than they should. So, evil senders can be identified, 1333 because they declare that they are good less often than they should. 1335 5.2. Re-ECN IPv6 Wire Protocol 1337 For IPv6, this document proposes that the new RE control flag will be 1338 positioned as the first bit of the option field of a new Congestion 1339 hop by hop option header (Figure 6). 1341 0 1 2 3 1342 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1343 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1344 | Next Header | Hdr ext Len | Option Type | Opt Length =4 | 1345 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1346 |R| Reserved for future use | 1347 |E| | 1348 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1350 Figure 6: Definition of a New IPv6 Congestion Hop by Hop Option 1351 Header containing the Re-ECN Extension (RE) Control Flag 1353 0 1 2 3 4 5 6 7 8 1354 +-+-+-+-+-+-+-+-+- 1355 |AIU|C|Option ID| 1356 +-+-+-+-+-+-+-+-+- 1358 Figure 7: Congestion Hop by Hop Option Type Encoding 1360 The Hop-by-Hop Options header enables packets to carry information to 1361 be examined and processed by routers or nodes along the packet's 1362 delivery path, including the source and destination nodes. For re- 1363 ECN, the two bits of the Action If Unrecognized (AIU) flag of the 1364 Congestion extension header MUST be set to "00" meaning if 1365 unrecognized `skip over option and continue processing the header'. 1366 Then, any routers or a receiver not upgraded with the optional re-ECN 1367 features described in this memo will simply ignore this header. But 1368 routers with these optional re-ECN features or a re-ECN policing 1369 function, will process this Congestion extension header. 1371 The `C' flag MUST be set to "1" to specify that the Option Data 1372 (currently only the RE control flag) can change en-route to the 1373 packet's final destination. This ensures that, when an 1374 Authentication header (AH [RFC2402]) is present in the packet, for 1375 any option whose data may change en-route, its entire Option Data 1376 field will be treated as zero-valued octets when computing or 1377 verifying the packet's authenticating value. 1379 Although the RE control flag should not be changed along the path, we 1380 expect that the rest of this option field that is currently `Reserved 1381 for future use' could be used for a multi-bit congestion notification 1382 field which we would expect to change en route. As the RE flag does 1383 not need end-to-end authentication, we set the C flag to '1'. 1385 {ToDo: A Congestion Hop by Hop Option ID will need to be registered 1386 with IANA.} 1388 5.3. Router Forwarding Behaviour 1390 Re-ECN works well without modifying the forwarding behaviour of any 1391 routers. However, below, two OPTIONAL changes to forwarding 1392 behaviour are defined which respectively enhance performance and 1393 improve a router's discrimination against flooding attacks. They are 1394 both OPTIONAL additions that we propose MAY apply by default to all 1395 Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN 1396 marking behaviours [RFC3168]. Specifications for PHBs MAY define 1397 different forwarding behaviours from this default, but this is NOT 1398 REQUIRED. [Re-PCN] is one example. 1400 FNE indicates ECT: 1402 The FNE codepoint tells a router to assume that the packet was 1403 sent by an ECN-capable transport (see Section 5.4). Therefore an 1404 FNE packet MAY be marked rather than dropped. Note that the FNE 1405 codepoint has been intentionally chosen so that, to legacy routers 1406 (which do not inspect the RE flag) an FNE packet appears to be 1407 Not-ECT so it will be dropped by legacy AQM algorithms. 1409 A network operator MUST NOT configure a router to ECN mark rather 1410 than drop FNE packets unless it can guarantee that FNE packets 1411 will be rate limited, either locally or upstream. The ingress 1412 policers discussed in Section 6.1.5 would count as rate limiters 1413 for this purpose. 1415 Preferential Drop: If a re-ECN capable router experiences very high 1416 load so that it has to drop arriving packets (e.g. a DoS attack), 1417 it MAY preferentially drop packets within the same Diffserv PHB 1418 using the preference order for extended ECN codepoints given in 1419 Table 7. Preferential dropping can be difficult to implement on 1420 some hardware, but if feasible it would discriminate against 1421 attack traffic if done as part of the overall policing framework 1422 of Section 6.1.3. If nowhere else, routers at the egress of a 1423 network SHOULD implement preferential drop (stronger than the MAY 1424 above). For simplicity, preferences 4 & 5 MAY be merged into one 1425 preference level. 1427 +-------+-----+------------+-------+------------+-------------------+ 1428 | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | 1429 | field | bit | ECN | | (1 = drop | | 1430 | | | codepoint | | 1st) | | 1431 +-------+-----+------------+-------+------------+-------------------+ 1432 | 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | 1433 | | | | | | congestion and | 1434 | | | | | | RECT | 1435 | 00 | 1 | FNE | +1 | 4 | Feedback not | 1436 | | | | | | established | 1437 | 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | 1438 | | | | | | by congestion | 1439 | | | | | | experienced | 1440 | 01 | 1 | RECT | 0 | 3 | Re-ECN capable | 1441 | | | | | | transport | 1442 | 11 | 1 | CE(-1) | -1 | 3 | Congestion | 1443 | | | | | | experienced | 1444 | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | 1445 | 10 | 0 | --- | n/a | 2 | Legacy ECN use | 1446 | | | | | | only | 1447 | 00 | 0 | Not-RECT | n/a | 1 | Not | 1448 | | | | | | re-ECN-capable | 1449 | | | | | | transport | 1450 +-------+-----+------------+-------+------------+-------------------+ 1452 Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth') 1454 The above drop preferences are arranged to preserve packets with 1455 more positive worth (Section 3.4), given senders of positive 1456 packets must have honestly declared downstream congestion. This 1457 is explained fully in Section 6 on applications, particularly when 1458 the application of re-ECN to protect against DDoS attacks is 1459 described. 1461 5.4. Justification for Setting the First SYN to FNE 1463 Congested routers may mark an FNE packet to CE(-1) (Section 5.3), and 1464 the initial SYN MUST be set to FNE by Re-ECT client A 1465 (Section 4.1.4). So an initial SYN may be marked CE(-1) rather than 1466 dropped. This seems dangerous, because the sender has not yet 1467 established whether the receiver is a legacy one that does not 1468 understand congestion marking. It also seems to allow malicious 1469 senders to take advantage of ECN marking to avoid so much drop when 1470 launching SYN flooding attacks. Below we explain the features of the 1471 protocol design that remove both these dangers. 1473 ECN-capable initial SYN with a Not-ECT server: If the TCP server B 1474 is re-ECN capable, provision is made for it to feedback a possible 1475 congestion marked SYN in the SYN ACK (Section 4.1.4). But if the 1476 TCP client A finds out from the SYN ACK that the server was not 1477 ECN-capable, the TCP client MUST consider the first SYN as 1478 congestion marked before setting itself into Not-ECT mode. 1479 Section 4.1.4 mandates that such a TCP client MUST also set its 1480 initial window to 1 segment. In this way we remove the need to 1481 cautiously avoid setting the first SYN to Not-RECT. This will 1482 give worse performance while deployment is patchy, but better 1483 performance once deployment is widespread. 1485 SYN flooding attacks can't exploit ECN-capability: Malicious hosts 1486 may think they can use the advantage that ECN-marking gives over 1487 drop in launching classic SYN-flood attacks. But Section 5.3 1488 mandates that a router MUST only be configured to treat packets 1489 with the FNE codepoint as ECN-capable if FNE packets are rate 1490 limited. Introduction of the FNE codepoint was a deliberate move 1491 to enable transport-neutral handling of flow-start and flow state 1492 set-up in the IP layer where it belongs. It then becomes possible 1493 to protect against flooding attacks of all forms (not just SYN 1494 flooding) without transport-specific inspection for things like 1495 the SYN flag in TCP headers. Then, for instance, SYN flooding 1496 attacks using IPSec ESP encryption can also be rate limited at the 1497 IP layer. 1499 It might seem pedantic going to all this trouble to enable ECN on the 1500 initial packet of a flow, but it is motivated by a much wider concern 1501 to ensure safe congestion control will still be possible even if the 1502 application mix evolves to the point where the majority of flows 1503 consist of a single window or even a single packet. It also allows 1504 denial of service attacks to be more easily isolated and prevented. 1506 5.5. Control and Management 1508 5.5.1. Negative Balance Warning 1510 A new ICMP message type is being considered so that a dropper can 1511 warn the apparent sender of a flow that it has started to sanction 1512 the flow. The message would have similar semantics to the `Time 1513 exceeded' ICMP message type. To ensure the sender has to invest some 1514 work before the network will generate such a message, a dropper 1515 SHOULD only send such a message for flows that have demonstrated that 1516 they have started correctly by establishing a positive record, but 1517 have later gone negative. The threshold is up to the implementation. 1518 The purpose of the message is to deconfuse the cause of drops from 1519 other causes, such as congestion or transmission losses. The dropper 1520 would send the message to the sender of the flow, not the receiver. 1522 If we did define this message type, it would be REQUIRED for all re- 1523 ECT senders to parse and understand it. Note that a sender MUST only 1524 use this message to explain why losses are occurring. A sender MUST 1525 NOT take this message to mean that losses have occurred that it was 1526 not aware of. Otherwise, spoof messages could be sent by malicious 1527 sources to slow down a sender (c.f. ICMP source quench). 1529 However, the need for this message type is not yet confirmed, as we 1530 are considering how to prevent it being used by malicious senders to 1531 scan for droppers and to test their threshold settings. {ToDo: 1532 Complete this section.} 1534 5.5.2. Rate Response Control 1536 As discussed in Section 6.1.5 the sender's access operator will be 1537 expected to use bulk per-user policing, but they might choose to 1538 introduce a per-flow policer. In cases where operators do introduce 1539 per-flow policing, there may be a need for a sender to send a request 1540 to the ingress policer asking for permission to apply a non-default 1541 response to congestion (where TCP-friendly is assumed to be the 1542 default). This would require the sender to know what message 1543 format(s) to use and to be able to discover how to address the 1544 policer. The required control protocol(s) are outside the scope of 1545 this document, but will require definition elsewhere. 1547 The policer is likely to be local to the sender and inline, probably 1548 at the ingress interface to the internetwork. So, discovery should 1549 not be hard. A variety of control protocols already exist for some 1550 widely used rate-responses to congestion. For instance DCCP 1551 congestion control identifiers (CCIDs [RFC4340]) fulfil this role and 1552 so does QoS signalling (e.g. and RSVP request for controlled load 1553 service is equivalent to a request for no rate response to 1554 congestion, but with admission control). 1556 5.6. IP in IP Tunnels 1558 For re-ECN to work correctly through IP in IP tunnels, it needs 1559 slightly different tunnel handling to regular ECN [RFC3168]. 1560 Currently there is some incosistency between how the handling of IP 1561 in IP tunnels is defined in [RFC3168] and how it is defined in 1562 [RFC4301], but re-ECN would work fine with the IPsec behaviour. This 1563 inconsistency is addressed in a new Internet Draft [ECN-tunnel] that 1564 proposes to update RFC3168 tunnel behaviour to bring it into line 1565 with IPsec. Ideally, for re-ECN to work through a tunnel, the tunnel 1566 entry should copy both the RE flag and the ECN field from the inner 1567 to the outer IP header. Then at the tunnel exit, any congestion 1568 marking of the outer ECN field should overwrite the inner ECN field 1569 (unless the inner field is Not-ECT in which case an alarm should be 1570 raised). The RE flag shouldn't change along a path, so the outer RE 1571 flag should be the same as the inner. If it isn't a management alarm 1572 should be raised. This behaviour is the same as the full- 1573 functionality variant of [RFC3168] at tunnel exit, but different at 1574 tunnel entry. 1576 If tunnels are left as they are specified in [RFC3168], whether the 1577 limited or full-functionality variants are used, a problem arises 1578 with re-ECN if a tunnel crosses an inter-domain boundary, because the 1579 difference between positive and negative markings will not be 1580 correctly accounted for. In a limited functionality ECN tunnel, the 1581 flow will appear to be legacy traffic, and therefore may be wrongly 1582 rate limited. In a full-functionality ECN tunnel, the result will 1583 depend whether the tunnel entry copies the inner RE flag to the outer 1584 header or the RE flag in the outer header is always cleared. If the 1585 former, the flow will tend to be too positive when accounted for at 1586 borders. If the latter, it will be too negative. If the rules set 1587 out in [ECN-tunnel] are followed then this will not be an issue. 1589 5.7. Non-Issues 1591 The following issues might seem to cause unfavourable interactions 1592 with re-ECN, but we will explain why they don't: 1594 o Various link layers support explicit congestion notification, such 1595 as Frame Relay and ATM. Explicit congestion notification is 1596 proposed to be added to other link layers, such as Ethernet 1597 (802.3ar Ethernet congestion management) and MPLS [ECN-MPLS]; 1599 o Encryption and IPSec. 1601 In the case of congestion notification at the link layer, each 1602 particular link layer scheme either manages congestion on the link 1603 with its own link-level feedback (the usual arrangement in the cases 1604 of ATM and Frame Relay), or congestion notification from the link 1605 layer is merged into congestion notification at the IP level when the 1606 frame headers are decapsulated at the end of the link (the 1607 recommended arrangement in the Ethernet and MPLS cases). Given the 1608 RE flag is not intended to change along the path, this means that 1609 downstream congestion will still be measureable at any point where IP 1610 is processed on the path by subtracting positive from negative 1611 markings. 1613 In the case of encryption, as long as the tunnel issues described in 1614 Section 5.6 are dealt with, payload encryption itself will not be a 1615 problem. The design goal of re-ECN is to include downstream 1616 congestion in the IP header so that it is not necessary to bury into 1617 inner headers. Obfuscation of flow identifiers is not a problem for 1618 re-ECN policing elements. Re-ECN doesn't ever require flow 1619 identifiers to be valid, it only requires them to be unique. So if 1620 an IPSec encapsulating security payload (ESP [RFC2406]) or an 1621 authentication header (AH [RFC2402]) is used, the security parameters 1622 index (SPI) will be a sufficient flow identifier, as it is intended 1623 to be unique to a flow without revealing actual port numbers. 1625 In general, even if endpoints use some locally agreed scheme to hide 1626 port numbers, re-ECN policing elements can just consider the pair of 1627 source and destination IP addresses as the flow identifier. Re-ECN 1628 encourages endpoints to at least tell the network layer that a 1629 sequence of packets are all part of the same flow, if indeed they 1630 are. The alternative would be for the sender to make each packet 1631 appear to be a new flow, which would require them all to be marked 1632 FNE in order to avoid being treated with the bulk of malicious flows 1633 at the egress dropper. Given the FNE marking is worth +1 and 1634 networks are likely to rate limit FNE packets, endpoints are given an 1635 incentive not to set FNE on each packet. But if the sender really 1636 does want to hide the flow relationship between packets it can choose 1637 to pay the cost of multiple FNE packets, which in the long run will 1638 compensate for the extra memory required on network policing elements 1639 to process each flow. 1641 6. Applications 1643 6.1. Policing Congestion Response 1645 6.1.1. The Policing Problem 1647 The current Internet architecture trusts hosts to respond voluntarily 1648 to congestion. Limited evidence shows that the large majority of 1649 end-points on the Internet comply with a TCP-friendly response to 1650 congestion. But telephony (and increasingly video) services over the 1651 best effort Internet are attracting the interest of major commercial 1652 operations. Most of these applications do not respond to congestion 1653 at all. Those that can switch to lower rate codecs, still have a 1654 lower bound below which they must become unresponsive to congestion. 1656 Of course, the Internet is intended to support many different 1657 application behaviours. But the problem is that this freedom can be 1658 exercised irresponsibly. The greater problem is that we will never 1659 be able to agree on where the boundary is between responsible and 1660 irresponsible. Therefore re-ECN is designed to allow different 1661 networks to set their own view of the limit to irresponsibility, and 1662 to allow networks that choose a more conservative limit to push back 1663 against congestion caused in more liberal networks. 1665 As an example of the impossibility of setting a standard for 1666 fairness, mandating TCP-friendliness would set the bar too high for 1667 unresponsive streaming media, but still some would say the bar was 1668 too low. Even though all known peer-to-peer filesharing applications 1669 are TCP-compatible, they can cause a disproportionate amount of 1670 congestion, simply by using multiple flows and by transferring data 1671 continuously relative to other short-lived sessions. On the other 1672 hand, if we swung the other way and set the bar low enough to allow 1673 streaming media to be unresponsive, we would also allow denial of 1674 service attacks, which are typically unresponsive to congestion and 1675 consist of multiple continuous flows. 1677 Applications that need (or choose) to be unresponsive to congestion 1678 can effectively take (some would say steal) whatever share of 1679 bottleneck resources they want from responsive flows. Whether or not 1680 such free-riding is common, inability to prevent it increases the 1681 risk of poor returns for investors in network infrastructure, leading 1682 to under-investment. An increasing proportion of unresponsive or 1683 free-riding demand coupled with persistent under-supply is a broken 1684 economic cycle. Therefore, if the current, largely co-operative 1685 consensus continues to erode, congestion collapse could become more 1686 common in more areas of the Internet [RFC3714]. 1688 While we have designed re-ECN so that networks can choose to deploy 1689 stringent policing, this does not imply we advocate that every 1690 network should introduce tight controls on those that cause 1691 congestion. Re-ECN has been specifically designed to allow different 1692 networks to choose how conservative or liberal they wish to be with 1693 respect to policing congestion. But those that choose to be 1694 conservative can protect themselves from the excesses that liberal 1695 networks allow their users. 1697 6.1.2. The Case Against Bottleneck Policing 1699 The state of the art in rate policing is the bottleneck policer, 1700 which is intended to be deployed at any forwarding resource that may 1701 become congested. Its aim is to detect flows that cause 1702 significantly more local congestion than others. Although operators 1703 might solve their immediate problems by deploying bottleneck 1704 policers, we are concerned that widespread deployment would make it 1705 extremely hard to evolve new application behaviours. We believe the 1706 IETF should offer re-ECN as the preferred protocol on which to base 1707 solutions to the policing problems of operators, because it would not 1708 harm evolvability and, frankly, it would be far more effective (see 1709 later for why). 1711 Approaches like [XCHOKe] & [pBox] are nice approaches for rate 1712 policing traffic without the benefit of whole path information (such 1713 as could be provided by re-ECN). But they must be deployed at 1714 bottlenecks in order to work. Unfortunately, a large proportion of 1715 traffic traverses at least two bottlenecks (in two access networks), 1716 particularly with the current traffic mix where peer-to-peer file- 1717 sharing is prevalent. If ECN were deployed, we believe it would be 1718 likely that these bottleneck policers would be adapted to combine ECN 1719 congestion marking from the upstream path with local congestion 1720 knowledge. But then the only useful placement for such policers 1721 would be close to the egress of the internetwork. 1723 But then, if these bottleneck policers were widely deployed (which 1724 would require them to be more effective than they are now), the 1725 Internet would find itself with one universal rate adaptation policy 1726 (probably TCP-friendliness) embedded throughout the network. Given 1727 TCP's congestion control algorithm is already known to be hitting its 1728 scalability limits and new algorithms are being developed for high- 1729 speed congestion control, embedding TCP policing into the Internet 1730 would make evolution to new algorithms extremely painful. If a 1731 source wanted to use a different algorithm, it would have to first 1732 discover then negotiate with all the policers on its path, 1733 particularly those in the far access network. The IETF has already 1734 traveled that path with the Intserv architecture and found it 1735 constrains scalability [RFC2208]. 1737 Anyway, if bottleneck policers were ever widely deployed, they would 1738 be likely to be bypassed by determined attackers. They inherently 1739 have to police fairness per flow or per source-destination pair. 1740 Therefore they can easily be circumvented either by opening multiple 1741 flows (by varying the end-point port number); or by spoofing the 1742 source address but arranging with the receiver to hide the true 1743 return address at a higher layer. 1745 6.1.3. Re-ECN Incentive Framework 1747 The aim is to create an incentive environment that ensures optimal 1748 sharing of capacity despite everyone acting selfishly (including 1749 lying and cheating). Of course, the mechanisms put in place for this 1750 can lie dormant wherever co-operation is the norm. 1752 Throughout this document we focus on path congestion. But some forms 1753 of fairness, particularly TCP's, also depend on round trip time. If 1754 TCP-fairness is required, we also propose to measure downstream path 1755 delay using re-feedback. We give a simple outline of how this could 1756 work in Appendix F. However, we do not expect this to be necessary, 1757 as researchers tend to agree that only congestion control dynamics 1758 need to depend on RTT, not the rate that the algorithm would converge 1759 on after a period of stability. 1761 Figure 8 sketches the incentive framework that we will describe piece 1762 by piece throughout this section. We will do a first pass in 1763 overview, then return to each piece in detail. We re-use the earlier 1764 example of how downstream congestion is derived by subtracting 1765 upstream congestion from path congestion (Figure 1) but depict 1766 multiple trust boundaries to turn it into an internetwork. For 1767 clarity, only downstream congestion is shown (the difference between 1768 the two earlier plots). The graph displays downstream path 1769 congestion seen in a typical flow as it traverses an example path 1770 from sender S to receiver R, across networks N1, N2 & N4. Everyone 1771 is shown using re-ECN correctly, but we intend to show why everyone 1772 would /choose/ to use it correctly, and honestly. 1774 Three main types of self-interest can be identified: 1776 o Users want to transmit data across the network as fast as 1777 possible, paying as little as possible for the privilege. In this 1778 respect, there is no distinction between senders and receivers, 1779 but we must be wary of potential malice by one on the other; 1781 o Network operators want to maximise revenues from the resources 1782 they invest in. They compete amongst themselves for the custom of 1783 users. 1785 o Attackers (whether users or networks) want to use any opportunity 1786 to subvert the new re-ECN system for their own gain or to damage 1787 the service of their victims, whether targeted or random. 1789 policer 1790 | 1791 | 1792 S <-----N1----> <---N2---> <---N4--> R domain 1793 | : : 1794 A\|/: : 1795 | V : : 1796 3% |---------+ : 1797 | : | : 1798 2% | : +-----------------------+ : 1799 | : downstream congestion | : 1800 1% | : | : 1801 | : | : 1802 0% +---------------------------------+=====--> 1803 0 i ^ resource index 1804 | | /|\ 1805 1.00% 2.00% | marking fraction 1806 | 1807 dropper 1809 Figure 8: Incentive Framework, showing creation of opposing pressures 1810 to under-declare and over-declare downstream congestion, using a 1811 policer and a dropper 1813 Source congestion control: We want to ensure that the sender will 1814 throttle its rate as downstream congestion increases. Whatever 1815 the agreed congestion response (whether TCP-compatible or some 1816 enhanced QoS), to some extent it will always be against the 1817 sender's interest to comply. 1819 Ingress policing: But it is in all the network operators' interests 1820 to encourage fair congestion response, so that their investments 1821 are employed to satisfy the most valuable demand. The re-ECN 1822 protocol ensures packets carry the necessary information about 1823 their own expected downstream congestion so that N1 can deploy a 1824 policer at its ingress to check that S1 is complying with whatever 1825 congestion control it should be using (Section 6.1.5). If N1 is 1826 extremely conservative it could police each flow, but it is likely 1827 to just police the bulk amount of congestion each customer causes 1828 without regard to flows, or if it is extremely liberal it need not 1829 police congestion control at all. Whatever, it is always 1830 preferable to police traffic at the very first ingress into an 1831 internetwork, before non-compliant traffic can cause any damage. 1833 Edge egress dropper: If the policer ensures the source has less 1834 right to a high rate the higher it declares downstream congestion, 1835 the source has a clear incentive to understate downstream 1836 congestion. But, if flows of packets are understated when they 1837 enter the internetwork, they will have become negative by the time 1838 they leave. So, we introduce a dropper at the last network 1839 egress, which drops packets in flows that persistently declare 1840 negative downstream congestion (see Section 6.1.4 for details). 1842 ..competitive routing 1843 .' : '. 1844 .' p e n a l:t i e s '. 1845 : | : \ : 1846 A : | : | : 1847 |S <-----N1----> <---N2---> <---N4--> R domain 1848 | : | : | : 1849 | V | : | : 1850 3% |--------+ | : | : 1851 | | V V V V 1852 2% | +-----------------------+ 1853 | downstream congestion | 1854 1% | : | 1855 | : | 1856 0% +--------------------------------+=====--> 1857 0 ^ i resource index 1858 | /|\ | 1859 1.00% | 2.00% marking fraction 1860 | 1861 sanctions 1863 Figure 9: Incentives at Inter-domain Borders 1865 Inter-domain traffic policing: But next we must ask, if congestion 1866 arises downstream (say in N4), what is the ingress network's 1867 (N1's) incentive to police its customers' response? If N1 turns a 1868 blind eye, its own customers benefit while other networks suffer. 1869 This is why all inter-domain QoS architectures (e.g. Intserv, 1870 Diffserv) police traffic each time it crosses a trust boundary. 1871 We have already shown that re-ECN gives a trustworthy measure of 1872 the expected downstream congestion that a flow will cause by 1873 subtracting negative volume from positive at any intermediate 1874 point on a path. N4 (say) can use this measure to police all the 1875 responses to congestion of all the sources beyond its upstream 1876 neighbour (N2), but in bulk with one very simple passive 1877 mechanism, rather than per flow, as we will now explain using 1878 Figure 9. 1880 Emulating policing with inter-domain congestion penalties: Between 1881 high-speed networks, we would rather avoid per-flow policing, and 1882 we would rather avoid holding back traffic while it is policed. 1883 Instead, once re-ECN has arranged headers to carry downstream 1884 congestion honestly, N2 can contract to pay N4 penalties in 1885 proportion to a single bulk count of the congestion metrics 1886 crossing their mutual trust boundary (Section 6.1.6). In this 1887 way, N4 puts pressure on N2 to suppress downstream congestion, for 1888 every flow passing through the border interface, even though they 1889 will all start and end in different places, and even though they 1890 may all be allowed different responses to congestion. The figure 1891 depicts this downward pressure on N2 by the solid downward arrow 1892 at the egress of N2. Then N2 has an incentive either to police 1893 the congestion response of its own ingress traffic (from N1) or to 1894 emulate policing by applying penalties to N1 in turn on the basis 1895 of congestion counted at their mutual boundary. In this recursive 1896 way, the incentives for each flow to respond correctly to 1897 congestion trace back with each flow precisely to each source, 1898 despite the mechanism not recognising flows (see Section 6.2.2). 1900 Inter-domain congestion charging diversity: Any two networks are 1901 free to agree any of a range of penalty regimes between themselves 1902 but they would only provide the right incentives if they were 1903 within the following reasonable constraints. N2 should expect to 1904 have to pay penalties to N4 where penalties monotonically increase 1905 with the volume of congestion and negative penalties are not 1906 allowed. For instance, they may agree an SLA with tiered 1907 congestion thresholds, where higher penalties apply the higher the 1908 threshold that is broken. But the most obvious (and useful) form 1909 of penalty is where N4 levies a charge on N2 proportional to the 1910 volume of downstream congestion N2 dumps into N4. In the 1911 explanation that follows, we assume this specific variant of 1912 volume charging between networks - charging proportionate to the 1913 volume of congestion. 1915 We must make clear that we are not advocating that everyone should 1916 use this form of contract. We are well aware that the IETF tries 1917 to avoid standardising technology that depends on a particular 1918 business model. And we strongly share this desire to encourage 1919 diversity. But our aim is merely to show that border policing can 1920 at least work with this one model, then we can assume that 1921 operators might experiment with the metric in other models (see 1922 Section 6.1.6 for examples). Of course, operators are free to 1923 complement this usage element of their charges with traditional 1924 capacity charging, and we expect they will as predicted by 1925 economics. 1927 No congestion charging to users: Bulk congestion penalties at trust 1928 boundaries are passive and extremely simple, and lose none of 1929 their per-packet precision from one boundary to the next (unlike 1930 Diffserv all-address traffic conditioning agreements, which 1931 dissipate their effectiveness across long topologies). But at any 1932 trust boundary, there is no imperative to use congestion charging. 1934 Traditional traffic policing can be used, if the complexity and 1935 cost is preferred. In particular, at the boundary with end 1936 customers (e.g. between S and N1), traffic policing will most 1937 likely be more appropriate. Policer complexity is less of a 1938 concern at the edge of the network. And end-customers are known 1939 to be highly averse to the unpredictability of congestion 1940 charging. 1942 NOTE WELL: This document neither advocates nor requires congestion 1943 charging for end customers and advocates but does not require 1944 inter-domain congestion charging. 1946 Competitive discipline of inter-domain traffic engineering: With 1947 inter-domain congestion charging, a domain seems to have a 1948 perverse incentive to fake congestion; N2's profit depends on the 1949 difference between congestion at its ingress (its revenue) and at 1950 its egress (its cost). So, overstating internal congestion seems 1951 to increase profit. However, smart border routing [Smart_rtg] by 1952 N1 will bias its routing towards the least cost routes. So, N2 1953 risks losing all its revenue to competitive routes if it 1954 overstates congestion (see Section 6.2.3). In other words, if N2 1955 is the least congested route, its ability to raise excess profits 1956 is limited by the congestion on the next least congested route. 1957 This pressure on N2 to remain competitive is represented by the 1958 dotted downward arrow at the ingress to N2 in Figure 9. 1960 Closing the loop: All the above elements conspire to trap everyone 1961 between two opposing pressures (the downward and upward arrows in 1962 Figure 8 & Figure 9), ensuring the downstream congestion metric 1963 arrives at the destination neither above nor below zero. So, we 1964 have arrived back where we started in our argument. The ingress 1965 edge network can rely on downstream congestion declared in the 1966 packet headers presented by the sender. So it can police the 1967 sender's congestion response accordingly. 1969 Evolvability of congestion control: We have seen that re-ECN enables 1970 policing at the very first ingress. We have also seen that, as 1971 flows continue on their path through further networks downstream, 1972 re-ECN removes the need for further per-domain ingress policing of 1973 all the different congestion responses allowed to each different 1974 flow. This is why the evolvability of re-ECN policing is so 1975 superior to bottleneck policing or to any policing of different 1976 QoS for different flows. Even if all access networks choose to 1977 conservatively police congestion per flow, each will want to 1978 compete with the others to allow new responses to congestion for 1979 new types of application. With re-ECN, each can introduce new 1980 controls independently, without coordinating with other networks 1981 and without having to standardise anything. But, as we have just 1982 seen, by making inter-domain penalties proportionate to bulk 1983 downtream congestion, downstream networks can be agnostic to the 1984 specific congestion response for each flow, but they can still 1985 apply more penalty the more liberal the ingress access network has 1986 been in the response to congestion it allowed for each flow. 1988 6.1.3.1. The Case against Classic Feedback 1990 A system that produces an optimal outcome as a result of everyone's 1991 selfish actions is extremely powerful. Especially one that enables 1992 evolvability of congestion control. But why do we have to change to 1993 re-ECN to achieve it? Can't classic congestion feedback (as used 1994 already by standard ECN) be arranged to provide similar incentives 1995 and similar evolvability? Superficially it can. Kelly's seminal 1996 work showed how we can allow everyone the freedom to evolve whatever 1997 congestion control behaviour is in their application's best interest 1998 but still optimise the whole system of networks and users by placing 1999 a price on congestion to ensure responsible use of this 2000 freedom [Evol_cc]). Kelly used ECN with its classic congestion 2001 feedback model as the mechanism to convey congestion price 2002 information. The mechanism could be thought of as volume charging; 2003 except only the volume of packets marked with congestion experienced 2004 (CE) was counted. 2006 However, below we explain why relying on classic feedback /required/ 2007 congestion charging to be used, while re-ECN achieves the same 2008 powerful outcome (given it is built on Kelly's foundations), but does 2009 not /require/ congestion charging. In brief, the problem with 2010 classic feedback is that the incentives have to trace the indirect 2011 path back to the sender---the long way round the feedback loop. For 2012 example, if classic feedback were used in Figure 8, N2 would have had 2013 to influence N1 via all of N4, R & S rather than directly. 2015 Inability to agree what is happening downstream: In order to police 2016 its upstream neighbour's congestion response, the neighbours 2017 should be able to agree on the congestion to be responded to. 2018 Whatever the feedback regime, as packets change hands at each 2019 trust boundary, any path metrics they carry are verifiable by both 2020 neighbours. But, with a classic path metric, they can only agree 2021 on the /upstream/ path congestion. 2023 Inaccessible back-channel: The network needs a whole-path congestion 2024 metric if it wants to control the source. Classically, whole path 2025 congestion emerges at the destination, to be fed back from 2026 receiver to sender in a back-channel. But, in any data network, 2027 back-channels need not be visible to relays, as they are 2028 essentially communications between the end-points. They may be 2029 encrypted, asymmetrically routed or simply omitted, so no network 2030 element can reliably intercept them. The congestion charging 2031 literature solves this problem by charging the receiver and 2032 assuming this will cause the receiver to refer the charges to the 2033 sender. But, of course, this creates unintended side-effects... 2035 `Receiver pays' unacceptable: In connectionless datagram networks, 2036 receivers and receiving networks cannot prevent reception from 2037 malicious senders, so `receiver pays' opens them to `denial of 2038 funds' attacks. 2040 End-user congestion charging unacceptable: Even if 'denial of funds' 2041 were not a problem, we know that end-users are highly averse to 2042 the unpredictability of congestion charging and anyway, we want to 2043 avoid restricting network operators to just one retail tariff. 2044 But with classic feedback only an upstream metric is available, so 2045 we cannot avoid having to wrap the `receiver pays' money flow 2046 around the feedback loop, necessarily forcing end-users to be 2047 subjected to congestion charging. 2049 To summarise so far, with classic feedback, policing congestion 2050 response without losing evolvability /requires/ congestion charging 2051 of end-users and a `receiver pays' model, whereas, with re-ECN, it is 2052 still possible to influence incentives using congestion charging but 2053 using the safer `sender pays' model. However, congestion charging is 2054 only likely to be appropriate between domains. So, without losing 2055 evolvability, re-ECN enables technical policing mechanisms that are 2056 more appropriate for end users than congestion pricing. 2058 We now take a second pass over the incentive framework, filling in 2059 the detail. 2061 6.1.4. Egress Dropper 2063 As traffic leaves the last network before the receiver (domain N4 in 2064 Figure 8), the fraction of positive octets in a flow should match the 2065 fraction of negative octets introduced by congestion marking, leaving 2066 a balance of zero. If it is less (a negative flow), it implies that 2067 the source is understating path congestion (which will reduce the 2068 penalties that N2 owes N4). 2070 If flows are positive, N4 need take no action---this simply means its 2071 upstream neighbour is paying more penalties than it needs to, and the 2072 source is going slower than it needs to. But, to protect itself 2073 against persistently negative flows, N4 will need to install a 2074 dropper at its egress. Appendix E gives a suggested algorithm for 2075 this dropper. There is no intention that the dropper algorithm needs 2076 to be standardised, it is merely provided to show that an efficient, 2077 robust algorithm is possible. But whatever algorithm is used must 2078 meet the criteria below: 2080 o It SHOULD introduce minimal false positives for honest flows; 2082 o It SHOULD quickly detect and sanction dishonest flows (minimal 2083 false negatives); 2085 o It MUST be invulnerable to state exhaustion attacks from malicious 2086 sources. For instance, if the dropper uses flow-state, it should 2087 not be possible for a source to send numerous packets, each with a 2088 different flow ID, to force the dropper to exhaust its memory 2089 capacity; 2091 o It MUST introduce sufficient loss in goodput so that malicious 2092 sources cannot play off losses in the egress dropper against 2093 higher allowed throughput. Salvatori [CLoop_pol] describes this 2094 attack, which involves the source understating path congestion 2095 then inserting forward error correction (FEC) packets to 2096 compensate expected losses. 2098 Note that the dropper operates on flows but we would like it not to 2099 require per-flow state. This is why we have been careful to ensure 2100 that all flows MUST start with a packet marked with the FNE 2101 codepoint. If a flow does not start with the FNE codepoint, a 2102 dropper is likely to treat it unfavourably. This risk makes it worth 2103 setting the FNE codepoint at the start of a flow, even though there 2104 is a cost to the sender of setting FNE (positive `worth'). Indeed, 2105 with the FNE codepoint, the rate at which a sender can generate new 2106 flows can be limited (Appendix G). In this respect, the FNE 2107 codepoint works like Handley's state set-up bit [Steps_DoS]. 2109 Appendix E also gives an example dropper implementation that 2110 aggregates flow state. Dropper algorithms will often maintain a 2111 moving average across flows of the fraction of RE blanked packets. 2112 When maintaining an average across flows, a dropper SHOULD only allow 2113 flows into the average if they start with FNE, but it SHOULD NOT 2114 include packets with the FNE codepoint set in the average. A sender 2115 sets the FNE codepoint when it does not have the benefit of feedback 2116 from the receiver. So, counting packets with FNE cleared would be 2117 likely to make the average unnecessarily positive, providing headroom 2118 (or should we say footroom?) for dishonest (negative) traffic. 2120 If the dropper detects a persistently negative flow, it SHOULD drop 2121 sufficient negative and neutral packets to force the flow to not be 2122 negative. Drops SHOULD be focused on just sufficient packets in 2123 misbehaving flows to remove the negative bias while doing minimal 2124 extra harm. 2126 6.1.5. Policing 2128 Access operators who wish to limit the congeston that a sender is 2129 able to cause can deploy policers at the very first ingress to the 2130 internetwork. Re-ECN has been designed to avoid the need for 2131 bottleneck policing so that we can avoid a future where a single rate 2132 adaptation policy is embedded throughout the network. Instead, re- 2133 ECN allows the particular rate adaptation policy to be solely agreed 2134 bilaterally between the sender and its ingress access provider 2135 (Section 5.5.2 discusses possible ways to signal between them), which 2136 allows congestion control to be policed, but maintains its 2137 evolvability, requiring only a single, local box to be updated. 2139 Appendix G gives examples of per-user policing algorithms. But there 2140 is no implication that these algorithms are to be standardised, or 2141 that they are ideal. The ingress rate policer is the part of the re- 2142 ECN incentive framework that is intended to be the most flexible. 2143 Once endpoint protocol handlers for re-ECN and egress droppers are in 2144 place, operators can choose exactly which congestion response they 2145 want to police, and whether they want to do it per user, per flow or 2146 not at all. 2148 The re-ECN protocol allows these ingress policers to easily perform 2149 bulk per-user policing (Appendix G.1). This is likely to provide 2150 sufficient incentive to the user to correctly respond to congestion 2151 without needing the policing function to be overly complex. If an 2152 access operator chose they could use per-flow policing according to 2153 the widely adopted TCP rate adaptation ( Appendix G.2) or other 2154 alternatives, however this would introduce extra complexity to the 2155 system. 2157 If a per-flow rate policer is used, it should use path (not 2158 downstream) congestion as the relevant metric, which is represented 2159 by the fraction of octets in packets with positive (Re-Echo and FNE) 2160 and canceled (CE(0)) markings. Of course, re-ECN provides all the 2161 information a policer needs directly in the packets being policed. 2162 So, even policing TCP's AIMD algorithm is relatively straightforward 2163 (Appendix G.2). 2165 Note that we have included canceled packets in the measure of path 2166 congestion. Canceled packets arise when the sender re-echoes earlier 2167 congestion, but then this Re-Echo packet just happens to be 2168 congestion marked itself. One would not normally expect many 2169 canceled packets at the first ingress because one would not normally 2170 expect much congestion marking to have been necessary that soon in 2171 the path. However, a home network or campus network may well sit 2172 between the sending endpoint and the ingress policer, so some 2173 congestion may occur upstream of the policer. And if congestion does 2174 occur upstream, some canceled packets should be visible, and should 2175 be taken into account in the measure of path congestion. 2177 But a much more important reason for including canceled packets in 2178 the measure of path congestion at an ingress policer is that a sender 2179 might otherwise subvert the protocol by sending canceled packets 2180 instead of neutral (RECT) packets. Like neutral, canceled packets 2181 are worth zero, so the sender knows they won't be counted against any 2182 quota it might have been allowed. But unlike neutral packets, 2183 canceled packets are immune to congestion marking, because they have 2184 already been congestion marked. So, it is both correct and useful 2185 that canceled packets should be included in a policer's measure of 2186 path congestion, as this removes the incentive the sender would 2187 otherwise have to mark more packets as canceled than it should. 2189 An ingress policer should also ensure that flows are not already 2190 negative when they enter the access network. As with canceled 2191 packets, the presence of negative packets will typically be unusual. 2192 Therefore it will be easy to detect negative flows at the ingress by 2193 just detecting negative packets then monitoring the flow they belong 2194 to. 2196 Of course, even if the sender does operate its own network, it may 2197 arrange not to congestion mark traffic. Whether the sender does this 2198 or not is of no concern to anyone else except the sender. Such a 2199 sender will not be policed against its own network's contribution to 2200 congestion, but the only resulting problem would be overload in the 2201 sender's own network. 2203 Finally, we must not forget that an easy way to circumvent re-ECN's 2204 defences is for the source to turn off re-ECN support, by setting the 2205 Not-RECT codepoint, implying legacy traffic. Therefore an ingress 2206 policer should put a general rate-limit on Not-RECT traffic, which 2207 SHOULD be lax during early, patchy deployment, but will have to 2208 become stricter as deployment widens. Similarly, flows starting 2209 without an FNE packet can be confined by a strict rate-limit used for 2210 the remainder of flows that haven't proved they are well-behaved by 2211 starting correctly (therefore they need not consume any flow state--- 2212 they are just confined to the `misbehaving' bin if they carry an 2213 unrecognised flow ID). 2215 6.1.6. Inter-domain Policing 2217 One of the main design goals of re-ECN is for border security 2218 mechanisms to be as simple as possible, otherwise they will become 2219 the pinch-points that limit scalability of the whole internetwork. 2220 We want to avoid per-flow processing at borders and to keep to 2221 passive mechanisms that can monitor traffic in parallel to 2222 forwarding, rather than having to filter traffic inline---in series 2223 with forwarding. Such passive, off-line mechanisms are essential for 2224 future high-speed all-optical border interconnection where packets 2225 cannot be buffered while they are checked for policy compliance. 2227 So far, we have been able to keep the border mechanisms simple, 2228 despite having had to harden them against some subtle attacks on the 2229 re-ECN design. The mechanisms are still passive and avoid per-flow 2230 processing. 2232 The basic accounting mechanism at each border interface simply 2233 involves accumulating the volume of packets with positive worth (Re- 2234 Echo and FNE), and subtracting the volume of those with negative 2235 worth: CE(-1). Even though this mechanism takes no regard of flows, 2236 over an accounting period (say a month) this subtraction will account 2237 for the downstream congestion caused by all the flows traversing the 2238 interface, wherever they come from, and wherever they go to. The two 2239 networks can agree to use this metric however they wish to determine 2240 some congestion-related penalty against the upstream network. 2241 Although the algorithm could hardly be simpler, it is spelled out 2242 using pseudo-code in Appendix H.1. 2244 Various attempts to subvert the re-ECN design have been made. In all 2245 cases their root cause is persistently negative flows. But, after 2246 describing these attacks we will show that we don't actually have to 2247 get rid of all persistently negative flows in order to thwart the 2248 attacks. 2250 In honest flows, downstream congestion is measured as positive minus 2251 negative volume. So if all flows are honest (i.e. not persistently 2252 negative), adding all positive volume and all negative volume without 2253 regard to flows will give an aggregate measure of downstream 2254 congestion. But such simple aggregation is only possible if no flows 2255 are persistently negative. Unless persistently negative flows are 2256 completely removed, they will reduce the aggregate measure of 2257 congestion. The aggregate may still be positive overall, but not as 2258 positive as it would have been had the negative flows been removed. 2260 In Section 6.1.4 we discussed how to sanction traffic to remove, or 2261 at least to identify, persistently negative flows. But, even if the 2262 sanction for negative traffic is to discard it, unless it is 2263 discarded at the exact point it goes negative, it will wrongly 2264 subtract from aggregate downstream congestion, at least at any 2265 borders it crosses after it has gone negative but before it is 2266 discarded. 2268 We rely on sanctions to deter dishonest understatement of congestion. 2269 But even the ultimate sanction of discard can only be effective if 2270 the sender is bothered about the data getting through to its 2271 destination. A number of attacks have been identified where a sender 2272 gains from sending dummy traffic or it can attack someone or 2273 something using dummy traffic even though it isn't communicating any 2274 information to anyone: 2276 o A host can send traffic with no positive markings towards its 2277 intended destination, aiming to transmit as much traffic as any 2278 dropper will allow [Bauer06]. It may add forward error correction 2279 (FEC) to repair as much drop as it experiences. 2281 o A host can send dummy traffic into the network with no positive 2282 markings and with no intention of communicating with anyone, but 2283 merely to cause higher levels of congestion for others who do want 2284 to communicate (DoS). So, to ride over the extra congestion, 2285 everyone else has to spend more of whatever rights to cause 2286 congestion they have been allowed. 2288 o A network can simply create its own dummy traffic to congest 2289 another network, perhaps causing it to lose business at no cost to 2290 the attacking network. This is a form of denial of service 2291 perpetrated by one network on another. The preferential drop 2292 measures in Section 5.3 provide crude protection against such 2293 attacks, but we are not overly worried about more accurate 2294 prevention measures, because it is already possible for networks 2295 to DoS other networks on the general Internet, but they generally 2296 don't because of the grave consequences of being found out. We 2297 are only concerned if re-ECN increases the motivation for such an 2298 attack, as in the next example. 2300 o A network can just generate negative traffic and send it over its 2301 border with a neighbour to reduce the overall penalties that it 2302 should pay to that neighbour. It could even initialise the TTL so 2303 it expired shortly after entering the neighbouring network, 2304 reducing the chance of detection further downstream. This attack 2305 need not be motivated by a desire to deny service and indeed need 2306 not cause denial of service. A network's main motivator would 2307 most likely be to reduce the penalties it pays to a neighbour. 2308 But, the prospect of financial gain might tempt the network into 2309 mounting a DoS attack on the other network as well, given the gain 2310 would offset some of the risk of being detected. 2312 The first step towards a solution to all these problems with negative 2313 flows is to be able to estimate the contribution they make to 2314 downstream congestion at a border and to correct the measure 2315 accordingly. Although ideally we want to remove negative flows 2316 themselves, perhaps surprisingly, the most effective first step is to 2317 cancel out the polluting effect negative flows have on the measure of 2318 downstream congestion at a border. It is more important to get an 2319 unbiased estimate of their effect, than to try to remove them all. A 2320 suggested algorithm to give an unbiased estimate of the contribution 2321 from negative flows to the downstream congestion measure is given in 2322 Appendix H.2. 2324 Although making an accurate assessment of the contribution from 2325 negative flows may not be easy, just the single step of neutralising 2326 their polluting effect on congestion metrics removes all the gains 2327 networks could otherwise make from mounting dummy traffic attacks on 2328 each other. This puts all networks on the same side (only with 2329 respect to negative flows of course), rather than being pitched 2330 against each other. The network where this flow goes negative as 2331 well as all the networks downstream lose out from not being 2332 reimbursed for any congestion this flow causes. So they all have an 2333 interest in getting rid of these negative flows. Networks forwarding 2334 a flow before it goes negative aren't strictly on the same side, but 2335 they are disinterested bystanders---they don't care that the flow 2336 goes negative downstream, but at least they can't actively gain from 2337 making it go negative. The problem becomes localised so that once a 2338 flow goes negative, all the networks from where it happens and beyond 2339 downstream each have a small problem, each can detect it has a 2340 problem and each can get rid of the problem if it chooses to. But 2341 negative flows can no longer be used for any new attacks. 2343 Once an unbiased estimate of the effect of negative flows can be 2344 made, the problem reduces to detecting and preferably removing flows 2345 that have gone negative as soon as possible. But importantly, 2346 complete eradication of negative flows is no longer critical---best 2347 endeavours will be sufficient. 2349 For instance, let us consider the case where a source sends traffic 2350 with no positive markings at all, hoping to at least get as much 2351 traffic delivered as network-based droppers will allow. The flow is 2352 likely to go at least slightly negative in the first network on the 2353 path (N1 if we use the example network layout in Figure 9). If all 2354 networks use the algorithm in Appendix H.2 to inflate penalties at 2355 their border with an upstream network, they will remove the effect of 2356 negative flows. So, for instance, N2 will not be paying a penalty to 2357 N1 for this flow. Further, because the flow contributes no positive 2358 markings at all, a dropper at the egress will completely remove it. 2360 The remaining problem is that every network is carrying a flow that 2361 is causing congestion to others but not being held to account for the 2362 congestion it is causing. Whenever the fail-safe border algorithm 2363 (Section 6.1.7) or the border algorithm to compensate for negative 2364 flows (Appendix H.2) detects a negative flow, it can instantiate a 2365 focused dropper for that flow locally. It may be some time before 2366 the flow is detected, but the more strongly negative the flow is, the 2367 more quickly it will be detected by the fail-safe algorithm. But, in 2368 the meantime, it will not be distorting border incentives. Until it 2369 is detected, if it contributes to drop anywhere, its packets will 2370 tend to be dropped before others if routers use the preferential drop 2371 rules in Section 5.3, which discriminate against non-positive 2372 packets. All networks below the point where a flow goes negative 2373 (N1, N2 and N4 in this case) have an incentive to remove this flow, 2374 but the router where it first goes negative (in N1) can of course 2375 remove the problem for everyone downstream. 2377 In the case of DDoS attacks, Section 6.2.1 describes how re-ECN 2378 mitigates their force. 2380 6.1.7. Inter-domain Fail-safes 2382 The mechanisms described so far create incentives for rational 2383 network operators to behave. That is, one operator aims to make 2384 another behave responsibly by applying penalties and expects a 2385 rational response (i.e. one that trades off costs against benefits). 2386 It is usually reasonable to assume that other network operators will 2387 behave rationally (policy routing can avoid those that might not). 2388 But this approach does not protect against the misconfigurations and 2389 accidents of other operators. 2391 Therefore, we propose the following two mechanisms at a network's 2392 borders to provide "defence in depth". Both are similar: 2394 Highly positive flows: A small sample of positive packets should be 2395 picked randomly as they cross a border interface. Then subsequent 2396 packets matching the same source and destination address and DSCP 2397 should be monitored. If the fraction of positive marking is well 2398 above a threshold (to be determined by operational practice), a 2399 management alarm SHOULD be raised, and the flow MAY be 2400 automatically subject to focused drop. 2402 Persistently negative flows: A small sample of congestion marked 2403 (negative) packets should be picked randomly as they cross a 2404 border interface. Then subsequent packets matching the same 2405 source and destination address and DSCP should be monitored. If 2406 the balance of positive minus negative markings is persistently 2407 negative, a management alarm SHOULD be raised, and the flow MAY be 2408 automatically subject to focused drop. 2410 Both these mechanisms rely on the fact that highly positive (or 2411 negative) flows will appear more quickly in the sample by selecting 2412 randomly solely from positive (or negative) packets. 2414 6.1.8. Simulations 2416 Simulations of policer and dropper performance done for the multi-bit 2417 version of re-feedback have been included in section 5 "Dropper 2418 Performance" of [Re-fb]. Simulations of policer and dropper for the 2419 re-ECN version described in this document are work in progress. 2421 6.2. Other Applications 2423 6.2.1. DDoS Mitigation 2425 A flooding attack is inherently about congestion of a resource. 2426 Because re-ECN ensures the sources causing network congestion 2427 experience the cost of their own actions, it acts as a first line of 2428 defence against DDoS. As load focuses on a victim, upstream queues 2429 grow, requiring honest sources to pre-load packets with a higher 2430 fraction of positive packets. Once downstream routers are so 2431 congested that they are dropping traffic, they will be CE marking the 2432 traffic they do forward 100%. Honest sources will therefore be 2433 sending Re-Echo 100% (and therefore being severely rate-limited at 2434 the ingress). 2436 Senders under malicious control can either do the same as honest 2437 sources, and be rate-limited at ingress, or they can understate 2438 congestion by sending more neutral RECT packets than they should. If 2439 sources understate congestion (i.e. do not re-echo sufficient 2440 positive packets) and the preferential drop ranking is implemented on 2441 routers (Section 5.3), these routers will preserve positive traffic 2442 until last. So, the neutral traffic from malicious sources will all 2443 be automatically dropped first. Either way, the malicious sources 2444 cannot send more than honest sources. 2446 Further, hosts under malicious control will tend to be re-used for 2447 many different attacks. They will therefore build up a long term 2448 history of causing congestion. Therefore, as long as the population 2449 of potentially compromisable hosts around the Internet is limited, 2450 the per-user policing algorithms in Appendix G.1 will gradually 2451 throttle down zombies and other launchpads for attacks. Therefore, 2452 widespread deployment of re-ECN could considerably dampen the force 2453 of DDoS. Certainly, zombie armies could hold their fire for long 2454 enough to be able to build up enough credit in the per-user policers 2455 to launch an attack. But they would then still be limited to no more 2456 throughput than other, honest users. 2458 Inter-domain traffic policing (see Section 6.1.6)ensures that any 2459 network that harbours compromised `zombie' hosts will have to bear 2460 the cost of the congestion caused by traffic from zombies in 2461 downstream networks. Such networks will be incentivised to deploy 2462 per-user policers that rate-limit hosts that are unresponsive to 2463 congestion so they can only send very slowly into congested paths. 2464 As well as protecting other networks, the extremely poor performance 2465 at any sign of congestion will incentivise the zombie's owner to 2466 clean it up. However, the host should behave normally when using 2467 uncongested paths. 2469 Uniquely, re-ECN handles DDoS traffic without relying on the validity 2470 of identifiers in packets. Certainly the egress dropper relies on 2471 uniqueness of flow identifiers, but not their validity. So if a 2472 source spoofs another address, re-ECN works just as well, as long as 2473 the attacker cannot imitate all the flow identifiers of another 2474 active flow passing through the same dropper (see Section 6.3). 2475 Similarly, the ingress policer relies on uniqueness of flow IDs, not 2476 their validity. Because a new flow will only be allowed any rate at 2477 all if it starts with FNE, and the more FNE packets there are 2478 starting new flows, the more they will be limited. Essentially a re- 2479 ECN policer limits the bulk of all congestion entering the network 2480 through a physical interface; limiting the congestion caused by each 2481 flow is merely an optional extra. 2483 6.2.2. End-to-end QoS 2485 {ToDo: (Section 3.3.2 of [Re-fb] entitled `Edge QoS' gives an outline 2486 of the text that will be added here).} 2488 6.2.3. Traffic Engineering 2490 {ToDo: } 2492 6.2.4. Inter-Provider Service Monitoring 2494 {ToDo: } 2496 6.3. Limitations 2498 The known limitations of the re-ECN approach are: 2500 o We still cannot defend against the attack described in Section 10 2501 where a malicious source sends negative traffic through the same 2502 egress dropper as another flow and imitates its flow identifiers, 2503 allowing a malicious source to cause an innocent flow to 2504 experience heavy drop. 2506 o Re-feedback for TTL (re-TTL) would also be desirable at the same 2507 time as re-ECN. Unfortunately this requires a further standards 2508 action for the mechanisms briefly described in Appendix F 2510 o Traffic must be ECN-capable for re-ECN to be effective. The only 2511 defence against malicious users who turn off ECN capbility is that 2512 networks are expected to rate limit Not-ECT traffic and to apply 2513 higher drop preference to it during congestion. Although these 2514 are blunt instruments, they at least represent a feasible scenario 2515 for the future Internet where Not-ECT traffic co-exists with re- 2516 ECN traffic, but as a severely hobbled under-class. We recommend 2517 (Section 7.1) that while accommodating a smooth initial transition 2518 to re-ECN, policing policies should gradually be tightened to rate 2519 limit Not-ECT traffic more strictly in the longer term. 2521 o When checking whether a flow is balancing positive markings with 2522 congestion marking, re-ECN can only account for congestion 2523 marking, not drops. So, whenever a sender experiences drop, it 2524 does not have to re-echo the congestion event. Nonetheless, it is 2525 hardly any advantage to be able to send faster than other flows 2526 only if your traffic is dropped and the other traffic isn't. 2528 o We are considering the issue of whether it would be useful to 2529 truncate rather than drop packets that appear to be malicious, so 2530 that the feedback loop is not broken but useful data can be 2531 removed. 2533 7. Incremental Deployment 2535 7.1. Incremental Deployment Features 2537 The design of the re-ECN protocol started from the fact that the 2538 current ECN marking behaviour of routers was sufficient and that re- 2539 feedback could be introduced around these routers by changing the 2540 sender behaviour but not the routers. Otherwise, if we had required 2541 routers to be changed, the chance of encountering a path that had 2542 every router upgraded would be vanishly small during early 2543 deployment, giving no incentive to start deployment. Also, as there 2544 is no new forwarding behaviour, routers and hosts do not have to 2545 signal or negotiate anything. 2547 However, networks that choose to protect themselves using re-ECN do 2548 have to add new security functions at their trust boundaries with 2549 others. They distinguish legacy traffic by its ECN field. Traffic 2550 from Not-ECT transports is distinguishable by its Not-RECT marking. 2551 Traffic from legacy ECN transports is distinguished from re-ECN by 2552 which of ECT(0) or ECT(1) is used. We chose to use ECT(1) for re-ECN 2553 traffic deliberately. Existing ECN sources set ECT(0) on either 50% 2554 (the nonce) or 100% (the default) of packets, whereas re-ECN does not 2555 use ECT(0) at all. We can use this distinguishing feature of legacy 2556 ECN traffic to separate it out for different treatment at the various 2557 border security functions: egress dropping, ingress policing and 2558 border policing. 2560 The general principle we adopt is that an egress dropper will not 2561 drop any legacy traffic, but ingress and border policers will limit 2562 the bulk rate of legacy traffic that can enter each network. Then, 2563 during early re-ECN deployment, operators can set very permissive (or 2564 non-existent) rate-limits on legacy traffic, but once re-ECN 2565 implementations are generally available, legacy traffic can be rate- 2566 limited increasingly harshly. Ultimately, an operator might choose 2567 to block all legacy traffic entering its network, or at least only 2568 allow through a trickle. 2570 Then, as the limits are set more strictly, the more legacy ECN 2571 sources will gain by upgrading to re-ECN. Thus, towards the end of 2572 the voluntary incremental deployment period, legacy transports can be 2573 given progressively stronger encouragement to upgrade. 2575 The following list of minor changes, brings together all the points 2576 where Re-ECN semantics for use of the two-bit ECN field are different 2577 compared to RFC3168: 2579 o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender 2580 sets ECT(0) by default (Section 3.3); 2582 o No provision is necessary for a re-ECN capable source transport to 2583 use the ECN nonce (Section 4.1.2.1); 2585 o Routers MAY preferentially drop different extended ECN codepoints 2586 (Section 5.3); 2588 o Packets carrying the feedback not established (FNE) codepoint MAY 2589 optionally be marked rather than dropped by routers, even though 2590 their ECN field is Not-ECT (with the important caveat in 2591 Section 5.3); 2593 o Packets may be dropped by policing nodes because of apparent 2594 misbehaviour, not just because of congestion (Section 6); 2596 o Tunnel entry behaviour is still to be defined, but may have to be 2597 different from RFC3168 (Section 5.6). 2599 None of these changes REQUIRE any modifications to routers. Also 2600 none of these changes affect anything about end to end congestion 2601 control; they are all to do with allowing networks to police that end 2602 to end congestion control is well-behaved. 2604 7.2. Incremental Deployment Incentives 2606 It would only be worth standardising the re-ECN protocol if there 2607 existed a coherent story for how it might be incrementally deployed. 2608 In order for it to have a chance of deployment, everyone who needs to 2609 act must have a strong incentive to act, and the incentives must 2610 arise in the order that deployment would have to happen. Re-ECN 2611 works around unmodified ECN routers, but we can't just discuss why 2612 and how re-ECN deployment might build on ECN deployment, because 2613 there is precious little to build on in the first place. Instead, we 2614 aim to show that re-ECN deployment could carry ECN with it. We focus 2615 on commercial deployment incentives, although some of the arguments 2616 apply equally to academic or government sectors. 2618 ECN deployment: 2620 ECN is largely implemented in commercial routers, but generally 2621 not as a supported feature, and it has largely not been deployed 2622 by commercial network operators. It has been released in many 2623 Unix-based operating systems, but not in proprietary OSs like 2624 Windows or those in many mobile devices. For detailed deployment 2625 status, see [ECN-Deploy]. We believe the reason ECN deployment 2626 has not happened is twofold: 2628 * ECN requires changes to both routers and hosts. If someone 2629 wanted to sell the improvement that ECN offers, they would have 2630 to co-ordinate deployment of their product with others. An ECN 2631 server only gives any improvement on an ECN network. An ECN 2632 network only gives any improvement if used by ECN devices. 2633 Deployment that requires co-ordination adds cost and delay and 2634 tends to dilute any competitive advantage that might be gained. 2636 * ECN `only' gives a performance improvement. Making a product a 2637 bit faster (whether the product is a device or a network), 2638 isn't usually a sufficient selling point to be worth the cost 2639 of co-ordinating across the industry to deploy it. Network 2640 operators tend to avoid re-configuring a working network unless 2641 launching a new product. 2643 ECN and re-ECN for Edge-to-edge Assured QoS: 2645 We believe the proposal to provide assured QoS sessions using a 2646 form of ECN called pre-congestion notification (PCN) [PCN-arch] is 2647 most likely to break the deadlock in ECN deployment first. It 2648 only requires edge-to-edge deployment so it does not require 2649 endpoint support. It can be deployed in a single network, then 2650 grow incrementally to interconnected networks. And it provides a 2651 different `product' (internetworked assured QoS), rather than 2652 merely making an existing product a bit faster. 2654 Not only could this assured QoS application kick-start ECN 2655 deployment, it could also carry re-ECN deployment with it; because 2656 re-ECN can enable the assured QoS region to expand to a large 2657 internetwork where neighbouring networks do not trust each other. 2658 [Re-PCN] argues that re-ECN security should be built in to the QoS 2659 system from the start, explaining why and how. 2661 If ECN and re-ECN were deployed edge-to-edge for assured QoS, 2662 operators would gain valuable experience. They would also clear 2663 away many technical obstacles such as firewall configurations that 2664 block all but the legacy settings of the ECN field and the RE 2665 flag. 2667 ECN in Access Networks: 2669 The next obstacle to ECN deployment would be extension to access 2670 and backhaul networks, where considerable link layer differences 2671 makes implementation non-trivial, particularly on congested 2672 wireless links. ECN and re-ECN work fine during partial 2673 deployment, but they will not be very useful if the most congested 2674 elements in networks are the last to support them. Access network 2675 support is one of the weakest parts of this deployment story. All 2676 we can hope is that, once the benefits of ECN are better 2677 understood by operators, they will push for the necessary link 2678 layer implementations as deployment proceeds. 2680 Policing Unresponsive Flows: 2682 Re-ECN allows a network to offer differentiated quality of service 2683 as explained in Section 6.2.2. But we do not believe this will 2684 motivate initial deployment of re-ECN, because the industry is 2685 already set on alternative ways of doing QoS. Despite being much 2686 more complicated and expensive, the alternative approaches are 2687 here and now. 2689 But re-ECN is critical to QoS deployment in another respect. It 2690 can be used to prevent applications from taking whatever bandwidth 2691 they choose without asking. 2693 Currently, applications that remain resolute in their lack of 2694 response to congestion are rewarded by other TCP applications. In 2695 other words, TCP is naively friendly, in that it reduces its rate 2696 in response to congestion whether it is competing with friends 2697 (other TCPs) or with enemies (unresponsive applications). 2699 Therefore, those network owners that want to sell QoS will be keen 2700 to ensure that their users can't help themselves to QoS for free. 2701 Given the very large revenues at stake, we believe effective 2702 policing of congestion response will become highly sought after by 2703 network owners. 2705 But this does not necessarily argue for re-ECN deployment. 2706 Network owners might choose to deploy bottleneck policers rather 2707 than re-ECN-based policing. However, under Related Work 2708 (Section 9) we argue that bottleneck policers are inherently 2709 vulnerable to circumvention. 2711 Therefore we believe there will be a strong demand from network 2712 owners for re-ECN deployment so they can police flows that do not 2713 ask to be unresponsive to congestion, in order to protect their 2714 revenues from flows that do ask (QoS). In particular, we suspect 2715 that the operators of cellular networks will want to prevent VoIP 2716 and video applications being used freely on their networks as a 2717 more open market develops in GPRS and 3G devices. 2719 Initial deployments are likely to be isolated to single cellular 2720 networks. Cellular operators would first place requirements on 2721 device manufacturers to include re-ECN in the standards for mobile 2722 devices. In parallel, they would put out tenders for ingress and 2723 egress policers. Then, after a while they would start to tighten 2724 rate limits on Not-ECT traffic from non-standard devices and they 2725 would start policing whatever non-accredited applications people 2726 might install on mobile devices with re-ECN support in the 2727 operating system. This would force even independent mobile device 2728 manufacturers to provide re-ECN support. Early standardisation 2729 across the cellular operators is likely, including interconnection 2730 agreements with penalties for excess downstream congestion. 2732 We suspect some fixed broadband networks (whether cable or DSL) 2733 would follow a similar path. However, we also believe that larger 2734 parts of the fixed Internet would not choose to police on a per- 2735 flow basis. Some might choose to police congestion on a per-user 2736 basis in order to manage heavy peer-to-peer file-sharing, but it 2737 seems likely that a sizeable majority would not deploy any form of 2738 policing. 2740 This hybrid situation begs the question, "How does re-ECN work for 2741 networks that choose to using policing if they connect with others 2742 that don't?" Traffic from non-ECN capable sources will arrive 2743 from other networks and cause congestion within the policed, ECN- 2744 capable networks. So networks that chose to police congestion 2745 would rate-limit Not-ECT traffic throughout their network, 2746 particularly at their borders. They would probably also set 2747 higher usage prices in their interconnection contracts for 2748 incoming Not-ECT and Not-RECT traffic. We assume that 2749 interconnection contracts between networks in the same tier will 2750 include congestion penalties before contracts with provider 2751 backbones do. 2753 A hybrid situation could remain for all time. As was explained in 2754 the introduction, we believe in healthy competition between 2755 policing and not policing, with no imperative to convert the whole 2756 world to the religion of policing. Networks that chose not to 2757 deploy egress droppers would leave themselves open to being 2758 congested by senders in other networks. But that would be their 2759 choice. 2761 The important aspect of the egress dropper though is that it most 2762 protects the network that deploys it. If a network does not 2763 deploy an egress dropper, sources sending into it from other 2764 networks will be able to understate the congestion they are 2765 causing. Whereas, if a network deploys an egress dropper, it can 2766 know how much congestion other networks are dumping into it, and 2767 apply penalties or charges accordingly. So, whether or not a 2768 network polices its own sources at ingress, it is in its interests 2769 to deploy an egress dropper. 2771 Host support: 2773 In the above deployment scenario, host operating system support 2774 for re-ECN came about through the cellular operators demanding it 2775 in device standards (i.e. 3GPP). Of course, increasingly, mobile 2776 devices are being built to support multiple wireless technologies. 2777 So, if re-ECN were stipulated for cellular devices, it would 2778 automatically appear in those devices connected to the wireless 2779 fringes of fixed networks if they coupled cellular with WiFi or 2780 Bluetooth technology, for instance. Also, once implemented in the 2781 operating system of one mobile device, it would tend to be found 2782 in other devices using the same family of operating system. 2784 Therefore, whether or not a fixed network deployed ECN, or 2785 deployed re-ECN policers and droppers, many of its hosts might 2786 well be using re-ECN over it. Indeed, they would be at an 2787 advantage when communicating with hosts across Re-ECN policed 2788 networks that rate limited Not-RECT traffic. 2790 Other possible scenarios: 2792 The above is thankfully not the only plausible scenario we can 2793 think of. One of the many clubs of operators that meet regularly 2794 around the world might decide to act together to persuade a major 2795 operating system manufacturer to implement re-ECN. And they may 2796 agree between them on an interconnection model that includes 2797 congestion penalties. 2799 Re-ECN provides an interesting opportunity for device 2800 manufacturers as well as network operators. Policers can be 2801 configured loosely when first deployed. Then as re-ECN take-up 2802 increases, they can be tightened up, so that a network with re-ECN 2803 deployed can gradually squeeze down the service provided to legacy 2804 devices that have not upgraded to re-ECN. Many device vendors 2805 rely on replacement sales. And operating system companies rely 2806 heavily on new release sales. Also support services would like to 2807 be able to force stragglers to upgrade. So, the ability to 2808 throttle service to legacy operating systems is quite valuable. 2810 Also, policing unresponsive sources may not be the only or even 2811 the first application that drives deployment. It may be policing 2812 causes of heavy congestion (e.g. peer-to-peer file-sharing). Or 2813 it may be mitigation of denial of service. Or we may be wrong in 2814 thinking simpler QoS will not be the initial motivation for re-ECN 2815 deployment. Indeed, the combined pressure for all these may be 2816 the motivator, but it seems optimistic to expect such a level of 2817 joined-up thinking from today's communications industry. We 2818 believe a single application alone must be a sufficient motivator. 2820 In short, everyone gains from adding accountability to TCP/IP, 2821 except the selfish or malicious. So, deployment incentives tend 2822 to be strong. 2824 8. Architectural Rationale 2826 In the Internet's technical community, the danger of not responding 2827 to congestion is well-understood, as well as its attendant risk of 2828 congestion collapse [RFC3714]. However, one side of the Internet's 2829 commercial community considers that the very essence of IP is to 2830 provide open access to the internetwork for all applications. They 2831 see congestion as a symptom of over-conservative investment, and rely 2832 on revising application designs to find novel ways to keep 2833 applications working despite congestion. They argue that the 2834 Internet was never intended to be solely for TCP-friendly 2835 applications. Meanwhile, another side of the Internet's commercial 2836 community believes that it is worthwhile providing a network for 2837 novel applications only if it has sufficient capacity, which can 2838 happen only if a greater share of application revenues can be 2839 /assured/ for the infrastructure provider. Otherwise the major 2840 investments required would carry too much risk and wouldn't happen. 2842 The lesson articulated in [Tussle] is that we shouldn't embed our 2843 view on these arguments into the Internet at design time. Instead we 2844 should design the Internet so that the outcome of these arguments can 2845 get decided at run-time. Re-ECN is designed in that spirit. Once 2846 the protocol is available, different network operators can choose how 2847 liberal they want to be in holding people accountable for the 2848 congestion they cause. Some might boldly invest in capacity and not 2849 police its use at all, hoping that novel applications will result. 2850 Others might use re-ECN for fine-grained flow policing, expecting to 2851 make money selling vertically integrated services. Yet others might 2852 sit somewhere half-way, perhaps doing coarse, per-user policing. All 2853 might change their minds later. But re-ECN always allows them to 2854 interconnect so that the careful ones can protect themselves from the 2855 liberal ones. 2857 The incentive-based approach used for re-ECN is based on Gibbens and 2858 Kelly's arguments [Evol_cc] on allowing endpoints the freedom to 2859 evolve new congestion control algorithms for new applications. They 2860 ensured responsible behaviour despite everyone's self-interest by 2861 applying pricing to ECN marking, and Kelly had proved stability and 2862 optimality in an earlier paper. 2864 Re-ECN keeps all the underlying economic incentives, but rearranges 2865 the feedback. The idea is to allow a network operator (if it 2866 chooses) to deploy engineering mechanisms like policers at the front 2867 of the network which can be designed to behave /as if/ they are 2868 responding to congestion prices. Rather than having to subject users 2869 to congestion pricing, networks can then use more traditional 2870 charging regimes (or novel ones). But the engineering can constrain 2871 the overall amount of congestion a user can cause. This provides a 2872 buffer against completely outrageous congestion control, but still 2873 makes it easy for novel applications to evolve if they need different 2874 congestion control to the norms. It also allows novel charging 2875 regimes to evolve. 2877 Despite being achieved with a relatively minor protocol change, re- 2878 ECN is an architectural change. Previously, Internet congestion 2879 could only be controlled by the data sender, because it was the only 2880 one both in a position to control the load and in a position to see 2881 information on congestion. Re-ECN levels the playing field. It 2882 recognises that the network also has a role to play in moderating 2883 (policing) congestion control. But policing is only truly effective 2884 at the first ingress into an internetwork, whereas path congestion 2885 was previously only visible at the last egress. So, re-ECN 2886 democratises congestion information. Then the choice over who 2887 actually controls congestion can be made at run-time, not design 2888 time---a bit like an aircraft with dual controls. And different 2889 operators can make different choices. We believe non-architectural 2890 approaches to this problem are unlikely to offer more than partial 2891 solutions (see Section 9). 2893 Importantly, re-ECN does NOT REQUIRE assumptions about specific 2894 congestion responses to be embedded in any network elements, except 2895 at the first ingress to the internetwork if that level of control is 2896 desired by the ingress operator. But such tight policing will be a 2897 matter of agreement between the source and its access network 2898 operator. The ingress operator need not police congestion response 2899 at flow granularity; it can simply hold a source responsible for the 2900 aggregate congestion it causes, perhaps keeping it within a monthly 2901 congestion quota. Or if the ingress network trusts the source, it 2902 can do nothing. 2904 Therefore, the aim of the re-ECN protocol is NOT solely to police 2905 TCP-friendliness. Re-ECN preserves IP as a generic network layer for 2906 all sorts of responses to congestion, for all sorts of transports. 2907 Re-ECN merely ensures truthful downstream congestion information is 2908 available in the network layer for all sorts of accountability 2909 applications. 2911 The end to end design principle does not say that all functions 2912 should be moved out of the lower layers---only those functions that 2913 are not generic to all higher layers. Re-ECN adds a function to the 2914 network layer that is generic, but was omitted: accountability for 2915 causing congestion. Accountability is not something that an end-user 2916 can provide to themselves. We believe re-ECN adds no more than is 2917 sufficient to hold each flow accountable, even if it consists of a 2918 single datagram. 2920 "Accountability" implies being able to identify who is responsible 2921 for causing congestion. However, at the network layer it would NOT 2922 be useful to identify the cause of congestion by adding individual or 2923 organisational identity information, NOR by using source IP 2924 addresses. Rather than bringing identity information to the point of 2925 congestion, we bring downstream congestion information to the point 2926 where the cause can be most easily identified and dealt with. That 2927 is, at any trust boundary congestion can be associated with the 2928 physically connected upstream neighbour that is directly responsible 2929 for causing it (whether intentionally or not). A trust boundary 2930 interface is exactly the place to police or throttle in order to 2931 directly mitigate congestion, rather than having to trace the 2932 (ir)responsible party in order to shut them down. 2934 Some considered that ECN itself was a layering violation. The 2935 reasoning went that the interface to a layer should provide a service 2936 to the higher layer and hide how the lower layer does it. However, 2937 ECN reveals the state of the network layer and below to the transport 2938 layer. A more positive way to describe ECN is that it is like the 2939 return value of a function call to the network layer. It explicitly 2940 returns the status of the request to deliver a packet, by returning a 2941 value representing the current risk that a packet will not be served. 2942 Re-ECN has similar semantics, except the transport layer must try to 2943 guess the return value, then it can use the actual return value from 2944 the network layer to modify the next guess. 2946 The guiding principle behind all the discussion in Section 6.1.6 on 2947 Policing is that any gain from subverting the protocol should be 2948 precisely neutralised, rather than punished. If a gain is punished 2949 to a greater extent than is sufficient to neutralise it, it will most 2950 likely open up a new vulnerability, where the amplifying effect of 2951 the punishment mechanism can be turned on others. 2953 For instance, if possible, flows should be removed as soon as they go 2954 negative, but we do NOT RECOMMEND any attempts to discard such flows 2955 further upstream while they are still positive. Such over-zealous 2956 push-back is unnecessary and potentially dangerous. These flows have 2957 paid their `fare' up to the point they go negative, so there is no 2958 harm in delivering them that far. If someone downstream asks for a 2959 flow to be dropped as near to the source as possible, because they 2960 say it is going to become negative later, an upstream node cannot 2961 test the truth of this assertion. Rather than have to authenticate 2962 such messages, re-ECN has been designed so that flows can be dropped 2963 solely based on locally measurable evidence. A message hinting that 2964 a flow should be watched closely to test for negativity is fine. But 2965 not a message that claims that a positive flow will go negative 2966 later, so it should be dropped. . 2968 9. Related Work 2970 {Due to lack of time, this section is incomplete. The reader is 2971 referred to the Related Work section of [Re-fb] for a brief selection 2972 of related ideas.} 2974 9.1. Policing Rate Response to Congestion 2976 ATM network elements send congestion back-pressure 2977 messages [ITU-T.I.371] along each connection, duplicating any end to 2978 end feedback because they don't trust it. On the other hand, re-ECN 2979 ensures information in forwarded packets can be used for congestion 2980 management without requiring a connection-oriented architecture and 2981 re-using the overhead of fields that are already set aside for end to 2982 end congestion control (and routing loop detection in the case of re- 2983 TTL in Appendix F). 2985 We borrowed ideas from policers in the literature [pBox],[XCHOKe], 2986 AFD etc. for our rate equation policer. However, without the benefit 2987 of re-ECN they don't police the correct rate for the condition of 2988 their path. They detect unusually high /absolute/ rates, but only 2989 while the policer itself is congested, because they work by detecting 2990 prevalent flows in the discards from the local RED queue. These 2991 policers must sit at every potential bottleneck, whereas our policer 2992 need only be located at each ingress to the internetwork. As Floyd & 2993 Fall explain [pBox], the limitation of their approach is that a high 2994 sending rate might be perfectly legitimate, if the rest of the path 2995 is uncongested or the round trip time is short. Commercially 2996 available rate policers cap the rate of any one flow. Or they 2997 enforce monthly volume caps in an attempt to control high volume 2998 file-sharing. They limit the value a customer derives. They might 2999 also limit the congestion customers can cause, but only as an 3000 accidental side-effect. They actually punish traffic that fills 3001 troughs as much as traffic that causes peaks in utilisation. In 3002 practice network operators need to be able to allocate service by 3003 cost during congestion, and by value at other times. 3005 9.2. Congestion Notification Integrity 3007 The choice of two ECT code-points in the ECN field [RFC3168] 3008 permitted future flexibility, optionally allowing the sender to 3009 encode the experimental ECN nonce [RFC3540] in the packet stream. 3010 This mechanism has since been included in the specifications of DCCP 3011 [RFC4340]. 3013 The ECN nonce is an elegant scheme that allows the sender to detect 3014 if someone in the feedback loop - the receiver especially - tries to 3015 claim no congestion was experienced when in fact congestion led to 3016 packet drops or ECN marks. For each packet it sends, the sender 3017 chooses between the two ECT codepoints in a pseudo-random sequence. 3018 Then, whenever the network marks a packet with CE, if the receiver 3019 wants to deny congestion happened, she has to guess which ECT 3020 codepoint was overwritten. She has only a 50:50 chance of being 3021 correct each time she denies a congestion mark or a drop, which 3022 ultimately will give her away. 3024 The purpose of a network-layer nonce should primarily be protection 3025 of the network, while a transport-layer nonce would be better used to 3026 protect the sender from cheating receivers. Now, the assumption 3027 behind the ECN nonce is that a sender will want to detect whether a 3028 receiver is suppressing congestion feedback. This is only true if 3029 the sender's interests are aligned with the network's, or with the 3030 community of users as a whole. This may be true for certain large 3031 senders, who are under close scrutiny and have a reputation to 3032 maintain. But we have to deal with a more hostile world, where 3033 traffic may be dominated by peer-to-peer transfers, rather than 3034 downloads from a few popular sites. Often the `natural' self- 3035 interest of a sender is not aligned with the interests of other 3036 users. It often wishes to transfer data quickly to the receiver as 3037 much as the receiver wants the data quickly. 3039 In contrast, the re-ECN protocol enables policing of an agreed rate- 3040 response to congestion (e.g. TCP-friendliness) at the sender's 3041 interface with the internetwork. It also ensures downstream networks 3042 can police their upstream neighbours, to encourage them to police 3043 their users in turn. But most importantly, it requires the sender to 3044 declare path congestion to the network and it can remove traffic at 3045 the egress if this declaration is dishonest. So it can police 3046 correctly, irrespective of whether the receiver tries to suppress 3047 congestion feedback or whether the sender ignores genuine congestion 3048 feedback. Therefore the re-ECN protocol addresses a much wider range 3049 of cheating problems, which includes the one addressed by the ECN 3050 nonce. 3052 9.3. Identifying Upstream and Downstream Congestion 3054 Purple [Purple] proposes that routers should use the CWR flag in the 3055 TCP header of ECN-capable flows to work out path congestion and 3056 therefore downstream congestion in a similar way to re-ECN. However, 3057 because CWR is in the transport layer, it is not always visible to 3058 network layer routers and policers. Purple's motivation was to 3059 improve AQM, not policing. But, of course, nodes trying to avoid a 3060 policer would not be expected to allow CWR to be visible. 3062 10. Security Considerations 3064 This whole memo concerns the deployment of a secure congestion 3065 control framework. However, below we list some specific security 3066 issues that we are still working on: 3068 o Malicious users have ability to launch dynamically changing 3069 attacks, exploiting the time it takes to detect an attack, given 3070 ECN marking is binary. We are concentrating on subtle 3071 interactions between the ingress policer and the egress dropper in 3072 an effort to make it impossible to game the system. 3074 o There is an inherent need for at least some flow state at the 3075 egress dropper given the binary marking environment, which leads 3076 to an apparent vulnerability to state exhaustion attacks. An 3077 egress dropper design with bounded flow state is in write-up. 3079 o A malicious source can spoof another user's address and send 3080 negative traffic to the same destination in order to fool the 3081 dropper into sanctioning the other user's flow. To prevent or 3082 mitigate these two different kinds of DoS attack, against the 3083 dropper and against given flows, we are considering various 3084 protection mechanisms. Section 5.5.1 discusses one of these. 3086 o A malicious client can send requests using a spoofed source 3087 address to a server (such as a DNS server) that tends to respond 3088 with single packet responses. This server will then be tricked 3089 into having to set FNE on the first (and only) packet of all these 3090 wasted responses. Given packets marked FNE are worth +1, this 3091 will cause such servers to consume more of their allowance to 3092 cause congestion than they would wish to. In general, re-ECN is 3093 deliberately designed so that single packet flows have to bear the 3094 cost of not discovering the congestion state of their path. One 3095 of the reasons for introducing re-ECN is to encourage short flows 3096 to make use of previous path knowledge by moving the cost of this 3097 lack of knowledge to sources that create short flows. Therefore, 3098 we in the long run we might expect services like DNS to aggregate 3099 single packet flows into connections where it brings benefits. 3100 However, this attack where DNS requests are made from spoofed 3101 addresses genuinely forces the server to waste its resources. The 3102 only mitigating feature is that the attacker has to set FNE on 3103 each of its requests if they are to get through an egress dropper 3104 to a DNS server. The attacker therefore has to consume as many 3105 resources as the victim, which at least implies re-ECN does not 3106 unwittingly amplify this attack. 3108 Having highlighted outstanding security issues, we now explain the 3109 design decisions that were taken based on a security-related 3110 rationale. It may seem that the six codepoints of the eight made 3111 available by extending the ECN field with the RE flag have been used 3112 rather wastefully to encode just five states. In effect the RE flag 3113 has been used as an orthogonal single bit, using up four codepoints 3114 to encode the three states of positive, neutral and negative worth. 3115 The mapping of the codepoints in an earlier version of this proposal 3116 used the codepoint space more efficiently, but the scheme became 3117 vulnerable to network operators bypassing congestion penalties by 3118 focusing congestion marking on positive packets. Appendix B explains 3119 why fixing that problem while allowing for incremental deployment, 3120 would have used another codepoint anyway. So it was better to use 3121 this orthogonal encoding scheme, which greatly simplified the whole 3122 protocol and brought with it some subtle security benefits (see the 3123 last paragraph of Appendix B). 3125 With the scheme as now proposed, once the RE flag is set or cleared 3126 by the sender or its proxy, it should not be written by the network, 3127 only read. So the endpoints can detect if any network maliciously 3128 alters the RE flag. IPSec AH integrity checking does not cover the 3129 IPv4 option flags (they were considered mutable---even the one we 3130 propose using for the RE flag that was `currently unused' when IPSec 3131 was defined). But it would be sufficient for a pair of endpoints to 3132 make random checks on whether the RE flag was the same when it 3133 reached the egress as when it left the ingress. Indeed, if IPSec AH 3134 had covered the RE flag, any network intending to alter sufficient RE 3135 flags to make a gain would have focused its alterations on packets 3136 without authenticating headers (AHs). 3138 The security of re-ECN has been deliberately designed to not rely on 3139 cryptography. 3141 11. IANA Considerations 3143 This memo includes no request to IANA (yet). 3145 If this memo was to progress to standards track, it would list: 3147 o The new RE flag in IPv4 (Section 5.1) and its extension with the 3148 ECN field to create a new set of extended ECN (EECN) codepoints; 3150 o The definition of the EECN codepoints for default Diffserv PHBs 3151 (Section 3.2) 3153 o The new extension header for IPv6 (Section 5.2); 3155 o The new combinations of flags in the TCP header for capability 3156 negotiation (Section 4.1.3); 3158 o The new ICMP message type (Section 5.5.1). 3160 12. Conclusions 3162 {ToDo:} 3164 13. Acknowledgements 3166 Sebastien Cazalet and Andrea Soppera contributed to the idea of re- 3167 feedback. All the following have given helpful comments: Andrea 3168 Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, 3169 Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, 3170 John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru 3171 Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd 3172 (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark 3173 Handley (who developed the attack with canceled packets), Adam 3174 Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft 3175 (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who 3176 complemented our own dummy traffic attacks with others), Liz Maida 3177 (MIT), and comments from participants in the CRN/CFP Broadband and 3178 DoS-resistant Internet working groups. 3180 14. Comments Solicited 3182 Comments and questions are encouraged and very welcome. They can be 3183 addressed to the IETF Transport Area working group's mailing list 3184 , and/or to the authors. 3186 15. References 3188 15.1. Normative References 3190 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3191 Requirement Levels", BCP 14, RFC 2119, March 1997. 3193 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 3194 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 3195 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 3196 S., Wroclawski, J., and L. Zhang, "Recommendations on 3197 Queue Management and Congestion Avoidance in the 3198 Internet", RFC 2309, April 1998. 3200 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 3201 Control", RFC 2581, April 1999. 3203 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 3204 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 3205 Zhang, L., and V. Paxson, "Stream Control Transmission 3206 Protocol", RFC 2960, October 2000. 3208 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 3209 of Explicit Congestion Notification (ECN) to IP", 3210 RFC 3168, September 2001. 3212 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 3213 Initial Window", RFC 3390, October 2002. 3215 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 3216 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 3218 [RFC4341] Floyd, S. and E. Kohler, "Profile for Datagram Congestion 3219 Control Protocol (DCCP) Congestion Control ID 2: TCP-like 3220 Congestion Control", RFC 4341, March 2006. 3222 [RFC4342] Floyd, S., Kohler, E., and J. Padhye, "Profile for 3223 Datagram Congestion Control Protocol (DCCP) Congestion 3224 Control ID 3: TCP-Friendly Rate Control (TFRC)", RFC 4342, 3225 March 2006. 3227 15.2. Informative References 3229 [ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the 3230 Internet to Support Real-Time Content Supply from a Large 3231 Fraction of Broadband Residential Users", BT Technology 3232 Journal (BTTJ) 23(2), April 2005. 3234 [Bauer06] Bauer, S., Faratin, P., and R. Beverly, "Assessing the 3235 assumptions underlying mechanism design for the Internet", 3236 Proc. Workshop on the Economics of Networked Systems 3237 (NetEcon06) , June 2006, . 3240 [CLoop_pol] 3241 Salvatori, A., "Closed Loop Traffic Policing", Politecnico 3242 Torino and Institut Eurecom Masters Thesis , 3243 September 2005. 3245 [ECN-Deploy] 3246 Floyd, S., "ECN (Explicit Congestion Notification) in 3247 TCP/IP; Implementation and Deployment of ECN", Web-page , 3248 May 2004, 3249 . 3251 [ECN-MPLS] 3252 Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion 3253 Marking in MPLS", draft-ietf-tsvwg-ecn-mpls-01 (work in 3254 progress), June 2007. 3256 [ECN-tunnel] 3257 Briscoe, B., "Layered Encapsulation of Congestion 3258 Notification", draft-briscoe-tsvwg-ecn-tunnel-00 (work in 3259 progress), July 2007. 3261 [Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the 3262 evolution of congestion control", Automatica 35(12)1969-- 3263 1985, December 1999, 3264 . 3266 [I-D.ietf-tcpm-ecnsyn] 3267 Kuzmanovic, A., "Adding Explicit Congestion Notification 3268 (ECN) Capability to TCP's SYN/ACK Packets", 3269 draft-ietf-tcpm-ecnsyn-01 (work in progress), 3270 October 2006. 3272 [I-D.moncaster-tcpm-rcv-cheat] 3273 Moncaster, T., "A TCP Test to Allow Senders to Identify 3274 Receiver Non-Compliance", 3275 draft-moncaster-tcpm-rcv-cheat-01 (work in progress), 3276 June 2007. 3278 [ITU-T.I.371] 3279 ITU-T, "Traffic Control and Congestion Control in 3280 {B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. 3282 [Jiang02] Jiang, H. and D. Dovrolis, "The Macroscopic Behavior of 3283 the TCP Congestion Avoidance Algorithm", ACM SIGCOMM 3284 CCR 32(3)75-88, July 2002, 3285 . 3287 [Mathis97] 3288 Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The 3289 Macroscopic Behavior of the TCP Congestion Avoidance 3290 Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997, 3291 . 3293 [PCN-arch] 3294 Eardley, P., Babiarz, J., Chan, K., Charny, A., Geib, R., 3295 Karagiannis, G., Menth, M., and T. Tsou, "Pre-Congestion 3296 Notification Architecture", 3297 draft-eardley-pcn-architecture-00 (work in progress), 3298 June 2007. 3300 [Purple] Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE: 3301 Predictive Active Queue Management Utilizing Congestion 3302 Information", Proc. Local Computer Networks (LCN 2003) , 3303 October 2003. 3305 [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, 3306 M., Romanow, A., Weinrib, A., and L. Zhang, "Resource 3307 ReSerVation Protocol (RSVP) Version 1 Applicability 3308 Statement Some Guidelines on Deployment", RFC 2208, 3309 September 1997. 3311 [RFC2402] Kent, S. and R. Atkinson, "IP Authentication Header", 3312 RFC 2402, November 1998. 3314 [RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security 3315 Payload (ESP)", RFC 2406, November 1998. 3317 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., 3318 and W. Weiss, "An Architecture for Differentiated 3319 Services", RFC 2475, December 1998. 3321 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 3322 Timer", RFC 2988, November 2000. 3324 [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", 3325 RFC 3124, June 2001. 3327 [RFC3514] Bellovin, S., "The Security Flag in the IPv4 Header", 3328 RFC 3514, April 2003. 3330 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 3331 Congestion Notification (ECN) Signaling with Nonces", 3332 RFC 3540, June 2003. 3334 [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion 3335 Control for Voice Traffic in the Internet", RFC 3714, 3336 March 2004. 3338 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 3339 Internet Protocol", RFC 4301, December 2005. 3341 [Re-PCN] Briscoe, B., "Emulating Border Flow Policing using Re-ECN 3342 on Bulk Data", draft-briscoe-tsvwg-re-ecn-border-cheat-01 3343 (work in progress), March 2006. 3345 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 3346 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 3347 Congestion Response in an Internetwork Using Re-Feedback", 3348 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 3352 [Savage99] 3353 Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 3354 "TCP congestion control with a misbehaving receiver", ACM 3355 SIGCOMM CCR 29(5), October 1999, 3356 . 3358 [Smart_rtg] 3359 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 3360 "Optimizing Cost and Performance for Multihoming", ACM 3361 SIGCOMM CCR 34(4)79--92, October 2004, 3362 . 3364 [Steps_DoS] 3365 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 3366 resistant Internet Architecture", Proc. ACM SIGCOMM 3367 workshop on Future directions in network architecture 3368 (FDNA'04) pp 49--56, August 2004. 3370 [Tussle] Clark, D., Sollins, K., Wroclawski, J., and R. Braden, 3371 "Tussle in Cyberspace: Defining Tomorrow's Internet", ACM 3372 SIGCOMM CCR 32(4)347--356, October 2002, 3373 . 3376 [XCHOKe] Chhabra, P., Chuig, S., Goel, A., John, A., Kumar, A., 3377 Saran, H., and R. Shorey, "XCHOKe: Malicious Source 3378 Control for Congestion Avoidance at Internet Gateways", 3379 Proceedings of IEEE International Conference on Network 3380 Protocols (ICNP-02) , November 2002, 3381 . 3383 [pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End 3384 Congestion Control in the Internet", IEEE/ACM Transactions 3385 on Networking 7(4) 458--472, August 1999, 3386 . 3388 Appendix A. Precise Re-ECN Protocol Operation 3390 {ToDo: fix this} 3392 The protocol operation described in Section 3.3 was an approximation. 3393 In fact, standard ECN router marking combines 1% and 2% marking into 3394 slightly less than 3% whole-path marking, because routers 3395 deliberately mark CE whether or not it has already been marked by 3396 another router upstream. So the combined marking fraction would 3397 actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. 3399 To generalise this we will need some notation. 3401 o j represents the index of each resource (typically queues) along a 3402 path, ranging from 0 at the first router to n-1 at the last. 3404 o m_j represents the fraction of octets *m*arked CE by a particular 3405 router (whether or not they are already marked) because of 3406 congestion of resource j. 3408 o u_j represents congestion *u*pstream of resource j, being the 3409 fraction of CE marking in arriving packet headers (before 3410 marking). 3412 o p_j represents *p*ath congestion, being the fraction of packets 3413 arriving at resource j with the RE flag blanked (excluding Not- 3414 RECT packets). 3416 o v_j denotes expected congestion downstream of resource j, which 3417 can be thought of as a *v*irtual marking fraction, being derived 3418 from two other marking fractions. 3420 Observed fractions of each particular codepoint (u, p and v) and 3421 router marking rate m are dimensionless fractions, being the ratio of 3422 two data volumes (marked and total) over a monitoring period. All 3423 measurements are in terms of octets, not packets, assuming that line 3424 resources are more congestible than packet processing. 3426 The path congestion (RE blanking fraction) set by the sender should 3427 reflect the upstream congestion (CE marking fraction) fed back from 3428 the destination. Therefore in the steady state 3430 p_0 = u_n 3431 = 1 - (1 - m_1)(1 - m_2)... 3433 Similarly, at some point j in the middle of the network, if p = 1 - 3434 (1 - u_j)(1 - v_j), then 3436 v_j = 1 - (1 - p)/(1 - u_j) 3438 ~= p - u_j; if u_j << 100% 3440 So, between the two routers in the example in Section 3.3, congestion 3441 downstream is 3443 v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) 3444 = 2.00%, 3446 or a useful approximation of downstream congestion is 3448 v_1 ~= 2.98% - 1.00% 3449 ~= 1.98%. 3451 Appendix B. Justification for Two Codepoints Signifying Zero Worth 3452 Packets 3454 It may seem a waste of a codepoint to set aside two codepoints of the 3455 Extended ECN field to signify zero worth (RECT and CE(0) are both 3456 worth zero). The justification is subtle, but worth recording. 3458 The original version of re-ECN ([Re-fb] and draft-00 of this memo) 3459 used three codepoints for neutral (ECT(1)), positive (ECT(0)) and 3460 negative (CE) packets. The sender set packets to neutral unless re- 3461 echoing congestion, when it set them positive, in much the same way 3462 that it blanks the RE flag in the current protocol. However, routers 3463 were meant to mark congestion by setting packets negative (CE) 3464 irrespective of whether they had previously been neutral or positive. 3466 However, we did not arrange for senders to remember which packet had 3467 been sent with which codepoint, or for feedback to say exactly which 3468 packets arrived with which codepoints. The transport was meant to 3469 inflate the number of positive packets it sent to allow for a few 3470 being wiped out by congestion marking. We (wrongly) assumed that 3471 routers would congestion mark packets indiscriminately, so the 3472 transport could infer how many positive packets had been marked and 3473 compensate accordingly by re-echoing. But this created a perverse 3474 incentive for routers to preferentially congestion mark positive 3475 packets rather than neutral ones. 3477 We could have removed this perverse incentive by requiring re-ECN 3478 senders to remember which packets they had sent with which codepoint. 3479 And for feedback from the receiver to identify which packets arrived 3480 as which. Then, if a positive packet was congestion marked to 3481 negative, the sender could have re-echoed twice to maintain the 3482 balance between positive and negative at the receiver. 3484 Instead, we chose to make re-echoing congestion (blanking RE) 3485 orthogonal to congestion notification (marking CE), which required a 3486 second neutral codepoint (the orthogonal scheme forms the main square 3487 of four codepoints in Figure 2). Then the receiver would be able to 3488 detect and echo a congestion event even if it arrived on a packet 3489 that had originally been positive. 3491 If we had added extra complexity to the sender and receiver 3492 transports to track changes to individual packets, we could have made 3493 it work, but then routers would have had an incentive to mark 3494 positive packets with half the probability of neutral packets. That 3495 in turn would have led router algorithms to become more complex. 3496 Then senders wouldn't know whether a mark had been introduced by a 3497 simple or a complex router algorithm. That in turn would have 3498 required another codepoint to distinguish between legacy ECN and new 3499 re-ECN router marking. 3501 Once the cost of IP header codepoint real-estate was the same for 3502 both schemes, there was no doubt that the simpler option for 3503 endpoints and for routers should be chosen. The resulting protocol 3504 also no longer needed the tricky inflation/deflation complexity of 3505 the original (broken) scheme. It was also much simpler to understand 3506 conceptually. 3508 A further advantage of the new orthogonal four-codepoint scheme was 3509 that senders owned sole rights to change the RE flag and routers 3510 owned sole rights to change the ECN field. Although we still arrange 3511 the incentives so neither party strays outside their dominion, these 3512 clear lines of authority simplify the matter. 3514 Finally, a little redundancy can be very powerful in a scheme such as 3515 this. In one flow, the proportion of packets changed to CE should be 3516 the same as the proportion of RECT packets changed to CE(-1) and the 3517 proportion of Re-Echo packets changed to CE(0). Double checking 3518 using such redundant relationships can improve the security of a 3519 scheme (cf. double-entry book-keeping or the ECN Nonce). 3520 Alternatively, it might be necessary to exploit the redundancy in the 3521 future to encode an extra information channel. 3523 Appendix C. ECN Compatibility 3525 The rationale for choosing the particular combinations of SYN and SYN 3526 ACK flags in Section 4.1.3 is as follows. 3528 Choice of SYN flags: A re-ECN sender can work with vanilla ECN 3529 receivers so we wanted to use the same flags as would be used in 3530 an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same time, 3531 we wanted a server (host B) that is Re-ECT to be able to recognise 3532 that the client (A) is also Re-ECT. We believe also setting NS=1 3533 in the initial SYN achieves both these objectives, as it should be 3534 ignored by vanilla ECT receivers and by ECT-Nonce receivers. But 3535 senders that are not Re-ECT should not set NS=1. At the time ECN 3536 was defined, the NS flag was not defined, so setting NS=1 should 3537 be ignored by existing ECT receivers (but testing against 3538 implementations may yet prove otherwise). The ECN Nonce 3539 RFC [RFC3540] is silent on what the NS field might be set to in 3540 the TCP SYN, but we believe the intent was for a nonce client to 3541 set NS=0 in the initial SYN (again only testing will tell). 3542 Therefore we define a Re-ECN-setup SYN as one with NS=1, CWR=1 & 3543 ECE=1 3545 Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to 3546 be able to determine whether the server (B) is Re-ECT. The 3547 original ECN specification required an ECT server to respond to an 3548 ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There 3549 is no room to modify this by setting the NS flag, as that is 3550 already set in the SYN ACK of an ECT-Nonce server. So we used the 3551 only combination of CWR and ECE that would not be used by existing 3552 TCP receivers: CWR=1 and ECE=0. The original ECN specification 3553 defines this combination as a non-ECN-setup SYN ACK, which remains 3554 true for vanilla and Nonce ECTs. But for re-ECN we define it as a 3555 Re-ECN-setup SYN ACK. We didn't use a SYN ACK with both CWR and 3556 ECE cleared to 0 because that would be the likely response from 3557 most Not-ECT receivers. And we didn't use a SYN ACK with both CWR 3558 and ECE set to 1 either, as at least one broken receiver 3559 implementation echoes whatever flags were in the SYN into its SYN 3560 ACK. Therefore we define a Re-ECN-setup SYN ACK as one with CWR=1 3561 & ECE=0. 3563 Choice of two alternative SYN ACKs: the NS flag may take either 3564 value in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re- 3565 ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to 3566 echo congestion experienced (CE) on the initial SYN. Otherwise a 3567 Re-ECN-setup SYN ACK MUST be returned with NS=0. The only current 3568 known use of the NS flag in a SYN ACK is to indicate support for 3569 the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. 3570 Given the ECN nonce MUST NOT be used for a RECN mode connection, a 3571 Re-ECN-setup SYN ACK can use either setting of the NS flag without 3572 any risk of confusion, because the CWR & ECE flags will be 3573 reversed relative to those used by an ECN nonce SYN ACK. 3575 Appendix D. Packet Marking During Flow Start 3577 {ToDo: Write up proof that sender should mark FNE on first and third 3578 data packets, even with the largest allowed initial window.} 3580 Appendix E. Example Egress Dropper Algorithm 3582 {ToDo: Write up the basic algorithm with flow state, then the 3583 aggregated one.} 3585 Appendix F. Re-TTL 3587 This Appendix gives an overview of a proposal to be able to overload 3588 the TTL field in the IP header to monitor downstream propagation 3589 delay. This is included to show that it would be possible to take 3590 account of RTT if it was deemed desirable. 3592 Delay re-feedback can be achieved by overloading the TTL field, 3593 without changing IP or router TTL processing. A target value for TTL 3594 at the destination would need standardising, say 16. If the path hop 3595 count increased by more than 16 during a routing change, it would 3596 temporarily be mistaken for a routing loop, so this target would need 3597 to be chosen to exceed typical hop count increases. The TCP wire 3598 protocol and handlers would need modifying to feed back the 3599 destination TTL and initialise it. It would be necessary to 3600 standardise the unit of TTL in terms of real time (as was the 3601 original intent in the early days of the Internet). 3603 In the longer term, precision could be improved if routers 3604 decremented TTL to represent exact propagation delay to the next 3605 router. That is, for a router to decrement TTL by, say, 1.8 time 3606 units it would alternate the decrement of every packet between 1 & 2 3607 at a ratio of 1:4. Although this might sometimes require a seemingly 3608 dangerous null decrement, a packet in a loop would still decrement to 3609 zero after 255 time units on average. As more routers were upgraded 3610 to this more accurate TTL decrement, path delay estimates would 3611 become increasingly accurate despite the presence of some legacy 3612 routers that continued to always decrement the TTL by 1. 3614 Appendix G. Policer Designs to ensure Congestion Responsiveness 3616 G.1. Per-user Policing 3618 User policing requires a policer on the ingress interface of the 3619 access router associated with the user. At that point, the traffic 3620 of the user hasn't diverged on different routes yet; nor has it mixed 3621 with traffic from other sources. 3623 In order to ensure that a user doesn't generate more congestion in 3624 the network than her due share, a modified bulk token-bucket is 3625 maintained with the following parameter: 3627 o b_0 the initial token level 3629 o r the filling rate 3631 o b_max the bucket depth 3633 The same token bucket algorithm is used as in many areas of 3634 networking, but how it is used is very different: 3636 o all traffic from a user over the lifetime of their subscription is 3637 policed in the same token bucket. 3639 o only positive and canceled packets (Re-Echo, FNE and CE(0)) 3640 consume tokens 3642 Such a policer will allow network operators to throttle the 3643 contribution of their users to network congestion. This will require 3644 the appropriate contractual terms to be in place between operators 3645 and users. For instance: a condition for a user to subscribe to a 3646 given network service may be that she should not cause more than a 3647 volume C_user of congestion over a reference period T_user, although 3648 she may carry forward up to N_user times her allowance at the end of 3649 each period. These terms directly set the parameter of the user 3650 policer: 3652 o b_0 = C_user 3654 o r = C_user/T_user 3656 o b_max = b_0 * (N_user +1) 3658 Besides the congestion budget policer above, another user policer may 3659 be necessary to further rate-limit FNE packets, if they are to be 3660 marked rather than dropped (see discussion in Section 5.3.). Rate- 3661 limiting FNE packets will prevent high bursts of new flow arrivals, 3662 which is a very useful feature in DoS prevention. A condition to 3663 subscribe to a given network service would have to be that a user 3664 should not generate more than C_FNE FNE packets, over a reference 3665 period T_FNE, with no option to carry forward any of the allowance at 3666 the end of each period. These terms directly set the parameters of 3667 the FNE policer: 3669 o b_0 = C_FNE 3671 o r = C_FNE/T_FNE 3673 o b_max = b_0 3675 T_FNE should be a much shorter period than T_user: for instance T_FNE 3676 could be in the order of minutes while T_user could be in order of 3677 weeks. 3679 G.2. Per-flow Rate Policing 3681 Whilst we believe that simple per-user policing would be sufficient 3682 to ensure senders comply with congestion control, some operators may 3683 wish to police the rate response of each flow to congestion as well. 3684 Although we do not believe this will be neceesary, we include this 3685 section to show how one could perform per-flow policing using 3686 enforcement of TCP-fairness as an example. Per-flow policing aims to 3687 enforce congestion responsiveness on the shortest information 3688 timescale on a network path: packet roundtrips. 3690 This again requires that the appropriate terms be agreed between a 3691 network operator and its users, where a congestion responsiveness 3692 policy might be required for the use of a given network service 3693 (perhaps unless the user specifically requests otherwise). 3695 As an example, we describe below how a rate adaptation policer can be 3696 designed when the applicable rate adaptation policy is TCP- 3697 compliance. In that context, the average throughput of a flow will 3698 be expected to be bounded by the value of the TCP throughput during 3699 congestion avoidance, given in Mathis' formula [Mathis97] 3701 x_TCP = k * s / ( T * sqrt(m) ) 3703 where: 3705 o x_TCP is the throughput of the TCP flow in packets per second, 3707 o k is a constant upper-bounded by sqrt(3/2), 3709 o s is the average packet size of the flow, 3711 o T is the roundtrip time of the flow, 3713 o m is the congestion level experienced by the flow. 3715 We define the marking period N=1/m which represents the average 3716 number of packets between two positive or canceled packets. Mathis' 3717 formula can be re-written as: 3719 x_TCP = k*s*sqrt(N)/T 3721 We can then get the average inter-mark time in a compliant TCP flow, 3722 dt_TCP, by solving (x_TCP/s)*dt_TCP = N which gives 3724 dt_TCP = sqrt(N)*T/k 3726 We rely on this equation for the design of a rate-adaptation policer 3727 as a variation of a token bucket. In that case a policer has to be 3728 set up for each policed flow. This may be triggered by FNE packets, 3729 with the remainder of flows being all rate limited together if they 3730 do not start with an FNE packet. 3732 Where maintaining per flow state is not a problem, for instance on 3733 some access routers, systematic per-flow policing may be considered. 3734 Should per-flow state be more constrained, rate adaptation policing 3735 could be limited to a random sample of flows exhibiting positive or 3736 canceled packets. 3738 As in the case of user policing, only positive or canceled packets 3739 will consume tokens, however the amount of tokens consumed will 3740 depend on the congestion signal. 3742 When a new rate adaptation policer is set up for flow j, the 3743 following state is created: 3745 o a token bucket b_j of depth b_max starting at level b_0 3747 o a timestamp t_j = timenow() 3749 o a counter N_j = 0 3751 o a roundtrip estimate T_j 3753 o a filling rate r 3755 When the policing node forwards a packet of flow j with no Re-Echo: 3757 o . the counter is incremented: N_j += 1 3759 When the policing node forwards a packet of flow j carrying a 3760 congestion mark (CE): 3762 o the counter is incremented: N_j += 1 3764 o the token level is adjusted: b_j += r*(timenow()-t_j) - sqrt(N_j)* 3765 T_j/k 3767 o the counter is reset: N_j = 0 3769 o the timer is reset: t_j = timenow() 3771 An implementation example will be given in a later draft that avoids 3772 having to extract the square root. 3774 Analysis: For a TCP flow, for r= 1 token/sec, on average, 3776 r*(timenow()-t_j)-sqrt(N_j)* T_j/k = dt_TCP - sqrt(N)*T/k = 0 3778 This means that the token level will fluctuate around its initial 3779 level. The depth b_max of the bucket sets the timescale on which the 3780 rate adaptation policy is performed while the filling rate r sets the 3781 trade-off between responsiveness and robustness: 3783 o the higher b_max, the longer it will take to catch greedy flows 3784 o the higher r, the fewer false positives (greedy verdict on 3785 compliant flows) but the more false negatives (compliant verdict 3786 on greedy flows) 3788 This rate adaptation policer requires the availability of a roundtrip 3789 estimate which may be obtained for instance from the application of 3790 re-feedback to the downstream delay Appendix F or passive estimation 3791 [Jiang02]. 3793 When the bucket of a policer located at the access router (whether it 3794 is a per-user policer or a per-flow policer) becomes empty, the 3795 access router SHOULD drop at least all packets causing the token 3796 level to become negative. The network operator MAY take further 3797 sanctions if the token level of the per-flow policers associated with 3798 a user becomes negative. 3800 Appendix H. Downstream Congestion Metering Algorithms 3802 H.1. Bulk Downstream Congestion Metering Algorithm 3804 To meter the bulk amount of downstream congestion in traffic crossing 3805 an inter-domain border an algorithm is needed that accumulates the 3806 size of positive packets and subtracts the size of negative packets. 3807 We maintain two counters: 3809 V_b: accumulated congestion volume 3811 B: total data volume (in case it is needed) 3813 A suitable pseudo-code algorithm for a border router is as follows: 3815 ==================================================================== 3816 V_b = 0 3817 B = 0 3818 for each re-ECN-capable packet { 3819 b = readLength(packet) /* set b to packet size */ 3820 B += b /* accumulate total volume */ 3821 if readEECN(packet) == (Re-Echo || FNE) { 3822 V_b += b /* increment... */ 3823 } elseif readEECN(packet) == CE(-1) { 3824 V_b -= b /* ...or decrement V_b... */ 3825 } /*...depending on EECN field */ 3826 } 3827 ==================================================================== 3829 At the end of an accounting period this counter V_b represents the 3830 congestion volume that penalties could be applied to, as described in 3831 Section 6.1.6. 3833 For instance, accumulated volume of congestion through a border 3834 interface over a month might be V_b = 5PB (petabyte = 10^15 byte). 3835 This might have resulted from an average downstream congestion level 3836 of 1% on an accumulated total data volume of B = 500PB. 3838 H.2. Inflation Factor for Persistently Negative Flows 3840 The following process is suggested to complement the simple algorithm 3841 above in order to protect against the various attacks from 3842 persistently negative flows described in Section 6.1.6. As explained 3843 in that section, the most important and first step is to estimate the 3844 contribution of persistently negative flows to the bulk volume of 3845 downstream pre-congestion and to inflate this bulk volume as if these 3846 flows weren't there. The process below has been designed to give an 3847 unbiased estimate, but it may be possible to define other processes 3848 that achieve similar ends. 3850 While the above simple metering algorithm is counting the bulk of 3851 traffic over an accounting period, the meter should also select a 3852 subset of the whole flow ID space that is small enough to be able to 3853 realistically measure but large enough to give a realistic sample. 3854 Many different samples of different subsets of the ID space should be 3855 taken at different times during the accounting period, preferably 3856 covering the whole ID space. During each sample, the meter should 3857 count the volume of positive packets and subtract the volume of 3858 negative, maintaining a separate account for each flow in the sample. 3859 It should run a lot longer than the large majority of flows, to avoid 3860 a bias from missing the starts and ends of flows, which tend to be 3861 positive and negative respectively. 3863 Once the accounting period finishes, the meter should calculate the 3864 total of the accounts V_{bI} for the subset of flows I in the sample, 3865 and the total of the accounts V_{fI} excluding flows with a negative 3866 account from the subset I. Then the weighted mean of all these 3867 samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I} 3868 V_{bI}. 3870 If V_b is the result of the bulk accounting algorithm over the 3871 accounting period (Appendix H.1) it can be inflated by this factor 3872 a_S to get a good unbiased estimate of the volume of downstream 3873 congestion over the accounting period a_S.V_b, without being polluted 3874 by the effect of persistently negative flows. 3876 Appendix I. Argument for holding back the ECN nonce 3878 The ECN nonce is a mechanism that allows a /sending/ transport to 3879 detect if drop or ECN marking at a congested router has been 3880 suppressed by a node somewhere in the feedback loop---another router 3881 or the receiver. 3883 Space for the ECN nonce was set aside in [RFC3168] (currently 3884 proposed standard) while the full nonce mechanism is specified in 3885 [RFC3540] (currently experimental). The specifications for [RFC4340] 3886 (currently proposed standard) requires that "Each DCCP sender SHOULD 3887 set ECN Nonces on its packets...". It also mandates as a requirement 3888 for all CCID profiles that "Any newly defined acknowledgement 3889 mechanism MUST include a way to transmit ECN Nonce Echoes back to the 3890 sender.", therefore: 3892 o The CCID profile for TCP-like Congestion Control [RFC4341] 3893 (currently proposed standard) says "The sender will use the ECN 3894 Nonce for data packets, and the receiver will echo those nonces in 3895 its Ack Vectors." 3897 o The CCID profile for TCP-Friendly Rate Control (TFRC) [RFC4342] 3898 recommends that "The sender [use] Loss Intervals options' ECN 3899 Nonce Echoes (and possibly any Ack Vectors' ECN Nonce Echoes) to 3900 probabilistically verify that the receiver is correctly reporting 3901 all dropped or marked packets." 3903 The primary function of the ECN nonce is to protect the integrity of 3904 the information about congestion: ECN marks and packet drops. 3905 However, when the nonce is used to protect the integrity of 3906 information about packet drops, rather than ECN marks, a transport 3907 layer nonce will always be sufficient (because a drop loses the 3908 transport header as well as the ECN field in the network header), 3909 which would avoid using scarce IP header codepoint space. Similarly, 3910 a transport layer nonce would protect against a receiver sending 3911 early acknowledgements [Savage99]. 3913 If the ECN nonce reveals integrity problems with the information 3914 about congestion, the sending transport can use that knowledge for 3915 two functions: 3917 o to protect its own resources, by allocating them in proportion to 3918 the rates that each network path can sustain, based on congestion 3919 control, 3921 o and to protect congested routers in the network, by slowing down 3922 drastically its connection to the destination with corrupt 3923 congestion information. 3925 If the sending transport chooses to act in the interests of congested 3926 routers, it can reduce its rate if it detects some malicious party in 3927 the feedback loop may be suppressing ECN feedback. But it would only 3928 be useful to congested routers when /all/ senders using them are 3929 trusted to act in interest of the congested routers. 3931 In the end, the only essential use of a network layer nonce is when 3932 sending transports (e.g. large servers) want to allocate their /own/ 3933 resources in proportion to the rates that each network path can 3934 sustain, based on congestion control. In that case, the nonce allows 3935 senders to be assured that they aren't being duped into giving more 3936 of their own resources to a particular flow. And if congestion 3937 suppression is detected, the sending transport can rate limit the 3938 offending connection to protect its own resources. Certainly, this 3939 is a useful function, but the IETF should carefully decide whether 3940 such a single, very specific case warrants IP header space. 3942 In contrast, re-ECN allows all routers to fully protect themselves 3943 from such attacks, without having to trust anyone - senders, 3944 receivers, neighbouring networks. Re-ECN is therefore proposed in 3945 preference to the ECN nonce on the basis that it addresses the 3946 generic problem of accountability for congestion of a network's 3947 resources at the IP layer. 3949 Delaying the ECN nonce is justified because the applicability of the 3950 ECN nonce seems too limited for it to consume a two-bit codepoint in 3951 the IP header. It therefore seems prudent to give time for an 3952 alternative way to be found to do the one function the nonce is 3953 essential for. 3955 Moreover, while we have re-designed the re-ECN codepoints so that 3956 they do not prevent the ECN nonce progressing, the same is not true 3957 the other way round. If the ECN nonce started to see some deployment 3958 (perhaps because it was blessed with proposed standard status), 3959 incremental deployment of re-ECN would effectively be impossible, 3960 because re-ECN marking fractions at inter-domain borders would be 3961 polluted by unknown levels of nonce traffic. 3963 The authors are aware that re-ECN must prove it has the potential it 3964 claims if it is to displace the nonce. Therefore, every effort has 3965 been made to complete a comprehensive specification of re-ECN so that 3966 its potential can be assessed. We therefore seek the opinion of the 3967 Internet community on whether the re-ECN protocol is sufficiently 3968 useful to warrant standards action. 3970 Authors' Addresses 3972 Bob Briscoe 3973 BT & UCL 3974 B54/77, Adastral Park 3975 Martlesham Heath 3976 Ipswich IP5 3RE 3977 UK 3979 Phone: +44 1473 645196 3980 Email: bob.briscoe@bt.com 3981 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 3983 Arnaud Jacquet 3984 BT 3985 B54/70, Adastral Park 3986 Martlesham Heath 3987 Ipswich IP5 3RE 3988 UK 3990 Phone: +44 1473 647284 3991 Email: arnaud.jacquet@bt.com 3992 URI: 3994 Alessandro Salvatori 3995 BT 3996 B54/77, Adastral Park 3997 Martlesham Heath 3998 Ipswich IP5 3RE 3999 UK 4001 Email: alessandro.salvatori@gmail.com 4003 Martin Koyabe 4004 BT 4005 PP2a Rigel House, Adastral Park 4006 Martlesham Heath 4007 Ipswich IP5 3RE 4008 UK 4010 Phone: +44 1473 646923 4011 Email: martin.koyabe@bt.com 4012 URI: 4014 Toby Moncaster 4015 BT 4016 B54/70, Adastral Park 4017 Martlesham Heath 4018 Ipswich IP5 3RE 4019 UK 4021 Phone: +44 1473 648734 4022 Email: toby.moncaster@bt.com 4024 Full Copyright Statement 4026 Copyright (C) The IETF Trust (2007). 4028 This document is subject to the rights, licenses and restrictions 4029 contained in BCP 78, and except as set forth therein, the authors 4030 retain all their rights. 4032 This document and the information contained herein are provided on an 4033 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 4034 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 4035 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 4036 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 4037 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 4038 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 4040 Intellectual Property 4042 The IETF takes no position regarding the validity or scope of any 4043 Intellectual Property Rights or other rights that might be claimed to 4044 pertain to the implementation or use of the technology described in 4045 this document or the extent to which any license under such rights 4046 might or might not be available; nor does it represent that it has 4047 made any independent effort to identify any such rights. Information 4048 on the procedures with respect to rights in RFC documents can be 4049 found in BCP 78 and BCP 79. 4051 Copies of IPR disclosures made to the IETF Secretariat and any 4052 assurances of licenses to be made available, or the result of an 4053 attempt made to obtain a general license or permission for the use of 4054 such proprietary rights by implementers or users of this 4055 specification can be obtained from the IETF on-line IPR repository at 4056 http://www.ietf.org/ipr. 4058 The IETF invites any interested party to bring to its attention any 4059 copyrights, patents or patent applications, or other proprietary 4060 rights that may cover technology that may be required to implement 4061 this standard. Please address the information to the IETF at 4062 ietf-ipr@ietf.org. 4064 Acknowledgments 4066 Funding for the RFC Editor function is provided by the IETF 4067 Administrative Support Activity (IASA). This document was produced 4068 using xml2rfc v1.32 (of http://xml.resource.org/) from a source in 4069 RFC-2629 XML format.