idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-tcp-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 18. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 4123. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 4134. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 4141. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 4147. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 10, 2008) is 5944 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960) == Outdated reference: A later version (-02) exists of draft-ietf-tsvwg-ecn-mpls-01 == Outdated reference: A later version (-01) exists of draft-briscoe-tsvwg-ecn-tunnel-00 == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-ecnsyn-03 == Outdated reference: A later version (-03) exists of draft-moncaster-tcpm-rcv-cheat-02 -- Obsolete informational reference (is this intentional?): RFC 2402 (Obsoleted by RFC 4302, RFC 4305) -- Obsolete informational reference (is this intentional?): RFC 2406 (Obsoleted by RFC 4303, RFC 4305) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) == Outdated reference: A later version (-03) exists of draft-briscoe-re-pcn-border-cheat-00 Summary: 4 errors (**), 0 flaws (~~), 6 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Intended status: Standards Track A. Jacquet 5 Expires: July 13, 2008 T. Moncaster 6 A. Smith 7 BT 8 January 10, 2008 10 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP 11 draft-briscoe-tsvwg-re-ecn-tcp-05 13 Status of this Memo 15 By submitting this Internet-Draft, each author represents that any 16 applicable patent or other IPR claims of which he or she is aware 17 have been or will be disclosed, and any of which he or she becomes 18 aware will be disclosed, in accordance with Section 6 of BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt. 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet-Draft will expire on July 13, 2008. 38 Copyright Notice 40 Copyright (C) The IETF Trust (2008). 42 Abstract 44 This document introduces a new protocol for explicit congestion 45 notification (ECN), termed re-ECN, which can be deployed 46 incrementally around unmodified routers. The protocol arranges an 47 extended ECN field in each packet so that, as it crosses any 48 interface in an internetwork, it will carry a truthful prediction of 49 congestion on the remainder of its path. Then the upstream party at 50 any trust boundary in the internetwork can be held responsible for 51 the congestion they cause, or allow to be caused. So, networks can 52 introduce straightforward accountability and policing mechanisms for 53 incoming traffic from end-customers or from neighbouring network 54 domains. The purpose of this document is to specify the re-ECN 55 protocol at the IP layer and to give guidelines on any consequent 56 changes required to transport protocols. It includes the changes 57 required to TCP both as an example and as a specification. It also 58 gives examples of mechanisms that can use the protocol to ensure data 59 sources respond correctly to congestion. And it describes example 60 mechanisms that ensure the dominant selfish strategy of both network 61 domains and end-points will be to set the extended ECN field 62 honestly. 64 Authors' Statement: Status (to be removed by the RFC Editor) 66 Although the re-ECN protocol is intended to make a simple but far- 67 reaching change to the Internet architecture, the most immediate 68 priority for the authors is to delay any move of the ECN nonce to 69 Proposed Standard status. The argument for this position is 70 developed in Appendix I. 72 Changes from previous drafts (to be removed by the RFC Editor) 74 Full diffs created using the rfcdiff tool are available at 75 77 From -04 to -05 (current version): 79 Completed justification for packet marking with FNE during slow- 80 start(Appendix D). 82 Minor editorial changes throughout. 84 From -03 to -04: 86 Clarified reasons for holding back ECN nonce (Section 3.2 & 87 Appendix I). 89 Clarified Figure 1. 91 Added Section 4.1.1.1 on equivalence of drops and ECN marks. 93 Improved precision of Section 5.6 on IP in IP tunnels. 95 Explained the RTT fairness is possible to enforce, but unlikely to 96 be required (Section 6.1.3 & Appendix F). 98 Explained that bulk per-user policing should be adequate but per- 99 flow policing is also possible if desired, though it is not likely 100 to be necessary (Section 6.1.5 & Appendix G). 102 Reinforced need for passive policing at inter-domain borders to 103 enable all-optical networking (Section 6.1.6). 105 Minor editorial changes throughout. 107 From -02 to -03: 109 Started guidelines for re-ECN support in DCCP and SCTP. 111 Added annex on limitations of nonce mechanism. 113 Minor editorial changes throughout. 115 From -01 to -02: 117 Explanation on informal terminology in Section 3.4 clarified. 119 IPv6 wire protocol encoding added (Section 5.2). 121 Text on (non-)issues with tunnels, encryption and link layer 122 congestion notification added (Section 5.6 & Section 5.7). 124 Section added giving evolvability arguments against encouraging 125 bottleneck policing (Section 6.1.2). And text on re-ECN's 126 evolvability by design added to Section 6.1.3 128 Text on inter-domain policing (Section 6.1.6) and inter-domain 129 fail-safes (Section 6.1.7) added. 131 From -00 to -01: 133 Encoding of re-ECN wire protocol changed for reasons given in 134 Appendix B and consequently draft substantially re-written. 136 Substantial text added in sections on applications, incremental 137 deployment, architectural rationale and security considerations. 139 Table of Contents 141 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 142 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 7 143 3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 8 144 3.1. Background and Applicability . . . . . . . . . . . . . . . 8 145 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 146 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 147 3.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 11 148 3.4. Informal Terminology . . . . . . . . . . . . . . . . . . . 13 149 4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 15 150 4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 151 4.1.1. RECN mode: Full re-ECN capable transport . . . . . . . 16 152 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or 153 Nonce ECT Receiver . . . . . . . . . . . . . . . . . . 20 154 4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 21 155 4.1.4. Extended ECN (EECN) Field Settings during Flow 156 Start or after Idle Periods . . . . . . . . . . . . . 23 157 4.1.5. Pure ACKS, Retransmissions, Window Probes and 158 Partial ACKs . . . . . . . . . . . . . . . . . . . . . 26 159 4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 27 160 4.2.1. General Guidelines for Adding Re-ECN to Other 161 Transports . . . . . . . . . . . . . . . . . . . . . . 27 162 4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 28 163 4.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 28 164 4.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 28 165 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 28 166 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 28 167 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 30 168 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 31 169 5.4. Justification for Setting the First SYN to FNE . . . . . . 32 170 5.5. Control and Management . . . . . . . . . . . . . . . . . . 33 171 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 33 172 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 34 173 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 34 174 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 35 175 6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 36 176 6.1. Policing Congestion Response . . . . . . . . . . . . . . . 36 177 6.1.1. The Policing Problem . . . . . . . . . . . . . . . . . 36 178 6.1.2. The Case Against Bottleneck Policing . . . . . . . . . 37 179 6.1.3. Re-ECN Incentive Framework . . . . . . . . . . . . . . 38 180 6.1.4. Egress Dropper . . . . . . . . . . . . . . . . . . . . 45 181 6.1.5. Policing . . . . . . . . . . . . . . . . . . . . . . . 47 182 6.1.6. Inter-domain Policing . . . . . . . . . . . . . . . . 48 183 6.1.7. Inter-domain Fail-safes . . . . . . . . . . . . . . . 52 184 6.1.8. Simulations . . . . . . . . . . . . . . . . . . . . . 53 185 6.2. Other Applications . . . . . . . . . . . . . . . . . . . . 53 186 6.2.1. DDoS Mitigation . . . . . . . . . . . . . . . . . . . 53 187 6.2.2. End-to-end QoS . . . . . . . . . . . . . . . . . . . . 54 188 6.2.3. Traffic Engineering . . . . . . . . . . . . . . . . . 54 189 6.2.4. Inter-Provider Service Monitoring . . . . . . . . . . 54 190 6.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . 54 191 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 55 192 7.1. Incremental Deployment Features . . . . . . . . . . . . . 55 193 7.2. Incremental Deployment Incentives . . . . . . . . . . . . 57 194 8. Architectural Rationale . . . . . . . . . . . . . . . . . . . 61 195 9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 64 196 9.1. Policing Rate Response to Congestion . . . . . . . . . . . 64 197 9.2. Congestion Notification Integrity . . . . . . . . . . . . 65 198 9.3. Identifying Upstream and Downstream Congestion . . . . . . 66 199 10. Security Considerations . . . . . . . . . . . . . . . . . . . 66 200 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 68 201 12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 68 202 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 68 203 14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 69 204 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 69 205 15.1. Normative References . . . . . . . . . . . . . . . . . . . 69 206 15.2. Informative References . . . . . . . . . . . . . . . . . . 70 207 Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 73 208 Appendix B. Justification for Two Codepoints Signifying Zero 209 Worth Packets . . . . . . . . . . . . . . . . . . . . 74 210 Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 76 211 Appendix D. Packet Marking with FNE During Flow Start . . . . . . 77 212 Appendix E. Example Egress Dropper Algorithm . . . . . . . . . . 79 213 Appendix F. Re-TTL . . . . . . . . . . . . . . . . . . . . . . . 79 214 Appendix G. Policer Designs to ensure Congestion 215 Responsiveness . . . . . . . . . . . . . . . . . . . 80 216 G.1. Per-user Policing . . . . . . . . . . . . . . . . . . . . 80 217 G.2. Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 81 218 Appendix H. Downstream Congestion Metering Algorithms . . . . . . 84 219 H.1. Bulk Downstream Congestion Metering Algorithm . . . . . . 84 220 H.2. Inflation Factor for Persistently Negative Flows . . . . . 85 221 Appendix I. Argument for holding back the ECN nonce . . . . . . . 85 222 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 87 223 Intellectual Property and Copyright Statements . . . . . . . . . . 89 225 1. Introduction 227 This document aims: 229 o To provide a complete specification of the addition of the re-ECN 230 protocol to IP and guidelines on how to add it to transport layer 231 protocols, including a complete specification of re-ECN in TCP as 232 an example; 234 o To show how a number of hard problems become much easier to solve 235 once re-ECN is available in IP. 237 A general statement of the problem solved by re-ECN is to provide 238 sufficient information in each IP datagram to be able to hold senders 239 and whole networks accountable for the congestion they cause 240 downstream, before they cause it. But the every-day problems that 241 re-ECN can solve are much more recognisable than this rather generic 242 statement: mitigating distributed denial of service (DDoS); 243 simplifying differentiation of quality of service (QoS); policing 244 compliance to congestion control; and so on. 246 Uniquely, re-ECN manages to enable solutions to these problems 247 without unduly stifling innovative new ways to use the Internet. 248 This was a hard balance to strike, given it could be argued that DDoS 249 is an innovative way to use the Internet. The most valuable insight 250 was to allow each network to choose the level of constraint it wishes 251 to impose. Also re-ECN has been carefully designed so that networks 252 that choose to use it conservatively can protect themselves against 253 the congestion caused in their network by users on other networks 254 with more liberal policies. 256 For instance, some network owners want to block applications like 257 voice and video unless their network is compensated for the extra 258 share of bottleneck bandwidth taken. These real-time applications 259 tend to be unresponsive when congestion arises. Whereas elastic TCP- 260 based applications back away quickly, ending up taking a much smaller 261 share of congested capacity for themselves. Other network owners 262 want to invest in large amounts of capacity and make their gains from 263 simplicity of operation and economies of scale. 265 Re-ECN allows the more conservative networks to police out flows that 266 have not asked to be unresponsive to congestion---not because they 267 are voice or video---just because they don't respond to congestion. 268 But it also allows other networks to choose not to police. 269 Crucially, when flows from liberal networks cross into a conservative 270 network, re-ECN enables the conservative network to apply penalties 271 to its neighbouring networks for the congestion they allow to be 272 caused. And these penalties can be applied to bulk data, without 273 regard to flows. 275 Then, if unresponsive applications become so dominant that some of 276 the more liberal networks experience congestion collapse [RFC3714], 277 they can change their minds and use re-ECN to apply tighter controls 278 in order to bring congestion back under control. 280 Re-ECN works by arranging that each packet arrives at each network 281 element carrying a view of expected congestion on its own downstream 282 path, albeit averaged over multiple packets. Most usefully, 283 congestion on the remainder of the path becomes visible in the IP 284 header at the first ingress. Many of the applications of re-ECN 285 involve a policer at this ingress using the view of downstream 286 congestion arriving in packets to police or control the packet rate. 288 Importantly, the scheme is recursive: a whole network harbouring 289 users causing congestion in downstream networks can be held 290 responsible or policed by its downstream neighbour. 292 This document is structured as follows. First an overview of the re- 293 ECN protocol is given (Section 3), outlining its attributes and 294 explaining conceptually how it works as a whole. The two main parts 295 of the document follow, as described above. That is, the protocol 296 specification divided into transport (Section 4) and network 297 (Section 5) layers, then the applications it can be put to, such as 298 policing DDoS, QoS and congestion control (Section 6). Although 299 these applications do not require standardisation themselves, they 300 are described in a fair degree of detail in order to explain how re- 301 ECN can be used. Given re-ECN proposes to use the last undefined bit 302 in the IPv4 header, we felt it necessary to outline the potential 303 that re-ECN could release in return for being given that bit. 305 Deployment issues discussed throughout the document are brought 306 together in Section 7, which is followed by a brief section 307 explaining the somewhat subtle rationale for the design from an 308 architectural perspective (Section 8). We end by describing related 309 work (Section 9), listing security considerations (Section 10) and 310 finally drawing conclusions (Section 12). 312 2. Requirements notation 314 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 315 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 316 document are to be interpreted as described in [RFC2119]. 318 This document first specifies a protocol, then describes a framework 319 that creates the right incentives to ensure compliance to the 320 protocol. This could cause confusion because the second part of the 321 document considers many cases where malicious nodes may not comply 322 with the protocol. When such contingencies are described, if any of 323 the above keywords are not capitalised, that is deliberate. So, for 324 instance, the following two apparently contradictory sentences would 325 be perfectly consistent: i) x MUST do this; ii) x may not do this. 327 3. Protocol Overview 329 3.1. Background and Applicability 331 First we briefly recap the essentials of the ECN protocol [RFC3168]. 332 Two bits in the IP protocol (v4 or v6) are assigned to the ECN field. 333 The sender clears the field to "00" (Not-ECT) if either end-point 334 transport is not ECN-capable. Otherwise it indicates an ECN-capable 335 transport (ECT) using either of the two code-points "10" or "01" 336 (ECT(0) and ECT(1) resp.). 338 ECN-capable routers probabilistically set "11" if congestion is 339 experienced (CE), the marking probability increasing with the length 340 of the queue at its egress link (typically using the RED 341 algorithm [RFC2309]). However, they still drop rather than mark Not- 342 ECT packets. With multiple ECN-capable routers on a path, a flow of 343 packets accumulates the fraction of CE marking that each router adds. 344 The combined effect of the packet marking of all the routers along 345 the path signals congestion of the whole path to the receiver. So, 346 for example, if one router early in a path is marking 1% of packets 347 and another later in a path is marking 2%, flows that pass through 348 both routers will experience approximately 3% marking (see Appendix A 349 for a precise treatment). 351 The choice of two ECT code-points in the ECN field [RFC3168] 352 permitted future flexibility, optionally allowing the sender to 353 encode the experimental ECN nonce [RFC3540] in the packet stream. 354 The nonce is designed to allow a sender to check the integrity of 355 congestion feedback. But Section 9.2 explains that it still gives no 356 control over how fast the sender transmits as a result of the 357 feedback. On the other hand, re-ECN is designed both to ensure that 358 congestion is declared honestly and that the sender's rate responds 359 appropriately. 361 Re-ECN is based on a feedback arrangement called `re- 362 feedback' [Re-fb]. The word is short for either receiver-aligned, 363 re-inserted or re-echoed feedback. But it actually works even when 364 no feedback is available. In fact it has been carefully designed to 365 work for single datagram flows. It also encourages aggregation of 366 single packet flows by congestion control proxies. Then, even if the 367 traffic mix of the Internet were to become dominated by short 368 messages, it would still be possible to control congestion 369 effectively and efficiently. 371 Changing the Internet's feedback architecture seems to imply 372 considerable upheaval. But re-ECN can be deployed incrementally at 373 the transport layer around unmodified routers using existing fields 374 in IP (v4 or v6). However it does also require the last undefined 375 bit in the IPv4 header, which it uses in combination with the 2-bit 376 ECN field to create four new codepoints. Nonetheless, changes to IP 377 routers are RECOMMENDED in order to improve resilience against DoS 378 attacks. Similarly, re-ECN works best if both the sender and 379 receiver transports are re-ECN-capable, but it can work with just 380 sender support. Section 7.1 summarises the incremental deployment 381 strategy. 383 The re-ECN protocol makes no changes and has no effect on the TCP 384 congestion control algorithm or on other rate responses to 385 congestion. Re-ECN is only concerned with enabling the ingress 386 network to police that a source is complying with a congestion 387 control algorithm, which is orthogonal to congestion control itself. 389 Before re-ECN can be considered worthy of using up the last bit in 390 the IP header, we must be sure that all our claims are robust. We 391 have gradually been reducing the list of outstanding issues, but the 392 few that still remain are listed in Section 6.3. We expect new 393 attacks may still be found, but we offer the re-ECN protocol on the 394 basis that it is built on fairly solid theoretical foundations and, 395 so far, it has proved possible to keep it relatively robust. 397 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 399 The re-ECN wire protocol uses the two bit ECN field broadly as in 400 RFC3168 [RFC3168] as described above, but with five differences of 401 detail (brought together in a list in Section 7.1). This 402 specification defines a new re-ECN extension (RE) flag. We will 403 defer the definition of the actual position of the RE flag in the 404 IPv4 & v6 headers until Section 5. Until then it will suffice to use 405 an abstraction of the IPv4 and v6 wire protocols by just calling it 406 the RE flag. 408 Unlike the ECN field, the RE flag is intended to be set by the sender 409 and remain unchanged along the path, although it can be read by 410 network elements that understand the re-ECN protocol. It is feasible 411 that a network element MAY change the setting of the RE flag, perhaps 412 acting as a proxy for an end-point, but such a protocol would have to 413 be defined in another specification (e.g. [Re-PCN]). 415 Although the RE flag is a separate, single bit field, it can be read 416 as an extension to the two-bit ECN field; the three concatenated bits 417 in what we will call the extended ECN field (EECN) making eight 418 codepoints. We will use the RFC3168 names of the ECN codepoints to 419 describe settings of the ECN field when the RE flag setting is "don't 420 care", but we also define the following six extended ECN codepoint 421 names for when we need to be more specific. 423 RFC3168 ECN defines uses for all four codepoints of the two-bit ECN 424 field. This memo widens the codepoint space to eight, and uses six 425 codepoints. One of re-ECN's codepoints is an alternative use of the 426 codepoint set aside in RFC3168 for the ECN nonce (ECT(1)). 427 Transports not using re-ECN can still use the ECN nonce, while those 428 using re-ECN do not need to as long as the sender is also checking 429 for transport protocol compliance [I-D.moncaster-tcpm-rcv-cheat]. 430 The case for doing this is given in Appendix I. Two re-ECN 431 codepoints are given compatible uses to those defined in RFC3168 432 (Not-ECT and CE). The other codepoint used by RFC3168 (ECT(0)) isn't 433 used for re-ECN. Altogether this leave one codepoint of the eight 434 unused and available for future use. 436 +-------+------------+------+--------------+------------------------+ 437 | ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | 438 | field | codepoint | flag | codepoint | | 439 +-------+------------+------+--------------+------------------------+ 440 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 441 | | | | | transport | 442 | 00 | Not-ECT | 1 | FNE | Feedback not | 443 | | | | | established | 444 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 445 | | | | | and RECT | 446 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 447 | | | | | transport | 448 | 10 | ECT(0) | 0 | --- | Legacy ECN use only | 449 | | | | | | 450 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 451 | | | | | | 452 | 11 | CE | 0 | CE(0) | Re-Echo canceled by | 453 | | | | | congestion experienced | 454 | 11 | CE | 1 | CE(-1) | Congestion experienced | 455 +-------+------------+------+--------------+------------------------+ 457 Table 1: Extended ECN Codepoints 459 3.3. Re-ECN Protocol Operation 461 In this section we will give an overview of the operation of the re- 462 ECN protocol for TCP/IP, leaving a detailed specification to the 463 following sections. Other transports will be discussed later. 465 In summary, the protocol adds a third `re-echo' stage to the existing 466 TCP/IP ECN protocol. Whenever the network adds CE congestion 467 signalling to the IP header on the forward data path, the receiver 468 feeds it back to the ingress using TCP, then the sender re-echoes it 469 into the forward data path using the RE flag in the next packet. 471 Prior to receiving any feedback a sender will not know which setting 472 of the RE flag to use, so it sets the feedback not established (FNE) 473 codepoint. The network reads the FNE codepoint conservatively as 474 equivalent to re-echoed congestion. 476 Specifically, once a flow is established, a re-ECN sender always 477 initialises the ECN field to ECT(1). And it usually sets the RE flag 478 to "1". Whenever a router re-marks a packet to CE, the receiver 479 feeds back this event to the sender. On receiving this feedback, the 480 re-ECN sender will clear the RE flag to "0" in the next packet it 481 sends. 483 We chose to set and clear the RE flag this way round to ease 484 incremental deployment (see Section 7.1). To avoid confusion we will 485 use the term `blanking' (rather than marking) when the RE flag is 486 cleared to "0". So, over a stream of packets, we will talk of the 487 `RE blanking fraction' as the fraction of octets in packets with the 488 RE flag cleared to "0". 490 _ _ _ _ 491 / \ / \ / \ / \ 492 | S |--| 0 | - - - - - - - - | i |--| D | 493 \ _ / \ _ / \ _ / \ _ / 494 . . . . 495 ^ . . . . 496 | . . . . 497 | . RE blanking fraction . . 498 3% |-------------------------------+======= 499 | . . | . 500 2% | . . | . 501 | . . CE marking fraction | . 502 1% | . +----------------------+ . 503 | . | . . 504 0% +---------------------------------------> 505 ^ 0 ^ i ^ resource index 506 0 ^ 1 ^ 2 observation points 507 | | 508 1.00% 2.00% marking fraction 510 Figure 1: A 2-Router Example (Imprecise) 512 Figure 1 uses a simple network to illustrate how re-ECN allows 513 routers to measure downstream congestion. The horizontal axis 514 represents the index of each congestible resource (typically queues) 515 along a path through the Internet. There may be many routers on the 516 path, but we assume only two are currently congested (those with 517 resource index 0 and i). The two superimposed plots show the 518 fraction of each extended ECN codepoint in a flow observed along this 519 path. Given about 3% of packets reaching the destination are marked 520 CE, in response to feedback the sender will blank the RE flag in 521 about 3% of packets it sends. Then approximate downstream congestion 522 can be measured at the observation points shown along the path by 523 subtracting the CE marking fraction from the RE blanking fraction, as 524 shown in the table below (Appendix A derives these approximations 525 from a precise analysis). 527 +-------------------+------------------------------+ 528 | Observation point | Approx downstream congestion | 529 +-------------------+------------------------------+ 530 | 0 | 3% - 0% = 3% | 531 | 1 | 3% - 1% = 2% | 532 | 2 | 3% - 3% = 0% | 533 +-------------------+------------------------------+ 535 Table 2: Downstream Congestion Measured at Example Observation Points 537 All along the path, whole-path congestion remains unchanged so it can 538 be used as a reference against which to compare upstream congestion. 539 The difference predicts downstream congestion for the rest of the 540 path. Therefore, measuring the fractions of each codepoint at any 541 point in the Internet will reveal upstream, downstream and whole path 542 congestion. 544 Note that we have introduced discussion of marking and blanking 545 fractions solely for illustration. To be absolutely clear, these 546 fractions are averages that would result from the behaviour of a TCP 547 protocol handler mechanically blanking outgoing packets in direct 548 response to incoming feedback---we are not saying any protocol 549 handler works with these average fractions directly. 551 3.4. Informal Terminology 553 In the rest of this memo we will loosely talk of positive or negative 554 flows, meaning flows where the moving average of the downstream 555 congestion metric is persistently positive or negative. The notion 556 of a negative metric arises because it is derived by subtracting one 557 metric from another. Of course actual downstream congestion cannot 558 be negative, only the metric can (whether due to time lags or 559 deliberate malice). 561 Just as we will loosely talk of positive and negative flows, we will 562 also talk of positive or negative packets, meaning packets that 563 contribute positively or negatively to the downstream congestion 564 metric. 566 Therefore we will talk of packets having `worth' of +1, 0 or -1, 567 which, when multiplied by their size, indicates their contribution to 568 the downstream congestion metric. 570 Figure 2 shows the main state transitions of the system once a flow 571 is established, showing the worth of packets in each state. When the 572 network congestion marks a packet it decrements its worth (moving 573 from the left of the main square to the right). When the sender 574 blanks the RE flag in order to re-echo congestion it increments the 575 worth of a packet (moving from the bottom of the main square to the 576 top). 578 Sender state Sent Worth Received Worth 579 packet packet 580 +----------------------------------------------------+ 581 | ^ 582 V | 583 Congestion echoed -->Re-Echo +1 --+---> CE(0) 0 --+ 584 (positive) | (canceled) | 585 V network | 586 | congestion | 587 | | 588 Flow established --> RECT 0 ----+-> CE(-1) -1 --+ 589 ^ (neutral) | | (negative) 590 | | | 591 | no V V 592 | congestion | | 593 +-----------<--------------+-+ 595 Figure 2: Re-ECN System State Diagram (bootstrap not shown) 597 The idea is that every time the network decrements the worth of a 598 packet, the sender increments the worth of a later packet. Then, 599 over time, as many positive octets should arrive at the receiver as 600 negative. Note we have said octets not packets, so if packets are of 601 different sizes, the worth should be incremented on enough octets to 602 balance the octets in negative packets arriving at the receiver. It 603 is this balance that will allow the network to hold the sender 604 accountable for the congestion it causes, as we shall see. The 605 informal outline below uses TCP as an example transport, but the idea 606 would be broadly similar for any transport that adapts its rate to 607 congestion. 609 We will start with the sender in `flow established' state. Normally, 610 as acknowledgements of earlier packets arrive that don't feedback any 611 congestion, the congestion window can be opened, so the sender goes 612 round the smaller sub-loop, sending RECT packets (worth 0) and 613 returning to the flow established state to send another one. If a 614 router congestion marks one of the packets, it decrements the 615 packet's worth. The sender will have been continuing to traverse 616 round the smaller feedback loop every time acknowledgements arrive. 617 But when congestion feedback returns from this packet that was marked 618 with -1 worth (the largest loop in the figure) the sender jumps to 619 the congestion echoed state in order to re-echo the congestion, 620 incrementing the worth of the next packet to +1 by blanking its RE 621 flag. The sender then returns to the flow established state and 622 continues round the smaller loop, sending packets worth 0. Note that 623 the size of the loops is just an artefact of the figure; it is not 624 meant to imply that one loop is slower than the other - they are both 625 the same end to end feedback loop. 627 If a packet carrying re-echoed congestion happens to also be 628 congestion marked, the +1 worth added by the sender will be cancelled 629 out by the -1 network congestion marking. Although the two worth 630 values correctly cancel out, neither the congestion marking nor the 631 re-echoed congestion are lost, because the RE bit and the ECN field 632 are orthogonal. So, whenever this happens, the receiver will 633 correctly detect and re-echo the new congestion event as well (the 634 top sub-loop). When we need to distinguish, we will sometimes call a 635 packet marked RECT 'neutral' (0 worth), while we will call the CE(0) 636 marking 'canceled' (also 0 worth). If a re-echoed packet isn't 637 unlucky enough to be further congestion marked, the sender will 638 return to the flow established state and continue to send RECT 639 packets (worth 0). 641 The table below specifies unambiguously the worth of each extended 642 ECN codepoint. Note the order is different from the previous table 643 to better show how the worth increments and decrements. The FNE 644 codepoint is an exception. It is used in the flow bootstrap process 645 (explained later) and has the same positive (+1) worth as a packet 646 with the Re-Echo codepoint. 648 +--------+------+----------------+-------+--------------------------+ 649 | ECN | RE | Extended ECN | Worth | Re-ECN meaning | 650 | field | bit | codepoint | | | 651 +--------+------+----------------+-------+--------------------------+ 652 | 00 | 0 | Not-RECT | ... | Not re-ECN-capable | 653 | | | | | transport | 654 | 01 | 0 | Re-Echo | +1 | Re-echoed congestion and | 655 | | | | | RECT | 656 | 10 | 0 | --- | ... | Legacy ECN use only | 657 | 11 | 0 | CE(0) | 0 | Re-Echo canceled by | 658 | | | | | congestion experienced | 659 | 00 | 1 | FNE | +1 | Feedback not established | 660 | 01 | 1 | RECT | 0 | Re-ECN capable transport | 661 | 10 | 1 | --CU-- | ... | Currently unused | 662 | | | | | | 663 | 11 | 1 | CE(-1) | -1 | Congestion experienced | 664 +--------+------+----------------+-------+--------------------------+ 666 Table 3: 'Worth' of Extended ECN Codepoints 668 4. Transport Layers 670 4.1. TCP 672 Re-ECN capability at the sender is essential. At the receiver it is 673 optional, as long as the receiver has a basic (`vanilla flavour') 674 RFC3168-compliant ECN-capable transport (ECT) [RFC3168]. Given re- 675 ECN is not the first attempt to define the semantics of the ECN 676 field, we give a table below summarising what happens for various 677 combinations of capabilities of the sender S and receiver R, as 678 indicated in the first four columns below. The last column gives the 679 mode a half-connection should be in after the first two of the three 680 TCP handshakes. 682 +--------+--------------+------------+---------+--------------------+ 683 | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | 684 | | (RFC3540) | (RFC3168) | | Half-connection | 685 | | | | | Mode | 686 +--------+--------------+------------+---------+--------------------+ 687 | SR | | | | RECN | 688 | S | R | | | RECN-Co | 689 | S | | R | | RECN-Co | 690 | S | | | R | Not-ECT | 691 +--------+--------------+------------+---------+--------------------+ 693 Table 4: Modes of TCP Half-connection for Combinations of ECN 694 Capabilities of Sender S and Receiver R 696 We will describe what happens in each mode, then describe how they 697 are negotiated. The abbreviations for the modes in the above table 698 mean: 700 RECN: Full re-ECN capable transport 702 RECN-Co: Re-ECN sender in compatibility mode with a 703 vanilla [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable 704 receiver. Implementation of this mode is OPTIONAL. 706 Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when 707 at least one of the transports does not understand even basic ECN 708 marking. 710 Note that we use the term Re-ECT for a host transport that is re-ECN- 711 capable but RECN for the modes of the half connections between hosts 712 when they are both Re-ECT. If a host transport is Re-ECT, this fact 713 alone does NOT imply either of its half connections will necessarily 714 be in RECN mode, at least not until it has confirmed that the other 715 host is Re-ECT. 717 4.1.1. RECN mode: Full re-ECN capable transport 719 In full RECN mode, for each half connection, both the sender and the 720 receiver each maintain an unsigned integer counter we will call ECC 721 (echo congestion counter). The receiver maintains a count, modulo 8, 722 of how many times a CE marked packet has arrived during the half- 723 connection. Once a RECN connection is established, the three TCP 724 option flags (ECE, CWR & NS) used for ECN-related functions in other 725 versions of ECN are used as a 3-bit field for the receiver to 726 repeatedly tell the sender the current value of ECC whenever it sends 727 a TCP ACK. We will call this the echo congestion increment (ECI) 728 field. This overloaded use of these 3 option flags as one 3-bit ECI 729 field is shown in Figure 4. The actual definition of the TCP header, 730 including the addition of support for the ECN nonce, is shown for 731 comparison in Figure 3. This specification does not redefine the 732 names of these three TCP option flags, it merely overloads them with 733 another definition once a flow is established. 735 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 736 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 737 | | | N | C | E | U | A | P | R | S | F | 738 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 739 | | | | R | E | G | K | H | T | N | N | 740 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 742 Figure 3: The (post-ECN Nonce) definition of bytes 13 and 14 of the 743 TCP Header 745 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 746 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 747 | | | | U | A | P | R | S | F | 748 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 749 | | | | G | K | H | T | N | N | 750 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 752 Figure 4: Definition of the ECI field within bytes 13 and 14 of the 753 TCP Header, overloading the current definitions above for established 754 RECN flows. 756 Receiver Action in RECN Mode 758 Every time a CE marked packet arrives at a receiver in RECN mode, 759 the receiver transport increments its local value of ECC modulo 8 760 and MUST echo its value to the sender in the ECI field of the next 761 ACK. It MUST repeat the same value of ECI in every subsequent ACK 762 until the next CE event, when it increments ECI again. 764 The increment of the local ECC values is modulo 8 so the field 765 value simply wraps round back to zero when it overflows. The 766 least significant bit is to the right (labelled bit 9). 768 A receiver in RECN mode MAY delay the echo of a CE to the next 769 delayed-ACK, which would be necessary if ACK-withholding were 770 implemented. 772 Sender Action in RECN Mode 774 On the arrival of every ACK, the sender compares the ECI field 775 with its own ECC value, then replaces its local value with that 776 from the ACK. The difference D is assumed to be the number of CE 777 marked packets that arrived at the receiver since it sent the 778 previously received ACK (but see below for the sender's safety 779 strategy). Whenever the ECI field increments by D (and/or d drops 780 are detected), the sender MUST clear the RE flag to "0" in the IP 781 header of the next D' data packets it sends (where D' = D + d), 782 effectively re-echoing each single increment of ECI. Otherwise 783 the data sender MUST send all data packets with RE set to "1". 785 As a general rule, once a flow is established, as well as setting 786 or clearing the RE flag as above, a data sender in RECN mode MUST 787 always set the ECN field to ECT(1). However, the settings of the 788 extended ECN field during flow start are defined in Section 4.1.4. 790 As we have already emphasised, the re-ECN protocol makes no 791 changes and has no effect on the TCP congestion control algorithm. 792 So, each increment of ECI (or detection of a drop) also triggers 793 the standard TCP congestion response, but with no more than one 794 congestion response per round trip, as usual. 796 A TCP sender also acts as the receiver for the other half- 797 connection. The host will maintain two ECC values S.ECC and R.ECC 798 as sender and receiver respectively. Every TCP header sent by a 799 host in RECN mode will also repeat the prevailing value of R.ECC 800 in its ECI field. If a sender in RECN mode has to retransmit a 801 packet due to a suspected loss, the re-transmitted packet MUST 802 carry the latest prevailing value of R.ECC when it is re- 803 transmitted, which will not necessarily be the one it carried 804 originally. 806 4.1.1.1. Drops and Marks 808 Re-ECN is based on the ECN protocol [RFC3168] which in turn is 809 typically based on the RED algorithm [RFC2309]. This algorithm marks 810 packets as CE with a probability that increases as the size of the 811 router queue increases. Howeverif the queue becomes too full then it 812 will revert to dropping packets. Because of this it is important 813 that re-ECN treats each packet drop it detects as if it were actually 814 a CE mark. This ensures that it can continue to correctly echo 815 congestion even through a highly congested path. 817 In order to ensure that drops are correctly echoed the sender needs 818 to add the number of drops detected per RTT to the difference in ECI 819 value waiting to be echoed. A drop is defined as set out in 820 [RFC2581] -- if the connection is in slow start then a single 821 duplicate aknowledgement will be treated as an indication of a drop. 822 When the system is in the congestion avoidance stage then 3 duplicate 823 acknowledgements will be treated as a sign of a drop. In all cases, 824 if a re-transmission time-out occurs then that will be treatd as a 825 drop. 827 4.1.1.2. Safety against Long Pure ACK Loss Sequences 829 The ECI method was chosen for echoing congestion marking because a 830 re-ECN sender needs to know about every CE mark arriving at the 831 receiver, not just whether at least one arrives within a round trip 832 time (which is all the ECE/CWR mechanism supported). And, as pure 833 ACKs are not protected by TCP reliable delivery, we repeat the same 834 ECI value in every ACK until it changes. Even if many ACKs in a row 835 are lost, as soon as one gets through, the ECI field it repeats from 836 previous ACKs that didn't get through will update the sender on how 837 many CE marks arrived since the last ACK got through. 839 The sender will only lose a record of the arrival of a CE mark if all 840 the ACKS are lost (and all of them were pure ACKs) for a stream of 841 data long enough to contain 8 or more CE marks. So, if the marking 842 fraction was p, at least 8/p pure ACKs would have to be lost. For 843 example, if p was 5%, a sequence of 160 pure ACKs would all have to 844 be lost. To protect against such extremely unlikely events, if a re- 845 ECN sender detects a sequence of pure ACKs has been lost it SHOULD 846 assume the ECI field wrapped as many times as possible within the 847 sequence. 849 Specifically, if a re-ECN sender receives an ACK with an 850 acknowledgement number that acknowledges L segments since the 851 previous ACK but with a sequence number unchanged from the previously 852 received ACK, it SHOULD conservatively assume that the ECI field 853 incremented by D' = L - ((L-D) mod 8), where D is the apparent 854 increase in the ECI field. For example if the ACK arriving after 9 855 pure ACK losses apparently increased ECI by 2, the assumed increment 856 of ECI would still be 2. But if ECI apparently increased by 2 after 857 11 pure ACK losses, ECI should be assumed to have increased by 10. 859 A re-ECN sender MAY implement a heuristic algorithm to predict beyond 860 reasonable doubt that the ECI field probably did not wrap within a 861 sequence of lost pure ACKs. But such an algorithm is NOT REQUIRED. 862 Such an algorithm MUST NOT be used unless it is proven to work even 863 in the presence of correlation between high ACK loss rate on the back 864 channel and high CE marking rate on the forward channel. 866 Whatever assumption a re-ECN sender makes about potentially lost CE 867 marks, both its congestion control and its re-echoing behaviour 868 SHOULD be consistent with the assumption it makes. 870 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver 872 If the half-connection is in RECN-Co mode, ECN feedback proceeds no 873 differently to that of vanilla ECN. In other words, the receiver 874 sets the ECE flag repeatedly in the TCP header and the sender 875 responds by setting the CWR flag. Although RECN-Co mode is used when 876 the receiver has not implemented the re-ECN protocol, the sender can 877 infer enough from its vanilla ECN feedback to set or clear the RE 878 flag reasonably well. Specifically, every time the receiver toggles 879 the ECE field from "0" to "1" (or a loss is detected), as well as 880 setting CWR in the TCP flags, the re-ECN sender MUST blank the RE 881 flag of the next packet to "0" as it would do in full RECN mode. 882 Otherwise, the data sender SHOULD send all other packets with RE set 883 to "1". Once a flow is established, a re-ECN data sender in RECN-Co 884 mode MUST always set the ECN field to ECT(1). 886 If a CE marked packet arrives at the receiver within a round trip 887 time of a previous mark, the receiver will still be echoing ECE for 888 the last CE mark. Therefore, such a mark will be missed by the 889 sender. Of course, this isn't of concern for congestion control, but 890 it does mean that very occasionally the RE blanking fraction will be 891 understated. Therefore flows in RECN-Co mode may occasionally be 892 mistaken for very lightly cheating flows and consequently might 893 suffer a small number of packet drops through an egress dropper 894 (Section 6.1.4). We expect re-ECN would be deployed for some time 895 before policers and droppers start to enforce it. So, given there is 896 not much ECN deployment yet anyway, this minor problem may affect 897 only a very small proportion of flows, reducing to nothing over the 898 years as vanilla ECN hosts upgrade. The use of RECN-Co mode would 899 need to be reviewed in the light of experience at the time of re-ECN 900 deployment. 902 RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their 903 code simple, MAY choose not to implement this mode. If they do not, 904 a re-ECN sender SHOULD fall back to vanilla ECT mode in the presence 905 of an ECN-capable receiver. It MAY choose to fall back to the ECT- 906 Nonce mode, but if re-ECN implementers don't want to be bothered with 907 RECN-Co mode, they probably won't want to add an ECT-Nonce mode 908 either. 910 4.1.2.1. Re-ECN support for the ECN Nonce 912 A TCP half-connection in RECN-Co mode MUST NOT support the ECN 913 Nonce [RFC3540]. This means that the sending code of a re-ECN 914 implementation will never need to include ECN Nonce support. Re-ECN 915 is intended to provide wider protection than the ECN nonce against 916 congestion control misbehaviour, and re-ECN only requires support 917 from the sender, therefore it is preferable to specifically rule out 918 the need for dual sender implementations. As a consequence, a re-ECN 919 capable sender will never set ECT(0), so it will be easier for 920 network elements to discriminate re-ECN traffic flows from other ECN 921 traffic, which will always contain some ECT(0) packets. 923 However, a re-ECN implementation MAY OPTIONALLY include receiving 924 code that complies with the ECN Nonce protocol when interacting with 925 a sender that supports the ECN nonce (rather than re-ECN), but this 926 support is NOT REQUIRED. 928 RFC3540 allows an ECN nonce sender to choose whether to sanction a 929 receiver that does not ever set the nonce sum. Given re-ECN is 930 intended to provide wider protection than the ECN nonce against 931 congestion control misbehaviour, implementers of re-ECN receivers MAY 932 choose not to implement backwards compatibility with the ECN nonce 933 capability. This may be because they deem that the risk of sanctions 934 is low, perhaps because significant deployment of the ECN nonce seems 935 unlikely at implementation time. 937 4.1.3. Capability Negotiation 939 During the TCP hand-shake at the start of a connection, an originator 940 of the connection (host A) with a re-ECN-capable transport MUST 941 indicate it is Re-ECT by setting the TCP options NS=1, CWR=1 and 942 ECE=1 in the initial SYN. 944 A responding Re-ECT host (host B) MUST return a SYN ACK with flags 945 CWR=1 and ECE=0. The responding host MUST NOT set this combination 946 of flags unless the preceding SYN has already indicated Re-ECT 947 support as above. A Re-ECT server (B) can use either setting of the 948 NS flag combined with this type of SYN ACK in response to a SYN from 949 a Re-ECT client (A). Normally a Re-ECT server will reply to a Re-ECT 950 client with NS=0, but in the special circumstance below it can return 951 a SYN ACK with NS=1. 953 If the initial SYN from Re-ECT client A is marked CE(-1), a Re-ECT 954 server B MUST increment its local value of ECC. But B cannot reflect 955 the value of ECC in the SYN ACK, because it is still using the 3 bits 956 to negotiate connection capabilities. So, server B MUST set the 957 alternative TCP header flags in its SYN ACK: NS=1, CWR=1 and ECE=0. 959 These handshakes are summarised in Table 5 below, with X meaning 960 `don't care'. The handshakes used for the other flavours of ECN are 961 also shown for comparison. To compress the width of the table, the 962 headings of the first four columns have been severely abbreviated, as 963 follows: 965 R: *R*e-ECT 967 N: ECT-*N*once (RFC3540) 969 E: *E*CT (RFC3168) 971 I: Not-ECT (*I*mplicit congestion notification). 973 These correspond with the same headings used in Table 4. Indeed, the 974 resulting modes in the last two columns of the table below are a more 975 comprehensive way of saying the same thing as Table 4. 977 +----+---+---+---+------------+-------------+-----------+-----------+ 978 | R | N | E | I | SYN A-B | SYN ACK B-A | A-B Mode | B-A Mode | 979 +----+---+---+---+------------+-------------+-----------+-----------+ 980 | | | | | NS CWR ECE | NS CWR ECE | | | 981 | AB | | | | 1 1 1 | X 1 0 | RECN | RECN | 982 | A | B | | | 1 1 1 | 1 0 1 | RECN-Co | ECT-Nonce | 983 | A | | B | | 1 1 1 | 0 0 1 | RECN-Co | ECT | 984 | A | | | B | 1 1 1 | 0 0 0 | Not-ECT | Not-ECT | 985 | B | A | | | 0 1 1 | 0 0 1 | ECT-Nonce | RECN-Co | 986 | B | | A | | 0 1 1 | 0 0 1 | ECT | RECN-Co | 987 | B | | | A | 0 0 0 | 0 0 0 | Not-ECT | Not-ECT | 988 +----+---+---+---+------------+-------------+-----------+-----------+ 990 Table 5: TCP Capability Negotiation between Originator (A) and 991 Responder (B) 993 As soon as a re-ECN capable TCP server receives a SYN, it MUST set 994 its two half-connections into the modes given in Table 5. As soon as 995 a re-ECN capable TCP client receives a SYN ACK, it MUST set its two 996 half-connections into the modes given in Table 5. The half- 997 connections will remain in these modes for the rest of the 998 connection, including for the third segment of TCP's three-way hand- 999 shake (the ACK). 1001 {ToDo: Consider SYNs within a connection.} 1003 Recall that, if the SYN ACK reflects the same flag settings as the 1004 preceding SYN (because there is a broken legacy implementation that 1005 behaves this way), RFC3168 specifies that the whole connection MUST 1006 revert to Not-ECT. 1008 Also note that, whenever the SYN flag of a TCP segment is set 1009 (including when the ACK flag is also set), the NS, CWR and ECE flags 1010 MUST NOT be interpreted as the 3-bit ECI value, which is only set as 1011 a copy of the local ECC value in non-SYN packets. 1013 4.1.4. Extended ECN (EECN) Field Settings during Flow Start or after 1014 Idle Periods 1016 If the originator (A) of a TCP connection supports re-ECN it MUST set 1017 the extended ECN (EECN) field in the IP header of the initial SYN 1018 packet to the feedback not established (FNE) codepoint. 1020 FNE is a new extended ECN codepoint defined by this specification 1021 (Section 3.2). The feedback not established (FNE) codepoint is used 1022 when the transport does not have the benefit of ECN feedback so it 1023 cannot decide whether to set or clear the RE flag. 1025 If after receiving a SYN the server B has set its sending half- 1026 connection into RECN mode or RECN-Co mode, it MUST set the extended 1027 ECN field in the IP header of its SYN ACK to the feedback not 1028 established (FNE) codepoint. Note the careful wording here, which 1029 means that Re-ECT server B MUST set FNE on a SYN ACK whether it is 1030 responding to a SYN from a Re-ECT client or from a client that is 1031 merely ECN-capable. 1033 The original ECN specification [RFC3168] required SYNs and SYN ACKs 1034 to use the Not-ECT codepoint of the ECN field. The aim was to 1035 prevent well-known DoS attacks such as SYN flooding being able to 1036 gain from the advantage that ECN capability afforded over drop at 1037 ECN-capable routers. 1039 For a SYN ACK, Kuzmanovic [I-D.ietf-tcpm-ecnsyn] has shown that this 1040 caution was unnecessary, and proposes to allow a SYN ACK to be ECN- 1041 capable to improve performance. We have gone further by proposing to 1042 make the initial SYN ECN-capable too. By stipulating the FNE 1043 codepoint for the initial SYN, we comply with RFC3168 in word but not 1044 in spirit, because we have indeed set the ECN field to Not-ECT, but 1045 we have extended the ECN field with another bit. And it will be seen 1046 (Section 5.3) that we have defined one setting of that bit to mean an 1047 ECN-capable transport. Therefore, by proposing that the FNE 1048 codepoint MUST be used on the initial SYN of a connection, we have 1049 (deliberately) made the initial SYN ECN-capable. Section 5.4 1050 justifies deciding to make the initial SYN ECN-capable. 1052 Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will 1053 have already been set on the initial SYN and possibly the SYN ACK as 1054 above. But each re-ECN sender will have to set FNE cautiously on a 1055 few data packets as well, given a number of packets will usually have 1056 to be sent before sufficient congestion feedback is received. The 1057 behaviour will be different depending on the mode of the half- 1058 connection: 1060 RECN mode: Given the constraints on TCP's initial window [RFC3390] 1061 and its exponential window increase during slow start 1062 phase [RFC2581], it turns out that the sender SHOULD set FNE on 1063 the first and third data packets in its flow, assuming equal sized 1064 data packets once a flow is established. Appendix D presents the 1065 calculation that led to this conclusion. Below, after running 1066 through the start of an example TCP session, we give the intuition 1067 learned from that calculation. 1069 RECN-Co mode: A re-ECT sender that switches into re-ECN 1070 compatibility mode or into Not-ECT mode (because it has detected 1071 the corresponding host is not re-ECN capable) MUST limit its 1072 initial window to 1 segment. The reasoning behind this constraint 1073 is given in Section 5.4. Having set this initial window, a re-ECN 1074 sender in RECN-Co mode SHOULD set FNE on the first and third data 1075 packets in a flow, as for RECN mode. 1077 +----+------+----------------+-------+-------+---------------+------+ 1078 | | Data | TCP A(Re-ECT) | IP A | IP B | TCP B(Re-ECT) | Data | 1079 +----+------+----------------+-------+-------+---------------+------+ 1080 | | Byte | SEQ ACK CTL | EECN | EECN | SEQ ACK CTL | Byte | 1081 | -- | ---- | ------------- | ----- | ----- | ------------- | ---- | 1082 | 1 | | 0100 SYN | FNE | --> | R.ECC=0 | | 1083 | | | CWR,ECE,NS | | | | | 1084 | 2 | | R.ECC=0 | <-- | FNE | 0300 0101 | | 1085 | | | | | | SYN,ACK,CWR | | 1086 | 3 | | 0101 0301 ACK | RECT | --> | R.ECC=0 | | 1087 | 4 | 1000 | 0101 0301 ACK | FNE | --> | R.ECC=0 | | 1088 | 5 | | R.ECC=0 | <-- | FNE | 0301 1102 ACK | 1460 | 1089 | 6 | | R.ECC=0 | <-- | RECT | 1762 1102 ACK | 1460 | 1090 | 7 | | R.ECC=0 | <-- | FNE | 3222 1102 ACK | 1460 | 1091 | 8 | | 1102 1762 ACK | RECT | --> | R.ECC=0 | | 1092 | 9 | | R.ECC=0 | <-- | RECT | 4682 1102 ACK | 1460 | 1093 | 10 | | R.ECC=0 | <-- | RECT | 6142 1102 ACK | 1460 | 1094 | 11 | | 1102 3222 ACK | RECT | --> | R.ECC=0 | | 1095 | 12 | | R.ECC=0 | <-- | RECT | 7602 1102 ACK | 1460 | 1096 | 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 | 1097 | | | ... | | | | | 1098 +----+------+----------------+-------+-------+---------------+------+ 1100 Table 6: TCP Session Example #1 1102 Table 6 shows an example TCP session, where the server B sets FNE on 1103 its first and third data packets (lines 5 & 7) as well as on the 1104 initial SYN ACK as previously described. The left hand half of the 1105 table shows the relevant settings of headers sent by client A in 1106 three layers: the TCP payload size; TCP settings; then IP settings. 1107 The right hand half gives equivalent columns for server B. The only 1108 TCP settings shown are the sequence number (SEQ), acknowledgement 1109 number (ACK) and the relevant control (CTL) flags that A sets in the 1110 TCP header. The IP columns show the setting of the extended ECN 1111 (EECN) field. 1113 Also shown on the receiving side of the table is the value of the 1114 receiver's echo congestion counter (R.ECC) after processing the 1115 incoming EECN header. Note that, once a host sets a half-connection 1116 into RECN mode, it MUST initialise its local value of ECC to zero. 1118 The intuition that Appendix D gives for why a sender should set FNE 1119 on the first and third data packets is as follows. At line 13, a 1120 packet sent by B is shown with an '*', which means it has been 1121 congestion marked by an intermediate router from RECT to CE(-1). On 1122 receiving this CE marked packet, client A increments its ECC counter 1123 to 1 as shown. This was the 7th data packet B sent, but before 1124 feedback about this event returns to B, it might well have sent many 1125 more packets. Indeed, during exponential slow start, about as many 1126 packets will be in flight (unacknowledged) as have been acknowledged. 1127 So, when the feedback from the congestion event on B's 7th segment 1128 returns, B will have sent about 7 further packets that will still be 1129 in flight. At that stage, B's best estimate of the network's packet 1130 marking fraction will be 1/7. So, as B will have sent about 14 1131 packets, it should have already marked 2 of them as FNE in order to 1132 have marked 1/7; hence the need to have set the first and third data 1133 packets to FNE. 1135 Client A's behaviour in Table 6 also shows FNE being set on the first 1136 SYN and the first data packet (lines 1 & 4), but in this case it 1137 sends no more data packets, so of course, it cannot, and does not 1138 need to, set FNE again. Note that in the A-B direction there is no 1139 need to set FNE on the third part of the three-way hand-shake (line 1140 3---the ACK). 1142 Note that in this section we have used the word SHOULD rather than 1143 MUST when specifying how to set FNE on data segments before positive 1144 congestion feedback arrives (but note that the word MUST was used for 1145 FNE on the SYN and SYN ACK). FNE is only RECOMMENDED for the first 1146 and third data segments to entertain the possibility that the TCP 1147 transport has the benefit of other knowledge of the path, which it 1148 re-uses from one flow for the benefit of a newly starting flow. For 1149 instance, one flow can re-use knowledge of other flows between the 1150 same hosts if using a Congestion Manager [RFC3124] or when a proxy 1151 host aggregates congestion information for large numbers of flows. 1153 After an idle period of more than 1 second, a re-ECN sender transport 1154 MUST set the EECN field of the packet that resumes the connection to 1155 FNE. Note that this next packet may be sent a very long time later, 1156 a packet does NOT have to be sent after 1 second of idling. In order 1157 that the design of network policers can be deterministic, this 1158 specification deliberately puts an absolute lower limit on how long a 1159 connection can be idle before the packet that resumes the connection 1160 must be set to FNE, rather than relating it to the connection round 1161 trip time. We use the lower bound of the retransmission timeout 1162 (RTO) [RFC2988], which is commonly used as the idle period before TCP 1163 must reduce to the restart window [RFC2581]. Note our specification 1164 of re-ECN's idle period is NOT intended to change the idle period for 1165 TCP's restart, nor indeed for any other purposes. 1167 {ToDo: Describe how the sender falls back to legacy modes if packets 1168 don't appear to be getting through (to work round firewalls 1169 discarding packets they consider unusual).} 1171 4.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs 1173 A re-ECN sender MUST clear the RE flag to "0" and set the ECN field 1174 to Not-ECT in pure ACKs, retransmissions and window probes, as 1175 specified in [RFC3168]. Our eventual goal is for all packets to be 1176 sent with re-ECN enabled, and we believe the semantics of the ECI 1177 field go a long way towards being able to achieve this. However, we 1178 have not completed a full security analysis for these cases, 1179 therefore, currently we merely re-state current practice. 1181 We must also reconcile the facts that congestion marking is applied 1182 to packets but acknowledgements cover octet ranges and acknowledged 1183 octet boundaries need not match the transmitted boundaries. The 1184 general principle we work to is to remain compatible with TCP's 1185 congestion control which is driven by congestion events at packet 1186 granularity while at the same time aiming to blank the RE flag on at 1187 least as many octets in a flow as have been marked CE. 1189 Therefore, a re-ECN TCP receiver MUST increment its ECC value as many 1190 times as CE marked packets have been received. And that value MUST 1191 be echoed to the sender in the first available ACK using the ECI 1192 field. This ensures the TCP sender's congestion control receives 1193 timely feedback on congestion events at the same packet granularity 1194 that they were generated on congested routers. 1196 Then, a re-ECN sender stores the difference D between its own ECC 1197 value and the incoming ECI field by incrementing a counter R. Then, R 1198 is decremented by 1 each subsequent packet that is sent with the RE 1199 flag blanked, until R is no longer positive. Using this technique, 1200 whenever a re-ECN transport sends a not re-ECN capable (NRECN) packet 1201 (e.g. a retransmission), the remaining packets required to have the 1202 RE flag blanked will be automatically carried over to subsequent 1203 packets, through the variable R. 1205 This does not ensure precisely the same number of octets have RE 1206 blanked as were CE marked. But we believe positive errors will 1207 cancel negative over a long enough period. {ToDo: However, more 1208 research is needed to prove whether this is so. If it is not, it may 1209 be necessary to increment and decrement R in octets rather than 1210 packets, by incrementing R as the product of D and the size in octets 1211 of packets being sent (typically the MSS).} 1213 4.2. Other Transports 1215 4.2.1. General Guidelines for Adding Re-ECN to Other Transports 1217 Re-ECT sender transports that have established the receiver transport 1218 is at least ECN-capable (not necessarily re-ECN capable) MUST blank 1219 the RE codepoint in packets carrying at least as many octets as 1220 arrive at receiver with the CE codepoint set. Re-ECN-capable sender 1221 transports should always initialise the ECN field to the ECT(1) 1222 codepoint once a flow is established. 1224 If the sender transport does not have sufficient feedback to even 1225 estimate the path's CE rate, it SHOULD set FNE continuously. If the 1226 sender transport has some, perhaps stale, feedback to estimate that 1227 the path's CE rate is nearly definitely less than E%, the transport 1228 MAY blank RE in packets for E% of sent octets, and set the RECT 1229 codepoint for the remainder. 1231 The following sections give guidelines on how re-ECN support could be 1232 added to RSVP or NSIS, to DCCP, and to SCTP - although separate 1233 Internet drafts will be necessary to document the exact mechanics of 1234 re-ECN in each of these protocols. 1236 {ToDo: Give a brief outline of what would be expected for each of the 1237 following: 1239 o UDP fire and forget (e.g. DNS) 1241 o UDP streaming with no feedback 1243 o UDP streaming with feedback 1245 } 1247 4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS 1249 A separate I-D has been submitted [Re-PCN] describing how re-ECN can 1250 be used in an edge-to-edge rather than end-to-end scenario. It can 1251 then be used by downstream networks to police whether upstream 1252 networks are blocking new flow reservations when downstream 1253 congestion is too high, even though the congestion is in other 1254 operators' downstream networks. This relates to current IETF work on 1255 Admission Control over Diffserv using Pre-Congestion Notification 1256 (PCN) [PCN-arch]. 1258 4.2.3. Guidelines for adding Re-ECN to DCCP 1260 Beside adjusting the initial features negotiation sequence, operating 1261 re-ECN in DCCP [RFC4340] could be achieved by defining a new option 1262 to be added to acknowledgments, that would include a multibit field 1263 where the destination could copy its ECC. 1265 4.2.4. Guidelines for adding Re-ECN to SCTP 1267 Annex 1 in [RFC2960] gives the specifications for SCTP to support 1268 ECN. Similar steps should be taken to support re-ECN. Beside 1269 adjusting the initial features negotiation sequence, operating re-ECN 1270 in SCTP could be achieved by defining a new control chunk, that would 1271 include a multibit field where the destination could copy its ECC 1273 5. Network Layer 1275 5.1. Re-ECN IPv4 Wire Protocol 1277 The wire protocol of the ECN field in the IP header remains largely 1278 unchanged from [RFC3168]. However, an extension to the ECN field we 1279 call the RE (re-ECN extension) flag (Section 3.2) is defined in this 1280 document. It doubles the extended ECN codepoint space, giving 8 1281 potential codepoints. The semantics of the extra codepoints are 1282 backward compatible with the semantics of the 4 original codepoints 1283 [RFC3168] (Section 7.1 collects together and summarises all the 1284 changes defined in this document). 1286 For IPv4, this document proposes that the new RE control flag will be 1287 positioned where the `reserved' control flag was at bit 48 of the 1288 IPv4 header (counting from 0). Alternatively, some would call this 1289 bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 1290 header (Figure 5). 1292 0 1 2 1293 +---+---+---+ 1294 | R | D | M | 1295 | E | F | F | 1296 +---+---+---+ 1298 Figure 5: New Definition of the Re-ECN Extension (RE) Control Flag at 1299 the Start of Byte 7 of the IPv4 Header 1301 The semantics of the RE flag are described in outline in Section 3 1302 and specified fully in Section 4. The RE flag is always considered 1303 in conjunction with the 2-bit ECN field, as if they were concatenated 1304 together to form a 3-bit extended ECN field. If the ECN field is set 1305 to either the ECT(1) or CE codepoint, when the RE flag is blanked 1306 (cleared to "0") it represents a re-echo of congestion experienced by 1307 an early packet. If the ECN field is set to the Not-ECT codepoint, 1308 when the RE flag is set to "1" it represents the feedback not 1309 established (FNE) codepoint, which signals that the packet was sent 1310 without the benefit of congestion feedback. 1312 It is believed that the FNE codepoint can simultaneously serve other 1313 purposes, particularly where the start of a flow needs distinguishing 1314 from packets later in the flow. For instance it would have been 1315 useful to identify new flows for tag switching and might enable 1316 similar developments in the future if it were adopted. It is similar 1317 to the state set-up bit idea designed to protect against memory 1318 exhaustion attacks. This idea was proposed informally by David Clark 1319 and documented by Handley and Greenhalgh [Steps_DoS]. The FNE 1320 codepoint can be thought of as a `soft-state set-up flag', because it 1321 is idempotent (i.e. one occurrence of the flag is sufficient but 1322 further occurrences achieve the same effect if previous ones were 1323 lost). 1325 We are sure there will probably be other claims pending on the use of 1326 bit 48. We know of at least two [ARI05], [RFC3514] but neither have 1327 been pursued in the IETF, so far, although the present proposal would 1328 meet the needs of the former. 1330 The security flag proposal (commonly known as the evil bit) was 1331 published on 1 April 2003 as Informational RFC 3514, but it was not 1332 adopted due to confusion over whether evil-doers might set it 1333 inappropriately. The present proposal is backward compatible with 1334 RFC3514 because if re-ECN compliant senders were benign they would 1335 correctly clear the evil bit to honestly declare that they had just 1336 received congestion feedback. Whereas evil-doers would hide 1337 congestion feedback by setting the evil bit continuously, or at least 1338 more often than they should. So, evil senders can be identified, 1339 because they declare that they are good less often than they should. 1341 5.2. Re-ECN IPv6 Wire Protocol 1343 For IPv6, this document proposes that the new RE control flag will be 1344 positioned as the first bit of the option field of a new Congestion 1345 hop by hop option header (Figure 6). 1347 0 1 2 3 1348 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1349 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1350 | Next Header | Hdr ext Len | Option Type | Opt Length =4 | 1351 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1352 |R| Reserved for future use | 1353 |E| | 1354 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1356 Figure 6: Definition of a New IPv6 Congestion Hop by Hop Option 1357 Header containing the Re-ECN Extension (RE) Control Flag 1359 0 1 2 3 4 5 6 7 8 1360 +-+-+-+-+-+-+-+-+- 1361 |AIU|C|Option ID| 1362 +-+-+-+-+-+-+-+-+- 1364 Figure 7: Congestion Hop by Hop Option Type Encoding 1366 The Hop-by-Hop Options header enables packets to carry information to 1367 be examined and processed by routers or nodes along the packet's 1368 delivery path, including the source and destination nodes. For re- 1369 ECN, the two bits of the Action If Unrecognized (AIU) flag of the 1370 Congestion extension header MUST be set to "00" meaning if 1371 unrecognized `skip over option and continue processing the header'. 1372 Then, any routers or a receiver not upgraded with the optional re-ECN 1373 features described in this memo will simply ignore this header. But 1374 routers with these optional re-ECN features or a re-ECN policing 1375 function, will process this Congestion extension header. 1377 The `C' flag MUST be set to "1" to specify that the Option Data 1378 (currently only the RE control flag) can change en-route to the 1379 packet's final destination. This ensures that, when an 1380 Authentication header (AH [RFC2402]) is present in the packet, for 1381 any option whose data may change en-route, its entire Option Data 1382 field will be treated as zero-valued octets when computing or 1383 verifying the packet's authenticating value. 1385 Although the RE control flag should not be changed along the path, we 1386 expect that the rest of this option field that is currently `Reserved 1387 for future use' could be used for a multi-bit congestion notification 1388 field which we would expect to change en route. As the RE flag does 1389 not need end-to-end authentication, we set the C flag to '1'. 1391 {ToDo: A Congestion Hop by Hop Option ID will need to be registered 1392 with IANA.} 1394 5.3. Router Forwarding Behaviour 1396 Re-ECN works well without modifying the forwarding behaviour of any 1397 routers. However, below, two OPTIONAL changes to forwarding 1398 behaviour are defined which respectively enhance performance and 1399 improve a router's discrimination against flooding attacks. They are 1400 both OPTIONAL additions that we propose MAY apply by default to all 1401 Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN 1402 marking behaviours [RFC3168]. Specifications for PHBs MAY define 1403 different forwarding behaviours from this default, but this is NOT 1404 REQUIRED. [Re-PCN] is one example. 1406 FNE indicates ECT: 1408 The FNE codepoint tells a router to assume that the packet was 1409 sent by an ECN-capable transport (see Section 5.4). Therefore an 1410 FNE packet MAY be marked rather than dropped. Note that the FNE 1411 codepoint has been intentionally chosen so that, to legacy routers 1412 (which do not inspect the RE flag) an FNE packet appears to be 1413 Not-ECT so it will be dropped by legacy AQM algorithms. 1415 A network operator MUST NOT configure a router to ECN mark rather 1416 than drop FNE packets unless it can guarantee that FNE packets 1417 will be rate limited, either locally or upstream. The ingress 1418 policers discussed in Section 6.1.5 would count as rate limiters 1419 for this purpose. 1421 Preferential Drop: If a re-ECN capable router experiences very high 1422 load so that it has to drop arriving packets (e.g. a DoS attack), 1423 it MAY preferentially drop packets within the same Diffserv PHB 1424 using the preference order for extended ECN codepoints given in 1425 Table 7. Preferential dropping can be difficult to implement on 1426 some hardware, but if feasible it would discriminate against 1427 attack traffic if done as part of the overall policing framework 1428 of Section 6.1.3. If nowhere else, routers at the egress of a 1429 network SHOULD implement preferential drop (stronger than the MAY 1430 above). For simplicity, preferences 4 & 5 MAY be merged into one 1431 preference level. 1433 +-------+-----+------------+-------+------------+-------------------+ 1434 | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | 1435 | field | bit | ECN | | (1 = drop | | 1436 | | | codepoint | | 1st) | | 1437 +-------+-----+------------+-------+------------+-------------------+ 1438 | 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | 1439 | | | | | | congestion and | 1440 | | | | | | RECT | 1441 | 00 | 1 | FNE | +1 | 4 | Feedback not | 1442 | | | | | | established | 1443 | 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | 1444 | | | | | | by congestion | 1445 | | | | | | experienced | 1446 | 01 | 1 | RECT | 0 | 3 | Re-ECN capable | 1447 | | | | | | transport | 1448 | 11 | 1 | CE(-1) | -1 | 3 | Congestion | 1449 | | | | | | experienced | 1450 | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | 1451 | 10 | 0 | --- | n/a | 2 | Legacy ECN use | 1452 | | | | | | only | 1453 | 00 | 0 | Not-RECT | n/a | 1 | Not | 1454 | | | | | | re-ECN-capable | 1455 | | | | | | transport | 1456 +-------+-----+------------+-------+------------+-------------------+ 1458 Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth') 1460 The above drop preferences are arranged to preserve packets with 1461 more positive worth (Section 3.4), given senders of positive 1462 packets must have honestly declared downstream congestion. This 1463 is explained fully in Section 6 on applications, particularly when 1464 the application of re-ECN to protect against DDoS attacks is 1465 described. 1467 5.4. Justification for Setting the First SYN to FNE 1469 Congested routers may mark an FNE packet to CE(-1) (Section 5.3), and 1470 the initial SYN MUST be set to FNE by Re-ECT client A 1471 (Section 4.1.4). So an initial SYN may be marked CE(-1) rather than 1472 dropped. This seems dangerous, because the sender has not yet 1473 established whether the receiver is a legacy one that does not 1474 understand congestion marking. It also seems to allow malicious 1475 senders to take advantage of ECN marking to avoid so much drop when 1476 launching SYN flooding attacks. Below we explain the features of the 1477 protocol design that remove both these dangers. 1479 ECN-capable initial SYN with a Not-ECT server: If the TCP server B 1480 is re-ECN capable, provision is made for it to feedback a possible 1481 congestion marked SYN in the SYN ACK (Section 4.1.4). But if the 1482 TCP client A finds out from the SYN ACK that the server was not 1483 ECN-capable, the TCP client MUST consider the first SYN as 1484 congestion marked before setting itself into Not-ECT mode. 1485 Section 4.1.4 mandates that such a TCP client MUST also set its 1486 initial window to 1 segment. In this way we remove the need to 1487 cautiously avoid setting the first SYN to Not-RECT. This will 1488 give worse performance while deployment is patchy, but better 1489 performance once deployment is widespread. 1491 SYN flooding attacks can't exploit ECN-capability: Malicious hosts 1492 may think they can use the advantage that ECN-marking gives over 1493 drop in launching classic SYN-flood attacks. But Section 5.3 1494 mandates that a router MUST only be configured to treat packets 1495 with the FNE codepoint as ECN-capable if FNE packets are rate 1496 limited. Introduction of the FNE codepoint was a deliberate move 1497 to enable transport-neutral handling of flow-start and flow state 1498 set-up in the IP layer where it belongs. It then becomes possible 1499 to protect against flooding attacks of all forms (not just SYN 1500 flooding) without transport-specific inspection for things like 1501 the SYN flag in TCP headers. Then, for instance, SYN flooding 1502 attacks using IPSec ESP encryption can also be rate limited at the 1503 IP layer. 1505 It might seem pedantic going to all this trouble to enable ECN on the 1506 initial packet of a flow, but it is motivated by a much wider concern 1507 to ensure safe congestion control will still be possible even if the 1508 application mix evolves to the point where the majority of flows 1509 consist of a single window or even a single packet. It also allows 1510 denial of service attacks to be more easily isolated and prevented. 1512 5.5. Control and Management 1514 5.5.1. Negative Balance Warning 1516 A new ICMP message type is being considered so that a dropper can 1517 warn the apparent sender of a flow that it has started to sanction 1518 the flow. The message would have similar semantics to the `Time 1519 exceeded' ICMP message type. To ensure the sender has to invest some 1520 work before the network will generate such a message, a dropper 1521 SHOULD only send such a message for flows that have demonstrated that 1522 they have started correctly by establishing a positive record, but 1523 have later gone negative. The threshold is up to the implementation. 1524 The purpose of the message is to deconfuse the cause of drops from 1525 other causes, such as congestion or transmission losses. The dropper 1526 would send the message to the sender of the flow, not the receiver. 1528 If we did define this message type, it would be REQUIRED for all re- 1529 ECT senders to parse and understand it. Note that a sender MUST only 1530 use this message to explain why losses are occurring. A sender MUST 1531 NOT take this message to mean that losses have occurred that it was 1532 not aware of. Otherwise, spoof messages could be sent by malicious 1533 sources to slow down a sender (c.f. ICMP source quench). 1535 However, the need for this message type is not yet confirmed, as we 1536 are considering how to prevent it being used by malicious senders to 1537 scan for droppers and to test their threshold settings. {ToDo: 1538 Complete this section.} 1540 5.5.2. Rate Response Control 1542 As discussed in Section 6.1.5 the sender's access operator will be 1543 expected to use bulk per-user policing, but they might choose to 1544 introduce a per-flow policer. In cases where operators do introduce 1545 per-flow policing, there may be a need for a sender to send a request 1546 to the ingress policer asking for permission to apply a non-default 1547 response to congestion (where TCP-friendly is assumed to be the 1548 default). This would require the sender to know what message 1549 format(s) to use and to be able to discover how to address the 1550 policer. The required control protocol(s) are outside the scope of 1551 this document, but will require definition elsewhere. 1553 The policer is likely to be local to the sender and inline, probably 1554 at the ingress interface to the internetwork. So, discovery should 1555 not be hard. A variety of control protocols already exist for some 1556 widely used rate-responses to congestion. For instance DCCP 1557 congestion control identifiers (CCIDs [RFC4340]) fulfil this role and 1558 so does QoS signalling (e.g. and RSVP request for controlled load 1559 service is equivalent to a request for no rate response to 1560 congestion, but with admission control). 1562 5.6. IP in IP Tunnels 1564 For re-ECN to work correctly through IP in IP tunnels, it needs 1565 slightly different tunnel handling to regular ECN [RFC3168]. 1566 Currently there is some incosistency between how the handling of IP 1567 in IP tunnels is defined in [RFC3168] and how it is defined in 1568 [RFC4301], but re-ECN would work fine with the IPsec behaviour. This 1569 inconsistency is addressed in a new Internet Draft [ECN-tunnel] that 1570 proposes to update RFC3168 tunnel behaviour to bring it into line 1571 with IPsec. Ideally, for re-ECN to work through a tunnel, the tunnel 1572 entry should copy both the RE flag and the ECN field from the inner 1573 to the outer IP header. Then at the tunnel exit, any congestion 1574 marking of the outer ECN field should overwrite the inner ECN field 1575 (unless the inner field is Not-ECT in which case an alarm should be 1576 raised). The RE flag shouldn't change along a path, so the outer RE 1577 flag should be the same as the inner. If it isn't a management alarm 1578 should be raised. This behaviour is the same as the full- 1579 functionality variant of [RFC3168] at tunnel exit, but different at 1580 tunnel entry. 1582 If tunnels are left as they are specified in [RFC3168], whether the 1583 limited or full-functionality variants are used, a problem arises 1584 with re-ECN if a tunnel crosses an inter-domain boundary, because the 1585 difference between positive and negative markings will not be 1586 correctly accounted for. In a limited functionality ECN tunnel, the 1587 flow will appear to be legacy traffic, and therefore may be wrongly 1588 rate limited. In a full-functionality ECN tunnel, the result will 1589 depend whether the tunnel entry copies the inner RE flag to the outer 1590 header or the RE flag in the outer header is always cleared. If the 1591 former, the flow will tend to be too positive when accounted for at 1592 borders. If the latter, it will be too negative. If the rules set 1593 out in [ECN-tunnel] are followed then this will not be an issue. 1595 5.7. Non-Issues 1597 The following issues might seem to cause unfavourable interactions 1598 with re-ECN, but we will explain why they don't: 1600 o Various link layers support explicit congestion notification, such 1601 as Frame Relay and ATM. Explicit congestion notification is 1602 proposed to be added to other link layers, such as Ethernet 1603 (802.3ar Ethernet congestion management) and MPLS [ECN-MPLS]; 1605 o Encryption and IPSec. 1607 In the case of congestion notification at the link layer, each 1608 particular link layer scheme either manages congestion on the link 1609 with its own link-level feedback (the usual arrangement in the cases 1610 of ATM and Frame Relay), or congestion notification from the link 1611 layer is merged into congestion notification at the IP level when the 1612 frame headers are decapsulated at the end of the link (the 1613 recommended arrangement in the Ethernet and MPLS cases). Given the 1614 RE flag is not intended to change along the path, this means that 1615 downstream congestion will still be measureable at any point where IP 1616 is processed on the path by subtracting positive from negative 1617 markings. 1619 In the case of encryption, as long as the tunnel issues described in 1620 Section 5.6 are dealt with, payload encryption itself will not be a 1621 problem. The design goal of re-ECN is to include downstream 1622 congestion in the IP header so that it is not necessary to bury into 1623 inner headers. Obfuscation of flow identifiers is not a problem for 1624 re-ECN policing elements. Re-ECN doesn't ever require flow 1625 identifiers to be valid, it only requires them to be unique. So if 1626 an IPSec encapsulating security payload (ESP [RFC2406]) or an 1627 authentication header (AH [RFC2402]) is used, the security parameters 1628 index (SPI) will be a sufficient flow identifier, as it is intended 1629 to be unique to a flow without revealing actual port numbers. 1631 In general, even if endpoints use some locally agreed scheme to hide 1632 port numbers, re-ECN policing elements can just consider the pair of 1633 source and destination IP addresses as the flow identifier. Re-ECN 1634 encourages endpoints to at least tell the network layer that a 1635 sequence of packets are all part of the same flow, if indeed they 1636 are. The alternative would be for the sender to make each packet 1637 appear to be a new flow, which would require them all to be marked 1638 FNE in order to avoid being treated with the bulk of malicious flows 1639 at the egress dropper. Given the FNE marking is worth +1 and 1640 networks are likely to rate limit FNE packets, endpoints are given an 1641 incentive not to set FNE on each packet. But if the sender really 1642 does want to hide the flow relationship between packets it can choose 1643 to pay the cost of multiple FNE packets, which in the long run will 1644 compensate for the extra memory required on network policing elements 1645 to process each flow. 1647 6. Applications 1649 6.1. Policing Congestion Response 1651 6.1.1. The Policing Problem 1653 The current Internet architecture trusts hosts to respond voluntarily 1654 to congestion. Limited evidence shows that the large majority of 1655 end-points on the Internet comply with a TCP-friendly response to 1656 congestion. But telephony (and increasingly video) services over the 1657 best effort Internet are attracting the interest of major commercial 1658 operations. Most of these applications do not respond to congestion 1659 at all. Those that can switch to lower rate codecs, still have a 1660 lower bound below which they must become unresponsive to congestion. 1662 Of course, the Internet is intended to support many different 1663 application behaviours. But the problem is that this freedom can be 1664 exercised irresponsibly. The greater problem is that we will never 1665 be able to agree on where the boundary is between responsible and 1666 irresponsible. Therefore re-ECN is designed to allow different 1667 networks to set their own view of the limit to irresponsibility, and 1668 to allow networks that choose a more conservative limit to push back 1669 against congestion caused in more liberal networks. 1671 As an example of the impossibility of setting a standard for 1672 fairness, mandating TCP-friendliness would set the bar too high for 1673 unresponsive streaming media, but still some would say the bar was 1674 too low. Even though all known peer-to-peer filesharing applications 1675 are TCP-compatible, they can cause a disproportionate amount of 1676 congestion, simply by using multiple flows and by transferring data 1677 continuously relative to other short-lived sessions. On the other 1678 hand, if we swung the other way and set the bar low enough to allow 1679 streaming media to be unresponsive, we would also allow denial of 1680 service attacks, which are typically unresponsive to congestion and 1681 consist of multiple continuous flows. 1683 Applications that need (or choose) to be unresponsive to congestion 1684 can effectively take (some would say steal) whatever share of 1685 bottleneck resources they want from responsive flows. Whether or not 1686 such free-riding is common, inability to prevent it increases the 1687 risk of poor returns for investors in network infrastructure, leading 1688 to under-investment. An increasing proportion of unresponsive or 1689 free-riding demand coupled with persistent under-supply is a broken 1690 economic cycle. Therefore, if the current, largely co-operative 1691 consensus continues to erode, congestion collapse could become more 1692 common in more areas of the Internet [RFC3714]. 1694 While we have designed re-ECN so that networks can choose to deploy 1695 stringent policing, this does not imply we advocate that every 1696 network should introduce tight controls on those that cause 1697 congestion. Re-ECN has been specifically designed to allow different 1698 networks to choose how conservative or liberal they wish to be with 1699 respect to policing congestion. But those that choose to be 1700 conservative can protect themselves from the excesses that liberal 1701 networks allow their users. 1703 6.1.2. The Case Against Bottleneck Policing 1705 The state of the art in rate policing is the bottleneck policer, 1706 which is intended to be deployed at any forwarding resource that may 1707 become congested. Its aim is to detect flows that cause 1708 significantly more local congestion than others. Although operators 1709 might solve their immediate problems by deploying bottleneck 1710 policers, we are concerned that widespread deployment would make it 1711 extremely hard to evolve new application behaviours. We believe the 1712 IETF should offer re-ECN as the preferred protocol on which to base 1713 solutions to the policing problems of operators, because it would not 1714 harm evolvability and, frankly, it would be far more effective (see 1715 later for why). 1717 Approaches like [XCHOKe] & [pBox] are nice approaches for rate 1718 policing traffic without the benefit of whole path information (such 1719 as could be provided by re-ECN). But they must be deployed at 1720 bottlenecks in order to work. Unfortunately, a large proportion of 1721 traffic traverses at least two bottlenecks (in two access networks), 1722 particularly with the current traffic mix where peer-to-peer file- 1723 sharing is prevalent. If ECN were deployed, we believe it would be 1724 likely that these bottleneck policers would be adapted to combine ECN 1725 congestion marking from the upstream path with local congestion 1726 knowledge. But then the only useful placement for such policers 1727 would be close to the egress of the internetwork. 1729 But then, if these bottleneck policers were widely deployed (which 1730 would require them to be more effective than they are now), the 1731 Internet would find itself with one universal rate adaptation policy 1732 (probably TCP-friendliness) embedded throughout the network. Given 1733 TCP's congestion control algorithm is already known to be hitting its 1734 scalability limits and new algorithms are being developed for high- 1735 speed congestion control, embedding TCP policing into the Internet 1736 would make evolution to new algorithms extremely painful. If a 1737 source wanted to use a different algorithm, it would have to first 1738 discover then negotiate with all the policers on its path, 1739 particularly those in the far access network. The IETF has already 1740 traveled that path with the Intserv architecture and found it 1741 constrains scalability [RFC2208]. 1743 Anyway, if bottleneck policers were ever widely deployed, they would 1744 be likely to be bypassed by determined attackers. They inherently 1745 have to police fairness per flow or per source-destination pair. 1746 Therefore they can easily be circumvented either by opening multiple 1747 flows (by varying the end-point port number); or by spoofing the 1748 source address but arranging with the receiver to hide the true 1749 return address at a higher layer. 1751 6.1.3. Re-ECN Incentive Framework 1753 The aim is to create an incentive environment that ensures optimal 1754 sharing of capacity despite everyone acting selfishly (including 1755 lying and cheating). Of course, the mechanisms put in place for this 1756 can lie dormant wherever co-operation is the norm. 1758 Throughout this document we focus on path congestion. But some forms 1759 of fairness, particularly TCP's, also depend on round trip time. If 1760 TCP-fairness is required, we also propose to measure downstream path 1761 delay using re-feedback. We give a simple outline of how this could 1762 work in Appendix F. However, we do not expect this to be necessary, 1763 as researchers tend to agree that only congestion control dynamics 1764 need to depend on RTT, not the rate that the algorithm would converge 1765 on after a period of stability. 1767 Figure 8 sketches the incentive framework that we will describe piece 1768 by piece throughout this section. We will do a first pass in 1769 overview, then return to each piece in detail. We re-use the earlier 1770 example of how downstream congestion is derived by subtracting 1771 upstream congestion from path congestion (Figure 1) but depict 1772 multiple trust boundaries to turn it into an internetwork. For 1773 clarity, only downstream congestion is shown (the difference between 1774 the two earlier plots). The graph displays downstream path 1775 congestion seen in a typical flow as it traverses an example path 1776 from sender S to receiver R, across networks N1, N2 & N4. Everyone 1777 is shown using re-ECN correctly, but we intend to show why everyone 1778 would /choose/ to use it correctly, and honestly. 1780 Three main types of self-interest can be identified: 1782 o Users want to transmit data across the network as fast as 1783 possible, paying as little as possible for the privilege. In this 1784 respect, there is no distinction between senders and receivers, 1785 but we must be wary of potential malice by one on the other; 1787 o Network operators want to maximise revenues from the resources 1788 they invest in. They compete amongst themselves for the custom of 1789 users. 1791 o Attackers (whether users or networks) want to use any opportunity 1792 to subvert the new re-ECN system for their own gain or to damage 1793 the service of their victims, whether targeted or random. 1795 policer 1796 | 1797 | 1798 S <-----N1----> <---N2---> <---N4--> R domain 1799 | : : 1800 A\|/: : 1801 | V : : 1802 3% |---------+ : 1803 | : | : 1804 2% | : +-----------------------+ : 1805 | : downstream congestion | : 1806 1% | : | : 1807 | : | : 1808 0% +---------------------------------+=====--> 1809 0 i ^ resource index 1810 | | /|\ 1811 1.00% 2.00% | marking fraction 1812 | 1813 dropper 1815 Figure 8: Incentive Framework, showing creation of opposing pressures 1816 to under-declare and over-declare downstream congestion, using a 1817 policer and a dropper 1819 Source congestion control: We want to ensure that the sender will 1820 throttle its rate as downstream congestion increases. Whatever 1821 the agreed congestion response (whether TCP-compatible or some 1822 enhanced QoS), to some extent it will always be against the 1823 sender's interest to comply. 1825 Ingress policing: But it is in all the network operators' interests 1826 to encourage fair congestion response, so that their investments 1827 are employed to satisfy the most valuable demand. The re-ECN 1828 protocol ensures packets carry the necessary information about 1829 their own expected downstream congestion so that N1 can deploy a 1830 policer at its ingress to check that S1 is complying with whatever 1831 congestion control it should be using (Section 6.1.5). If N1 is 1832 extremely conservative it could police each flow, but it is likely 1833 to just police the bulk amount of congestion each customer causes 1834 without regard to flows, or if it is extremely liberal it need not 1835 police congestion control at all. Whatever, it is always 1836 preferable to police traffic at the very first ingress into an 1837 internetwork, before non-compliant traffic can cause any damage. 1839 Edge egress dropper: If the policer ensures the source has less 1840 right to a high rate the higher it declares downstream congestion, 1841 the source has a clear incentive to understate downstream 1842 congestion. But, if flows of packets are understated when they 1843 enter the internetwork, they will have become negative by the time 1844 they leave. So, we introduce a dropper at the last network 1845 egress, which drops packets in flows that persistently declare 1846 negative downstream congestion (see Section 6.1.4 for details). 1848 ..competitive routing 1849 .' : '. 1850 .' p e n a l:t i e s '. 1851 : | : \ : 1852 A : | : | : 1853 |S <-----N1----> <---N2---> <---N4--> R domain 1854 | : | : | : 1855 | V | : | : 1856 3% |--------+ | : | : 1857 | | V V V V 1858 2% | +-----------------------+ 1859 | downstream congestion | 1860 1% | : | 1861 | : | 1862 0% +--------------------------------+=====--> 1863 0 ^ i resource index 1864 | /|\ | 1865 1.00% | 2.00% marking fraction 1866 | 1867 sanctions 1869 Figure 9: Incentives at Inter-domain Borders 1871 Inter-domain traffic policing: But next we must ask, if congestion 1872 arises downstream (say in N4), what is the ingress network's 1873 (N1's) incentive to police its customers' response? If N1 turns a 1874 blind eye, its own customers benefit while other networks suffer. 1875 This is why all inter-domain QoS architectures (e.g. Intserv, 1876 Diffserv) police traffic each time it crosses a trust boundary. 1877 We have already shown that re-ECN gives a trustworthy measure of 1878 the expected downstream congestion that a flow will cause by 1879 subtracting negative volume from positive at any intermediate 1880 point on a path. N4 (say) can use this measure to police all the 1881 responses to congestion of all the sources beyond its upstream 1882 neighbour (N2), but in bulk with one very simple passive 1883 mechanism, rather than per flow, as we will now explain using 1884 Figure 9. 1886 Emulating policing with inter-domain congestion penalties: Between 1887 high-speed networks, we would rather avoid per-flow policing, and 1888 we would rather avoid holding back traffic while it is policed. 1889 Instead, once re-ECN has arranged headers to carry downstream 1890 congestion honestly, N2 can contract to pay N4 penalties in 1891 proportion to a single bulk count of the congestion metrics 1892 crossing their mutual trust boundary (Section 6.1.6). In this 1893 way, N4 puts pressure on N2 to suppress downstream congestion, for 1894 every flow passing through the border interface, even though they 1895 will all start and end in different places, and even though they 1896 may all be allowed different responses to congestion. The figure 1897 depicts this downward pressure on N2 by the solid downward arrow 1898 at the egress of N2. Then N2 has an incentive either to police 1899 the congestion response of its own ingress traffic (from N1) or to 1900 emulate policing by applying penalties to N1 in turn on the basis 1901 of congestion counted at their mutual boundary. In this recursive 1902 way, the incentives for each flow to respond correctly to 1903 congestion trace back with each flow precisely to each source, 1904 despite the mechanism not recognising flows (see Section 6.2.2). 1906 Inter-domain congestion charging diversity: Any two networks are 1907 free to agree any of a range of penalty regimes between themselves 1908 but they would only provide the right incentives if they were 1909 within the following reasonable constraints. N2 should expect to 1910 have to pay penalties to N4 where penalties monotonically increase 1911 with the volume of congestion and negative penalties are not 1912 allowed. For instance, they may agree an SLA with tiered 1913 congestion thresholds, where higher penalties apply the higher the 1914 threshold that is broken. But the most obvious (and useful) form 1915 of penalty is where N4 levies a charge on N2 proportional to the 1916 volume of downstream congestion N2 dumps into N4. In the 1917 explanation that follows, we assume this specific variant of 1918 volume charging between networks - charging proportionate to the 1919 volume of congestion. 1921 We must make clear that we are not advocating that everyone should 1922 use this form of contract. We are well aware that the IETF tries 1923 to avoid standardising technology that depends on a particular 1924 business model. And we strongly share this desire to encourage 1925 diversity. But our aim is merely to show that border policing can 1926 at least work with this one model, then we can assume that 1927 operators might experiment with the metric in other models (see 1928 Section 6.1.6 for examples). Of course, operators are free to 1929 complement this usage element of their charges with traditional 1930 capacity charging, and we expect they will as predicted by 1931 economics. 1933 No congestion charging to users: Bulk congestion penalties at trust 1934 boundaries are passive and extremely simple, and lose none of 1935 their per-packet precision from one boundary to the next (unlike 1936 Diffserv all-address traffic conditioning agreements, which 1937 dissipate their effectiveness across long topologies). But at any 1938 trust boundary, there is no imperative to use congestion charging. 1940 Traditional traffic policing can be used, if the complexity and 1941 cost is preferred. In particular, at the boundary with end 1942 customers (e.g. between S and N1), traffic policing will most 1943 likely be more appropriate. Policer complexity is less of a 1944 concern at the edge of the network. And end-customers are known 1945 to be highly averse to the unpredictability of congestion 1946 charging. 1948 NOTE WELL: This document neither advocates nor requires congestion 1949 charging for end customers and advocates but does not require 1950 inter-domain congestion charging. 1952 Competitive discipline of inter-domain traffic engineering: With 1953 inter-domain congestion charging, a domain seems to have a 1954 perverse incentive to fake congestion; N2's profit depends on the 1955 difference between congestion at its ingress (its revenue) and at 1956 its egress (its cost). So, overstating internal congestion seems 1957 to increase profit. However, smart border routing [Smart_rtg] by 1958 N1 will bias its routing towards the least cost routes. So, N2 1959 risks losing all its revenue to competitive routes if it 1960 overstates congestion (see Section 6.2.3). In other words, if N2 1961 is the least congested route, its ability to raise excess profits 1962 is limited by the congestion on the next least congested route. 1963 This pressure on N2 to remain competitive is represented by the 1964 dotted downward arrow at the ingress to N2 in Figure 9. 1966 Closing the loop: All the above elements conspire to trap everyone 1967 between two opposing pressures (the downward and upward arrows in 1968 Figure 8 & Figure 9), ensuring the downstream congestion metric 1969 arrives at the destination neither above nor below zero. So, we 1970 have arrived back where we started in our argument. The ingress 1971 edge network can rely on downstream congestion declared in the 1972 packet headers presented by the sender. So it can police the 1973 sender's congestion response accordingly. 1975 Evolvability of congestion control: We have seen that re-ECN enables 1976 policing at the very first ingress. We have also seen that, as 1977 flows continue on their path through further networks downstream, 1978 re-ECN removes the need for further per-domain ingress policing of 1979 all the different congestion responses allowed to each different 1980 flow. This is why the evolvability of re-ECN policing is so 1981 superior to bottleneck policing or to any policing of different 1982 QoS for different flows. Even if all access networks choose to 1983 conservatively police congestion per flow, each will want to 1984 compete with the others to allow new responses to congestion for 1985 new types of application. With re-ECN, each can introduce new 1986 controls independently, without coordinating with other networks 1987 and without having to standardise anything. But, as we have just 1988 seen, by making inter-domain penalties proportionate to bulk 1989 downtream congestion, downstream networks can be agnostic to the 1990 specific congestion response for each flow, but they can still 1991 apply more penalty the more liberal the ingress access network has 1992 been in the response to congestion it allowed for each flow. 1994 6.1.3.1. The Case against Classic Feedback 1996 A system that produces an optimal outcome as a result of everyone's 1997 selfish actions is extremely powerful. Especially one that enables 1998 evolvability of congestion control. But why do we have to change to 1999 re-ECN to achieve it? Can't classic congestion feedback (as used 2000 already by standard ECN) be arranged to provide similar incentives 2001 and similar evolvability? Superficially it can. Kelly's seminal 2002 work showed how we can allow everyone the freedom to evolve whatever 2003 congestion control behaviour is in their application's best interest 2004 but still optimise the whole system of networks and users by placing 2005 a price on congestion to ensure responsible use of this 2006 freedom [Evol_cc]). Kelly used ECN with its classic congestion 2007 feedback model as the mechanism to convey congestion price 2008 information. The mechanism could be thought of as volume charging; 2009 except only the volume of packets marked with congestion experienced 2010 (CE) was counted. 2012 However, below we explain why relying on classic feedback /required/ 2013 congestion charging to be used, while re-ECN achieves the same 2014 powerful outcome (given it is built on Kelly's foundations), but does 2015 not /require/ congestion charging. In brief, the problem with 2016 classic feedback is that the incentives have to trace the indirect 2017 path back to the sender---the long way round the feedback loop. For 2018 example, if classic feedback were used in Figure 8, N2 would have had 2019 to influence N1 via all of N4, R & S rather than directly. 2021 Inability to agree what is happening downstream: In order to police 2022 its upstream neighbour's congestion response, the neighbours 2023 should be able to agree on the congestion to be responded to. 2024 Whatever the feedback regime, as packets change hands at each 2025 trust boundary, any path metrics they carry are verifiable by both 2026 neighbours. But, with a classic path metric, they can only agree 2027 on the /upstream/ path congestion. 2029 Inaccessible back-channel: The network needs a whole-path congestion 2030 metric if it wants to control the source. Classically, whole path 2031 congestion emerges at the destination, to be fed back from 2032 receiver to sender in a back-channel. But, in any data network, 2033 back-channels need not be visible to relays, as they are 2034 essentially communications between the end-points. They may be 2035 encrypted, asymmetrically routed or simply omitted, so no network 2036 element can reliably intercept them. The congestion charging 2037 literature solves this problem by charging the receiver and 2038 assuming this will cause the receiver to refer the charges to the 2039 sender. But, of course, this creates unintended side-effects... 2041 `Receiver pays' unacceptable: In connectionless datagram networks, 2042 receivers and receiving networks cannot prevent reception from 2043 malicious senders, so `receiver pays' opens them to `denial of 2044 funds' attacks. 2046 End-user congestion charging unacceptable: Even if 'denial of funds' 2047 were not a problem, we know that end-users are highly averse to 2048 the unpredictability of congestion charging and anyway, we want to 2049 avoid restricting network operators to just one retail tariff. 2050 But with classic feedback only an upstream metric is available, so 2051 we cannot avoid having to wrap the `receiver pays' money flow 2052 around the feedback loop, necessarily forcing end-users to be 2053 subjected to congestion charging. 2055 To summarise so far, with classic feedback, policing congestion 2056 response without losing evolvability /requires/ congestion charging 2057 of end-users and a `receiver pays' model, whereas, with re-ECN, it is 2058 still possible to influence incentives using congestion charging but 2059 using the safer `sender pays' model. However, congestion charging is 2060 only likely to be appropriate between domains. So, without losing 2061 evolvability, re-ECN enables technical policing mechanisms that are 2062 more appropriate for end users than congestion pricing. 2064 We now take a second pass over the incentive framework, filling in 2065 the detail. 2067 6.1.4. Egress Dropper 2069 As traffic leaves the last network before the receiver (domain N4 in 2070 Figure 8), the fraction of positive octets in a flow should match the 2071 fraction of negative octets introduced by congestion marking, leaving 2072 a balance of zero. If it is less (a negative flow), it implies that 2073 the source is understating path congestion (which will reduce the 2074 penalties that N2 owes N4). 2076 If flows are positive, N4 need take no action---this simply means its 2077 upstream neighbour is paying more penalties than it needs to, and the 2078 source is going slower than it needs to. But, to protect itself 2079 against persistently negative flows, N4 will need to install a 2080 dropper at its egress. Appendix E gives a suggested algorithm for 2081 this dropper. There is no intention that the dropper algorithm needs 2082 to be standardised, it is merely provided to show that an efficient, 2083 robust algorithm is possible. But whatever algorithm is used must 2084 meet the criteria below: 2086 o It SHOULD introduce minimal false positives for honest flows; 2088 o It SHOULD quickly detect and sanction dishonest flows (minimal 2089 false negatives); 2091 o It MUST be invulnerable to state exhaustion attacks from malicious 2092 sources. For instance, if the dropper uses flow-state, it should 2093 not be possible for a source to send numerous packets, each with a 2094 different flow ID, to force the dropper to exhaust its memory 2095 capacity; 2097 o It MUST introduce sufficient loss in goodput so that malicious 2098 sources cannot play off losses in the egress dropper against 2099 higher allowed throughput. Salvatori [CLoop_pol] describes this 2100 attack, which involves the source understating path congestion 2101 then inserting forward error correction (FEC) packets to 2102 compensate expected losses. 2104 Note that the dropper operates on flows but we would like it not to 2105 require per-flow state. This is why we have been careful to ensure 2106 that all flows MUST start with a packet marked with the FNE 2107 codepoint. If a flow does not start with the FNE codepoint, a 2108 dropper is likely to treat it unfavourably. This risk makes it worth 2109 setting the FNE codepoint at the start of a flow, even though there 2110 is a cost to the sender of setting FNE (positive `worth'). Indeed, 2111 with the FNE codepoint, the rate at which a sender can generate new 2112 flows can be limited (Appendix G). In this respect, the FNE 2113 codepoint works like Handley's state set-up bit [Steps_DoS]. 2115 Appendix E also gives an example dropper implementation that 2116 aggregates flow state. Dropper algorithms will often maintain a 2117 moving average across flows of the fraction of RE blanked packets. 2118 When maintaining an average across flows, a dropper SHOULD only allow 2119 flows into the average if they start with FNE, but it SHOULD NOT 2120 include packets with the FNE codepoint set in the average. A sender 2121 sets the FNE codepoint when it does not have the benefit of feedback 2122 from the receiver. So, counting packets with FNE cleared would be 2123 likely to make the average unnecessarily positive, providing headroom 2124 (or should we say footroom?) for dishonest (negative) traffic. 2126 If the dropper detects a persistently negative flow, it SHOULD drop 2127 sufficient negative and neutral packets to force the flow to not be 2128 negative. Drops SHOULD be focused on just sufficient packets in 2129 misbehaving flows to remove the negative bias while doing minimal 2130 extra harm. 2132 6.1.5. Policing 2134 Access operators who wish to limit the congeston that a sender is 2135 able to cause can deploy policers at the very first ingress to the 2136 internetwork. Re-ECN has been designed to avoid the need for 2137 bottleneck policing so that we can avoid a future where a single rate 2138 adaptation policy is embedded throughout the network. Instead, re- 2139 ECN allows the particular rate adaptation policy to be solely agreed 2140 bilaterally between the sender and its ingress access provider 2141 (Section 5.5.2 discusses possible ways to signal between them), which 2142 allows congestion control to be policed, but maintains its 2143 evolvability, requiring only a single, local box to be updated. 2145 Appendix G gives examples of per-user policing algorithms. But there 2146 is no implication that these algorithms are to be standardised, or 2147 that they are ideal. The ingress rate policer is the part of the re- 2148 ECN incentive framework that is intended to be the most flexible. 2149 Once endpoint protocol handlers for re-ECN and egress droppers are in 2150 place, operators can choose exactly which congestion response they 2151 want to police, and whether they want to do it per user, per flow or 2152 not at all. 2154 The re-ECN protocol allows these ingress policers to easily perform 2155 bulk per-user policing (Appendix G.1). This is likely to provide 2156 sufficient incentive to the user to correctly respond to congestion 2157 without needing the policing function to be overly complex. If an 2158 access operator chose they could use per-flow policing according to 2159 the widely adopted TCP rate adaptation ( Appendix G.2) or other 2160 alternatives, however this would introduce extra complexity to the 2161 system. 2163 If a per-flow rate policer is used, it should use path (not 2164 downstream) congestion as the relevant metric, which is represented 2165 by the fraction of octets in packets with positive (Re-Echo and FNE) 2166 and canceled (CE(0)) markings. Of course, re-ECN provides all the 2167 information a policer needs directly in the packets being policed. 2168 So, even policing TCP's AIMD algorithm is relatively straightforward 2169 (Appendix G.2). 2171 Note that we have included canceled packets in the measure of path 2172 congestion. Canceled packets arise when the sender re-echoes earlier 2173 congestion, but then this Re-Echo packet just happens to be 2174 congestion marked itself. One would not normally expect many 2175 canceled packets at the first ingress because one would not normally 2176 expect much congestion marking to have been necessary that soon in 2177 the path. However, a home network or campus network may well sit 2178 between the sending endpoint and the ingress policer, so some 2179 congestion may occur upstream of the policer. And if congestion does 2180 occur upstream, some canceled packets should be visible, and should 2181 be taken into account in the measure of path congestion. 2183 But a much more important reason for including canceled packets in 2184 the measure of path congestion at an ingress policer is that a sender 2185 might otherwise subvert the protocol by sending canceled packets 2186 instead of neutral (RECT) packets. Like neutral, canceled packets 2187 are worth zero, so the sender knows they won't be counted against any 2188 quota it might have been allowed. But unlike neutral packets, 2189 canceled packets are immune to congestion marking, because they have 2190 already been congestion marked. So, it is both correct and useful 2191 that canceled packets should be included in a policer's measure of 2192 path congestion, as this removes the incentive the sender would 2193 otherwise have to mark more packets as canceled than it should. 2195 An ingress policer should also ensure that flows are not already 2196 negative when they enter the access network. As with canceled 2197 packets, the presence of negative packets will typically be unusual. 2198 Therefore it will be easy to detect negative flows at the ingress by 2199 just detecting negative packets then monitoring the flow they belong 2200 to. 2202 Of course, even if the sender does operate its own network, it may 2203 arrange not to congestion mark traffic. Whether the sender does this 2204 or not is of no concern to anyone else except the sender. Such a 2205 sender will not be policed against its own network's contribution to 2206 congestion, but the only resulting problem would be overload in the 2207 sender's own network. 2209 Finally, we must not forget that an easy way to circumvent re-ECN's 2210 defences is for the source to turn off re-ECN support, by setting the 2211 Not-RECT codepoint, implying legacy traffic. Therefore an ingress 2212 policer should put a general rate-limit on Not-RECT traffic, which 2213 SHOULD be lax during early, patchy deployment, but will have to 2214 become stricter as deployment widens. Similarly, flows starting 2215 without an FNE packet can be confined by a strict rate-limit used for 2216 the remainder of flows that haven't proved they are well-behaved by 2217 starting correctly (therefore they need not consume any flow state--- 2218 they are just confined to the `misbehaving' bin if they carry an 2219 unrecognised flow ID). 2221 6.1.6. Inter-domain Policing 2223 One of the main design goals of re-ECN is for border security 2224 mechanisms to be as simple as possible, otherwise they will become 2225 the pinch-points that limit scalability of the whole internetwork. 2226 We want to avoid per-flow processing at borders and to keep to 2227 passive mechanisms that can monitor traffic in parallel to 2228 forwarding, rather than having to filter traffic inline---in series 2229 with forwarding. Such passive, off-line mechanisms are essential for 2230 future high-speed all-optical border interconnection where packets 2231 cannot be buffered while they are checked for policy compliance. 2233 So far, we have been able to keep the border mechanisms simple, 2234 despite having had to harden them against some subtle attacks on the 2235 re-ECN design. The mechanisms are still passive and avoid per-flow 2236 processing. 2238 The basic accounting mechanism at each border interface simply 2239 involves accumulating the volume of packets with positive worth (Re- 2240 Echo and FNE), and subtracting the volume of those with negative 2241 worth: CE(-1). Even though this mechanism takes no regard of flows, 2242 over an accounting period (say a month) this subtraction will account 2243 for the downstream congestion caused by all the flows traversing the 2244 interface, wherever they come from, and wherever they go to. The two 2245 networks can agree to use this metric however they wish to determine 2246 some congestion-related penalty against the upstream network. 2247 Although the algorithm could hardly be simpler, it is spelled out 2248 using pseudo-code in Appendix H.1. 2250 Various attempts to subvert the re-ECN design have been made. In all 2251 cases their root cause is persistently negative flows. But, after 2252 describing these attacks we will show that we don't actually have to 2253 get rid of all persistently negative flows in order to thwart the 2254 attacks. 2256 In honest flows, downstream congestion is measured as positive minus 2257 negative volume. So if all flows are honest (i.e. not persistently 2258 negative), adding all positive volume and all negative volume without 2259 regard to flows will give an aggregate measure of downstream 2260 congestion. But such simple aggregation is only possible if no flows 2261 are persistently negative. Unless persistently negative flows are 2262 completely removed, they will reduce the aggregate measure of 2263 congestion. The aggregate may still be positive overall, but not as 2264 positive as it would have been had the negative flows been removed. 2266 In Section 6.1.4 we discussed how to sanction traffic to remove, or 2267 at least to identify, persistently negative flows. But, even if the 2268 sanction for negative traffic is to discard it, unless it is 2269 discarded at the exact point it goes negative, it will wrongly 2270 subtract from aggregate downstream congestion, at least at any 2271 borders it crosses after it has gone negative but before it is 2272 discarded. 2274 We rely on sanctions to deter dishonest understatement of congestion. 2275 But even the ultimate sanction of discard can only be effective if 2276 the sender is bothered about the data getting through to its 2277 destination. A number of attacks have been identified where a sender 2278 gains from sending dummy traffic or it can attack someone or 2279 something using dummy traffic even though it isn't communicating any 2280 information to anyone: 2282 o A host can send traffic with no positive markings towards its 2283 intended destination, aiming to transmit as much traffic as any 2284 dropper will allow [Bauer06]. It may add forward error correction 2285 (FEC) to repair as much drop as it experiences. 2287 o A host can send dummy traffic into the network with no positive 2288 markings and with no intention of communicating with anyone, but 2289 merely to cause higher levels of congestion for others who do want 2290 to communicate (DoS). So, to ride over the extra congestion, 2291 everyone else has to spend more of whatever rights to cause 2292 congestion they have been allowed. 2294 o A network can simply create its own dummy traffic to congest 2295 another network, perhaps causing it to lose business at no cost to 2296 the attacking network. This is a form of denial of service 2297 perpetrated by one network on another. The preferential drop 2298 measures in Section 5.3 provide crude protection against such 2299 attacks, but we are not overly worried about more accurate 2300 prevention measures, because it is already possible for networks 2301 to DoS other networks on the general Internet, but they generally 2302 don't because of the grave consequences of being found out. We 2303 are only concerned if re-ECN increases the motivation for such an 2304 attack, as in the next example. 2306 o A network can just generate negative traffic and send it over its 2307 border with a neighbour to reduce the overall penalties that it 2308 should pay to that neighbour. It could even initialise the TTL so 2309 it expired shortly after entering the neighbouring network, 2310 reducing the chance of detection further downstream. This attack 2311 need not be motivated by a desire to deny service and indeed need 2312 not cause denial of service. A network's main motivator would 2313 most likely be to reduce the penalties it pays to a neighbour. 2314 But, the prospect of financial gain might tempt the network into 2315 mounting a DoS attack on the other network as well, given the gain 2316 would offset some of the risk of being detected. 2318 The first step towards a solution to all these problems with negative 2319 flows is to be able to estimate the contribution they make to 2320 downstream congestion at a border and to correct the measure 2321 accordingly. Although ideally we want to remove negative flows 2322 themselves, perhaps surprisingly, the most effective first step is to 2323 cancel out the polluting effect negative flows have on the measure of 2324 downstream congestion at a border. It is more important to get an 2325 unbiased estimate of their effect, than to try to remove them all. A 2326 suggested algorithm to give an unbiased estimate of the contribution 2327 from negative flows to the downstream congestion measure is given in 2328 Appendix H.2. 2330 Although making an accurate assessment of the contribution from 2331 negative flows may not be easy, just the single step of neutralising 2332 their polluting effect on congestion metrics removes all the gains 2333 networks could otherwise make from mounting dummy traffic attacks on 2334 each other. This puts all networks on the same side (only with 2335 respect to negative flows of course), rather than being pitched 2336 against each other. The network where this flow goes negative as 2337 well as all the networks downstream lose out from not being 2338 reimbursed for any congestion this flow causes. So they all have an 2339 interest in getting rid of these negative flows. Networks forwarding 2340 a flow before it goes negative aren't strictly on the same side, but 2341 they are disinterested bystanders---they don't care that the flow 2342 goes negative downstream, but at least they can't actively gain from 2343 making it go negative. The problem becomes localised so that once a 2344 flow goes negative, all the networks from where it happens and beyond 2345 downstream each have a small problem, each can detect it has a 2346 problem and each can get rid of the problem if it chooses to. But 2347 negative flows can no longer be used for any new attacks. 2349 Once an unbiased estimate of the effect of negative flows can be 2350 made, the problem reduces to detecting and preferably removing flows 2351 that have gone negative as soon as possible. But importantly, 2352 complete eradication of negative flows is no longer critical---best 2353 endeavours will be sufficient. 2355 For instance, let us consider the case where a source sends traffic 2356 with no positive markings at all, hoping to at least get as much 2357 traffic delivered as network-based droppers will allow. The flow is 2358 likely to go at least slightly negative in the first network on the 2359 path (N1 if we use the example network layout in Figure 9). If all 2360 networks use the algorithm in Appendix H.2 to inflate penalties at 2361 their border with an upstream network, they will remove the effect of 2362 negative flows. So, for instance, N2 will not be paying a penalty to 2363 N1 for this flow. Further, because the flow contributes no positive 2364 markings at all, a dropper at the egress will completely remove it. 2366 The remaining problem is that every network is carrying a flow that 2367 is causing congestion to others but not being held to account for the 2368 congestion it is causing. Whenever the fail-safe border algorithm 2369 (Section 6.1.7) or the border algorithm to compensate for negative 2370 flows (Appendix H.2) detects a negative flow, it can instantiate a 2371 focused dropper for that flow locally. It may be some time before 2372 the flow is detected, but the more strongly negative the flow is, the 2373 more quickly it will be detected by the fail-safe algorithm. But, in 2374 the meantime, it will not be distorting border incentives. Until it 2375 is detected, if it contributes to drop anywhere, its packets will 2376 tend to be dropped before others if routers use the preferential drop 2377 rules in Section 5.3, which discriminate against non-positive 2378 packets. All networks below the point where a flow goes negative 2379 (N1, N2 and N4 in this case) have an incentive to remove this flow, 2380 but the router where it first goes negative (in N1) can of course 2381 remove the problem for everyone downstream. 2383 In the case of DDoS attacks, Section 6.2.1 describes how re-ECN 2384 mitigates their force. 2386 6.1.7. Inter-domain Fail-safes 2388 The mechanisms described so far create incentives for rational 2389 network operators to behave. That is, one operator aims to make 2390 another behave responsibly by applying penalties and expects a 2391 rational response (i.e. one that trades off costs against benefits). 2392 It is usually reasonable to assume that other network operators will 2393 behave rationally (policy routing can avoid those that might not). 2394 But this approach does not protect against the misconfigurations and 2395 accidents of other operators. 2397 Therefore, we propose the following two mechanisms at a network's 2398 borders to provide "defence in depth". Both are similar: 2400 Highly positive flows: A small sample of positive packets should be 2401 picked randomly as they cross a border interface. Then subsequent 2402 packets matching the same source and destination address and DSCP 2403 should be monitored. If the fraction of positive marking is well 2404 above a threshold (to be determined by operational practice), a 2405 management alarm SHOULD be raised, and the flow MAY be 2406 automatically subject to focused drop. 2408 Persistently negative flows: A small sample of congestion marked 2409 (negative) packets should be picked randomly as they cross a 2410 border interface. Then subsequent packets matching the same 2411 source and destination address and DSCP should be monitored. If 2412 the balance of positive minus negative markings is persistently 2413 negative, a management alarm SHOULD be raised, and the flow MAY be 2414 automatically subject to focused drop. 2416 Both these mechanisms rely on the fact that highly positive (or 2417 negative) flows will appear more quickly in the sample by selecting 2418 randomly solely from positive (or negative) packets. 2420 6.1.8. Simulations 2422 Simulations of policer and dropper performance done for the multi-bit 2423 version of re-feedback have been included in section 5 "Dropper 2424 Performance" of [Re-fb]. Simulations of policer and dropper for the 2425 re-ECN version described in this document are work in progress. 2427 6.2. Other Applications 2429 6.2.1. DDoS Mitigation 2431 A flooding attack is inherently about congestion of a resource. 2432 Because re-ECN ensures the sources causing network congestion 2433 experience the cost of their own actions, it acts as a first line of 2434 defence against DDoS. As load focuses on a victim, upstream queues 2435 grow, requiring honest sources to pre-load packets with a higher 2436 fraction of positive packets. Once downstream routers are so 2437 congested that they are dropping traffic, they will be CE marking the 2438 traffic they do forward 100%. Honest sources will therefore be 2439 sending Re-Echo 100% (and therefore being severely rate-limited at 2440 the ingress). 2442 Senders under malicious control can either do the same as honest 2443 sources, and be rate-limited at ingress, or they can understate 2444 congestion by sending more neutral RECT packets than they should. If 2445 sources understate congestion (i.e. do not re-echo sufficient 2446 positive packets) and the preferential drop ranking is implemented on 2447 routers (Section 5.3), these routers will preserve positive traffic 2448 until last. So, the neutral traffic from malicious sources will all 2449 be automatically dropped first. Either way, the malicious sources 2450 cannot send more than honest sources. 2452 Further, hosts under malicious control will tend to be re-used for 2453 many different attacks. They will therefore build up a long term 2454 history of causing congestion. Therefore, as long as the population 2455 of potentially compromisable hosts around the Internet is limited, 2456 the per-user policing algorithms in Appendix G.1 will gradually 2457 throttle down zombies and other launchpads for attacks. Therefore, 2458 widespread deployment of re-ECN could considerably dampen the force 2459 of DDoS. Certainly, zombie armies could hold their fire for long 2460 enough to be able to build up enough credit in the per-user policers 2461 to launch an attack. But they would then still be limited to no more 2462 throughput than other, honest users. 2464 Inter-domain traffic policing (see Section 6.1.6)ensures that any 2465 network that harbours compromised `zombie' hosts will have to bear 2466 the cost of the congestion caused by traffic from zombies in 2467 downstream networks. Such networks will be incentivised to deploy 2468 per-user policers that rate-limit hosts that are unresponsive to 2469 congestion so they can only send very slowly into congested paths. 2470 As well as protecting other networks, the extremely poor performance 2471 at any sign of congestion will incentivise the zombie's owner to 2472 clean it up. However, the host should behave normally when using 2473 uncongested paths. 2475 Uniquely, re-ECN handles DDoS traffic without relying on the validity 2476 of identifiers in packets. Certainly the egress dropper relies on 2477 uniqueness of flow identifiers, but not their validity. So if a 2478 source spoofs another address, re-ECN works just as well, as long as 2479 the attacker cannot imitate all the flow identifiers of another 2480 active flow passing through the same dropper (see Section 6.3). 2481 Similarly, the ingress policer relies on uniqueness of flow IDs, not 2482 their validity. Because a new flow will only be allowed any rate at 2483 all if it starts with FNE, and the more FNE packets there are 2484 starting new flows, the more they will be limited. Essentially a re- 2485 ECN policer limits the bulk of all congestion entering the network 2486 through a physical interface; limiting the congestion caused by each 2487 flow is merely an optional extra. 2489 6.2.2. End-to-end QoS 2491 {ToDo: (Section 3.3.2 of [Re-fb] entitled `Edge QoS' gives an outline 2492 of the text that will be added here).} 2494 6.2.3. Traffic Engineering 2496 {ToDo: } 2498 6.2.4. Inter-Provider Service Monitoring 2500 {ToDo: } 2502 6.3. Limitations 2504 The known limitations of the re-ECN approach are: 2506 o We still cannot defend against the attack described in Section 10 2507 where a malicious source sends negative traffic through the same 2508 egress dropper as another flow and imitates its flow identifiers, 2509 allowing a malicious source to cause an innocent flow to 2510 experience heavy drop. 2512 o Re-feedback for TTL (re-TTL) would also be desirable at the same 2513 time as re-ECN. Unfortunately this requires a further standards 2514 action for the mechanisms briefly described in Appendix F 2516 o Traffic must be ECN-capable for re-ECN to be effective. The only 2517 defence against malicious users who turn off ECN capbility is that 2518 networks are expected to rate limit Not-ECT traffic and to apply 2519 higher drop preference to it during congestion. Although these 2520 are blunt instruments, they at least represent a feasible scenario 2521 for the future Internet where Not-ECT traffic co-exists with re- 2522 ECN traffic, but as a severely hobbled under-class. We recommend 2523 (Section 7.1) that while accommodating a smooth initial transition 2524 to re-ECN, policing policies should gradually be tightened to rate 2525 limit Not-ECT traffic more strictly in the longer term. 2527 o When checking whether a flow is balancing positive markings with 2528 congestion marking, re-ECN can only account for congestion 2529 marking, not drops. So, whenever a sender experiences drop, it 2530 does not have to re-echo the congestion event. Nonetheless, it is 2531 hardly any advantage to be able to send faster than other flows 2532 only if your traffic is dropped and the other traffic isn't. 2534 o We are considering the issue of whether it would be useful to 2535 truncate rather than drop packets that appear to be malicious, so 2536 that the feedback loop is not broken but useful data can be 2537 removed. 2539 7. Incremental Deployment 2541 7.1. Incremental Deployment Features 2543 The design of the re-ECN protocol started from the fact that the 2544 current ECN marking behaviour of routers was sufficient and that re- 2545 feedback could be introduced around these routers by changing the 2546 sender behaviour but not the routers. Otherwise, if we had required 2547 routers to be changed, the chance of encountering a path that had 2548 every router upgraded would be vanishly small during early 2549 deployment, giving no incentive to start deployment. Also, as there 2550 is no new forwarding behaviour, routers and hosts do not have to 2551 signal or negotiate anything. 2553 However, networks that choose to protect themselves using re-ECN do 2554 have to add new security functions at their trust boundaries with 2555 others. They distinguish legacy traffic by its ECN field. Traffic 2556 from Not-ECT transports is distinguishable by its Not-RECT marking. 2557 Traffic from legacy ECN transports is distinguished from re-ECN by 2558 which of ECT(0) or ECT(1) is used. We chose to use ECT(1) for re-ECN 2559 traffic deliberately. Existing ECN sources set ECT(0) on either 50% 2560 (the nonce) or 100% (the default) of packets, whereas re-ECN does not 2561 use ECT(0) at all. We can use this distinguishing feature of legacy 2562 ECN traffic to separate it out for different treatment at the various 2563 border security functions: egress dropping, ingress policing and 2564 border policing. 2566 The general principle we adopt is that an egress dropper will not 2567 drop any legacy traffic, but ingress and border policers will limit 2568 the bulk rate of legacy traffic that can enter each network. Then, 2569 during early re-ECN deployment, operators can set very permissive (or 2570 non-existent) rate-limits on legacy traffic, but once re-ECN 2571 implementations are generally available, legacy traffic can be rate- 2572 limited increasingly harshly. Ultimately, an operator might choose 2573 to block all legacy traffic entering its network, or at least only 2574 allow through a trickle. 2576 Then, as the limits are set more strictly, the more legacy ECN 2577 sources will gain by upgrading to re-ECN. Thus, towards the end of 2578 the voluntary incremental deployment period, legacy transports can be 2579 given progressively stronger encouragement to upgrade. 2581 The following list of minor changes, brings together all the points 2582 where Re-ECN semantics for use of the two-bit ECN field are different 2583 compared to RFC3168: 2585 o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender 2586 sets ECT(0) by default (Section 3.3); 2588 o No provision is necessary for a re-ECN capable source transport to 2589 use the ECN nonce (Section 4.1.2.1); 2591 o Routers MAY preferentially drop different extended ECN codepoints 2592 (Section 5.3); 2594 o Packets carrying the feedback not established (FNE) codepoint MAY 2595 optionally be marked rather than dropped by routers, even though 2596 their ECN field is Not-ECT (with the important caveat in 2597 Section 5.3); 2599 o Packets may be dropped by policing nodes because of apparent 2600 misbehaviour, not just because of congestion (Section 6); 2602 o Tunnel entry behaviour is still to be defined, but may have to be 2603 different from RFC3168 (Section 5.6). 2605 None of these changes REQUIRE any modifications to routers. Also 2606 none of these changes affect anything about end to end congestion 2607 control; they are all to do with allowing networks to police that end 2608 to end congestion control is well-behaved. 2610 7.2. Incremental Deployment Incentives 2612 It would only be worth standardising the re-ECN protocol if there 2613 existed a coherent story for how it might be incrementally deployed. 2614 In order for it to have a chance of deployment, everyone who needs to 2615 act must have a strong incentive to act, and the incentives must 2616 arise in the order that deployment would have to happen. Re-ECN 2617 works around unmodified ECN routers, but we can't just discuss why 2618 and how re-ECN deployment might build on ECN deployment, because 2619 there is precious little to build on in the first place. Instead, we 2620 aim to show that re-ECN deployment could carry ECN with it. We focus 2621 on commercial deployment incentives, although some of the arguments 2622 apply equally to academic or government sectors. 2624 ECN deployment: 2626 ECN is largely implemented in commercial routers, but generally 2627 not as a supported feature, and it has largely not been deployed 2628 by commercial network operators. It has been released in many 2629 Unix-based operating systems, but not in proprietary OSs like 2630 Windows or those in many mobile devices. For detailed deployment 2631 status, see [ECN-Deploy]. We believe the reason ECN deployment 2632 has not happened is twofold: 2634 * ECN requires changes to both routers and hosts. If someone 2635 wanted to sell the improvement that ECN offers, they would have 2636 to co-ordinate deployment of their product with others. An ECN 2637 server only gives any improvement on an ECN network. An ECN 2638 network only gives any improvement if used by ECN devices. 2639 Deployment that requires co-ordination adds cost and delay and 2640 tends to dilute any competitive advantage that might be gained. 2642 * ECN `only' gives a performance improvement. Making a product a 2643 bit faster (whether the product is a device or a network), 2644 isn't usually a sufficient selling point to be worth the cost 2645 of co-ordinating across the industry to deploy it. Network 2646 operators tend to avoid re-configuring a working network unless 2647 launching a new product. 2649 ECN and re-ECN for Edge-to-edge Assured QoS: 2651 We believe the proposal to provide assured QoS sessions using a 2652 form of ECN called pre-congestion notification (PCN) [PCN-arch] is 2653 most likely to break the deadlock in ECN deployment first. It 2654 only requires edge-to-edge deployment so it does not require 2655 endpoint support. It can be deployed in a single network, then 2656 grow incrementally to interconnected networks. And it provides a 2657 different `product' (internetworked assured QoS), rather than 2658 merely making an existing product a bit faster. 2660 Not only could this assured QoS application kick-start ECN 2661 deployment, it could also carry re-ECN deployment with it; because 2662 re-ECN can enable the assured QoS region to expand to a large 2663 internetwork where neighbouring networks do not trust each other. 2664 [Re-PCN] argues that re-ECN security should be built in to the QoS 2665 system from the start, explaining why and how. 2667 If ECN and re-ECN were deployed edge-to-edge for assured QoS, 2668 operators would gain valuable experience. They would also clear 2669 away many technical obstacles such as firewall configurations that 2670 block all but the legacy settings of the ECN field and the RE 2671 flag. 2673 ECN in Access Networks: 2675 The next obstacle to ECN deployment would be extension to access 2676 and backhaul networks, where considerable link layer differences 2677 makes implementation non-trivial, particularly on congested 2678 wireless links. ECN and re-ECN work fine during partial 2679 deployment, but they will not be very useful if the most congested 2680 elements in networks are the last to support them. Access network 2681 support is one of the weakest parts of this deployment story. All 2682 we can hope is that, once the benefits of ECN are better 2683 understood by operators, they will push for the necessary link 2684 layer implementations as deployment proceeds. 2686 Policing Unresponsive Flows: 2688 Re-ECN allows a network to offer differentiated quality of service 2689 as explained in Section 6.2.2. But we do not believe this will 2690 motivate initial deployment of re-ECN, because the industry is 2691 already set on alternative ways of doing QoS. Despite being much 2692 more complicated and expensive, the alternative approaches are 2693 here and now. 2695 But re-ECN is critical to QoS deployment in another respect. It 2696 can be used to prevent applications from taking whatever bandwidth 2697 they choose without asking. 2699 Currently, applications that remain resolute in their lack of 2700 response to congestion are rewarded by other TCP applications. In 2701 other words, TCP is naively friendly, in that it reduces its rate 2702 in response to congestion whether it is competing with friends 2703 (other TCPs) or with enemies (unresponsive applications). 2705 Therefore, those network owners that want to sell QoS will be keen 2706 to ensure that their users can't help themselves to QoS for free. 2707 Given the very large revenues at stake, we believe effective 2708 policing of congestion response will become highly sought after by 2709 network owners. 2711 But this does not necessarily argue for re-ECN deployment. 2712 Network owners might choose to deploy bottleneck policers rather 2713 than re-ECN-based policing. However, under Related Work 2714 (Section 9) we argue that bottleneck policers are inherently 2715 vulnerable to circumvention. 2717 Therefore we believe there will be a strong demand from network 2718 owners for re-ECN deployment so they can police flows that do not 2719 ask to be unresponsive to congestion, in order to protect their 2720 revenues from flows that do ask (QoS). In particular, we suspect 2721 that the operators of cellular networks will want to prevent VoIP 2722 and video applications being used freely on their networks as a 2723 more open market develops in GPRS and 3G devices. 2725 Initial deployments are likely to be isolated to single cellular 2726 networks. Cellular operators would first place requirements on 2727 device manufacturers to include re-ECN in the standards for mobile 2728 devices. In parallel, they would put out tenders for ingress and 2729 egress policers. Then, after a while they would start to tighten 2730 rate limits on Not-ECT traffic from non-standard devices and they 2731 would start policing whatever non-accredited applications people 2732 might install on mobile devices with re-ECN support in the 2733 operating system. This would force even independent mobile device 2734 manufacturers to provide re-ECN support. Early standardisation 2735 across the cellular operators is likely, including interconnection 2736 agreements with penalties for excess downstream congestion. 2738 We suspect some fixed broadband networks (whether cable or DSL) 2739 would follow a similar path. However, we also believe that larger 2740 parts of the fixed Internet would not choose to police on a per- 2741 flow basis. Some might choose to police congestion on a per-user 2742 basis in order to manage heavy peer-to-peer file-sharing, but it 2743 seems likely that a sizeable majority would not deploy any form of 2744 policing. 2746 This hybrid situation begs the question, "How does re-ECN work for 2747 networks that choose to using policing if they connect with others 2748 that don't?" Traffic from non-ECN capable sources will arrive 2749 from other networks and cause congestion within the policed, ECN- 2750 capable networks. So networks that chose to police congestion 2751 would rate-limit Not-ECT traffic throughout their network, 2752 particularly at their borders. They would probably also set 2753 higher usage prices in their interconnection contracts for 2754 incoming Not-ECT and Not-RECT traffic. We assume that 2755 interconnection contracts between networks in the same tier will 2756 include congestion penalties before contracts with provider 2757 backbones do. 2759 A hybrid situation could remain for all time. As was explained in 2760 the introduction, we believe in healthy competition between 2761 policing and not policing, with no imperative to convert the whole 2762 world to the religion of policing. Networks that chose not to 2763 deploy egress droppers would leave themselves open to being 2764 congested by senders in other networks. But that would be their 2765 choice. 2767 The important aspect of the egress dropper though is that it most 2768 protects the network that deploys it. If a network does not 2769 deploy an egress dropper, sources sending into it from other 2770 networks will be able to understate the congestion they are 2771 causing. Whereas, if a network deploys an egress dropper, it can 2772 know how much congestion other networks are dumping into it, and 2773 apply penalties or charges accordingly. So, whether or not a 2774 network polices its own sources at ingress, it is in its interests 2775 to deploy an egress dropper. 2777 Host support: 2779 In the above deployment scenario, host operating system support 2780 for re-ECN came about through the cellular operators demanding it 2781 in device standards (i.e. 3GPP). Of course, increasingly, mobile 2782 devices are being built to support multiple wireless technologies. 2783 So, if re-ECN were stipulated for cellular devices, it would 2784 automatically appear in those devices connected to the wireless 2785 fringes of fixed networks if they coupled cellular with WiFi or 2786 Bluetooth technology, for instance. Also, once implemented in the 2787 operating system of one mobile device, it would tend to be found 2788 in other devices using the same family of operating system. 2790 Therefore, whether or not a fixed network deployed ECN, or 2791 deployed re-ECN policers and droppers, many of its hosts might 2792 well be using re-ECN over it. Indeed, they would be at an 2793 advantage when communicating with hosts across Re-ECN policed 2794 networks that rate limited Not-RECT traffic. 2796 Other possible scenarios: 2798 The above is thankfully not the only plausible scenario we can 2799 think of. One of the many clubs of operators that meet regularly 2800 around the world might decide to act together to persuade a major 2801 operating system manufacturer to implement re-ECN. And they may 2802 agree between them on an interconnection model that includes 2803 congestion penalties. 2805 Re-ECN provides an interesting opportunity for device 2806 manufacturers as well as network operators. Policers can be 2807 configured loosely when first deployed. Then as re-ECN take-up 2808 increases, they can be tightened up, so that a network with re-ECN 2809 deployed can gradually squeeze down the service provided to legacy 2810 devices that have not upgraded to re-ECN. Many device vendors 2811 rely on replacement sales. And operating system companies rely 2812 heavily on new release sales. Also support services would like to 2813 be able to force stragglers to upgrade. So, the ability to 2814 throttle service to legacy operating systems is quite valuable. 2816 Also, policing unresponsive sources may not be the only or even 2817 the first application that drives deployment. It may be policing 2818 causes of heavy congestion (e.g. peer-to-peer file-sharing). Or 2819 it may be mitigation of denial of service. Or we may be wrong in 2820 thinking simpler QoS will not be the initial motivation for re-ECN 2821 deployment. Indeed, the combined pressure for all these may be 2822 the motivator, but it seems optimistic to expect such a level of 2823 joined-up thinking from today's communications industry. We 2824 believe a single application alone must be a sufficient motivator. 2826 In short, everyone gains from adding accountability to TCP/IP, 2827 except the selfish or malicious. So, deployment incentives tend 2828 to be strong. 2830 8. Architectural Rationale 2832 In the Internet's technical community, the danger of not responding 2833 to congestion is well-understood, as well as its attendant risk of 2834 congestion collapse [RFC3714]. However, one side of the Internet's 2835 commercial community considers that the very essence of IP is to 2836 provide open access to the internetwork for all applications. They 2837 see congestion as a symptom of over-conservative investment, and rely 2838 on revising application designs to find novel ways to keep 2839 applications working despite congestion. They argue that the 2840 Internet was never intended to be solely for TCP-friendly 2841 applications. Meanwhile, another side of the Internet's commercial 2842 community believes that it is worthwhile providing a network for 2843 novel applications only if it has sufficient capacity, which can 2844 happen only if a greater share of application revenues can be 2845 /assured/ for the infrastructure provider. Otherwise the major 2846 investments required would carry too much risk and wouldn't happen. 2848 The lesson articulated in [Tussle] is that we shouldn't embed our 2849 view on these arguments into the Internet at design time. Instead we 2850 should design the Internet so that the outcome of these arguments can 2851 get decided at run-time. Re-ECN is designed in that spirit. Once 2852 the protocol is available, different network operators can choose how 2853 liberal they want to be in holding people accountable for the 2854 congestion they cause. Some might boldly invest in capacity and not 2855 police its use at all, hoping that novel applications will result. 2856 Others might use re-ECN for fine-grained flow policing, expecting to 2857 make money selling vertically integrated services. Yet others might 2858 sit somewhere half-way, perhaps doing coarse, per-user policing. All 2859 might change their minds later. But re-ECN always allows them to 2860 interconnect so that the careful ones can protect themselves from the 2861 liberal ones. 2863 The incentive-based approach used for re-ECN is based on Gibbens and 2864 Kelly's arguments [Evol_cc] on allowing endpoints the freedom to 2865 evolve new congestion control algorithms for new applications. They 2866 ensured responsible behaviour despite everyone's self-interest by 2867 applying pricing to ECN marking, and Kelly had proved stability and 2868 optimality in an earlier paper. 2870 Re-ECN keeps all the underlying economic incentives, but rearranges 2871 the feedback. The idea is to allow a network operator (if it 2872 chooses) to deploy engineering mechanisms like policers at the front 2873 of the network which can be designed to behave /as if/ they are 2874 responding to congestion prices. Rather than having to subject users 2875 to congestion pricing, networks can then use more traditional 2876 charging regimes (or novel ones). But the engineering can constrain 2877 the overall amount of congestion a user can cause. This provides a 2878 buffer against completely outrageous congestion control, but still 2879 makes it easy for novel applications to evolve if they need different 2880 congestion control to the norms. It also allows novel charging 2881 regimes to evolve. 2883 Despite being achieved with a relatively minor protocol change, re- 2884 ECN is an architectural change. Previously, Internet congestion 2885 could only be controlled by the data sender, because it was the only 2886 one both in a position to control the load and in a position to see 2887 information on congestion. Re-ECN levels the playing field. It 2888 recognises that the network also has a role to play in moderating 2889 (policing) congestion control. But policing is only truly effective 2890 at the first ingress into an internetwork, whereas path congestion 2891 was previously only visible at the last egress. So, re-ECN 2892 democratises congestion information. Then the choice over who 2893 actually controls congestion can be made at run-time, not design 2894 time---a bit like an aircraft with dual controls. And different 2895 operators can make different choices. We believe non-architectural 2896 approaches to this problem are unlikely to offer more than partial 2897 solutions (see Section 9). 2899 Importantly, re-ECN does NOT REQUIRE assumptions about specific 2900 congestion responses to be embedded in any network elements, except 2901 at the first ingress to the internetwork if that level of control is 2902 desired by the ingress operator. But such tight policing will be a 2903 matter of agreement between the source and its access network 2904 operator. The ingress operator need not police congestion response 2905 at flow granularity; it can simply hold a source responsible for the 2906 aggregate congestion it causes, perhaps keeping it within a monthly 2907 congestion quota. Or if the ingress network trusts the source, it 2908 can do nothing. 2910 Therefore, the aim of the re-ECN protocol is NOT solely to police 2911 TCP-friendliness. Re-ECN preserves IP as a generic network layer for 2912 all sorts of responses to congestion, for all sorts of transports. 2913 Re-ECN merely ensures truthful downstream congestion information is 2914 available in the network layer for all sorts of accountability 2915 applications. 2917 The end to end design principle does not say that all functions 2918 should be moved out of the lower layers---only those functions that 2919 are not generic to all higher layers. Re-ECN adds a function to the 2920 network layer that is generic, but was omitted: accountability for 2921 causing congestion. Accountability is not something that an end-user 2922 can provide to themselves. We believe re-ECN adds no more than is 2923 sufficient to hold each flow accountable, even if it consists of a 2924 single datagram. 2926 "Accountability" implies being able to identify who is responsible 2927 for causing congestion. However, at the network layer it would NOT 2928 be useful to identify the cause of congestion by adding individual or 2929 organisational identity information, NOR by using source IP 2930 addresses. Rather than bringing identity information to the point of 2931 congestion, we bring downstream congestion information to the point 2932 where the cause can be most easily identified and dealt with. That 2933 is, at any trust boundary congestion can be associated with the 2934 physically connected upstream neighbour that is directly responsible 2935 for causing it (whether intentionally or not). A trust boundary 2936 interface is exactly the place to police or throttle in order to 2937 directly mitigate congestion, rather than having to trace the 2938 (ir)responsible party in order to shut them down. 2940 Some considered that ECN itself was a layering violation. The 2941 reasoning went that the interface to a layer should provide a service 2942 to the higher layer and hide how the lower layer does it. However, 2943 ECN reveals the state of the network layer and below to the transport 2944 layer. A more positive way to describe ECN is that it is like the 2945 return value of a function call to the network layer. It explicitly 2946 returns the status of the request to deliver a packet, by returning a 2947 value representing the current risk that a packet will not be served. 2948 Re-ECN has similar semantics, except the transport layer must try to 2949 guess the return value, then it can use the actual return value from 2950 the network layer to modify the next guess. 2952 The guiding principle behind all the discussion in Section 6.1.6 on 2953 Policing is that any gain from subverting the protocol should be 2954 precisely neutralised, rather than punished. If a gain is punished 2955 to a greater extent than is sufficient to neutralise it, it will most 2956 likely open up a new vulnerability, where the amplifying effect of 2957 the punishment mechanism can be turned on others. 2959 For instance, if possible, flows should be removed as soon as they go 2960 negative, but we do NOT RECOMMEND any attempts to discard such flows 2961 further upstream while they are still positive. Such over-zealous 2962 push-back is unnecessary and potentially dangerous. These flows have 2963 paid their `fare' up to the point they go negative, so there is no 2964 harm in delivering them that far. If someone downstream asks for a 2965 flow to be dropped as near to the source as possible, because they 2966 say it is going to become negative later, an upstream node cannot 2967 test the truth of this assertion. Rather than have to authenticate 2968 such messages, re-ECN has been designed so that flows can be dropped 2969 solely based on locally measurable evidence. A message hinting that 2970 a flow should be watched closely to test for negativity is fine. But 2971 not a message that claims that a positive flow will go negative 2972 later, so it should be dropped. . 2974 9. Related Work 2976 {Due to lack of time, this section is incomplete. The reader is 2977 referred to the Related Work section of [Re-fb] for a brief selection 2978 of related ideas.} 2980 9.1. Policing Rate Response to Congestion 2982 ATM network elements send congestion back-pressure 2983 messages [ITU-T.I.371] along each connection, duplicating any end to 2984 end feedback because they don't trust it. On the other hand, re-ECN 2985 ensures information in forwarded packets can be used for congestion 2986 management without requiring a connection-oriented architecture and 2987 re-using the overhead of fields that are already set aside for end to 2988 end congestion control (and routing loop detection in the case of re- 2989 TTL in Appendix F). 2991 We borrowed ideas from policers in the literature [pBox],[XCHOKe], 2992 AFD etc. for our rate equation policer. However, without the benefit 2993 of re-ECN they don't police the correct rate for the condition of 2994 their path. They detect unusually high /absolute/ rates, but only 2995 while the policer itself is congested, because they work by detecting 2996 prevalent flows in the discards from the local RED queue. These 2997 policers must sit at every potential bottleneck, whereas our policer 2998 need only be located at each ingress to the internetwork. As Floyd & 2999 Fall explain [pBox], the limitation of their approach is that a high 3000 sending rate might be perfectly legitimate, if the rest of the path 3001 is uncongested or the round trip time is short. Commercially 3002 available rate policers cap the rate of any one flow. Or they 3003 enforce monthly volume caps in an attempt to control high volume 3004 file-sharing. They limit the value a customer derives. They might 3005 also limit the congestion customers can cause, but only as an 3006 accidental side-effect. They actually punish traffic that fills 3007 troughs as much as traffic that causes peaks in utilisation. In 3008 practice network operators need to be able to allocate service by 3009 cost during congestion, and by value at other times. 3011 9.2. Congestion Notification Integrity 3013 The choice of two ECT code-points in the ECN field [RFC3168] 3014 permitted future flexibility, optionally allowing the sender to 3015 encode the experimental ECN nonce [RFC3540] in the packet stream. 3016 This mechanism has since been included in the specifications of DCCP 3017 [RFC4340]. 3019 The ECN nonce is an elegant scheme that allows the sender to detect 3020 if someone in the feedback loop - the receiver especially - tries to 3021 claim no congestion was experienced when in fact congestion led to 3022 packet drops or ECN marks. For each packet it sends, the sender 3023 chooses between the two ECT codepoints in a pseudo-random sequence. 3024 Then, whenever the network marks a packet with CE, if the receiver 3025 wants to deny congestion happened, she has to guess which ECT 3026 codepoint was overwritten. She has only a 50:50 chance of being 3027 correct each time she denies a congestion mark or a drop, which 3028 ultimately will give her away. 3030 The purpose of a network-layer nonce should primarily be protection 3031 of the network, while a transport-layer nonce would be better used to 3032 protect the sender from cheating receivers. Now, the assumption 3033 behind the ECN nonce is that a sender will want to detect whether a 3034 receiver is suppressing congestion feedback. This is only true if 3035 the sender's interests are aligned with the network's, or with the 3036 community of users as a whole. This may be true for certain large 3037 senders, who are under close scrutiny and have a reputation to 3038 maintain. But we have to deal with a more hostile world, where 3039 traffic may be dominated by peer-to-peer transfers, rather than 3040 downloads from a few popular sites. Often the `natural' self- 3041 interest of a sender is not aligned with the interests of other 3042 users. It often wishes to transfer data quickly to the receiver as 3043 much as the receiver wants the data quickly. 3045 In contrast, the re-ECN protocol enables policing of an agreed rate- 3046 response to congestion (e.g. TCP-friendliness) at the sender's 3047 interface with the internetwork. It also ensures downstream networks 3048 can police their upstream neighbours, to encourage them to police 3049 their users in turn. But most importantly, it requires the sender to 3050 declare path congestion to the network and it can remove traffic at 3051 the egress if this declaration is dishonest. So it can police 3052 correctly, irrespective of whether the receiver tries to suppress 3053 congestion feedback or whether the sender ignores genuine congestion 3054 feedback. Therefore the re-ECN protocol addresses a much wider range 3055 of cheating problems, which includes the one addressed by the ECN 3056 nonce. 3058 9.3. Identifying Upstream and Downstream Congestion 3060 Purple [Purple] proposes that routers should use the CWR flag in the 3061 TCP header of ECN-capable flows to work out path congestion and 3062 therefore downstream congestion in a similar way to re-ECN. However, 3063 because CWR is in the transport layer, it is not always visible to 3064 network layer routers and policers. Purple's motivation was to 3065 improve AQM, not policing. But, of course, nodes trying to avoid a 3066 policer would not be expected to allow CWR to be visible. 3068 10. Security Considerations 3070 This whole memo concerns the deployment of a secure congestion 3071 control framework. However, below we list some specific security 3072 issues that we are still working on: 3074 o Malicious users have ability to launch dynamically changing 3075 attacks, exploiting the time it takes to detect an attack, given 3076 ECN marking is binary. We are concentrating on subtle 3077 interactions between the ingress policer and the egress dropper in 3078 an effort to make it impossible to game the system. 3080 o There is an inherent need for at least some flow state at the 3081 egress dropper given the binary marking environment, which leads 3082 to an apparent vulnerability to state exhaustion attacks. An 3083 egress dropper design with bounded flow state is in write-up. 3085 o A malicious source can spoof another user's address and send 3086 negative traffic to the same destination in order to fool the 3087 dropper into sanctioning the other user's flow. To prevent or 3088 mitigate these two different kinds of DoS attack, against the 3089 dropper and against given flows, we are considering various 3090 protection mechanisms. Section 5.5.1 discusses one of these. 3092 o A malicious client can send requests using a spoofed source 3093 address to a server (such as a DNS server) that tends to respond 3094 with single packet responses. This server will then be tricked 3095 into having to set FNE on the first (and only) packet of all these 3096 wasted responses. Given packets marked FNE are worth +1, this 3097 will cause such servers to consume more of their allowance to 3098 cause congestion than they would wish to. In general, re-ECN is 3099 deliberately designed so that single packet flows have to bear the 3100 cost of not discovering the congestion state of their path. One 3101 of the reasons for introducing re-ECN is to encourage short flows 3102 to make use of previous path knowledge by moving the cost of this 3103 lack of knowledge to sources that create short flows. Therefore, 3104 we in the long run we might expect services like DNS to aggregate 3105 single packet flows into connections where it brings benefits. 3106 However, this attack where DNS requests are made from spoofed 3107 addresses genuinely forces the server to waste its resources. The 3108 only mitigating feature is that the attacker has to set FNE on 3109 each of its requests if they are to get through an egress dropper 3110 to a DNS server. The attacker therefore has to consume as many 3111 resources as the victim, which at least implies re-ECN does not 3112 unwittingly amplify this attack. 3114 Having highlighted outstanding security issues, we now explain the 3115 design decisions that were taken based on a security-related 3116 rationale. It may seem that the six codepoints of the eight made 3117 available by extending the ECN field with the RE flag have been used 3118 rather wastefully to encode just five states. In effect the RE flag 3119 has been used as an orthogonal single bit, using up four codepoints 3120 to encode the three states of positive, neutral and negative worth. 3121 The mapping of the codepoints in an earlier version of this proposal 3122 used the codepoint space more efficiently, but the scheme became 3123 vulnerable to network operators bypassing congestion penalties by 3124 focusing congestion marking on positive packets. Appendix B explains 3125 why fixing that problem while allowing for incremental deployment, 3126 would have used another codepoint anyway. So it was better to use 3127 this orthogonal encoding scheme, which greatly simplified the whole 3128 protocol and brought with it some subtle security benefits (see the 3129 last paragraph of Appendix B). 3131 With the scheme as now proposed, once the RE flag is set or cleared 3132 by the sender or its proxy, it should not be written by the network, 3133 only read. So the endpoints can detect if any network maliciously 3134 alters the RE flag. IPSec AH integrity checking does not cover the 3135 IPv4 option flags (they were considered mutable---even the one we 3136 propose using for the RE flag that was `currently unused' when IPSec 3137 was defined). But it would be sufficient for a pair of endpoints to 3138 make random checks on whether the RE flag was the same when it 3139 reached the egress as when it left the ingress. Indeed, if IPSec AH 3140 had covered the RE flag, any network intending to alter sufficient RE 3141 flags to make a gain would have focused its alterations on packets 3142 without authenticating headers (AHs). 3144 The security of re-ECN has been deliberately designed to not rely on 3145 cryptography. 3147 11. IANA Considerations 3149 This memo includes no request to IANA (yet). 3151 If this memo was to progress to standards track, it would list: 3153 o The new RE flag in IPv4 (Section 5.1) and its extension with the 3154 ECN field to create a new set of extended ECN (EECN) codepoints; 3156 o The definition of the EECN codepoints for default Diffserv PHBs 3157 (Section 3.2) 3159 o The new extension header for IPv6 (Section 5.2); 3161 o The new combinations of flags in the TCP header for capability 3162 negotiation (Section 4.1.3); 3164 o The new ICMP message type (Section 5.5.1). 3166 12. Conclusions 3168 {ToDo:} 3170 13. Acknowledgements 3172 Sebastien Cazalet and Andrea Soppera contributed to the idea of re- 3173 feedback. All the following have given helpful comments: Andrea 3174 Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, 3175 Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, 3176 John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru 3177 Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd 3178 (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark 3179 Handley (who developed the attack with canceled packets), Adam 3180 Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft 3181 (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who 3182 complemented our own dummy traffic attacks with others), Liz Maida 3183 (MIT), and comments from participants in the CRN/CFP Broadband and 3184 DoS-resistant Internet working groups. 3186 14. Comments Solicited 3188 Comments and questions are encouraged and very welcome. They can be 3189 addressed to the IETF Transport Area working group's mailing list 3190 , and/or to the authors. 3192 15. References 3194 15.1. Normative References 3196 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3197 Requirement Levels", BCP 14, RFC 2119, March 1997. 3199 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 3200 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 3201 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 3202 S., Wroclawski, J., and L. Zhang, "Recommendations on 3203 Queue Management and Congestion Avoidance in the 3204 Internet", RFC 2309, April 1998. 3206 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 3207 Control", RFC 2581, April 1999. 3209 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 3210 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 3211 Zhang, L., and V. Paxson, "Stream Control Transmission 3212 Protocol", RFC 2960, October 2000. 3214 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 3215 of Explicit Congestion Notification (ECN) to IP", 3216 RFC 3168, September 2001. 3218 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 3219 Initial Window", RFC 3390, October 2002. 3221 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 3222 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 3224 [RFC4341] Floyd, S. and E. Kohler, "Profile for Datagram Congestion 3225 Control Protocol (DCCP) Congestion Control ID 2: TCP-like 3226 Congestion Control", RFC 4341, March 2006. 3228 [RFC4342] Floyd, S., Kohler, E., and J. Padhye, "Profile for 3229 Datagram Congestion Control Protocol (DCCP) Congestion 3230 Control ID 3: TCP-Friendly Rate Control (TFRC)", RFC 4342, 3231 March 2006. 3233 15.2. Informative References 3235 [ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the 3236 Internet to Support Real-Time Content Supply from a Large 3237 Fraction of Broadband Residential Users", BT Technology 3238 Journal (BTTJ) 23(2), April 2005. 3240 [Bauer06] Bauer, S., Faratin, P., and R. Beverly, "Assessing the 3241 assumptions underlying mechanism design for the Internet", 3242 Proc. Workshop on the Economics of Networked Systems 3243 (NetEcon06) , June 2006, . 3246 [CLoop_pol] 3247 Salvatori, A., "Closed Loop Traffic Policing", Politecnico 3248 Torino and Institut Eurecom Masters Thesis , 3249 September 2005. 3251 [ECN-Deploy] 3252 Floyd, S., "ECN (Explicit Congestion Notification) in 3253 TCP/IP; Implementation and Deployment of ECN", Web-page , 3254 May 2004, 3255 . 3257 [ECN-MPLS] 3258 Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion 3259 Marking in MPLS", draft-ietf-tsvwg-ecn-mpls-01 (work in 3260 progress), June 2007. 3262 [ECN-tunnel] 3263 Briscoe, B., "Layered Encapsulation of Congestion 3264 Notification", draft-briscoe-tsvwg-ecn-tunnel-00 (work in 3265 progress), June 2007. 3267 [Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the 3268 evolution of congestion control", Automatica 35(12)1969-- 3269 1985, December 1999, 3270 . 3272 [I-D.ietf-tcpm-ecnsyn] 3273 Kuzmanovic, A., "Adding Explicit Congestion Notification 3274 (ECN) Capability to TCP's SYN/ACK Packets", 3275 draft-ietf-tcpm-ecnsyn-03 (work in progress), 3276 November 2007. 3278 [I-D.moncaster-tcpm-rcv-cheat] 3279 Moncaster, T., "A TCP Test to Allow Senders to Identify 3280 Receiver Non-Compliance", 3281 draft-moncaster-tcpm-rcv-cheat-02 (work in progress), 3282 November 2007. 3284 [ITU-T.I.371] 3285 ITU-T, "Traffic Control and Congestion Control in 3286 {B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. 3288 [Jiang02] Jiang, H. and D. Dovrolis, "The Macroscopic Behavior of 3289 the TCP Congestion Avoidance Algorithm", ACM SIGCOMM 3290 CCR 32(3)75-88, July 2002, 3291 . 3293 [Mathis97] 3294 Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The 3295 Macroscopic Behavior of the TCP Congestion Avoidance 3296 Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997, 3297 . 3299 [PCN-arch] 3300 Eardley, P., Babiarz, J., Chan, K., Charny, A., Geib, R., 3301 Karagiannis, G., Menth, M., and T. Tsou, "Pre-Congestion 3302 Notification Architecture", 3303 draft-eardley-pcn-architecture-00 (work in progress), 3304 June 2007. 3306 [Purple] Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE: 3307 Predictive Active Queue Management Utilizing Congestion 3308 Information", Proc. Local Computer Networks (LCN 2003) , 3309 October 2003. 3311 [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, 3312 M., Romanow, A., Weinrib, A., and L. Zhang, "Resource 3313 ReSerVation Protocol (RSVP) Version 1 Applicability 3314 Statement Some Guidelines on Deployment", RFC 2208, 3315 September 1997. 3317 [RFC2402] Kent, S. and R. Atkinson, "IP Authentication Header", 3318 RFC 2402, November 1998. 3320 [RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security 3321 Payload (ESP)", RFC 2406, November 1998. 3323 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., 3324 and W. Weiss, "An Architecture for Differentiated 3325 Services", RFC 2475, December 1998. 3327 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 3328 Timer", RFC 2988, November 2000. 3330 [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", 3331 RFC 3124, June 2001. 3333 [RFC3514] Bellovin, S., "The Security Flag in the IPv4 Header", 3334 RFC 3514, April 2003. 3336 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 3337 Congestion Notification (ECN) Signaling with Nonces", 3338 RFC 3540, June 2003. 3340 [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion 3341 Control for Voice Traffic in the Internet", RFC 3714, 3342 March 2004. 3344 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 3345 Internet Protocol", RFC 4301, December 2005. 3347 [Re-PCN] Briscoe, B., "Emulating Border Flow Policing using Re-ECN 3348 on Bulk Data", draft-briscoe-re-pcn-border-cheat-00 (work 3349 in progress), July 2007. 3351 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 3352 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 3353 Congestion Response in an Internetwork Using Re-Feedback", 3354 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 3358 [Savage99] 3359 Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 3360 "TCP congestion control with a misbehaving receiver", ACM 3361 SIGCOMM CCR 29(5), October 1999, 3362 . 3364 [Smart_rtg] 3365 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 3366 "Optimizing Cost and Performance for Multihoming", ACM 3367 SIGCOMM CCR 34(4)79--92, October 2004, 3368 . 3370 [Steps_DoS] 3371 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 3372 resistant Internet Architecture", Proc. ACM SIGCOMM 3373 workshop on Future directions in network architecture 3374 (FDNA'04) pp 49--56, August 2004. 3376 [Tussle] Clark, D., Sollins, K., Wroclawski, J., and R. Braden, 3377 "Tussle in Cyberspace: Defining Tomorrow's Internet", ACM 3378 SIGCOMM CCR 32(4)347--356, October 2002, 3379 . 3382 [XCHOKe] Chhabra, P., Chuig, S., Goel, A., John, A., Kumar, A., 3383 Saran, H., and R. Shorey, "XCHOKe: Malicious Source 3384 Control for Congestion Avoidance at Internet Gateways", 3385 Proceedings of IEEE International Conference on Network 3386 Protocols (ICNP-02) , November 2002, 3387 . 3389 [pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End 3390 Congestion Control in the Internet", IEEE/ACM Transactions 3391 on Networking 7(4) 458--472, August 1999, 3392 . 3394 Appendix A. Precise Re-ECN Protocol Operation 3396 {ToDo: fix this} 3398 The protocol operation in the middle described in Section 3.3 was an 3399 approximation. In fact, standard ECN router marking combines 1% and 3400 2% marking into slightly less than 3% whole-path marking, because 3401 routers deliberately mark CE whether or not it has already been 3402 marked by another router upstream. So the combined marking fraction 3403 would actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. 3405 To generalise this we will need some notation. 3407 o j represents the index of each resource (typically queues) along a 3408 path, ranging from 0 at the first router to n-1 at the last. 3410 o m_j represents the fraction of octets *m*arked CE by a particular 3411 router (whether or not they are already marked) because of 3412 congestion of resource j. 3414 o u_j represents congestion *u*pstream of resource j, being the 3415 fraction of CE marking in arriving packet headers (before 3416 marking). 3418 o p_j represents *p*ath congestion, being the fraction of packets 3419 arriving at resource j with the RE flag blanked (excluding Not- 3420 RECT packets). 3422 o v_j denotes expected congestion downstream of resource j, which 3423 can be thought of as a *v*irtual marking fraction, being derived 3424 from two other marking fractions. 3426 Observed fractions of each particular codepoint (u, p and v) and 3427 router marking rate m are dimensionless fractions, being the ratio of 3428 two data volumes (marked and total) over a monitoring period. All 3429 measurements are in terms of octets, not packets, assuming that line 3430 resources are more congestible than packet processing. 3432 The path congestion (RE blanking fraction) set by the sender should 3433 reflect the upstream congestion (CE marking fraction) fed back from 3434 the destination. Therefore in the steady state 3436 p_0 = u_n 3437 = 1 - (1 - m_1)(1 - m_2)... 3439 Similarly, at some point j in the middle of the network, if p = 1 - 3440 (1 - u_j)(1 - v_j), then 3442 v_j = 1 - (1 - p)/(1 - u_j) 3444 ~= p - u_j; if u_j << 100% 3446 So, between the two routers in the example in Section 3.3, congestion 3447 downstream is 3449 v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) 3450 = 2.00%, 3452 or a useful approximation of downstream congestion is 3454 v_1 ~= 2.98% - 1.00% 3455 ~= 1.98%. 3457 Appendix B. Justification for Two Codepoints Signifying Zero Worth 3458 Packets 3460 It may seem a waste of a codepoint to set aside two codepoints of the 3461 Extended ECN field to signify zero worth (RECT and CE(0) are both 3462 worth zero). The justification is subtle, but worth recording. 3464 The original version of re-ECN ([Re-fb] and draft-00 of this memo) 3465 used three codepoints for neutral (ECT(1)), positive (ECT(0)) and 3466 negative (CE) packets. The sender set packets to neutral unless re- 3467 echoing congestion, when it set them positive, in much the same way 3468 that it blanks the RE flag in the current protocol. However, routers 3469 were meant to mark congestion by setting packets negative (CE) 3470 irrespective of whether they had previously been neutral or positive. 3472 However, we did not arrange for senders to remember which packet had 3473 been sent with which codepoint, or for feedback to say exactly which 3474 packets arrived with which codepoints. The transport was meant to 3475 inflate the number of positive packets it sent to allow for a few 3476 being wiped out by congestion marking. We (wrongly) assumed that 3477 routers would congestion mark packets indiscriminately, so the 3478 transport could infer how many positive packets had been marked and 3479 compensate accordingly by re-echoing. But this created a perverse 3480 incentive for routers to preferentially congestion mark positive 3481 packets rather than neutral ones. 3483 We could have removed this perverse incentive by requiring re-ECN 3484 senders to remember which packets they had sent with which codepoint. 3485 And for feedback from the receiver to identify which packets arrived 3486 as which. Then, if a positive packet was congestion marked to 3487 negative, the sender could have re-echoed twice to maintain the 3488 balance between positive and negative at the receiver. 3490 Instead, we chose to make re-echoing congestion (blanking RE) 3491 orthogonal to congestion notification (marking CE), which required a 3492 second neutral codepoint (the orthogonal scheme forms the main square 3493 of four codepoints in Figure 2). Then the receiver would be able to 3494 detect and echo a congestion event even if it arrived on a packet 3495 that had originally been positive. 3497 If we had added extra complexity to the sender and receiver 3498 transports to track changes to individual packets, we could have made 3499 it work, but then routers would have had an incentive to mark 3500 positive packets with half the probability of neutral packets. That 3501 in turn would have led router algorithms to become more complex. 3502 Then senders wouldn't know whether a mark had been introduced by a 3503 simple or a complex router algorithm. That in turn would have 3504 required another codepoint to distinguish between legacy ECN and new 3505 re-ECN router marking. 3507 Once the cost of IP header codepoint real-estate was the same for 3508 both schemes, there was no doubt that the simpler option for 3509 endpoints and for routers should be chosen. The resulting protocol 3510 also no longer needed the tricky inflation/deflation complexity of 3511 the original (broken) scheme. It was also much simpler to understand 3512 conceptually. 3514 A further advantage of the new orthogonal four-codepoint scheme was 3515 that senders owned sole rights to change the RE flag and routers 3516 owned sole rights to change the ECN field. Although we still arrange 3517 the incentives so neither party strays outside their dominion, these 3518 clear lines of authority simplify the matter. 3520 Finally, a little redundancy can be very powerful in a scheme such as 3521 this. In one flow, the proportion of packets changed to CE should be 3522 the same as the proportion of RECT packets changed to CE(-1) and the 3523 proportion of Re-Echo packets changed to CE(0). Double checking 3524 using such redundant relationships can improve the security of a 3525 scheme (cf. double-entry book-keeping or the ECN Nonce). 3526 Alternatively, it might be necessary to exploit the redundancy in the 3527 future to encode an extra information channel. 3529 Appendix C. ECN Compatibility 3531 The rationale for choosing the particular combinations of SYN and SYN 3532 ACK flags in Section 4.1.3 is as follows. 3534 Choice of SYN flags: A re-ECN sender can work with vanilla ECN 3535 receivers so we wanted to use the same flags as would be used in 3536 an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same time, 3537 we wanted a server (host B) that is Re-ECT to be able to recognise 3538 that the client (A) is also Re-ECT. We believe also setting NS=1 3539 in the initial SYN achieves both these objectives, as it should be 3540 ignored by vanilla ECT receivers and by ECT-Nonce receivers. But 3541 senders that are not Re-ECT should not set NS=1. At the time ECN 3542 was defined, the NS flag was not defined, so setting NS=1 should 3543 be ignored by existing ECT receivers (but testing against 3544 implementations may yet prove otherwise). The ECN Nonce 3545 RFC [RFC3540] is silent on what the NS field might be set to in 3546 the TCP SYN, but we believe the intent was for a nonce client to 3547 set NS=0 in the initial SYN (again only testing will tell). 3548 Therefore we define a Re-ECN-setup SYN as one with NS=1, CWR=1 & 3549 ECE=1 3551 Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to 3552 be able to determine whether the server (B) is Re-ECT. The 3553 original ECN specification required an ECT server to respond to an 3554 ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There 3555 is no room to modify this by setting the NS flag, as that is 3556 already set in the SYN ACK of an ECT-Nonce server. So we used the 3557 only combination of CWR and ECE that would not be used by existing 3558 TCP receivers: CWR=1 and ECE=0. The original ECN specification 3559 defines this combination as a non-ECN-setup SYN ACK, which remains 3560 true for vanilla and Nonce ECTs. But for re-ECN we define it as a 3561 Re-ECN-setup SYN ACK. We didn't use a SYN ACK with both CWR and 3562 ECE cleared to 0 because that would be the likely response from 3563 most Not-ECT receivers. And we didn't use a SYN ACK with both CWR 3564 and ECE set to 1 either, as at least one broken receiver 3565 implementation echoes whatever flags were in the SYN into its SYN 3566 ACK. Therefore we define a Re-ECN-setup SYN ACK as one with CWR=1 3567 & ECE=0. 3569 Choice of two alternative SYN ACKs: the NS flag may take either 3570 value in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re- 3571 ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to 3572 echo congestion experienced (CE) on the initial SYN. Otherwise a 3573 Re-ECN-setup SYN ACK MUST be returned with NS=0. The only current 3574 known use of the NS flag in a SYN ACK is to indicate support for 3575 the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. 3576 Given the ECN nonce MUST NOT be used for a RECN mode connection, a 3577 Re-ECN-setup SYN ACK can use either setting of the NS flag without 3578 any risk of confusion, because the CWR & ECE flags will be 3579 reversed relative to those used by an ECN nonce SYN ACK. 3581 Appendix D. Packet Marking with FNE During Flow Start 3583 FNE (feedback not established) packets have two functions. Their 3584 main role is to announce the start of a new flow when feedback has 3585 not yet been established. However they also have the role of 3586 balancing the expected feedback and can be used where there are 3587 sudden changes in the rate of transmission. Whilst this should not 3588 happen under TCP their use as speculative marking is used in building 3589 the following argument as to why the first and third packets should 3590 be set to FNE. 3592 The proportion of FNE packets in each roundtrip should be a high 3593 estimate of the potential error in the balance of number of 3594 congestion marked packets versus number of re-echo packets already 3595 issued. 3597 Let's call: 3599 S: the number of the TCP segments sent so far 3601 F: the number of FNE packets sent so far 3602 R: the number of Re-Echo packets sent so far 3604 A: the number of acknowledgments received so far 3606 C: the number of acknowledgments echoing a CE packet 3608 In normal operation, when we want to send packet S+1, we first need 3609 to check that enough Re-Echo packets have been issued: 3611 If R 1 FNE 3651 o if the acknowledgment doesn't echo a mark 3653 * for the second packet, A=F=S=1 R=C=0 ==> 1 RECT 3655 * for the third packet, S=2 A=F=1 R=C=0 ==> 1 FNE 3657 o if no acknowledgement for these two packets echoes a congestion 3658 mark, then {A=S=3 F=2 R=C=0} which gives k<2*4/1-3, so the source 3660 o if no acknowledgement for these four packets echoes a congestion 3661 mark, then {A=S=7 F=2 R=C=0} which gives k<2*8/1-7, so the source 3662 could send another 8 RECT packets. ==> 8 RECT 3664 This behaviour happens to match TCP's congestion window control in 3665 slow start, which is why for TCP sources, only the first and third 3666 packet need be FNE packets. 3668 A source that would open the congestion window any quicker would have 3669 to insert more FNE packets. As another example a UDP source sending 3670 VBR traffic might need to send several FNE packets ahead of the 3671 traffic peaks it generates. 3673 Appendix E. Example Egress Dropper Algorithm 3675 {ToDo: Write up the basic algorithm with flow state, then the 3676 aggregated one.} 3678 Appendix F. Re-TTL 3680 This Appendix gives an overview of a proposal to be able to overload 3681 the TTL field in the IP header to monitor downstream propagation 3682 delay. This is included to show that it would be possible to take 3683 account of RTT if it was deemed desirable. 3685 Delay re-feedback can be achieved by overloading the TTL field, 3686 without changing IP or router TTL processing. A target value for TTL 3687 at the destination would need standardising, say 16. If the path hop 3688 count increased by more than 16 during a routing change, it would 3689 temporarily be mistaken for a routing loop, so this target would need 3690 to be chosen to exceed typical hop count increases. The TCP wire 3691 protocol and handlers would need modifying to feed back the 3692 destination TTL and initialise it. It would be necessary to 3693 standardise the unit of TTL in terms of real time (as was the 3694 original intent in the early days of the Internet). 3696 In the longer term, precision could be improved if routers 3697 decremented TTL to represent exact propagation delay to the next 3698 router. That is, for a router to decrement TTL by, say, 1.8 time 3699 units it would alternate the decrement of every packet between 1 & 2 3700 at a ratio of 1:4. Although this might sometimes require a seemingly 3701 dangerous null decrement, a packet in a loop would still decrement to 3702 zero after 255 time units on average. As more routers were upgraded 3703 to this more accurate TTL decrement, path delay estimates would 3704 become increasingly accurate despite the presence of some legacy 3705 routers that continued to always decrement the TTL by 1. 3707 Appendix G. Policer Designs to ensure Congestion Responsiveness 3709 G.1. Per-user Policing 3711 User policing requires a policer on the ingress interface of the 3712 access router associated with the user. At that point, the traffic 3713 of the user hasn't diverged on different routes yet; nor has it mixed 3714 with traffic from other sources. 3716 In order to ensure that a user doesn't generate more congestion in 3717 the network than her due share, a modified bulk token-bucket is 3718 maintained with the following parameter: 3720 o b_0 the initial token level 3722 o r the filling rate 3724 o b_max the bucket depth 3726 The same token bucket algorithm is used as in many areas of 3727 networking, but how it is used is very different: 3729 o all traffic from a user over the lifetime of their subscription is 3730 policed in the same token bucket. 3732 o only positive and canceled packets (Re-Echo, FNE and CE(0)) 3733 consume tokens 3735 Such a policer will allow network operators to throttle the 3736 contribution of their users to network congestion. This will require 3737 the appropriate contractual terms to be in place between operators 3738 and users. For instance: a condition for a user to subscribe to a 3739 given network service may be that she should not cause more than a 3740 volume C_user of congestion over a reference period T_user, although 3741 she may carry forward up to N_user times her allowance at the end of 3742 each period. These terms directly set the parameter of the user 3743 policer: 3745 o b_0 = C_user 3747 o r = C_user/T_user 3749 o b_max = b_0 * (N_user +1) 3751 Besides the congestion budget policer above, another user policer may 3752 be necessary to further rate-limit FNE packets, if they are to be 3753 marked rather than dropped (see discussion in Section 5.3.). Rate- 3754 limiting FNE packets will prevent high bursts of new flow arrivals, 3755 which is a very useful feature in DoS prevention. A condition to 3756 subscribe to a given network service would have to be that a user 3757 should not generate more than C_FNE FNE packets, over a reference 3758 period T_FNE, with no option to carry forward any of the allowance at 3759 the end of each period. These terms directly set the parameters of 3760 the FNE policer: 3762 o b_0 = C_FNE 3764 o r = C_FNE/T_FNE 3766 o b_max = b_0 3768 T_FNE should be a much shorter period than T_user: for instance T_FNE 3769 could be in the order of minutes while T_user could be in order of 3770 weeks. 3772 G.2. Per-flow Rate Policing 3774 Whilst we believe that simple per-user policing would be sufficient 3775 to ensure senders comply with congestion control, some operators may 3776 wish to police the rate response of each flow to congestion as well. 3777 Although we do not believe this will be neceesary, we include this 3778 section to show how one could perform per-flow policing using 3779 enforcement of TCP-fairness as an example. Per-flow policing aims to 3780 enforce congestion responsiveness on the shortest information 3781 timescale on a network path: packet roundtrips. 3783 This again requires that the appropriate terms be agreed between a 3784 network operator and its users, where a congestion responsiveness 3785 policy might be required for the use of a given network service 3786 (perhaps unless the user specifically requests otherwise). 3788 As an example, we describe below how a rate adaptation policer can be 3789 designed when the applicable rate adaptation policy is TCP- 3790 compliance. In that context, the average throughput of a flow will 3791 be expected to be bounded by the value of the TCP throughput during 3792 congestion avoidance, given in Mathis' formula [Mathis97] 3794 x_TCP = k * s / ( T * sqrt(m) ) 3796 where: 3798 o x_TCP is the throughput of the TCP flow in packets per second, 3800 o k is a constant upper-bounded by sqrt(3/2), 3802 o s is the average packet size of the flow, 3804 o T is the roundtrip time of the flow, 3806 o m is the congestion level experienced by the flow. 3808 We define the marking period N=1/m which represents the average 3809 number of packets between two positive or canceled packets. Mathis' 3810 formula can be re-written as: 3812 x_TCP = k*s*sqrt(N)/T 3814 We can then get the average inter-mark time in a compliant TCP flow, 3815 dt_TCP, by solving (x_TCP/s)*dt_TCP = N which gives 3817 dt_TCP = sqrt(N)*T/k 3819 We rely on this equation for the design of a rate-adaptation policer 3820 as a variation of a token bucket. In that case a policer has to be 3821 set up for each policed flow. This may be triggered by FNE packets, 3822 with the remainder of flows being all rate limited together if they 3823 do not start with an FNE packet. 3825 Where maintaining per flow state is not a problem, for instance on 3826 some access routers, systematic per-flow policing may be considered. 3827 Should per-flow state be more constrained, rate adaptation policing 3828 could be limited to a random sample of flows exhibiting positive or 3829 canceled packets. 3831 As in the case of user policing, only positive or canceled packets 3832 will consume tokens, however the amount of tokens consumed will 3833 depend on the congestion signal. 3835 When a new rate adaptation policer is set up for flow j, the 3836 following state is created: 3838 o a token bucket b_j of depth b_max starting at level b_0 3840 o a timestamp t_j = timenow() 3842 o a counter N_j = 0 3844 o a roundtrip estimate T_j 3846 o a filling rate r 3848 When the policing node forwards a packet of flow j with no Re-Echo: 3850 o . the counter is incremented: N_j += 1 3852 When the policing node forwards a packet of flow j carrying a 3853 congestion mark (CE): 3855 o the counter is incremented: N_j += 1 3857 o the token level is adjusted: b_j += r*(timenow()-t_j) - sqrt(N_j)* 3858 T_j/k 3860 o the counter is reset: N_j = 0 3862 o the timer is reset: t_j = timenow() 3864 An implementation example will be given in a later draft that avoids 3865 having to extract the square root. 3867 Analysis: For a TCP flow, for r= 1 token/sec, on average, 3869 r*(timenow()-t_j)-sqrt(N_j)* T_j/k = dt_TCP - sqrt(N)*T/k = 0 3871 This means that the token level will fluctuate around its initial 3872 level. The depth b_max of the bucket sets the timescale on which the 3873 rate adaptation policy is performed while the filling rate r sets the 3874 trade-off between responsiveness and robustness: 3876 o the higher b_max, the longer it will take to catch greedy flows 3878 o the higher r, the fewer false positives (greedy verdict on 3879 compliant flows) but the more false negatives (compliant verdict 3880 on greedy flows) 3882 This rate adaptation policer requires the availability of a roundtrip 3883 estimate which may be obtained for instance from the application of 3884 re-feedback to the downstream delay Appendix F or passive estimation 3885 [Jiang02]. 3887 When the bucket of a policer located at the access router (whether it 3888 is a per-user policer or a per-flow policer) becomes empty, the 3889 access router SHOULD drop at least all packets causing the token 3890 level to become negative. The network operator MAY take further 3891 sanctions if the token level of the per-flow policers associated with 3892 a user becomes negative. 3894 Appendix H. Downstream Congestion Metering Algorithms 3896 H.1. Bulk Downstream Congestion Metering Algorithm 3898 To meter the bulk amount of downstream congestion in traffic crossing 3899 an inter-domain border an algorithm is needed that accumulates the 3900 size of positive packets and subtracts the size of negative packets. 3901 We maintain two counters: 3903 V_b: accumulated congestion volume 3905 B: total data volume (in case it is needed) 3907 A suitable pseudo-code algorithm for a border router is as follows: 3909 ==================================================================== 3910 V_b = 0 3911 B = 0 3912 for each re-ECN-capable packet { 3913 b = readLength(packet) /* set b to packet size */ 3914 B += b /* accumulate total volume */ 3915 if readEECN(packet) == (Re-Echo || FNE) { 3916 V_b += b /* increment... */ 3917 } elseif readEECN(packet) == CE(-1) { 3918 V_b -= b /* ...or decrement V_b... */ 3919 } /*...depending on EECN field */ 3920 } 3921 ==================================================================== 3923 At the end of an accounting period this counter V_b represents the 3924 congestion volume that penalties could be applied to, as described in 3925 Section 6.1.6. 3927 For instance, accumulated volume of congestion through a border 3928 interface over a month might be V_b = 5PB (petabyte = 10^15 byte). 3929 This might have resulted from an average downstream congestion level 3930 of 1% on an accumulated total data volume of B = 500PB. 3932 H.2. Inflation Factor for Persistently Negative Flows 3934 The following process is suggested to complement the simple algorithm 3935 above in order to protect against the various attacks from 3936 persistently negative flows described in Section 6.1.6. As explained 3937 in that section, the most important and first step is to estimate the 3938 contribution of persistently negative flows to the bulk volume of 3939 downstream pre-congestion and to inflate this bulk volume as if these 3940 flows weren't there. The process below has been designed to give an 3941 unbiased estimate, but it may be possible to define other processes 3942 that achieve similar ends. 3944 While the above simple metering algorithm is counting the bulk of 3945 traffic over an accounting period, the meter should also select a 3946 subset of the whole flow ID space that is small enough to be able to 3947 realistically measure but large enough to give a realistic sample. 3948 Many different samples of different subsets of the ID space should be 3949 taken at different times during the accounting period, preferably 3950 covering the whole ID space. During each sample, the meter should 3951 count the volume of positive packets and subtract the volume of 3952 negative, maintaining a separate account for each flow in the sample. 3953 It should run a lot longer than the large majority of flows, to avoid 3954 a bias from missing the starts and ends of flows, which tend to be 3955 positive and negative respectively. 3957 Once the accounting period finishes, the meter should calculate the 3958 total of the accounts V_{bI} for the subset of flows I in the sample, 3959 and the total of the accounts V_{fI} excluding flows with a negative 3960 account from the subset I. Then the weighted mean of all these 3961 samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I} 3962 V_{bI}. 3964 If V_b is the result of the bulk accounting algorithm over the 3965 accounting period (Appendix H.1) it can be inflated by this factor 3966 a_S to get a good unbiased estimate of the volume of downstream 3967 congestion over the accounting period a_S.V_b, without being polluted 3968 by the effect of persistently negative flows. 3970 Appendix I. Argument for holding back the ECN nonce 3972 The ECN nonce is a mechanism that allows a /sending/ transport to 3973 detect if drop or ECN marking at a congested router has been 3974 suppressed by a node somewhere in the feedback loop---another router 3975 or the receiver. 3977 Space for the ECN nonce was set aside in [RFC3168] (currently 3978 proposed standard) while the full nonce mechanism is specified in 3980 [RFC3540] (currently experimental). The specifications for [RFC4340] 3981 (currently proposed standard) requires that "Each DCCP sender SHOULD 3982 set ECN Nonces on its packets...". It also mandates as a requirement 3983 for all CCID profiles that "Any newly defined acknowledgement 3984 mechanism MUST include a way to transmit ECN Nonce Echoes back to the 3985 sender.", therefore: 3987 o The CCID profile for TCP-like Congestion Control [RFC4341] 3988 (currently proposed standard) says "The sender will use the ECN 3989 Nonce for data packets, and the receiver will echo those nonces in 3990 its Ack Vectors." 3992 o The CCID profile for TCP-Friendly Rate Control (TFRC) [RFC4342] 3993 recommends that "The sender [use] Loss Intervals options' ECN 3994 Nonce Echoes (and possibly any Ack Vectors' ECN Nonce Echoes) to 3995 probabilistically verify that the receiver is correctly reporting 3996 all dropped or marked packets." 3998 The primary function of the ECN nonce is to protect the integrity of 3999 the information about congestion: ECN marks and packet drops. 4000 However, when the nonce is used to protect the integrity of 4001 information about packet drops, rather than ECN marks, a transport 4002 layer nonce will always be sufficient (because a drop loses the 4003 transport header as well as the ECN field in the network header), 4004 which would avoid using scarce IP header codepoint space. Similarly, 4005 a transport layer nonce would protect against a receiver sending 4006 early acknowledgements [Savage99]. 4008 If the ECN nonce reveals integrity problems with the information 4009 about congestion, the sending transport can use that knowledge for 4010 two functions: 4012 o to protect its own resources, by allocating them in proportion to 4013 the rates that each network path can sustain, based on congestion 4014 control, 4016 o and to protect congested routers in the network, by slowing down 4017 drastically its connection to the destination with corrupt 4018 congestion information. 4020 If the sending transport chooses to act in the interests of congested 4021 routers, it can reduce its rate if it detects some malicious party in 4022 the feedback loop may be suppressing ECN feedback. But it would only 4023 be useful to congested routers when /all/ senders using them are 4024 trusted to act in interest of the congested routers. 4026 In the end, the only essential use of a network layer nonce is when 4027 sending transports (e.g. large servers) want to allocate their /own/ 4028 resources in proportion to the rates that each network path can 4029 sustain, based on congestion control. In that case, the nonce allows 4030 senders to be assured that they aren't being duped into giving more 4031 of their own resources to a particular flow. And if congestion 4032 suppression is detected, the sending transport can rate limit the 4033 offending connection to protect its own resources. Certainly, this 4034 is a useful function, but the IETF should carefully decide whether 4035 such a single, very specific case warrants IP header space. 4037 In contrast, re-ECN allows all routers to fully protect themselves 4038 from such attacks, without having to trust anyone - senders, 4039 receivers, neighbouring networks. Re-ECN is therefore proposed in 4040 preference to the ECN nonce on the basis that it addresses the 4041 generic problem of accountability for congestion of a network's 4042 resources at the IP layer. 4044 Delaying the ECN nonce is justified because the applicability of the 4045 ECN nonce seems too limited for it to consume a two-bit codepoint in 4046 the IP header. It therefore seems prudent to give time for an 4047 alternative way to be found to do the one function the nonce is 4048 essential for. 4050 Moreover, while we have re-designed the re-ECN codepoints so that 4051 they do not prevent the ECN nonce progressing, the same is not true 4052 the other way round. If the ECN nonce started to see some deployment 4053 (perhaps because it was blessed with proposed standard status), 4054 incremental deployment of re-ECN would effectively be impossible, 4055 because re-ECN marking fractions at inter-domain borders would be 4056 polluted by unknown levels of nonce traffic. 4058 The authors are aware that re-ECN must prove it has the potential it 4059 claims if it is to displace the nonce. Therefore, every effort has 4060 been made to complete a comprehensive specification of re-ECN so that 4061 its potential can be assessed. We therefore seek the opinion of the 4062 Internet community on whether the re-ECN protocol is sufficiently 4063 useful to warrant standards action. 4065 Authors' Addresses 4067 Bob Briscoe 4068 BT & UCL 4069 B54/77, Adastral Park 4070 Martlesham Heath 4071 Ipswich IP5 3RE 4072 UK 4074 Phone: +44 1473 645196 4075 Email: bob.briscoe@bt.com 4076 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 4078 Arnaud Jacquet 4079 BT 4080 B54/70, Adastral Park 4081 Martlesham Heath 4082 Ipswich IP5 3RE 4083 UK 4085 Phone: +44 1473 647284 4086 Email: arnaud.jacquet@bt.com 4087 URI: 4089 Toby Moncaster 4090 BT 4091 B54/70, Adastral Park 4092 Martlesham Heath 4093 Ipswich IP5 3RE 4094 UK 4096 Phone: +44 1473 648734 4097 Email: toby.moncaster@bt.com 4099 Alan Smith 4100 BT 4101 B54/76, Adastral Park 4102 Martlesham Heath 4103 Ipswich IP5 3RE 4104 UK 4106 Phone: +44 1473 640404 4107 Email: alan.p.smith@bt.com 4109 Full Copyright Statement 4111 Copyright (C) The IETF Trust (2008). 4113 This document is subject to the rights, licenses and restrictions 4114 contained in BCP 78, and except as set forth therein, the authors 4115 retain all their rights. 4117 This document and the information contained herein are provided on an 4118 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 4119 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 4120 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 4121 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 4122 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 4123 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 4125 Intellectual Property 4127 The IETF takes no position regarding the validity or scope of any 4128 Intellectual Property Rights or other rights that might be claimed to 4129 pertain to the implementation or use of the technology described in 4130 this document or the extent to which any license under such rights 4131 might or might not be available; nor does it represent that it has 4132 made any independent effort to identify any such rights. Information 4133 on the procedures with respect to rights in RFC documents can be 4134 found in BCP 78 and BCP 79. 4136 Copies of IPR disclosures made to the IETF Secretariat and any 4137 assurances of licenses to be made available, or the result of an 4138 attempt made to obtain a general license or permission for the use of 4139 such proprietary rights by implementers or users of this 4140 specification can be obtained from the IETF on-line IPR repository at 4141 http://www.ietf.org/ipr. 4143 The IETF invites any interested party to bring to its attention any 4144 copyrights, patents or patent applications, or other proprietary 4145 rights that may cover technology that may be required to implement 4146 this standard. Please address the information to the IETF at 4147 ietf-ipr@ietf.org. 4149 Acknowledgments 4151 Funding for the RFC Editor function is provided by the IETF 4152 Administrative Support Activity (IASA). This document was produced 4153 using xml2rfc v1.32 (of http://xml.resource.org/) from a source in 4154 RFC-2629 XML format.