idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-tcp-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 3002. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2979. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2986. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2992. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: Appendix F also gives an example dropper implementation that aggregates flow state. Dropper algorithms will often maintain a moving average across flows of the fraction of RE blanked packets. When maintaining an average across flows, a dropper SHOULD only allow flows into the average if they start with FNE, but it SHOULD not include packets with the FNE codepoint set in the average. An ingress gateway sets the FNE codepoint when it does not have the benefit of feedback from the ingress. So, counting packets with FNE cleared would be likely to make the average unnecessarily positive, providing headroom (or should we say footroom?) for dishonest (negative) traffic. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 06, 2006) is 6627 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'ITU-T.I.371' is defined on line 2523, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Downref: Normative reference to an Historic RFC: RFC 3540 == Outdated reference: A later version (-04) exists of draft-briscoe-tsvwg-cl-architecture-02 -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) Summary: 6 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Expires: September 7, 2006 A. Jacquet 5 A. Salvatori 6 BT 7 March 06, 2006 9 Re-ECN: Adding Accountability for Causing Congestion to TCP/IP 10 draft-briscoe-tsvwg-re-ecn-tcp-01 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on September 7, 2006. 37 Copyright Notice 39 Copyright (C) The Internet Society (2006). 41 Abstract 43 This document introduces a new protocol for explicit congestion 44 notification (ECN), termed re-ECN, which can be deployed 45 incrementally around unmodified routers. The protocol arranges an 46 extended ECN field in each packet so that, as it crosses any 47 interface in an internetwork, it will carry a truthful prediction of 48 congestion on the remainder of its path. Then the upstream party at 49 any trust boundary in the internetwork can be held responsible for 50 the congestion they cause, or allow to be caused. So, networks can 51 introduce straightforward accountability and policing mechanisms for 52 incoming traffic from end-customers or from neighbouring network 53 domains. The purpose of this document is to specify the re-ECN 54 protocol at the IP layer and to give guidelines on any consequent 55 changes required to transport protocols. It includes the changes 56 required to TCP both as an example and as a specification. It also 57 gives examples of mechanisms that can use the protocol to ensure data 58 sources respond correctly to congestion. And it describes example 59 mechanisms that ensure the dominant selfish strategy of both network 60 domains and end-points will be to set the extended ECN field 61 honestly. 63 Authors' Statement: Status (to be removed by the RFC Editor) 65 This document is posted as an Internet-Draft with the intent (at 66 least that of the authors) to eventually progress to standards track. 68 Although the re-ECN protocol is intended to make a simple but far- 69 reaching change to the Internet architecture, the most immediate 70 priority for the authors is to delay any move of the ECN nonce to 71 Proposed Standard status. 73 The ECN nonce is an experimental RFC that allows /senders/ to check 74 the integrity of congestion feedback from /networks/. Therefore the 75 nonce only helps in scenarios where the sender is trusted to control 76 network congestion. On the other hand, the re-ECN protocol aims to 77 allow networks themselves to be able to police cheating senders and 78 receivers and to police neighbouring networks. Re-ECN is therefore 79 proposed in preference to the ECN nonce on the basis that it 80 addresses the generic problem of accountability for congestion of a 81 network's resources at the IP layer. 83 Delaying the ECN nonce is justified by two factors: 85 o The ECN nonce would permanently consumes a two-bit codepoint in 86 the IP header for a purpose specific to a limited trust model. 87 Although the nonce is a neat idea, its applicability seems too 88 limited to warrant space in the IP header; 90 o Although we have re-designed the re-ECN codepoints so that they do 91 not prevent the ECN nonce progressing, the same is not true the 92 other way round. If the ECN nonce started to see some deployment 93 (perhaps because it was blessed with proposed standard status), 94 incremental deployment of re-ECN would effectively be impossible, 95 because re-ECN marking fractions at inter-domain borders would be 96 polluted by unknown levels of nonce traffic. 98 The authors are aware that re-ECN must prove it has the potential it 99 claims if it is to displace the nonce. Therefore, every effort has 100 been made to complete a comprehensive specification of re-ECN so that 101 its potential can be assessed. We therefore seek the opinion of the 102 Internet community on whether the re-ECN protocol is sufficiently 103 useful to warrant standards action. 105 Table of Contents 107 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 108 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 6 109 3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 7 110 3.1. Background and Applicability . . . . . . . . . . . . . . . 7 111 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 112 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 113 3.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 9 114 3.4. Informal Terminology . . . . . . . . . . . . . . . . . . . 11 115 4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 13 116 4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 117 4.1.1. RECN mode: Full re-ECN capable transport . . . . . . . 14 118 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or 119 Nonce ECT Receiver . . . . . . . . . . . . . . . . . . 17 120 4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 18 121 4.1.4. Extended ECN (EECN) Field Settings during Flow 122 Start or after Idle Periods . . . . . . . . . . . . . 20 123 4.1.5. Pure ACKS, Retransmissions, Window Probes and 124 Partial ACKs . . . . . . . . . . . . . . . . . . . . . 23 125 4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 24 126 4.2.1. Guidelines for Adding Re-ECN to Other Transports . . . 24 127 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 24 128 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 24 129 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 26 130 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 26 131 5.4. Justification for Setting the First SYN to FNE . . . . . . 27 132 5.5. Control and Management . . . . . . . . . . . . . . . . . . 28 133 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 28 134 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 28 135 5.6. Tunnels . . . . . . . . . . . . . . . . . . . . . . . . . 29 136 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 29 137 6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 29 138 6.1. Policing Congestion Response . . . . . . . . . . . . . . . 29 139 6.1.1. The Policing Problem . . . . . . . . . . . . . . . . . 29 140 6.1.2. Incentive Framework . . . . . . . . . . . . . . . . . 30 141 6.1.3. Egress Dropper . . . . . . . . . . . . . . . . . . . . 36 142 6.1.4. Rate Policing . . . . . . . . . . . . . . . . . . . . 37 143 6.1.5. Inter-domain Policing . . . . . . . . . . . . . . . . 39 144 6.1.6. Simulations . . . . . . . . . . . . . . . . . . . . . 39 146 6.2. Other Applications . . . . . . . . . . . . . . . . . . . . 40 147 6.2.1. DDoS Mitigation . . . . . . . . . . . . . . . . . . . 40 148 6.2.2. End-to-end QoS . . . . . . . . . . . . . . . . . . . . 41 149 6.2.3. Traffic Engineering . . . . . . . . . . . . . . . . . 41 150 6.2.4. Inter-Provider Service Monitoring . . . . . . . . . . 41 151 6.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . 41 152 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 41 153 7.1. Incremental Deployment Features . . . . . . . . . . . . . 42 154 7.2. Incremental Deployment Incentives . . . . . . . . . . . . 42 155 8. Architectural Rationale . . . . . . . . . . . . . . . . . . . 47 156 9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 50 157 9.1. Policing Rate Response to Congestion . . . . . . . . . . . 50 158 9.2. Congestion Notification Integrity . . . . . . . . . . . . 50 159 9.3. Identifying Upstream and Downstream Congestion . . . . . . 51 160 10. Security Considerations . . . . . . . . . . . . . . . . . . . 51 161 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 52 162 12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 52 163 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 53 164 14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 53 165 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 53 166 15.1. Normative References . . . . . . . . . . . . . . . . . . . 53 167 15.2. Informative References . . . . . . . . . . . . . . . . . . 54 168 Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 56 169 Appendix B. ECN Compatibility . . . . . . . . . . . . . . . . . . 57 170 Appendix C. Packet Marking During Flow Start . . . . . . . . . . 58 171 Appendix D. Example Egress Dropper Algorithm . . . . . . . . . . 59 172 Appendix E. Re-TTL . . . . . . . . . . . . . . . . . . . . . . . 59 173 Appendix F. Policer Designs to ensure Congestion 174 Responsiveness . . . . . . . . . . . . . . . . . . . 59 175 F.1. Per-user Policing . . . . . . . . . . . . . . . . . . . . 59 176 F.2. Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 61 177 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 64 178 Intellectual Property and Copyright Statements . . . . . . . . . . 65 180 1. Introduction 182 This document aims: 184 o To provide a complete specification of the addition of the re-ECN 185 protocol to IP and guidelines on how to add it to transport layer 186 protocols, including a complete specification of re-ECN in TCP as 187 an example; 189 o To show how a number of hard problems become much easier to solve 190 once re-ECN is available in IP. 192 A general statement of the problem solved by re-ECN is to provide 193 sufficient information in each IP datagram to be able to hold senders 194 and whole networks accountable for the congestion they cause 195 downstream, before they cause it. But the every-day problems that 196 re-ECN can solve are much more recognisable than this rather generic 197 statement: mitigating distributed denial of service (DDoS); 198 simplifying differentiation of quality of service (QoS); policing 199 compliance to congestion control; and so on. 201 Uniquely, re-ECN manages to enable solutions to these problems 202 without unduly stifling innovative new ways to use the Internet. 203 This was a hard balance to strike, given it could be argued that DDoS 204 is an innovative way to use the Internet. The most valuable insight 205 was to allow each network to choose the level of constraint it wishes 206 to impose. Also re-ECN has been carefully designed so that networks 207 that choose to use it conservatively can protect themselves against 208 the congestion caused in their network by users on other networks 209 with more liberal policies. 211 For instance, some network owners want to block applications like 212 voice and video unless their network is compensated for the extra 213 share of bottleneck bandwidth taken. These real-time applications 214 tend to be unresponsive when congestion arises. Whereas elastic TCP- 215 based applications back away quickly, ending up taking a much smaller 216 share of congested capacity for themselves. Other network owners 217 want to invest in large amounts of capacity and make their gains from 218 simplicity of operation and economies of scale. 220 Re-ECN allows the more conservative networks to police out flows that 221 have not asked to be unresponsive to congestion---not because they 222 are voice or video---just because they don't respond to congestion. 223 But it also allows other networks to choose not to police. 224 Crucially, when flows from liberal networks cross into a conservative 225 network, re-ECN enables the conservative network to apply penalties 226 to its neighbouring networks for the congestion they cause. And 227 these penalties can be applied to bulk data, without regard to flows. 229 Then, if unresponsive applications become so dominant that some of 230 the more liberal networks experience congestion collapse [RFC3714], 231 they can change their minds and use re-ECN to apply tighter controls 232 in order to bring congestion back under control. 234 Re-ECN works by arranging that each packet arrives at each network 235 element carrying a view of expected congestion on its own downstream 236 path, albeit averaged over multiple packets. Most usefully, 237 congestion on the remainder of the path becomes visible in the IP 238 header at the first ingress. Many of the applications of re-ECN 239 involve a policer at this ingress using the view of downstream 240 congestion arriving in packets to police or control the packet rate. 242 Importantly, the scheme is recursive: a whole network harbouring 243 users causing congestion in downstream networks can be held 244 responsible or policed by its downstream neighbour. 246 This document is structured as follows. First an overview of the re- 247 ECN protocol is given (Section 3), outlining its attributes and 248 explaining conceptually how it works as a whole. The two main parts 249 of the document follow, as described above. That is, the protocol 250 specification divided into transport (Section 4) and network 251 (Section 5) layers, then the applications it can be put to, such as 252 policing DDoS, QoS and congestion control (Section 6). Although 253 these applications do not require standardisation themselves, they 254 are described in a fair degree of detail in order to explain how re- 255 ECN can be used. Given, re-ECN proposes to use the last undefined 256 bit in the IPv4 header, we felt it necessary to outline the potential 257 that re-ECN could release in return for being given that bit. 259 Deployment issues discussed throughout the document are brought 260 together in Section 7, which is followed by a brief section 261 explaining the somewhat subtle rationale for the design, from an 262 architectural perspective (Section 8). We end by describing related 263 work (Section 9), listing security considerations (Section 10) and 264 finally drawing conclusions (Section 12). 266 2. Requirements notation 268 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 269 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 270 document are to be interpreted as described in [RFC2119]. 272 This document first specifies a protocol, then describes a framework 273 that creates the right incentives to ensure compliance to the 274 protocol. This could cause confusion because the second part of the 275 document considers many cases where malicious nodes may not comply 276 with the protocol. When such contingencies are described, if any of 277 the above keywords are not capitalised, that is deliberate. So, for 278 instance, the following two apparently contradictory sentences would 279 be perfectly consistent: i) x MUST do this; ii) x may not do this. 281 3. Protocol Overview 283 3.1. Background and Applicability 285 First we briefly recap the essentials of the ECN protocol [RFC3168]. 286 Two bits in the IP protocol (v4 or v6) are assigned to the ECN field. 287 The sender clears the field to "00" (Not-ECT) if either end-point 288 transport is not ECN-capable. Otherwise it indicates an ECN-capable 289 transport (ECT) using either of the two code-points "10" or "01" 290 (ECT(0) and ECT(1) resp.). 292 ECN-capable routers probabilistically set "11" if congestion is 293 experienced (CE), the marking probability increasing with the length 294 of the queue at its egress link (the RED algorithm [RFC2309]). 295 However, they still drop rather than mark Not-ECT packets. With 296 multiple ECN-capable routers on a path, a flow of packets accumulates 297 the fraction of CE marking that each router adds. The combined 298 effect of the packet marking of all the routers along the path 299 signals congestion of the whole path to the receiver. So, for 300 example, if one router early in a path is marking 1% of packets and 301 another later in a path is marking 2%, flows that pass through both 302 routers will experience approximately 3% marking (see Appendix A for 303 a precise treatment). 305 The choice of two ECT code-points in the ECN field [RFC3168] 306 permitted future flexibility, optionally allowing the sender to 307 encode the experimental ECN nonce [RFC3540] in the packet stream. 308 The nonce is designed to allow a sender to check the integrity of 309 congestion feedback. But Section 9.2 explains that it still gives no 310 control over how fast the sender transmits as a result of the 311 feedback. On the other hand, re-ECN is designed both to ensure that 312 congestion is declared honestly and that the sender's rate responds 313 appropriately. 315 Re-ECN is based on a feedback arrangement called 316 `re-feedback' [Re-fb]. The word is short for either receiver- 317 aligned, re-inserted or re-echoed feedback. But it actually works 318 even when no feedback is available. In fact it has been carefully 319 designed to work for single datagram flows. Indeed, it even 320 encourages aggregation of single packet flows by congestion control 321 proxies. Then, even if the traffic mix of the Internet were to 322 become dominated by short messages, it would still be possible to 323 control congestion efficiently. 325 Changing the Internet's feedback architecture seems to imply 326 considerable upheaval. But re-ECN can be deployed incrementally at 327 the transport layer around unmodified routers using existing fields 328 in IP (v4 or v6). However it does also require the last undefined 329 bit in the IPv4 header, which it uses in combination with the 2-bit 330 ECN field to create four new codepoints. Changes to IP routers are 331 RECOMMENDED in order to improve resilience against DoS attacks. 332 Similarly, re-ECN works best if both the sender and receiver 333 transports are re-ECN-capable, but it can work with just sender 334 support. Section 7 summarises the incremental deployment strategy. 336 The re-ECN protocol makes no changes and has no effect on the TCP 337 congestion control algorithm or on other rate responses to 338 congestion. Re-ECN is only concerned with enabling the ingress 339 network to police that a source is complying with a congestion 340 control algorithm, which is orthogonal to congestion control itself. 342 Before re-ECN can be considered worthy of using up the last bit in 343 the IP header, we must be sure that all our claims are robust. We 344 have gradually been reducing the list of outstanding issues, but the 345 few that still remain are listed in Section 6.3. We expect others 346 may find new attacks, but we offer the re-ECN protocol on the basis 347 that it is built on fairly solid theoretical foundations and, so far, 348 it has proved possible to keep it relatively robust. 350 3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 352 The re-ECN wire protocol uses the two bit ECN field broadly as in 353 RFC3168 [RFC3168] as described above, but with three differences of 354 detail (see Section 5.3). This specification defines a new re-ECN 355 extension (RE) flag. We will defer the definition of the actual 356 position of the RE flag in the IPv4 & v6 headers until Section 5. 357 Until then it will suffice to use an abstraction of the IPv4 and v6 358 wire protocols by just calling it the RE flag. 360 Unlike the ECN field, the RE flag is intended to be set by the sender 361 and remain unchanged along the path, although it can be read by 362 network elements that understand the re-ECN protocol. It is feasible 363 that a network element MAY change the setting of the RE flag, perhaps 364 acting as a proxy for an end-point, but such a protocol would have to 365 be defined in another specification (e.g. [Re-PCN]). 367 Although the RE flag is a separate, single bit field, it can be read 368 as an extension to the two-bit ECN field; the three concatenated bits 369 in what we will call the extended ECN field (EECN) making eight 370 codepoints. We will use the RFC3168 names of the ECN codepoints to 371 describe settings of the ECN field when the RE flag setting is "don't 372 care", but we also define the following six extended ECN codepoint 373 names for when we need to be more specific. 375 +-------+------------+------+---------------+-----------------------+ 376 | ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | 377 | field | codepoint | flag | codepoint | | 378 +-------+------------+------+---------------+-----------------------+ 379 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 380 | | | | | transport | 381 | 00 | Not-ECT | 1 | FNE | Feedback not | 382 | | | | | established | 383 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 384 | | | | | and RECT | 385 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 386 | | | | | transport | 387 | 10 | ECT(0) | 0 | --- | Legacy ECN use only | 388 | | | | | | 389 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 390 | | | | | | 391 | 11 | CE | 0 | CE(0) | Congestion | 392 | | | | | experienced with | 393 | | | | | Re-Echo | 394 | 11 | CE | 1 | CE(-1) | Congestion | 395 | | | | | experienced | 396 +-------+------------+------+---------------+-----------------------+ 398 Table 1: Extended ECN Codepoints 400 3.3. Re-ECN Protocol Operation 402 In this section we will give an overview of the operation of the re- 403 ECN protocol for TCP/IP, leaving a detailed specification to the 404 following sections. Other transports will be discussed later. 406 In summary, the protocol adds a third `re-echo' stage to the existing 407 TCP/IP ECN protocol. Whenever the network adds CE congestion 408 signalling to the IP header on the forward data path, the receiver 409 feeds it back to the ingress using TCP, then the sender re-echoes it 410 into the forward data path using the RE flag in the next packet. 412 Prior to receiving any feedback a sender will not know which setting 413 of the RE flag to use, so it sets the feedback not established (FNE) 414 codepoint. The network reads the FNE codepoint conservatively as 415 equivalent to re-echoed congestion. 417 Specifically, once a flow is established, a re-ECN sender always 418 initialises the ECN field to ECT(1). And it usually sets the RE flag 419 to "1". Whenever a router re-marks a packet to CE, the receiver 420 feeds back this event to the sender. On receiving this feedback, the 421 re-ECN sender will clear the RE flag to "0" in the next packet it 422 sends. 424 We chose to set and clear the RE flag this way round to ease 425 incremental deployment (see Section 7). To avoid confusion we will 426 use the term `blanking' (rather than marking) when the RE flag is 427 cleared to "0". So, over a stream of packets, we will talk of the 428 `RE blanking fraction' as the fraction of octets in packets with the 429 RE flag cleared to "0". 431 ^ 432 | 433 | RE blanking fraction 434 3% |--------------------------------+===== 435 | | 436 2% | | 437 | CE marking fraction | 438 1% | +-----------------------+ 439 | | 440 0% +----------------------------------------> 441 ^ 0 ^ i ^ resource index 442 | ^ | ^ | 443 0 | 1 | 2 observation points 444 1.00% 2.00% marking fraction 446 Figure 1: A 2-Router Example (Imprecise) 448 Figure 1 uses the two router example introduced earlier to illustrate 449 why re-ECN allows routers to measure downstream congestion. The 450 horizontal axis represents the index of each congestible resource 451 (typically queues) along a path through the Internet. There may be 452 many routers on the path, but we assume only two are currently 453 congested (those with resource index 0 and i). The two superimposed 454 plots show the fraction of each extended ECN codepoint in a flow 455 observed along this path. Given about 3% of packets reaching the 456 destination are marked CE, in response to feedback the sender will 457 blank the RE flag in about 3% of packets it sends. Then approximate 458 downstream congestion can be measured at the observation points shown 459 along the path by subtracting the CE marking fraction from the RE 460 blanking fraction, as shown in the table below (Appendix A derives 461 these approximations from a precise analysis). 463 +-------------------+------------------------------+ 464 | Observation point | Approx downstream congestion | 465 +-------------------+------------------------------+ 466 | 0 | 3% - 0% = 3% | 467 | 1 | 3% - 1% = 2% | 468 | 2 | 3% - 3% = 0% | 469 +-------------------+------------------------------+ 471 Table 2: Downstream Congestion Measured at Example Observation Points 473 All along the path, whole-path congestion remains unchanged so it can 474 be used as a reference against which to compare upstream congestion. 475 The difference predicts downstream congestion for the rest of the 476 path. Therefore, measuring the fractions of each codepoint at any 477 point in the Internet will reveal upstream, downstream and whole path 478 congestion. 480 Note that we have introduced discussion of marking and blanking 481 fractions solely for illustration. To be absolutely clear, these 482 fractions are averages that would result from the behaviour of a TCP 483 protocol handler mechanically blanking outgoing packets in direct 484 response to incoming feedback---we are not saying any protocol 485 handler works with these average fractions directly. 487 3.4. Informal Terminology 489 In the rest of this memo we will loosely talk of positive or negative 490 flows, meaning flows where the moving average of the downstream 491 congestion metric is persistently positive or negative. The notion 492 of a negative metric arises because it is derived by subtracting one 493 metric from another. Of course actual downstream congestion cannot 494 be negative, only the metric can (whether due to time lags or 495 deliberate malice). 497 Just as we will loosely talk of positive and negative flows, we will 498 also talk of positive or negative packets, meaning packets that 499 contribute positively or negatively to downstream congestion. 501 Therefore packets can be considered to have a `worth' of +1, 0 or -1, 502 which, when multiplied by their size, indicates their contribution to 503 downstream congestion. Figure 2 shows the main state transitions of 504 the system once a flow is established, showing the worth of packets 505 in each state. When the network congestion marks a packet it 506 decrements its worth. When the sender blanks the RE flag in order to 507 re-echo congestion it increments the worth of a packet. 509 Sender state Sent Worth Network Received Worth 510 packet Congestion packet 511 +----------------------------------------------------+ 512 | ^ 513 V | 514 Congestion echoed -->Re-Echo +1 --> CE(0) 0 --+ 515 / | 516 No congestion___/ | 517 / \ | 518 V \ | 519 Flow established --> RECT 0 --> CE(-1) -1 --+ 521 Figure 2: Re-ECN System State Diagram (bootstrap not shown) 523 The idea is that every time the network decrements the worth of a 524 packet, the sender increments the worth of a later packet. Then, 525 over time, as many positive packets should arrive at the receiver as 526 negative. It is this balance that will allow the network to hold the 527 sender accountable for the congestion it causes, as we shall see. 529 If we start with the sender in `flow established' state, normally it 530 goes round the tight sub-loop, sending RECT packets (worth nothing) 531 and returning to the flow established state to send another one. But 532 if one of the packets is congestion marked, its worth is decremented. 533 The sender will have been continuing round its tight sending loop. 534 But when congestion feedback returns from one of the packets in 535 flight (the largest loop in the figure) the sender jumps to the 536 congestion echoed state in order to re-echo the congestion, 537 incrementing the worth of the next packet by blanking its RE bit. 538 The sender then returns to the flow established state and continues 539 in the tight loop sending zero worth. 541 If a packet carrying re-echoed congestion happens to also be 542 congestion marked, the worth added by the sender will be cancelled 543 out by the network congestion marking. Although the two worth values 544 correctly cancel out, neither the congestion marking nor the re- 545 echoed congestion are lost, because the RE bit and the ECN field are 546 orthogonal. So, whenever this happens, the receiver will correctly 547 detect and re-echo the new congestion event as well (the top sub- 548 loop). 550 The table below specifies unambiguously the worth of each extended 551 ECN codepoint. Note the order is different from the previous table 552 to better show how the worth increments and decrements. The FNE 553 codepoint is an exception. It is used in the bootstrap process 554 (explained later) and has the same positive worth as a packet with 555 the Re-Echo codepoint. 557 +--------+------+----------------+-------+--------------------------+ 558 | ECN | RE | Extended ECN | Worth | Re-ECN meaning | 559 | field | bit | codepoint | | | 560 +--------+------+----------------+-------+--------------------------+ 561 | 00 | 0 | Not-RECT | ... | Not re-ECN-capable | 562 | | | | | transport | 563 | 01 | 0 | Re-Echo | +1 | Re-echoed congestion and | 564 | | | | | RECT | 565 | 10 | 0 | --- | ... | Legacy ECN use only | 566 | 11 | 0 | CE(0) | 0 | Congestion experienced | 567 | | | | | with Re-Echo | 568 | 00 | 1 | FNE | +1 | Feedback not established | 569 | 01 | 1 | RECT | 0 | Re-ECN capable transport | 570 | 10 | 1 | --CU-- | ... | Currently unused | 571 | | | | | | 572 | 11 | 1 | CE(-1) | -1 | Congestion experienced | 573 +--------+------+----------------+-------+--------------------------+ 575 Table 3: 'Worth' of Extended ECN Codepoints 577 4. Transport Layers 579 4.1. TCP 581 Re-ECN capability at the sender is essential. At the receiver it is 582 optional, as long as the receiver has a basic (`vanilla flavour') 583 RFC3168-compliant ECN-capable transport (ECT) [RFC3168]. Given re- 584 ECN is not the first attempt to define the semantics of the ECN 585 field, we give a table below summarising what happens for various 586 combinations of capabilities of the sender S and receiver R, as 587 indicated in the first four columns below. The last column gives the 588 mode a half-connection should be in after the first two of the three 589 TCP handshakes. 591 +--------+---------------+-----------+---------+--------------------+ 592 | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | 593 | | (RFC3540) | (RFC3168) | | Half-connection | 594 | | | | | Mode | 595 +--------+---------------+-----------+---------+--------------------+ 596 | SR | | | | RECN | 597 | S | R | | | RECN-Co | 598 | S | | R | | RECN-Co | 599 | S | | | R | Not-ECT | 600 +--------+---------------+-----------+---------+--------------------+ 602 Table 4: Modes of TCP Half-connection for Combinations of ECN 603 Capabilities of Sender S and Receiver R 605 We will describe what happens in each mode, then describe how they 606 are negotiated. The abbreviations for the modes in the above table 607 mean: 609 RECN: Full re-ECN capable transport 611 RECN-Co: Re-ECN sender in compatibility mode with a vanilla [RFC3168] 612 ECN receiver or an [RFC3540] ECN nonce-capable receiver. 614 Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when 615 at least one of the transports does not understand even basic ECN 616 marking. 618 Note that we use the term Re-ECT for a host transport that is re-ECN- 619 capable but RECN for the modes of the half connections between hosts 620 when they are both Re-ECT. If a host transport is Re-ECT, this fact 621 alone does NOT imply either of its half connections will necessarily 622 be in RECN mode, at least not until it has confirmed that the other 623 host is Re-ECT. 625 4.1.1. RECN mode: Full re-ECN capable transport 627 In full RECN mode, for each half connection, both the sender and the 628 receiver each maintain an unsigned integer counter we will call ECC 629 (echo congestion counter). The receiver maintains a count, modulo 8, 630 of how many times a CE marked packet has arrived during the half- 631 connection. Once a RECN connection is established, the three TCP 632 option flags (ECE, CWR & NS) used for ECN-related functions in 633 previous versions of ECN are used as a 3-bit field for the receiver 634 to repeatedly tell the sender the current value of ECC whenever it 635 sends a TCP ACK. We will call this the echo congestion increment 636 (ECI) field. This overloaded use of these 3 option flags as one 637 3-bit ECI field is shown in Figure 4. The actual definition of the 638 TCP header, including the addition of support for the ECN nonce, is 639 shown for comparison in Figure 3. This specification does not 640 redefine the names of these three TCP option flags, it merely 641 overloads them with another definition once a flow is established. 643 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 644 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 645 | | | N | C | E | U | A | P | R | S | F | 646 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 647 | | | | R | E | G | K | H | T | N | N | 648 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 650 Figure 3: The (post-ECN Nonce) definition of bytes 13 and 14 of the 651 TCP Header 652 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 653 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 654 | | | | U | A | P | R | S | F | 655 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 656 | | | | G | K | H | T | N | N | 657 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 659 Figure 4: Definition of the ECI field within bytes 13 and 14 of the 660 TCP Header, overloading the current definitions above for established 661 RECN flows. 663 Receiver Action in RECN Mode 665 Every time a CE marked packet arrives at a receiver in RECN mode, 666 the receiver transport increments its local value of ECC modulo 8 667 and MUST echo its value to the sender in the ECI field of the next 668 ACK. It MUST repeat the same value of ECI in every subsequent ACK 669 until the next CE event, when it increments ECI again. 671 The increment of the local ECC values is modulo 8 so the field 672 value simply wraps round back to zero when it overflows. The 673 least significant bit is to the right (labelled bit 9). 675 A receiver in RECN mode MAY delay the echo of a CE to the next 676 delayed-ACK, which would be necessary if ACK-withholding were 677 implemented. 679 Sender Action in RECN Mode 681 On the arrival of every ACK, the sender compares the ECI field 682 with its own ECC value, then replaces its local value with that 683 from the ACK. The difference D is assumed to be the number of CE 684 marked packets that arrived at the receiver since it sent the 685 previously received ACK (but see below for the sender's safety 686 strategy). Whenever the ECI field increments by D (or D drops are 687 detected), the sender MUST clear the RE flag to "0" in the IP 688 header of the next D data packets it sends, effectively re-echoing 689 each single increment of ECI. Otherwise the data sender MUST send 690 all data packets with RE set to "1". 692 As a general rule, once a flow is established, as well as setting 693 or clearing the RE flag as above, a data sender in RECN mode MUST 694 always set the ECN field to ECT(1). However, the settings of the 695 extended ECN field during flow start are defined in Section 4.1.4. 697 As we have already emphasised, the re-ECN protocol makes no 698 changes and has no effect on the TCP congestion control algorithm. 699 So, each increment of ECI (or detection of a drop) also triggers 700 the standard TCP congestion response, but with no more than one 701 congestion response per round trip, as usual. 703 A TCP sender also acts as the receiver for the other half- 704 connection. The host will maintain two ECC values S.ECC and R.ECC 705 as sender and receiver respectively. Every data packet sent by a 706 host in RECN mode will also repeat the prevailing value of R.ECC 707 in its ECI field. If a sender in RECN mode has to retransmit a 708 packet due to a suspected loss, the re-transmitted packet MUST 709 carry the latest prevailing value of R.ECC when it is re- 710 transmitted, which will not necessarily be the one it carried 711 originally. 713 4.1.1.1. Safety against Long Pure ACK Loss Sequences 715 The ECI method was chosen for echoing congestion marking because a 716 re-ECN sender needs to know about every CE mark arriving at the 717 receiver, not just whether at least one arrives within a round trip 718 time (which is all the ECE/CWR mechanism supported). But pure ACKs 719 are not protected by TCP reliable delivery, so we repeat the same ECI 720 value in every ACK until it changes. Even if many ACKs in a row are 721 lost, as soon as one gets through, the ECI field it repeats from 722 previous ACKs that didn't get through will update the sender on how 723 many CE marks arrived since the last ACK got through. 725 The sender will only lose a record of the arrival of a CE mark if all 726 the ACKS are lost (and all of them were pure ACKs) for a stream of 727 data long enough to contain 8 or more CE marks. So, if the marking 728 fraction was p, at least 8/p pure ACKs would have to be lost. For 729 example, if p was 5%, a sequence of 160 pure ACKs would all have to 730 be lost. To protect against such extremely unlikely events, if a re- 731 ECN sender detects a sequence of pure ACKs has been lost it SHOULD 732 assume the ECI field wrapped as many times as possible within the 733 sequence. 735 Specifically, if a re-ECN sender receives an ACK with an 736 acknowledgement number that acknowledges L segments since the 737 previous ACK but with a sequence number unchanged from the previously 738 received ACK, it SHOULD conservatively assume that the ECI field 739 incremented by D' = L - ((L-D) mod 8), where D is the apparent 740 increase in the ECI field. For example if the ACK arriving after 9 741 pure ACK losses apparently increased ECI by 2, the assumed increment 742 of ECI would still be 2. But if ECI apparently increased by 2 after 743 11 pure ACK losses, ECI should be assumed to have increased by 10. 745 A re-ECN sender MAY implement a heuristic algorithm to predict beyond 746 reasonable doubt that the ECI field probably did not wrap within a 747 sequence of lost pure ACKs. But such an algorithm is NOT REQUIRED. 748 Such an algorithm MUST NOT be used unless it is proven to work even 749 in the presence of correlation between high ACK loss rate on the back 750 channel and high CE marking rate on the forward channel. 752 Whatever assumption a re-ECN sender makes about potentially lost CE 753 marks, both its congestion control and its re-echoing behaviour 754 SHOULD be consistent with the assumption it makes. 756 4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver 758 If the half-connection is in RECN-Co mode, ECN feedback proceeds no 759 differently to that of vanilla ECN. In other words, the receiver 760 sets the ECE flag repeatedly in the TCP header and the sender 761 responds by setting the CWR flag. Although RECN-Co mode is used when 762 the receiver has not implemented the re-ECN protocol, the sender can 763 infer enough from its vanilla ECN feedback to set or clear the RE 764 flag reasonably well. Essentially, every time the receiver toggles 765 the ECE field from "0" to "1" (or a loss is detected), as well as 766 setting CWR in the TCP flags, the re-ECN sender sets the IP header 767 the same as it would do in full RECN mode. Specifically, the re-ECN 768 sender MUST clear the RE flag to "0" in the next packet. Otherwise 769 the data sender SHOULD send all other packets with RE set to "1". 770 Once a flow is established, a re-ECN data sender in RECN-Co mode MUST 771 always set the ECN field to ECT(1). 773 If a CE marked packet arrives at the receiver within a round trip 774 time of a previous mark, the receiver will still be echoing ECE for 775 the last CE mark. Therefore, such a mark will be missed by the 776 sender. Of course, this isn't of concern for congestion control, but 777 it does mean that very occasionally the RE blanking fraction will be 778 understated. Therefore flows in RECN-Co mode may occasionally be 779 mistaken for very lightly cheating flows and consequently might 780 suffer a small number of packet drops through an egress dropper 781 (Section 6.1.3). We expect re-ECN would be deployed for some time 782 before policers and droppers start to enforce it. So, given there is 783 not much ECN deployment yet anyway, this minor problem may affect 784 only a very small proportion of flows, reducing to nothing over the 785 years as vanilla ECN hosts upgrade. The use of RECN-Co mode would 786 need to be reviewed in the light of experience at the time of re-ECN 787 deployment. 789 RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their 790 code simple, MAY choose not to implement this mode. If they do not, 791 a re-ECN sender SHOULD fall back to vanilla ECT mode in the presence 792 of an ECN-capable receiver. It MAY choose to fall back to the ECT- 793 Nonce mode, but if re-ECN implementers don't want to be bothered with 794 RECN-Co mode, they probably won't want to add an ECT-Nonce mode 795 either. 797 4.1.2.1. Re-ECN support for the ECN Nonce 799 A TCP half-connection in RECN-Co mode MUST NOT support the ECN 800 Nonce [RFC3540]. This means that the sending code of a re-ECN 801 implementation will never need to include ECN Nonce support. Re-ECN 802 is intended to provide wider protection than the ECN nonce against 803 congestion control misbehaviour, and re-ECN only requires support 804 from the sender, therefore it is preferable to specifically rule out 805 the need for dual sender implementations. As a consequence, a re-ECN 806 capable sender will never set ECT(0), so it will be easier for 807 network elements to discriminate re-ECN traffic flows from other ECN 808 traffic, which will always contain some ECT(0) packets. 810 However, a re-ECN implementation MAY OPTIONALLY include receiving 811 code that complies with the ECN Nonce protocol when interacting with 812 a sender that supports the ECN nonce (rather than re-ECN), but this 813 support is NOT REQUIRED. 815 RFC3540 allows an ECN nonce sender to choose whether to sanction a 816 receiver that does not ever set the nonce sum. Given re-ECN is 817 intended to provide wider protection than the ECN nonce against 818 congestion control misbehaviour, implementers of re-ECN receivers MAY 819 choose not to implement backwards compatibility with the ECN nonce 820 capability. This may be because they deem that the risk of sanctions 821 is low, perhaps because significant deployment of the ECN nonce seems 822 unlikely at implementation time. 824 4.1.3. Capability Negotiation 826 During the TCP hand-shake at the start of a connection, an originator 827 of the connection (host A) with a re-ECN-capable transport MUST 828 indicate it is Re-ECT by setting the TCP options NS=1, CWR=1 and 829 ECE=1 in the initial SYN. 831 A responding Re-ECT host (host B) MUST return a SYN ACK with flags 832 CWR=1 and ECE=0. The responding host MUST NOT set this combination 833 of flags unless the preceding SYN has already indicated Re-ECT 834 support as above. A Re-ECT server (B) can use either setting of the 835 NS flag combined with this type of SYN ACK in response to a SYN from 836 a Re-ECT client (A). Normally a Re-ECT server will reply to a Re-ECT 837 client with NS=0, but under special circumstances described in 838 Section 4.1.4 it can return a SYN ACK with NS=1. 840 These handshakes are summarised in Table 5 below, with X meaning 841 `don't care'. The handshakes used for the other flavours of ECN are 842 also shown for comparison. To compress the width of the table, the 843 headings of the first four columns have been severely abbreviated, as 844 follows: 846 R: *R*e-ECT 848 N: ECT-*N*once (RFC3540) 850 E: *E*CT (RFC3168) 852 I: Not-ECT (*I*mplicit congestion notification). 854 These correspond with the same headings used in Table 4. Indeed, the 855 resulting modes in the last two columns of the table below are a more 856 comprehensive way of saying the same thing as Table 4. 858 +----+---+---+---+------------+-------------+-----------+-----------+ 859 | R | N | E | I | SYN A-B | SYN ACK B-A | A-B Mode | B-A Mode | 860 +----+---+---+---+------------+-------------+-----------+-----------+ 861 | | | | | NS CWR ECE | NS CWR ECE | | | 862 | AB | | | | 1 1 1 | X 1 0 | RECN | RECN | 863 | A | B | | | 1 1 1 | 1 0 1 | RECN-Co | ECT-Nonce | 864 | A | | B | | 1 1 1 | 0 0 1 | RECN-Co | ECT | 865 | A | | | B | 1 1 1 | 0 0 0 | Not-ECT | Not-ECT | 866 | B | A | | | 0 1 1 | 0 0 1 | ECT-Nonce | RECN-Co | 867 | B | | A | | 0 1 1 | 0 0 1 | ECT | RECN-Co | 868 | B | | | A | 0 0 0 | 0 0 0 | Not-ECT | Not-ECT | 869 +----+---+---+---+------------+-------------+-----------+-----------+ 871 Table 5: TCP Capability Negotiation between Originator (A) and 872 Responder (B) 874 As soon as a re-ECN capable TCP server receives a SYN, it MUST set 875 its two half-connections into the modes given in Table 5. As soon as 876 a re-ECN capable TCP client receives a SYN ACK, it MUST set its two 877 half-connections into the modes given in Table 5. The half- 878 connections will remain in these modes for the rest of the 879 connection, including for the third segment of TCP's three-way hand- 880 shake (the ACK). 882 {ToDo: Consider SYNs within a connection.} 884 Recall that, if the SYN ACK reflects the same flag settings as the 885 preceding SYN (because there is a broken legacy implementation that 886 behaves this way), RFC3168 specifies that the whole connection MUST 887 revert to Not-ECT. 889 Also note that, whenever the SYN flag of a TCP segment is set 890 (including when the ACK flag is also set), the NS, CWR and ECE flags 891 MUST NOT be interpreted as the 3-bit ECI value, which is only set as 892 a copy of the local ECC value in non-SYN packets. 894 4.1.4. Extended ECN (EECN) Field Settings during Flow Start or after 895 Idle Periods 897 If the originator (A) of a TCP connection supports re-ECN it MUST set 898 the extended ECN (EECN) field in the IP header of the initial SYN 899 packet to the feedback not established (FNE) codepoint. 901 FNE is a new extended ECN codepoint defined by this specification 902 (Section 3.2). The feedback not established (FNE) codepoint is used 903 when the transport does not have the benefit of ECN feedback so it 904 cannot decide whether to set or clear the RE flag. 906 If after receiving a SYN the server B has set its sending half- 907 connection into RECN mode or RECN-Co mode, it MUST set the extended 908 ECN field in the IP header of its SYN ACK to the feedback not 909 established (FNE) codepoint. Note the careful wording here, which 910 means that Re-ECT server B must set FNE on a SYN ACK whether it is 911 responding to a SYN from a Re-ECT client or from a client that is 912 merely ECN-capable. 914 The original ECN specification [RFC3168] required SYNs and SYN ACKs 915 to use the Not-ECT codepoint of the ECN field. The aim was to 916 prevent well-known DoS attacks such as SYN flooding being able to 917 gain from the advantage that ECN capability afforded over drop at 918 ECN-capable routers. For a SYN ACK [I-D.ietf-tsvwg-ecnsyn] has shown 919 this caution was unnecessary, and proposes to allow a SYN ACK to be 920 ECN-capable to improve performance. However, our use of FNE on the 921 initial SYN seems to comply with this aim in word but not in spirit, 922 so a justification for choosing to set RE to 1 for a SYN is given in 923 Section 5.4. 925 Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will 926 have already been set on the initial SYN and possibly the SYN ACK as 927 above. But each re-ECN sender will have to set FNE cautiously on a 928 few data packets as well, given a number of packets will usually have 929 to be sent before sufficient congestion feedback is received. The 930 behaviour will be different depending on the mode of the half- 931 connection: 933 RECN mode: Given the constraints on TCP's initial window [RFC3390] 934 and its exponential window increase during slow start 935 phase [RFC2581], it turns out that the sender SHOULD set FNE on 936 the first and third data packets in its flow, assuming equal sized 937 data packets once a flow is established. Appendix C presents the 938 calculation that led to this conclusion. Below, after running 939 through the start of an example TCP session, we give the intuition 940 learned from that calculation. 942 RECN-Co mode: A re-ECT sender that switches into re-ECN compatibility 943 mode (because it has detected the corresponding host is ECN- 944 capable but not re-ECN capable) MUST limit its initial window to 1 945 segment. The reasoning behind this constraint is given in 946 Section 5.4. Having set this initial window, a re-ECN sender in 947 RECN-Co mode SHOULD set FNE on the first and third data packets in 948 a flow, as for RECN mode. 950 +----+------+----------------+-------+-------+---------------+------+ 951 | | Data | TCP A(Re-ECT) | IP A | IP B | TCP B(Re-ECT) | Data | 952 +----+------+----------------+-------+-------+---------------+------+ 953 | | Byte | SEQ ACK CTL | EECN | EECN | SEQ ACK CTL | Byte | 954 | -- | ---- | ------------- | ----- | ----- | ------------- | ---- | 955 | 1 | | 0100 SYN | FNE | --> | R.ECC=0 | | 956 | | | CWR,ECE,NS | | | | | 957 | 2 | | R.ECC=0 | <-- | FNE | 0300 0101 | | 958 | | | | | | SYN,ACK,CWR | | 959 | 3 | | 0101 0301 ACK | RECT | --> | R.ECC=0 | | 960 | 4 | 1000 | 0101 0301 ACK | FNE | --> | R.ECC=0 | | 961 | 5 | | R.ECC=0 | <-- | FNE | 0301 1102 ACK | 1460 | 962 | 6 | | R.ECC=0 | <-- | RECT | 1762 1102 ACK | 1460 | 963 | 7 | | R.ECC=0 | <-- | FNE | 3222 1102 ACK | 1460 | 964 | 8 | | 1102 1762 ACK | RECT | --> | R.ECC=0 | | 965 | 9 | | R.ECC=0 | <-- | RECT | 4682 1102 ACK | 1460 | 966 | 10 | | R.ECC=0 | <-- | RECT | 6142 1102 ACK | 1460 | 967 | 11 | | 1102 3222 ACK | RECT | --> | R.ECC=0 | | 968 | 12 | | R.ECC=0 | <-- | RECT | 7602 1102 ACK | 1460 | 969 | 13 | | R.ECC=1 | <*- | RECT | 9062 1102 ACK | 1460 | 970 | | | ... | | | | | 971 +----+------+----------------+-------+-------+---------------+------+ 973 Table 6: TCP Session Example #1 975 Table 6 shows an example TCP session, where the server B sets FNE on 976 its first and third data packets (lines 5 & 7) as well as on the 977 initial SYN ACK as previously described. The left hand half of the 978 table shows the relevant settings of headers sent by client A in 979 three layers: the TCP payload size; TCP settings; then IP settings. 980 The right hand half gives equivalent columns for server B. The only 981 TCP settings shown are the sequence number (SEQ), acknowledgement 982 number (ACK) and the relevant control (CTL) flags that A sets in the 983 TCP header. The IP columns show the setting of the extended ECN 984 (EECN) field. 986 Also shown on the receiving side of the table is the value of the 987 receiver's echo congestion counter (R.ECC) after processing the 988 incoming EECN header. Note that, once a host sets a half-connection 989 into RECN mode, it MUST initialise its local value of ECC to zero. 991 The intuition that Appendix C gives for why a sender should set FNE 992 on the first and third data packets is as follows. At line 13, a 993 packet sent by B is shown with an '*', which means it has been 994 congestion marked by an intermediate router from RECT to CE(-1). On 995 receiving this CE marked packet, client A increments its ECC counter 996 to 1 as shown. This was the 7th data packet B sent, but before 997 feedback about this event returns to B, it might well have sent many 998 more packets. Indeed, during exponential slow start, about as many 999 packets will be in flight (unacknowledged) as have been acknowledged. 1000 So, when the feedback from the congestion event on B's 7th segment 1001 returns, B will have sent about 7 further packets that will still be 1002 in flight. At that stage, B's best estimate of the network's packet 1003 marking fraction will be 1/7. So, as B will have sent about 14 1004 packets, it should have already marked 2 of them as FNE in order to 1005 have marked 1/7; hence the need to have set the first and third data 1006 packets to FNE. 1008 Client A's behaviour in Table 6 also shows FNE being set on the first 1009 SYN and the first data packet (lines 1 & 4), but in this case it 1010 sends no more data packets, so of course, it cannot, and does not 1011 need to, set FNE again. Note that in the A-B direction there is no 1012 need to set FNE on the third part of the three-way hand-shake (line 1013 3---the ACK). 1015 Note that in this section we have used the word SHOULD rather than 1016 MUST when specifying how to set FNE on data segments before positive 1017 congestion feedback arrives (but note that the word MUST was used for 1018 FNE on the SYN and SYN ACK). FNE is only RECOMMENDED for the first 1019 and third data segments to entertain the possibility that the TCP 1020 transport has the benefit of other knowledge of the path, which it 1021 re-uses from one flow for the benefit of a newly starting flow. For 1022 instance, one flow can re-use knowledge of other flows between the 1023 same hosts if using a Congestion Manager [RFC3124] or when a proxy 1024 host aggregates congestion information for large numbers of flows. 1026 After an idle period of more than 1 second, a re-ECN sender MUST set 1027 the EECN field of the next packet it sends to FNE. In order that the 1028 design of network policers can be deterministic, this specification 1029 deliberately puts an absolute lower limit on how long a connection 1030 can be idle before the next packet must be FNE, rather than relating 1031 it to the connection round trip time. We use the lower bound of the 1032 retransmission timeout (RTO) [RFC2988], which is commonly used as the 1033 idle period before TCP must reduce to the restart window [RFC2581]. 1035 Note our specification of re-ECN's idle period is NOT intended to 1036 change the idle period for TCP's restart, nor indeed for any other 1037 purposes. 1039 {ToDo: Describe how the sender falls back to legacy modes if packets 1040 don't appear to be getting through (to work round firewalls 1041 discarding packets they consider unusual).} 1043 4.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs 1045 A re-ECN sender MUST clear the RE flag to "0" and set the ECN field 1046 to Not-ECT in pure ACKs, retransmissions and window probes, as 1047 specified in [RFC3168]. Our eventual goal is for all packets to be 1048 sent with re-ECN enabled, and we believe the semantics of the ECI 1049 field go a long way towards being able to achieve this. However, we 1050 have not completed a full security analysis for these cases, 1051 therefore, currently we merely re-state current practice. 1053 We must also reconcile the facts that congestion marking is applied 1054 to packets but acknowledgements cover octet ranges and acknowledged 1055 octet boundaries need not match the transmitted boundaries. The 1056 general principle we work to is to remain compatible with TCP's 1057 congestion control which is driven by congestion events at packet 1058 granularity while at the same time aiming to blank the RE flag on at 1059 least as many octets in a flow as have been marked CE. 1061 Therefore, a re-ECN TCP receiver MUST increment its ECC value as many 1062 times as CE marked packets have been received. And that value MUST 1063 be echoed to the sender in the first available ACK using the ECI 1064 field. This ensures the TCP sender's congestion control receives 1065 timely feedback on congestion events at the same packet granularity 1066 that they were generated on congested routers. 1068 Then, a re-ECN sender stores the difference D between its own ECC 1069 value and the incoming ECI field by incrementing a counter R. Then, R 1070 is decremented by 1 each subsequent packet that is sent with the RE 1071 flag blanked, until R is no longer positive. Using this technique, 1072 whenever a re-ECN transport sends a not re-ECN capable (NRECN) packet 1073 (e.g. a retransmission), the remaining packets required to have the 1074 RE flag blanked will be automatically carried over to subsequent 1075 packets, through the variable R. 1077 This does not ensure precisely the same number of octets have RE 1078 blanked as were CE marked. But we believe positive errors will 1079 cancel negative over a long enough period. {ToDo: However, more 1080 research is needed to prove whether this is so. If it is not, it may 1081 be necessary to increment and decrement R in octets rather than 1082 packets, by incrementing R as the product of D and the size in octets 1083 of packets being sent (typically the MSS).} 1085 4.2. Other Transports 1087 4.2.1. Guidelines for Adding Re-ECN to Other Transports 1089 Re-ECT sender transports that have established the receiver transport 1090 is at least ECN-capable (not necessarily re-ECN capable) MUST blank 1091 the RE codepoint in packets carrying at least as many octets as 1092 arrive at receiver with the CE codepoint set. Re-ECN-capable sender 1093 transports should always initialise the ECN field to the ECT(1) 1094 codepoint once a flow is established. 1096 If the sender transport does not have sufficient feedback to even 1097 estimate the path's CE rate, it SHOULD set FNE continuously. If the 1098 sender transport has some, perhaps stale, feedback to estimate that 1099 the path's CE rate is nearly definitely less than E%, the transport 1100 MAY blank RE in packets for E% of sent octets, and set the RECT 1101 codepoint for the remainder. 1103 {ToDo: Give a brief outline of what would be expected for each of the 1104 following: 1106 o UDP fire and forget (e.g. DNS) 1108 o UDP streaming with no feedback 1110 o UDP streaming with feedback 1112 o DCCP} 1114 o RSVP and/or NSIS: A separate I-D has been submitted [Re-PCN] 1115 describing how re-ECN can be used in an edge-to-edge rather than 1116 end-to-end scenario. It can then be used by downstream networks 1117 to police whether upstream networks are blocking new flow 1118 reservations when downstream congestion is too high, even though 1119 the congestion is in other operators' downstream networks. This 1120 relates to current work in progress on Admission Control over 1121 Diffserv using Pre-Congestion Notification, being reported to the 1122 IETF TSVWG [CL-arch]. 1124 5. Network Layer 1126 5.1. Re-ECN IPv4 Wire Protocol 1128 The wire protocol of the ECN field in the IP header remains largely 1129 unchanged from [RFC3168]. However, an extension to the ECN field we 1130 call the RE (re-ECN extension) flag (Section 3.2) is defined in this 1131 document. It doubles the extended ECN codepoint space, giving 8 1132 potential codepoints. The semantics of the extra codepoints are 1133 backward compatible with the semantics of the 4 original codepoints 1134 [RFC3168] (Section 7 collects together and summarises all the changes 1135 defined in this document). 1137 For IPv4, this document proposes that the new RE control flag will be 1138 positioned where the `reserved' control flag was at bit 48 of the 1139 IPv4 header (counting from 0). Alternatively, some would call this 1140 bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 1141 header (Figure 5). 1143 0 1 2 1144 +---+---+---+ 1145 | R | D | M | 1146 | E | F | F | 1147 +---+---+---+ 1149 Figure 5: New Definition of the Re-ECN Extension (RE) Control Flag at 1150 the Start of Byte 7 of the IPv4 Header 1152 It is believed that the RE flag can simultaneously serve other 1153 purposes, particularly where the start of a flow needs distinguishing 1154 from packets later in the flow. For instance it would have been 1155 useful to identify new flows for tag switching and might enable 1156 similar developments in the future if it were adopted. It is similar 1157 to the state set-up bit idea designed to protect against memory 1158 exhaustion attacks. This idea was proposed by David Clark and 1159 documented by Handley and Greenhalgh [Steps_DoS]. The RE flag can be 1160 thought of as a `soft-state set-up flag', because it is idempotent 1161 (i.e. one occurrence of the flag is sufficient but further 1162 occurrences achieve the same effect if previous ones were lost). 1164 We are sure there will probably be other claims pending on the use of 1165 bit 48. We know of at least two [ARI05], [RFC3514] but neither have 1166 been pursued in the IETF, so far, although the present proposal would 1167 meet the needs of the former. 1169 The security flag proposal (commonly known as the evil bit) was 1170 published on 1 April 2003 as Informational RFC 3514, but it was not 1171 adopted due to confusion over whether evil-doers might set it 1172 inappropriately. The present proposal is backward compatible with 1173 RFC3514 because if re-ECN compliant senders were benign they would 1174 correctly clear the evil bit to honestly declare that they had just 1175 received congestion feedback. Whereas evil-doers would hide 1176 congestion feedback by setting the evil bit continuously, or at least 1177 more often than they should. So, evil senders can be identified, 1178 because they declare that they are good less often than they should. 1180 5.2. Re-ECN IPv6 Wire Protocol 1182 {ToDo: Include the IPv6 extension header design, including support 1183 for the FNE flag. Also its integrated support for a future multi-bit 1184 congestion notification field, with a TTL hop count scheme to check 1185 that all routers on the path support it (similar to Quick-Start). 1186 So, if the whole path of routers doesn't support the extension, the 1187 end-points can fall back to re-ECN (or drop).} 1189 5.3. Router Forwarding Behaviour 1191 Re-ECN works well without modifying the forwarding behaviour of any 1192 routers. However, below, two OPTIONAL changes to forwarding 1193 behaviour are defined, which respectively enhance performance and 1194 improve a router's discrimination against flooding attacks. They are 1195 both OPTIONAL additions that we propose MAY apply by default to all 1196 Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN 1197 marking behaviours [RFC3168]. Specifications for PHBs MAY define 1198 different forwarding behaviours from this default, but this is NOT 1199 REQUIRED. [Re-PCN] is one example. 1201 FNE indicates ECT: 1203 The FNE codepoint indicates to a router that the packet was sent 1204 and will be received by an ECN-capable transport. Therefore an 1205 FNE packet MAY be marked rather than dropped. Note that the FNE 1206 codepoint has been intentionally chosen so that, to legacy routers 1207 (which do not inspect the RE flag), an FNE packet appears to be 1208 Not-ECT, so will be dropped by legacy AQM algorithms. 1210 A network operator MUST NOT configure a router to ECN mark rather 1211 than drop FNE packets unless it can guarantee that FNE packets 1212 will be rate limited, either locally or upstream. The ingress 1213 policers discussed in Section 6.1.4 would count as rate limiters 1214 for this purpose. 1216 Preferential Drop: If a re-ECN capable router experiences very high 1217 load so that it has to drop arriving packets (e.g. a DoS attack), 1218 it MAY preferentially drop packets within the same Diffserv PHB 1219 using the preference order for extended ECN codepoints given in 1220 Table 7. Preferential dropping is difficult to implement, but if 1221 feasible it would discriminate against attack traffic, if done as 1222 part of the overall policing framework of Section 6.1.2. If 1223 nowhere else, routers at the egress of a network SHOULD implement 1224 preferential drop (stronger than the MAY above). For simplicity, 1225 preferences 3,4 & 5 MAY be merged into one preference level. 1227 +-------+-----+------------+-------+-------------+------------------+ 1228 | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | 1229 | field | bit | ECN | | (1 = drop | | 1230 | | | codepoint | | 1st) | | 1231 +-------+-----+------------+-------+-------------+------------------+ 1232 | 01 | 0 | Re-Echo | +1 | 7 | Re-echoed | 1233 | | | | | | congestion and | 1234 | | | | | | RECT | 1235 | 00 | 1 | FNE | +1 | 6 | Feedback not | 1236 | | | | | | established | 1237 | 11 | 0 | CE(0) | 0 | 5 | Congestion | 1238 | | | | | | experienced with | 1239 | | | | | | Re-Echo | 1240 | 01 | 1 | RECT | 0 | 4 | Re-ECN capable | 1241 | | | | | | transport | 1242 | 11 | 1 | CE(-1) | -1 | 3 | Congestion | 1243 | | | | | | experienced | 1244 | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | 1245 | 10 | 0 | --- | n/a | 2 | Legacy ECN use | 1246 | | | | | | only | 1247 | 00 | 0 | Not-RECT | n/a | 1 | Not | 1248 | | | | | | re-ECN-capable | 1249 | | | | | | transport | 1250 +-------+-----+------------+-------+-------------+------------------+ 1252 Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth') 1254 The above drop preferences are arranged to preserve packets with 1255 more positive worth (Section 3.4), given senders of positive 1256 packets must have honestly declared downstream congestion. This 1257 is explained fully in Section 6 on applications. 1259 5.4. Justification for Setting the First SYN to FNE 1261 We require clients to consider the first SYN as congestion marked if 1262 they find out at the end of the handshake that the server was not Re- 1263 ECT capable. This way we remove the need to cautiously avoid setting 1264 the first SYN to Not-RECT. This will give worse performance while 1265 deployment is patchy, but better performance once deployment is 1266 widespread. Malicious clients may think they can use the advantage 1267 that ECN-marking gives over drop in launching classic SYN-flood 1268 attacks. But the rate limit on FNE codepoints performed by the 1269 ingress policer should be a sufficient countermeasure. 1271 If the server is re-ECN capable, provision is made for it to echo a 1272 possible congestion marking. Congested routers may mark an FNE 1273 packet to CE (see Section 5.3), in which case the packet will arrive 1274 at B with an extended ECN codepoint of CE(-1). So, if the initial 1275 SYN from Re-ECT client A is marked CE(-1), a Re-ECT server B MUST 1276 increment its local value of ECC. But B cannot reflect the value of 1277 ECC in the SYN ACK, because it is still using the 3 bits to negotiate 1278 connection capabilities. So, server B MUST set the alternative TCP 1279 header flags in its SYN ACK: NS=1, CWR=1 and ECE=0 (see Table 5). 1281 It might seem pedantic worrying about these single packets, but this 1282 behaviour ensures the system is safe, even if the application mix on 1283 the Internet evolves to the point where the majority of flows consist 1284 of a single window or even a single packet. It also allows denial of 1285 service attacks to be more easily isolated and prevented. 1287 5.5. Control and Management 1289 5.5.1. Negative Balance Warning 1291 A new ICMP message type is being considered so that a dropper can 1292 warn the apparent sender of a flow that it has started to sanction 1293 the flow. The message would have similar semantics to the `Time 1294 exceeded' ICMP message type. To ensure the sender has to invest some 1295 work before the network will generate such a message, a dropper 1296 SHOULD only send such a message for flows that have demonstrated that 1297 they have started correctly by establishing a positive record, but 1298 have later gone negative. The threshold is up to the implementation. 1299 The purpose of the message is to deconfuse the cause of drops from 1300 other causes, such as congestion or transmission losses. The dropper 1301 would send the message to the sender of the flow, not the receiver. 1302 If we did define this message type, it would be REQUIRED for all re- 1303 ECT senders to parse and understand it. Note that a sender MUST only 1304 use this message to explain why losses are occurring. A sender MUST 1305 NOT take this message to mean that losses have occurred that it was 1306 not aware of. Otherwise, spoof messages could be sent by malicious 1307 sources to slow down a sender (c.f. ICMP source quench). 1309 However, the need for this message type is not yet confirmed, as we 1310 are considering how to prevent it being used by malicious senders to 1311 scan for droppers and to test their threshold settings. {ToDo: 1312 Complete this section.} 1314 5.5.2. Rate Response Control 1316 The framework of Section 6.1.2 implies the need for a sender to send 1317 a request to an ingress policer asking that it be allowed to apply a 1318 non-default response to congestion (where TCP-friendly is assumed to 1319 be the default). This would require the sender to be able to 1320 discover how to address the policer. And message format(s) would 1321 have to be defined. The required control protocol(s) are outside the 1322 scope of this document, but will require definition elsewhere. 1324 The policer is likely to be local to the sender and inline, probably 1325 at the ingress interface to the internetwork. So, discovery should 1326 not be hard. A variety of control protocols already exist for some 1327 widely used rate-responses to congestion. For instance DCCP 1328 congestion control identifiers (CCIDs) fulfil this role and so does 1329 QoS signalling (e.g. and RSVP request for controlled load service is 1330 equivalent to a request for no rate response to congestion, but with 1331 admission control). 1333 5.6. Tunnels 1335 For tunnels to work correctly, re-ECN largely requires no more than 1336 the tunnel handling of regular ECN [RFC3168]. The RE flag raises an 1337 extra issue, but it is more straightforward than the ECN field 1338 because it is not intended to change along the path. Therefore a 1339 tunnel entry point only needs to copy the RE flag into the 1340 encapsulating header, without any need to negotiate whether the 1341 tunnel exit supports RE flag handling. 1343 {ToDo: However, there are some issues to discuss concerning tunnels, 1344 which will be included in a future version of this draft} 1346 5.7. Non-Issues 1348 {ToDo: This section will explain why the addition of re-ECN does not 1349 interact with any of the following: 1351 o Integration with congestion notification in various link layers 1352 (Ethernet, ATM (and MPLS if it had a congestion notification 1353 capability added, which is not precluded for the EXP field 1354 [RFC3270]) 1356 o Tunnels, and Overlays that wish to support congestion notification 1357 (see also the brief discussion of edge-to-edge support for re-ECN 1358 in RSVP or NSIS transports earlier) 1360 o Encryption and IPSec 1362 } 1364 6. Applications 1366 6.1. Policing Congestion Response 1368 6.1.1. The Policing Problem 1370 The current Internet architecture trusts hosts to respond voluntarily 1371 to congestion. Limited evidence shows that the large majority of 1372 end-points on the Internet comply with a TCP-friendly response to 1373 congestion. But telephony (and increasingly video) services over the 1374 best efforts Internet are attracting the interest of major commercial 1375 operations. Most of these applications do not respond to congestion 1376 at all. Those that can switch to lower rate codecs, still have a 1377 lower bound below which they must become unresponsive to congestion. 1379 Even TCP-friendly applications can cause a disproportionate amount of 1380 congestion, simply by using multiple flows or by transferring data 1381 continuously. Also the Internet Architecture has few defences 1382 against distributed denial of service attacks that combine both 1383 problems: unresponsiveness to congestion and flooding with multiple 1384 flows. 1386 Applications that need (or choose) to be unresponsive to congestion 1387 can effectively steal whatever share of bottleneck resources they 1388 want from responsive flows. Whether or not such free-riding is 1389 common, inability to prevent it increases the risk of poor returns 1390 for investors in network infrastructure, leading to under-investment. 1391 An increasing proportion of unresponsive, free-riding demand coupled 1392 with persistent under-supply is a broken economic cycle. Therefore, 1393 if the current, largely co-operative consensus continues to erode, 1394 congestion collapse could become more common in more areas of the 1395 Internet [RFC3714]. 1397 However, while we have designed re-ECN to provide a way to solve 1398 these problems, this does not imply we advocate that every network 1399 should introduce tight controls on those that cause congestion. Re- 1400 ECN has been specifically designed to allow different networks to 1401 choose how conservative or liberal they wish to be with respect to 1402 policing congestion. But those that choose to be conservative can 1403 protect themselves from the excesses that liberal networks allow 1404 their users. 1406 6.1.2. Incentive Framework 1408 The aim is to create an incentive environment that ensures optimal 1409 sharing of capacity despite everyone acting selfishly (including 1410 lying and cheating). Of course, the mechanisms put in place for this 1411 can lie dormant wherever co-operation is the norm. 1413 Throughout this document we focus on path congestion. But most forms 1414 of fairness, including TCP's, also depend on round trip time. So, we 1415 also propose to measure downstream path delay using re-feedback. 1416 This proposal will be published in a very simple future draft, but 1417 for now we give an outline in Appendix E. 1419 Figure 6 sketches the incentive framework that we will describe piece 1420 by piece throughout this section. We will do a first pass in 1421 overview, then return to each piece in detail. An internetwork with 1422 multiple trust boundaries is depicted. The difference between the 1423 two plots in the example we used earlier Figure 1 is plotted below. 1424 The graph displays downstream path congestion seen in a typical flow 1425 as it traverses an example path from sender S to receiver R, across 1426 networks N1, N2 & N4. Everyone is shown using re-ECN, but we intend 1427 to show why everyone would /choose/ to use it, correctly and 1428 honestly. 1430 Two main types of self-interest can be identified: 1432 o Users want to transmit data across the network as fast as 1433 possible, paying as little as possible for the privilege. In this 1434 respect, there is no distinction between senders and receivers, 1435 but we must be wary of potential malice by one on the other; 1437 o Network operators want to maximise revenues from the resources 1438 they invest in. They compete amongst themselves for the custom of 1439 users. 1441 policer 1442 A | 1443 | | 1444 |S <-----N1----> <---N2---> <---N4--> R domain 1445 |: : : 1446 |V : : 1447 3% |--------+ : 1448 | : | : 1449 2% | : +-----------------------+ : 1450 | : downstream congestion | : 1451 1% | : | : 1452 | : | : 1453 0% +--------------------------------+=====--> 1454 0 i ^ resource index 1455 | | /|\ 1456 1.00% 2.00% | marking fraction 1457 | 1458 dropper 1460 Figure 6: Incentive Framework, showing creation of opposing pressures 1461 to under-declare and over-declare downstream congestion, using a 1462 policer and a dropper 1463 Source congestion control: We want to ensure that the sender will 1464 throttle its rate as downstream congestion increases. Whatever 1465 the agreed congestion response (whether TCP-compatible or some 1466 enhanced QoS), to some extent it will always be against the 1467 sender's interest to comply. 1469 Ingress policing: But it is in all the network operators' interests 1470 to encourage fair congestion response, so that their investments 1471 are employed to satisfy the most valuable demand. N1 is in the 1472 best position to deploy a policer at its ingress to check that S1 1473 is complying with congestion control (Section 6.1.4). But ingress 1474 policing is not the only possible arrangement. Re-ECN provides 1475 the necessary information for dual control of congestion either by 1476 the sender or by the network ingress. So, in some scenarios (e.g. 1477 sensing devices with minimal capabilities) the network ingress 1478 might do the congestion control as a proxy for the sender. 1480 Edge egress dropper: If the policer ensures the source has less right 1481 to a high rate the higher it declares downstream congestion, the 1482 source has a clear incentive to understate downstream congestion. 1483 But, if packets are understated when they enter the internetwork, 1484 they will be negative when they leave. So, we introduce a dropper 1485 at the last network egress, which drops packets in flows that 1486 persistently declare negative downstream congestion (see 1487 Section 6.1.3 for details). Incidentally, a network can trivially 1488 prevent negative traffic from being sent in the first place by not 1489 permitting a sender to send any CE packets, which would clearly 1490 contravene the ECN protocol. 1492 ..competitive routing 1493 .' : '. 1494 .' p e n a l:t i e s '. 1495 : | : \ : 1496 A : | : | : 1497 |S <-----N1----> <---N2---> <---N4--> R domain 1498 | : | : | : 1499 | V | : | : 1500 3% |--------+ | : | : 1501 | | V V V V 1502 2% | +-----------------------+ 1503 | downstream congestion | 1504 1% | : | 1505 | : | 1506 0% +--------------------------------+=====--> 1507 0 ^ i resource index 1508 | /|\ | 1509 1.00% | 2.00% marking fraction 1510 | 1511 sanctions 1513 Figure 7: Incentives at Inter-domain Borders 1515 Inter-domain traffic policing: But next we must ask, if congestion 1516 arises downstream (say in N4), what is the ingress network's (N1's) 1517 incentive to police its customers' response? If N1 turns a blind 1518 eye, its own customers benefit while other networks suffer. This is 1519 why all inter-domain QoS architectures (e.g. Intserv, Diffserv) 1520 police traffic each time it crosses a trust boundary. Re-ECN gives 1521 trustworthy information at each trust boundary, which N4 (say) can 1522 use in bulk to police all the responses to congestion of all the 1523 sources beyond its upstream neighbour (N2) with one very simple 1524 passive mechanism, as we will now explain using Figure 7. 1526 But before we do, we need to make a very important point. In the 1527 explanation that follows, we assume a very specific variant of volume 1528 charging between networks. We must make clear that we are not 1529 advocating that everyone should use this form of contract. We are 1530 well aware that the IETF tries to avoid standardising technology that 1531 depends on a particular business model. And we strongly share this 1532 desire to encourage diversity. But our aim is merely to show that 1533 border policing can at least work with this one model, then we can 1534 assume that operators might experiment with the metric in other 1535 models (see Section 6.1.5 for examples). Of course, operators are 1536 free to complement this usage element of their charges with 1537 traditional capacity charging, and we expect they will. 1539 Emulating policing with inter-domain congestion charging: Between 1540 high-speed networks, we would rather avoid holding back traffic 1541 while it is policed. Instead, once re-ECN has arranged headers to 1542 carry downstream congestion honestly, N2 can contract to pay N4 1543 penalties in proportion to a single bulk count of the congestion 1544 metrics crossing their mutual trust boundary (Section 6.1.5). In 1545 this way, N4 puts pressure on N2 to suppress downstream 1546 congestion, as shown by the solid downward arrow at the egress of 1547 N2. Then N2 has an incentive either to police the congestion 1548 response of its own ingress traffic (from N1) or to charge N1 in 1549 turn on the basis of congestion counted at their mutual boundary. 1550 In this recursive way, the incentives for each flow to respond 1551 correctly to congestion trace back with each flow precisely to 1552 each source, despite the mechanism not recognising flows (see 1553 Section 6.2.2). If N1 turns a blind eye to its own upstream 1554 customers' congestion response, it will still have to pay its 1555 downstream neighbours. 1557 No congestion charging to users: Bulk congestion charging at trust 1558 boundaries is passive and extremely simple, and loses none of its 1559 per-packet precision from one boundary to the next (unlike 1560 Diffserv all-address traffic conditioning agreements, which 1561 dissipate their effectiveness across long topologies). But at any 1562 trust boundary, there is no imperative to use congestion charging. 1563 Traditional traffic policing can be used, if the complexity and 1564 cost is preferred. In particular, at the boundary with end 1565 customers (e.g. between S and N1), traffic policing will most 1566 likely be far more appropriate. Policer complexity is less of a 1567 concern at the edge of the network. And end-customers are known 1568 to be highly averse to the unpredictability of congestion 1569 charging. 1571 So, NOTE WELL: this document neither advocates nor requires 1572 congestion charging for end customers and advocates but does not 1573 require inter-domain congestion charging. 1575 Competitive discipline of inter-domain traffic engineering: With 1576 inter-domain congestion charging, a domain seems to have a 1577 perverse incentive to fake congestion; N2's profit depends on the 1578 difference between congestion at its ingress (its revenue) and at 1579 its egress (its cost). So, overstating internal congestion seems 1580 to increase profit. However, smart border routing [Smart_rtg] by 1581 N1 will bias its multipath routing towards the least cost routes. 1582 So, N2 risks losing all its revenue to competitive routes if it 1583 overstates congestion (see Section 6.2.3). In other words, if N2 1584 is the least congested route, its ability to raise excess profits 1585 is limited by the congestion on the next least congested route. 1586 This pressure on N2 to remain competitive is represented by the 1587 dotted downward arrow at the ingress to N2 in Figure 7. 1589 Closing the loop: All the above elements conspire to trap everyone 1590 between two opposing pressures (upper half of Figure 6), ensuring 1591 the downstream congestion metric arrives at the destination 1592 neither above nor below zero. So, we have arrived back where we 1593 started in our argument. The ingress edge network can rely on 1594 downstream congestion declared in the packet headers presented by 1595 the sender. So it can police the sender's congestion response 1596 accordingly. 1598 6.1.2.1. The Case against Classic Feedback 1600 A system that produces an optimal outcome as a result of everyone's 1601 selfish actions is extremely powerful. But why do we have to change 1602 to re-ECN to achieve it? Can't classic congestion feedback (as used 1603 already by standard ECN) be arranged to provide similar incentives? 1604 Superficially it can. Given ECN already existed, this was the 1605 deployment path Kelly proposed for his seminal work that used self- 1606 interest to optimise a system of networks and users (summarised in 1607 [Evol_cc]). The mechanism was nearly identical to volume charging; 1608 except only the volume of packets marked with congestion experienced 1609 (CE) was counted. 1611 However, below we explain why relying on classic feedback /required/ 1612 congestion charging to be used, while re-ECN achieves the same 1613 powerful outcome, but does not /require/ congestion charging. In 1614 brief, the problem with classic feedback is that the incentives have 1615 to trace the indirect path back to the sender---the long way round 1616 the feedback loop. For example, if classic feedback were used in 1617 Figure 6, N2 would have had to influence N1 via N4, R & S rather than 1618 directly. 1620 Inability to agree what is happening downstream: In order to police 1621 its upstream neighbour's congestion response, the neighbours 1622 should be able to agree on the congestion to be responded to. 1623 Whatever the feedback regime, as packets change hands at each 1624 trust boundary, any path metrics they carry are verifiable by both 1625 neighbours. But, with a classic path metric, they can only agree 1626 on the /upstream/ path congestion. 1628 Inaccessible back-channel: The network needs a whole-path congestion 1629 metric to control the source. Classically, whole path congestion 1630 emerges at the destination, to be fed back from receiver to sender 1631 in a back-channel. But, in any data network, back-channels need 1632 not be visible to relays, as they are essentially communications 1633 between the end-points. They may be encrypted, asymmetrically 1634 routed or simply omitted, so no network element can reliably 1635 intercept them. The congestion charging literature solves this 1636 problem by charging the receiver and assuming this will cause the 1637 receiver to refer the charges to the sender. But, of course, this 1638 creates unintended side-effects... 1640 `Receiver pays' unacceptable: In connectionless datagram networks, 1641 receivers and receiving networks cannot prevent reception from 1642 malicious senders, so `receiver pays' opens them to `denial of 1643 funds' attacks. 1645 End-user congestion charging unacceptable: Even if 'denial of funds' 1646 were not a problem, we know that end-users are highly averse to 1647 the unpredictability of congestion charging and anyway, we want to 1648 avoid restricting network operators to just one retail tariff. 1649 But with classic feedback only an upstream metric is available, so 1650 we cannot avoid having to wrap the `receiver pays' money flow 1651 around the feedback loop, necessarily forcing end-users to be 1652 subjected to congestion charging. 1654 To summarise so far, with classic feedback, policing congestion 1655 response /requires/ congestion charging of end-users and a `receiver 1656 pays' model, whereas, with re-ECN, incentives can be fashioned either 1657 by technical policing mechanisms (more appropriate for end users) or 1658 by congestion charging using the safer `sender pays' model (more 1659 appropriate inter-domain). 1661 We now take a second pass over the incentive framework, filling in 1662 the detail. 1664 6.1.3. Egress Dropper 1666 As traffic leaves the last network before the receiver (domain N4 in 1667 Figure 6), the RE blanking fraction in a flow should match the CE 1668 congestion marking fraction. If it is less (a negative flow), it 1669 implies that the source is understating path congestion (which will 1670 reduce the penalties that N2 owes N4). 1672 If flows are positive, N4 need take no action---this simply means its 1673 upstream neighbour is paying more penalties than it needs to, and the 1674 source is going slower than it needs to. But, to protect itself 1675 against persistently negative flows, N4 should install a dropper at 1676 its egress. Appendix D gives a suggested algorithm for the dropper, 1677 meeting the criteria below. 1679 o It SHOULD introduce minimal false positives for honest flows; 1681 o It SHOULD quickly detect and sanction dishonest flows (minimal 1682 false negatives); 1684 o It MUST be invulnerable to state exhaustion attacks from malicious 1685 sources. For instance, if the dropper uses flow-state, it should 1686 not be possible for a source to send numerous packets, each with a 1687 different flow ID, to force the dropper to exhaust its memory 1688 capacity.; 1690 o It MUST introduce sufficient loss in goodput so that malicious 1691 sources cannot play off losses in the egress dropper against 1692 higher allowed throughput. Salvatori [CLoop_pol] describes this 1693 attack, which involves the source understating path congestion 1694 then inserting forward error correction (FEC) packets to 1695 compensate expected losses. 1697 Note that the dropper operates on flows but we would like it not to 1698 require per-flow state. This is why we have been careful to ensure 1699 that all flows MUST start with a packet marked with the FNE 1700 codepoint. If a flow does not start with the FNE codepoint, a 1701 dropper is likely to treat it unfavourably. This risk makes it worth 1702 setting the FNE codepoint at the start of a flow, even though there 1703 is a cost to the sender of setting FNE (positive `worth'). Indeed, 1704 with the FNE codepoint, the rate at which a sender can generate new 1705 flows can be limited (Appendix F). In this respect, the FNE 1706 codepoint works like Clark's state set-up bit [Steps_DoS]. 1708 Appendix F also gives an example dropper implementation that 1709 aggregates flow state. Dropper algorithms will often maintain a 1710 moving average across flows of the fraction of RE blanked packets. 1711 When maintaining an average across flows, a dropper SHOULD only allow 1712 flows into the average if they start with FNE, but it SHOULD not 1713 include packets with the FNE codepoint set in the average. An 1714 ingress gateway sets the FNE codepoint when it does not have the 1715 benefit of feedback from the ingress. So, counting packets with FNE 1716 cleared would be likely to make the average unnecessarily positive, 1717 providing headroom (or should we say footroom?) for dishonest 1718 (negative) traffic. 1720 If the dropper detects a persistently negative flow, it SHOULD drop 1721 sufficient negative and neutral packets to force the flow to not be 1722 negative. Drops SHOULD be focused on just sufficient packets in 1723 misbehaving flows to remove the negative bias while doing minimal 1724 harm. 1726 6.1.4. Rate Policing 1728 Approaches like [XCHOKe] & [pBox] are nice approaches for rate 1729 policing traffic without the benefit of whole path information, such 1730 as could be provided by re-ECN. But they must be deployed at 1731 bottlenecks in order to work. Unfortunately, a large proportion of 1732 traffic traverses at least two bottlenecks (in the two access 1733 networks), particularly with the current traffic mix where peer-to- 1734 peer file-sharing is prevalent. These `bottleneck policers' could be 1735 adapted to combine ECN congestion marking from the upstream path with 1736 local congestion knowledge. But then the only useful placement for 1737 them would be close to the egress of the network. 1739 But then, if these bottleneck policers were widely deployed, the 1740 Internet would find itself with one universal rate adaptation policy 1741 (TCP-friendliness) embedded throughout the network. Given TCP's 1742 congestion control algorithm is already known to be hitting its 1743 scalability limits and new algorithms are being developed for high- 1744 speed congestion control, embedding TCP policing into the Internet 1745 would make evolution to new algorithms extremely painful. If a 1746 source wanted to use a different algorithm, it would have to both 1747 discover and negotiate with a policer in some remote access network, 1748 as well as possibly others on its path. 1750 Therefore, re-ECN has been designed to avoid the need for bottleneck 1751 policing so that we can avoid the threat of a single rate adaptation 1752 policy throughout the network. Instead, re-ECN allows the access 1753 network operator at the ingress to choose which rate adaptation to 1754 enforce. If desired, the re-ECN wire protocol allows these ingress 1755 policers to perform per-flow policing according to the widely adopted 1756 TCP rate adaptation, but it also allows new rate adaptation policies 1757 beyond TCP to be enforced. Further, it also allows the flexibility 1758 for networks to choose to police users as a whole, rather than flows 1759 (see Appendix F for example designs). 1761 o The particular rate adaptation may be agreed bilaterally between 1762 the sender and its ingress provider (Section 5.5.2), which would 1763 greatly improve the evolvability of congestion control, requiring 1764 only a single, local box to be updated upon changes. Of course, 1765 one would currently expect TCP to be the default of choice. 1767 o Bottleneck policing can easily be circumvented, opening multiple 1768 flows by varying the active end-point port number; or by spoofing 1769 the source address but arranging with the receiver to hide the 1770 true return address at a higher layer. 1772 A useful feature of re-ECN is that it provides all the information a 1773 policer needs directly in the packets being policed. Re-Echo packets 1774 represent congestion echoes as far as an ingress policer is 1775 concerned. So, even policing TCP's AIMD algorithm is relatively 1776 straightforward. Appendix F presents an example design, but the 1777 choice of the preferred mechanism is up to the implementer. 1779 Finally, we must not forget that an easy way to circumvent re-ECN's 1780 defences is for the source to turn off re-ECN support, by setting the 1781 Not-RECT codepoint, implying legacy traffic. Therefore an ingress 1782 policer must put a general rate-limit on Not-RECT traffic, which 1783 SHOULD be lax during early, patchy deployment, but will have to 1784 become stricter as deployment widens. Similarly, flows starting 1785 without an FNE packet can be confined by a strict rate-limit used for 1786 the remainder of flows that haven't proved they are well-behaved by 1787 starting correctly (therefore they need not consume any flow state--- 1788 they are just confined to the `misbehaving' bin if they carry an 1789 unrecognised flow ID). Also, as already pointed out, an ingress rate 1790 policer MUST block both CE codepoints, as traffic that is already 1791 negative as soon as it is sent must be invalid. 1793 6.1.5. Inter-domain Policing 1795 Section 6.1.2 outlining the whole the Incentive Framework above has 1796 already explained how neighbouring domains can arrange their contract 1797 with each other so that a network can penalises its upstream 1798 neighbour in proportion to the total downstream congestion that 1799 crosses the interface between them over an accounting period. That 1800 is, a simple count of the volume of data in packets with RE blanked 1801 minus the volume with CE marked over, say, a month. 1803 Full details of how this can be done, why it works and a security 1804 analysis are available in a sister Internet Draft entitled `Emulating 1805 Border Flow Policing using Re-ECN on Bulk Data' [Re-PCN]. That I-D 1806 gives examples of how downstream networks can police the aggregate 1807 congestion response of their upstream neighbours, against different 1808 contractual arrangements. The goal is to ensure an upstream network 1809 in turn polices its upstream networks, eventually ensuring upstream 1810 networks will suffer if they do not police the rate response to 1811 congestion of their users. 1813 The scenario used in [Re-PCN] is one where re-ECN is used edge-to- 1814 edge rather than end-to-end as in the present document. However, the 1815 position at inter-domain borders is nearly identical. {ToDo: A 1816 summary of the relevant aspects of that I-D will be included here, 1817 but due to lack of time this has had to be deferred for the next 1818 version.} 1820 6.1.6. Simulations 1822 Simulations of policer and dropper performance done for the multi-bit 1823 version of re-feedback have been included in section 5 "Dropper 1824 Performance" of [Re-fb]. Simulations of policer and dropper for the 1825 re-ECN version described in this document are work in progress. 1827 6.2. Other Applications 1829 {ToDo: Other applications of re-ECN will be briefly outlined here 1830 (largely drawing from section 3 of [Re-fb]), such as: } 1832 6.2.1. DDoS Mitigation 1834 A flooding attack is inherently about congestion of a resource. 1835 Because re-ECN ensures the sources causing network congestion 1836 experience the cost of their own actions, it acts as a first line of 1837 defence against DDoS. As load focuses on a victim, upstream queues 1838 grow, requiring honest sources to pre-load packets with a higher 1839 fraction of positive packets. Once downstream routers are so 1840 congested that they are dropping traffic, they will be CE marking the 1841 traffic they do forward 100%. Honest sources will therefore be 1842 sending Re-Echo 100% (and therefore being severely rate-limited at 1843 the ingress). 1845 Malicious sources can either do the same as honest sources, and be 1846 rate-limited at ingress, or they can understate congestion by sending 1847 more neutral RECT packets than they should. If sources understate 1848 congestion (i.e. do not re-echo sufficient positive packets) and the 1849 preferential drop ranking is implemented on routers (Section 5.3), 1850 these routers will preserve positive traffic until last. So, the 1851 neutral traffic from malicious sources will all be automatically 1852 dropped first. Either way, the malicious sources cannot send more 1853 than honest sources. 1855 Further, DDoS sources will tend to be re-used by different 1856 controllers for different attacks. They will therefore build up a 1857 long term history of causing congestion. Therefore, as long as the 1858 population of potentially compromisable hosts around the Internet is 1859 limited, the per-user policing algorithms in Appendix F.1 will 1860 gradually throttle down the zombies. Therefore, widespread 1861 deployment of re-ECN could considerably dampen the force of DDoS. 1862 Zombie armies could hold back from attacking for long enough to be 1863 able to build up enough credit in the per-user policers to launch an 1864 attack. But they would then still be limited to no more throughput 1865 than other, honest users. 1867 Inter-domain traffic policing (see Section 6.1.5)ensures that any 1868 network that harbours compromised `zombie' hosts will have to bear 1869 the cost of the congestion caused by the packets of the zombies in 1870 downstream networks. Such network will be incentivised to deploy 1871 per-user policers that rate-limit hosts unresponsive to congestion so 1872 they can only send very slowly into congested paths. As well as 1873 protecting other networks, the extremely poor performance at any sign 1874 of congestion will incentivise the zombie's owner to clean it up. 1876 However, the host should behave normally when using uncongested 1877 paths. 1879 6.2.2. End-to-end QoS 1881 {ToDo: } 1883 6.2.3. Traffic Engineering 1885 {ToDo: } 1887 6.2.4. Inter-Provider Service Monitoring 1889 {ToDo: } 1891 6.3. Limitations 1893 This section will discuss the limitations of the re-ECN approach, 1894 particularly: 1896 o Malicious users have the ability to turn off ECT. Given Not-ECT 1897 traffic cannot be efficiently policed, users would be able to get 1898 a considerable advantage that would not be simply compensated by 1899 their being the preferential candidates for drops in case of 1900 sustained congestion. For this reason, we recommend that while 1901 accommodating a smooth initial transition to re-ECN policers 1902 should gradually be tuned to rate limit Not-ECT traffic in the 1903 long term. 1905 o Re-feedback for TTL (re-TTL) would also be desirable at the same 1906 time as re-ECN. Unfortunately this requires a further agreement 1907 to standardise the mechanisms briefly described in Appendix E 1909 o We are considering the issue of whether it would be useful to 1910 truncate rather than drop packets that appear to be malicious, so 1911 that the feedback loop is not broken but useful data can be 1912 removed. 1914 o The inability to police excessive congestion when it causes an 1915 ECN-capable router to drop ECT traffic rather than marking it. 1916 Re-ECN allows policing of downstream explicit congestion 1917 notifications, not drops. 1919 7. Incremental Deployment 1920 7.1. Incremental Deployment Features 1922 We chose to use ECT(1) for Re-ECN traffic deliberately. Existing ECN 1923 sources set ECT(0) at either 50% (the nonce) or 100% (the default). 1924 So they will appear to a re-ECN policer as very highly congested 1925 paths. When policers are first deployed they can be configured 1926 permissively, allowing through both `legacy' ECN and misbehaving re- 1927 ECN flows. Then, as the threshold is set more strictly, the more 1928 legacy ECN sources will gain by upgrading to re-ECN. Thus, towards 1929 the end of the voluntary incremental deployment period, legacy 1930 transports can be given progressively stronger encouragement to 1931 upgrade. 1933 {ToDo: As well as introducing the new information above, this section 1934 is intended to collect together all the snippets of information 1935 throughout the draft about incremental deployment. Through lack of 1936 time, this rationalisation will have to wait until the next version, 1937 except for the brief list below. However, a long section describing 1938 possible deployment scenarios is available in the section following.} 1940 Re-ECN semantics for use of the two-bit ECN field are different in 1941 the following minor respects compared to RFC3168: 1943 o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender 1944 sets ECT(0) by default; 1946 o No provision is necessary for a re-ECN capable source transport to 1947 use the ECN nonce; 1949 o Routers MAY preferentially drop different extended ECN codepoints; 1951 o Packets carrying the feedback not established (FNE) codepoint MAY 1952 optionally be marked rather than dropped by routers, even though 1953 their ECN field is Not-ECT (with the important caveat in 1954 "retcp_Router_Forwarding_Behaviour"); 1956 o Packets may be dropped by policing nodes because of apparent 1957 misbehaviour, not just because of congestion. 1959 None of these changes REQUIRE any modifications to routers. 1961 7.2. Incremental Deployment Incentives 1963 It would only be worth standardising the re-ECN protocol if there 1964 existed a coherent story for how it might be incrementally deployed. 1965 In order for it to have a chance of deployment, everyone who needs to 1966 act, must have a strong incentive to act, and the incentives must 1967 arise in the order that deployment would have to happen. Re-ECN 1968 works around unmodified ECN routers, but we can't just discuss why 1969 and how re-ECN deployment might build on ECN deployment, because 1970 there is precious little to build on in the first place. Instead, we 1971 aim to show that re-ECN deployment could carry ECN with it. We focus 1972 on commercial deployment incentives, although some of the arguments 1973 apply equally to academic or government sectors. 1975 ECN deployment: 1977 ECN is largely implemented in commercial routers, but generally 1978 not as a supported feature, and it has largely not been deployed 1979 by commercial network operators. It has been released in many 1980 Unix-based operating systems, but not in proprietary OSs like 1981 Windows or those in many mobile devices. For detailed deployment 1982 status, see [ECN-Deploy]. We believe the reason ECN deployment 1983 has not happened is twofold: 1985 * ECN requires changes to both routers and hosts. If someone 1986 wanted to sell the improvement that ECN offers, they would have 1987 to co-ordinate deployment of their product with others. An ECN 1988 server only gives any improvement on an ECN network. An ECN 1989 network only gives any improvement if used by ECN devices. 1990 Deployment that requires co-ordination adds cost and delay and 1991 tends to dilute any competitive advantage that might be gained. 1993 * ECN `only' gives a performance improvement. Making a product a 1994 bit faster (whether the product is a device or a network), 1995 isn't usually a sufficient selling point to be worth the cost 1996 of co-ordinating across the industry to deploy it. Network 1997 operators tend to avoid re-configuring a working network unless 1998 launching a new product. 2000 ECN and re-ECN for Edge-to-edge Assured QoS: 2002 We believe the proposal to provide assured QoS sessions using a 2003 form of ECN called pre-congestion notification (PCN) [CL-arch] is 2004 most likely to break the deadlock in ECN deployment first. It 2005 only requires edge-to-edge deployment so it does not require 2006 endpoint support. It can be deployed in a single network, then 2007 grow incrementally to interconnected networks. And it provides a 2008 different `product' (internetworked assured QoS), rather than 2009 merely making an existing product a bit faster. 2011 Not only could this assured QoS application kick-start ECN 2012 deployment, it could also carry re-ECN deployment with it; because 2013 re-ECN can enable the assured QoS region to expand to a large 2014 internetwork where neighbouring networks do not trust each other. 2015 [Re-PCN] argues that re-ECN security should be built in to the QoS 2016 system from the start, explaining why and how. 2018 If ECN and re-ECN were deployed edge-to-edge for assured QoS, 2019 operators would gain valuable experience. They would also clear 2020 away many technical obstacles such as firewall configurations that 2021 block all but the legacy settings of the ECN field and the RE 2022 flag. 2024 ECN in Access Networks: 2026 The next obstacle to ECN deployment would be extension to access 2027 and backhaul networks, where considerable link layer differences 2028 makes implementation non-trivial, particularly on congested 2029 wireless links. ECN and re-ECN work fine during partial 2030 deployment, but they will not be very useful if the most congested 2031 elements in networks are the last to support them. Access network 2032 support is one of the weakest parts of this deployment story. All 2033 we can hope is that, once the benefits of ECN are better 2034 understood by operators, they will push for the necessary link 2035 layer implementations as deployment proceeds. 2037 Policing Unresponsive Flows: 2039 Re-ECN allows a network to offer differentiated quality of service 2040 as explained in Section 6.2.2. But we do not believe this will 2041 motivate initial deployment of re-ECN, because the industry is 2042 already set on alternative ways of doing QoS. Despite being much 2043 more complicated and expensive, the alternative approaches are 2044 here and now. 2046 But re-ECN is critical to QoS deployment in another respect. It 2047 can be used to prevent applications from taking whatever bandwidth 2048 they choose without asking. 2050 Currently, applications that remain resolute in their lack of 2051 response to congestion are rewarded by other TCP applications. In 2052 other words, TCP is naively friendly, in that it reduces its rate 2053 in response to congestion whether it is competing with friends 2054 (other TCPs) or with enemies (unresponsive applications). 2056 Therefore, those network owners that want to sell QoS will be keen 2057 to ensure that their users can't help themselves to QoS for free. 2058 Given the very large revenues at stake, we believe effective 2059 policing of congestion response will become highly sought after by 2060 network owners. 2062 But this does not necessarily argue for re-ECN deployment. 2063 Network owners might choose to deploy bottleneck policers rather 2064 than re-ECN-based policing. However, under Related Work 2065 (Section 9) we argue that bottleneck policers are inherently 2066 vulnerable to circumvention. 2068 Therefore we believe there will be a strong demand from network 2069 owners for re-ECN deployment so they can police flows that do not 2070 ask to be unresponsive to congestion, in order to protect their 2071 revenues from flows that do ask (QoS). In particular, we suspect 2072 that the operators of cellular networks will want to prevent VoIP 2073 and video applications being used freely on their networks as a 2074 more open market develops in GPRS and 3G devices. 2076 Initial deployments are likely to be isolated to single cellular 2077 networks. Cellular operators would first place requirements on 2078 device manufacturers to include re-ECN in the standards for mobile 2079 devices. In parallel, they would put out tenders for ingress and 2080 egress policers. Then, after a while they would start to tighten 2081 rate limits on Not-ECT traffic from non-standard devices and they 2082 would start policing whatever non-accredited applications people 2083 might install on mobile devices with re-ECN support in the 2084 operating system. This would force even independent mobile device 2085 manufacturers to provide re-ECN support. Early standardisation 2086 across the cellular operators is likely, including interconnection 2087 agreements with penalties for excess downstream congestion. 2089 We suspect some fixed broadband networks (whether cable or DSL) 2090 would follow a similar path. However, we also believe that larger 2091 parts of the fixed Internet would not choose to police on a per- 2092 flow basis. Some might choose to police congestion on a per-user 2093 basis in order to manage heavy peer-to-peer file-sharing, but it 2094 seems likely that a sizeable majority would not deploy any form of 2095 policing. 2097 This hybrid situation begs the question, "How does re-ECN work for 2098 networks that choose to using policing if they connect with others 2099 that don't?" Traffic from non-ECN capable sources will arrive 2100 from other networks and cause congestion within the policed, ECN- 2101 capable networks. So networks that chose to police congestion 2102 would rate-limit Not-ECT traffic throughout their network, 2103 particularly at their borders. They would probably also set 2104 higher usage prices in their interconnection contracts for 2105 incoming Not-ECT and Not-RECT traffic. We assume that 2106 interconnection contracts between networks in the same tier will 2107 include congestion penalties before contracts with provider 2108 backbones do. 2110 A hybrid situation could remain for all time. As was explained in 2111 the introduction, we believe in healthy competition between 2112 policing and not policing, with no imperative to convert the whole 2113 world to the religion of policing. Networks that chose not to 2114 deploy egress droppers would leave themselves open to being 2115 congested by senders in other networks. But that would be their 2116 choice. 2118 The important aspect of the egress dropper though is that it most 2119 protects the network that deploys it. If a network does not 2120 deploy an egress dropper, sources sending into it from other 2121 networks will be able to understate the congestion they are 2122 causing. Whereas, if a network deploys an egress dropper, it can 2123 know how much congestion other networks are dumping into it. And 2124 apply penalties or charges accordingly. So, whether or not a 2125 network polices its own sources at ingress, it is in its interests 2126 to deploy an egress dropper. 2128 Host support: 2130 In the above deployment scenario, host operating system support 2131 for re-ECN came about through the cellular operators demanding it 2132 in device standards (i.e. 3GPP). Of course, increasingly, mobile 2133 devices are being built to support multiple wireless technologies. 2134 So, if re-ECN were stipulated for cellular devices, it would 2135 automatically appear in those devices connected to the wireless 2136 fringes of fixed networks if they coupled cellular with WiFi or 2137 Bluetooth technology, for instance. Also, once implemented in the 2138 operating system of one mobile device, it would tend to be found 2139 in other devices using the same family of operating system. 2141 Therefore, whether or not a fixed network deployed ECN, or 2142 deployed re-ECN policers and droppers, many of its hosts might 2143 well be using re-ECN over it. Indeed, they would be at an 2144 advantage when communicating with hosts across Re-ECN policed 2145 networks that rate limited Not-RECT traffic. 2147 Other possible scenarios: 2149 The above is thankfully not the only plausible scenario we can 2150 think of. One of the many clubs of operators that meet regularly 2151 around the world might decide to act together to persuade a major 2152 operating system manufacturer to implement re-ECN. And they may 2153 agree between them on an interconnection model that includes 2154 congestion penalties. 2156 Re-ECN provides an interesting opportunity for device 2157 manufacturers as well as network operators. Policers can be 2158 configured loosely when first deployed. Then as re-ECN take-up 2159 increases, they can be tightened up, so that a network with re-ECN 2160 deployed can gradually squeeze down the service provided to legacy 2161 devices that have not upgraded to re-ECN. Many device vendors 2162 rely on replacement sales. And operating system companies rely 2163 heavily on new release sales. Also support services would like to 2164 be able to force stragglers to upgrade. So, the ability to 2165 throttle service to legacy operating systems is quite valuable. 2167 Also, policing unresponsive sources may not be the only or even 2168 the first application that drives deployment. It may be policing 2169 causes of heavy congestion (e.g. peer-to-peer file-sharing). Or 2170 it may be mitigation of denial of service. Or we may be wrong in 2171 thinking simpler QoS will not be the initial motivation for re-ECN 2172 deployment. Indeed, the combined pressure for all these may be 2173 the motivator, but it seems optimistic to expect such a level of 2174 joined-up thinking from today's communications industry. We 2175 believe a single application alone must be a sufficient motivator. 2177 In short, everyone gains from adding accountability to TCP/IP, 2178 except the selfish or malicious. So, deployment incentives tend 2179 to be strong. 2181 8. Architectural Rationale 2183 In the Internet's technical community the danger of not responding to 2184 congestion is well-understood, with its attendant risk of congestion 2185 collapse [RFC3714]. However, many of the Internet's commercial 2186 community consider that the very essence of IP is to provide open 2187 access to the internetwork for all applications. Congestion is seen 2188 as a symptom of over-conservative investment. And the goal of 2189 application design is to find novel ways to continue working despite 2190 congestion. They argue that the Internet was never intended to be 2191 solely for TCP-friendly applications. Another side of the Internet's 2192 commercial community believe that it is no use providing a network 2193 for novel applications if it has insufficient capacity. And it will 2194 always have insufficient capacity unless a greater share of 2195 application revenues can be /assured/ for the infrastructure 2196 provider. Otherwise the major investments required will carry too 2197 much risk and won't happen. 2199 The lesson articulated in [Tussle] is that we shouldn't embed our 2200 view on these arguments into the Internet at design time. Instead we 2201 should design the Internet so that the outcome of these arguments can 2202 get decided at run-time. Re-ECN is designed in that spirit. Once 2203 the protocol is available, different network operators can choose how 2204 liberal they want to be in holding people accountable for the 2205 congestion they cause. Some might boldly invest in capacity and not 2206 police its use at all, hoping that novel applications will result. 2207 Others might use re-ECN for fine-grained flow policing, expecting to 2208 make money selling vertically integrated services. Yet others might 2209 sit somewhere half-way, perhaps doing coarse, per-user policing. All 2210 might change their minds later. But re-ECN always allows them to 2211 interconnect so that the careful ones can protect themselves from the 2212 liberal ones. 2214 The incentive-based approach used for re-ECN is based on Gibbens and 2215 Kelly's arguments [Evol_cc] on allowing endpoints the freedom to 2216 evolve new congestion control algorithms for new applications. They 2217 ensured responsible behaviour despite everyone's self-interest by 2218 applying pricing to ECN marking, and Kelly had proved stability and 2219 optimality in an earlier paper. 2221 Re-ECN keeps all the underlying economic incentives, but rearranges 2222 the feedback. The idea is to allow a network operator (if it 2223 chooses) to deploy engineering mechanisms like policers at the front 2224 of the network which can be designed to behave /as if/ they are 2225 responding to congestion prices. Rather than having to subject users 2226 to congestion pricing, networks can then use more traditional 2227 charging regimes (or novel ones). But the engineering can constrain 2228 the overall amount of congestion a user can cause. This provides a 2229 buffer against completely outrageous congestion control, but still 2230 makes it easy for novel applications to evolve if they need different 2231 congestion control to the norms. It also allows novel charging 2232 regimes to evolve. 2234 Despite being achieved with a relatively minor protocol change, re- 2235 ECN is an architectural change. Previously, Internet congestion 2236 could only be controlled by the data sender, because it was the only 2237 one both in a position to control the load and in a position to see 2238 information on congestion. Re-ECN levels the playing field. It 2239 recognises that the network also has a role to play in moderating 2240 (policing) congestion control. But policing is only truly effective 2241 at the first ingress into an internetwork, whereas path congestion 2242 was previously only visible at the last egress. So, re-ECN 2243 democratises congestion information. Then the choice over who 2244 actually controls congestion can be made at run-time, not design 2245 time---a bit like an aircraft with dual controls. And different 2246 operators can make different choices. We believe non-architectural 2247 approaches to this problem are unlikely to offer more than partial 2248 solutions (see Section 9). 2250 Importantly, re-ECN does NOT REQUIRE assumptions about specific 2251 congestion responses to be embedded in any network elements, except 2252 at the first ingress to the internetwork if that level of control is 2253 desired by the ingress operator. But such tight policing will be a 2254 matter of agreement between the source and its access network 2255 operator. The ingress operator need not police congestion response 2256 at flow granularity; it can simply hold a source responsible for the 2257 aggregate congestion it causes, perhaps keeping it within a monthly 2258 congestion quota. Or if the ingress network trusts the source, it 2259 can do nothing. 2261 Therefore, the aim of the re-ECN protocol is NOT solely to police 2262 TCP-friendliness. Re-ECN preserves IP as a generic network layer for 2263 all sorts of responses to congestion, for all sorts of transports. 2264 Re-ECN merely ensures truthful downstream congestion information is 2265 available in the network layer for all sorts of accountability 2266 applications. 2268 The end to end design principle does not say that all functions 2269 should be moved out of the lower layers---only those functions that 2270 are not generic to all higher layers. Re-ECN adds a function to the 2271 network layer that is generic, but was omitted: accountability for 2272 causing congestion. Accountability is not something that an end-user 2273 can provide to themselves. We believe re-ECN adds no more than is 2274 sufficient to hold each flow accountable, even if it consists of a 2275 single datagram. 2277 "Accountability" implies being able to identify who is responsible 2278 for causing congestion. However, at the network layer it would NOT 2279 be useful to identify the cause of congestion by adding individual or 2280 organisational identity information, NOR by using source IP 2281 addresses. Rather than bringing identity information to the point of 2282 congestion, we bring downstream congestion information to the point 2283 where the cause can be most easily identified and dealt with. That 2284 is, at any trust boundary, congestion can be associated with the 2285 physically connected upstream neighbour that is directly responsible 2286 for causing it (whether intentionally or not). A trust boundary 2287 interface is exactly the place to police or throttle in order to 2288 directly mitigate congestion, rather than having to trace the 2289 (ir)responsible party in order to shut them down. 2291 Some considered that ECN itself was a layering violation. The 2292 reasoning went that the interface to a layer should provide a service 2293 to the higher layer and hide how the lower layer does it. However, 2294 ECN reveals the state of the network layer and below to the transport 2295 layer. A more positive way to describe ECN is that it is like the 2296 return value of a function call to the network layer. It explicitly 2297 returns the status of the request to deliver a packet, by returning a 2298 value representing the current risk that a packet will not be served. 2300 Re-ECN has similar semantics, except the transport layer must try to 2301 guess the return value, then it can use the actual return value from 2302 the network layer to modify the next guess. 2304 9. Related Work 2306 {Due to lack of time, this section is incomplete. The reader is 2307 referred to the Related Work section of [Re-fb] for a brief selection 2308 of related ideas.} 2310 9.1. Policing Rate Response to Congestion 2312 ATM network elements send congestion back-pressure messages [ITU- 2313 T.I.371] along each connection, duplicating any end to end feedback 2314 because they don't trust it. On the other hand, re-ECN ensures 2315 information in forwarded packets can be used for congestion 2316 management without requiring a connection-oriented architecture and 2317 re-using the overhead of fields that are already set aside for end to 2318 end congestion control (and routing loop detection in the case of re- 2319 TTL in Appendix E). 2321 We borrowed ideas from policers in the literature [pBox],[XCHOKe], 2322 AFD etc. for our rate equation policer. However, without the benefit 2323 of re-ECN they don't police the correct rate for the condition of 2324 their path. They detect unusually high /absolute/ rates, but only 2325 while the policer itself is congested, because they work by detecting 2326 prevalent flows in the discards from the local RED queue. These 2327 policers must sit at every potential bottleneck, whereas our policer 2328 need only be located at each ingress to the internetwork. As Floyd & 2329 Fall explain [pBox], the limitation of their approach is that a high 2330 sending rate might be perfectly legitimate, if the rest of the path 2331 is uncongested or the round trip time is short. Commercially 2332 available rate policers cap the rate of any one flow. Or they 2333 enforce monthly volume caps in an attempt to control high volume 2334 file-sharing. They limit the value a customer derives. They might 2335 also limit the congestion customers can cause, but only as an 2336 accidental side-effect. They actually punish traffic that fills 2337 troughs as much as traffic that causes peaks in utilisation. In 2338 practice network operators need to be able to allocate service by 2339 cost during congestion, and by value at other times. 2341 9.2. Congestion Notification Integrity 2343 The choice of two ECT code-points in the ECN field [RFC3168] 2344 permitted future flexibility, optionally allowing the sender to 2345 encode the experimental ECN nonce [RFC3540] in the packet stream. 2347 The ECN nonce is an elegant scheme that allows the sender to detect 2348 if someone in the feedback loop tries to claim no congestion was 2349 experienced when it fact it was (whether drop or ECN marking). The 2350 sender chooses between the two ECT codepoints in a pseudo-random 2351 sequence. Then, whenever the network marks a packet with CE, to deny 2352 the congestion happened, the cheater would have to guess which ECT 2353 codepoint was overwritten, with only a 50:50 chance of being correct 2354 each time. 2356 The assumption behind the ECN nonce is that a sender will want to 2357 detect whether a receiver is suppressing congestion feedback. This 2358 is only true if the sender's interests are aligned with the 2359 network's, or with the community of users as a whole. This may be 2360 true for certain large senders, who are under close scrutiny and have 2361 a reputation to maintain. But we have to deal with a more hostile 2362 world, where traffic may be dominated by peer-to-peer transfers, 2363 rather than downloads from a few popular sites. Often the `natural' 2364 self-interest of a sender is not aligned with the interests of other 2365 users. It often wishes to transfer data quickly to the receiver as 2366 much as the receiver wants the data quickly. 2368 In contrast, the re-ECN protocol enables policing of an agreed rate- 2369 response to congestion (e.g. TCP-friendliness) at the sender's 2370 interface with the internetwork. It also ensures downstream networks 2371 can police their upstream neighbours, to encourage them to police 2372 their users in turn. But most importantly, it requires the sender to 2373 declare path congestion to the network and it can remove traffic at 2374 the egress if this declaration is dishonest. So it can police 2375 correctly, irrespective of whether the receiver tries to suppress 2376 congestion feedback or whether the sender ignores genuine congestion 2377 feedback. Therefore the re-ECN protocol addresses a much wider range 2378 of cheating problems, which includes the one addressed by the ECN 2379 nonce. {ToDo: Ensure we address the early ACK problem.} 2381 9.3. Identifying Upstream and Downstream Congestion 2383 Purple [Purple] proposes that routers should use the CWR flag in the 2384 TCP header of ECN-capable flows to work out path congestion and 2385 therefore downstream congestion in a similar way to re-ECN. However, 2386 because CWR is in the transport layer, it is not always visible to 2387 network layer routers and policers. Purple's motivation was to 2388 improve AQM, not policing. But, of course, nodes trying to avoid a 2389 policer would not be expected to allow CWR to be visible. 2391 10. Security Considerations 2393 This whole memo concerns the deployment of a secure congestion 2394 control framework. There are some specific security issues that we 2395 are still working on. 2397 Malicious users have ability to launch dynamically changing attacks, 2398 exploiting the time it takes to detect an attack, given ECN marking 2399 is binary. We are concentrating on subtle interactions between the 2400 ingress policer and the egress dropper in an effort to make it 2401 impossible to game the system. 2403 There is an inherent need for at least some flow state at the egress 2404 dropper given the binary marking environment, and the consequent 2405 vulnerability to state exhaustion attacks. An egress dropper design 2406 with bounded flow state is in write-up. 2408 A malicious source can spoof another user's address and send negative 2409 traffic to the same destination in order to fool the dropper into 2410 sanctioning the other user's flow. To prevent or mitigate these two 2411 different kinds of DoS attack, against the dropper and against given 2412 flows, we are considering various protection mechanisms. 2413 Section 5.5.1 discusses one of these. 2415 The security of re-ECN has been deliberately designed to not rely on 2416 cryptography. 2418 11. IANA Considerations 2420 This memo includes no request to IANA (yet). 2422 If this memo was to progress to standards track, it would list: 2424 o The new RE flag in IPv4 (Section 5.1) and its extension with the 2425 ECN field to create a new set of extended ECN (EECN) codepoints; 2427 o The definition of the EECN codepoints for default Diffserv PHBs 2428 (Section 3.2) 2430 o The new extension header for IPv6 (Section 5.2); 2432 o The new combinations of flags in the TCP header for capability 2433 negotiation (Section 4.1.3); 2435 o The new ICMP message type (Section 5.5.1). 2437 12. Conclusions 2439 {ToDo:} 2441 13. Acknowledgements 2443 Sebastien Cazalet and Andrea Soppera contributed to the idea of re- 2444 feedback. All the following have given helpful comments: Andrea 2445 Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, 2446 Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, 2447 John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru 2448 Murgu, Nigel Geffen, Pete Willis (BT), Sally Floyd (ICIR), Stephen 2449 Hailes, Mark Handley, Adam Greenhalgh (UCL), Jon Crowcroft (Uni Cam), 2450 David Clark, Bill Lehr, Sharon Gillett, Steve Bauer, Liz Maida (MIT), 2451 and comments from participants in the CRN/CFP Broadband and DoS- 2452 resistant Internet working groups. 2454 14. Comments Solicited 2456 Comments and questions are encouraged and very welcome. They can be 2457 addressed to the IETF Transport Area working group's mailing list 2458 , and/or to the authors. 2460 15. References 2462 15.1. Normative References 2464 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2465 Requirement Levels", BCP 14, RFC 2119, March 1997. 2467 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 2468 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 2469 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 2470 S., Wroclawski, J., and L. Zhang, "Recommendations on 2471 Queue Management and Congestion Avoidance in the 2472 Internet", RFC 2309, April 1998. 2474 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 2475 Control", RFC 2581, April 1999. 2477 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2478 of Explicit Congestion Notification (ECN) to IP", 2479 RFC 3168, September 2001. 2481 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 2482 Initial Window", RFC 3390, October 2002. 2484 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2485 Congestion Notification (ECN) Signaling with Nonces", 2486 RFC 3540, June 2003. 2488 15.2. Informative References 2490 [ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the 2491 Internet to Support Real-Time Content Supply from a Large 2492 Fraction of Broadband Residential Users", BT Technology 2493 Journal (BTTJ) 23(2), April 2005. 2495 [CL-arch] Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., 2496 Charny, A., Babiarz, J., and K. Chan, "A Framework for 2497 Admission Control over DiffServ using Pre-Congestion 2498 Notification", draft-briscoe-tsvwg-cl-architecture-02 2499 (work in progress), March 2006. 2501 [CLoop_pol] 2502 Salvatori, A., "Closed Loop Traffic Policing", Politecnico 2503 Torino and Institut Eurecom Masters Thesis , 2504 September 2005. 2506 [ECN-Deploy] 2507 Floyd, S., "ECN (Explicit Congestion Notification) in 2508 TCP/IP; Implementation and Deployment of ECN", Web-page , 2509 May 2004, 2510 . 2512 [Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the 2513 evolution of congestion control", Automatica 35(12)1969-- 2514 1985, December 1999, 2515 . 2517 [I-D.ietf-tsvwg-ecnsyn] 2518 Kuzmanovic, A., "Adding Explicit Congestion Notification 2519 (ECN) Capability to TCP's SYN/ACK Packets", 2520 draft-ietf-tsvwg-ecnsyn-00 (work in progress), 2521 November 2005. 2523 [ITU-T.I.371] 2524 ITU-T, "Traffic Control and Congestion Control in 2525 {B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. 2527 [Jiang02] Jiang, H. and D. Dovrolis, "The Macroscopic Behavior of 2528 the TCP Congestion Avoidance Algorithm", ACM SIGCOMM 2529 CCR 32(3)75-88, July 2002, 2530 . 2532 [Mathis97] 2533 Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The 2534 Macroscopic Behavior of the TCP Congestion Avoidance 2535 Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997, 2536 . 2538 [Purple] Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE: 2539 Predictive Active Queue Management Utilizing Congestion 2540 Information", Proc. Local Computer Networks (LCN 2003) , 2541 October 2003. 2543 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., 2544 and W. Weiss, "An Architecture for Differentiated 2545 Services", RFC 2475, December 1998. 2547 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 2548 Timer", RFC 2988, November 2000. 2550 [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", 2551 RFC 3124, June 2001. 2553 [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., Vaananen, 2554 P., Krishnan, R., Cheval, P., and J. Heinanen, "Multi- 2555 Protocol Label Switching (MPLS) Support of Differentiated 2556 Services", RFC 3270, May 2002. 2558 [RFC3514] Bellovin, S., "The Security Flag in the IPv4 Header", 2559 RFC 3514, April 2003. 2561 [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion 2562 Control for Voice Traffic in the Internet", RFC 3714, 2563 March 2004. 2565 [Re-PCN] Briscoe, B., "Emulating Border Flow Policing using Re-ECN 2566 on Bulk Data", draft-briscoe-tsvwg-re-ecn-border-cheat-01 2567 (work in progress), March 2006. 2569 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 2570 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 2571 Congestion Response in an Internetwork Using Re-Feedback", 2572 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 2576 [Smart_rtg] 2577 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 2578 "Optimizing Cost and Performance for Multihoming", ACM 2579 SIGCOMM CCR 34(4)79--92, October 2004, 2580 . 2582 [Steps_DoS] 2583 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 2584 resistant Internet Architecture", Proc. ACM SIGCOMM 2585 workshop on Future directions in network architecture 2586 (FDNA'04) pp 49--56, August 2004. 2588 [Tussle] Clark, D., Sollins, K., Wroclawski, J., and R. Braden, 2589 "Tussle in Cyberspace: Defining Tomorrow's Internet", ACM 2590 SIGCOMM CCR 32(4)347--356, October 2002, 2591 . 2594 [XCHOKe] Chhabra, P., Chuig, S., Goel, A., John, A., Kumar, A., 2595 Saran, H., and R. Shorey, "XCHOKe: Malicious Source 2596 Control for Congestion Avoidance at Internet Gateways", 2597 Proceedings of IEEE International Conference on Network 2598 Protocols (ICNP-02) , November 2002, 2599 . 2601 [pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End 2602 Congestion Control in the Internet", IEEE/ACM Transactions 2603 on Networking 7(4) 458--472, August 1999, 2604 . 2606 Appendix A. Precise Re-ECN Protocol Operation 2608 The protocol operation described in Section 3.3 was an approximation. 2609 In fact, standard ECN router marking combines 1% and 2% marking into 2610 slightly less than 3% whole-path marking, because routers 2611 deliberately mark CE whether or not it has already been marked by 2612 another router upstream. So the combined marking fraction would 2613 actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. 2615 To generalise this we will need some notation. 2617 o j represents the index of each resource (typically queues) along a 2618 path, ranging from 0 at the first router to n-1 at the last. 2620 o m_j represents the fraction of octets *m*arked CE by a particular 2621 router (whether or not they are already marked) because of 2622 congestion of resource j. 2624 o u_j represents congestion *u*pstream of resource j, being the 2625 fraction of CE marking in arriving packet headers (before 2626 marking). 2628 o p_j represents *p*ath congestion, being the fraction of packets 2629 arriving at resource j with the RE flag blanked (excluding Not- 2630 RECT packets). 2632 o v_j denotes expected congestion downstream of resource j, which 2633 can be thought of as a *v*irtual marking fraction, being derived 2634 from two other marking fractions. 2636 Observed fractions of each particular codepoint (u, p and v) and 2637 router marking rate m are dimensionless fractions, being the ratio of 2638 two data volumes (marked and total) over a monitoring period. All 2639 measurements are in terms of octets, not packets, assuming that line 2640 resources are more congestible than packet processing. 2642 The path congestion (RE blanking fraction) set by the sender should 2643 reflect the upstream congestion (CE marking fraction) fed back from 2644 the destination. Therefore in the steady state 2646 p_0 = u_n 2647 = 1 - (1 - m_1)(1 - m_2)... 2649 Similarly, at some point j in the middle of the network, if p = 1 - 2650 (1 - u_j)(1 - v_j), then 2652 v_j = 1 - (1 - p)/(1 - u_j) 2654 ~= p - u_j; if u_j << 100% 2656 So, between the two routers in the example in Section 3.3, congestion 2657 downstream is 2659 v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) 2660 = 2.00%, 2662 or a useful approximation of downstream congestion is 2664 v_1 ~= 2.98% - 1.00% 2665 ~= 1.98%. 2667 Appendix B. ECN Compatibility 2669 The rationale for choosing the particular combinations of SYN and SYN 2670 ACK flags in Section 4.1.3 is as follows. 2672 Choice of SYN flags: A re-ECN sender can work with vanilla ECN 2673 receivers so we wanted to use the same flags as would be used in 2674 an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same time, 2675 we wanted a server (host B) that is Re-ECT to be able to recognise 2676 that the client (A) is also Re-ECT. We believe also setting NS=1 2677 in the initial SYN achieves both these objectives, as it should be 2678 ignored by vanilla ECT receivers and by ECT-Nonce receivers. But 2679 senders that are not Re-ECT should not set NS=1. At the time ECN 2680 was defined, the NS flag was not defined, so setting NS=1 should 2681 be ignored by existing ECT receivers (but testing against 2682 implementations may yet prove otherwise). The ECN Nonce 2683 RFC [RFC3540] is silent on what the NS field might be set to in 2684 the TCP SYN, but we believe the intent was for a nonce client to 2685 set NS=0 in the initial SYN (again only testing will tell). 2686 Therefore we define a Re-ECN-setup SYN as one with NS=1, CWR=1 & 2687 ECE=1 2689 Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to 2690 be able to determine whether the server (B) is Re-ECT. The 2691 original ECN specification required an ECT server to respond to an 2692 ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There 2693 is no room to modify this by setting the NS flag, as that is 2694 already set in the SYN ACK of an ECT-Nonce server. So we used the 2695 only combination of CWR and ECE that would not be used by existing 2696 TCP receivers: CWR=1 and ECE=0. The original ECN specification 2697 defines this combination as a non-ECN-setup SYN ACK, which remains 2698 true for vanilla and Nonce ECTs. But for re-ECN we define it as a 2699 Re-ECN-setup SYN ACK. We didn't use a SYN ACK with both CWR and 2700 ECE cleared to 0 because that would be the likely response from 2701 most Not-ECT receivers. And we didn't use a SYN ACK with both CWR 2702 and ECE set to 1 either, as at least one broken receiver 2703 implementation echoes whatever flags were in the SYN into its SYN 2704 ACK. Therefore we define a Re-ECN-setup SYN ACK as one with CWR=1 2705 & ECE=0. 2707 Choice of two alternative SYN ACKs: the NS flag may take either value 2708 in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re-ECT 2709 server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to echo 2710 congestion experienced (CE) on the initial SYN. Otherwise a Re- 2711 ECN-setup SYN ACK MUST be returned with NS=0. The only current 2712 known use of the NS flag in a SYN ACK is to indicate support for 2713 the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. 2714 Given the ECN nonce MUST NOT be used for a RECN mode connection, a 2715 Re-ECN-setup SYN ACK can use either setting of the NS flag without 2716 any risk of confusion, because the CWR & ECE flags will be 2717 reversed relative to those used by an ECN nonce SYN ACK. 2719 Appendix C. Packet Marking During Flow Start 2721 {ToDo: Write up proof that sender should mark FNE on first and third 2722 data packets, even with the largest allowed initial window.} 2724 Appendix D. Example Egress Dropper Algorithm 2726 {ToDo: Write up the basic algorithm with flow state, then the 2727 aggregated one.} 2729 Appendix E. Re-TTL 2731 This Appendix gives an overview of a proposal to be able to overload 2732 the TTL field in the IP header to monitor downstream propagation 2733 delay. It is planned to fully write up this proposal in a future 2734 Internet Draft. 2736 Delay re-feedback can be achieved by overloading the TTL field, 2737 without changing IP or router TTL processing. A target value for TTL 2738 at the destination would need standardising, say 16. If the path hop 2739 count increased by more than 16 during a routing change, it would 2740 temporarily be mistaken for a routing loop, so this target would need 2741 to be chosen to exceed typical hop count increases. The TCP wire 2742 protocol and handlers would need modifying to feed back the 2743 destination TTL and initialise it. It would be necessary to 2744 standardise the unit of TTL in terms of real time (as was the 2745 original intent in the early days of the Internet). 2747 In the longer term, precision could be improved if routers 2748 decremented TTL to represent exact propagation delay to the next 2749 router. That is, for a router to decrement TTL by, say, 1.8 time 2750 units it would alternate the decrement of every packet between 1 & 2 2751 at a ratio of 1:4. Although this might sometimes require a seemingly 2752 dangerous null decrement, a packet in a loop would still decrement to 2753 zero after 255 time units on average. As more routers were upgraded 2754 to this more accurate TTL decrement, path delay estimates would 2755 become increasingly accurate despite the presence of some legacy 2756 routers that continued to always decrement the TTL by 1. 2758 Appendix F. Policer Designs to ensure Congestion Responsiveness 2760 F.1. Per-user Policing 2762 User policing requires a policer on the ingress interface of the 2763 access router associated with the user. At that point, the traffic 2764 of the user hasn't diverged on different routes yet; nor has it mixed 2765 with traffic from other sources. 2767 In order to ensure that a user doesn't generate more congestion in 2768 the network than her due share, a modified bulk token-bucket is 2769 maintained with the following parameter: 2771 o b_0 the initial token level 2773 o r the filling rate 2775 o b_max the bucket depth 2777 The same token bucket algorithm is used as in many areas of 2778 networking, but how it is used is very different: 2780 o all traffic from a user over the lifetime of their subscription is 2781 policed in the same token bucket. 2783 o only Re-Echo packets consume tokens 2785 Such a policer will allow network operators to throttle the 2786 contribution of their users to network congestion. This will require 2787 the appropriate contractual terms to be in place between operators 2788 and users. For instance: a condition for a user to subscribe to a 2789 given network service may be that she should not cause more than a 2790 volume C_user of congestion over a reference period T_user, although 2791 she may carry forward up to N_user times her allowance at the end of 2792 each period. These terms directly set the parameter of the user 2793 policer: 2795 o b_0 = C_user 2797 o r = C_user/T_user 2799 o b_max = b_0 * (N_user +1) 2801 Besides the congestion budget policer above, another user policer 2802 will be necessary to rate-limit FNE packets, if they are to be marked 2803 rather than dropped (see discussion in Section 5.3.). Rate-limiting 2804 FNE packets will prevent high bursts of new flow arrivals, which is a 2805 very useful feature in DoS prevention. A condition to subscribe to a 2806 given network service would have to be that a user should not 2807 generate more than C_FNE FNE packets, over a reference period T_FNE, 2808 with no option to carry forward any of the allowance at the end of 2809 each period. These terms directly set the parameters of the FNE 2810 policer: 2812 o b_0 = C_FNE 2814 o r = C_FNE/T_FNE 2816 o b_max = b_0 2818 T_FNE should be a much shorter period than T_user: for instance T_FNE 2819 could be in the order of minutes while T_user could be in order of 2820 weeks. 2822 F.2. Per-flow Rate Policing 2824 Per-flow policing aims to enforce congestion responsiveness on the 2825 shortest information timescale on a network path: packet roundtrips. 2827 This again requires that the appropriate terms be agreed between a 2828 network operator and its users, where a congestion responsiveness 2829 policy might be required for the use of a given network service 2830 (perhaps unless the user specifically requests otherwise). 2832 As an example, we describe below how a rate adaptation policer can be 2833 designed when the applicable rate adaptation policy is TCP- 2834 compliance. In that context, the average throughput of a flow will 2835 be expected to be bounded by the value of the TCP throughput during 2836 congestion avoidance, given n Mathis' formula [Mathis97] 2838 x_TCP = k * s / ( T * sqrt(m) ) 2840 where: 2842 o x_TCP is the throughput of the TCP flow in packets per second, 2844 o k is a constant upper-bounded by sqrt(3/2), 2846 o s is the average packet size of the flow, 2848 o T is the roundtrip time of the flow, 2850 o m is the congestion level experienced by the flow. 2852 We define the marking period N=1/m which represents the average 2853 number of packets between two re-echoes. Mathis' formula can be re- 2854 written as: 2856 x_TCP = k*s*sqrt(N)/T 2858 We can then get the average inter-mark time in a compliant TCP flow, 2859 dt_TCP, by solving (x_TCP/s)*dt_TCP = N which gives 2861 dt_TCP = sqrt(N)*T/k 2863 We rely on this equation for the design of a rate-adaptation policer 2864 as a variation of a token bucket. In that case a policer has to be 2865 set up for each policed flow. This may be triggered by FNE packets, 2866 with the remainder of flows being all rate limited together if they 2867 do not start with an FNE packet. 2869 Where maintaining per flow state is not a problem, for instance on 2870 some access routers, systematic per-flow policing may be considered. 2871 Should per-flow state be more constrained, rate adaptation policing 2872 could be limited to a random sample of flows exhibiting Re-Echoes. 2874 As in the case of user policing, only re-echo packets will consume 2875 tokens, however the amount of tokens consumed will depend on the 2876 congestion signal. 2878 When a new rate adaptation policer is set up for flow j, the 2879 following state is created: 2881 o a token bucket b_j of depth b_max starting at level b_0 2883 o a timestamp t_j = timenow() 2885 o a counter N_j = 0 2887 o a roundtrip estimate T_j 2889 o a filling rate r 2891 When the policing node forwards a packet of flow j with no Re-Echo: 2893 o . the counter is incremented: N_j += 1 2895 When the policing node forwards a packet of flow j carrying a 2896 congestion mark (CE): 2898 o the counter is incremented: N_j += 1 2900 o the token level is adjusted: b_j += r*(timenow()-t_j) - sqrt(N_j)* 2901 T_j/k 2903 o the counter is reset: N_j = 0 2905 o the timer is reset: t_j = timenow() 2907 An implementation example will be given in a later draft that avoids 2908 having to extract the square root. 2910 Analysis: For a TCP flow, for r= 1 token/sec, on average, 2912 r*(timenow()-t_j)-sqrt(N_j)* T_j/k = dt_TCP - sqrt(N)*T/k = 0 2914 This means that the token level will fluctuate around its initial 2915 level. The depth b_max of the bucket sets the timescale on which the 2916 rate adaptation policy is performed while the filling rate r sets the 2917 trade-off between responsiveness and robustness: 2919 o the higher b_max, the longer it will take to catch greedy flows 2921 o the higher r, the fewer false positives (greedy verdict on 2922 compliant flows) but the more false negatives (compliant verdict 2923 on greedy flows) 2925 This rate adaptation policer requires the availability of a roundtrip 2926 estimate which may be obtained for instance from the application of 2927 re-feedback to the downstream delay Appendix E or passive estimation 2928 [Jiang02]. 2930 When the bucket of a policer located at the access router (whether it 2931 is a per-user policer or a per-flow policer) becomes empty, the 2932 access router SHOULD drop at least all packets causing the token 2933 level to become negative. The network operator MAY take further 2934 sanctions if the token level of the per-flow policers associated with 2935 a user becomes negative. 2937 Authors' Addresses 2939 Bob Briscoe 2940 BT & UCL 2941 B54/77, Adastral Park 2942 Martlesham Heath 2943 Ipswich IP5 3RE 2944 UK 2946 Phone: +44 1473 645196 2947 Email: bob.briscoe@bt.com 2948 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 2950 Arnaud Jacquet 2951 BT 2952 B54/70, Adastral Park 2953 Martlesham Heath 2954 Ipswich IP5 3RE 2955 UK 2957 Phone: +44 1473 647284 2958 Email: arnaud.jacquet@bt.com 2959 URI: 2961 Alessandro Salvatori 2962 BT 2963 B54/77, Adastral Park 2964 Martlesham Heath 2965 Ipswich IP5 3RE 2966 UK 2968 Email: sandr8@gmail.com 2970 Intellectual Property Statement 2972 The IETF takes no position regarding the validity or scope of any 2973 Intellectual Property Rights or other rights that might be claimed to 2974 pertain to the implementation or use of the technology described in 2975 this document or the extent to which any license under such rights 2976 might or might not be available; nor does it represent that it has 2977 made any independent effort to identify any such rights. Information 2978 on the procedures with respect to rights in RFC documents can be 2979 found in BCP 78 and BCP 79. 2981 Copies of IPR disclosures made to the IETF Secretariat and any 2982 assurances of licenses to be made available, or the result of an 2983 attempt made to obtain a general license or permission for the use of 2984 such proprietary rights by implementers or users of this 2985 specification can be obtained from the IETF on-line IPR repository at 2986 http://www.ietf.org/ipr. 2988 The IETF invites any interested party to bring to its attention any 2989 copyrights, patents or patent applications, or other proprietary 2990 rights that may cover technology that may be required to implement 2991 this standard. Please address the information to the IETF at 2992 ietf-ipr@ietf.org. 2994 Disclaimer of Validity 2996 This document and the information contained herein are provided on an 2997 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2998 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2999 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 3000 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 3001 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 3002 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 3004 Copyright Statement 3006 Copyright (C) The Internet Society (2006). This document is subject 3007 to the rights, licenses and restrictions contained in BCP 78, and 3008 except as set forth therein, the authors retain all their rights. 3010 Acknowledgment 3012 Funding for the RFC Editor function is currently provided by the 3013 Internet Society.