idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-border-cheat-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2279. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2256. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2263. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2269. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RSVP-ECN], [Re-TCP], [PCN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: However, if the ingress gateway can guarantee that the network(s) that will carry the flow to its egress gateway all use a common identifier for the aggregate (e.g. a single MPLS network without ECMP routing), it MAY NOT set FNE when it adds a new flow to an active aggregate. And an FNE packet need only be sent if a whole aggregate has been idle for more than 1 second. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 26, 2006) is 6513 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-03) exists of draft-briscoe-tsvwg-cl-phb-02 -- Possible downref: Normative reference to a draft: ref. 'PCN' -- Possible downref: Normative reference to a draft: ref. 'RSVP-ECN' == Outdated reference: A later version (-09) exists of draft-briscoe-tsvwg-re-ecn-tcp-02 == Outdated reference: A later version (-04) exists of draft-briscoe-tsvwg-cl-architecture-03 == Outdated reference: A later version (-01) exists of draft-davie-ecn-mpls-00 == Outdated reference: A later version (-20) exists of draft-ietf-nsis-rmd-06 Summary: 4 errors (**), 0 flaws (~~), 8 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Expires: December 28, 2006 June 26, 2006 6 Emulating Border Flow Policing using Re-ECN on Bulk Data 7 draft-briscoe-tsvwg-re-ecn-border-cheat-01 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on December 28, 2006. 34 Copyright Notice 36 Copyright (C) The Internet Society (2006). 38 Abstract 40 Scaling per flow admission control to the Internet is a hard problem. 41 A recently proposed approach combines Diffserv and pre-congestion 42 notification (PCN) to provide a service slightly better than Intserv 43 controlled load. It scales to networks of any size, but only if 44 domains trust each other to comply with admission control and rate 45 policing. This memo claims to solve this trust problem without 46 losing scalability. It describes bulk border policing that provides 47 a sufficient emulation of per-flow policing with the help of another 48 recently proposed extension to ECN, involving re-echoing ECN feedback 49 (re-ECN). With only passive bulk measurements at borders, sanctions 50 can be applied against cheating networks. 52 Status (to be removed by the RFC Editor) 54 This memo is posted as an Internet-Draft with the intent to 55 eventually progress to informational status. It is envisaged that 56 the necessary standards actions to realise the system described would 57 sit in three other documents currently being discussed (but not on 58 the standards track) in the IETF Transport Area [Re-TCP], [RSVP-ECN] 59 & [PCN]. The authors seek comments from the Internet community on 60 whether combining PCN and re-ECN is a sufficient solution to the 61 admission control problem. 63 Changes from previous drafts (to be removed by the RFC Editor) 65 From -00 to -01: 67 Added subsection on Border Accounting Mechanisms (Section 5.6.1) 69 Section 4.2 on the re-ECN wire protocol clarified and re-organised 70 to separately discuss re-ECN for default ECN marking and for pre- 71 congestion marking (PCN). 73 Router Forwarding Behaviour subsection added to re-organised 74 section on Protocol Operation (Section 4.3). Extensions section 75 moved within Protocol Operations. 77 Emulating Border Policing (Section 5) reorganised, starting with a 78 new Terminology subsection heading, and a simplified overview 79 section. Added a large new subsection on Border Accounting 80 Mechanisms within a new section bringing together other 81 subsections on Border Mechanisms generally (Section 5.6). Some 82 text moved from old subsections into these new ones. 84 Added section on Incremental Deployment (Section 7), drawing 85 together relevant points about deployment made throughout. 87 Sections on Design Rationale (Section 8) and Security 88 Considerations (Section 9) expanded with some new material, 89 including new attacks and their defences. 91 Suggested Border Metering Algorithms improved (Appendix A.2) for 92 resilience to newly identified attacks. 94 Table of Contents 96 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 97 2. Requirements Notation . . . . . . . . . . . . . . . . . . . . 7 98 3. The Problem . . . . . . . . . . . . . . . . . . . . . . . . . 7 99 3.1. The Traditional Per-flow Policing Problem . . . . . . . . 7 100 3.2. Generic Scenario . . . . . . . . . . . . . . . . . . . . . 9 101 4. Re-ECN Protocol for an RSVP (or similar) Transport . . . . . . 11 102 4.1. Protocol Overview . . . . . . . . . . . . . . . . . . . . 11 103 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 104 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 105 4.2.1. Re-ECN Recap . . . . . . . . . . . . . . . . . . . . . 13 106 4.2.2. Re-ECN Combined with Pre-Congestion Notification 107 (re-PCN) . . . . . . . . . . . . . . . . . . . . . . . 14 108 4.3. Protocol Operation . . . . . . . . . . . . . . . . . . . . 17 109 4.3.1. Protocol Operation for an Established Flow . . . . . . 17 110 4.3.2. Aggregate Bootstrap . . . . . . . . . . . . . . . . . 18 111 4.3.3. Flow Bootstrap . . . . . . . . . . . . . . . . . . . . 19 112 4.3.4. Router Forwarding Behaviour . . . . . . . . . . . . . 20 113 4.3.5. Extensions . . . . . . . . . . . . . . . . . . . . . . 22 114 5. Emulating Border Policing with Re-ECN . . . . . . . . . . . . 22 115 5.1. Informal Terminology . . . . . . . . . . . . . . . . . . . 22 116 5.2. Policing Overview . . . . . . . . . . . . . . . . . . . . 23 117 5.3. Pre-requisite Contractual Arrangements . . . . . . . . . . 25 118 5.4. Emulation of Per-Flow Rate Policing: Rationale and 119 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 28 120 5.5. Sanctioning Dishonest Marking . . . . . . . . . . . . . . 29 121 5.6. Border Mechanisms . . . . . . . . . . . . . . . . . . . . 31 122 5.6.1. Border Accounting Mechanisms . . . . . . . . . . . . . 31 123 5.6.2. Competitive Routing . . . . . . . . . . . . . . . . . 35 124 5.6.3. Fail-safes . . . . . . . . . . . . . . . . . . . . . . 35 125 6. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 126 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 39 127 8. Design Choices and Rationale . . . . . . . . . . . . . . . . . 40 128 9. Security Considerations . . . . . . . . . . . . . . . . . . . 41 129 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 43 130 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 43 131 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 44 132 13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 44 133 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 44 134 14.1. Normative References . . . . . . . . . . . . . . . . . . . 44 135 14.2. Informative References . . . . . . . . . . . . . . . . . . 45 136 Appendix A. Implementation . . . . . . . . . . . . . . . . . . . 46 137 A.1. Ingress Gateway Algorithm for Blanking the RE flag . . . . 47 138 A.2. Downstream Congestion Metering Algorithms . . . . . . . . 47 139 A.2.1. Bulk Downstream Congestion Metering Algorithm . . . . 47 140 A.2.2. Inflation Factor for Persistently Negative Flows . . . 48 141 A.3. Algorithm for Sanctioning Negative Traffic . . . . . . . . 49 143 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 50 144 Intellectual Property and Copyright Statements . . . . . . . . . . 51 146 1. Introduction 148 The Internet community largely lost interest in the Intserv 149 architecture after it was clarified that it would be unlikely to 150 scale to the whole Internet [RFC2208]. Although Intserv mechanisms 151 proved impractical, the bandwidth reservation service it aimed to 152 offer is still very much required. 154 A recently proposed approach [CL-deploy] combines Diffserv and pre- 155 congestion notification (PCN) to provide a service slightly better 156 than Intserv controlled load [RFC2211]. It scales to any size 157 network, but only if domains trust their neighbours to have checked 158 that upstream customers aren't taking more bandwidth than they 159 reserved, either accidentally or deliberately. This memo describes 160 border policing measures so that one network can protect its 161 interests, even if networks around it are deliberately trying to 162 cheat. The approach provides a sufficient emulation of flow rate 163 policing at trust boundaries but without per-flow processing. The 164 emulation is not perfect, but it is sufficient to ensure that the 165 punishment is at least proportionate to the severity of the cheat. 167 The aim is to be able to scale controlled load service to any number 168 of endpoints, even though such scaling must take account of the 169 increasing numbers of networks and users who may all have conflicting 170 interests. To achieve such scaling, this memo combines two recent 171 proposals, both of which it briefly recaps: 173 o A deployment model for admission control over Diffserv using pre- 174 congestion notification [CL-deploy] describes how bulk pre- 175 congestion notification on routers within an edge-to-edge Diffserv 176 region can emulate the precision of per-flow admission control to 177 provide controlled load service without unscalable per-flow 178 processing; 180 o Re-ECN: Adding Accountability to TCP/IP [Re-TCP]. The trick that 181 addresses cheating at borders is to recognise that border policing 182 is mainly necessary because cheating upstream networks will admit 183 traffic when they shouldn't only as long as they don't directly 184 experience the downstream congestion their misbehaviour can cause. 185 The re-ECN protocol requires upstream nodes to declare expected 186 downstream congestion in all forwarded packets and it makes it in 187 their interests to declare it honestly. Operators can then 188 monitor downstream congestion in bulk at borders to emulate 189 policing. 191 Rather than the end-to-end arrangement used when re-ECN was specified 192 for the TCP transport [Re-TCP], this memo specifies re-ECN in an 193 edge-to-edge arrangement, making it applicable to the above 194 deployment model for admission control over Diffserv. Also, rather 195 than using a TCP transport for regular congestion feedback, this memo 196 specifies re-ECN using RSVP as the transport for feedback [RSVP-ECN]. 197 A similar deployment model, but with a different transport for 198 signalling congestion feedback could be used (e.g. RMD [NSIS-RMD] 199 uses NSIS). 201 This memo aims to do two things: i) define how to apply the re-ECN 202 protocol to the admission control over Diffserv scenario; and ii) 203 explain why re-ECN sufficiently emulates border policing in that 204 scenario. Most of the memo is taken up with the second aim; 205 explaining why it works. Applying re-ECN to the scenario actually 206 involves quite a trivial modification to the ingress gateway. Our 207 immediate goal is to convince everyone to build that modification in 208 to ingress gateways from the start, whether first deployments require 209 policing or not. Otherwise, when we want to add policing, we will 210 have built ourselves a legacy problem. In other words, we aim to 211 convince people to "Build in security from the start." 213 The body of this memo is structured as follows: 215 Section 3 describes the border policing problem. We recap the 216 traditional, unscalable view of how to solve the problem, and we 217 recap the admission control solution which has the scalability we 218 do not want to lose when we add border policing; 220 Section 4 specifies the re-ECN protocol solution in detail; 222 Section 5 explains how to use the protocol to emulate border 223 policing, and why it works; 225 Section 6 analyses the security of the proposed solution; 227 Section 8 explains the sometimes subtle rationale behind our 228 design decisions; 230 Section 9 comments on the overall robustness of the security 231 assumptions and lists specific security issues. 233 It must be emphasised that we are not evangelical about removing per- 234 flow processing from borders. Network operators may choose to do 235 per-flow processing at their borders for their own reasons, such as 236 to support business models that require per-flow accounting. Our aim 237 is to show that per-flow processing at borders is no longer 238 /necessary/ in order to provide end-to-end QoS using flow admission 239 control. Indeed, we are absolutely opposed to standardisation of 240 technology that embeds particular business models into the Internet. 241 Our aim is merely to provide a new useful metric (downstream 242 congestion) at trust boundaries. Given the well-known significance 243 of congestion in economics, operators can then use this new metric in 244 their interconnection contracts if they choose. This will enable 245 competitive evolution of new business models (for examples 246 see [IXQoS]), alongside more traditional models that depend on more 247 costly per-flow processing at borders. 249 2. Requirements Notation 251 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 252 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 253 document are to be interpreted as described in [RFC2119]. 255 3. The Problem 257 3.1. The Traditional Per-flow Policing Problem 259 If we claim to be able to emulate per-flow policing with bulk 260 policing at trust boundaries, we need to know exactly what we are 261 emulating. So, even though we expect it to become a historic 262 practice, we will start from the traditional scenario with per-flow 263 policing at trust boundaries to explain why it has always been 264 considered necessary. 266 To be able to take advantage of a reservation-based service such as 267 controlled load, a source must reserve resources using a signalling 268 protocol such as RSVP [RFC2205]. An RSVP signalling request refers 269 to a flow of packets by its flow ID tuple (filter spec [RFC2205]) (or 270 its security parameter index (SPI) [RFC2207] if port numbers are 271 hidden by IPSec encryption). Other signalling protocols use similar 272 flow identifiers. But, it is insufficient to merely authorise and 273 admit a flow based on its identifiers, for instance merely opening a 274 pin-hole for packets with identifiers that match an admitted flow ID. 275 Once a flow is admitted, it cannot necessarily be trusted to send 276 packets within the rate profile it requested. 278 The packet rate must also be policed to keep the flow within the 279 requested flow spec [RFC2205]. For instance, without data rate 280 policing, a source could reserve resources for an 8kbps audio flow 281 but transmit a 6Mbps video (theft of service). More subtly, the 282 sender could generate bursts that were outside the profile it had 283 requested. 285 In traditional architectures, per-flow packet rate-policing is 286 expensive and unscalable but, without it, a network is vulnerable to 287 such theft of service (whether malicious or accidental). Perhaps 288 more importantly, if flows are allowed to send more data than they 289 were permitted, the ability of admission control to give assurances 290 to other flows will break. 292 Just as sources need not be trusted to keep within their requested 293 flow spec, whole networks might also try to cheat. We will now set 294 up a concrete scenario to illustrate such cheats. Imagine 295 reservations for unidirectional flows from senders, through at least 296 two networks, an edge network and its downstream transit provider. 297 Imagine the edge network charges its retail customers per reservation 298 but also has to pay its transit provider a charge per reservation. 299 Typically, both its selling and buying charges might depend on the 300 duration and rate of each reservation. The level of the actual 301 selling and buying prices are irrelevant to our discussion (most 302 likely the network will sell at a higher price than it buys, of 303 course). 305 A cheating ingress network could systematically reduce the size of 306 its retail customers' reservation signalling requests before 307 forwarding them to its transit provider (and systematically reinstate 308 the responses on the way back). It would then receive an honest 309 income from its upstream retail customer but only pay for 310 fraudulently smaller reservations downstream. Equivalently, a 311 cheating ingress network may feed the traffic from a number of flows 312 into an aggregate reservation over the transit that is smaller than 313 the total of all the flows. Because of these fraud possibilities, in 314 traditional QoS reservation architectures the downstream network 315 polices at each border. The policer checks that the actual sent data 316 rate of each flow is within the signalled reservation. 318 Reservation signalling could be authenticated end to end, but this 319 wouldn't prevent the aggregation cheat just described. For this 320 reason, and to avoid the need for a global PKI, signalling integrity 321 is typically only protected on a hop-by-hop basis [RFC2747]. 323 A variant of the above cheat is where a router in an honest 324 downstream network denies admission to a new reservation, but a 325 cheating upstream network still admits the flow. For instance, the 326 networks may be using Diffserv internally, but Intserv admission 327 control at their borders [RFC2998]. The cheat would only work if 328 they were using bulk Diffserv traffic policing at their borders, 329 perhaps to avoid the cost/complexity of Intserv border policing. As 330 far as the cheating upstream network is concerned, it gets the 331 revenue from the reservation, but it doesn't have to pay any 332 downstream wholesale charges and the congestion is in someone else's 333 network. The cheating network may calculate that most of the flows 334 affected by congestion in the downstream network aren't likely to be 335 its own. It may also calculate that the downstream router has been 336 configured to deny admission to new flows in order to protect 337 bandwidth assigned to other network services (e.g. enterprise VPNs). 338 So the cheating network can steal capacity from the downstream 339 operator's VPNs that are probably not actually congested. 341 To summarise, in traditional reservation signalling architectures, if 342 a network cannot trust a neighbouring upstream network to rate-police 343 each reservation, it has to check for itself that the data rate fits 344 within each of the reservations it has admitted. 346 3.2. Generic Scenario 348 We will now describe a generic internetworking scenario that we will 349 use to describe and to test our bulk policing proposal. It consists 350 of a number of networks and endpoints that do not fully trust each 351 other to behave. In Section 6 we will tie down exactly what we mean 352 by partial trust, and we will consider the various combinations where 353 some networks do not trust each other and others are colluding 354 together. 356 _ ___ _____________________________________ ___ _ 357 | | | | _|__ ______ ______ ______ _|__ | | | | 358 | | | | | | | | | | | | | | | | | | 359 | | | | | | |Inter-| |Inter-| |Inter-| | | | | | | 360 | | | | | | | ior | | ior | | ior | | | | | | | 361 | | | | | | |Domain| |Domain| |Domain| | | | | | | 362 | | | | | | | A | | B | | C | | | | | | | 363 | | | | | | | | | | | | | | | | | | 364 | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | 365 | | | | | | |B| |B| |B| |B| |B| |B| | | | |\ | | 366 | |==| |==|Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr |==| |=>| | 367 | | | | |G/W | | | | | | | | | | | | | |G/W | | |/ | | 368 | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | 369 | | | | | | | | | | | | | | | | | | 370 | | | | |____| |______| |______| |______| |____| | | | | 371 |_| |___| |_____________________________________| |___| |_| 373 Sx Ingress Diffserv region Egress Rx 374 End Access Access End 375 Host Network Network Host 376 <-------- edge-to-edge signalling -------> 377 (for admission control) 379 <-------------------end-to-end QoS signalling protocol-------------> 381 Figure 1: Generic Scenario (see text for explanation of terms) 383 An ingress and egress gateway (Ingr G/W and Egr G/W in Figure 1) 384 connect the interior Diffserv region to the edge access networks 385 where routers (not shown) use per-flow reservation processing. 386 Within the Diffserv region are three interior domains, A, B and C, as 387 well as the inward facing interfaces of the ingress and egress 388 gateways. An ingress and egress border router (BR) is shown 389 interconnecting each interior domain with the next. There may be 390 other interior routers (not shown) within each interior domain. 392 In two paragraphs we now briefly recap how pre-congestion 393 notification is intended to be used to control flow admission to a 394 large Diffserv region. The first paragraph describes data plane 395 functions and the second describes signalling in the control plane. 396 We omit many details from [CL-deploy] including behaviour during 397 routing changes. For brevity here we assume other flows are already 398 in progress across a path through the Diffserv region before a new 399 one arrives, but how bootstrap works is described in Section 4.3.2. 401 Figure 1 shows a single simplex reserved flow from the sending (Sx) 402 end host to the receiving (Rx) end host. The ingress gateway polices 403 incoming traffic within its admitted reservation and remarks it to 404 turn on an ECN-capable codepoint [RFC3168] and the controlled load 405 (CL) Diffserv codepoint. Together, these codepoints define which 406 traffic is entitled to the enhanced scheduling of the CL behaviour 407 aggregate on routers within the Diffserv region. The CL PHB of 408 interior routers consists of a scheduling behaviour and a new ECN 409 marking behaviour that we call `pre-congestion notification' [PCN]. 410 The CL PHB simply re-uses the definition of expedited forwarding 411 (EF) [RFC3246] for its scheduling behaviour. But it incorporates a 412 new ECN marking behaviour, which sets the ECN field of an increasing 413 number of CL packets to the admission marked (AM) codepoint as they 414 approach a threshold rate that is lower than the line rate. The use 415 of virtual queues ensures real queues have hardly built up any 416 congestion delay. The level of marking detected at the egress of the 417 Diffserv region is then used by the signalling system in order to 418 determine admission control as follows. 420 The end-to-end QoS signalling (e.g. RSVP) for a new reservation 421 takes one giant hop from ingress to egress gateway, because interior 422 routers within the Diffserv region are configured to ignore RSVP. 423 The egress gateway holds flow state because it takes part in the end- 424 to-end reservation. So it can classify all packets by flow and it 425 can identify all flows that have the same previous RSVP hop (a CL- 426 region-aggregate). For each CL-region-aggregate of flows in 427 progress, the egress gateway maintains a per-packet moving average of 428 the fraction of pre-congestion-marked traffic. Once an RSVP PATH 429 message for a new reservation has hopped across the Diffserv region 430 and reached the destination, an RSVP RESV message is returned. As 431 the RESV message passes, the egress gateway piggy-backs the relevant 432 pre-congestion level onto it [RSVP-ECN]. Again, interior routers 433 ignore the RSVP message, but the ingress gateway strips off the pre- 434 congestion level. If the pre-congestion level is above a threshold, 435 the ingress gateway denies admission to the new reservation, 436 otherwise it returns the original RESV signal back towards the data 437 sender. 439 Once a reservation is admitted, its traffic will always receive low 440 delay service for the duration of the reservation. This is because 441 ingress gateways ensure that traffic not under a reservation cannot 442 pass into the Diffserv region with the CL DSCP set. So non-reserved 443 traffic will always be treated with a lower priority PHB at each 444 interior router. And even if some disaster re-routes traffic after 445 it has been admitted, if the traffic through any resource tips over a 446 fail-safe threshold, pre-congestion notification will trigger flow- 447 pre-emption to very quickly bring every router within the whole 448 Diffserv region back below its operating point. 450 The whole admission control system just described deliberately 451 confines per-flow processing to the access edges of the network, 452 where it will not limit the system's scalability. But ideally we 453 want to extend this approach to multiple networks, to take even more 454 advantage of its scaling potential. We would still need per-flow 455 processing at the access edges of each network, but not at the high 456 speed interfaces where they interconnect. Even though such an 457 admission control system would work technically, it would gain us no 458 scaling advantage if each network also wanted to police the rate of 459 each admitted flow for itself---border routers would still have to do 460 complex packet operations per-flow anyway, given they don't trust 461 upstream networks to do their policing for them. 463 This memo describes how to emulate per-flow rate policing using bulk 464 mechanisms at border routers, so the full scalability potential of 465 pre-congestion notification is not limited by the need for per-flow 466 policing mechanisms at borders, which would make borders the most 467 cost-critical pinch-points. Then we can achieve the long sought-for 468 vision of secure Internet-wide bandwidth reservations without needing 469 per-flow processing at all in core and border routers---where 470 scalability is most critical. 472 4. Re-ECN Protocol for an RSVP (or similar) Transport 474 4.1. Protocol Overview 476 First we need to recap the way routers accumulate congestion marking 477 along a path. Each ECN-capable router marks some packets with CE, 478 the marking probability increasing with the length of the queue at 479 its egress link. The only difference with pre-congestion 480 marking [PCN] is that marking is based on the length of a virtual 481 queue, so that the real queue occupancy can remain very low. We will 482 use the terms congestion and pre-congestion interchangeably in the 483 following unless it is important to distinguish between them. 485 With multiple ECN-capable routers on a path, the ECN field 486 accumulates the fraction of CE marking that each router adds. The 487 combined effect of the packet marking of all the routers along the 488 path signals congestion of the whole path to the receiver. So, for 489 example, if one router early in a path is marking 1% of packets and 490 another later in a path is marking 2%, flows that pass through both 491 routers will experience approximately 3% marking. 493 The packets crossing an inter-domain trust boundary within the 494 Diffserv region will all have come from different ingress gateways 495 and will all be destined for different egress gateways. We will show 496 that the key to policing against theft of service is for a border 497 router to be able to directly measure the congestion that is about to 498 be caused by the traffic it forwards. That is, it can measure 499 locally the congestion on each of the downstream paths between itself 500 and the egress gateways that its traffic is destined for. 502 With the original ECN protocol, if CE markings crossing the border 503 had been counted over a period, they would have represented the 504 accumulated upstream congestion that had already been experienced by 505 those packets. The general idea of re-ECN is for the ingress gateway 506 to continuously encode path congestion into the IP header where, in 507 this case, `path' means from ingress to egress gateway. Then at any 508 point on that path (e.g. between domains A & B in Figure 2 below), IP 509 headers can be monitored to subtract upstream congestion from 510 expected path congestion in order to give the expected downstream 511 congestion still to be experienced until the egress gateway. 513 Importantly, it turns out that there is no need to monitor downstream 514 congestion on a per-flow basis. We will show that accounting for it 515 in bulk across all flows will be sufficient. 517 _____________________________________ 518 _|__ ______ ______ ______ _|__ 519 | | | A | | B | | C | | | 520 +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ 521 | | |B| |B| |B| |B| |B| |B| | | 522 |Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr | 523 |G/W | | | | |: | | | | | | | | |G/W | 524 +----+ +-+ +-+: +-+ +-+ +-+ +-+ +----+ 525 | | | |: | | | | | | 526 |____| |______|: |______| |______| |____| 527 |_____________:_______________________| 528 : 529 | : | 530 |<-upstream-->:<-expected downstream->| 531 | congestion : congestion | 532 | u v ~= p - u | 533 | | 534 |<--- expected path congestion, p --->| 536 Figure 2: Re-ECN concept 538 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 540 In this section we define the names of the various codepoints of the 541 re-ECN protocol when used with pre-congestion notification, deferring 542 description of their semantics to the following sections. But first 543 we recap the re-ECN wire protocol proposed in [Re-TCP]. 545 4.2.1. Re-ECN Recap 547 Re-ECN uses the two bit ECN field broadly as in RFC3168 [RFC3168]. 548 It also uses a new re-ECN extension (RE) flag. The actual position 549 of the RE flag is different between IPv4 & v6 headers so we will use 550 an abstraction of the IPv4 and v6 wire protocols by just calling it 551 the RE flag. [Re-TCP] proposes using bit 48 (currently unused) in 552 the IPv4 header for the RE flag, while for IPv6 it proposes an ECN 553 extension header. 555 Unlike the ECN field, the RE flag is intended to be set by the sender 556 and remain unchanged along the path, although it can be read by 557 network elements that understand the re-ECN protocol. In the 558 scenario used in this memo, the ingress gateway acts as a proxy for 559 the sender, setting the RE flag as permitted in the specification of 560 re-ECN. 562 Note that general-purpose routers do not have to read the RE flag, 563 only special policing elements at borders do. And no general-purpose 564 routers have to change the RE flag, although the ingress and egress 565 gateways do because in the edge-to-edge deployment model we are 566 using, they act as proxies for the endpoints. Therefore the RE flag 567 does not even have to be visible to interior routers. So the RE flag 568 has no implications on protocols like MPLS. Congested label 569 switching routers (LSRs) would have to be able to notify their 570 congestion with an ECN/PCN codepoint in the MPLS shim [ECN-MPLS], but 571 like any interior IP router, they can be oblivious to the RE flag, 572 which need only be read by border policing functions. 574 Although the RE flag is a separate, single bit field, it can be read 575 as an extension to the two-bit ECN field; the three concatenated bits 576 in what we will call the extended ECN field (EECN) make eight 577 codepoints available. When the RE flag setting is "don't care", we 578 use the RFC3168 names of the ECN codepoints, but [Re-TCP] proposes 579 the following six codepoint names for when there is a need to be more 580 specific. 582 +-------+------------+------+---------------+-----------------------+ 583 | ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | 584 | field | codepoint | flag | codepoint | | 585 +-------+------------+------+---------------+-----------------------+ 586 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 587 | | | | | transport | 588 | 00 | Not-ECT | 1 | FNE | Feedback not | 589 | | | | | established | 590 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 591 | | | | | and RECT | 592 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 593 | | | | | transport | 594 | 10 | ECT(0) | 0 | --- | Legacy ECN use | 595 | | | | | only | 596 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 597 | | | | | | 598 | 11 | CE | 0 | CE(0) | Congestion | 599 | | | | | experienced with | 600 | | | | | Re-Echo | 601 | 11 | CE | 1 | CE(-1) | Congestion | 602 | | | | | experienced | 603 +-------+------------+------+---------------+-----------------------+ 605 Table 1: Re-cap of Default Extended ECN Codepoints Proposed for Re- 606 ECN 608 4.2.2. Re-ECN Combined with Pre-Congestion Notification (re-PCN) 610 As permitted by the ECN specification [RFC3168], a proposal is 611 currently being advanced in the IETF to define different semantics 612 for how routers might mark the ECN field of certain packets. The 613 idea is to be able to notify congestion when the router's load 614 approaches a logical limit, rather than the physical limit of the 615 line. This new marking is called pre-congestion notification [PCN] 616 and we will use the term PCN-enabled router for a router that can 617 apply pre-congestion notification marking to the ECN fields of 618 packets. 620 [RFC3168] recommends that a packet's Diffserv codepoint should 621 determine which type of ECN marking it receives. A Diffserv per-hop 622 behaviour (PHB) can specify that routers should apply pre-congestion 623 notification marking to PCN-capable packets. We will call this a 624 PCN-enhanced PHB. A PCN-capable packet must meet two conditions, it 625 must carry a DSCP that maps to a PCN-enhanced PHB and it must carry 626 an ECN field that turns on PCN marking. 628 As an example, the controlled load (CL) PHB might specify expedited 629 forwarding as its scheduling behaviour and PCN marking as its 630 congestion marking behaviour. Then we would say the CL PHB is a PCN- 631 enhanced PHB, and that packets with a DSCP that maps to the CL PHB 632 and with ECN turned on are PCN-capable packets. 634 [PCN] actually proposes that two logical limits should be used for 635 pre-congestion notification, with the higher limit as a back-stop for 636 dealing with anomalous events. It envisages PCN will be used to 637 admission control inelastic real-time traffic, so marking at the 638 lower limit will trigger admission control, while at the higher limit 639 it will trigger flow pre-emption. 641 Because it needs two types of congestion marking, PCN seems to need 642 five states: Not-ECT, ECT (ECN-capable transport), the ECN Nonce, 643 Admission Marking (AM) and Flow Pre-emption Marking (PM). [PCN] 644 proposes various alternative encodings of the ECN field, attempting 645 various compromises to fit these five states into the four available 646 ECN codepoints. 648 One of the five states to make room for is the ECN Nonce [RFC3540], 649 but the capability we describe in this memo supersedes any need for 650 the Nonce. The ECN Nonce is an elegant scheme, but it only allows a 651 sending node (or its proxy) to detect suppression of congestion 652 marking in the feedback loop. Thus the Nonce requires the sender or 653 its proxy to be trusted to respond correctly to congestion. But this 654 is precisely the main cheat we want to protect against (as well as 655 many others). 657 One of the compromise protocol encodings that [PCN] explores 658 ("Alternative 5") leaves out support for the ECN Nonce. Therefore we 659 use that one. This encoding of PCN markings is shown on the left of 660 Table 2. Note that these codepoints of the ECN field only take on 661 the semantics of pre-congestion noticiation if they are combined with 662 a Diffserv codepoint that the operator has configured to cause PCN 663 marking, by mapping it to a PCN-enhanced PHB. 665 For the rest of this memo, we will not distinguish between Admission 666 Marking and Pre-emption Marking unless we need to be specific. We 667 will call both "congestion marking". With the above encoding, 668 congestion marking can be read to mean any packet with the left-most 669 bit of the ECN field set. 671 The re-ECN protocol can be used to control misbehaving sources 672 whether congestion is with respect to a logical threshold (PCN) or 673 the physical line rate (ECN). In either case the RE flag can be used 674 to create an extended ECN field. For PCN-capable packets, the 8 675 possible encodings of this 3-bit extended ECN (EECN) field are 676 defined on the right of Table 2 below. The purposes of these 677 different codepoints will be introduced in subsequent sections. 679 +-------+-----------------+------+-------------+--------------------+ 680 | ECN | PCN codepoint | RE | Extended | Re-ECN meaning | 681 | field | (Alternative 5) | flag | ECN | | 682 | | | | codepoint | | 683 +-------+-----------------+------+-------------+--------------------+ 684 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 685 | | | | | transport | 686 | 00 | Not-ECT | 1 | FNE | Feedback not | 687 | | | | | established | 688 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed | 689 | | | | | congestion and | 690 | | | | | RECT | 691 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 692 | | | | | transport | 693 | 10 | AM | 0 | AM(0) | Admission Marking | 694 | | | | | with Re-Echo | 695 | 10 | AM | 1 | AM(-1) | Admission Marking | 696 | | | | | | 697 | 11 | PM | 0 | PM(0) | Pre-emption | 698 | | | | | Marking with | 699 | | | | | Re-Echo | 700 | 11 | PM | 1 | PM(-1) | Pre-emption | 701 | | | | | Marking | 702 +-------+-----------------+------+-------------+--------------------+ 704 Table 2: Extended ECN Codepoints if the Diffserv codepoint uses Pre- 705 congestion Notification (PCN) 707 4.3. Protocol Operation 709 4.3.1. Protocol Operation for an Established Flow 711 The re-ECN protocol involves a simple tweak to the action of the 712 gateway at the ingress edge of the CL region. In the deployment 713 model just described [CL-deploy], for each active traffic aggregate 714 across the CL region (CL-region-aggregate) the ingress gateway will 715 hold a fairly recent Congestion-Level-Estimate that the egress 716 gateway will have fed back to it, piggybacked on the signalling that 717 sets up each flow. For instance, one aggregate might have been 718 experiencing 3% pre-congestion (that is, congestion marked octets 719 whether Admission Marked or Pre-emption Marked). In this case, the 720 ingress gateway MUST clear the RE flag to "0" for the same percentage 721 of octets of CL-packets (3%) and set it to "1" in the rest (97%). 722 Appendix A.1 gives a simple pseudo-code algorithm that the ingress 723 gateway may use to do this. 725 The RE flag is set and cleared this way round for incremental 726 deployment reasons (see [Re-TCP]). To avoid confusion we will use 727 the term `blanking' (rather than marking) when the RE flag is cleared 728 to "0", so we will talk of the `RE blanking fraction' as the fraction 729 of octets with the RE flag cleared to "0". 731 ^ 732 | 733 | RE blanking fraction 734 3% | +----------------------------+====+ 735 | | | | 736 2% | | | | 737 | | congestion marking fraction| | 738 1% | | +----------------------+ | 739 | | | | 740 0% +----+=====+---------------------------+------> 741 ^ <--A---> <---B---> <---C---> ^ domain 742 | ^ ^ | 743 ingress | | egress 744 1.00% 2.00% marking fraction 746 Figure 3: Example Extended ECN codepoint Marking fractions 747 (Imprecise) 749 Figure 3 illustrates our example. The horizontal axis represents the 750 index of each congestible resource (typically queues) along a path 751 through the Internet. The two superimposed plots show the fraction 752 of each ECN codepoint observed along this path, assuming there are 753 two congested routers somewhere within domains A and C. And Table 3 754 below shows the downstream pre-congestion measured at various border 755 observation points along the path. Figure 4 (later) shows the same 756 results of these subtractions, but in graphical form like the above 757 figure. The tabulated figures are actually reasonable approximations 758 derived from more precise formulae given in Appendix A of [Re-TCP]. 759 The RE flag is not changed by interior routers, so it can be seen 760 that it acts as a reference against which the congestion marking 761 fraction can be compared along the path. 763 +--------------------------+---------------------------------------+ 764 | Border observation point | Approximate Downstream pre-congestion | 765 +--------------------------+---------------------------------------+ 766 | ingress -- A | 3% - 0% = 3% | 767 | A -- B | 3% - 1% = 2% | 768 | B -- C | 3% - 1% = 2% | 769 | C -- egress | 3% - 3% = 0% | 770 +--------------------------+---------------------------------------+ 772 Table 3: Downstream Congestion Measured at Example Observation Points 774 Note that the ingress determines the RE blanking fraction for each 775 aggregate using the most recent feedback from the relevant egress, 776 arriving with each new reservation, or each refresh. These updates 777 arrive relatively infrequently compared to the speed with which 778 congestion changes. Although this feedback will always be out of 779 date, on average positive errors should cancel out negative over a 780 sufficiently long duration. 782 In summary, the network adds pre-congestion marking in the forward 783 data path, the egress feeds its level back to the ingress in RSVP (or 784 similar signalling), then the ingress gateway re-echoes it into the 785 forward data path by blanking the RE flag. Hence the name re-ECN. 786 Then at any border within the Diffserv region, the pre-congestion 787 marking that every passing packet will be expected to experience 788 downstream can be measured to be the RE blanking fraction minus the 789 congestion marking fraction. 791 4.3.2. Aggregate Bootstrap 793 When a new reservation PATH message arrives at the egress, if there 794 are currently no flows in progress from the same ingress, there will 795 be no state maintaining the current level of pre-congestion marking 796 for the aggregate. While the reservation signalling continues onward 797 towards the receiving host, the egress gateway returns an RSVP 798 message to the ingress with a flag [RSVP-ECN] asking the ingress to 799 send a specified number of data probes between them. This bootstrap 800 behaviour is all described in the deployment model [CL-deploy]. 802 However, with our new re-ECN scheme, the ingress does not know what 803 proportion of the data probes should have the RE flag blanked, 804 because it has no estimate yet of pre-congestion for the path across 805 the Diffserv region. 807 To be conservative, following the guidance for specifying other re- 808 ECN transports in [Re-TCP], the ingress SHOULD set the FNE codepoint 809 of the extended ECN header in all probe packets (Table 2). As per 810 the deployment model, the egress gateway measures the fraction of 811 congestion-marked probe octets and feeds back the resulting pre- 812 congestion level to the ingress, piggy-backed on the returning 813 reservation response (RESV) for the new flow. Probe packets are 814 identifiable by the egress because they have the ingress as the 815 source and the egress as the destination in the IP header. 817 It may seem inadvisable to expect the FNE codepoint to be set on 818 probes, given legacy firewalls etc. might discard such packets 819 (because this flag had no previous legitimate use). However, in the 820 deployment scenarios envisaged, each domain in the Diffserv region 821 has to be explicitly configured to support the controlled load 822 service. So, before deploying the service, the operator MUST 823 reconfigure such a misbehaving middlebox to allow through packets 824 with the RE flag set. 826 Note that we have said SHOULD rather than MUST for the FNE setting 827 behaviour of the ingress for probe packets. This entertains the 828 possibility of an ingress implementation having the benefit of other 829 knowledge of the path, which it re-uses for a newly starting 830 aggregate. For instance, it may hold cached information from a 831 recent use of the aggregate that is still sufficiently current to be 832 useful. 834 It might seem pedantic worrying about these few probe packets, but 835 this behaviour ensures the system is safe, even if the proportion of 836 probe packets becomes large. 838 4.3.3. Flow Bootstrap 840 It might be expected that a new flow within an active aggregate would 841 need no special bootstrap behaviour. If there was an aggregate 842 already in progress between the gateways the new flow was about to 843 use, it would inherit the prevailing RE blanking fraction. And if 844 there were no active aggregate, the bootstrap behaviour for an 845 aggregate would be appropriate and sufficient for the new flow. 847 However, for a number of reasons, at least the first packet of each 848 new flow SHOULD be set to the FNE codepoint, irrespective of whether 849 it is joining an active aggregate or not. If the first packet is 850 unlikely to be reliably delivered, a number of FNE packets MAY be 851 sent to increase the probability that at least one is delivered to 852 the egress gateway. 854 If each flow does not start with an FNE packet, it will be seen later 855 that sanctions may be too strict at the interface before the egress 856 gateway. It will often be possible to apply sanctions at the 857 granularity of aggregates rather than flows, but in an internetworked 858 environment it cannot be guaranteed that aggregates will be 859 identifiable in remote networks. So setting FNE at the start of each 860 flow is a safe strategy. For instance, a remote network may have 861 equal cost multi-path (ECMP) routing enabled, causing different flows 862 between the same gateways to traverse different paths. 864 After an idle period of more than 1 second, the ingress gateway 865 SHOULD set the EECN field of the next packet it sends to FNE. This 866 allows the design of network policers to be deterministic (see [Re- 867 TCP]). 869 However, if the ingress gateway can guarantee that the network(s) 870 that will carry the flow to its egress gateway all use a common 871 identifier for the aggregate (e.g. a single MPLS network without ECMP 872 routing), it MAY NOT set FNE when it adds a new flow to an active 873 aggregate. And an FNE packet need only be sent if a whole aggregate 874 has been idle for more than 1 second. 876 4.3.4. Router Forwarding Behaviour 878 Adding re-ECN works well without modifying the forwarding behaviour 879 of any routers. However, below, two changes are proposed when 880 forwarding packets with a per-hop-behaviour that requires pre- 881 congestion notification: 883 Preferential drop: When a router cannot avoid dropping ECN-capable 884 packets, preferential dropping of packets with different extended 885 ECN codepoints SHOULD be implemented between packets within a PHB 886 that uses PCN marking. The drop preference order to use is 887 defined in Table 4. Note that to reduce configuration complexity, 888 Re-Echo and FNE MAY be given the same drop preference, but if 889 feasible, FNE should be dropped in preference to Re-Echo. 891 +--------+------+----------------+---------+------------------------+ 892 | ECN | RE | Extended ECN | Drop | Re-ECN meaning | 893 | field | flag | codepoint | Pref | | 894 +--------+------+----------------+---------+------------------------+ 895 | 01 | 0 | Re-Echo | 5/4 | Re-echoed congestion | 896 | | | | | and RECT | 897 | 00 | 1 | FNE | 4 | Feedback not | 898 | | | | | established | 899 | 01 | 1 | RECT | 3 | Re-ECN capable | 900 | | | | | transport | 901 | 10 | 0 | AM(0) | 3 | Admission Marking with | 902 | | | | | Re-Echo | 903 | 10 | 1 | AM(-1) | 3 | Admission Marking | 904 | | | | | | 905 | 11 | 0 | PM(0) | 2 | Pre-emption Marking | 906 | | | | | with Re-Echo | 907 | 11 | 1 | PM(-1) | 2 | Pre-emption Marking | 908 | | | | | | 909 | 00 | 0 | Not-RECT | 1 | Not re-ECN-capable | 910 | | | | | transport | 911 +--------+------+----------------+---------+------------------------+ 913 Table 4: Drop Preference of Extended ECN Codepoints (1 = drop 1st) 915 Given this proposal is being advanced at the same time as PCN 916 itself, we strongly RECOMMEND that preferential drop based on 917 extended ECN codepoint is added to router forwarding at the same 918 time as PCN marking. Preferential dropping can be difficult to 919 implement, but we strongly RECOMMEND this security-related re-ECN 920 improvement where feasible as it is an effective defence against 921 flooding attacks. 923 Marking vs. Drop: We propose that PCN-routers SHOULD inspect the RE 924 flag as well as the ECN field to decide whether to drop or mark 925 PCN DSCPs. They MUST choose drop if the codepoint of this 926 extended ECN field is Not-RECT. Otherwise they SHOULD mark 927 (unless, of course, buffer space is exhausted). 929 A PCN-capable router MUST NOT ever congestion mark a packet 930 carrying the Not-RECT codepoint because the transport will only 931 understand drop, not congestion marking. But a PCN-capable router 932 can mark rather than drop an FNE packet, even though its ECN field 933 when looked at in isolation is '00' which appears to be a legacy 934 Not-ECT packet. Therefore, if a packet's RE flag is '1', even if 935 its ECN field is '00', a PCN-enabled router SHOULD use congestion 936 marking. This allows the `feedback not established' (FNE) 937 codepoint to be used for probe packets, in order to pick up PCN 938 marking when bootstrapping an aggregate. 940 ECN marking rather than dropping of FNE packets MUST only be 941 deployed in controlled environments, such as that in [CL-deploy], 942 where the presence of an egress node that understands ECN marking 943 is assured. Congestion events might otherwise be ignored if the 944 receiver only understands drop, rather than ECN marking. This is 945 because there is no guarantee that ECN capability has been 946 negotiated if feedback is not established (FNE). Also, [Re-TCP] 947 places the strong condition that a router MUST apply drop rather 948 than marking to FNE packets unless it can guarantee that FNE 949 packets are rate limited either locally or upstream. 951 4.3.5. Extensions 953 If a different signalling system, such as NSIS, were used, but it 954 provided admission control in a similar way, using pre-congestion 955 notification (e.g. with RMD [NSIS-RMD]) we believe re-ECN could be 956 used to protect against misbehaving networks in the same way as 957 proposed above. 959 5. Emulating Border Policing with Re-ECN 961 5.1. Informal Terminology 963 In the rest of this memo, where the context makes it clear, we will 964 sometimes loosely use the term `congestion' rather than using the 965 stricter `downstream pre-congestion'. Also we will loosely talk of 966 positive or negative flows, meaning flows where the moving average of 967 the downstream pre-congestion metric is persistently positive or 968 negative. The notion of a negative metric arises because it is 969 derived by subtracting one metric from another. Of course actual 970 downstream congestion cannot be negative, only the metric can 971 (whether due to time lags or deliberate malice). 973 Just as we will loosely talk of positive and negative flows, we will 974 also talk of positive or negative packets, meaning packets that 975 contribute positively or negatively to downstream pre-congestion. 977 Therefore packets can be considered to have a `worth' of +1, 0 or -1, 978 which, when multiplied by their size, indicates their contribution to 979 downstream congestion. Packets will usually be sent with a worth of 980 0. Blanking the RE flag increments the worth of a packet to +1. 981 Congestion marking a packet decrements its worth (whether admission 982 marking or pre-emption marking). Congestion marking a previously 983 blanked packet cancel out the positive and negative worth of each 984 marking (a worth of 0). The FNE codepoint is an exception. It has 985 the same positive worth as a packet with the Re-Echo codepoint. The 986 table below specifies unambiguously the worth of each extended ECN 987 codepoint. Note the order is different from the previous table to 988 emphasise how congestion marking processes decrement the worth. 990 +--------+------+------------------+-------+------------------------+ 991 | ECN | RE | Extended ECN | Worth | Re-ECN meaning | 992 | field | flag | codepoint | | | 993 +--------+------+------------------+-------+------------------------+ 994 | 00 | 0 | Not-RECT | n/a | Not re-ECN-capable | 995 | | | | | transport | 996 | 01 | 0 | Re-Echo | +1 | Re-echoed congestion | 997 | | | | | and RECT | 998 | 10 | 0 | AM(0) | 0 | Admission Marking with | 999 | | | | | Re-Echo | 1000 | 11 | 0 | PM(0) | 0 | Pre-emption Marking | 1001 | | | | | with Re-Echo | 1002 | 00 | 1 | FNE | +1 | Feedback not | 1003 | | | | | established | 1004 | 01 | 1 | RECT | 0 | Re-ECN capable | 1005 | | | | | transport | 1006 | 10 | 1 | AM(-1) | -1 | Admission Marking | 1007 | | | | | | 1008 | 11 | 1 | PM(-1) | -1 | Pre-emption Marking | 1009 +--------+------+------------------+-------+------------------------+ 1011 Table 5: 'Worth' of Extended ECN Codepoints 1013 5.2. Policing Overview 1015 It will be recalled that downstream congestion can be found by 1016 subtracting upstream congestion from path congestion. Figure 4 1017 displays the difference between the two plots in Figure 3 to show 1018 downstream pre-congestion across the same path through the Internet. 1020 To emulate border policing, the general idea is for each domain to 1021 apply penalties to its upstream neighbour in proportion to the amount 1022 of downstream pre-congestion that the upstream network sends across 1023 the border. That is, the penalties should be in proportion to the 1024 height of the plot. Downward arrows in the figure show the resulting 1025 pressure for each domain to under-declare downstream pre-congestion 1026 in traffic they pass to the next domain, because of the penalties. 1028 p e n a l t i e s 1029 / | \ 1030 A : : : 1031 | | <--A---> <---B---> <---C---> domain 1032 | V : : : 1033 3% | +-----+ | | : 1034 | | | V V : 1035 2% | | +----------------------+ : 1036 | | downstream pre-congestion | : 1037 1% | | : | : 1038 | | : | : 1039 0% +----+----------------------------+====+------> 1040 : : : A : 1041 : : : | : 1042 ingress : : : egress 1043 1.00% 2.00%: pre-congestion 1044 | 1045 sanctions 1047 Figure 4: Policing Framework, showing creation of opposing pressures 1048 to under-declare and over-declare downstream pre-congestion, using 1049 penalties and sanctions 1051 These penalties seem to encourage everyone to understate downstream 1052 congestion in order to reduce the penalties they incur. But a 1053 balancing pressure is introduced by the last domain, which applies 1054 sanctions to flows if downstream congestion goes negative before the 1055 egress gateway. The upward arrow at Domain C's border with the 1056 egress gateway represents the incentive the sanctions would create to 1057 prevent negative traffic. The same upward pressure can be applied at 1058 any domain border (arrows not shown). 1060 Any flow that persistently goes negative by the time it leaves a 1061 domain must not have been marked correctly in the first place. A 1062 domain that discovers such a flow can adopt a range of strategies to 1063 protect itself. Which strategy it uses will depend on policy, 1064 because it cannot immediately assume malice---there may be an 1065 innocent configuration error somewhere in the system. 1067 This memo does not propose to standardise any particular mechanism to 1068 detect persistently negative flows, but Section 5.5 does give 1069 examples. Note that we have used the term flow, but there will be no 1070 need to bury into the transport layer for port numbers; identifiers 1071 visible in the network layer will be sufficient (IP address pair, 1072 DSCP, protocol ID). The appendix also gives a mechanism to bound the 1073 required flow state, preventing state exhaustion attacks. 1075 Of course, some domains may trust other domains to comply with 1076 admission control without applying sanctions or penalties. In these 1077 cases, the protocol should still be used but no penalties need be 1078 applied. The re-ECN protocol ensures downstream pre-congestion 1079 marking is passed on correctly whether or not penalties are applied 1080 to it, so the system works just as well with a mixture of some 1081 domains trusting each other and others not. 1083 Providers should be free to agree the contractual terms they wish 1084 between themselves, so this memo does not propose to standardise how 1085 these penalties would be applied. It is sufficient to standardise 1086 the re-ECN protocol so the downstream pre-congestion metric is 1087 available if providers choose to use it. However, the next section 1088 (Section 5.3) gives some examples of how these penalties might be 1089 implemented. 1091 5.3. Pre-requisite Contractual Arrangements 1093 The re-ECN protocol has been chosen to solve the policing problem 1094 because it embeds a downstream pre-congestion metric in passing CL 1095 traffic that is difficult to lie about and can be measured in bulk. 1096 The ability to emulate border policing depends on network operators 1097 choosing to use this metric as one of the elements in their contracts 1098 with each other. 1100 Already many inter-domain agreements involve a capacity and a usage 1101 element. The usage element may be based on volume or various 1102 measures of peak demand. We expect that those network operators who 1103 choose to use pre-congestion notification for admission control would 1104 also be willing to consider using this downstream pre-congestion 1105 metric as a usage element in their interconnection contracts for 1106 admission controlled (CL) traffic. 1108 Congestion (or pre-congestion) has the dimension of [octet], being 1109 the product of volume transferred [octet] and the congestion fraction 1110 [dimensionless], which is the fraction of the offered load that the 1111 network isn't able to serve (or would rather not serve in the case of 1112 pre-congestion). Measuring downstream congestion gives a measure of 1113 the volume transferred but modulated by congestion expected 1114 downstream. So volume transferred during off-peak periods counts as 1115 nearly nothing, while volume transferred at peak times counts very 1116 highly. The re-ECN protocol allows one network to measure how much 1117 pre-congestion has been `dumped' into it by another network. And 1118 then in turn how much of that pre-congestion it dumped into the next 1119 downstream network. 1121 Section 5.6 describes mechanisms for calculating border penalties 1122 referring to Appendix A.2 for suggested metering algorithms for 1123 downstream congestion at a border router. Conceptually, it could 1124 hardly be simpler. It broadly involves accumulating the volume of 1125 packets with the RE flag blanked and the volume of those with 1126 congestion marking then subtracting the two. 1128 Once this downstream pre-congestion metric is available, operators 1129 are free to choose how they incorporate it into their interconnection 1130 contracts [IXQoS]. Some may include a threshold volume of pre- 1131 congestion as a quality measure in their service level agreement, 1132 perhaps with a penalty clause if the upstream network exceeds this 1133 threshold over, say, a month. Others may agree a set of tiered 1134 monthly thresholds, with increasing penalties as each threshold is 1135 exceeded. But, it would be just as easy, and more resistant to 1136 gaming, to do away with discrete thresholds, and instead make the 1137 penalty rise smoothly with the volume of pre-congestion by applying a 1138 price to pre-congestion itself. Then the usage element of the 1139 interconnection contract would directly relate to the volume of pre- 1140 congestion caused by the upstream network. 1142 The direction of penalties and charges relative to the direction of 1143 traffic flow is a constant source of confusion. Typically, where 1144 capacity charges are concerned, lower tier customer networks pay 1145 higher tier provider networks. So money flows from the edges to the 1146 middle of the internetwork, towards greater connectivity, 1147 irrespective of the flow of data. But we advise that penalties or 1148 charges for usage should follow the same direction as the data 1149 flow---the direction of control at the network layer. Otherwise a 1150 network lays itself open to `denial of funds' attacks. So, where a 1151 tier 2 provider sends data into a tier 3 customer network, we would 1152 expect the penalty clauses for sending too much pre-congestion to be 1153 against the tier 2 network, even though it is the provider. 1155 It may help to remember that data will be flowing in the other 1156 direction too. So the provider network has as much opportunity to 1157 levy usage penalties as its customer, and it can set the price or 1158 strength of its own penalties higher if it chooses. Usage charges in 1159 both directions tend to cancel each other out, which confirms that 1160 usage-charging is less to do with revenue raising and more to do with 1161 encouraging load control discipline in order to smooth peaks and 1162 troughs, improving utilisation and quality. 1164 Further, when operators agree penalties in their interconnection 1165 contracts for sending downstream congestion, they should make sure 1166 that any level of negative marking only equates to zero penalty. In 1167 other words, penalties are always paid in the same direction as the 1168 data, and never against the data flow, even if downstream congestion 1169 seems to be negative. This is consistent with the definition of 1170 physical congestion; when a resource is underutilised, it is not 1171 negatively congested. Its congestion is just zero. So, although 1172 short periods of negative marking can be tolerated to correct 1173 temporary over-declarations due to lags in the feedback system, 1174 persistent downstream negative congestion can have no physical 1175 meaning and therefore must signify a problem. The incentive for 1176 domains not to tolerate persistently negative traffic depends on this 1177 principle that penalties must never be paid against the data flow. 1179 Also note that at the last egress of the Diffserv region, domain C 1180 should not agree to pay any penalties to the egress gateway for pre- 1181 congestion passed to the egress gateway. Downstream pre-congestion 1182 to the egress gateway should have reached zero here. If domain C 1183 were to agree to pay for any remaining downstream pre-congestion, it 1184 would give the egress gateway an incentive to over-declare pre- 1185 congestion feedback and take the resulting profit from domain C. 1187 To focus the discussion, from now on, unless otherwise stated, we 1188 will assume a downstream network charges its upstream neighbour in 1189 proportion to the pre-congestion it sends (V_b in the notation of 1190 Appendix A.2). Effectively tiered thresholds would be just more 1191 coarse-grained approximations of the fine-grained case we choose to 1192 examine. If these neighbours had previously agreed that the (fixed) 1193 price per octet of pre-congestion would be L, then the bill at the 1194 end of the month would simply be the product L*V_b, plus any fixed 1195 charges they may also have agreed. 1197 We are well aware that the IETF tries to avoid standardising 1198 technology that depends on a particular business model. Indeed, this 1199 principle is at the heart of all our own work. Our aim here is to 1200 make a new metric available that we believe is superior to all 1201 existing metrics. Then, our aim is to show that border policing can 1202 at least work with the one model we have just outlined. We assume 1203 that operators might then experiment with the metric in other models. 1204 Of course, operators are free to complement this pre-congestion-based 1205 usage element of their charges with traditional capacity charging, 1206 and we expect they will. 1208 Also note well that everything we discuss in this memo only concerns 1209 interconnection within the Diffserv region. ISPs are free to sell or 1210 give away reservations however they want on the retail market. But 1211 of course, interconnection charges will have a bearing on that. 1212 Indeed, in the present scenario, the ingress gateway effectively 1213 sells reservations on one side and buys congestion penalties on the 1214 other. As congestion rises, one can imagine the gateway discovering 1215 that congestion penalties have risen higher than the (probably fixed) 1216 revenue it will earn from selling the next flow reservation. This 1217 encourages the gateway to cut its losses by blocking new calls, which 1218 is why we believe downstream congestion penalties can emulate per- 1219 flow rate policing at borders, as the next section explains. 1221 5.4. Emulation of Per-Flow Rate Policing: Rationale and Limits 1223 The important feature of charging in proportion to congestion volume 1224 is that the penalty aggregates and disaggregates correctly along with 1225 packet flows. This is because the penalty rises linearly with bit 1226 rate (unless congestion is absolutely zero) and linearly with 1227 congestion, because it is the product of them both. So if the 1228 packets crossing a border belong to a thousand flows, and one of 1229 those flows doubles its rate, the ingress gateway forwarding that 1230 flow will have to put twice as much congestion marking into the 1231 packets of that flow. And this extra congestion marking will add 1232 proportionately to the penalties levied at every border the flow 1233 crosses in proportion to the amount of pre-congestion remaining on 1234 the path. 1236 Effectively, usage charges will continuously flow from ingress 1237 gateways to the places generating pre-congestion marking, in 1238 proportion to the pre-congestion marking introduced and to the data 1239 rates from those gateways. 1241 As importantly, pre-congestion itself rises super-linearly with 1242 utilisation of a particular resource. So if someone tries to push 1243 another flow into a path that is already signalling enough pre- 1244 congestion to warrant admission control, the penalty will be a lot 1245 greater than it would have been to add the same flow to a less 1246 congested path. This makes the incentive system fairly insensitive 1247 to the actual level of pre-congestion for triggering admission 1248 control that each ingress chooses. The deterrent against exceeding 1249 whatever threshold is chosen rises very quickly with a small amount 1250 of cheating. 1252 These are the properties that allow re-ECN to emulate per-flow border 1253 policing of both rate and admission control. It is not a perfect 1254 emulation of per-flow border policing, but we claim it is sufficient 1255 to at least ensure the cost to others of a cheat is borne by the 1256 cheater, because the penalties are at least proportionate to the 1257 level of the cheat. If an edge network operator is selling 1258 reservations at a large profit over the congestion cost, these pre- 1259 congestion penalties will not be sufficient to ensure networks in the 1260 middle get a share of those profits, but at least they can cover 1261 their costs. 1263 We will now explain with an example. When a whole inter-network is 1264 operating at normal (typically very low) congestion, the pre- 1265 congestion marking from virtual queues will be a little higher than 1266 if the real queues had been used---still low, but more noticeable. 1267 But low congestion levels do not imply that usage /charges/ must also 1268 be low. Usage charges will depend on the /price/ L as well. 1270 If the metric of the usage element of an interconnection agreement 1271 was changed from pure volume to pre-congested volume, one would 1272 expect the price of pre-congestion to be arranged so that the total 1273 usage charge remained about the same. So, if an average pre- 1274 congestion fraction turned out to be 1/1000, one would expect that 1275 the price L (per octet) of pre-congestion would be about 1000 times 1276 the previously used (per octet) price for volume. We should add that 1277 a switch to pre-congestion is unlikely to exactly maintain the same 1278 overall level of usage charges, but this argument will be 1279 approximately true, because usage charge will rise to at least the 1280 level the market finds necessary to push back against usage. 1282 From the above example it can be seen why a 1000x higher price will 1283 make operators become acutely sensitive to the congestion they cause 1284 in other networks, which is of course the desired effect; to 1285 encourage networks to /control/ the congestion they allow their users 1286 to cause to others. 1288 If any network sends even one flow at higher rate, they will 1289 immediately have to pay proportionately more usage charges. Because 1290 there is no knowledge of reservations within the Diffserv region, no 1291 interior router can police whether the rate of each flow is greater 1292 than each reservation. So the system doesn't truly emulate rate- 1293 policing of each flow. But there is no incentive to pack a higher 1294 rate into a reservation, because the charges are directly 1295 proportional to rate, irrespective of the reservations. 1297 However, if virtual queues start to fill on any path, even though 1298 real queues will still be able to provide low latency service, pre- 1299 congestion marking will rise fairly quickly. It may eventually reach 1300 the threshold where the ingress gateway would deny admission to new 1301 flows. If the ingress gateway cheats and continues to admit new 1302 flows, the affected virtual queues will rapidly fill, even though the 1303 real queues will still be little worse than they were when admission 1304 control should have been invoked. The ingress gateway will have to 1305 pay the penalty for such an extremely high pre-congestion level, so 1306 the pressure to invoke admission control should become unbearable. 1308 The above mechanisms protect against rational operators. In 1309 Section 5.6.3 we discuss how networks can protect themselves from 1310 accidental or deliberate misconfiguration in neighbouring networks. 1312 5.5. Sanctioning Dishonest Marking 1314 As CL traffic leaves the last network before the egress gateway 1315 (domain C) the RE blanking fraction should match the congestion 1316 marking fraction, when averaged over a sufficiently long duration 1317 (perhaps ~10s to allow a few rounds of feedback through regular 1318 signalling of new and refreshed reservations). 1320 To protect itself, domain C should install a monitor at its egress. 1321 It aims to detect flows of CL packets that are persistently negative. 1322 If flows are positive, domain C need take no action---this simply 1323 means an upstream network must be paying more penalties than it needs 1324 to. Appendix A.3 gives a suggested algorithm for the monitor, 1325 meeting the criteria below. 1327 o It SHOULD introduce minimal false positives for honest flows; 1329 o It SHOULD quickly detect and sanction dishonest flows (minimal 1330 false negatives); 1332 o It MUST be invulnerable to state exhaustion attacks from malicious 1333 sources. For instance, if the dropper uses flow-state, it should 1334 not be possible for a source to send numerous packets, each with a 1335 different flow ID, to force the dropper to exhaust its memory 1336 capacity; 1338 o It MUST introduce sufficient loss in goodput so that malicious 1339 sources cannot play off losses in the egress dropper against 1340 higher allowed throughput. Salvatori [CLoop_pol] describes this 1341 attack, which involves the source understating path congestion 1342 then inserting forward error correction (FEC) packets to 1343 compensate expected losses. 1345 Note that the monitor operates on flows but with careful design we 1346 can avoid per-flow state. This is why we have been careful to ensure 1347 that all flows MUST start with a packet marked with the FNE 1348 codepoint. If a flow does not start with the FNE codepoint, a 1349 monitor is likely to treat it unfavourably. This risk makes it worth 1350 setting the FNE codepoint at the start of a flow, even though there 1351 is a cost to setting FNE (positive `worth'). 1353 Starting flows with an FNE packet also means that a monitor will be 1354 resistant to state exhaustion attacks from other networks, as the 1355 monitor can then be designed to never create state unless an FNE 1356 packet arrives. And an FNE packet counts positive, so it will cost a 1357 lot for a network to send many of them. 1359 Monitor algorithms will often maintain a moving average across flows 1360 of the fraction of RE blanked packets. When maintaining an average 1361 across flows, a monitor MUST ignore packets with the FNE codepoint 1362 set. An ingress gateway sets the FNE codepoint when it does not have 1363 the benefit of feedback from the egress. So counting packets with 1364 FNE cleared would be likely to make the average unnecessarily 1365 positive, providing headroom (or should we say footroom?) for 1366 dishonest (negative) traffic. 1368 If the monitor detects a persistently negative flow, it could drop 1369 sufficient negative and neutral packets to force the flow to not be 1370 negative. This is the approach taken for the `egress dropper' in 1371 [Re-TCP], but for the scenario in this memo, where everyone would 1372 expect everyone else to keep to the protocol, a management alarm 1373 SHOULD be raised on detecting persistently negative traffic and any 1374 automatic sanctions taken SHOULD be logged. Even if the chosen 1375 policy is to take no automatic action, the cause can then be 1376 investigated manually. 1378 Then all ingresses cannot understate downstream pre-congestion 1379 without their action being logged. So network operators can deal 1380 with offending networks at the human level, out of band. As a last 1381 resort, perhaps where the ingress gateway address seems to have been 1382 spoofed in the signalling, packets can be dropped. Drops could be 1383 focused on just sufficient packets in misbehaving flows to remove the 1384 negative bias while doing minimal harm. 1386 A future version of this memo may define a control message that could 1387 be used to notify an offending ingress gateway (possibly via the 1388 egress gateway) that it is sending persistently negative flows. 1389 However, we are aware that such messages could be used to test the 1390 sensitivity of the detection system, so currently we prefer silent 1391 sanctions. 1393 An extreme scenario would be where an ingress gateway (or set of 1394 gateways) mounted a DoS attack against another network. If their 1395 traffic caused sufficient congestion to lead to drop but they 1396 understated path congestion to avoid penalties for causing high 1397 congestion, the preferential drop recommendations in Section 4.3.4 1398 would at least ensure that these flows would always be dropped before 1399 honest flows.. 1401 5.6. Border Mechanisms 1403 5.6.1. Border Accounting Mechanisms 1405 One of the main design goals of re-ECN was for border security 1406 mechanisms to be as simple as possible, otherwise they would become 1407 the pinch-points that limit scalability of the whole internetwork. 1408 As the title of this memo suggests, we want to avoid per-flow 1409 processing at borders. We also want to keep to passive mechanisms 1410 that can monitor traffic in parallel to forwarding, rather than 1411 having to filter traffic inline---in series with forwarding. As data 1412 rates continue to rise, we suspect that all-optical interconnection 1413 between networks will soon be a requirement. So we want to avoid any 1414 new need for buffering (even though border filtering is current 1415 practice for other reasons, we don't want to make it even less likely 1416 that we will ever get rid of it). 1418 So far, we have been able to keep the border mechanisms simple, 1419 despite having had to harden them against some subtle attacks on the 1420 re-ECN design. The mechanisms are still passive and avoid per-flow 1421 processing, although we do use filtering as a fail-safe to 1422 temporarily shield against extreme events in other networks, such as 1423 accidental misconfigurations (Section 5.6.3). 1425 The basic accounting mechanism at each border interface simply 1426 involves accumulating the volume of packets with positive worth (Re- 1427 Echo and FNE), and subtracting the volume of those with negative 1428 worth: AM(-1) and PM(-1). Even though this mechanism takes no regard 1429 of flows, over an accounting period (say a month) this subtraction 1430 will account for the downstream congestion caused by all the flows 1431 traversing the interface, wherever they come from, and wherever they 1432 go to. The two networks can agree to use this metric however they 1433 wish to determine some congestion-related penalty against the 1434 upstream network (see Section 5.3 for examples). Although the 1435 algorithm could hardly be simpler, it is spelled out using pseudo- 1436 code in Appendix A.2.1. 1438 Various attempts to subvert the re-ECN design have been made. In all 1439 cases their root cause is persistently negative flows. But, after 1440 describing these attacks we will show that we don't actually have to 1441 get rid of all persistently negative flows in order to thwart the 1442 attacks. 1444 In honest flows, downstream congestion is measured as positive minus 1445 negative volume. So if all flows are honest (i.e. not persistently 1446 negative), adding all positive volume and all negative volume without 1447 regard to flows will give an aggregate measure of downstream 1448 congestion. But such simple aggregation is only possible if no flows 1449 are persistently negative. Unless persistently negative flows are 1450 completely removed, they will reduce the aggregate measure of 1451 congestion. The aggregate may still be positive overall, but not as 1452 positive as it would have been had the negative flows been removed. 1454 In Section 5.5 we discussed how to sanction traffic to remove, or at 1455 least to identify, persistently negative flows. But, even if the 1456 sanction for negative traffic is to discard it, unless it is 1457 discarded at the exact point it goes negative, it will wrongly 1458 subtract from aggregate downstream congestion, at least at any 1459 borders it crosses after it has gone negative but before it is 1460 discarded. 1462 We rely on sanctions to deter dishonest understatement of congestion. 1463 But even the ultimate sanction of discard can only be effective if 1464 the sender is bothered about the data getting through to its 1465 destination. A number of attacks have been identified where a sender 1466 gains from sending dummy traffic or it can attack someone or 1467 something using dummy traffic even though it isn't communicating any 1468 information to anyone: 1470 o A network can simply create its own dummy traffic to congest 1471 another network, perhaps causing it to lose business at no cost to 1472 the attacking network. This is a form of denial of service 1473 perpetrated by one network on another. The preferential drop 1474 measures in Section 4.3.4 provide crude protection against such 1475 attacks, but we are not overly worried about more accurate 1476 prevention measures, because it is already possible for networks 1477 to DoS other networks on the general Internet, but they generally 1478 don't because of the grave consequences of being found out. We 1479 are only concerned if re-ECN increases the motivation for such an 1480 attack, as in the next example. 1482 o A network can just generate negative traffic and send it over its 1483 border with a neighbour to reduce the overall penalties that it 1484 should pay to that neighbour. It could even initialise the TTL so 1485 it expired shortly after entering the neighbouring network, 1486 reducing the chance of detection further downstream. This attack 1487 need not be motivated by a desire to deny service and indeed need 1488 not cause denial of service. A network's main motivator would 1489 most likely be to reduce the penalties it pays to a neighbour. 1490 But, the prospect of financial gain might tempt the network into 1491 mounting a DoS attack on the other network as well, given the gain 1492 would offset some of the risk of being detected. 1494 Note that we have not included DoS by Internet hosts in the above 1495 list of attacks, because we have restricted ourselves to a scenario 1496 with edge-to-edge admission control across a Diffserv region. In 1497 this case, the edge ingress gateways insulate the Diffserv region 1498 from DoS by Internet hosts. Re-ECN resists more general DoS attacks, 1499 but this is discussed in [Re-TCP]. 1501 The first step towards a solution to all these problems with negative 1502 flows is to be able to estimate the contribution they make to 1503 downstream congestion at a border and to correct the measure 1504 accordingly. Although ideally we want to remove negative flows 1505 themselves, perhaps surprisingly, the most effective first step is to 1506 cancel out the polluting effect negative flows have on the measure of 1507 downstream congestion at a border. It is more important to get an 1508 unbiased estimate of their effect, than to try to remove them all. A 1509 suggested algorithm to give an unbiased estimate of the contribution 1510 from negative flows to the downstream congestion measure is given in 1511 Appendix A.2.2. 1513 Although making an accurate assessment of the contribution from 1514 negative flows may not be easy, just the single step of neutralising 1515 their polluting effect on congestion metrics removes all the gains 1516 networks could otherwise make from mounting dummy traffic attacks on 1517 each other. This puts all networks on the same side (only with 1518 respect to negative flows of course), rather than being pitched 1519 against each other. The network where this flow goes negative as 1520 well as all the networks downstream lose out from not being 1521 reimbursed for any congestion this flow causes. So they all have an 1522 interest in getting rid of these negative flows. Networks forwarding 1523 a flow before it goes negative aren't strictly on the same side, but 1524 they are disinterested bystanders---they don't care that the flow 1525 goes negative downstream, but at least they can't actively gain from 1526 making it go negative. The problem becomes localised so that once a 1527 flow goes negative, all the networks from where it happens and beyond 1528 downstream each have a small problem, each can detect it has a 1529 problem and each can get rid of the problem if it chooses to. But 1530 negative flows can no longer be used for any new attacks. 1532 Once an unbiased estimate of the effect of negative flows can be 1533 made, the problem reduces to detecting and preferably removing flows 1534 that have gone negative as soon as possible. But importantly, 1535 complete eradication of negative flows is no longer critical---best 1536 endeavours will be sufficient. 1538 Note that the guiding principle behind all the above discussion is 1539 that any gain from subverting the protocol should be precisely 1540 neutralised, rather than punished. If a gain is punished to a 1541 greater extent than is sufficient to neutralise it, it will most 1542 likely open up a new vulnerability, where the amplifying effect of 1543 the punishment mechanism can be turned on others. 1545 For instance, if possible, flows should be removed as soon as they go 1546 negative, but we do NOT RECOMMEND any attempts to discard such flows 1547 further upstream while they are still positive. Such over-zealous 1548 push-back is unnecessary and potentially dangerous. These flows have 1549 paid their `fare' up to the point they go negative, so there is no 1550 harm in delivering them that far. If someone downstream asks for a 1551 flow to be dropped as near to the source as possible, because they 1552 say it is going to become negative later, an upstream node cannot 1553 test the truth of this assertion. Rather than have to authenticate 1554 such messages, re-ECN has been designed so that flows can be dropped 1555 solely based on locally measurable evidence. A message hinting that 1556 a flow should be watched closely to test for negativity is fine. But 1557 not a message that claims that a positive flow will go negative 1558 later, so it should be dropped. . 1560 5.6.2. Competitive Routing 1562 With the above penalty system, each domain seems to have a perverse 1563 incentive to fake pre-congestion. For instance domain B profits from 1564 the difference between penalties it receives at its ingress (its 1565 revenue) and those it pays at its egress (its cost). So if B 1566 overstates internal pre-congestion it seems to increase its profit. 1567 However, we can assume that domain A could bypass B, routing through 1568 other domains to reach the egress. So the competitive discipline of 1569 least-cost routing can ensure that any domain tempted to fake pre- 1570 congestion for profit risks losing /all/ its incoming traffic. The 1571 least congested route would eventually be able to win this 1572 competitive game, only as long as it didn't declare more fake pre- 1573 congestion than the next most competitive route. 1575 This memo does not need to standardise any particular mechanism for 1576 routing based on re-ECN. Goldenberg et al [Smart_rtg] refers to 1577 various commercial products and presents its own algorithms for 1578 moving traffic between multi-homed routes based on usage charges. 1579 None of these systems require any changes to standards protocols 1580 because the choice between the available border gateway protocol 1581 (BGP) routes is based on a combination of local knowledge of the 1582 charging regime and local measurement of traffic levels. If, as we 1583 propose, charges or penalties were based on the level of re-ECN 1584 measured in passing traffic, a similar optimisation could be achieved 1585 without requiring any changes to standard routing protocols. 1587 We must be clear that applying pre-congestion-based routing to this 1588 admission control system remains an open research issue. Traffic 1589 engineering based on congestion requires careful damping to avoid 1590 oscillations, and should not be attempted without adult supervision 1591 :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering based 1592 on congestion. But without the benefit of re-ECN, they had to add a 1593 path attribute to BGP to advertise a route's downstream congestion 1594 (actually they proposed that BGP should advertise the charge for 1595 congestion, which we believe wrongly embeds an assumption into BGP 1596 that the only thing to do with congestion is charge for it). 1598 5.6.3. Fail-safes 1600 The mechanisms described so far create incentives for rational 1601 operators to behave. That is, one operator aims to make another 1602 behave responsibly by applying penalties and expects a rational 1603 response (i.e. one that trades off costs against benefits). It is 1604 usually reasonable to assume that other network operators will behave 1605 rationally (policy routing can avoid those that might not). But this 1606 approach does not protect against the misconfigurations and accidents 1607 of other operators. 1609 Therefore, we propose the following two mechanisms at a network's 1610 borders to provide "defence in depth". Both are similar: 1612 Highly positive flows: A small sample of positive packets should be 1613 picked randomly as they cross a border interface. Then subsequent 1614 packets matching the same source and destination address and DSCP 1615 should be monitored. If the fraction of positive marking is well 1616 above a threshold (to be determined by operational practice), a 1617 management alarm SHOULD be raised, and the flow MAY be 1618 automatically subject to focused drop. 1620 Persistently negative flows: A small sample of congestion marked 1621 packets should be picked randomly as they cross a border 1622 interface. Then subsequent packets matching the same source and 1623 destination address and DSCP should be monitored. If the RE 1624 blanking fraction minus the congestion marking fraction is 1625 persistently negative, a management alarm SHOULD be raised, and 1626 the flow MAY be automatically subject to focused drop. 1628 Both these mechanisms rely on the fact that highly positive (or 1629 negative) flows will appear more quickly in the sample by selecting 1630 randomly solely from positive (or negative) packets. 1632 Note that there is no assumption that /users/ behave rationally. The 1633 system is protected from the vagaries of irrational user behaviour by 1634 the ingress gateways, which transform internal penalties into a 1635 deterministic, admission control mechanism that prevents users from 1636 misbehaving, by directly engineered means. 1638 6. Analysis 1640 The domains in Figure 1 are not expected to be completely malicious 1641 towards each other. After all, we can assume that they are all co- 1642 operating to provide an internetworking service to the benefit of 1643 each of them and their customers. Otherwise their routing polices 1644 would not interconnect them in the first place. However, we assume 1645 that they are also competitors of each other. So a network may try 1646 to contravene our proposed protocol if it would gain or make a 1647 competitor lose, or both, but only if it can do so without being 1648 caught. Therefore we do not have to consider every possible random 1649 attack one network could launch on the traffic of another, given 1650 anyway one network can always drop or corrupt packets that it 1651 forwards on behalf of another. 1653 Therefore, we only consider new opportunities for /gainful/ attack 1654 that our proposal introduces. But to a certain extent we can also 1655 rely on the in depth defences we have described (Section 5.6.3 ) 1656 intended to mitigate the potential impact if one network accidentally 1657 misconfiguring the workings of this protocol. 1659 The ingress and egress gateways are shown in the most generic 1660 arrangement possible in Figure 1, without any surrounding network. 1661 This allows us to consider more specific cases where these gateways 1662 and a neighbouring network are operated by the same player. As well 1663 as cases where the same player operates neighbouring networks, we 1664 will also consider cases where the two gateways collude as one player 1665 and where the sender and receiver collude as one. Collusion of other 1666 sets of domains is less likely, but we will consider such cases. In 1667 the general case, we will assume none of the nine trust domains 1668 across the figure fully trust any of the others. 1670 As we only propose to change routers within the Diffserv region, we 1671 assume the operators of networks outside the region will be doing 1672 per-flow policing. That is, we assume the networks outside the 1673 Diffserv region and the gateways around its edges can protect 1674 themselves. So given we are proposing to remove flow policing from 1675 some networks, our primary concern must be to protect networks that 1676 don't do per-flow policing (the potential `victims') from those that 1677 do (the `enemy'). The ingress and egress gateways are the only way 1678 the outer enemy can get at the middle victim, so we can consider the 1679 gateways as the representatives of the enemy as far as domains A, B 1680 and C are concerned. We will call this trust scenario `edges against 1681 middles'. 1683 Earlier in this memo, we outlined the classic border rate policing 1684 problem (Section 3). It will now be useful to reiterate the 1685 motivations that are the root cause of the problem. The more 1686 reservations a gateway can allow, the more revenue it receives. The 1687 middle networks want the edges to comply with the admission control 1688 protocol when they become so congested that their service to others 1689 might suffer. The middle networks also want to ensure the edges 1690 cannot steal more service from them than they are entitled to. 1692 In the context of this `edges against middles' scenario, the re-ECN 1693 protocol has two main effects: 1695 o The more pre-congestion there is on a path across the Diffserv 1696 region, the higher the ingress gateway must declare downstream 1697 pre-congestion. 1699 o If the ingress gateway does not declare downstream pre-congestion 1700 high enough on average, it will `hit the ground before the 1701 runway', going negative and triggering sanctions, either directly 1702 against the traffic or against the ingress gateway at a management 1703 level 1705 An executive summary of our security analysis can be stated in three 1706 parts, distinguished by the type of collusion considered. 1708 Neighbour-only Middle-Middle Collusion: Here there is no collusion or 1709 collusion is limited to neighbours in the feedback loop. In other 1710 words, two neighbouring networks can be assumed to act as one. Or 1711 the egress gateway might collude with domain C. Or the ingress 1712 gateway might collude with domain A. Or ingress and egress 1713 gateways might collude with each other. 1715 In these cases where only neighbours in the feedback loop collude, 1716 we concludes that all parties have a positive incentive to declare 1717 downstream pre-congestion truthfully, and the ingress gateway has 1718 a positive incentive to invoke admission control when congestion 1719 rises above the admission threshold in any network in the region 1720 (including its own). No party has an incentive to send more 1721 traffic than declared in reservation signalling (even though only 1722 the gateways read this signalling). In short, no party can gain 1723 at the expense of another. 1725 Non-neighbour Middle-Middle Collusion: In the case of other forms of 1726 collusion between middle networks (e.g. between domain A and C) it 1727 would be possible for say A & C to create a tunnel between 1728 themselves so that A would gain at the expense of B. But C would 1729 then lose the gain that A had made. Therefore the value to A & C 1730 of colluding to mount this attack seems questionable. It is made 1731 more questionable, because the attack can be statistically 1732 detected by B using the second `defence in depth' mechanism 1733 mentioned already. Note that C can defend itself from being 1734 attacked through a tunnel by treating the tunnel end point as a 1735 direct link to a neighbouring network (e.g. as if A were a 1736 neighbour of C, via the tunnel), which falls back to the safety of 1737 the neighbour-only scenario. 1739 Middle-Edge Collusion: Collusion between networks or gateways within 1740 the Diffserv region and networks or users outside the region has 1741 not yet been fully analysed. The presence of full per-flow 1742 policing at the ingress gateway seems to make this a less likely 1743 source of a successful attack. 1745 {ToDo: Due to lack of time, the full write up of the security 1746 analysis is deferred to the next version of this memo.} 1748 Finally, it is well known that the best person to analyse the 1749 security of a system is not the designer. Therefore, our confident 1750 claims must be hedged with doubt until others with perhaps a greater 1751 incentive to break it have mounted a full analysis. 1753 7. Incremental Deployment 1755 We believe ECN has so far not been widely deployed because it 1756 requires widespread end system and network deployment just to achieve 1757 a marginal improvement in performance. The ability to offer a new 1758 service (admission control) would be a much stronger driver for ECN 1759 deployment. 1761 As stated in the introduction, the aim of this memo is to "build in 1762 security from the start" when admission control is based on pre- 1763 congestion notification. However, the proposal has been designed so 1764 that security can be added some time after first deployment. Given 1765 admission control based on pre-congestion notification requires few 1766 changes to standards, it should be deployable fairly soon. However, 1767 re-ECN requires a change to IP, which may take a little longer. 1769 We expect that initial deployments of PCN-based admission control 1770 will be confined to single networks, or to clubs of networks that 1771 trust each other. The proposal in this memo will only become 1772 relevant once networks with conflicting interests wish to 1773 interconnect their admission controlled services, but without the 1774 scalability constraints of per-flow border policing. It will not be 1775 possible to use re-ECN, even in a controlled environment between 1776 consenting operators, unless it is standardised into IP. Given the 1777 IPv4 header has limited space for further changes, current IESG 1778 policy [{ToDo: ref?}] is not to allow experimental use of codepoints 1779 in the IPv4 header, as whenever an experiment isn't taken up, the 1780 space it used tends to be impossible to reclaim. 1782 If PCN-based admission control is deployed before re-ECN is 1783 standardised into IP, wherever a networks (or club of networks) 1784 connects to another network (or club of networks) with conflicting 1785 interests, they will place a gateway between the two regions that 1786 does per-flow rate policing and admission control. If re-ECN is 1787 eventually standardised into IP, it will be possible for these 1788 separate regions to upgrade all their gateways to use re-ECN before 1789 removing the per-flow policing gateways between them. Given the 1790 edge-to-edge deployment model of PCN-based admission control, it is 1791 reasonable to imagine this incremental deployment model without 1792 needing to cater for partial deployment of re-ECN in just some of the 1793 gateways around one Diffserv region. 1795 Only the edge gateways around a Diffserv region have to be upgraded 1796 to add re-ECN support, not interior routers. It is also necessary to 1797 add the mechanisms that use re-ECN to secure a network against 1798 misbehaving gateways and networks. Specifically, these are the 1799 border mechanisms (Section 5.6) and the mechanisms to sanction 1800 dishonest marking (Section 5.5). 1802 We also RECOMMEND adding improvements to forwarding on interior 1803 routers (Section 4.3.4). But the system works whether all, some or 1804 none are upgraded, so interior routers may be upgraded in a piecemeal 1805 fashion at any time. 1807 8. Design Choices and Rationale 1809 The primary insight of this work is that downstream congestion is the 1810 metric that would be most useful to control an internetwork, and 1811 particularly to police how one network responds to the congestion it 1812 causes in a remote network. This is the problem that has previously 1813 made it so hard to provide scalable admission control. 1815 The case for using re-feedback (a generalisation of re-ECN) to police 1816 congestion response and provide QoS is made in [Re-fb]. Essentially, 1817 the insight is that congestion is a factor that crosses layers from 1818 the physical upwards. Therefore re-feedback polices congestion where 1819 it emerges from a physical interface between networks. This is 1820 achieved by bringing the congestion information to the interface, 1821 rather than examining packet addressing where there is congestion. 1822 Then congestion crossing the physical interface at a border can be 1823 policed at the interface, rather than policing the congestion on 1824 packets that claim to come from an address (which may be spoofed). 1825 Also, re-feedback works in the network layer independently of other 1826 layers---despite its name re-feedback does not actually require 1827 feedback. It requires a source to act conservatively before it gets 1828 feedback. 1830 On the subject of lack of feedback, the feedback not established 1831 (FNE) codepoint is motivated by arguments for a state set-up bit in 1832 IP to prevent state exhaustion attacks. This idea was first put 1833 forward informally by David Clark and documented by Handley and 1834 Greenhalgh in [Steps_DoS]. The idea is that network layer datagrams 1835 should signal explicitly when they require state to be created in the 1836 network layer or the layer above (e.g. at flow start). Then a node 1837 can refuse to create any state unless a datagram declares this 1838 intent. We believe the proposed FNE codepoint serves the same 1839 purpose as the proposed state-set-up bit, but it has been overloaded 1840 with a more specific purpose, using it on more packets than just the 1841 first in a flow, but never less (i.e. it is idempotent). In effect 1842 the FNE codepoint serves the purpose of a `soft-state set-up 1843 codepoint'. 1845 The re-feedback paper [Re-fb] also makes the case for converting the 1846 economic interpretation of congestion into hard engineering 1847 mechanism, which is the basis of the approach used in this memo. The 1848 admission control gateways around the Diffserv region use hard 1849 engineering, not incentives, to prevent end users from sending more 1850 traffic than they have reserved. Incentive-based mechanisms are only 1851 used between networks, because they are expected to respond to 1852 incentives more rationally than end-users can be expected to. 1853 However, even then, a network can use fail-safes to protect itself 1854 from excessively unusual behaviour by neighbouring networks, whether 1855 due to an accidental misconfiguration or malicious intent. 1857 The guiding principle behind the incentive-based approach used 1858 between networks is that any gain from subverting the protocol should 1859 be precisely neutralised, rather than punished. If a gain is 1860 punished to a greater extent than is sufficient to neutralise it, it 1861 will most likely open up a new vulnerability, where the amplifying 1862 effect of the punishment mechanism can be turned on others. 1864 The re-feedback paper also makes the case against the use of 1865 congestion charging to police congestion if it is based on classic 1866 feedback (where only upstream congestion is visible to network 1867 elements). It argues this would open up receiving networks to 1868 `denial of funds' attacks and would require end users to accept 1869 dynamic pricing (which few would). 1871 Re-ECN has been deliberately designed to simplify policing at the 1872 borders between networks. These trust boundaries are the critical 1873 pinch-points that will limit the scalability of the whole 1874 internetwork unless the overall design minimises the complexity of 1875 security functions at these borders. The border mechanisms described 1876 in this memo run passively in parallel to data forwarding and they do 1877 not require per-flow processing. 1879 9. Security Considerations 1881 This whole memo concerns the security of a scalable admission control 1882 system. In particular the analysis section. Below some specific 1883 security issues are mentioned that did not belong elsewhere or which 1884 comment on the overall robustness of the security provided by the 1885 design. 1887 Firstly, we must repeat the statement of applicability in the 1888 analysis: that we only consider new opportunities for /gainful/ 1889 attack that our proposal introduces, particularly if the attacker can 1890 avoid being identified. Despite only involving a few bits, there is 1891 sufficient complexity in the whole system that there are probably 1892 numerous possibilities for other attacks. However, as far as we are 1893 aware, none reap any benefit to the attacker. For instance, it would 1894 be possible for a downstream network to remove the congestion 1895 markings introduced by an upstream network, but it would only lose 1896 out on the penalties it could apply to a downstream network. 1898 When one network forwards a neighbouring network's traffic it will 1899 always be possible to cause damage by dropping or corrupting it. 1900 Therefore we do not believe networks would set their routing policies 1901 to interconnect in the first place if they didn't trust the other 1902 networks not to arbitrarily damage their traffic. 1904 Having said this, we do want to highlight some of the weaker parts of 1905 our argument. We have argued that networks will be dissuaded from 1906 faking congestion marking by the possibility that upstream networks 1907 will route round them. As we have said, these arguments are based on 1908 fairly delicate assumptions and will remain fairly tenuous until 1909 proved in practice, particularly close to the egress where less 1910 competitive routing is likely. 1912 We should also point out that the approach in this memo was only 1913 designed to be robust for admission control. We do not claim the 1914 incentives will always be strong enough to force correct flow pre- 1915 emption behaviour. This is because a user will tend to perceive much 1916 greater loss in value if a flow is pre-empted than if admission is 1917 denied at the start. However, in general the incentives for correct 1918 flow pre-emption are similar to those for admission control. 1920 Finally, it may seem that the 8 codepoints that have been made 1921 available by extending the ECN field with the RE flag have been used 1922 rather wastefully. In effect the RE flag has been used as an 1923 orthogonal single bit in nearly all cases. The only exception being 1924 when the ECN field is cleared to "00". The mapping of the codepoints 1925 in an earlier version of this proposal used the codepoint space more 1926 efficiently, but the scheme became vulnerable to a network operator 1927 focusing its congestion marking to mark more positive than neutral 1928 packets in order to reduce its penalties. 1930 With the scheme as now proposed, once the RE flag is set or cleared 1931 by the sender or its proxy, it should not be written by the network, 1932 only read. So the gateways can detect if any network maliciously 1933 alters the RE flag. IPSec AH integrity checking does not cover the 1934 IPv4 option flags (they were considered mutable---even the one we 1935 propose using for the RE flag that was `currently unused' when IPSec 1936 was defined). But it would be sufficient for a pair of gateways to 1937 make random checks on whether the RE flag was the same when it 1938 reached the egress gateway as when it left the ingress. Indeed, if 1939 IPSec AH had covered the RE flag, any network intending to alter 1940 sufficient RE flags to make a gain would have focused its alterations 1941 on packets without authenticating headers (AHs). 1943 No cryptographic algorithms have been harmed in the making of this 1944 proposal. 1946 10. IANA Considerations 1948 This memo includes no request to IANA. 1950 11. Conclusions 1952 This memo builds on a promising technique to solve the classic 1953 problem of making flow admission control scale to any size network. 1954 It involves the use of Diffserv in a deployment model that uses pre- 1955 congestion notification feedback to control admission into a network 1956 path [CL-deploy]. However as it stands, that deployment model 1957 depends on all network domains trusting each other to comply with the 1958 protocols, invoking admission control and flow pre-emption when 1959 requested. 1961 We propose that the congestion feedback used in that deployment model 1962 should be re-echoed into the forward data path, by making a trivial 1963 modification to the ingress gateway. We then explain how the 1964 resulting downstream pre-congestion metric in packets can be 1965 monitored in bulk at borders to sufficiently emulate flow rate 1966 policing. 1968 We claim the result of combining these two approaches is an admission 1969 control system that scales to any size network /and/ any number of 1970 interconnected networks, even if they all act in their own interests. 1972 This proposal aims to convince its readers to "Design in Security 1973 from the start," by building modified ingress gateways from day one, 1974 even if border policing is not needed at first. This way, we will 1975 not build ourselves tomorrow's legacy problem. 1977 Re-echoing congestion feedback is based on a principled technique 1978 called Re-ECN [Re-TCP], designed to add accountability for causing 1979 congestion to the general-purpose IP datagram service. Re-ECN 1980 proposes to consume the last completely unused bit in the basic IPv4 1981 header. 1983 12. Acknowledgements 1985 All the following have given helpful comments and some may become co- 1986 authors of later drafts: Arnaud Jacquet, Alessandro Salvatori, Steve 1987 Rudkin, David Songhurst, John Davey, Ian Self, Anthony Sheppard, 1988 Carla Di Cairano-Gilfedder (BT), Mark Handley (who identified the 1989 excess canceled packets attack), Stephen Hailes, Adam Greenhalgh 1990 (UCL), Francois Le Faucheur, Anna Charny (Cisco), Jozef Babiarz, 1991 Kwok-Ho Chan, Corey Alexander (Nortel), David Clark, Bill Lehr, 1992 Sharon Gillett, Steve Bauer (MIT) (who publicised various dummy 1993 traffic attacks), Sally Floyd (ICIR) and comments from participants 1994 in the CFP/CRN inter-provider QoS and broadband working groups. 1996 13. Comments Solicited 1998 Comments and questions are encouraged and very welcome. They can be 1999 addressed to the IETF Transport Area working group's mailing list 2000 , and/or to the authors. 2002 14. References 2004 14.1. Normative References 2006 [PCN] Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., 2007 Charny, A., Liatsos, V., Babiarz, J., Chan, K., Dudley, 2008 S., Westberg, L., Bader, A., and G. Karagiannis, "Pre- 2009 Congestion Notification Marking", 2010 draft-briscoe-tsvwg-cl-phb-02 (work in progress), 2011 June 2006. 2013 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2014 Requirement Levels", BCP 14, RFC 2119, March 1997. 2016 [RFC2211] Wroclawski, J., "Specification of the Controlled-Load 2017 Network Element Service", RFC 2211, September 1997. 2019 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2020 of Explicit Congestion Notification (ECN) to IP", 2021 RFC 3168, September 2001. 2023 [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, 2024 J., Courtney, W., Davari, S., Firoiu, V., and D. 2025 Stiliadis, "An Expedited Forwarding PHB (Per-Hop 2026 Behavior)", RFC 3246, March 2002. 2028 [RSVP-ECN] 2029 Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P., 2030 Babiarz, J., and K. Chan, "RSVP Extensions for Admission 2031 Control over Diffserv using Pre-congestion Notification", 2032 draft-lefaucheur-rsvp-ecn-01 (work in progress), 2033 June 2006. 2035 [Re-TCP] Briscoe, B., Jacquet, A., and A. Salvatori, "Re-ECN: 2036 Adding Accountability for Causing Congestion to TCP/IP", 2037 draft-briscoe-tsvwg-re-ecn-tcp-02 (work in progress), 2038 June 2006. 2040 14.2. Informative References 2042 [CL-deploy] 2043 Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., 2044 Charny, A., Babiarz, J., Chan, K., Westberg, L., Bader, 2045 A., and G. Karagiannis, "A Deployment Model for Admission 2046 Control over DiffServ using Pre-Congestion Notification", 2047 draft-briscoe-tsvwg-cl-architecture-03 (work in progress), 2048 June 2006. 2050 [CLoop_pol] 2051 Salvatori, A., "Closed Loop Traffic Policing", Politecnico 2052 Torino and Institut Eurecom Masters Thesis , 2053 September 2005. 2055 [ECN-BGP] Mortier, R. and I. Pratt, "Incentive Based Inter-Domain 2056 Routeing", Proc Internet Charging and QoS Technology 2057 Workshop (ICQT'03) pp308--317, September 2003, . 2060 [ECN-MPLS] 2061 Bruce, B., Briscoe, B., and J. Tay, "Explicit Congestion 2062 Marking in MPLS", draft-davie-ecn-mpls-00 (work in 2063 progress), June 2006. 2065 [IXQoS] Briscoe, B. and S. Rudkin, "Commercial Models for IP 2066 Quality of Service Interconnect", BT Technology Journal 2067 (BTTJ) 23(2)171--195, April 2005, 2068 . 2070 [NSIS-RMD] 2071 Bader, A., Westberg, L., Karagiannis, G., Kappler, C., and 2072 T. Phelan, "RMD-QOSM - The Resource Management in Diffserv 2073 QOS Model", draft-ietf-nsis-rmd-06 (work in progress), 2074 February 2006. 2076 [RFC2205] Braden, B., Zhang, L., Berson, S., Herzog, S., and S. 2078 Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1 2079 Functional Specification", RFC 2205, September 1997. 2081 [RFC2207] Berger, L. and T. O'Malley, "RSVP Extensions for IPSEC 2082 Data Flows", RFC 2207, September 1997. 2084 [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, 2085 M., Romanow, A., Weinrib, A., and L. Zhang, "Resource 2086 ReSerVation Protocol (RSVP) Version 1 Applicability 2087 Statement Some Guidelines on Deployment", RFC 2208, 2088 September 1997. 2090 [RFC2747] Baker, F., Lindell, B., and M. Talwar, "RSVP Cryptographic 2091 Authentication", RFC 2747, January 2000. 2093 [RFC2998] Bernet, Y., Ford, P., Yavatkar, R., Baker, F., Zhang, L., 2094 Speer, M., Braden, R., Davie, B., Wroclawski, J., and E. 2095 Felstaine, "A Framework for Integrated Services Operation 2096 over Diffserv Networks", RFC 2998, November 2000. 2098 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2099 Congestion Notification (ECN) Signaling with Nonces", 2100 RFC 3540, June 2003. 2102 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 2103 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 2104 Congestion Response in an Internetwork Using Re-Feedback", 2105 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 2109 [Smart_rtg] 2110 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 2111 "Optimizing Cost and Performance for Multihoming", ACM 2112 SIGCOMM CCR 34(4)79--92, October 2004, 2113 . 2115 [Steps_DoS] 2116 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 2117 resistant Internet Architecture", Proc. ACM SIGCOMM 2118 workshop on Future directions in network architecture 2119 (FDNA'04) pp 49--56, August 2004. 2121 Appendix A. Implementation 2122 A.1. Ingress Gateway Algorithm for Blanking the RE flag 2124 The ingress gateway receives regular feedback reporting the fraction 2125 of congestion marked octets for each aggregate arriving at the 2126 egress. So for each aggregate it should blank the RE flag on the 2127 same fraction of octets. It is more efficient to calculate the 2128 reciprocal of this fraction when the signalling arrives, Z_0 = (1 / 2129 Congestion-Level-Estimate). Z_0 will be the number of octets of 2130 packets the ingress should send with the RE flag set between those it 2131 sends with the RE flag blanked. Z_0 will also take account of the 2132 sustainable rate reported during the flow pre-emption process, if 2133 necessary. 2135 A suitable pseudo-code algorithm for the ingress gateway is as 2136 follows: 2138 ==================================================================== 2139 B_i = 0 /* interblank volume */ 2140 for each PCN-capable packet { 2141 b = readLength() /* set b to packet size */ 2142 B_i += b /* accumulate interblank volume */ 2143 if B_i < b * Z_0 { /* test whether interblank volume... */ 2144 writeRE(1) 2145 } else { /* ...exceeds blank RE spacing * pkt size*/ 2146 writeRE(0) /* ...and if so, clear RE */ 2147 B_i = 0 /* ...and re-set interblank volume */ 2148 } 2149 } 2150 ==================================================================== 2152 A.2. Downstream Congestion Metering Algorithms 2154 A.2.1. Bulk Downstream Congestion Metering Algorithm 2156 To meter the bulk amount of downstream pre-congestion in traffic 2157 crossing an inter-domain border, an algorithm is needed that 2158 accumulates the size of positive packets and subtracts the size of 2159 negative packets. We maintain two counters: 2161 V_b: accumulated pre-congestion volume 2163 B: total data volume (in case it is needed) 2165 A suitable pseudo-code algorithm for a border router is as follows: 2167 ==================================================================== 2168 V_b = 0 2169 B = 0 2170 for each PCN-capable packet { 2171 b = readLength(packet) /* set b to packet size */ 2172 B += b /* accumulate total volume */ 2173 if readEECN(packet) == (Re-Echo || FNE) { 2174 V_b += b /* increment... */ 2175 } elseif readEECN(packet) == ( AM(-1) || PM(-1) ) { 2176 V_b -= b /* ...or decrement V_b... */ 2177 } /*...depending on EECN field */ 2178 } 2179 ==================================================================== 2181 At the end of an accounting period this counter V_b represents the 2182 pre-congestion volume that penalties could be applied to, as 2183 described in Section 5.3. 2185 For instance, accumulated volume of pre-congestion through a border 2186 interface over a month might be V_b = 5PB (petabyte = 10^15 byte). 2187 This might have resulted from an average downstream pre-congestion 2188 level of 1% on an accumulated total data volume of B = 500PB. 2190 A.2.2. Inflation Factor for Persistently Negative Flows 2192 The following process is suggested to complement the simple algorithm 2193 above in order to protect against the various attacks from 2194 persistently negative flows described in Section 5.6.1. As explained 2195 in that section, the most important and first step is to estimate the 2196 contribution of persistently negative flows to the bulk volume of 2197 downstream pre-congestion and to inflate this bulk volume as if these 2198 flows weren't there. The process below has been designed to give an 2199 unboased estimate, but it may be possible to define other processes 2200 that achieve similar ends. 2202 While the above simple metering algorithm is counting the bulk of 2203 traffic over an accounting period, the meter should also select a 2204 subset of the whole flow ID space that is small enough to be able to 2205 realistically measure but large enough to give a realistic sample. 2206 Many different samples of different subsets of the ID space should be 2207 taken at different times during the accounting period, preferably 2208 covering the whole ID space. During each sample, the meter should 2209 count the volume of positive packets and subtract the volume of 2210 negative, maintaining a separate account for each flow in the sample. 2211 It should run a lot longer than the large majority of flows, to avoid 2212 a bias from missing the starts and ends of flows, which tend to be 2213 positive and negative respectively. 2215 Once the accounting period finishes, the meter should calculate the 2216 total of the accounts V_{bI} for the subset of flows I in the sample, 2217 and the total of the accounts V_{fI} excluding flows with a negative 2218 account from the subset I. Then the weighted mean of all these 2219 samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I} 2220 V_{bI}. 2222 If V_b is the result of the bulk accounting algorithm over the 2223 accounting period (Appendix A.2.1) it can be inflated by this factor 2224 a_S to get a good unbiased estimate of the volume of downstream 2225 congestion over the accounting period a_S.V_b, without being polluted 2226 by the effect of persistently negative flows. 2228 A.3. Algorithm for Sanctioning Negative Traffic 2230 {ToDo: Write up algorithms similar to Appendix D of [Re-TCP] for the 2231 negative flow monitor with flow management algorithm and the variant 2232 with bounded flow state.} 2234 Author's Address 2236 Bob Briscoe 2237 BT & UCL 2238 B54/77, Adastral Park 2239 Martlesham Heath 2240 Ipswich IP5 3RE 2241 UK 2243 Phone: +44 1473 645196 2244 Email: bob.briscoe@bt.com 2245 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 2247 Intellectual Property Statement 2249 The IETF takes no position regarding the validity or scope of any 2250 Intellectual Property Rights or other rights that might be claimed to 2251 pertain to the implementation or use of the technology described in 2252 this document or the extent to which any license under such rights 2253 might or might not be available; nor does it represent that it has 2254 made any independent effort to identify any such rights. Information 2255 on the procedures with respect to rights in RFC documents can be 2256 found in BCP 78 and BCP 79. 2258 Copies of IPR disclosures made to the IETF Secretariat and any 2259 assurances of licenses to be made available, or the result of an 2260 attempt made to obtain a general license or permission for the use of 2261 such proprietary rights by implementers or users of this 2262 specification can be obtained from the IETF on-line IPR repository at 2263 http://www.ietf.org/ipr. 2265 The IETF invites any interested party to bring to its attention any 2266 copyrights, patents or patent applications, or other proprietary 2267 rights that may cover technology that may be required to implement 2268 this standard. Please address the information to the IETF at 2269 ietf-ipr@ietf.org. 2271 Disclaimer of Validity 2273 This document and the information contained herein are provided on an 2274 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2275 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2276 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2277 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2278 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2279 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2281 Copyright Statement 2283 Copyright (C) The Internet Society (2006). This document is subject 2284 to the rights, licenses and restrictions contained in BCP 78, and 2285 except as set forth therein, the authors retain all their rights. 2287 Acknowledgment 2289 Funding for the RFC Editor function is currently provided by the 2290 Internet Society.