idnits 2.17.1 draft-briscoe-re-pcn-border-cheat-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 2379. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2390. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2397. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2403. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RSVP-ECN], [Re-TCP], [PCN-arch], [PCN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: However, if the ingress gateway can guarantee that the network(s) that will carry the flow to its egress gateway all use a common identifier for the aggregate (e.g. a single MPLS network without ECMP routing), it MAY NOT set FNE when it adds a new flow to an active aggregate. And an FNE packet need only be sent if a whole aggregate has been idle for more than 1 second. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 30, 2007) is 6138 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-09) exists of draft-briscoe-tsvwg-re-ecn-tcp-04 == Outdated reference: A later version (-02) exists of draft-ietf-tsvwg-ecn-mpls-01 == Outdated reference: A later version (-20) exists of draft-ietf-nsis-rmd-09 Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 PCN Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Intended status: Informational June 30, 2007 5 Expires: January 1, 2008 7 Emulating Border Flow Policing using Re-ECN on Bulk Data 8 draft-briscoe-re-pcn-border-cheat-00 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that any 13 applicable patent or other IPR claims of which he or she is aware 14 have been or will be disclosed, and any of which he or she becomes 15 aware will be disclosed, in accordance with Section 6 of BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on January 1, 2008. 35 Copyright Notice 37 Copyright (C) The IETF Trust (2007). 39 Abstract 41 Scaling per flow admission control to the Internet is a hard problem. 42 A recently proposed approach combines Diffserv and pre-congestion 43 notification (PCN) to provide a service slightly better than Intserv 44 controlled load. It scales to networks of any size, but only if 45 domains trust each other to comply with admission control and rate 46 policing. This memo claims to solve this trust problem without 47 losing scalability. It describes bulk border policing that provides 48 a sufficient emulation of per-flow policing with the help of another 49 recently proposed extension to ECN, involving re-echoing ECN feedback 50 (re-ECN). With only passive bulk measurements at borders, sanctions 51 can be applied against cheating networks. 53 Status (to be removed by the RFC Editor) 55 This memo is posted as an Internet-Draft with the intent to 56 eventually be broken down in two documents; one for the standards 57 track and one for informational status. But until it becomes an item 58 of IETF working group business the whole proposal has been kept 59 together to aid understanding. Only the text of Section 4 of this 60 document requires standardisation. The rest of the sections describe 61 how a system might be built from these protocols by the operators of 62 an internetwork. Note in particular that the policing and monitoring 63 functions proposed for the trust boundaries between operators would 64 not need standardisation by the IETF. They simply represent one way 65 that the proposed protocols could be used to extend the PCN 66 architecture [PCN-arch] to span multiple domains without mutual trust 67 between the operators. 69 To realise the system described, this document also depends on 70 standardisation of three other documents currently being discussed 71 (but not on the standards track) in the IETF Transport Area: pre- 72 congestion notification (PCN) marking on interior nodes [PCN]; 73 feedback of aggregate PCN measurements by suitably extending the 74 admission control signalling protocol (e.g. RSVP) [RSVP-ECN]; and 75 re-insertion of the feedback into the forward stream of IP packets by 76 the PCN ingress gateway in a similar way to that proposed for a TCP 77 source [Re-TCP]. 79 The authors seek comments from the Internet community on whether 80 combining PCN and re-ECN in this way is a sufficient solution to the 81 problem of scaling microflow admission control to the Internet as a 82 whole, even though such scaling must take account of the increasing 83 numbers of networks and users who may all have conflicting interests. 85 Changes from previous drafts (to be removed by the RFC Editor) 87 Changes in this version 88 relative to the last : 90 Changed filename to associate it with the new IETF PCN w-g, rather 91 than the TSVWG w-g. 93 Introduction: Clarified that bulk policing only replaces per-flow 94 policing at interior inter-domain borders, while per-flow policing 95 is still needed at the access interface to the internetwork. Also 96 clarified that the aim is to neutralise any gains from cheating 97 using local bilateral contracts between neighbouring networks, 98 rather than merely identifying remote cheaters. 100 Section 3.1: Described the traditional per-flow policing problem 101 with inter-domain reservations more precisely, particularly with 102 respect to direction of reservations and of traffic flows. 104 Clarified status of Section 5 onwards, in particular that policers 105 and monitors would not need standardisation, but that the protocol 106 in Section 4 would require standardisation. 108 Section 5.6.2 on competitive routing: Added discussion of direct 109 incentives for a receiver to switch to a different provider even 110 if the provider has a termination monopoly. 112 Clarified that "Designing in security from the start" merely means 113 allowing codepoint space in the PCN protocol encoding. There is 114 no need to actually implement inter-domain security mechanisms for 115 solutions confined to a single domain. 117 Updated some references and added a ref to the Security 118 Considerations, as well as other minor corrections and 119 improvements. 121 Changes from : 124 Added subsection on Border Accounting Mechanisms (Section 5.6.1) 126 Section 4.2 on the re-ECN wire protocol clarified and re-organised 127 to separately discuss re-ECN for default ECN marking and for pre- 128 congestion marking (PCN). 130 Router Forwarding Behaviour subsection added to re-organised 131 section on Protocol Operation (Section 4.3). Extensions section 132 moved within Protocol Operations. 134 Emulating Border Policing (Section 5) reorganised, starting with a 135 new Terminology subsection heading, and a simplified overview 136 section. Added a large new subsection on Border Accounting 137 Mechanisms within a new section bringing together other 138 subsections on Border Mechanisms generally (Section 5.6). Some 139 text moved from old subsections into these new ones. 141 Added section on Incremental Deployment (Section 7), drawing 142 together relevant points about deployment made throughout. 144 Sections on Design Rationale (Section 8) and Security 145 Considerations (Section 9) expanded with some new material, 146 including new attacks and their defences. 148 Suggested Border Metering Algorithms improved (Appendix A.2) for 149 resilience to newly identified attacks. 151 Table of Contents 153 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 7 154 2. Requirements Notation . . . . . . . . . . . . . . . . . . . . 9 155 3. The Problem . . . . . . . . . . . . . . . . . . . . . . . . . 9 156 3.1. The Traditional Per-flow Policing Problem . . . . . . . . 9 157 3.2. Generic Scenario . . . . . . . . . . . . . . . . . . . . . 11 158 4. Re-ECN Protocol for an RSVP (or similar) Transport . . . . . . 14 159 4.1. Protocol Overview . . . . . . . . . . . . . . . . . . . . 14 160 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 161 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 162 4.2.1. Re-ECN Recap . . . . . . . . . . . . . . . . . . . . . 16 163 4.2.2. Re-ECN Combined with Pre-Congestion Notification 164 (re-PCN) . . . . . . . . . . . . . . . . . . . . . . . 17 165 4.3. Protocol Operation . . . . . . . . . . . . . . . . . . . . 19 166 4.3.1. Protocol Operation for an Established Flow . . . . . . 19 167 4.3.2. Aggregate Bootstrap . . . . . . . . . . . . . . . . . 21 168 4.3.3. Flow Bootstrap . . . . . . . . . . . . . . . . . . . . 22 169 4.3.4. Router Forwarding Behaviour . . . . . . . . . . . . . 23 170 4.3.5. Extensions . . . . . . . . . . . . . . . . . . . . . . 24 171 5. Emulating Border Policing with Re-ECN . . . . . . . . . . . . 24 172 5.1. Informal Terminology . . . . . . . . . . . . . . . . . . . 25 173 5.2. Policing Overview . . . . . . . . . . . . . . . . . . . . 26 174 5.3. Pre-requisite Contractual Arrangements . . . . . . . . . . 28 175 5.4. Emulation of Per-Flow Rate Policing: Rationale and 176 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 31 177 5.5. Sanctioning Dishonest Marking . . . . . . . . . . . . . . 32 178 5.6. Border Mechanisms . . . . . . . . . . . . . . . . . . . . 34 179 5.6.1. Border Accounting Mechanisms . . . . . . . . . . . . . 34 180 5.6.2. Competitive Routing . . . . . . . . . . . . . . . . . 38 181 5.6.3. Fail-safes . . . . . . . . . . . . . . . . . . . . . . 39 182 6. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 183 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 42 184 8. Design Choices and Rationale . . . . . . . . . . . . . . . . . 43 185 9. Security Considerations . . . . . . . . . . . . . . . . . . . 45 186 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46 187 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 46 188 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 47 189 13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 47 190 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 48 191 14.1. Normative References . . . . . . . . . . . . . . . . . . . 48 192 14.2. Informative References . . . . . . . . . . . . . . . . . . 48 193 Appendix A. Implementation . . . . . . . . . . . . . . . . . . . 50 194 A.1. Ingress Gateway Algorithm for Blanking the RE flag . . . . 50 195 A.2. Downstream Congestion Metering Algorithms . . . . . . . . 51 196 A.2.1. Bulk Downstream Congestion Metering Algorithm . . . . 51 197 A.2.2. Inflation Factor for Persistently Negative Flows . . . 52 198 A.3. Algorithm for Sanctioning Negative Traffic . . . . . . . . 52 200 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 53 201 Intellectual Property and Copyright Statements . . . . . . . . . . 54 203 1. Introduction 205 The Internet community largely lost interest in the Intserv 206 architecture after it was clarified that it would be unlikely to 207 scale to the whole Internet [RFC2208]. Although Intserv mechanisms 208 proved impractical, the bandwidth reservation service it aimed to 209 offer is still very much required. 211 A recently proposed approach [PCN-arch] combines Diffserv and pre- 212 congestion notification (PCN) to provide a service slightly better 213 than Intserv controlled load [RFC2211]. It scales to any size 214 network, but only if domains trust their neighbours to have checked 215 that upstream customers aren't taking more bandwidth than they 216 reserved, either accidentally or deliberately. This memo describes 217 border policing measures so that one network can protect its 218 interests, even if networks around it are deliberately trying to 219 cheat. The approach provides a sufficient emulation of flow rate 220 policing at trust boundaries but without per-flow processing. The 221 emulation is not perfect, but it is sufficient to ensure that the 222 punishment is at least proportionate to the severity of the cheat. 223 Per-flow rate policing for each reservation is still expected to be 224 used at the access edge of the internetwork, but at the borders 225 between networks bulk policing can be used to emulate per-flow 226 policing. 228 The aim is to be able to scale controlled load service to any number 229 of endpoints, even though such scaling must take account of the 230 increasing numbers of networks and users who may all have conflicting 231 interests. To achieve such scaling, this memo combines two recent 232 proposals, both of which it briefly recaps: 234 o A deployment model for admission control over Diffserv using pre- 235 congestion notification [PCN-arch] describes how bulk pre- 236 congestion notification on routers within an edge-to-edge Diffserv 237 region can emulate the precision of per-flow admission control to 238 provide controlled load service without unscalable per-flow 239 processing; 241 o Re-ECN: Adding Accountability to TCP/IP [Re-TCP]. The trick that 242 addresses cheating at borders is to recognise that border policing 243 is mainly necessary because cheating upstream networks will admit 244 traffic when they shouldn't only as long as they don't directly 245 experience the downstream congestion their misbehaviour can cause. 246 The re-ECN protocol requires upstream nodes to declare expected 247 downstream congestion in all forwarded packets and it makes it in 248 their interests to declare it honestly. Operators can then 249 monitor downstream congestion in bulk at borders to emulate 250 policing. 252 The aim is not to enable a network to _identify_ some remote cheating 253 party, which would rarely be useful given the victim network would be 254 unlikely to be able to seek redress from a cheater in some remote 255 part of the world with whom no direct contractual relationship 256 exists. Rather the aim is to ensure that any gain from cheating will 257 be cancelled out by penalties applied to the cheating party by its 258 local network. Further, the solution ensures each of the chain of 259 networks between the cheater and the victim will lose out if it 260 doesn't apply penalties to its neighbour. Thus the solution builds 261 on the local bilateral contractual relationships that already exist 262 between neighbouring networks. 264 Rather than the end-to-end arrangement used when re-ECN was specified 265 for the TCP transport [Re-TCP], this memo specifies re-ECN in an 266 edge-to-edge arrangement, making it applicable to the above 267 deployment model for admission control over Diffserv. Also, rather 268 than using a TCP transport for regular congestion feedback, this memo 269 specifies re-ECN using RSVP as the transport for feedback [RSVP-ECN]. 270 A similar deployment model, but with a different transport for 271 signalling congestion feedback could be used (e.g. RMD [NSIS-RMD] 272 uses NSIS). 274 This memo aims to do two things: i) define how to apply the re-ECN 275 protocol to the admission control over Diffserv scenario; and ii) 276 explain why re-ECN sufficiently emulates border policing in that 277 scenario. Most of the memo is taken up with the second aim; 278 explaining why it works. Applying re-ECN to the scenario actually 279 involves quite a trivial modification to the ingress gateway. That 280 modification can be added to gateways later, so our immediate goal is 281 to convince everyone to have the foresight to define the PCN wire 282 protocol encoding to accommodate the extended codepoints defined in 283 this document, whether first deployments require border policing or 284 not. Otherwise, when we want to add policing, we will have built 285 ourselves a legacy problem. In other words, we aim to convince 286 people to "Design in security from the start." 288 The body of this memo is structured as follows: 290 Section 3 describes the border policing problem. We recap the 291 traditional, unscalable view of how to solve the problem, and we 292 recap the admission control solution which has the scalability we 293 do not want to lose when we add border policing; 295 Section 4 specifies the re-ECN protocol solution in detail; 297 Section 5 explains how to use the protocol to emulate border 298 policing, and why it works; 299 Section 6 analyses the security of the proposed solution; 301 Section 8 explains the sometimes subtle rationale behind our 302 design decisions; 304 Section 9 comments on the overall robustness of the security 305 assumptions and lists specific security issues. 307 It must be emphasised that we are not evangelical about removing per- 308 flow processing from borders. Network operators may choose to do 309 per-flow processing at their borders for their own reasons, such as 310 to support business models that require per-flow accounting. Our aim 311 is to show that per-flow processing at borders is no longer 312 _necessary_ in order to provide end-to-end QoS using flow admission 313 control. Indeed, we are absolutely opposed to standardisation of 314 technology that embeds particular business models into the Internet. 315 Our aim is merely to provide a new useful metric (downstream 316 congestion) at trust boundaries. Given the well-known significance 317 of congestion in economics, operators can then use this new metric in 318 their interconnection contracts if they choose. This will enable 319 competitive evolution of new business models (for examples 320 see [IXQoS]), even for sets of flows running alongside another set 321 across the same border but using the more traditional model that 322 depends on more costly per-flow processing at each border. 324 2. Requirements Notation 326 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 327 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 328 document are to be interpreted as described in [RFC2119]. 330 3. The Problem 332 3.1. The Traditional Per-flow Policing Problem 334 If we claim to be able to emulate per-flow policing with bulk 335 policing at trust boundaries, we need to know exactly what we are 336 emulating. So, we will start from the traditional scenario with per- 337 flow policing at trust boundaries to explain why it has always been 338 considered necessary. 340 To be able to take advantage of a reservation-based service such as 341 controlled load, a source-destination pair must reserve resources 342 using a signalling protocol such as RSVP [RFC2205]. An RSVP 343 signalling request refers to a flow of packets by its flow ID tuple 344 (filter spec [RFC2205]) (or its security parameter index 345 (SPI) [RFC2207] if port numbers are hidden by IPSec encryption). 346 Other signalling protocols use similar flow identifiers. But, it is 347 insufficient to merely authorise and admit a flow based on its 348 identifiers, for instance merely opening a pin-hole for packets with 349 identifiers that match an admitted flow ID. Because, once a flow is 350 admitted, it cannot necessarily be trusted to send packets within the 351 rate profile it requested. 353 The packet rate must also be policed to keep the flow within the 354 requested flow spec [RFC2205]. For instance, without data rate 355 policing, a source-destination pair could reserve resources for an 356 8kbps audio flow but the source could transmit a 6Mbps video (theft 357 of service). More subtly, the sender could generate bursts that were 358 outside the profile requested. 360 In traditional architectures, per-flow packet rate-policing is 361 expensive and unscalable but, without it, a network is vulnerable to 362 such theft of service (whether malicious or accidental). Perhaps 363 more importantly, if flows are allowed to send more data than they 364 were permitted, the ability of admission control to give assurances 365 to other flows will break. 367 Just as sources need not be trusted to keep within the requested flow 368 spec, whole networks might also try to cheat. We will now set up a 369 concrete scenario to illustrate such cheats. Imagine reservations 370 for unidirectional flows, through at least two networks, an edge 371 network and its downstream transit provider. Imagine the edge 372 network charges its retail customers per reservation but also has to 373 pay its transit provider a charge per reservation. Typically, both 374 its selling and buying charges might depend on the duration and rate 375 of each reservation. The level of the actual selling and buying 376 prices are irrelevant to our discussion (most likely the network will 377 sell at a higher price than it buys, of course). 379 A cheating ingress network could systematically reduce the size of 380 its retail customers' reservation signalling requests (e.g. the 381 SENDER_TSPEC object in RSVP's PATH message) before forwarding them to 382 its transit provider and systematically reinstate the responses on 383 the way back (e.g. the FLOWSPEC object in RSVP's RESV message). It 384 would then receive an honest income from its upstream retail customer 385 but only pay for fraudulently smaller reservations downstream. A 386 similar but opposite trick (increasing the TSPEC and decreasing the 387 FLOWSPEC) could be perpetrated by the receiver's access network if 388 the reservation was paid for by the receiver. 390 Equivalently, a cheating ingress network may feed the traffic from a 391 number of flows into an aggregate reservation over the transit that 392 is smaller than the total of all the flows. Because of these fraud 393 possibilities, in traditional QoS reservation architectures the 394 downstream network polices at each border. The policer checks that 395 the actual sent data rate of each flow is within the signalled 396 reservation. 398 Reservation signalling could be authenticated end to end, but this 399 wouldn't prevent the aggregation cheat just described. For this 400 reason, and to avoid the need for a global PKI, signalling integrity 401 is typically only protected on a hop-by-hop basis [RFC2747]. 403 A variant of the above cheat is where a router in an honest 404 downstream network denies admission to a new reservation, but a 405 cheating upstream network still admits the flow. For instance, the 406 networks may be using Diffserv internally, but Intserv admission 407 control at their borders [RFC2998]. The cheat would only work if 408 they were using bulk Diffserv traffic policing at their borders, 409 perhaps to avoid the cost/complexity of Intserv border policing. As 410 far as the cheating upstream network is concerned, it gets the 411 revenue from the reservation, but it doesn't have to pay any 412 downstream wholesale charges and the congestion is in someone else's 413 network. The cheating network may calculate that most of the flows 414 affected by congestion in the downstream network aren't likely to be 415 its own. It may also calculate that the downstream router has been 416 configured to deny admission to new flows in order to protect 417 bandwidth assigned to other network services (e.g. enterprise VPNs). 418 So the cheating network can steal capacity from the downstream 419 operator's VPNs that are probably not actually congested. 421 All the above cheats are framed in the context of RSVP's receiver 422 confirmed reservation model, but similar cheats are possible with 423 sender-initiated and other models. 425 To summarise, in traditional reservation signalling architectures, if 426 a network cannot trust a neighbouring upstream network to rate-police 427 each reservation, it has to check for itself that the data rate fits 428 within each of the reservations it has admitted. 430 3.2. Generic Scenario 432 We will now describe a generic internetworking scenario that we will 433 use to describe and to test our bulk policing proposal. It consists 434 of a number of networks and endpoints that do not fully trust each 435 other to behave. In Section 6 we will tie down exactly what we mean 436 by partial trust, and we will consider the various combinations where 437 some networks do not trust each other and others are colluding 438 together. 440 _ ___ _____________________________________ ___ _ 441 | | | | _|__ ______ ______ ______ _|__ | | | | 442 | | | | | | | | | | | | | | | | | | 443 | | | | | | |Inter-| |Inter-| |Inter-| | | | | | | 444 | | | | | | | ior | | ior | | ior | | | | | | | 445 | | | | | | |Domain| |Domain| |Domain| | | | | | | 446 | | | | | | | A | | B | | C | | | | | | | 447 | | | | | | | | | | | | | | | | | | 448 | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | 449 | | | | | | |B| |B| |B| |B| |B| |B| | | | |\ | | 450 | |==| |==|Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr |==| |=>| | 451 | | | | |G/W | | | | | | | | | | | | | |G/W | | |/ | | 452 | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | 453 | | | | | | | | | | | | | | | | | | 454 | | | | |____| |______| |______| |______| |____| | | | | 455 |_| |___| |_____________________________________| |___| |_| 457 Sx Ingress Diffserv region Egress Rx 458 End Access Access End 459 Host Network Network Host 460 <-------- edge-to-edge signalling -------> 461 (for admission control) 463 <-------------------end-to-end QoS signalling protocol-------------> 465 Figure 1: Generic Scenario (see text for explanation of terms) 467 An ingress and egress gateway (Ingr G/W and Egr G/W in Figure 1) 468 connect the interior Diffserv region to the edge access networks 469 where routers (not shown) use per-flow reservation processing. 470 Within the Diffserv region are three interior domains, A, B and C, as 471 well as the inward facing interfaces of the ingress and egress 472 gateways. An ingress and egress border router (BR) is shown 473 interconnecting each interior domain with the next. There may be 474 other interior routers (not shown) within each interior domain. 476 In two paragraphs we now briefly recap how pre-congestion 477 notification is intended to be used to control flow admission to a 478 large Diffserv region. The first paragraph describes data plane 479 functions and the second describes signalling in the control plane. 480 We omit many details from [PCN-arch] including behaviour during 481 routing changes. For brevity here we assume other flows are already 482 in progress across a path through the Diffserv region before a new 483 one arrives, but how bootstrap works is described in Section 4.3.2. 485 Figure 1 shows a single simplex reserved flow from the sending (Sx) 486 end host to the receiving (Rx) end host. The ingress gateway polices 487 incoming traffic within its admitted reservation and remarks it to 488 turn on an ECN-capable codepoint [RFC3168] and the controlled load 489 (CL) Diffserv codepoint. Together, these codepoints define which 490 traffic is entitled to the enhanced scheduling of the CL behaviour 491 aggregate on routers within the Diffserv region. The CL PHB of 492 interior routers consists of a scheduling behaviour and a new ECN 493 marking behaviour that we call `pre-congestion notification' [PCN]. 494 The CL PHB simply re-uses the definition of expedited forwarding 495 (EF) [RFC3246] for its scheduling behaviour. But it incorporates a 496 new ECN marking behaviour, which sets the ECN field of an increasing 497 number of CL packets to the admission marked (AM) codepoint as they 498 approach a threshold rate that is lower than the line rate. The use 499 of virtual queues ensures real queues have hardly built up any 500 congestion delay. The level of marking detected at the egress of the 501 Diffserv region is then used by the signalling system in order to 502 determine admission control as follows. 504 The end-to-end QoS signalling (e.g. RSVP) for a new reservation 505 takes one giant hop from ingress to egress gateway, because interior 506 routers within the Diffserv region are configured to ignore RSVP. 507 The egress gateway holds flow state because it takes part in the end- 508 to-end reservation. So it can classify all packets by flow and it 509 can identify all flows that have the same previous RSVP hop (a CL- 510 region-aggregate). For each CL-region-aggregate of flows in 511 progress, the egress gateway maintains a per-packet moving average of 512 the fraction of pre-congestion-marked traffic. Once an RSVP PATH 513 message for a new reservation has hopped across the Diffserv region 514 and reached the destination, an RSVP RESV message is returned. As 515 the RESV message passes, the egress gateway piggy-backs the relevant 516 pre-congestion level onto it [RSVP-ECN]. Again, interior routers 517 ignore the RSVP message, but the ingress gateway strips off the pre- 518 congestion level. If the pre-congestion level is above a threshold, 519 the ingress gateway denies admission to the new reservation, 520 otherwise it returns the original RESV signal back towards the data 521 sender. 523 Once a reservation is admitted, its traffic will always receive low 524 delay service for the duration of the reservation. This is because 525 ingress gateways ensure that traffic not under a reservation cannot 526 pass into the Diffserv region with the CL DSCP set. So non-reserved 527 traffic will always be treated with a lower priority PHB at each 528 interior router. And even if some disaster re-routes traffic after 529 it has been admitted, if the traffic through any resource tips over a 530 fail-safe threshold, pre-congestion notification will trigger flow 531 pre-emption to very quickly bring every router within the whole 532 Diffserv region back below its operating point. 534 The whole admission control system just described deliberately 535 confines per-flow processing to the access edges of the network, 536 where it will not limit the system's scalability. But ideally we 537 want to extend this approach to multiple networks, to take even more 538 advantage of its scaling potential. We would still need per-flow 539 processing at the access edges of each network, but not at the high 540 speed interfaces where they interconnect. Even though such an 541 admission control system would work technically, it would gain us no 542 scaling advantage if each network also wanted to police the rate of 543 each admitted flow for itself--border routers would still have to do 544 complex packet operations per-flow anyway, given they don't trust 545 upstream networks to do their policing for them. 547 This memo describes how to emulate per-flow rate policing using bulk 548 mechanisms at border routers, so the full scalability potential of 549 pre-congestion notification is not limited by the need for per-flow 550 policing mechanisms at borders, which would make borders the most 551 cost-critical pinch-points. Then we can achieve the long sought-for 552 vision of secure Internet-wide bandwidth reservations without needing 553 per-flow processing at all in core and border routers--where 554 scalability is most critical. 556 4. Re-ECN Protocol for an RSVP (or similar) Transport 558 4.1. Protocol Overview 560 First we need to recap the way routers accumulate congestion marking 561 along a path. Each ECN-capable router marks some packets with CE, 562 the marking probability increasing with the length of the queue at 563 its egress link. The only difference with pre-congestion 564 marking [PCN] is that marking is based on the length of a virtual 565 queue, so that the real queue occupancy can remain very low. We will 566 use the terms congestion and pre-congestion interchangeably in the 567 following unless it is important to distinguish between them. 569 With multiple ECN-capable routers on a path, the ECN field 570 accumulates the fraction of CE marking that each router adds. The 571 combined effect of the packet marking of all the routers along the 572 path signals congestion of the whole path to the receiver. So, for 573 example, if one router early in a path is marking 1% of packets and 574 another later in a path is marking 2%, flows that pass through both 575 routers will experience approximately 3% marking. 577 The packets crossing an inter-domain trust boundary within the 578 Diffserv region will all have come from different ingress gateways 579 and will all be destined for different egress gateways. We will show 580 that the key to policing against theft of service is for a border 581 router to be able to directly measure the congestion that is about to 582 be caused by the traffic it forwards. That is, it can measure 583 locally the congestion on each of the downstream paths between itself 584 and the egress gateways that its traffic is destined for. 586 With the original ECN protocol, if CE markings crossing the border 587 had been counted over a period, they would have represented the 588 accumulated upstream congestion that had already been experienced by 589 those packets. The general idea of re-ECN is for the ingress gateway 590 to continuously encode path congestion into the IP header where, in 591 this case, `path' means from ingress to egress gateway. Then at any 592 point on that path (e.g. between domains A & B in Figure 2 below), IP 593 headers can be monitored to subtract upstream congestion from 594 expected path congestion in order to give the expected downstream 595 congestion still to be experienced until the egress gateway. 597 Importantly, it turns out that there is no need to monitor downstream 598 congestion on a per-flow basis. We will show that accounting for it 599 in bulk across all flows will be sufficient. 601 _____________________________________ 602 _|__ ______ ______ ______ _|__ 603 | | | A | | B | | C | | | 604 +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ 605 | | |B| |B| |B| |B| |B| |B| | | 606 |Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr | 607 |G/W | | | | |: | | | | | | | | |G/W | 608 +----+ +-+ +-+: +-+ +-+ +-+ +-+ +----+ 609 | | | |: | | | | | | 610 |____| |______|: |______| |______| |____| 611 |_____________:_______________________| 612 : 613 | : | 614 |<-upstream-->:<-expected downstream->| 615 | congestion : congestion | 616 | u v ~= p - u | 617 | | 618 |<--- expected path congestion, p --->| 620 Figure 2: Re-ECN concept 622 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 624 In this section we define the names of the various codepoints of the 625 re-ECN protocol when used with pre-congestion notification, deferring 626 description of their semantics to the following sections. But first 627 we recap the re-ECN wire protocol proposed in [Re-TCP]. 629 4.2.1. Re-ECN Recap 631 Re-ECN uses the two bit ECN field broadly as in RFC3168 [RFC3168]. 632 It also uses a new re-ECN extension (RE) flag. The actual position 633 of the RE flag is different between IPv4 & v6 headers so we will use 634 an abstraction of the IPv4 and v6 wire protocols by just calling it 635 the RE flag. [Re-TCP] proposes using bit 48 (currently unused) in 636 the IPv4 header for the RE flag, while for IPv6 it proposes an ECN 637 extension header. 639 Unlike the ECN field, the RE flag is intended to be set by the sender 640 and remain unchanged along the path, although it can be read by 641 network elements that understand the re-ECN protocol. In the 642 scenario used in this memo, the ingress gateway acts as a proxy for 643 the sender, setting the RE flag as permitted in the specification of 644 re-ECN. 646 Note that general-purpose routers do not have to read the RE flag, 647 only special policing elements at borders do. And no general-purpose 648 routers have to change the RE flag, although the ingress and egress 649 gateways do because in the edge-to-edge deployment model we are 650 using, they act as proxies for the endpoints. Therefore the RE flag 651 does not even have to be visible to interior routers. So the RE flag 652 has no implications on protocols like MPLS. Congested label 653 switching routers (LSRs) would have to be able to notify their 654 congestion with an ECN/PCN codepoint in the MPLS shim [ECN-MPLS], but 655 like any interior IP router, they can be oblivious to the RE flag, 656 which need only be read by border policing functions. 658 Although the RE flag is a separate, single bit field, it can be read 659 as an extension to the two-bit ECN field; the three concatenated bits 660 in what we will call the extended ECN field (EECN) make eight 661 codepoints available. When the RE flag setting is "don't care", we 662 use the RFC3168 names of the ECN codepoints, but [Re-TCP] proposes 663 the following six codepoint names for when there is a need to be more 664 specific. 666 +--------+-------------+-------+-------------+----------------------+ 667 | ECN | RFC3168 | RE | Extended | Re-ECN meaning | 668 | field | codepoint | flag | ECN | | 669 | | | | codepoint | | 670 +--------+-------------+-------+-------------+----------------------+ 671 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 672 | | | | | transport | 673 | 00 | Not-ECT | 1 | FNE | Feedback not | 674 | | | | | established | 675 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 676 | | | | | and RECT | 677 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 678 | | | | | transport | 679 | 10 | ECT(0) | 0 | --- | Legacy ECN use | 680 | | | | | only | 681 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 682 | | | | | | 683 | 11 | CE | 0 | CE(0) | Congestion | 684 | | | | | experienced with | 685 | | | | | Re-Echo | 686 | 11 | CE | 1 | CE(-1) | Congestion | 687 | | | | | experienced | 688 +--------+-------------+-------+-------------+----------------------+ 690 Table 1: Re-cap of Default Extended ECN Codepoints Proposed for Re- 691 ECN 693 4.2.2. Re-ECN Combined with Pre-Congestion Notification (re-PCN) 695 As permitted by the ECN specification [RFC3168], a proposal is 696 currently being advanced in the IETF to define different semantics 697 for how routers might mark the ECN field of certain packets. The 698 idea is to be able to notify congestion when the router's load 699 approaches a logical limit, rather than the physical limit of the 700 line. This new marking is called pre-congestion notification [PCN] 701 and we will use the term PCN-enabled router for a router that can 702 apply pre-congestion notification marking to the ECN fields of 703 packets. 705 [RFC3168] recommends that a packet's Diffserv codepoint should 706 determine which type of ECN marking it receives. A Diffserv per-hop 707 behaviour (PHB) can specify that routers should apply pre-congestion 708 notification marking to PCN-capable packets. We will call this a 709 PCN-enhanced PHB. A PCN-capable packet must meet two conditions, it 710 must carry a DSCP that maps to a PCN-enhanced PHB and it must carry 711 an ECN field that turns on PCN marking. 713 As an example, the controlled load (CL) PHB might specify expedited 714 forwarding as its scheduling behaviour and PCN marking as its 715 congestion marking behaviour. Then we would say the CL PHB is a PCN- 716 enhanced PHB, and that packets with a DSCP that maps to the CL PHB 717 and with ECN turned on are PCN-capable packets. 719 [PCN] actually proposes that two logical limits should be used for 720 pre-congestion notification, with the higher limit as a back-stop for 721 dealing with anomalous events. It envisages PCN will be used to 722 admission control inelastic real-time traffic, so marking at the 723 lower limit will trigger admission control, while at the higher limit 724 it will trigger flow pre-emption. 726 Because it needs two types of congestion marking, PCN seems to need 727 five states: Not-ECT, ECT (ECN-capable transport), the ECN Nonce, 728 Admission Marking (AM) and Flow Pre-emption Marking (PM). [PCN] 729 proposes various alternative encodings of the ECN field, attempting 730 various compromises to fit these five states into the four available 731 ECN codepoints. 733 One of the five states to make room for is the ECN Nonce [RFC3540], 734 but the capability we describe in this memo supersedes any need for 735 the Nonce. The ECN Nonce is an elegant scheme, but it only allows a 736 sending node (or its proxy) to detect suppression of congestion 737 marking in the feedback loop. Thus the Nonce requires the sender or 738 its proxy to be trusted to respond correctly to congestion. But this 739 is precisely the main cheat we want to protect against (as well as 740 many others). 742 One of the compromise protocol encodings that [PCN] explores 743 ("Alternative 5") leaves out support for the ECN Nonce. Therefore we 744 use that one. This encoding of PCN markings is shown on the left of 745 Table 2. Note that these codepoints of the ECN field only take on 746 the semantics of pre-congestion notification if they are combined 747 with a Diffserv codepoint that the operator has configured to cause 748 PCN marking, by mapping it to a PCN-enhanced PHB. 750 For the rest of this memo, we will not distinguish between Admission 751 Marking and Pre-emption Marking unless we need to be specific. We 752 will call both "congestion marking". With the above encoding, 753 congestion marking can be read to mean any packet with the left-most 754 bit of the ECN field set. 756 The re-ECN protocol can be used to control misbehaving sources 757 whether congestion is with respect to a logical threshold (PCN) or 758 the physical line rate (ECN). In either case the RE flag can be used 759 to create an extended ECN field. For PCN-capable packets, the 8 760 possible encodings of this 3-bit extended ECN (EECN) field are 761 defined on the right of Table 2 below. The purposes of these 762 different codepoints will be introduced in subsequent sections. 764 +-------+-----------------+------+--------------+-------------------+ 765 | ECN | PCN codepoint | RE | Extended ECN | Re-ECN meaning | 766 | field | (Alternative 5) | flag | codepoint | | 767 +-------+-----------------+------+--------------+-------------------+ 768 | 00 | Not-ECT | 0 | Not-RECT | Not | 769 | | | | | re-ECN-capable | 770 | | | | | transport | 771 | 00 | Not-ECT | 1 | FNE | Feedback not | 772 | | | | | established | 773 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed | 774 | | | | | congestion and | 775 | | | | | RECT | 776 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 777 | | | | | transport | 778 | 10 | AM | 0 | AM(0) | Admission Marking | 779 | | | | | with Re-Echo | 780 | 10 | AM | 1 | AM(-1) | Admission Marking | 781 | | | | | | 782 | 11 | PM | 0 | PM(0) | Pre-emption | 783 | | | | | Marking with | 784 | | | | | Re-Echo | 785 | 11 | PM | 1 | PM(-1) | Pre-emption | 786 | | | | | Marking | 787 +-------+-----------------+------+--------------+-------------------+ 789 Table 2: Extended ECN Codepoints if the Diffserv codepoint uses Pre- 790 congestion Notification (PCN) 792 4.3. Protocol Operation 794 4.3.1. Protocol Operation for an Established Flow 796 The re-ECN protocol involves a simple tweak to the action of the 797 gateway at the ingress edge of the CL region. In the deployment 798 model just described [PCN-arch], for each active traffic aggregate 799 across the CL region (CL-region-aggregate) the ingress gateway will 800 hold a fairly recent Congestion-Level-Estimate that the egress 801 gateway will have fed back to it, piggybacked on the signalling that 802 sets up each flow. For instance, one aggregate might have been 803 experiencing 3% pre-congestion (that is, congestion marked octets 804 whether Admission Marked or Pre-emption Marked). In this case, the 805 ingress gateway MUST clear the RE flag to "0" for the same percentage 806 of octets of CL-packets (3%) and set it to "1" in the rest (97%). 807 Appendix A.1 gives a simple pseudo-code algorithm that the ingress 808 gateway may use to do this. 810 The RE flag is set and cleared this way round for incremental 811 deployment reasons (see [Re-TCP]). To avoid confusion we will use 812 the term `blanking' (rather than marking) when the RE flag is cleared 813 to "0", so we will talk of the `RE blanking fraction' as the fraction 814 of octets with the RE flag cleared to "0". 816 ^ 817 | 818 | RE blanking fraction 819 3% | +----------------------------+====+ 820 | | | | 821 2% | | | | 822 | | congestion marking fraction| | 823 1% | | +----------------------+ | 824 | | | | 825 0% +----+=====+---------------------------+------> 826 ^ <--A---> <---B---> <---C---> ^ domain 827 | ^ ^ | 828 ingress | | egress 829 1.00% 2.00% marking fraction 831 Figure 3: Example Extended ECN codepoint Marking fractions 832 (Imprecise) 834 Figure 3 illustrates our example. The horizontal axis represents the 835 index of each congestible resource (typically queues) along a path 836 through the Internet. The two superimposed plots show the fraction 837 of each ECN codepoint observed along this path, assuming there are 838 two congested routers somewhere within domains A and C. And Table 3 839 below shows the downstream pre-congestion measured at various border 840 observation points along the path. Figure 4 (later) shows the same 841 results of these subtractions, but in graphical form like the above 842 figure. The tabulated figures are actually reasonable approximations 843 derived from more precise formulae given in Appendix A of [Re-TCP]. 844 The RE flag is not changed by interior routers, so it can be seen 845 that it acts as a reference against which the congestion marking 846 fraction can be compared along the path. 848 +--------------------------+---------------------------------------+ 849 | Border observation point | Approximate Downstream pre-congestion | 850 +--------------------------+---------------------------------------+ 851 | ingress -- A | 3% - 0% = 3% | 852 | A -- B | 3% - 1% = 2% | 853 | B -- C | 3% - 1% = 2% | 854 | C -- egress | 3% - 3% = 0% | 855 +--------------------------+---------------------------------------+ 857 Table 3: Downstream Congestion Measured at Example Observation Points 858 Note that the ingress determines the RE blanking fraction for each 859 aggregate using the most recent feedback from the relevant egress, 860 arriving with each new reservation, or each refresh. These updates 861 arrive relatively infrequently compared to the speed with which 862 congestion changes. Although this feedback will always be out of 863 date, on average positive errors should cancel out negative over a 864 sufficiently long duration. 866 In summary, the network adds pre-congestion marking in the forward 867 data path, the egress feeds its level back to the ingress in RSVP (or 868 similar signalling), then the ingress gateway re-echoes it into the 869 forward data path by blanking the RE flag. Hence the name re-ECN. 870 Then at any border within the Diffserv region, the pre-congestion 871 marking that every passing packet will be expected to experience 872 downstream can be measured to be the RE blanking fraction minus the 873 congestion marking fraction. 875 4.3.2. Aggregate Bootstrap 877 When a new reservation PATH message arrives at the egress, if there 878 are currently no flows in progress from the same ingress, there will 879 be no state maintaining the current level of pre-congestion marking 880 for the aggregate. While the reservation signalling continues onward 881 towards the receiving host, the egress gateway returns an RSVP 882 message to the ingress with a flag [RSVP-ECN] asking the ingress to 883 send a specified number of data probes between them. This bootstrap 884 behaviour is all described in the deployment model [PCN-arch]. 886 However, with our new re-ECN scheme, the ingress does not know what 887 proportion of the data probes should have the RE flag blanked, 888 because it has no estimate yet of pre-congestion for the path across 889 the Diffserv region. 891 To be conservative, following the guidance for specifying other re- 892 ECN transports in [Re-TCP], the ingress SHOULD set the FNE codepoint 893 of the extended ECN header in all probe packets (Table 2). As per 894 the deployment model, the egress gateway measures the fraction of 895 congestion-marked probe octets and feeds back the resulting pre- 896 congestion level to the ingress, piggy-backed on the returning 897 reservation response (RESV) for the new flow. Probe packets are 898 identifiable by the egress because they have the ingress as the 899 source and the egress as the destination in the IP header. 901 It may seem inadvisable to expect the FNE codepoint to be set on 902 probes, given legacy firewalls etc. might discard such packets 903 (because this flag had no previous legitimate use). However, in the 904 deployment scenarios envisaged, each domain in the Diffserv region 905 has to be explicitly configured to support the controlled load 906 service. So, before deploying the service, the operator MUST 907 reconfigure such a misbehaving middlebox to allow through packets 908 with the RE flag set. 910 Note that we have said SHOULD rather than MUST for the FNE setting 911 behaviour of the ingress for probe packets. This entertains the 912 possibility of an ingress implementation having the benefit of other 913 knowledge of the path, which it re-uses for a newly starting 914 aggregate. For instance, it may hold cached information from a 915 recent use of the aggregate that is still sufficiently current to be 916 useful. 918 It might seem pedantic worrying about these few probe packets, but 919 this behaviour ensures the system is safe, even if the proportion of 920 probe packets becomes large. 922 4.3.3. Flow Bootstrap 924 It might be expected that a new flow within an active aggregate would 925 need no special bootstrap behaviour. If there was an aggregate 926 already in progress between the gateways the new flow was about to 927 use, it would inherit the prevailing RE blanking fraction. And if 928 there were no active aggregate, the bootstrap behaviour for an 929 aggregate would be appropriate and sufficient for the new flow. 931 However, for a number of reasons, at least the first packet of each 932 new flow SHOULD be set to the FNE codepoint, irrespective of whether 933 it is joining an active aggregate or not. If the first packet is 934 unlikely to be reliably delivered, a number of FNE packets MAY be 935 sent to increase the probability that at least one is delivered to 936 the egress gateway. 938 If each flow does not start with an FNE packet, it will be seen later 939 that sanctions may be too strict at the interface before the egress 940 gateway. It will often be possible to apply sanctions at the 941 granularity of aggregates rather than flows, but in an internetworked 942 environment it cannot be guaranteed that aggregates will be 943 identifiable in remote networks. So setting FNE at the start of each 944 flow is a safe strategy. For instance, a remote network may have 945 equal cost multi-path (ECMP) routing enabled, causing different flows 946 between the same gateways to traverse different paths. 948 After an idle period of more than 1 second, the ingress gateway 949 SHOULD set the EECN field of the next packet it sends to FNE. This 950 allows the design of network policers to be deterministic (see 951 [Re-TCP]). 953 However, if the ingress gateway can guarantee that the network(s) 954 that will carry the flow to its egress gateway all use a common 955 identifier for the aggregate (e.g. a single MPLS network without ECMP 956 routing), it MAY NOT set FNE when it adds a new flow to an active 957 aggregate. And an FNE packet need only be sent if a whole aggregate 958 has been idle for more than 1 second. 960 4.3.4. Router Forwarding Behaviour 962 Adding re-ECN works well without modifying the forwarding behaviour 963 of any routers. However, below, two changes are proposed when 964 forwarding packets with a per-hop-behaviour that requires pre- 965 congestion notification: 967 Preferential drop: When a router cannot avoid dropping ECN-capable 968 packets, preferential dropping of packets with different extended 969 ECN codepoints SHOULD be implemented between packets within a PHB 970 that uses PCN marking. The drop preference order to use is 971 defined in Table 4. Note that to reduce configuration complexity, 972 Re-Echo and FNE MAY be given the same drop preference, but if 973 feasible, FNE should be dropped in preference to Re-Echo. 975 +---------+-------+----------------+---------+----------------------+ 976 | ECN | RE | Extended ECN | Drop | Re-ECN meaning | 977 | field | flag | codepoint | Pref | | 978 +---------+-------+----------------+---------+----------------------+ 979 | 01 | 0 | Re-Echo | 5/4 | Re-echoed congestion | 980 | | | | | and RECT | 981 | 00 | 1 | FNE | 4 | Feedback not | 982 | | | | | established | 983 | 01 | 1 | RECT | 3 | Re-ECN capable | 984 | | | | | transport | 985 | 10 | 0 | AM(0) | 3 | Admission Marking | 986 | | | | | with Re-Echo | 987 | 10 | 1 | AM(-1) | 3 | Admission Marking | 988 | | | | | | 989 | 11 | 0 | PM(0) | 2 | Pre-emption Marking | 990 | | | | | with Re-Echo | 991 | 11 | 1 | PM(-1) | 2 | Pre-emption Marking | 992 | | | | | | 993 | 00 | 0 | Not-RECT | 1 | Not re-ECN-capable | 994 | | | | | transport | 995 +---------+-------+----------------+---------+----------------------+ 997 Table 4: Drop Preference of Extended ECN Codepoints (1 = drop 1st) 998 Given this proposal is being advanced at the same time as PCN 999 itself, we strongly RECOMMEND that preferential drop based on 1000 extended ECN codepoint is added to router forwarding at the same 1001 time as PCN marking. Preferential dropping can be difficult to 1002 implement, but we strongly RECOMMEND this security-related re-ECN 1003 improvement where feasible as it is an effective defence against 1004 flooding attacks. 1006 Marking vs. Drop: We propose that PCN-routers SHOULD inspect the RE 1007 flag as well as the ECN field to decide whether to drop or mark 1008 PCN DSCPs. They MUST choose drop if the codepoint of this 1009 extended ECN field is Not-RECT. Otherwise they SHOULD mark 1010 (unless, of course, buffer space is exhausted). 1012 A PCN-capable router MUST NOT ever congestion mark a packet 1013 carrying the Not-RECT codepoint because the transport will only 1014 understand drop, not congestion marking. But a PCN-capable router 1015 can mark rather than drop an FNE packet, even though its ECN field 1016 when looked at in isolation is '00' which appears to be a legacy 1017 Not-ECT packet. Therefore, if a packet's RE flag is '1', even if 1018 its ECN field is '00', a PCN-enabled router SHOULD use congestion 1019 marking. This allows the `feedback not established' (FNE) 1020 codepoint to be used for probe packets, in order to pick up PCN 1021 marking when bootstrapping an aggregate. 1023 ECN marking rather than dropping of FNE packets MUST only be 1024 deployed in controlled environments, such as that in [PCN-arch], 1025 where the presence of an egress node that understands ECN marking 1026 is assured. Congestion events might otherwise be ignored if the 1027 receiver only understands drop, rather than ECN marking. This is 1028 because there is no guarantee that ECN capability has been 1029 negotiated if feedback is not established (FNE). Also, [Re-TCP] 1030 places the strong condition that a router MUST apply drop rather 1031 than marking to FNE packets unless it can guarantee that FNE 1032 packets are rate limited either locally or upstream. 1034 4.3.5. Extensions 1036 If a different signalling system, such as NSIS, were used, but it 1037 provided admission control in a similar way, using pre-congestion 1038 notification (e.g. with RMD [NSIS-RMD]) we believe re-ECN could be 1039 used to protect against misbehaving networks in the same way as 1040 proposed above. 1042 5. Emulating Border Policing with Re-ECN 1044 Note that the re-ECN protocol described in Section 4 above would 1045 require standardisation, whereas operators acting in their own 1046 interests would be expected to deploy policing and monitoring 1047 functions similar to those proposed in the sections below without any 1048 further need for standardisation by the IETF. Flexibility is 1049 expected in exactly how policing and monitoring is done. 1051 5.1. Informal Terminology 1053 In the rest of this memo, where the context makes it clear, we will 1054 sometimes loosely use the term `congestion' rather than using the 1055 stricter `downstream pre-congestion'. Also we will loosely talk of 1056 positive or negative flows, meaning flows where the moving average of 1057 the downstream pre-congestion metric is persistently positive or 1058 negative. The notion of a negative metric arises because it is 1059 derived by subtracting one metric from another. Of course actual 1060 downstream congestion cannot be negative, only the metric can 1061 (whether due to time lags or deliberate malice). 1063 Just as we will loosely talk of positive and negative flows, we will 1064 also talk of positive or negative packets, meaning packets that 1065 contribute positively or negatively to downstream pre-congestion. 1067 Therefore packets can be considered to have a `worth' of +1, 0 or -1, 1068 which, when multiplied by their size, indicates their contribution to 1069 downstream congestion. Packets will usually be sent with a worth of 1070 0. Blanking the RE flag increments the worth of a packet to +1. 1071 Congestion marking a packet decrements its worth (whether admission 1072 marking or pre-emption marking). Congestion marking a previously 1073 blanked packet cancel out the positive and negative worth of each 1074 marking (a worth of 0). The FNE codepoint is an exception. It has 1075 the same positive worth as a packet with the Re-Echo codepoint. The 1076 table below specifies unambiguously the worth of each extended ECN 1077 codepoint. Note the order is different from the previous table to 1078 emphasise how congestion marking processes decrement the worth. 1080 +---------+-------+-----------------+-------+-----------------------+ 1081 | ECN | RE | Extended ECN | Worth | Re-ECN meaning | 1082 | field | flag | codepoint | | | 1083 +---------+-------+-----------------+-------+-----------------------+ 1084 | 00 | 0 | Not-RECT | n/a | Not re-ECN-capable | 1085 | | | | | transport | 1086 | 01 | 0 | Re-Echo | +1 | Re-echoed congestion | 1087 | | | | | and RECT | 1088 | 10 | 0 | AM(0) | 0 | Admission Marking | 1089 | | | | | with Re-Echo | 1090 | 11 | 0 | PM(0) | 0 | Pre-emption Marking | 1091 | | | | | with Re-Echo | 1092 | 00 | 1 | FNE | +1 | Feedback not | 1093 | | | | | established | 1094 | 01 | 1 | RECT | 0 | Re-ECN capable | 1095 | | | | | transport | 1096 | 10 | 1 | AM(-1) | -1 | Admission Marking | 1097 | | | | | | 1098 | 11 | 1 | PM(-1) | -1 | Pre-emption Marking | 1099 +---------+-------+-----------------+-------+-----------------------+ 1101 Table 5: 'Worth' of Extended ECN Codepoints 1103 5.2. Policing Overview 1105 It will be recalled that downstream congestion can be found by 1106 subtracting upstream congestion from path congestion. Figure 4 1107 displays the difference between the two plots in Figure 3 to show 1108 downstream pre-congestion across the same path through the Internet. 1110 To emulate border policing, the general idea is for each domain to 1111 apply penalties to its upstream neighbour in proportion to the amount 1112 of downstream pre-congestion that the upstream network sends across 1113 the border. That is, the penalties should be in proportion to the 1114 height of the plot. Downward arrows in the figure show the resulting 1115 pressure for each domain to under-declare downstream pre-congestion 1116 in traffic they pass to the next domain, because of the penalties. 1118 p e n a l t i e s 1119 / | \ 1120 A : : : 1121 | | <--A---> <---B---> <---C---> domain 1122 | V : : : 1123 3% | +-----+ | | : 1124 | | | V V : 1125 2% | | +----------------------+ : 1126 | | downstream pre-congestion | : 1127 1% | | : | : 1128 | | : | : 1129 0% +----+----------------------------+====+------> 1130 : : : A : 1131 : : : | : 1132 ingress : : : egress 1133 1.00% 2.00%: pre-congestion 1134 | 1135 sanctions 1137 Figure 4: Policing Framework, showing creation of opposing pressures 1138 to under-declare and over-declare downstream pre-congestion, using 1139 penalties and sanctions 1141 These penalties seem to encourage everyone to understate downstream 1142 congestion in order to reduce the penalties they incur. But a 1143 balancing pressure is introduced by the last domain, which applies 1144 sanctions to flows if downstream congestion goes negative before the 1145 egress gateway. The upward arrow at Domain C's border with the 1146 egress gateway represents the incentive the sanctions would create to 1147 prevent negative traffic. The same upward pressure can be applied at 1148 any domain border (arrows not shown). 1150 Any flow that persistently goes negative by the time it leaves a 1151 domain must not have been marked correctly in the first place. A 1152 domain that discovers such a flow can adopt a range of strategies to 1153 protect itself. Which strategy it uses will depend on policy, 1154 because it cannot immediately assume malice--there may be an innocent 1155 configuration error somewhere in the system. 1157 This memo does not propose to standardise any particular mechanism to 1158 detect persistently negative flows, but Section 5.5 does give 1159 examples. Note that we have used the term flow, but there will be no 1160 need to bury into the transport layer for port numbers; identifiers 1161 visible in the network layer will be sufficient (IP address pair, 1162 DSCP, protocol ID). The appendix also gives a mechanism to bound the 1163 required flow state, preventing state exhaustion attacks. 1165 Of course, some domains may trust other domains to comply with 1166 admission control without applying sanctions or penalties. In these 1167 cases, the protocol should still be used but no penalties need be 1168 applied. The re-ECN protocol ensures downstream pre-congestion 1169 marking is passed on correctly whether or not penalties are applied 1170 to it, so the system works just as well with a mixture of some 1171 domains trusting each other and others not. 1173 Providers should be free to agree the contractual terms they wish 1174 between themselves, so this memo does not propose to standardise how 1175 these penalties would be applied. It is sufficient to standardise 1176 the re-ECN protocol so the downstream pre-congestion metric is 1177 available if providers choose to use it. However, the next section 1178 (Section 5.3) gives some examples of how these penalties might be 1179 implemented. 1181 5.3. Pre-requisite Contractual Arrangements 1183 The re-ECN protocol has been chosen to solve the policing problem 1184 because it embeds a downstream pre-congestion metric in passing CL 1185 traffic that is difficult to lie about and can be measured in bulk. 1186 The ability to emulate border policing depends on network operators 1187 choosing to use this metric as one of the elements in their contracts 1188 with each other. 1190 Already many inter-domain agreements involve a capacity and a usage 1191 element. The usage element may be based on volume or various 1192 measures of peak demand. We expect that those network operators who 1193 choose to use pre-congestion notification for admission control would 1194 also be willing to consider using this downstream pre-congestion 1195 metric as a usage element in their interconnection contracts for 1196 admission controlled (CL) traffic. 1198 Congestion (or pre-congestion) has the dimension of [octet], being 1199 the product of volume transferred [octet] and the congestion fraction 1200 [dimensionless], which is the fraction of the offered load that the 1201 network isn't able to serve (or would rather not serve in the case of 1202 pre-congestion). Measuring downstream congestion gives a measure of 1203 the volume transferred but modulated by congestion expected 1204 downstream. So volume transferred during off-peak periods counts as 1205 nearly nothing, while volume transferred at peak times counts very 1206 highly. The re-ECN protocol allows one network to measure how much 1207 pre-congestion has been `dumped' into it by another network. And 1208 then in turn how much of that pre-congestion it dumped into the next 1209 downstream network. 1211 Section 5.6 describes mechanisms for calculating border penalties 1212 referring to Appendix A.2 for suggested metering algorithms for 1213 downstream congestion at a border router. Conceptually, it could 1214 hardly be simpler. It broadly involves accumulating the volume of 1215 packets with the RE flag blanked and the volume of those with 1216 congestion marking then subtracting the two. 1218 Once this downstream pre-congestion metric is available, operators 1219 are free to choose how they incorporate it into their interconnection 1220 contracts [IXQoS]. Some may include a threshold volume of pre- 1221 congestion as a quality measure in their service level agreement, 1222 perhaps with a penalty clause if the upstream network exceeds this 1223 threshold over, say, a month. Others may agree a set of tiered 1224 monthly thresholds, with increasing penalties as each threshold is 1225 exceeded. But, it would be just as easy, and more resistant to 1226 gaming, to do away with discrete thresholds, and instead make the 1227 penalty rise smoothly with the volume of pre-congestion by applying a 1228 price to pre-congestion itself. Then the usage element of the 1229 interconnection contract would directly relate to the volume of pre- 1230 congestion caused by the upstream network. 1232 The direction of penalties and charges relative to the direction of 1233 traffic flow is a constant source of confusion. Typically, where 1234 capacity charges are concerned, lower tier customer networks pay 1235 higher tier provider networks. So money flows from the edges to the 1236 middle of the internetwork, towards greater connectivity, 1237 irrespective of the flow of data. But we advise that penalties or 1238 charges for usage should follow the same direction as the data flow-- 1239 the direction of control at the network layer. Otherwise a network 1240 lays itself open to `denial of funds' attacks. So, where a tier 2 1241 provider sends data into a tier 3 customer network, we would expect 1242 the penalty clauses for sending too much pre-congestion to be against 1243 the tier 2 network, even though it is the provider. 1245 It may help to remember that data will be flowing in the other 1246 direction too. So the provider network has as much opportunity to 1247 levy usage penalties as its customer, and it can set the price or 1248 strength of its own penalties higher if it chooses. Usage charges in 1249 both directions tend to cancel each other out, which confirms that 1250 usage-charging is less to do with revenue raising and more to do with 1251 encouraging load control discipline in order to smooth peaks and 1252 troughs, improving utilisation and quality. 1254 Further, when operators agree penalties in their interconnection 1255 contracts for sending downstream congestion, they should make sure 1256 that any level of negative marking only equates to zero penalty. In 1257 other words, penalties are always paid in the same direction as the 1258 data, and never against the data flow, even if downstream congestion 1259 seems to be negative. This is consistent with the definition of 1260 physical congestion; when a resource is underutilised, it is not 1261 negatively congested. Its congestion is just zero. So, although 1262 short periods of negative marking can be tolerated to correct 1263 temporary over-declarations due to lags in the feedback system, 1264 persistent downstream negative congestion can have no physical 1265 meaning and therefore must signify a problem. The incentive for 1266 domains not to tolerate persistently negative traffic depends on this 1267 principle that penalties must never be paid against the data flow. 1269 Also note that at the last egress of the Diffserv region, domain C 1270 should not agree to pay any penalties to the egress gateway for pre- 1271 congestion passed to the egress gateway. Downstream pre-congestion 1272 to the egress gateway should have reached zero here. If domain C 1273 were to agree to pay for any remaining downstream pre-congestion, it 1274 would give the egress gateway an incentive to over-declare pre- 1275 congestion feedback and take the resulting profit from domain C. 1277 To focus the discussion, from now on, unless otherwise stated, we 1278 will assume a downstream network charges its upstream neighbour in 1279 proportion to the pre-congestion it sends (V_b in the notation of 1280 Appendix A.2). Effectively tiered thresholds would be just more 1281 coarse-grained approximations of the fine-grained case we choose to 1282 examine. If these neighbours had previously agreed that the (fixed) 1283 price per octet of pre-congestion would be L, then the bill at the 1284 end of the month would simply be the product L*V_b, plus any fixed 1285 charges they may also have agreed. 1287 We are well aware that the IETF tries to avoid standardising 1288 technology that depends on a particular business model. Indeed, this 1289 principle is at the heart of all our own work. Our aim here is to 1290 make a new metric available that we believe is superior to all 1291 existing metrics. Then, our aim is to show that border policing can 1292 at least work with the one model we have just outlined. We assume 1293 that operators might then experiment with the metric in other models. 1294 Of course, operators are free to complement this pre-congestion-based 1295 usage element of their charges with traditional capacity charging, 1296 and we expect they will. 1298 Also note well that everything we discuss in this memo only concerns 1299 interconnection within the Diffserv region. ISPs are free to sell or 1300 give away reservations however they want on the retail market. But 1301 of course, interconnection charges will have a bearing on that. 1302 Indeed, in the present scenario, the ingress gateway effectively 1303 sells reservations on one side and buys congestion penalties on the 1304 other. As congestion rises, one can imagine the gateway discovering 1305 that congestion penalties have risen higher than the (probably fixed) 1306 revenue it will earn from selling the next flow reservation. This 1307 encourages the gateway to cut its losses by blocking new calls, which 1308 is why we believe downstream congestion penalties can emulate per- 1309 flow rate policing at borders, as the next section explains. 1311 5.4. Emulation of Per-Flow Rate Policing: Rationale and Limits 1313 The important feature of charging in proportion to congestion volume 1314 is that the penalty aggregates and disaggregates correctly along with 1315 packet flows. This is because the penalty rises linearly with bit 1316 rate (unless congestion is absolutely zero) and linearly with 1317 congestion, because it is the product of them both. So if the 1318 packets crossing a border belong to a thousand flows, and one of 1319 those flows doubles its rate, the ingress gateway forwarding that 1320 flow will have to put twice as much congestion marking into the 1321 packets of that flow. And this extra congestion marking will add 1322 proportionately to the penalties levied at every border the flow 1323 crosses in proportion to the amount of pre-congestion remaining on 1324 the path. 1326 Effectively, usage charges will continuously flow from ingress 1327 gateways to the places generating pre-congestion marking, in 1328 proportion to the pre-congestion marking introduced and to the data 1329 rates from those gateways. 1331 As importantly, pre-congestion itself rises super-linearly with 1332 utilisation of a particular resource. So if someone tries to push 1333 another flow into a path that is already signalling enough pre- 1334 congestion to warrant admission control, the penalty will be a lot 1335 greater than it would have been to add the same flow to a less 1336 congested path. This makes the incentive system fairly insensitive 1337 to the actual level of pre-congestion for triggering admission 1338 control that each ingress chooses. The deterrent against exceeding 1339 whatever threshold is chosen rises very quickly with a small amount 1340 of cheating. 1342 These are the properties that allow re-ECN to emulate per-flow border 1343 policing of both rate and admission control. It is not a perfect 1344 emulation of per-flow border policing, but we claim it is sufficient 1345 to at least ensure the cost to others of a cheat is borne by the 1346 cheater, because the penalties are at least proportionate to the 1347 level of the cheat. If an edge network operator is selling 1348 reservations at a large profit over the congestion cost, these pre- 1349 congestion penalties will not be sufficient to ensure networks in the 1350 middle get a share of those profits, but at least they can cover 1351 their costs. 1353 We will now explain with an example. When a whole inter-network is 1354 operating at normal (typically very low) congestion, the pre- 1355 congestion marking from virtual queues will be a little higher than 1356 if the real queues had been used--still low, but more noticeable. 1357 But low congestion levels do not imply that usage _charges_ must also 1358 be low. Usage charges will depend on the _price_ L as well. 1360 If the metric of the usage element of an interconnection agreement 1361 was changed from pure volume to pre-congested volume, one would 1362 expect the price of pre-congestion to be arranged so that the total 1363 usage charge remained about the same. So, if an average pre- 1364 congestion fraction turned out to be 1/1000, one would expect that 1365 the price L (per octet) of pre-congestion would be about 1000 times 1366 the previously used (per octet) price for volume. We should add that 1367 a switch to pre-congestion is unlikely to exactly maintain the same 1368 overall level of usage charges, but this argument will be 1369 approximately true, because usage charge will rise to at least the 1370 level the market finds necessary to push back against usage. 1372 From the above example it can be seen why a 1000x higher price will 1373 make operators become acutely sensitive to the congestion they cause 1374 in other networks, which is of course the desired effect; to 1375 encourage networks to _control_ the congestion they allow their users 1376 to cause to others. 1378 If any network sends even one flow at higher rate, they will 1379 immediately have to pay proportionately more usage charges. Because 1380 there is no knowledge of reservations within the Diffserv region, no 1381 interior router can police whether the rate of each flow is greater 1382 than each reservation. So the system doesn't truly emulate rate- 1383 policing of each flow. But there is no incentive to pack a higher 1384 rate into a reservation, because the charges are directly 1385 proportional to rate, irrespective of the reservations. 1387 However, if virtual queues start to fill on any path, even though 1388 real queues will still be able to provide low latency service, pre- 1389 congestion marking will rise fairly quickly. It may eventually reach 1390 the threshold where the ingress gateway would deny admission to new 1391 flows. If the ingress gateway cheats and continues to admit new 1392 flows, the affected virtual queues will rapidly fill, even though the 1393 real queues will still be little worse than they were when admission 1394 control should have been invoked. The ingress gateway will have to 1395 pay the penalty for such an extremely high pre-congestion level, so 1396 the pressure to invoke admission control should become unbearable. 1398 The above mechanisms protect against rational operators. In 1399 Section 5.6.3 we discuss how networks can protect themselves from 1400 accidental or deliberate misconfiguration in neighbouring networks. 1402 5.5. Sanctioning Dishonest Marking 1404 As CL traffic leaves the last network before the egress gateway 1405 (domain C) the RE blanking fraction should match the congestion 1406 marking fraction, when averaged over a sufficiently long duration 1407 (perhaps ~10s to allow a few rounds of feedback through regular 1408 signalling of new and refreshed reservations). 1410 To protect itself, domain C should install a monitor at its egress. 1411 It aims to detect flows of CL packets that are persistently negative. 1412 If flows are positive, domain C need take no action--this simply 1413 means an upstream network must be paying more penalties than it needs 1414 to. Appendix A.3 gives a suggested algorithm for the monitor, 1415 meeting the criteria below. 1417 o It SHOULD introduce minimal false positives for honest flows; 1419 o It SHOULD quickly detect and sanction dishonest flows (minimal 1420 false negatives); 1422 o It MUST be invulnerable to state exhaustion attacks from malicious 1423 sources. For instance, if the dropper uses flow-state, it should 1424 not be possible for a source to send numerous packets, each with a 1425 different flow ID, to force the dropper to exhaust its memory 1426 capacity; 1428 o It MUST introduce sufficient loss in goodput so that malicious 1429 sources cannot play off losses in the egress dropper against 1430 higher allowed throughput. Salvatori [CLoop_pol] describes this 1431 attack, which involves the source understating path congestion 1432 then inserting forward error correction (FEC) packets to 1433 compensate expected losses. 1435 Note that the monitor operates on flows but with careful design we 1436 can avoid per-flow state. This is why we have been careful to ensure 1437 that all flows MUST start with a packet marked with the FNE 1438 codepoint. If a flow does not start with the FNE codepoint, a 1439 monitor is likely to treat it unfavourably. This risk makes it worth 1440 setting the FNE codepoint at the start of a flow, even though there 1441 is a cost to setting FNE (positive `worth'). 1443 Starting flows with an FNE packet also means that a monitor will be 1444 resistant to state exhaustion attacks from other networks, as the 1445 monitor can then be designed to never create state unless an FNE 1446 packet arrives. And an FNE packet counts positive, so it will cost a 1447 lot for a network to send many of them. 1449 Monitor algorithms will often maintain a moving average across flows 1450 of the fraction of RE blanked packets. When maintaining an average 1451 across flows, a monitor MUST ignore packets with the FNE codepoint 1452 set. An ingress gateway sets the FNE codepoint when it does not have 1453 the benefit of feedback from the egress. So counting packets with 1454 FNE cleared would be likely to make the average unnecessarily 1455 positive, providing headroom (or should we say footroom?) for 1456 dishonest (negative) traffic. 1458 If the monitor detects a persistently negative flow, it could drop 1459 sufficient negative and neutral packets to force the flow to not be 1460 negative. This is the approach taken for the `egress dropper' in 1461 [Re-TCP], but for the scenario in this memo, where everyone would 1462 expect everyone else to keep to the protocol, a management alarm 1463 SHOULD be raised on detecting persistently negative traffic and any 1464 automatic sanctions taken SHOULD be logged. Even if the chosen 1465 policy is to take no automatic action, the cause can then be 1466 investigated manually. 1468 Then all ingresses cannot understate downstream pre-congestion 1469 without their action being logged. So network operators can deal 1470 with offending networks at the human level, out of band. As a last 1471 resort, perhaps where the ingress gateway address seems to have been 1472 spoofed in the signalling, packets can be dropped. Drops could be 1473 focused on just sufficient packets in misbehaving flows to remove the 1474 negative bias while doing minimal harm. 1476 A future version of this memo may define a control message that could 1477 be used to notify an offending ingress gateway (possibly via the 1478 egress gateway) that it is sending persistently negative flows. 1479 However, we are aware that such messages could be used to test the 1480 sensitivity of the detection system, so currently we prefer silent 1481 sanctions. 1483 An extreme scenario would be where an ingress gateway (or set of 1484 gateways) mounted a DoS attack against another network. If their 1485 traffic caused sufficient congestion to lead to drop but they 1486 understated path congestion to avoid penalties for causing high 1487 congestion, the preferential drop recommendations in Section 4.3.4 1488 would at least ensure that these flows would always be dropped before 1489 honest flows.. 1491 5.6. Border Mechanisms 1493 5.6.1. Border Accounting Mechanisms 1495 One of the main design goals of re-ECN was for border security 1496 mechanisms to be as simple as possible, otherwise they would become 1497 the pinch-points that limit scalability of the whole internetwork. 1498 As the title of this memo suggests, we want to avoid per-flow 1499 processing at borders. We also want to keep to passive mechanisms 1500 that can monitor traffic in parallel to forwarding, rather than 1501 having to filter traffic inline--in series with forwarding. As data 1502 rates continue to rise, we suspect that all-optical interconnection 1503 between networks will soon be a requirement. So we want to avoid any 1504 new need for buffering (even though border filtering is current 1505 practice for other reasons, we don't want to make it even less likely 1506 that we will ever get rid of it). 1508 So far, we have been able to keep the border mechanisms simple, 1509 despite having had to harden them against some subtle attacks on the 1510 re-ECN design. The mechanisms are still passive and avoid per-flow 1511 processing, although we do use filtering as a fail-safe to 1512 temporarily shield against extreme events in other networks, such as 1513 accidental misconfigurations (Section 5.6.3). 1515 The basic accounting mechanism at each border interface simply 1516 involves accumulating the volume of packets with positive worth (Re- 1517 Echo and FNE), and subtracting the volume of those with negative 1518 worth: AM(-1) and PM(-1). Even though this mechanism takes no regard 1519 of flows, over an accounting period (say a month) this subtraction 1520 will account for the downstream congestion caused by all the flows 1521 traversing the interface, wherever they come from, and wherever they 1522 go to. The two networks can agree to use this metric however they 1523 wish to determine some congestion-related penalty against the 1524 upstream network (see Section 5.3 for examples). Although the 1525 algorithm could hardly be simpler, it is spelled out using pseudo- 1526 code in Appendix A.2.1. 1528 Various attempts to subvert the re-ECN design have been made. In all 1529 cases their root cause is persistently negative flows. But, after 1530 describing these attacks we will show that we don't actually have to 1531 get rid of all persistently negative flows in order to thwart the 1532 attacks. 1534 In honest flows, downstream congestion is measured as positive minus 1535 negative volume. So if all flows are honest (i.e. not persistently 1536 negative), adding all positive volume and all negative volume without 1537 regard to flows will give an aggregate measure of downstream 1538 congestion. But such simple aggregation is only possible if no flows 1539 are persistently negative. Unless persistently negative flows are 1540 completely removed, they will reduce the aggregate measure of 1541 congestion. The aggregate may still be positive overall, but not as 1542 positive as it would have been had the negative flows been removed. 1544 In Section 5.5 we discussed how to sanction traffic to remove, or at 1545 least to identify, persistently negative flows. But, even if the 1546 sanction for negative traffic is to discard it, unless it is 1547 discarded at the exact point it goes negative, it will wrongly 1548 subtract from aggregate downstream congestion, at least at any 1549 borders it crosses after it has gone negative but before it is 1550 discarded. 1552 We rely on sanctions to deter dishonest understatement of congestion. 1553 But even the ultimate sanction of discard can only be effective if 1554 the sender is bothered about the data getting through to its 1555 destination. A number of attacks have been identified where a sender 1556 gains from sending dummy traffic or it can attack someone or 1557 something using dummy traffic even though it isn't communicating any 1558 information to anyone: 1560 o A network can simply create its own dummy traffic to congest 1561 another network, perhaps causing it to lose business at no cost to 1562 the attacking network. This is a form of denial of service 1563 perpetrated by one network on another. The preferential drop 1564 measures in Section 4.3.4 provide crude protection against such 1565 attacks, but we are not overly worried about more accurate 1566 prevention measures, because it is already possible for networks 1567 to DoS other networks on the general Internet, but they generally 1568 don't because of the grave consequences of being found out. We 1569 are only concerned if re-ECN increases the motivation for such an 1570 attack, as in the next example. 1572 o A network can just generate negative traffic and send it over its 1573 border with a neighbour to reduce the overall penalties that it 1574 should pay to that neighbour. It could even initialise the TTL so 1575 it expired shortly after entering the neighbouring network, 1576 reducing the chance of detection further downstream. This attack 1577 need not be motivated by a desire to deny service and indeed need 1578 not cause denial of service. A network's main motivator would 1579 most likely be to reduce the penalties it pays to a neighbour. 1580 But, the prospect of financial gain might tempt the network into 1581 mounting a DoS attack on the other network as well, given the gain 1582 would offset some of the risk of being detected. 1584 Note that we have not included DoS by Internet hosts in the above 1585 list of attacks, because we have restricted ourselves to a scenario 1586 with edge-to-edge admission control across a Diffserv region. In 1587 this case, the edge ingress gateways insulate the Diffserv region 1588 from DoS by Internet hosts. Re-ECN resists more general DoS attacks, 1589 but this is discussed in [Re-TCP]. 1591 The first step towards a solution to all these problems with negative 1592 flows is to be able to estimate the contribution they make to 1593 downstream congestion at a border and to correct the measure 1594 accordingly. Although ideally we want to remove negative flows 1595 themselves, perhaps surprisingly, the most effective first step is to 1596 cancel out the polluting effect negative flows have on the measure of 1597 downstream congestion at a border. It is more important to get an 1598 unbiased estimate of their effect, than to try to remove them all. A 1599 suggested algorithm to give an unbiased estimate of the contribution 1600 from negative flows to the downstream congestion measure is given in 1601 Appendix A.2.2. 1603 Although making an accurate assessment of the contribution from 1604 negative flows may not be easy, just the single step of neutralising 1605 their polluting effect on congestion metrics removes all the gains 1606 networks could otherwise make from mounting dummy traffic attacks on 1607 each other. This puts all networks on the same side (only with 1608 respect to negative flows of course), rather than being pitched 1609 against each other. The network where this flow goes negative as 1610 well as all the networks downstream lose out from not being 1611 reimbursed for any congestion this flow causes. So they all have an 1612 interest in getting rid of these negative flows. Networks forwarding 1613 a flow before it goes negative aren't strictly on the same side, but 1614 they are disinterested bystanders--they don't care that the flow goes 1615 negative downstream, but at least they can't actively gain from 1616 making it go negative. The problem becomes localised so that once a 1617 flow goes negative, all the networks from where it happens and beyond 1618 downstream each have a small problem, each can detect it has a 1619 problem and each can get rid of the problem if it chooses to. But 1620 negative flows can no longer be used for any new attacks. 1622 Once an unbiased estimate of the effect of negative flows can be 1623 made, the problem reduces to detecting and preferably removing flows 1624 that have gone negative as soon as possible. But importantly, 1625 complete eradication of negative flows is no longer critical--best 1626 endeavours will be sufficient. 1628 Note that the guiding principle behind all the above discussion is 1629 that any gain from subverting the protocol should be precisely 1630 neutralised, rather than punished. If a gain is punished to a 1631 greater extent than is sufficient to neutralise it, it will most 1632 likely open up a new vulnerability, where the amplifying effect of 1633 the punishment mechanism can be turned on others. 1635 For instance, if possible, flows should be removed as soon as they go 1636 negative, but we do NOT RECOMMEND any attempts to discard such flows 1637 further upstream while they are still positive. Such over-zealous 1638 push-back is unnecessary and potentially dangerous. These flows have 1639 paid their `fare' up to the point they go negative, so there is no 1640 harm in delivering them that far. If someone downstream asks for a 1641 flow to be dropped as near to the source as possible, because they 1642 say it is going to become negative later, an upstream node cannot 1643 test the truth of this assertion. Rather than have to authenticate 1644 such messages, re-ECN has been designed so that flows can be dropped 1645 solely based on locally measurable evidence. A message hinting that 1646 a flow should be watched closely to test for negativity is fine. But 1647 not a message that claims that a positive flow will go negative 1648 later, so it should be dropped. . 1650 5.6.2. Competitive Routing 1652 With the above penalty system, each domain seems to have a perverse 1653 incentive to fake pre-congestion. For instance domain B profits from 1654 the difference between penalties it receives at its ingress (its 1655 revenue) and those it pays at its egress (its cost). So if B 1656 overstates internal pre-congestion it seems to increase its profit. 1657 However, we can assume that domain A could bypass B, routing through 1658 other domains to reach the egress. So the competitive discipline of 1659 least-cost routing can ensure that any domain tempted to fake pre- 1660 congestion for profit risks losing _all_ its incoming traffic. The 1661 least congested route would eventually be able to win this 1662 competitive game, only as long as it didn't declare more fake pre- 1663 congestion than the next most competitive route. 1665 The competitive effect of interdomain routing might be weaker nearer 1666 to the egress. For instance, C may be the only route B can take to 1667 reach the ultimate receiver. And if C over-penalises B, the egress 1668 gateway and the ultimate receiver seem to have no incentive to move 1669 their terminating attachment to another network, because only B and 1670 those upstream of B suffer the higher penalties. However, we must 1671 remember that we are only looking at the money flows at the 1672 unidirectional network layer. There are likely to be all sorts of 1673 higher level business models constructed over the top of these low 1674 level 'sender-pays' penalties. For instance, we might expect a 1675 session layer charging model where the session originator pays for a 1676 pair of duplex flows, one as receiver and one as sender. 1677 Traditionally this has been a common model for telephony and we might 1678 expect it to be used, at least sometimes, for other media such as 1679 video. Wherever such a model is used, the data receiver will be 1680 directly affected if its sessions terminate through a network like C 1681 that fakes congestion to over-penalise B. So end-customers will 1682 experience a direct competitive pressure to switch to cheaper 1683 networks, away from networks like C that try to over-penalise B. 1685 This memo does not need to standardise any particular mechanism for 1686 routing based on re-ECN. Goldenberg et al [Smart_rtg] refers to 1687 various commercial products and presents its own algorithms for 1688 moving traffic between multi-homed routes based on usage charges. 1689 None of these systems require any changes to standards protocols 1690 because the choice between the available border gateway protocol 1691 (BGP) routes is based on a combination of local knowledge of the 1692 charging regime and local measurement of traffic levels. If, as we 1693 propose, charges or penalties were based on the level of re-ECN 1694 measured in passing traffic, a similar optimisation could be achieved 1695 without requiring any changes to standard routing protocols. 1697 We must be clear that applying pre-congestion-based routing to this 1698 admission control system remains an open research issue. Traffic 1699 engineering based on congestion requires careful damping to avoid 1700 oscillations, and should not be attempted without adult supervision 1701 :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering based 1702 on congestion. But without the benefit of re-ECN, they had to add a 1703 path attribute to BGP to advertise a route's downstream congestion 1704 (actually they proposed that BGP should advertise the charge for 1705 congestion, which we believe wrongly embeds an assumption into BGP 1706 that the only thing to do with congestion is charge for it). 1708 5.6.3. Fail-safes 1710 The mechanisms described so far create incentives for rational 1711 operators to behave. That is, one operator aims to make another 1712 behave responsibly by applying penalties and expects a rational 1713 response (i.e. one that trades off costs against benefits). It is 1714 usually reasonable to assume that other network operators will behave 1715 rationally (policy routing can avoid those that might not). But this 1716 approach does not protect against the misconfigurations and accidents 1717 of other operators. 1719 Therefore, we propose the following two mechanisms at a network's 1720 borders to provide "defence in depth". Both are similar: 1722 Highly positive flows: A small sample of positive packets should be 1723 picked randomly as they cross a border interface. Then subsequent 1724 packets matching the same source and destination address and DSCP 1725 should be monitored. If the fraction of positive marking is well 1726 above a threshold (to be determined by operational practice), a 1727 management alarm SHOULD be raised, and the flow MAY be 1728 automatically subject to focused drop. 1730 Persistently negative flows: A small sample of congestion marked 1731 packets should be picked randomly as they cross a border 1732 interface. Then subsequent packets matching the same source and 1733 destination address and DSCP should be monitored. If the RE 1734 blanking fraction minus the congestion marking fraction is 1735 persistently negative, a management alarm SHOULD be raised, and 1736 the flow MAY be automatically subject to focused drop. 1738 Both these mechanisms rely on the fact that highly positive (or 1739 negative) flows will appear more quickly in the sample by selecting 1740 randomly solely from positive (or negative) packets. 1742 Note that there is no assumption that _users_ behave rationally. The 1743 system is protected from the vagaries of irrational user behaviour by 1744 the ingress gateways, which transform internal penalties into a 1745 deterministic, admission control mechanism that prevents users from 1746 misbehaving, by directly engineered means. 1748 6. Analysis 1750 The domains in Figure 1 are not expected to be completely malicious 1751 towards each other. After all, we can assume that they are all co- 1752 operating to provide an internetworking service to the benefit of 1753 each of them and their customers. Otherwise their routing polices 1754 would not interconnect them in the first place. However, we assume 1755 that they are also competitors of each other. So a network may try 1756 to contravene our proposed protocol if it would gain or make a 1757 competitor lose, or both, but only if it can do so without being 1758 caught. Therefore we do not have to consider every possible random 1759 attack one network could launch on the traffic of another, given 1760 anyway one network can always drop or corrupt packets that it 1761 forwards on behalf of another. 1763 Therefore, we only consider new opportunities for _gainful_ attack 1764 that our proposal introduces. But to a certain extent we can also 1765 rely on the in depth defences we have described (Section 5.6.3 ) 1766 intended to mitigate the potential impact if one network accidentally 1767 misconfiguring the workings of this protocol. 1769 The ingress and egress gateways are shown in the most generic 1770 arrangement possible in Figure 1, without any surrounding network. 1771 This allows us to consider more specific cases where these gateways 1772 and a neighbouring network are operated by the same player. As well 1773 as cases where the same player operates neighbouring networks, we 1774 will also consider cases where the two gateways collude as one player 1775 and where the sender and receiver collude as one. Collusion of other 1776 sets of domains is less likely, but we will consider such cases. In 1777 the general case, we will assume none of the nine trust domains 1778 across the figure fully trust any of the others. 1780 As we only propose to change routers within the Diffserv region, we 1781 assume the operators of networks outside the region will be doing 1782 per-flow policing. That is, we assume the networks outside the 1783 Diffserv region and the gateways around its edges can protect 1784 themselves. So given we are proposing to remove flow policing from 1785 some networks, our primary concern must be to protect networks that 1786 don't do per-flow policing (the potential `victims') from those that 1787 do (the `enemy'). The ingress and egress gateways are the only way 1788 the outer enemy can get at the middle victim, so we can consider the 1789 gateways as the representatives of the enemy as far as domains A, B 1790 and C are concerned. We will call this trust scenario `edges against 1791 middles'. 1793 Earlier in this memo, we outlined the classic border rate policing 1794 problem (Section 3). It will now be useful to reiterate the 1795 motivations that are the root cause of the problem. The more 1796 reservations a gateway can allow, the more revenue it receives. The 1797 middle networks want the edges to comply with the admission control 1798 protocol when they become so congested that their service to others 1799 might suffer. The middle networks also want to ensure the edges 1800 cannot steal more service from them than they are entitled to. 1802 In the context of this `edges against middles' scenario, the re-ECN 1803 protocol has two main effects: 1805 o The more pre-congestion there is on a path across the Diffserv 1806 region, the higher the ingress gateway must declare downstream 1807 pre-congestion. 1809 o If the ingress gateway does not declare downstream pre-congestion 1810 high enough on average, it will `hit the ground before the 1811 runway', going negative and triggering sanctions, either directly 1812 against the traffic or against the ingress gateway at a management 1813 level 1815 An executive summary of our security analysis can be stated in three 1816 parts, distinguished by the type of collusion considered. 1818 Neighbour-only Middle-Middle Collusion: Here there is no collusion 1819 or collusion is limited to neighbours in the feedback loop. In 1820 other words, two neighbouring networks can be assumed to act as 1821 one. Or the egress gateway might collude with domain C. Or the 1822 ingress gateway might collude with domain A. Or ingress and egress 1823 gateways might collude with each other. 1825 In these cases where only neighbours in the feedback loop collude, 1826 we concludes that all parties have a positive incentive to declare 1827 downstream pre-congestion truthfully, and the ingress gateway has 1828 a positive incentive to invoke admission control when congestion 1829 rises above the admission threshold in any network in the region 1830 (including its own). No party has an incentive to send more 1831 traffic than declared in reservation signalling (even though only 1832 the gateways read this signalling). In short, no party can gain 1833 at the expense of another. 1835 Non-neighbour Middle-Middle Collusion: In the case of other forms of 1836 collusion between middle networks (e.g. between domain A and C) it 1837 would be possible for say A & C to create a tunnel between 1838 themselves so that A would gain at the expense of B. But C would 1839 then lose the gain that A had made. Therefore the value to A & C 1840 of colluding to mount this attack seems questionable. It is made 1841 more questionable, because the attack can be statistically 1842 detected by B using the second `defence in depth' mechanism 1843 mentioned already. Note that C can defend itself from being 1844 attacked through a tunnel by treating the tunnel end point as a 1845 direct link to a neighbouring network (e.g. as if A were a 1846 neighbour of C, via the tunnel), which falls back to the safety of 1847 the neighbour-only scenario. 1849 Middle-Edge Collusion: Collusion between networks or gateways within 1850 the Diffserv region and networks or users outside the region has 1851 not yet been fully analysed. The presence of full per-flow 1852 policing at the ingress gateway seems to make this a less likely 1853 source of a successful attack. 1855 {ToDo: Due to lack of time, the full write up of the security 1856 analysis is deferred to the next version of this memo.} 1858 Finally, it is well known that the best person to analyse the 1859 security of a system is not the designer. Therefore, our confident 1860 claims must be hedged with doubt until others with perhaps a greater 1861 incentive to break it have mounted a full analysis. 1863 7. Incremental Deployment 1865 We believe ECN has so far not been widely deployed because it 1866 requires widespread end system and network deployment just to achieve 1867 a marginal improvement in performance. The ability to offer a new 1868 service (admission control) would be a much stronger driver for ECN 1869 deployment. 1871 As stated in the introduction, the aim of this memo is to "Design in 1872 security from the start" when admission control is based on pre- 1873 congestion notification. The proposal has been designed so that 1874 security can be added some time after first deployment, but only if 1875 the PCN wire protocol encoding is defined with the foresight to 1876 accommodate the extended set of codepoints defined in this document. 1877 Given admission control based on pre-congestion notification requires 1878 few changes to standards, it should be deployable fairly soon. 1879 However, re-ECN requires a change to IP, which may take a little 1880 longer. 1882 We expect that initial deployments of PCN-based admission control 1883 will be confined to single networks, or to clubs of networks that 1884 trust each other. The proposal in this memo will only become 1885 relevant once networks with conflicting interests wish to 1886 interconnect their admission controlled services, but without the 1887 scalability constraints of per-flow border policing. It will not be 1888 possible to use re-ECN, even in a controlled environment between 1889 consenting operators, unless it is standardised into IP. Given the 1890 IPv4 header has limited space for further changes, current IESG 1891 policy [RFC4727] is not to allow experimental use of codepoints in 1892 the IPv4 header, as whenever an experiment isn't taken up, the space 1893 it used tends to be impossible to reclaim. 1895 If PCN-based admission control is deployed before re-ECN is 1896 standardised into IP, wherever a networks (or club of networks) 1897 connects to another network (or club of networks) with conflicting 1898 interests, they will place a gateway between the two regions that 1899 does per-flow rate policing and admission control. If re-ECN is 1900 eventually standardised into IP, it will be possible for these 1901 separate regions to upgrade all their gateways to use re-ECN before 1902 removing the per-flow policing gateways between them. Given the 1903 edge-to-edge deployment model of PCN-based admission control, it is 1904 reasonable to imagine this incremental deployment model without 1905 needing to cater for partial deployment of re-ECN in just some of the 1906 gateways around one Diffserv region. 1908 Only the edge gateways around a Diffserv region have to be upgraded 1909 to add re-ECN support, not interior routers. It is also necessary to 1910 add the mechanisms that use re-ECN to secure a network against 1911 misbehaving gateways and networks. Specifically, these are the 1912 border mechanisms (Section 5.6) and the mechanisms to sanction 1913 dishonest marking (Section 5.5). 1915 We also RECOMMEND adding improvements to forwarding on interior 1916 routers (Section 4.3.4). But the system works whether all, some or 1917 none are upgraded, so interior routers may be upgraded in a piecemeal 1918 fashion at any time. 1920 8. Design Choices and Rationale 1922 The primary insight of this work is that downstream congestion is the 1923 metric that would be most useful to control an internetwork, and 1924 particularly to police how one network responds to the congestion it 1925 causes in a remote network. This is the problem that has previously 1926 made it so hard to provide scalable admission control. 1928 The case for using re-feedback (a generalisation of re-ECN) to police 1929 congestion response and provide QoS is made in [Re-fb]. Essentially, 1930 the insight is that congestion is a factor that crosses layers from 1931 the physical upwards. Therefore re-feedback polices congestion where 1932 it emerges from a physical interface between networks. This is 1933 achieved by bringing the congestion information to the interface, 1934 rather than examining packet addressing where there is congestion. 1936 Then congestion crossing the physical interface at a border can be 1937 policed at the interface, rather than policing the congestion on 1938 packets that claim to come from an address (which may be spoofed). 1939 Also, re-feedback works in the network layer independently of other 1940 layers--despite its name re-feedback does not actually require 1941 feedback. It requires a source to act conservatively before it gets 1942 feedback. 1944 On the subject of lack of feedback, the feedback not established 1945 (FNE) codepoint is motivated by arguments for a state set-up bit in 1946 IP to prevent state exhaustion attacks. This idea was first put 1947 forward informally by David Clark and documented by Handley and 1948 Greenhalgh in [Steps_DoS]. The idea is that network layer datagrams 1949 should signal explicitly when they require state to be created in the 1950 network layer or the layer above (e.g. at flow start). Then a node 1951 can refuse to create any state unless a datagram declares this 1952 intent. We believe the proposed FNE codepoint serves the same 1953 purpose as the proposed state-set-up bit, but it has been overloaded 1954 with a more specific purpose, using it on more packets than just the 1955 first in a flow, but never less (i.e. it is idempotent). In effect 1956 the FNE codepoint serves the purpose of a `soft-state set-up 1957 codepoint'. 1959 The re-feedback paper [Re-fb] also makes the case for converting the 1960 economic interpretation of congestion into hard engineering 1961 mechanism, which is the basis of the approach used in this memo. The 1962 admission control gateways around the Diffserv region use hard 1963 engineering, not incentives, to prevent end users from sending more 1964 traffic than they have reserved. Incentive-based mechanisms are only 1965 used between networks, because they are expected to respond to 1966 incentives more rationally than end-users can be expected to. 1967 However, even then, a network can use fail-safes to protect itself 1968 from excessively unusual behaviour by neighbouring networks, whether 1969 due to an accidental misconfiguration or malicious intent. 1971 The guiding principle behind the incentive-based approach used 1972 between networks is that any gain from subverting the protocol should 1973 be precisely neutralised, rather than punished. If a gain is 1974 punished to a greater extent than is sufficient to neutralise it, it 1975 will most likely open up a new vulnerability, where the amplifying 1976 effect of the punishment mechanism can be turned on others. 1978 The re-feedback paper also makes the case against the use of 1979 congestion charging to police congestion if it is based on classic 1980 feedback (where only upstream congestion is visible to network 1981 elements). It argues this would open up receiving networks to 1982 `denial of funds' attacks and would require end users to accept 1983 dynamic pricing (which few would). 1985 Re-ECN has been deliberately designed to simplify policing at the 1986 borders between networks. These trust boundaries are the critical 1987 pinch-points that will limit the scalability of the whole 1988 internetwork unless the overall design minimises the complexity of 1989 security functions at these borders. The border mechanisms described 1990 in this memo run passively in parallel to data forwarding and they do 1991 not require per-flow processing. 1993 9. Security Considerations 1995 This whole memo concerns the security of a scalable admission control 1996 system. In particular the analysis section. Below some specific 1997 security issues are mentioned that did not belong elsewhere or which 1998 comment on the overall robustness of the security provided by the 1999 design. 2001 Firstly, we must repeat the statement of applicability in the 2002 analysis: that we only consider new opportunities for _gainful_ 2003 attack that our proposal introduces, particularly if the attacker can 2004 avoid being identified. Despite only involving a few bits, there is 2005 sufficient complexity in the whole system that there are probably 2006 numerous possibilities for other attacks. However, as far as we are 2007 aware, none reap any benefit to the attacker. For instance, it would 2008 be possible for a downstream network to remove the congestion 2009 markings introduced by an upstream network, but it would only lose 2010 out on the penalties it could apply to a downstream network. 2012 When one network forwards a neighbouring network's traffic it will 2013 always be possible to cause damage by dropping or corrupting it. 2014 Therefore we do not believe networks would set their routing policies 2015 to interconnect in the first place if they didn't trust the other 2016 networks not to arbitrarily damage their traffic. 2018 Having said this, we do want to highlight some of the weaker parts of 2019 our argument. We have argued that networks will be dissuaded from 2020 faking congestion marking by the possibility that upstream networks 2021 will route round them. As we have said, these arguments are based on 2022 fairly delicate assumptions and will remain fairly tenuous until 2023 proved in practice, particularly close to the egress where less 2024 competitive routing is likely. 2026 We should also point out that the approach in this memo was only 2027 designed to be robust for admission control. We do not claim the 2028 incentives will always be strong enough to force correct flow pre- 2029 emption behaviour. This is because a user will tend to perceive much 2030 greater loss in value if a flow is pre-empted than if admission is 2031 denied at the start. However, in general the incentives for correct 2032 flow pre-emption are similar to those for admission control. 2034 Finally, it may seem that the 8 codepoints that have been made 2035 available by extending the ECN field with the RE flag have been used 2036 rather wastefully. In effect the RE flag has been used as an 2037 orthogonal single bit in nearly all cases. The only exception being 2038 when the ECN field is cleared to "00". The mapping of the codepoints 2039 in an earlier version of this proposal used the codepoint space more 2040 efficiently, but the scheme became vulnerable to a network operator 2041 focusing its congestion marking to mark more positive than neutral 2042 packets in order to reduce its penalties (see Appendix B of 2043 [Re-TCP]). 2045 With the scheme as now proposed, once the RE flag is set or cleared 2046 by the sender or its proxy, it should not be written by the network, 2047 only read. So the gateways can detect if any network maliciously 2048 alters the RE flag. IPSec AH integrity checking does not cover the 2049 IPv4 option flags (they were considered mutable--even the one we 2050 propose using for the RE flag that was `currently unused' when IPSec 2051 was defined). But it would be sufficient for a pair of gateways to 2052 make random checks on whether the RE flag was the same when it 2053 reached the egress gateway as when it left the ingress. Indeed, if 2054 IPSec AH had covered the RE flag, any network intending to alter 2055 sufficient RE flags to make a gain would have focused its alterations 2056 on packets without authenticating headers (AHs). 2058 No cryptographic algorithms have been harmed in the making of this 2059 proposal. 2061 10. IANA Considerations 2063 This memo includes no request to IANA. 2065 11. Conclusions 2067 This memo builds on a promising technique to solve the classic 2068 problem of making flow admission control scale to any size network. 2069 It involves the use of Diffserv in a deployment model that uses pre- 2070 congestion notification feedback to control admission into a network 2071 path [PCN-arch]. However as it stands, that deployment model depends 2072 on all network domains trusting each other to comply with the 2073 protocols, invoking admission control and flow pre-emption when 2074 requested. 2076 We propose that the congestion feedback used in that deployment model 2077 should be re-echoed into the forward data path, by making a trivial 2078 modification to the ingress gateway. We then explain how the 2079 resulting downstream pre-congestion metric in packets can be 2080 monitored in bulk at borders to sufficiently emulate flow rate 2081 policing. 2083 We claim the result of combining these two approaches is an admission 2084 control system that scales to any size network _and_ any number of 2085 interconnected networks, even if they all act in their own interests. 2087 This proposal aims to convince its readers to "Design in Security 2088 from the start," by ensuring the PCN wire protocol encoding can 2089 accommodate the extended set of codepoints defined in this document, 2090 even if border policing is not needed at first. This way, we will 2091 not build ourselves tomorrow's legacy problem. 2093 Re-echoing congestion feedback is based on a principled technique 2094 called Re-ECN [Re-TCP], designed to add accountability for causing 2095 congestion to the general-purpose IP datagram service. Re-ECN 2096 proposes to consume the last completely unused bit in the basic IPv4 2097 header. 2099 12. Acknowledgements 2101 All the following have given helpful comments and some may become co- 2102 authors of later drafts: Arnaud Jacquet, Alessandro Salvatori, Steve 2103 Rudkin, David Songhurst, John Davey, Ian Self, Anthony Sheppard, 2104 Carla Di Cairano-Gilfedder (BT), Mark Handley (who identified the 2105 excess canceled packets attack), Stephen Hailes, Adam Greenhalgh 2106 (UCL), Francois Le Faucheur, Anna Charny (Cisco), Jozef Babiarz, 2107 Kwok-Ho Chan, Corey Alexander (Nortel), David Clark, Bill Lehr, 2108 Sharon Gillett, Steve Bauer (MIT) (who publicised various dummy 2109 traffic attacks), Sally Floyd (ICIR) and comments from participants 2110 in the CFP/CRN Inter-Provider QoS, Broadband and DoS-Resistant 2111 Internet working groups. 2113 13. Comments Solicited 2115 Comments and questions are encouraged and very welcome. They can be 2116 addressed to the IETF Transport Area working group's mailing list 2117 , and/or to the authors. 2119 14. References 2120 14.1. Normative References 2122 [PCN] Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., 2123 Charny, A., Liatsos, V., Babiarz, J., Chan, K., Dudley, 2124 S., Westberg, L., Bader, A., and G. Karagiannis, "Pre- 2125 Congestion Notification Marking", 2126 draft-briscoe-tsvwg-cl-phb-03 (work in progress), 2127 October 2006. 2129 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2130 Requirement Levels", BCP 14, RFC 2119, March 1997. 2132 [RFC2211] Wroclawski, J., "Specification of the Controlled-Load 2133 Network Element Service", RFC 2211, September 1997. 2135 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2136 of Explicit Congestion Notification (ECN) to IP", 2137 RFC 3168, September 2001. 2139 [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, 2140 J., Courtney, W., Davari, S., Firoiu, V., and D. 2141 Stiliadis, "An Expedited Forwarding PHB (Per-Hop 2142 Behavior)", RFC 3246, March 2002. 2144 [RSVP-ECN] 2145 Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P., 2146 Babiarz, J., and K. Chan, "RSVP Extensions for Admission 2147 Control over Diffserv using Pre-congestion Notification", 2148 draft-lefaucheur-rsvp-ecn-01 (work in progress), 2149 June 2006. 2151 [Re-TCP] Briscoe, B., Jacquet, A., Salvatori, A., and M. Koyabi, 2152 "Re-ECN: Adding Accountability for Causing Congestion to 2153 TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-04 (work in 2154 progress), June 2007. 2156 14.2. Informative References 2158 [CLoop_pol] 2159 Salvatori, A., "Closed Loop Traffic Policing", Politecnico 2160 Torino and Institut Eurecom Masters Thesis , 2161 September 2005. 2163 [ECN-BGP] Mortier, R. and I. Pratt, "Incentive Based Inter-Domain 2164 Routeing", Proc Internet Charging and QoS Technology 2165 Workshop (ICQT'03) pp308--317, September 2003, . 2168 [ECN-MPLS] 2169 Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion 2170 Marking in MPLS", draft-ietf-tsvwg-ecn-mpls-01 (work in 2171 progress), June 2007. 2173 [IXQoS] Briscoe, B. and S. Rudkin, "Commercial Models for IP 2174 Quality of Service Interconnect", BT Technology Journal 2175 (BTTJ) 23(2)171--195, April 2005, 2176 . 2178 [NSIS-RMD] 2179 Bader, A., Westberg, L., Karagiannis, G., Kappler, C., and 2180 T. Phelan, "RMD-QOSM - The Resource Management in Diffserv 2181 QOS Model", draft-ietf-nsis-rmd-09 (work in progress), 2182 March 2007. 2184 [PCN-arch] 2185 Eardley, P., Babiarz, J., Chan, K., Charny, A., Geib, R., 2186 Karagiannis, G., Menth, M., and T. Tsou, "Pre-Congestion 2187 Notification Architecture", 2188 draft-eardley-pcn-architecture-00 (work in progress), 2189 June 2007. 2191 [RFC2205] Braden, B., Zhang, L., Berson, S., Herzog, S., and S. 2192 Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1 2193 Functional Specification", RFC 2205, September 1997. 2195 [RFC2207] Berger, L. and T. O'Malley, "RSVP Extensions for IPSEC 2196 Data Flows", RFC 2207, September 1997. 2198 [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, 2199 M., Romanow, A., Weinrib, A., and L. Zhang, "Resource 2200 ReSerVation Protocol (RSVP) Version 1 Applicability 2201 Statement Some Guidelines on Deployment", RFC 2208, 2202 September 1997. 2204 [RFC2747] Baker, F., Lindell, B., and M. Talwar, "RSVP Cryptographic 2205 Authentication", RFC 2747, January 2000. 2207 [RFC2998] Bernet, Y., Ford, P., Yavatkar, R., Baker, F., Zhang, L., 2208 Speer, M., Braden, R., Davie, B., Wroclawski, J., and E. 2209 Felstaine, "A Framework for Integrated Services Operation 2210 over Diffserv Networks", RFC 2998, November 2000. 2212 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2213 Congestion Notification (ECN) Signaling with Nonces", 2214 RFC 3540, June 2003. 2216 [RFC4727] Fenner, B., "Experimental Values In IPv4, IPv6, ICMPv4, 2217 ICMPv6, UDP, and TCP Headers", RFC 4727, November 2006. 2219 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 2220 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 2221 Congestion Response in an Internetwork Using Re-Feedback", 2222 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 2226 [Smart_rtg] 2227 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 2228 "Optimizing Cost and Performance for Multihoming", ACM 2229 SIGCOMM CCR 34(4)79--92, October 2004, 2230 . 2232 [Steps_DoS] 2233 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 2234 resistant Internet Architecture", Proc. ACM SIGCOMM 2235 workshop on Future directions in network architecture 2236 (FDNA'04) pp 49--56, August 2004. 2238 Appendix A. Implementation 2240 A.1. Ingress Gateway Algorithm for Blanking the RE flag 2242 The ingress gateway receives regular feedback reporting the fraction 2243 of congestion marked octets for each aggregate arriving at the 2244 egress. So for each aggregate it should blank the RE flag on the 2245 same fraction of octets. It is more efficient to calculate the 2246 reciprocal of this fraction when the signalling arrives, Z_0 = (1 / 2247 Congestion-Level-Estimate). Z_0 will be the number of octets of 2248 packets the ingress should send with the RE flag set between those it 2249 sends with the RE flag blanked. Z_0 will also take account of the 2250 sustainable rate reported during the flow pre-emption process, if 2251 necessary. 2253 A suitable pseudo-code algorithm for the ingress gateway is as 2254 follows: 2256 ==================================================================== 2257 B_i = 0 /* interblank volume */ 2258 for each PCN-capable packet { 2259 b = readLength(packet) /* set b to packet size */ 2260 B_i += b /* accumulate interblank volume */ 2261 if B_i < b * Z_0 { /* test whether interblank volume... */ 2262 writeRE(1) 2263 } else { /* ...exceeds blank RE spacing * pkt size*/ 2264 writeRE(0) /* ...and if so, clear RE */ 2265 B_i = 0 /* ...and re-set interblank volume */ 2266 } 2267 } 2268 ==================================================================== 2270 A.2. Downstream Congestion Metering Algorithms 2272 A.2.1. Bulk Downstream Congestion Metering Algorithm 2274 To meter the bulk amount of downstream pre-congestion in traffic 2275 crossing an inter-domain border, an algorithm is needed that 2276 accumulates the size of positive packets and subtracts the size of 2277 negative packets. We maintain two counters: 2279 V_b: accumulated pre-congestion volume 2281 B: total data volume (in case it is needed) 2283 A suitable pseudo-code algorithm for a border router is as follows: 2285 ==================================================================== 2286 V_b = 0 2287 B = 0 2288 for each PCN-capable packet { 2289 b = readLength(packet) /* set b to packet size */ 2290 B += b /* accumulate total volume */ 2291 if readEECN(packet) == (Re-Echo || FNE) { 2292 V_b += b /* increment... */ 2293 } elseif readEECN(packet) == ( AM(-1) || PM(-1) ) { 2294 V_b -= b /* ...or decrement V_b... */ 2295 } /*...depending on EECN field */ 2296 } 2297 ==================================================================== 2299 At the end of an accounting period this counter V_b represents the 2300 pre-congestion volume that penalties could be applied to, as 2301 described in Section 5.3. 2303 For instance, accumulated volume of pre-congestion through a border 2304 interface over a month might be V_b = 5PB (petabyte = 10^15 byte). 2305 This might have resulted from an average downstream pre-congestion 2306 level of 1% on an accumulated total data volume of B = 500PB. 2308 A.2.2. Inflation Factor for Persistently Negative Flows 2310 The following process is suggested to complement the simple algorithm 2311 above in order to protect against the various attacks from 2312 persistently negative flows described in Section 5.6.1. As explained 2313 in that section, the most important and first step is to estimate the 2314 contribution of persistently negative flows to the bulk volume of 2315 downstream pre-congestion and to inflate this bulk volume as if these 2316 flows weren't there. The process below has been designed to give an 2317 unbiased estimate, but it may be possible to define other processes 2318 that achieve similar ends. 2320 While the above simple metering algorithm is counting the bulk of 2321 traffic over an accounting period, the meter should also select a 2322 subset of the whole flow ID space that is small enough to be able to 2323 realistically measure but large enough to give a realistic sample. 2324 Many different samples of different subsets of the ID space should be 2325 taken at different times during the accounting period, preferably 2326 covering the whole ID space. During each sample, the meter should 2327 count the volume of positive packets and subtract the volume of 2328 negative, maintaining a separate account for each flow in the sample. 2329 It should run a lot longer than the large majority of flows, to avoid 2330 a bias from missing the starts and ends of flows, which tend to be 2331 positive and negative respectively. 2333 Once the accounting period finishes, the meter should calculate the 2334 total of the accounts V_{bI} for the subset of flows I in the sample, 2335 and the total of the accounts V_{fI} excluding flows with a negative 2336 account from the subset I. Then the weighted mean of all these 2337 samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I} 2338 V_{bI}. 2340 If V_b is the result of the bulk accounting algorithm over the 2341 accounting period (Appendix A.2.1) it can be inflated by this factor 2342 a_S to get a good unbiased estimate of the volume of downstream 2343 congestion over the accounting period a_S.V_b, without being polluted 2344 by the effect of persistently negative flows. 2346 A.3. Algorithm for Sanctioning Negative Traffic 2348 {ToDo: Write up algorithms similar to Appendix D of [Re-TCP] for the 2349 negative flow monitor with flow management algorithm and the variant 2350 with bounded flow state.} 2352 Author's Address 2354 Bob Briscoe 2355 BT & UCL 2356 B54/77, Adastral Park 2357 Martlesham Heath 2358 Ipswich IP5 3RE 2359 UK 2361 Phone: +44 1473 645196 2362 Email: bob.briscoe@bt.com 2363 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 2365 Full Copyright Statement 2367 Copyright (C) The IETF Trust (2007). 2369 This document is subject to the rights, licenses and restrictions 2370 contained in BCP 78, and except as set forth therein, the authors 2371 retain all their rights. 2373 This document and the information contained herein are provided on an 2374 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2375 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 2376 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 2377 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 2378 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2379 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2381 Intellectual Property 2383 The IETF takes no position regarding the validity or scope of any 2384 Intellectual Property Rights or other rights that might be claimed to 2385 pertain to the implementation or use of the technology described in 2386 this document or the extent to which any license under such rights 2387 might or might not be available; nor does it represent that it has 2388 made any independent effort to identify any such rights. Information 2389 on the procedures with respect to rights in RFC documents can be 2390 found in BCP 78 and BCP 79. 2392 Copies of IPR disclosures made to the IETF Secretariat and any 2393 assurances of licenses to be made available, or the result of an 2394 attempt made to obtain a general license or permission for the use of 2395 such proprietary rights by implementers or users of this 2396 specification can be obtained from the IETF on-line IPR repository at 2397 http://www.ietf.org/ipr. 2399 The IETF invites any interested party to bring to its attention any 2400 copyrights, patents or patent applications, or other proprietary 2401 rights that may cover technology that may be required to implement 2402 this standard. Please address the information to the IETF at 2403 ietf-ipr@ietf.org. 2405 Acknowledgments 2407 Funding for the RFC Editor function is provided by the IETF 2408 Administrative Support Activity (IASA). This document was produced 2409 using xml2rfc v1.32 (of http://xml.resource.org/) from a source in 2410 RFC-2629 XML format.