idnits 2.17.1 draft-briscoe-re-pcn-border-cheat-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 2643. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2654. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2661. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2667. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: However, if the ingress gateway can guarantee that the network(s) that will carry the flow to its egress gateway all use a common identifier for the aggregate (e.g. a single MPLS network without ECMP routing), it MAY NOT set FNE when it adds a new flow to an active aggregate. And an FNE packet need only be sent if a whole aggregate has been idle for more than 1 second. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 13, 2008) is 5698 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-09) exists of draft-briscoe-tsvwg-re-ecn-tcp-06 == Outdated reference: A later version (-20) exists of draft-ietf-nsis-rmd-12 == Outdated reference: A later version (-11) exists of draft-ietf-pcn-architecture-06 == Outdated reference: A later version (-07) exists of draft-ietf-tsvwg-admitted-realtime-dscp-04 Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 PCN Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Intended status: Standards Track September 13, 2008 5 Expires: March 17, 2009 7 Emulating Border Flow Policing using Re-PCN on Bulk Data 8 draft-briscoe-re-pcn-border-cheat-02 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that any 13 applicable patent or other IPR claims of which he or she is aware 14 have been or will be disclosed, and any of which he or she becomes 15 aware will be disclosed, in accordance with Section 6 of BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on March 17, 2009. 35 Abstract 37 Scaling per flow admission control to the Internet is a hard problem. 38 The approach of combining Diffserv and pre-congestion notification 39 (PCN) provides a service slightly better than Intserv controlled load 40 that scales to networks of any size without needing Diffserv's usual 41 overprovisioning, but only if domains trust each other to comply with 42 admission control and rate policing. This memo claims to solve this 43 trust problem without losing scalability. It provides a sufficient 44 emulation of per-flow policing at borders but with only passive bulk 45 metering rather than per-flow processing. Measurements are 46 sufficient to apply penalties against cheating neighbour networks. 48 Table of Contents 50 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 8 51 2. Requirements Notation . . . . . . . . . . . . . . . . . . . . 11 52 3. The Problem . . . . . . . . . . . . . . . . . . . . . . . . . 11 53 3.1. The Traditional Per-flow Policing Problem . . . . . . . . 11 54 3.2. Generic Scenario . . . . . . . . . . . . . . . . . . . . . 14 55 4. Re-ECN Protocol in IP with Two Congestion Marking Levels . . . 17 56 4.1. Protocol Overview . . . . . . . . . . . . . . . . . . . . 17 57 4.2. Re-PCN Abstracted Network Layer Wire Protocol (IPv4 or 58 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 59 4.2.1. Re-ECN Recap . . . . . . . . . . . . . . . . . . . . . 18 60 4.2.2. Re-ECN Combined with Pre-Congestion Notification 61 (re-PCN) . . . . . . . . . . . . . . . . . . . . . . . 20 62 4.3. Protocol Operation . . . . . . . . . . . . . . . . . . . . 22 63 4.3.1. Protocol Operation for an Established Flow . . . . . . 23 64 4.3.2. Aggregate Bootstrap . . . . . . . . . . . . . . . . . 24 65 4.3.3. Flow Bootstrap . . . . . . . . . . . . . . . . . . . . 26 66 4.3.4. Router Forwarding Behaviour . . . . . . . . . . . . . 26 67 4.3.5. Extensions . . . . . . . . . . . . . . . . . . . . . . 28 68 5. Emulating Border Policing with Re-ECN . . . . . . . . . . . . 28 69 5.1. Informal Terminology . . . . . . . . . . . . . . . . . . . 28 70 5.2. Policing Overview . . . . . . . . . . . . . . . . . . . . 30 71 5.3. Pre-requisite Contractual Arrangements . . . . . . . . . . 31 72 5.4. Emulation of Per-Flow Rate Policing: Rationale and 73 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 34 74 5.5. Sanctioning Dishonest Marking . . . . . . . . . . . . . . 36 75 5.6. Border Mechanisms . . . . . . . . . . . . . . . . . . . . 38 76 5.6.1. Border Accounting Mechanisms . . . . . . . . . . . . . 38 77 5.6.2. Competitive Routing . . . . . . . . . . . . . . . . . 41 78 5.6.3. Fail-safes . . . . . . . . . . . . . . . . . . . . . . 42 79 6. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 80 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 46 81 8. Design Choices and Rationale . . . . . . . . . . . . . . . . . 47 82 9. Security Considerations . . . . . . . . . . . . . . . . . . . 49 83 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 50 84 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 50 85 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 51 86 13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 52 87 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 52 88 14.1. Normative References . . . . . . . . . . . . . . . . . . . 52 89 14.2. Informative References . . . . . . . . . . . . . . . . . . 53 90 Appendix A. Implementation . . . . . . . . . . . . . . . . . . . 55 91 A.1. Ingress Gateway Algorithm for Blanking the RE flag . . . . 55 92 A.2. Downstream Congestion Metering Algorithms . . . . . . . . 56 93 A.2.1. Bulk Downstream Congestion Metering Algorithm . . . . 56 94 A.2.2. Inflation Factor for Persistently Negative Flows . . . 56 95 A.3. Algorithm for Sanctioning Negative Traffic . . . . . . . . 57 97 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 57 98 Intellectual Property and Copyright Statements . . . . . . . . . . 59 100 Status (to be removed by the RFC Editor) 102 The IETF PCN working group is initially chartered to consider PCN 103 domains only under a single trust authority. However, after its 104 initial work is complete the charter says the working group may re- 105 charter to consider concatenated Diffserv domains, amongst other new 106 work items. The charter ends by stating "The details of these work 107 items are outside the scope of the initial phase; but the WG may 108 consider their requirements to design components that are 109 sufficiently general to support such extensions in the future." 111 This memo is therefore contributed to describe how PCN could be 112 extended to inter-domain. We wanted to document the solution to 113 reduce the chances that something else eats up the codepoint space 114 needed before PCN re-charters to consider inter-domain. Losing the 115 chance to standardise this simple, scalable solution to the problem 116 of inter-domain flow admission control would be unfortunate 117 (understatement), given it took years to find, and even then it was 118 very difficult to find codepoint space for it. 120 The scheme described here (Section 4) requires the PCN ingress 121 gateway to re-echo any PCN feedback it receives back into the forward 122 stream of IP packets (hence we call this scheme re-PCN). Re-PCN 123 works in a very similar way to the re-ECN proposal on which it is 124 based [I-D.briscoe-tsvwg-re-ecn-tcp], the only difference being that 125 PCN might encode three states of congestion, whereas ECN encodes two. 126 This document is written to stand alone from re-ECN, so that readers 127 do not have to read [I-D.briscoe-tsvwg-re-ecn-tcp]. 129 The authors seek comments from the Internet community on whether 130 combining PCN and re-ECN to create re-PCN in this way is a sufficient 131 solution to the problem of scaling microflow admission control to the 132 Internet as a whole. Here we emphasise that scaling is not just an 133 issue of numbers of flows, but also the number of security entities-- 134 networks and users--who may all have conflicting interests. 136 This memo is posted as an Internet-Draft with the intent to 137 eventually be broken down in two documents; one for the standards 138 track and one for informational status. But until it becomes an item 139 of IETF working group business the whole proposal has been kept 140 together to aid understanding. Only the text of Section 4 of this 141 document is intended to be normative (requiring standardisation). 142 The rest of the sections are merely informative, describing how a 143 system might be built from these protocols by the operators of an 144 internetwork. Note in particular that the policing and monitoring 145 functions proposed for the trust boundaries between operators would 146 not need standardisation by the IETF. They simply represent one 147 possible way that the proposed protocols could be used to extend the 148 PCN architecture [I-D.ietf-pcn-architecture] to span multiple domains 149 without mutual trust between the operators. 151 Dependencies (to be removed by the RFC Editor) 153 To realise the system described, this document also depends on other 154 documents chartered in the IETF Transport Area progressing along the 155 standards track: 157 o Pre-congestion notification (PCN) marking on interior nodes 158 [I-D.eardley-pcn-marking-behaviour], chartered for standardisation 159 in the PCN w-g; 161 o The baseline encoding of pre-congestion notification in the IP 162 header [I-D.moncaster-pcn-baseline-encoding], also chartered for 163 standardisation in the PCN w-g; 165 o Feedback of aggregate PCN measurements by suitably extending the 166 admission control signalling protocol (e.g. RSVP extension 167 [RSVP-ECN] or NSIS extension [I-D.arumaithurai-nsis-pcn]). 169 The baseline encoding makes no new demands on codepoint space in the 170 IP header but provides just two PCN encoding states (not marked and 171 marked). The PCN architecture recognises that operators might want 172 PCN marking to trigger two functions (admission control and flow 173 termination) at different levels of pre-congestion, which seems to 174 require three encoding states. A scheme has been proposed 175 [I-D.charny-pcn-single-marking] that can do both functions with just 176 two encoding states, but simulations have shown it performs poorly 177 under certain conditions that might be typical. As it seems likely 178 that PCN might need three encoding states to be fully operational, we 179 want to be sure that three encoding states can be extended to work 180 inter-domain. Therefore, we have defined a three-state extension 181 encoding scheme in this document, then we have added the re-PCN 182 scheme to it. The three-state encoding we have chosen depends on 183 standardisation of yet another document in the IETF Transport Area: 185 o Propagation beyond the tunnel decapsulator of any changes in the 186 ECN field to ECT(0) or ECT(1) made within a tunnel (the ideal 187 decapsulation rules of [I-D.briscoe-tsvwg-ecn-tunnel]); 189 Changes from previous drafts (to be removed by the RFC Editor) 191 Full diffs of incremental changes between drafts are available at 192 URL: 193 Changes from to 194 (current version): 196 Considerably updated the 'Status' note to explain the 197 relationship of this draft to other documents in the IETF 198 process (or not) and to chartered PCN w-g activity. 200 Split out the dependencies into a separate note and added 201 dependencies on new PCN documents in progress. 203 Made scalability motivation in the introduction clearer, 204 explaining why Diffserv over-provisioning doesn't scale unless 205 PCN is used. 207 Clarified that the standards action in Section 4 is to define 208 the meanings of the combination of fields in the IP header: the 209 RE flag and 2-level congestion marking in the ECN field. And 210 that it is not characterised by a particular feedback style in 211 the transport. 213 Switched round the two ECT codepoints to be compatible with the 214 new PCN baseline encoding and used less confusing naming for 215 re-PCN codepoints (Section 4). 217 Generalised rules for encoding probes when bootstrapping or re- 218 starting aggregates & flows (Section 4.3.2). 220 Downgraded drop sanction behaviour from MUST to conditional 221 SHOULD (Section 5.5). 223 Added incremental deployment safety justification for choice of 224 which way round the RE flag works (Section 7). 226 Added possible vulnerability to brief attacks and possible 227 solution to security considerations (Section 9). 229 Updated references and terminology, particularly taking account 230 of recent new PCN w-g documents; 232 Replaced suggested Ingress Gateway Algorithm for Blanking the 233 RE flag (Appendix A.1) 235 Clarifications throughout; 237 Changes from to 238 : 240 Updated references. 242 Changes from 243 to : 245 Changed filename to associate it with the new IETF PCN w-g, 246 rather than the TSVWG w-g. 248 Introduction: Clarified that bulk policing only replaces per- 249 flow policing at interior inter-domain borders, while per-flow 250 policing is still needed at the access interface to the 251 internetwork. Also clarified that the aim is to neutralise any 252 gains from cheating using local bilateral contracts between 253 neighbouring networks, rather than merely identifying remote 254 cheaters. 256 Section 3.1: Described the traditional per-flow policing 257 problem with inter-domain reservations more precisely, 258 particularly with respect to direction of reservations and of 259 traffic flows. 261 Clarified status of Section 5 onwards, in particular that 262 policers and monitors would not need standardisation, but that 263 the protocol in Section 4 would require standardisation. 265 Section 5.6.2 on competitive routing: Added discussion of 266 direct incentives for a receiver to switch to a different 267 provider even if the provider has a termination monopoly. 269 Clarified that "Designing in security from the start" merely 270 means allowing codepoint space in the PCN protocol encoding. 271 There is no need to actually implement inter-domain security 272 mechanisms for solutions confined to a single domain. 274 Updated some references and added a ref to the Security 275 Considerations, as well as other minor corrections and 276 improvements. 278 Changes from to 279 : 281 Added subsection on Border Accounting Mechanisms 282 (Section 5.6.1) 283 Section 4.2 on the re-ECN wire protocol clarified and re- 284 organised to separately discuss re-ECN for default ECN marking 285 and for pre-congestion marking (PCN). 287 Router Forwarding Behaviour subsection added to re-organised 288 section on Protocol Operation (Section 4.3). Extensions 289 section moved within Protocol Operations. 291 Emulating Border Policing (Section 5) reorganised, starting 292 with a new Terminology subsection heading, and a simplified 293 overview section. Added a large new subsection on Border 294 Accounting Mechanisms within a new section bringing together 295 other subsections on Border Mechanisms generally (Section 5.6). 296 Some text moved from old subsections into these new ones. 298 Added section on Incremental Deployment (Section 7), drawing 299 together relevant points about deployment made throughout. 301 Sections on Design Rationale (Section 8) and Security 302 Considerations (Section 9) expanded with some new material, 303 including new attacks and their defences. 305 Suggested Border Metering Algorithms improved (Appendix A.2) 306 for resilience to newly identified attacks. 308 1. Introduction 310 The Internet community largely lost interest in the Intserv 311 architecture after it was clarified that it would be unlikely to 312 scale to the whole Internet [RFC2208]. Although Intserv mechanisms 313 proved impractical, the bandwidth reservation service it aimed to 314 offer is still very much required. 316 A recently proposed approach [I-D.ietf-pcn-architecture] combines 317 Diffserv and pre-congestion notification (PCN) to provide a service 318 slightly better than Intserv controlled load [RFC2211]. PCN does not 319 require the considerable over-provisioning that is normally required 320 for admission control over Diffserv [RFC2998] to be robust against 321 re-routes or variation in the traffic matrix. It has been proved 322 that Diffserv's over-provisioning requirement grows linearly with the 323 network diameter in hops [QoS_scale]. 325 A number of PCN domains can be concatenated into a larger PCN region 326 without any per-flow processing between them, but only if each domain 327 trusts the ingress network to have checked that upstream customers 328 aren't taking more bandwidth than they reserved, either accidentally 329 or deliberately. Unfortunately, networks can gain considerably by 330 breaking this trust. One way for a network to protect itself against 331 others is to handle flow signalling at its own border and police 332 traffic against reservations itself. However, this reintroduces the 333 per-flow unscalability at borders that Intserv over Diffserv suffers 334 from. 336 This memo describes a protocol called re-PCN that enables bulk border 337 measurements so that one network can protect its interests, even if 338 networks around it are deliberately trying to cheat. The approach 339 provides a sufficient emulation of flow rate policing at trust 340 boundaries but without per-flow processing. Per-flow rate policing 341 for each reservation is still expected to be used at the access edge 342 of the internetwork, but at the borders between networks bulk 343 policing can be used to emulate per-flow policing. The emulation is 344 not perfect, but it is sufficient to ensure that the punishment is at 345 least proportionate to the severity of the cheat. Re-PCN neither 346 requires the unscalable over-provisioning of Diffserv nor the per- 347 flow processing at borders of Intserv over Diffserv. 349 It should therefore scale controlled load service to the whole 350 internetwork without the cost of Diffserv's linearly increasing over- 351 provisioning, or the cost of per-flow policing at each border. To 352 achieve such scaling, this memo combines two recent proposals, both 353 of which it briefly recaps: 355 o The pre-congestion notification (PCN) 356 architecture[I-D.ietf-pcn-architecture] describes how bulk pre- 357 congestion notification on routers within an edge-to-edge Diffserv 358 region can emulate the precision of per-flow admission control to 359 provide controlled load service without unscalable per-flow 360 processing; 362 o Re-ECN: Adding Accountability to TCP/ 363 IP [I-D.briscoe-tsvwg-re-ecn-tcp]. 365 We coin the term re-PCN for the combination of PCN and re-ECN. 367 The trick that addresses cheating at borders is to recognise that 368 border policing is mainly necessary because cheating upstream 369 networks will admit traffic when they shouldn't only as long as they 370 don't directly experience the downstream congestion their 371 misbehaviour can cause. The re-ECN protocol ensures a network can be 372 made to experience the congestion it causes in other networks. Re- 373 ECN requires the sending node to declare expected downstream 374 congestion in all packets and it makes it in its interest to declare 375 this honestly. At the border between upstream network 'A' and 376 downstream network 'B' (say), both networks can monitor packets 377 crossing the border to measure how much congestion 'A' is causing in 378 'B' and beyond. 'B' can then include a limit or penalty based on 379 this metric in its contract with 'A'. This is how 'A' experiences 380 the effect of congestion it causes in other networks. 'A' no longer 381 gains by admitting traffic when it shouldn't, which is why we can say 382 re-PCN emulates flow policing, even though it doesn't measure flows. 384 The aim is not to enable a network to _identify_ some remote cheating 385 party, which would rarely be useful given the victim network would be 386 unlikely to be able to seek redress from a cheater in some remote 387 part of the world with whom no direct contractual relationship 388 exists. Rather the aim is to ensure that any gain from cheating will 389 be cancelled out by penalties applied to the cheating party by its 390 local network. Further, the solution ensures each of the chain of 391 networks between the cheater and the victim will lose out if it 392 doesn't apply penalties to its neighbour. Thus the solution builds 393 on the local bilateral contractual relationships that already exist 394 between neighbouring networks. 396 Rather than the end-to-end arrangement used when re-ECN was specified 397 for the TCP transport [I-D.briscoe-tsvwg-re-ecn-tcp], this memo 398 specifies re-ECN in an edge-to-edge arrangement, making it applicable 399 to deployment models where admission control over Diffserv is based 400 on pre-congestion notification. Also, rather than using a TCP 401 transport for regular congestion feedback, this memo specifies re-ECN 402 using RSVP as the transport for feedback [RSVP-ECN]. RSVP is used to 403 be concrete, but a similar deployment model, but with a different 404 transport for signalling congestion feedback could be used (e.g. 405 Arumaithurai [I-D.arumaithurai-nsis-pcn] and RMD [I-D.ietf-nsis-rmd] 406 both use NSIS). 408 This memo aims to do two things: i) define how to apply the re-PCN 409 protocol to the admission control over Diffserv scenario; and ii) 410 explain why re-PCN sufficiently emulates border policing in that 411 scenario. Most of the memo is taken up with the second aim; 412 explaining why it works. Applying re-PCN to the scenario actually 413 involves quite a trivial modification to the ingress gateway. That 414 modification can be added to gateways later, so our immediate goal is 415 to convince everyone to have the foresight to define the PCN wire 416 protocol encoding to accommodate the extended codepoints defined in 417 this document, whether first deployments require border policing or 418 not. Otherwise, when we want to add policing, we will have built 419 ourselves a legacy problem. In other words, we aim to convince 420 people to "Design in security from the start." 422 The body of this memo is structured as follows: 424 Section 3 describes the border policing problem. We recap the 425 traditional, unscalable view of how to solve the problem, and we 426 recap the admission control solution which has the scalability we 427 do not want to lose when we add border policing; 429 Section 4 specifies the re-PCN protocol solution in detail; 431 Section 5 explains how to use the protocol to emulate border 432 policing, and why it works; 434 Section 6 analyses the security of the proposed solution; 436 Section 8 explains the sometimes subtle rationale behind our 437 design decisions; 439 Section 9 comments on the overall robustness of the security 440 assumptions and lists specific security issues. 442 It must be emphasised that we are not evangelical about removing per- 443 flow processing from borders. Network operators may choose to do 444 per-flow processing at their borders for their own reasons, such as 445 to support business models that require per-flow accounting. Our aim 446 is to show that per-flow processing at borders is no longer 447 _necessary_ in order to provide end-to-end QoS using flow admission 448 control. Indeed, we are absolutely opposed to standardisation of 449 technology that embeds particular business models into the Internet. 450 Our aim is merely to provide a new useful metric (downstream 451 congestion) at trust boundaries. Given the well-known significance 452 of congestion in economics, operators can then use this new metric in 453 their interconnection contracts if they choose. This will enable 454 competitive evolution of new business models (for examples 455 see [IXQoS]), even for sets of flows running alongside another set 456 across the same border but using the more traditional model that 457 depends on more costly per-flow processing at each border. 459 2. Requirements Notation 461 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 462 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 463 document are to be interpreted as described in [RFC2119]. 465 3. The Problem 467 3.1. The Traditional Per-flow Policing Problem 469 If we claim to be able to emulate per-flow policing with bulk 470 policing at trust boundaries, we need to know exactly what we are 471 emulating. So, we will start from the traditional scenario with per- 472 flow policing at trust boundaries to explain why it has always been 473 considered necessary. 475 To be able to take advantage of a reservation-based service such as 476 controlled load, a source-destination pair must reserve resources 477 using a signalling protocol such as RSVP [RFC2205]. An RSVP 478 signalling request refers to a flow of packets by its flow ID tuple 479 (filter spec [RFC2205]) (or its security parameter index 480 (SPI) [RFC2207] if port numbers are hidden by IPSec encryption). 481 Other signalling protocols use similar flow identifiers. But, it is 482 insufficient to merely authorise and admit a flow based on its 483 identifiers, for instance merely opening a pin-hole for packets with 484 identifiers that match an admitted flow ID. Because, once a flow is 485 admitted, it cannot necessarily be trusted to send packets within the 486 rate profile it requested. 488 The packet rate must also be policed to keep the flow within the 489 requested flow spec [RFC2205]. For instance, without data rate 490 policing, a source-destination pair could reserve resources for an 491 8kbps audio flow but the source could transmit a 6Mbps video (theft 492 of service). More subtly, the sender could generate bursts that were 493 outside the profile requested. 495 In traditional architectures, per-flow packet rate-policing is 496 expensive and unscalable but, without it, a network is vulnerable to 497 such theft of service (whether malicious or accidental). Perhaps 498 more importantly, if flows are allowed to send more data than they 499 were permitted, the ability of admission control to give assurances 500 to other flows will break. 502 Just as sources need not be trusted to keep within the requested flow 503 spec, whole networks might also try to cheat. We will now set up a 504 concrete scenario to illustrate such cheats. Imagine reservations 505 for unidirectional flows, through at least two networks, an edge 506 network and its downstream transit provider. Imagine the edge 507 network charges its retail customers per reservation but also has to 508 pay its transit provider a charge per reservation. Typically, both 509 the charges for buying from the transit and selling to the retail 510 customer might depend on the duration and rate of each reservation. 511 The level of the actual selling and buying prices are irrelevant to 512 our discussion (most likely the network will sell at a higher price 513 than it buys, of course). 515 A cheating ingress network could systematically reduce the size of 516 its retail customers' reservation signalling requests (e.g. the 517 SENDER_TSPEC object in RSVP's PATH message) before forwarding them to 518 its transit provider and systematically reinstate the responses on 519 the way back (e.g. the FLOWSPEC object in RSVP's RESV message). It 520 would then receive an honest income from its upstream retail customer 521 but only pay for fraudulently smaller reservations downstream. A 522 similar but opposite trick (increasing the TSPEC and decreasing the 523 FLOWSPEC) could be perpetrated by the receiver's access network if 524 the reservation was paid for by the receiver. 526 Equivalently, a cheating ingress network may feed the traffic from a 527 number of flows into an aggregate reservation over the transit that 528 is smaller than the total of all the flows. Because of these fraud 529 possibilities, in traditional QoS reservation architectures the 530 downstream network polices traffic at each border. The policer 531 checks that the actual sent data rate of each flow is within the 532 signalled reservation. 534 Reservation signalling could be authenticated end to end, but this 535 wouldn't prevent the aggregation cheat just described. For this 536 reason, and to avoid the need for a global PKI, signalling integrity 537 is typically only protected on a hop-by-hop basis [RFC2747]. 539 A variant of the above cheat is where a router in an honest 540 downstream network denies admission to a new reservation, but a 541 cheating upstream network still admits the flow. For instance, the 542 networks may be using Diffserv internally, but Intserv admission 543 control at their borders [RFC2998]. The cheat would only work if 544 they were using bulk Diffserv traffic policing at their borders, 545 perhaps to avoid the cost/complexity of Intserv border policing. As 546 far as the cheating upstream network is concerned, it gets the 547 revenue from the reservation, but it doesn't have to pay any 548 downstream wholesale charges and the congestion is in someone else's 549 network. The cheating network may calculate that most of the flows 550 affected by congestion in the downstream network aren't likely to be 551 its own. It may also calculate that the downstream router has been 552 configured to deny admission to new flows in order to protect 553 bandwidth assigned to other network services (e.g. enterprise VPNs). 554 So the cheating network can steal capacity from the downstream 555 operator's VPNs that are probably not actually congested. 557 All the above cheats are framed in the context of RSVP's receiver 558 confirmed reservation model, but similar cheats are possible with 559 sender-initiated and other models. 561 To summarise, in traditional reservation signalling architectures, if 562 a network cannot trust a neighbouring upstream network to rate-police 563 each reservation, it has to check for itself that the data rate fits 564 within each of the reservations it has admitted. 566 3.2. Generic Scenario 568 We will now describe a generic internetworking scenario that we will 569 use to describe and to test our bulk policing proposal. It consists 570 of a number of networks and endpoints that do not fully trust each 571 other to behave. In Section 6 we will tie down exactly what we mean 572 by partial trust, and we will consider the various combinations where 573 some networks do not trust each other and others are colluding 574 together. 576 _ ___ _____________________________________ ___ _ 577 | | | | _|__ ______ ______ ______ _|__ | | | | 578 | | | | | | | | | | | | | | | | | | 579 | | | | | | |Inter-| |Inter-| |Inter-| | | | | | | 580 | | | | | | | ior | | ior | | ior | | | | | | | 581 | | | | | | |Domain| |Domain| |Domain| | | | | | | 582 | | | | | | | A | | B | | C | | | | | | | 583 | | | | | | | | | | | | | | | | | | 584 | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | 585 | | | | | | |B| |B| |B| |B| |B| |B| | | | |\ | | 586 | |==| |==|Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr |==| |=>| | 587 | | | | |G/W | | | | | | | | | | | | | |G/W | | |/ | | 588 | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | 589 | | | | | | | | | | | | | | | | | | 590 | | | | |____| |______| |______| |______| |____| | | | | 591 |_| |___| |_____________________________________| |___| |_| 593 Sx Ingress Diffserv region Egress Rx 594 End Access Access End 595 Host Network Network Host 596 <-------- edge-to-edge signalling -------> 597 (for admission control) 599 <-------------------end-to-end QoS signalling protocol-------------> 601 Figure 1: Generic Scenario (see text for explanation of terms) 603 An ingress and egress gateway (Ingr G/W and Egr G/W in Figure 1) 604 connect the interior Diffserv region to the edge access networks 605 where routers (not shown) use per-flow reservation processing. 606 Within the Diffserv region are three interior domains, 'A', 'B' and 607 'C', as well as the inward facing interfaces of the ingress and 608 egress gateways. An ingress and egress border router (BR) is shown 609 interconnecting each interior domain with the next. There will 610 typically be other interior routers (not shown) within each interior 611 domain. 613 In two paragraphs we now briefly recap how pre-congestion 614 notification is intended to be used to control flow admission to a 615 large Diffserv region. The first paragraph describes data plane 616 functions and the second describes signalling in the control plane. 617 We omit many details from [I-D.ietf-pcn-architecture] including 618 behaviour during routing changes. For brevity here we assume other 619 flows are already in progress across a path through the Diffserv 620 region before a new one arrives, but how bootstrap works is described 621 in Section 4.3.2. 623 Figure 1 shows a single simplex reserved flow from the sending (Sx) 624 end host to the receiving (Rx) end host. The ingress gateway polices 625 incoming traffic and colours conforming traffic within an admitted 626 reservation to a combination of Diffserv codepoint and ECN field that 627 defines the traffic as 'PCN-enabled'. This redefines the meaning of 628 the ECN field as a PCN field, which is largely the same as ECN 629 [RFC3168], but with slightly different semantics defined in 630 [I-D.moncaster-pcn-baseline-encoding] (or various extensions that are 631 currently experimental). The Diffserv region is called a PCN-region 632 because all the queues within it are PCN-enabled. This means the 633 per-hop behaviour they apply to PCN-enabled traffic consists of both 634 a scheduling behaviour and a new ECN marking behaviour that we call 635 `pre-congestion notification' [I-D.eardley-pcn-marking-behaviour]. A 636 PCN-enabled queue typically re-uses the definition of expedited 637 forwarding (EF) [RFC3246] for its scheduling behaviour. The new 638 congestion marking behaviour sets the PCN field of an increasing 639 proportion of PCN packets to the PCN-marked (PM) codepoint 640 [I-D.moncaster-pcn-baseline-encoding] as their load approaches a 641 threshold rate that is lower than the line rate 642 [I-D.eardley-pcn-marking-behaviour]. This can be achieved with an 643 algorithm similar to a token-bucket called a virtual queue. The aim 644 is for a queue to start marking PCN traffic to trigger admission 645 control before the real queue builds up any congestion delay. The 646 level of a queue's pre-congestion marking is detected at the egress 647 of the Diffserv region and used by the signalling system to control 648 admission of further traffic that would otherwise overload that 649 queue, as follows. 651 The end-to-end QoS signalling for a new reservation (to be concrete 652 we will use RSVP) takes one giant hop from ingress to egress gateway, 653 because interior routers within the Diffserv region are configured to 654 ignore RSVP. The egress gateway holds flow state because it takes 655 part in the end-to-end reservation. So it can classify all packets 656 by flow and it can identify all flows that have the same previous 657 RSVP hop (an ingress-egress-aggregate). For each ingress-egress- 658 aggregate of flows in progress, the egress gateway maintains a per- 659 packet moving average of the fraction of pre-congestion-marked 660 traffic. Once an RSVP PATH message for a new reservation has hopped 661 across the Diffserv region and reached the destination, an RSVP RESV 662 message is returned. As the RESV message passes, the egress gateway 663 piggy-backs the relevant pre-congestion level onto it [RSVP-ECN]. 664 Again, interior routers ignore the RSVP message, but the ingress 665 gateway strips off the pre-congestion level. If the pre-congestion 666 level is above a threshold, the ingress gateway denies admission to 667 the new reservation, otherwise it returns the original RESV signal 668 back towards the data sender. 670 Once a reservation is admitted, its traffic will always receive low 671 delay service for the duration of the reservation. This is because 672 ingress gateways ensure that traffic not under a reservation cannot 673 pass into the PCN-region with a Diffserv codepoint that gives it 674 priority over the capacity used for PCN traffic. 676 Even if some disaster re-routes traffic after it has been admitted, 677 if the PCN traffic through any PCN resource tips over a higher, fail- 678 safe threshold, pre-congestion notification can trigger flow 679 termination to very quickly bring every router within the whole PCN- 680 region back below its operating point. The same marking process and 681 ECN codepoint can be used for both admission control and flow 682 termination, by simply triggering them at different fractions of 683 marking [I-D.charny-pcn-single-marking]. However simulations have 684 confirmed that this approach is not robust in all circumstances that 685 might typically be encountered, so approaches with two thresholds and 686 two congestion encodings are expected to be required in production 687 networks. 689 The whole admission control system just described deliberately 690 confines per-flow processing to the access edges of the network, 691 where it will not limit the system's scalability. But ideally we 692 want to extend this approach to multiple networks, to take even more 693 advantage of its scaling potential. We would still need per-flow 694 processing at the access edges of each network, but not at the high 695 speed interfaces where they interconnect. Even though such an 696 admission control system would work technically, it would gain us no 697 scaling advantage if each network also wanted to police the rate of 698 each admitted flow for itself--border routers would still have to do 699 complex packet operations per-flow anyway, given they don't trust 700 upstream networks to do their policing for them. 702 This memo describes how to emulate per-flow rate policing using bulk 703 mechanisms at border routers. Otherwise the full scalability 704 potential of pre-congestion notification would be limited by the need 705 for per-flow policing mechanisms at borders, which would make borders 706 the most cost-critical pinch-points. Instead we can achieve the long 707 sought-for vision of secure Internet-wide bandwidth reservations 708 without over-generous provisioning or per-flow processing. We still 709 use per-flow processing at the edge routers closest to the end-user, 710 but we need no per-flow processing at all in core _or border 711 routers_--where scalability is most critical. 713 4. Re-ECN Protocol in IP with Two Congestion Marking Levels 715 4.1. Protocol Overview 717 First we need to recap the way routers accumulate PCN congestion 718 marking along a path (it accumulates the same way as ECN). Each PCN- 719 capable queue into a link might mark some packets with a PCN-marked 720 (PM) codepoint, the marking probability increasing with the length of 721 the queue [I-D.eardley-pcn-marking-behaviour]. With a series of PCN- 722 capable routers on a path, a stream of packets accumulates the 723 fraction of PCN markings that each queue adds. The combined effect 724 of the packet marking of all the queues along the path signals 725 congestion of the whole path to the receiver. So, for example, if 726 one queue early in a path is marking 1% of packets and another later 727 in a path is marking 2%, flows that pass through both queues will 728 experience approximately 3% marking over a sequence of packets. 730 (Note: Whenever the word 'congestion' is used in this document it 731 should be taken to mean congestion of the virtual resource assigned 732 for use by PCN-traffic. This avoids cumbersome repetition of the 733 strictly correct term 'pre-congestion'.) 735 The packets crossing an inter-domain trust boundary within the PCN- 736 region will all have come from different ingress gateways and will 737 all be destined for different egress gateways. We will show that the 738 key to policing against theft of service is for a border router to be 739 able to directly measure the congestion that is about to be caused by 740 the packets it forwards into any of the downstream paths between 741 itself and the egress gateways that each packet is destined for. The 742 purpose of the re-PCN protocol is to make packets automatically carry 743 this information, which then merely needs to be counted locally at 744 the border. 746 With the original PCN protocol, if a border router, e.g. that between 747 domains 'A' & 'B' Figure 2), counts PCN markings crossing the border 748 over a period, they represent the accumulated congestion that has 749 already been experienced by those packets (congestion upstream of the 750 border, u). The idea of re-PCN is to make the ingress gateway 751 continuously encode the path congestion it knows into a new field in 752 the IP header (in this case, `path' means the path from the ingress 753 to the egress gateway). This new field is _not_ altered by queues 754 along the path. Then at any point on that path (e.g. between domains 755 'A' & 'B'), IP headers can be monitored to measure both expected path 756 congestion, p and upstream congestion, u. Then congestion expected 757 downstream of the border, v, can be derived simply by subtracting 758 upstream congestion from expected path congestion. That is v ~= p - 759 u. 761 Importantly, it turns out that there is no need to monitor downstream 762 congestion on a per-flow, per-path or per-aggregate basis. We will 763 show that accounting for it in bulk by counting the volume of all 764 marked packet will be sufficient. 766 _____________________________________ 767 _|__ ______ ______ ______ _|__ 768 | | | A | | B | | C | | | 769 +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ 770 | | |B| |B| |B| |B| |B| |B| | | 771 |Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr | 772 |G/W | | | | |: | | | | | | | | |G/W | 773 +----+ +-+ +-+: +-+ +-+ +-+ +-+ +----+ 774 | | | |: | | | | | | 775 |____| |______|: |______| |______| |____| 776 |_____________:_______________________| 777 : 778 | : | 779 |<-upstream-->:<-expected downstream->| 780 | congestion : congestion | 781 | u v ~= p - u | 782 | | 783 |<--- expected path congestion, p --->| 785 Figure 2: Re-ECN concept 787 4.2. Re-PCN Abstracted Network Layer Wire Protocol (IPv4 or v6) 789 In this section we define the names of the various codepoints of the 790 extended ECN field when used with pre-congestion notification, 791 deferring description of their semantics to the following sections. 792 But first we recap the re-ECN wire protocol proposed in 793 [I-D.briscoe-tsvwg-re-ecn-tcp]. 795 4.2.1. Re-ECN Recap 797 Re-ECN uses the two bit ECN field broadly as in RFC3168 [RFC3168]. 798 It also uses a new re-ECN extension (RE) flag. The actual position 799 of the RE flag is different between IPv4 & v6 headers so we will use 800 an abstraction of the IPv4 and v6 wire protocols by just calling it 801 the RE flag. [I-D.briscoe-tsvwg-re-ecn-tcp] proposes using bit 48 802 (currently unused) in the IPv4 header for the RE flag, while for IPv6 803 it proposes an congestion extension header. 805 Unlike the ECN field, the RE flag is intended to be set by the sender 806 and remain unchanged along the path, although it can be read by 807 network elements that understand the re-ECN protocol. In the 808 scenario used in this memo, the ingress gateway is the 'sender' as 809 far as the scope of the PCN region is concerned, so it sets the RE 810 flag (as permitted for sender proxies in the specification of re- 811 ECN). 813 Note that general-purpose routers do not have to read the RE flag, 814 only special policing elements at borders do. And no general-purpose 815 routers have to change the RE flag, although the ingress and egress 816 gateways do because in the edge-to-edge deployment model we are 817 using, they act as the endpoints of the PCN region. Therefore the RE 818 flag does not even have to be visible to interior routers. So the RE 819 flag has no implications on protocols like MPLS. Congested label 820 switching routers (LSRs) would have to be able to notify their 821 congestion with an ECN/PCN codepoint in the MPLS shim [RFC5129], but 822 like any interior IP router, they can be oblivious to the RE flag, 823 which need only be read by border policing functions. 825 Although the RE flag is a separate single bit field, it can be read 826 as an extension to the two-bit ECN field; the three concatenated bits 827 in what we will call the extended ECN field (EECN) make eight 828 codepoints available. When the RE flag setting is "don't care", we 829 use the RFC3168 names of the ECN codepoints, but 830 [I-D.briscoe-tsvwg-re-ecn-tcp] proposes the following six codepoint 831 names for when there is a need to be more specific. 833 +--------+-------------+-------+-------------+----------------------+ 834 | ECN | RFC3168 | RE | Extended | Re-ECN meaning | 835 | field | codepoint | flag | ECN | | 836 | | | | codepoint | | 837 +--------+-------------+-------+-------------+----------------------+ 838 | 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | 839 | | | | | transport | 840 | 00 | Not-ECT | 1 | FNE | Feedback not | 841 | | | | | established | 842 | 10 | ECT(0) | 0 | --- | Legacy ECN use | 843 | | | | | only | 844 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 845 | | | | | | 846 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 847 | | | | | and RECT | 848 | 01 | ECT(1) | 1 | RECT | Re-ECN capable | 849 | | | | | transport | 850 | 11 | CE | 0 | CE(0) | Congestion | 851 | | | | | experienced with | 852 | | | | | Re-Echo | 853 | 11 | CE | 1 | CE(-1) | Congestion | 854 | | | | | experienced | 855 +--------+-------------+-------+-------------+----------------------+ 857 Table 1: Re-cap of Default Extended ECN Codepoints Proposed for Re- 858 ECN 860 4.2.2. Re-ECN Combined with Pre-Congestion Notification (re-PCN) 862 As permitted by the ECN specification [RFC3168] and by the guidelines 863 for specifying alternative semantics for the ECN field [RFC4774], a 864 proposal is currently being advanced in the IETF to define different 865 semantics for how queues might mark the ECN field of certain packets. 866 The idea is to be able to notify congestion when the queue's load 867 approaches a logical limit, rather than the physical limit of the 868 line. This new marking is called pre-congestion 869 notification [I-D.eardley-pcn-marking-behaviour] and we will use the 870 term PCN-enabled queue for a queue that can apply pre-congestion 871 notification marking to the ECN fields of packets. 873 [RFC3168] recommends that a packet's Diffserv codepoint should 874 determine which type of ECN marking it receives. A PCN-capable 875 packet must meet two conditions; it must carry a DSCP that has been 876 associated with PCN marking and it must carry an ECN field that turns 877 on PCN marking. 879 As an example, a packet carrying the VOICE-ADMIT 880 [I-D.ietf-tsvwg-admitted-realtime-dscp] DSCP would be associated with 881 expedited forwarding [RFC3246] as its scheduling behaviour and pre- 882 congestion notification as its congestion marking behaviour. PCN 883 would only be turned on within a PCN-region by an ECN codepoint other 884 than Not-ECT (00). Then we would describe packets with the VOICE- 885 ADMIT DSCP and with ECN turned on as PCN-capable packets. 887 [I-D.eardley-pcn-marking-behaviour] actually proposes that two 888 logical limits can be used for pre-congestion notification, with the 889 higher limit as a back-stop for dealing with anomalous events. It 890 envisages PCN will be used to admission control inelastic real-time 891 traffic, so marking at the lower limit will trigger admission 892 control, while at the higher limit it will trigger flow termination. 894 Because it needs two types of congestion marking, PCN needs four 895 states: Not PCN-capable (Not-PCN), PCN-capable but not PCN-marked 896 (NM), Admission Marked (AM) and Flow Termination Marked (TM). A 897 proposed encoding of the four required PCN states is shown on the 898 left of Table 2. Note that these codepoints of the ECN field only 899 take on the semantics of pre-congestion notification if they are 900 combined with a Diffserv codepoint that the operator has configured 901 to be associated with PCN marking. 903 This encoding only correctly traverses an IP in IP tunnel if the 904 ideal decapsulation rules in [I-D.briscoe-tsvwg-ecn-tunnel] are 905 followed when combining the ECN fields of the outer and inner 906 headers. If instead the decapsulation rules in [RFC3168] or 907 [RFC4301] are followed, any admission marking applied to an outer 908 header will be incorrectly removed on decapsulation at the tunnel 909 egress. 911 The RFC3168 ECN field includes space for the experimental ECN 912 Nonce [RFC3540], which seems to require a fifth state if it is also 913 needed with re-PCN. But re-PCN supersedes any need for the Nonce 914 within the PCN-region. The ECN Nonce is an elegant scheme, but it 915 only allows a sending node (or its proxy) to detect suppression of 916 congestion marking in the feedback loop. Thus the Nonce requires the 917 sender (or in our case the PCN ingress) to be trusted to respond 918 correctly to congestion. But this is precisely the main cheat we 919 want to protect against (as well as many others). Also, the ECN 920 nonce only works once the receiver has placed packets in the same 921 order as they left the ingress, which cannot be done by an edge node 922 without adding unnecessary edge-edge packet ordering. Nonetheless, 923 if the ECN nonce were in use outside the PCN region (end-to-end), the 924 ingress would have to tunnel the arriving IP header across the PCN 925 region ([I-D.ietf-pcn-architecture]). 927 For the rest of this memo, to mean either Admission Marking or 928 Termination Marking we will call both "congestion marking" or "PCN 929 marking" unless we need to be specific. With the above encoding, 930 congestion marking can be read to mean any packet with the right-most 931 bit of the ECN field set. 933 The re-ECN protocol can be used to control misbehaving sources 934 whether congestion is with respect to a logical threshold (PCN) or 935 the physical line rate (ECN). In either case the RE flag can be used 936 to create an extended ECN field. For PCN-capable packets, the 8 937 possible encodings of this 3-bit extended PCN (EPCN) field are 938 defined on the right of Table 2 below. The purposes of these 939 different codepoints will be introduced in subsequent sections. 941 +--------+-----------+-------+-----------------+--------------------+ 942 | ECN | PCN | RE | Extended PCN | Re-PCN meaning | 943 | field | codepoint | flag | codepoint | | 944 +--------+-----------+-------+-----------------+--------------------+ 945 | 00 | Not-PCN | 0 | Not-PCN | Not PCN-capable | 946 | | | | | transport | 947 | 00 | Not-PCN | 1 | FNE | Feedback not | 948 | | | | | established | 949 | 10 | NM | 0 | Re-PCT-Echo | Re-echoed | 950 | | | | | congestion and | 951 | | | | | Re-PCT | 952 | 10 | NM | 1 | Re-PCT | Re-PCN capable | 953 | | | | | transport | 954 | 01 | AM | 0 | AM(0) | Admission Marking | 955 | | | | | with Re-Echo | 956 | 01 | AM | 1 | AM(-1) | Admission Marking | 957 | | | | | | 958 | 11 | TM | 0 | TM(0) | Termination | 959 | | | | | Marking with | 960 | | | | | Re-Echo | 961 | 11 | TM | 1 | TM(-1) | Termination | 962 | | | | | Marking | 963 +--------+-----------+-------+-----------------+--------------------+ 965 Table 2: Extended ECN Codepoints if the Diffserv codepoint uses Pre- 966 congestion Notification (PCN) 968 Note that Table 2 shows re-PCN uses ECT(0) but Table 1 shows re-ECN 969 uses ECT(1) for the unmarked state. The difference is intended-- 970 although it makes it harder to remember the two schemes, it makes 971 them both safer during incremental deployment. 973 4.3. Protocol Operation 974 4.3.1. Protocol Operation for an Established Flow 976 The re-PCN protocol involves a simple addition to the action of the 977 gateway at the ingress edge of the PCN region (the PCN-ingress-node). 978 But first we will recap how PCN works without the addition. For each 979 active traffic aggregate across a PCN region (ingress-egress- 980 aggregate) the egress gateway measures the level of PCN marking and 981 feeds it back to the ingress piggy-backed as 'PCN-feedback- 982 information' on any control signal passing between the nodes (e.g. 983 every flow set-up, refresh or tear-down). Therefore the ingress 984 gateway will always hold a fairly recent (typically at most 30sec) 985 estimate of the ingress-egress-aggregate congestion level. For 986 instance, one aggregate might have been experiencing 3% pre- 987 congestion (that is, congestion marked octets whether Admission 988 Marked or Termination Marked). 990 To comply with the re-PCN protocol, for all PCN packets in each 991 ingress-egress-aggregate the ingress gateway MUST clear the RE flag 992 to "0" for the same percentage of octets as its current estimate of 993 congestion on the aggregate (e.g. 3%) and set it to "1" in the rest 994 (97%). Appendix A.1 gives a simple pseudo-code algorithm that the 995 ingress gateway may use to do this. 997 The RE flag is set and cleared this way round for incremental 998 deployment reasons (see Section 7). To avoid confusion we will use 999 the term `blanking' (rather than marking) when the RE flag is cleared 1000 to "0", so we will talk of the `RE blanking fraction' as the fraction 1001 of octets with the RE flag cleared to "0". 1003 ^ 1004 | 1005 | RE blanking fraction 1006 3% | +----------------------------+====+ 1007 | | | | 1008 2% | | | | 1009 | | congestion marking fraction| | 1010 1% | | +----------------------+ | 1011 | | | | 1012 0% +----+=====+---------------------------+------> 1013 ^ <--A---> <---B---> <---C---> ^ domain 1014 | ^ ^ | 1015 ingress | | egress 1016 1.00% 2.00% marking fraction 1018 Figure 3: Example Extended ECN codepoint Marking fractions 1019 (Imprecise) 1021 Figure 3 illustrates our example. The horizontal axis represents the 1022 index of each congestible resource (typically queues) along a path 1023 through the Internet. The two superimposed plots show the fraction 1024 of each extended PCN codepoint observed along this path, assuming 1025 there are two congested routers somewhere within domains A and C. And 1026 Table 3 below shows the downstream pre-congestion measured at various 1027 border observation points along the path. Figure 4 (later) shows the 1028 same results of these subtractions, but in graphical form like the 1029 above figure. The tabulated figures are actually reasonable 1030 approximations derived from more precise formulae given in Appendix A 1031 of [I-D.briscoe-tsvwg-re-ecn-tcp]. The RE flag is not changed by 1032 interior routers, so it can be seen that it acts as a reference 1033 against which the congestion marking fraction can be compared along 1034 the path. 1036 +--------------------------+---------------------------------------+ 1037 | Border observation point | Approximate Downstream pre-congestion | 1038 +--------------------------+---------------------------------------+ 1039 | ingress -- A | 3% - 0% = 3% | 1040 | A -- B | 3% - 1% = 2% | 1041 | B -- C | 3% - 1% = 2% | 1042 | C -- egress | 3% - 3% = 0% | 1043 +--------------------------+---------------------------------------+ 1045 Table 3: Downstream Congestion Measured at Example Observation Points 1047 Note that the ingress determines the RE blanking fraction for each 1048 aggregate using the most recent feedback from the relevant egress, 1049 arriving with each new reservation, or each refresh. These updates 1050 arrive relatively infrequently compared to the speed with which 1051 congestion changes. Although this feedback will always be out of 1052 date, on average positive errors should cancel out negative over a 1053 sufficiently long duration. 1055 In summary, the network adds pre-congestion marking in the forward 1056 data path, the egress feeds its level back to the ingress in RSVP (or 1057 similar signalling), then the ingress gateway re-echoes it into the 1058 forward data path by blanking the RE flag. Then at any border within 1059 the PCN-region, the pre-congestion marking that every passing packet 1060 will be expected to experience downstream can be measured to be the 1061 RE blanking fraction minus the congestion marking fraction. 1063 4.3.2. Aggregate Bootstrap 1065 When a new reservation PATH message arrives at the egress, if there 1066 are currently no flows in progress from the same ingress, there will 1067 be no state maintaining the current level of pre-congestion marking 1068 for the aggregate. In the case of RSVP reservation signalling, while 1069 the signal continues onward towards the receiving host, the egress 1070 gateway can return an RSVP message to the ingress with a 1071 flag [RSVP-ECN] asking the ingress to send a specified number of data 1072 probes between them. The more general possibilities for bootstrap 1073 behaviour are described in the PCN 1074 architecture [I-D.ietf-pcn-architecture], including using the 1075 reservation signal itself as a probe. 1077 However, with our new re-PCN scheme, the ingress does not know what 1078 proportion of the data probes should have the RE flag blanked, 1079 because it has no estimate yet of pre-congestion for the path across 1080 the PCN-region. 1082 To be conservative, following the guidance for specifying other re- 1083 ECN transports in [I-D.briscoe-tsvwg-re-ecn-tcp], the ingress SHOULD 1084 set the FNE codepoint of the extended PCN header in all probe packets 1085 (Table 2). As per the PCN deployment model, the egress gateway 1086 measures the fraction of congestion-marked probe octets and feeds 1087 back the resulting pre-congestion level to the ingress, piggy-backed 1088 on the returning reservation response (RESV) for the new flow. Probe 1089 packets are identifiable by the egress because they carry the FNE 1090 codepoint. 1092 It may seem inadvisable to expect the FNE codepoint to be set on 1093 probes, given legacy firewalls etc. might discard such packets 1094 (because this flag had no previous legitimate use). However, in the 1095 deployment scenarios envisaged, each domain in the PCN-region has to 1096 be explicitly configured to support the admission controlled service. 1097 So, before deploying the service, the operator MUST reconfigure such 1098 a badly implemented middlebox to allow through packets with the RE 1099 flag set. 1101 Note that we have said SHOULD rather than MUST for the FNE setting 1102 behaviour of the ingress for probe packets. This entertains the 1103 possibility of an ingress implementation having the benefit of other 1104 knowledge of the path, which it re-uses for a newly starting 1105 aggregate. For instance, it may hold cached information from a 1106 recent use of the aggregate that is still sufficiently current to be 1107 useful. If not all probe packets are set to FNE, the ingress will 1108 have to ensure probe packets are identifiable by some other means, 1109 perhaps by using the egress as the destination address. 1111 It might seem pedantic worrying about these few probe packets, but 1112 this behaviour ensures the system is safe, even if the proportion of 1113 probe packets becomes large. 1115 4.3.3. Flow Bootstrap 1117 It might be expected that a new flow within an active aggregate would 1118 need no special bootstrap behaviour. If there was an aggregate 1119 already in progress between the gateways the new flow was about to 1120 use, it would inherit the prevailing RE blanking fraction. And if 1121 there were no active aggregate, the bootstrap behaviour for an 1122 aggregate would be appropriate and sufficient for the new flow. 1124 However, for a number of reasons, at least the first packet of each 1125 new flow SHOULD be set to the FNE codepoint, irrespective of whether 1126 it is joining an active aggregate or not. If the first packet is 1127 unlikely to be reliably delivered, a number of FNE packets MAY be 1128 sent to increase the probability that at least one is delivered to 1129 the egress gateway. 1131 If each flow does not start with an FNE packet, it will be seen later 1132 that sanctions may be too strict at the interface before the egress 1133 gateway. It will often be possible to apply sanctions at the 1134 granularity of aggregates rather than flows, but in an internetworked 1135 environment it cannot be guaranteed that aggregates will be 1136 identifiable in remote networks. So setting FNE at the start of each 1137 flow is a safe strategy. For instance, a remote network may have 1138 equal cost multi-path (ECMP) routing enabled, causing different flows 1139 between the same gateways to traverse different paths. 1141 After an idle period of more than 1 second, the ingress gateway 1142 SHOULD set the EPCN field of the next packet it sends to FNE. This 1143 allows the design of network policers to be deterministic (see 1144 [I-D.briscoe-tsvwg-re-ecn-tcp]). 1146 However, if the ingress gateway can guarantee that the network(s) 1147 that will carry the flow to its egress gateway all use a common 1148 identifier for the aggregate (e.g. a single MPLS network without ECMP 1149 routing), it MAY NOT set FNE when it adds a new flow to an active 1150 aggregate. And an FNE packet need only be sent if a whole aggregate 1151 has been idle for more than 1 second. 1153 4.3.4. Router Forwarding Behaviour 1155 Adding re-PCN works well with the regular PCN forwarding behaviour of 1156 interior queues. However, below, two optional changes are proposed 1157 when forwarding packets with a per-hop-behaviour that requires pre- 1158 congestion notification: 1160 Preferential drop: When a router cannot avoid dropping PCN-capable 1161 packets, preferential dropping of packets with different extended 1162 PCN codepoints SHOULD be implemented between packets within a PHB 1163 that uses PCN marking. The drop preference order to use is 1164 defined in Table 4. Note that to reduce configuration complexity, 1165 Re-PCT-Echo and FNE MAY be given the same drop preference, but if 1166 feasible, FNE SHOULD be dropped in preference to Re-PCT-Echo. 1168 If this proposal were advanced at the same time as PCN itself, we 1169 would recommend that preferential drop based on extended PCN 1170 codepoint SHOULD be added to router forwarding at the same time as 1171 PCN marking. Preferential dropping can be difficult to implement, 1172 but we RECOMMEND this security-related re-PCN improvement where 1173 feasible as it is an effective defence against flooding attacks. 1175 Marking vs. Drop: We propose that PCN-routers SHOULD inspect the RE 1176 flag as well as the ECN field to decide whether to drop or mark 1177 PCN DSCPs. They MUST choose drop if the codepoint of this 1178 extended ECN field is Not-PCN. Otherwise they SHOULD mark 1179 (unless, of course, buffer space is exhausted). 1181 A PCN-capable router MUST NOT ever congestion mark a packet 1182 carrying the Not-PCN codepoint because the transport will only 1183 understand drop, not congestion marking. But a PCN-capable router 1184 can mark rather than drop an FNE packet, even though its ECN field 1185 when looked at in isolation is '00' which appears to be a legacy 1186 Not-ECT packet. Therefore, if a packet's RE flag is '1', even if 1187 its ECN field is '00', a PCN-enabled router SHOULD use congestion 1188 marking. This allows the `feedback not established' (FNE) 1189 codepoint to be used for probe packets, in order to pick up PCN 1190 marking when bootstrapping an aggregate. 1192 PCN marking rather than dropping of FNE packets MUST only be 1193 deployed in controlled environments, such as that in 1194 [I-D.ietf-pcn-architecture], where the presence of an egress node 1195 that understands PCN marking is assured. Congestion events might 1196 otherwise be ignored if the receiver only understands drop, rather 1197 than PCN marking. This is because there is no guarantee that PCN 1198 capability has been negotiated if feedback is not established 1199 (FNE). Also, [I-D.briscoe-tsvwg-re-ecn-tcp] places the strong 1200 condition that a router MUST apply drop rather than marking to FNE 1201 packets unless it can guarantee that FNE packets are rate limited 1202 either locally or upstream. 1204 +---------+-------+-----------------+---------+---------------------+ 1205 | PCN | RE | Extended PCN | Drop | Re-PCN meaning | 1206 | field | flag | codepoint | Pref | | 1207 +---------+-------+-----------------+---------+---------------------+ 1208 | 10 | 0 | Re-PCT-Echo | 5/4 | Re-echoed | 1209 | | | | | congestion and | 1210 | | | | | Re-PCT | 1211 | 00 | 1 | FNE | 4 | Feedback not | 1212 | | | | | established | 1213 | 10 | 1 | Re-PCT | 3 | Re-PCN capable | 1214 | | | | | transport | 1215 | 01 | 0 | AM(0) | 3 | Admission Marking | 1216 | | | | | with Re-Echo | 1217 | 01 | 1 | AM(-1) | 3 | Admission Marking | 1218 | | | | | | 1219 | 11 | 0 | TM(0) | 2 | Termination Marking | 1220 | | | | | with Re-Echo | 1221 | 11 | 1 | TM(-1) | 2 | Termination Marking | 1222 | | | | | | 1223 | 00 | 0 | Not-PCN | 1 | Not PCN-capable | 1224 | | | | | transport | 1225 +---------+-------+-----------------+---------+---------------------+ 1227 Table 4: Drop Preference of Extended ECN Codepoints (1 = drop 1st) 1229 4.3.5. Extensions 1231 If a different signalling system, such as NSIS, were used but it 1232 provided admission control in a similar way using pre-congestion 1233 notification (e.g. Arumaithurai [I-D.arumaithurai-nsis-pcn] or 1234 RMD [I-D.ietf-nsis-rmd]), we believe re-PCN could be used to protect 1235 against misbehaving networks in the same way as proposed above. 1237 5. Emulating Border Policing with Re-ECN 1239 The following sections are informative, not normative. The re-PCN 1240 protocol described in Section 4 above would require standardisation, 1241 whereas operators acting in their own interests would be expected to 1242 deploy policing and monitoring functions similar to those proposed in 1243 the sections below without any further need for standardisation by 1244 the IETF. Flexibility is expected in exactly how policing and 1245 monitoring is done. 1247 5.1. Informal Terminology 1249 In the rest of this memo, where the context makes it clear, we will 1250 sometimes loosely use the term `congestion' rather than using the 1251 stricter `downstream pre-congestion'. Also we will loosely talk of 1252 positive or negative flows, meaning flows where the moving average of 1253 the downstream pre-congestion metric is persistently positive or 1254 negative. The notion of a negative metric arises because it is 1255 derived by subtracting one metric from another. Of course actual 1256 downstream congestion cannot be negative, only the metric can 1257 (whether due to time lags or deliberate malice). 1259 Just as we will loosely talk of positive and negative flows, we will 1260 also talk of positive or negative packets, meaning packets that 1261 contribute positively or negatively to downstream pre-congestion. 1263 Therefore packets can be considered to have a `worth' of +1, 0 or -1, 1264 which, when multiplied by their size, indicates their contribution to 1265 downstream congestion. Packets will usually be initialised by the 1266 PCN ingress with a worth of 0. Blanking the RE flag increments the 1267 worth of a packet to +1. Congestion marking a packet decrements its 1268 worth (whether admission marking or termination marking). Congestion 1269 marking a previously blanked packet cancels out the positive worth 1270 with the negative worth of the congestion marking (resulting in a 1271 packet worth 0). The FNE codepoint is an exception. It has the same 1272 positive worth as a packet with the Re-PCT-Echo codepoint. The table 1273 below specifies unambiguously the worth of each extended PCN 1274 codepoint. Note the order is different from the previous table to 1275 emphasise how congestion marking processes decrement the worth (with 1276 the exception of FNE). 1278 +---------+-------+------------------+-------+----------------------+ 1279 | ECN | RE | Extended PCN | Worth | Re-PCN meaning | 1280 | field | flag | codepoint | | | 1281 +---------+-------+------------------+-------+----------------------+ 1282 | 00 | 0 | Not-PCN | n/a | Not PCN-capable | 1283 | | | | | transport | 1284 | 10 | 0 | Re-PCT-Echo | +1 | Re-echoed congestion | 1285 | | | | | and Re-PCT | 1286 | 01 | 0 | AM(0) | 0 | Admission Marking | 1287 | | | | | with Re-Echo | 1288 | 11 | 0 | TM(0) | 0 | Termination Marking | 1289 | | | | | with Re-Echo | 1290 | 00 | 1 | FNE | +1 | Feedback not | 1291 | | | | | established | 1292 | 10 | 1 | Re-PCT | 0 | Re-PCN capable | 1293 | | | | | transport | 1294 | 01 | 1 | AM(-1) | -1 | Admission Marking | 1295 | | | | | | 1296 | 11 | 1 | TM(-1) | -1 | Termination Marking | 1297 +---------+-------+------------------+-------+----------------------+ 1298 Table 5: 'Worth' of Extended ECN Codepoints 1300 5.2. Policing Overview 1302 It will be recalled that downstream congestion can be found by 1303 subtracting upstream congestion from path congestion. Figure 4 1304 displays the difference between the two plots in Figure 3 to show 1305 downstream pre-congestion across the same path through the Internet. 1307 To emulate border policing, the general idea is for each domain to 1308 apply penalties to its upstream neighbour in proportion to the amount 1309 of downstream pre-congestion that the upstream network sends across 1310 the border. That is, the penalties should be in proportion to the 1311 height of the plot. Downward arrows in the figure show the resulting 1312 pressure for each domain to under-declare downstream pre-congestion 1313 in traffic they pass to the next domain, because of the penalties. 1315 p e n a l t i e s 1316 / | \ 1317 A : : : 1318 | | <--A---> <---B---> <---C---> domain 1319 | V : : : 1320 3% | +-----+ | | : 1321 | | | V V : 1322 2% | | +----------------------+ : 1323 | | downstream pre-congestion | : 1324 1% | | : | : 1325 | | : | : 1326 0% +----+----------------------------+====+------> 1327 : : : A : 1328 : : : | : 1329 ingress : : : egress 1330 1.00% 2.00%: pre-congestion 1331 | 1332 sanctions 1334 Figure 4: Policing Framework, showing creation of opposing pressures 1335 to under-declare and over-declare downstream pre-congestion, using 1336 penalties and sanctions 1338 These penalties seem to encourage everyone to understate downstream 1339 congestion in order to reduce the penalties they incur. But a 1340 balancing pressure is introduced by the last domain (strictly by any 1341 domain), which applies sanctions to flows if downstream congestion 1342 goes negative before the egress gateway. The upward arrow at Domain 1343 C's border with the egress gateway represents the incentive the 1344 sanctions would create to prevent negative traffic. The same upward 1345 pressure can be applied at any domain border (arrows not shown). 1347 Any flow that persistently goes negative by the time it leaves a 1348 domain must not have been marked correctly in the first place. A 1349 domain that discovers such a flow can adopt a range of strategies to 1350 protect itself. Which strategy it uses will depend on policy, 1351 because it cannot immediately assume malice--there may be an innocent 1352 configuration error somewhere in the system. 1354 This memo does not propose to standardise any particular mechanism to 1355 detect persistently negative flows, but Section 5.5 does give 1356 examples. Note that we have used the term flow, but there will be no 1357 need to bury into the transport layer for port numbers; identifiers 1358 visible in the network layer will be sufficient (IP address pair, 1359 DSCP, protocol ID). The appendix also gives a mechanism to limit the 1360 required flow state, preventing state exhaustion attacks. 1362 Of course, some domains may trust other domains to comply with 1363 admission control without applying sanctions or penalties. In these 1364 cases, the protocol should still be used but no penalties need be 1365 applied. The re-PCN protocol ensures downstream pre-congestion 1366 marking is passed on correctly whether or not penalties are applied 1367 to it, so the system works just as well with a mixture of some 1368 domains trusting each other and others not. 1370 Providers should be free to agree the contractual terms they wish 1371 between themselves, so this memo does not propose to standardise how 1372 these penalties would be applied. It is sufficient to standardise 1373 the re-PCN protocol so the downstream pre-congestion metric is 1374 available if providers choose to use it. However, the next section 1375 (Section 5.3) gives some examples of how these penalties might be 1376 implemented. 1378 5.3. Pre-requisite Contractual Arrangements 1380 The re-PCN protocol has been chosen to solve the policing problem 1381 because it embeds a downstream pre-congestion metric in passing PCN 1382 traffic that is difficult to lie about and can be measured in bulk. 1383 The ability to emulate border policing depends on network operators 1384 choosing to use this metric as one of the elements in their contracts 1385 with each other. 1387 Already many inter-domain agreements involve a capacity and a usage 1388 element. The usage element may be based on volume or various 1389 measures of peak demand. We expect that those network operators who 1390 choose to use pre-congestion notification for admission control would 1391 also be willing to consider using this downstream pre-congestion 1392 metric as a usage element in their interconnection contracts for 1393 admission controlled (PCN) traffic. 1395 Congestion (or pre-congestion) has the dimension of [octet], being 1396 the product of volume transferred [octet] and the congestion fraction 1397 [dimensionless], which is the fraction of the offered load that the 1398 network isn't able to serve (or would rather not serve in the case of 1399 pre-congestion). Measuring downstream congestion gives a measure of 1400 the volume transferred but modulated by congestion expected 1401 downstream. So volume transferred during off-peak periods counts as 1402 nearly nothing, while volume transferred at peak times or over 1403 temporarily congested links counts very highly. The re-PCN protocol 1404 allows one network to measure how much pre-congestion has been 1405 `dumped' into it by another network. And then in turn how much of 1406 that pre-congestion it dumped into the next downstream network. 1408 Section 5.6 describes mechanisms for calculating border penalties 1409 referring to Appendix A.2 for suggested metering algorithms for 1410 downstream congestion at a border router. Conceptually, it could 1411 hardly be simpler. It broadly involves accumulating the volume of 1412 packets with the RE flag blanked and the volume of those with 1413 congestion marking then subtracting the two. 1415 Once this downstream pre-congestion metric is available, operators 1416 are free to choose how they incorporate it into their interconnection 1417 contracts [IXQoS]. Some may include a threshold volume of pre- 1418 congestion as a quality measure in their service level agreement, 1419 perhaps with a penalty clause if the upstream network exceeds this 1420 threshold over, say, a month. Others may agree a set of tiered 1421 monthly thresholds, with increasing penalties as each threshold is 1422 exceeded. But, it would be just as easy, and more resistant to 1423 gaming, to do away with discrete thresholds, and instead make the 1424 penalty rise smoothly with the volume of pre-congestion by applying a 1425 price to pre-congestion itself. Then the usage element of the 1426 interconnection contract would directly relate to the volume of pre- 1427 congestion caused by the upstream network. 1429 The direction of penalties and charges relative to the direction of 1430 traffic flow is a constant source of confusion. Typically, where 1431 capacity charges are concerned, lower tier customer networks pay 1432 higher tier provider networks. So money flows from the edges to the 1433 middle of the internetwork, towards greater connectivity, 1434 irrespective of the flow of data. But we advise that penalties or 1435 charges for usage should follow the same direction as the data flow-- 1436 the direction of control at the network layer. Otherwise a network 1437 lays itself open to `denial of funds' attacks. So, where a tier 2 1438 provider sends data into a tier 3 customer network, we would expect 1439 the penalty clauses for sending too much pre-congestion to be against 1440 the tier 2 network, even though it is the provider. 1442 It may help to remember that data will be flowing in the other 1443 direction too. So the provider network has as much opportunity to 1444 levy usage penalties as its customer, and it can set the price or 1445 strength of its own penalties higher if it chooses. Usage charges in 1446 both directions tend to cancel each other out, which confirms that 1447 usage-charging is less to do with revenue raising and more to do with 1448 encouraging load control discipline in order to smooth peaks and 1449 troughs, improving utilisation and quality. 1451 Further, when operators agree penalties in their interconnection 1452 contracts for sending downstream congestion, they should make sure 1453 that any level of negative marking only equates to zero penalty. In 1454 other words, penalties are always paid in the same direction as the 1455 data, and never against the data flow, even if downstream congestion 1456 seems to be negative. This is consistent with the definition of 1457 physical congestion; when a resource is underutilised, it is not 1458 negatively congested. Its congestion is just zero. So, although 1459 short periods of negative marking can be tolerated to correct 1460 temporary over-declarations due to lags in the feedback system, 1461 persistent downstream negative congestion can have no physical 1462 meaning and therefore must signify a problem. The incentive for 1463 domains not to tolerate persistently negative traffic depends on this 1464 principle that negative penalties must never be paid for negative 1465 congestion. 1467 Also note that at the last egress of the PCN-region, domain C should 1468 not agree to pay any penalties to the egress gateway for pre- 1469 congestion passed to the egress gateway. Downstream pre-congestion 1470 to the egress gateway should have reached zero here. If domain C 1471 were to agree to pay for any remaining downstream pre-congestion, it 1472 would give the egress gateway an incentive to over-declare pre- 1473 congestion feedback and take the resulting profit from domain C. 1475 To focus the discussion, from now on, unless otherwise stated, we 1476 will assume a downstream network charges its upstream neighbour in 1477 proportion to the pre-congestion it sends (V_b in the notation of 1478 Appendix A.2). Effectively tiered thresholds would be just more 1479 coarse-grained approximations of the fine-grained case we choose to 1480 examine. If these neighbours had previously agreed that the (fixed) 1481 price per octet of pre-congestion would be L, then the bill at the 1482 end of the month would simply be the product L*V_b, plus any fixed 1483 charges they may also have agreed. 1485 We are well aware that the IETF tries to avoid standardising 1486 technology that depends on a particular business model. Indeed, this 1487 principle is at the heart of all our own work. Our aim here is to 1488 make a new metric available that we believe is superior to all 1489 existing metrics. Then, our aim is to show that bulk border policing 1490 can at least work with the one model we have just outlined. Of 1491 course, operators are free to complement this pre-congestion-based 1492 usage element of their charges with traditional capacity charging, 1493 and we expect they will. But if operators don't want to use this 1494 business model at all, they don't have to do bulk border policing. 1495 We also assume that operators might experiment with the metric in 1496 other models. 1498 Also note well that everything we discuss in this memo only concerns 1499 interconnection within the PCN-region. ISPs are free to sell or give 1500 away reservations however they want on the retail market. But of 1501 course, interconnection charges will have a bearing on that. Indeed, 1502 in the present scenario, the ingress gateway effectively sells 1503 reservations on one side and buys congestion penalties on the other. 1504 As congestion rises, one can imagine the gateway discovering that 1505 congestion penalties have risen higher than the (probably fixed) 1506 revenue it will earn from selling the next flow reservation. This 1507 encourages the gateway to cut its losses by blocking new calls, which 1508 is why we believe downstream congestion penalties can emulate per- 1509 flow rate policing at borders, as the next section explains. 1511 5.4. Emulation of Per-Flow Rate Policing: Rationale and Limits 1513 The important feature of charging in proportion to congestion volume 1514 is that the penalty aggregates and disaggregates correctly along with 1515 packet flows. This is because the penalty rises linearly with bit 1516 rate (unless congestion is absolutely zero) and linearly with 1517 congestion, because it is the product of them both. So if the 1518 packets crossing a border belong to a thousand flows, and one of 1519 those flows doubles its rate, the ingress gateway forwarding that 1520 flow will have to put twice as much congestion marking into the 1521 packets of that flow. And this extra congestion marking will add 1522 proportionately to the penalties levied at every border the flow 1523 crosses in proportion to the amount of pre-congestion remaining on 1524 the path. 1526 Effectively, usage charges will continuously flow from ingress 1527 gateways to the places generating pre-congestion marking, in 1528 proportion to the pre-congestion marking introduced and to the data 1529 rates from those gateways. 1531 As importantly, pre-congestion itself rises super-linearly with 1532 utilisation of a particular resource. So if someone tries to push 1533 another flow into a path that is already signalling enough pre- 1534 congestion to warrant admission control, the penalty will be a lot 1535 greater than it would have been to add the same flow to a less 1536 congested path. This makes the incentive system fairly insensitive 1537 to the actual level of pre-congestion for triggering admission 1538 control that each ingress chooses. The deterrent against exceeding 1539 whatever threshold is chosen rises very quickly with a small amount 1540 of cheating. 1542 These are the properties that allow re-PCN to emulate per-flow border 1543 policing of both rate and admission control. It is not a perfect 1544 emulation of per-flow border policing, but we claim it is sufficient 1545 to at least ensure the cost to others of a cheat is borne by the 1546 cheater, because the penalties are at least proportionate to the 1547 level of the cheat. If an edge network operator is selling 1548 reservations at a large profit over the congestion cost, these pre- 1549 congestion penalties will not be sufficient to ensure networks in the 1550 middle get a share of those profits, but at least they can cover 1551 their costs. 1553 We will now explain with an example. When a whole inter-network is 1554 operating at normal (typically very low) congestion, the pre- 1555 congestion marking from virtual queues will be a little higher than 1556 if the real queues had been used--still low, but more noticeable. 1557 But low congestion levels do not imply that usage _charges_ must also 1558 be low. Usage charges will depend on the _price_ L as well. 1560 If the metric of the usage element of an interconnection agreement 1561 was changed from pure volume to pre-congested volume, one would 1562 expect the price of pre-congestion to be arranged so that the total 1563 usage charge remained about the same. So, if an average pre- 1564 congestion fraction turned out to be 1/1000, one would expect that 1565 the price L (per octet) of pre-congestion would be about 1000 times 1566 the previously used (per octet) price for volume. We should add that 1567 a switch to pre-congestion is unlikely to exactly maintain the same 1568 overall level of usage charges, but this argument will be 1569 approximately true, because usage charge will rise to at least the 1570 level the market finds necessary to push back against usage. 1572 From the above example it can be seen why a 1000x higher price will 1573 make operators become acutely sensitive to the congestion they cause 1574 in other networks, which is of course the desired effect; to 1575 encourage networks to _avoid_ the congestion they allow their users 1576 to cause to others. 1578 If any network sends even one flow at higher rate, they will 1579 immediately have to pay proportionately more usage charges. Because 1580 there is no knowledge of reservations within the PCN-region, no 1581 interior router can police whether the rate of each flow is greater 1582 than each reservation. So the system doesn't truly emulate rate- 1583 policing of each flow. But there is no incentive to pack a higher 1584 rate into a reservation, because the charges are directly 1585 proportional to rate, irrespective of the reservations. 1587 However, if virtual queues start to fill on any path, even though 1588 real queues will still be able to provide low latency service, pre- 1589 congestion marking will rise fairly quickly. It may eventually reach 1590 the threshold where the ingress gateway would deny admission to new 1591 flows. If the ingress gateway cheats and continues to admit new 1592 flows, the affected virtual queues will rapidly fill, even though the 1593 real queues will still be little worse than they were when admission 1594 control should have been invoked. The ingress gateway will have to 1595 pay the penalty for such an extremely high pre-congestion level, so 1596 the pressure to invoke admission control should become unbearable. 1598 The above mechanisms protect against rational operators. In 1599 Section 5.6.3 we discuss how networks can protect themselves from 1600 accidental or deliberate misconfiguration in neighbouring networks. 1602 5.5. Sanctioning Dishonest Marking 1604 As PCN traffic leaves the last network before the egress gateway 1605 (domain 'C' in Figure 4) the RE blanking fraction should match the 1606 congestion marking fraction, when averaged over a sufficiently long 1607 duration (perhaps ~10s to allow a few rounds of feedback through 1608 regular signalling of new and refreshed reservations). 1610 To protect itself, domain 'C' should install a monitor at its egress. 1611 It aims to detect flows of PCN packets that are persistently 1612 negative. If flows are positive, domain 'C' need take no action-- 1613 this simply means an upstream network must be paying more penalties 1614 than it needs to. Appendix A.3 gives a suggested algorithm for the 1615 monitor, meeting the criteria below. 1617 o It SHOULD introduce minimal false positives for honest flows; 1619 o It SHOULD quickly detect and sanction dishonest flows (minimal 1620 false negatives); 1622 o It MUST be invulnerable to state exhaustion attacks from malicious 1623 sources. For instance, if the dropper uses flow-state, it should 1624 not be possible for a source to send numerous packets, each with a 1625 different flow ID, to force the dropper to exhaust its memory 1626 capacity; 1628 o If drop is used as a sanction, it SHOULD introduce sufficient loss 1629 in goodput so that malicious sources cannot play off losses in the 1630 egress dropper against higher allowed throughput. 1631 Salvatori [CLoop_pol] describes this attack, which involves the 1632 source understating path congestion then inserting forward error 1633 correction (FEC) packets to compensate expected losses. 1635 Note that the monitor operates on flows but with careful design we 1636 can avoid per-flow state. This is why we have been careful to ensure 1637 that all flows MUST start with a packet marked with the FNE 1638 codepoint. If a flow does not start with the FNE codepoint, a 1639 monitor is likely to treat it unfavourably. This risk makes it worth 1640 setting the FNE codepoint at the start of a flow, even though there 1641 is a cost to setting FNE (positive `worth'). 1643 Starting flows with an FNE packet also means that a monitor will be 1644 resistant to state exhaustion attacks from other networks, as the 1645 monitor can then be designed to never create state unless an FNE 1646 packet arrives. And an FNE packet counts positive, so it will cost a 1647 lot for a network to send many of them. 1649 Monitor algorithms will often maintain a moving average across flows 1650 of the fraction of RE blanked packets. When maintaining an average 1651 across flows, a monitor MUST ignore packets with the FNE codepoint 1652 set. An ingress gateway sets the FNE codepoint when it does not have 1653 the benefit of feedback from the egress. So counting packets with 1654 FNE cleared would be likely to make the average unnecessarily 1655 positive, providing headroom (or should we say footroom?) for 1656 dishonest (negative) traffic. 1658 If the monitor detects a persistently negative flow, it could drop 1659 sufficient negative and neutral packets to force the flow to not be 1660 negative. This is the approach taken for the `egress dropper' in 1661 [I-D.briscoe-tsvwg-re-ecn-tcp], but for the scenario in this memo, 1662 where everyone would expect everyone else to keep to the protocol, a 1663 management alarm SHOULD be raised on detecting persistently negative 1664 traffic and any automatic sanctions taken SHOULD be logged. Even if 1665 the chosen policy is to take no automatic action, the cause can then 1666 be investigated manually. 1668 Then all ingresses cannot understate downstream pre-congestion 1669 without their action being logged. So network operators can deal 1670 with offending networks at the human level, out of band. As a last 1671 resort, perhaps where the ingress gateway address seems to have been 1672 spoofed in the signalling, packets can be dropped. Drops could be 1673 focused on just sufficient packets in misbehaving flows to remove the 1674 negative bias while doing minimal harm. 1676 A future version of this memo may define a control message that could 1677 be used to notify an offending ingress gateway (possibly via the 1678 egress gateway) that it is sending persistently negative flows. 1679 However, we are aware that such messages could be used to test the 1680 sensitivity of the detection system, so currently we prefer silent 1681 sanctions. 1683 An extreme scenario would be where an ingress gateway (or set of 1684 gateways) mounted a DoS attack against another network. If their 1685 traffic caused sufficient congestion to lead to drop but they 1686 understated path congestion to avoid penalties for causing high 1687 congestion, the preferential drop recommendations in Section 4.3.4 1688 would at least ensure that these flows would always be dropped before 1689 honest flows.. 1691 5.6. Border Mechanisms 1693 5.6.1. Border Accounting Mechanisms 1695 One of the main design goals of re-PCN was for border security 1696 mechanisms to be as simple as possible, otherwise they would become 1697 the pinch-points that limit scalability of the whole internetwork. 1698 As the title of this memo suggests, we want to avoid per-flow 1699 processing at borders. We also want to keep to passive mechanisms 1700 that can monitor traffic in parallel to forwarding, rather than 1701 having to filter traffic inline--in series with forwarding. As data 1702 rates continue to rise, we suspect that all-optical interconnection 1703 between networks will soon be a requirement. So we want to avoid any 1704 new need for buffering (even though border filtering is current 1705 practice for other reasons, we don't want to make it even less likely 1706 that we will ever get rid of it). 1708 So far, we have been able to keep the border mechanisms simple, 1709 despite having had to harden them against some subtle attacks on the 1710 re-PCN design. The mechanisms are still passive and avoid per-flow 1711 processing, although we do use filtering as a fail-safe to 1712 temporarily shield against extreme events in other networks, such as 1713 accidental misconfigurations (Section 5.6.3). 1715 The basic accounting mechanism at each border interface simply 1716 involves accumulating the volume of packets with positive worth (Re- 1717 PCT-Echo and FNE), and subtracting the volume of those with negative 1718 worth: AM(-1) and TM(-1). Even though this mechanism takes no regard 1719 of flows, over an accounting period (say a month) this subtraction 1720 will account for the downstream congestion caused by all the flows 1721 traversing the interface, wherever they come from, and wherever they 1722 go to. The two networks can agree to use this metric however they 1723 wish to determine some congestion-related penalty against the 1724 upstream network (see Section 5.3 for examples). Although the 1725 algorithm could hardly be simpler, it is spelled out using pseudo- 1726 code in Appendix A.2.1. 1728 Various attempts to subvert the re-ECN design have been made. In all 1729 cases their root cause is persistently negative flows. But, after 1730 describing these attacks we will show that we don't actually have to 1731 get rid of all persistently negative flows in order to thwart the 1732 attacks. 1734 In honest flows, downstream congestion is measured as positive minus 1735 negative volume. So if all flows are honest (i.e. not persistently 1736 negative), adding all positive volume and all negative volume without 1737 regard to flows will give an aggregate measure of downstream 1738 congestion. But such simple aggregation is only possible if no flows 1739 are persistently negative. Unless persistently negative flows are 1740 completely removed, they will reduce the aggregate measure of 1741 congestion. The aggregate may still be positive overall, but not as 1742 positive as it would have been had the negative flows been removed. 1744 In Section 5.5 we discussed how to sanction traffic to remove, or at 1745 least to identify, persistently negative flows. But, even if the 1746 sanction for negative traffic is to discard it, unless it is 1747 discarded at the exact point it goes negative, it will wrongly 1748 subtract from aggregate downstream congestion, at least at any 1749 borders it crosses after it has gone negative but before it is 1750 discarded. 1752 We rely on sanctions to deter dishonest understatement of congestion. 1753 But even the ultimate sanction of discard can only be effective if 1754 the sender is bothered about the data getting through to its 1755 destination. A number of attacks have been identified where a sender 1756 gains from sending dummy traffic or it can attack someone or 1757 something using dummy traffic even though it isn't communicating any 1758 information to anyone: 1760 o A network can simply create its own dummy traffic to congest 1761 another network, perhaps causing it to lose business at no cost to 1762 the attacking network. This is a form of denial of service 1763 perpetrated by one network on another. The preferential drop 1764 measures in Section 4.3.4 provide crude protection against such 1765 attacks, but we are not overly worried about more accurate 1766 prevention measures, because it is already possible for networks 1767 to DoS other networks on the general Internet, but they generally 1768 don't because of the grave consequences of being found out. We 1769 are only concerned if re-PCN increases the motivation for such an 1770 attack, as in the next example. 1772 o A network can just generate negative traffic and send it over its 1773 border with a neighbour to reduce the overall penalties that it 1774 should pay to that neighbour. It could even initialise the TTL so 1775 it expired shortly after entering the neighbouring network, 1776 reducing the chance of detection further downstream. This attack 1777 need not be motivated by a desire to deny service and indeed need 1778 not cause denial of service. A network's main motivator would 1779 most likely be to reduce the penalties it pays to a neighbour. 1780 But, the prospect of financial gain might tempt the network into 1781 mounting a DoS attack on the other network as well, given the gain 1782 would offset some of the risk of being detected. 1784 Note that we have not included DoS by Internet hosts in the above 1785 list of attacks, because we have restricted ourselves to a scenario 1786 with edge-to-edge admission control across a PCN-region. In this 1787 case, the edge ingress gateways insulate the PCN-region from DoS by 1788 Internet hosts. Re-ECN resists more general DoS attacks, but this is 1789 discussed in [I-D.briscoe-tsvwg-re-ecn-tcp]. 1791 The first step towards a solution to all these problems with negative 1792 flows is to be able to estimate the contribution they make to 1793 downstream congestion at a border and to correct the measure 1794 accordingly. Although ideally we want to remove negative flows 1795 themselves, perhaps surprisingly, the most effective first step is to 1796 cancel out the polluting effect negative flows have on the measure of 1797 downstream congestion at a border. It is more important to get an 1798 unbiased estimate of their effect, than to try to remove them all. A 1799 suggested algorithm to give an unbiased estimate of the contribution 1800 from negative flows to the downstream congestion measure is given in 1801 Appendix A.2.2. 1803 Although making an accurate assessment of the contribution from 1804 negative flows may not be easy, just the single step of neutralising 1805 their polluting effect on congestion metrics removes all the gains 1806 networks could otherwise make from mounting dummy traffic attacks on 1807 each other. This puts all networks on the same side (only with 1808 respect to negative flows of course), rather than being pitched 1809 against each other. The network where a flow goes negative as well 1810 as all the networks downstream lose out from not being reimbursed for 1811 any congestion this flow causes. So they all have an interest in 1812 getting rid of these negative flows. Networks forwarding a flow 1813 before it goes negative aren't strictly on the same side, but they 1814 are disinterested bystanders--they don't care that the flow goes 1815 negative downstream, but at least they can't actively gain from 1816 making it go negative. The problem becomes localised so that once a 1817 flow goes negative, all the networks from where it happens and beyond 1818 downstream each have a small problem, each can detect it has a 1819 problem and each can get rid of the problem if it chooses to. But 1820 negative flows can no longer be used for any new attacks. 1822 Once an unbiased estimate of the effect of negative flows can be 1823 made, the problem reduces to detecting and preferably removing flows 1824 that have gone negative as soon as possible. But importantly, 1825 complete eradication of negative flows is no longer critical--best 1826 endeavours will be sufficient. 1828 Note that the guiding principle behind all the above discussion is 1829 that any gain from subverting the protocol should be precisely 1830 neutralised, rather than punished. If a gain is punished to a 1831 greater extent than is sufficient to neutralise it, it will most 1832 likely open up a new vulnerability, where the amplifying effect of 1833 the punishment mechanism can be turned on others. 1835 For instance, if possible, flows should be removed as soon as they go 1836 negative, but we do NOT RECOMMEND any attempts to discard such flows 1837 further upstream while they are still positive. Such over-zealous 1838 push-back is unnecessary and potentially dangerous. These flows have 1839 paid their `fare' up to the point they go negative, so there is no 1840 harm in delivering them that far. If someone downstream asks for a 1841 flow to be dropped as near to the source as possible, because they 1842 say it is going to become negative later, an upstream node cannot 1843 test the truth of this assertion. Rather than have to authenticate 1844 such messages, re-PCN has been designed so that flows can be dropped 1845 solely based on locally measurable evidence. A message hinting that 1846 a flow should be watched closely to test for negativity is fine. But 1847 not a message that claims that a positive flow will go negative 1848 later, so it should be dropped. 1850 5.6.2. Competitive Routing 1852 With the above penalty system, each domain seems to have a perverse 1853 incentive to fake pre-congestion. For instance domain 'B' profits 1854 from the difference between penalties it receives at its ingress (its 1855 revenue) and those it pays at its egress (its cost). So if 'B' 1856 overstates internal pre-congestion it seems to increase its profit. 1857 However, we can assume that domain 'A' could bypass 'B', routing 1858 through other domains to reach the egress. So the competitive 1859 discipline of least-cost routing can ensure that any domain tempted 1860 to fake pre-congestion for profit risks losing _all_ its incoming 1861 traffic. The least congested route would eventually be able to win 1862 this competitive game, only as long as it didn't declare more fake 1863 pre-congestion than the next most competitive route. 1865 The competitive effect of interdomain routing might be weaker nearer 1866 to the egress. For instance, 'C' may be the only route 'B' can take 1867 to reach the ultimate receiver. And if 'C' over-penalises 'B', the 1868 egress gateway and the ultimate receiver seem to have no incentive to 1869 move their terminating attachment to another network, because only 1870 'B' and those upstream of 'B' suffer the higher penalties. However, 1871 we must remember that we are only looking at the money flows at the 1872 unidirectional network layer. There are likely to be all sorts of 1873 higher level business models constructed over the top of these low 1874 level 'sender-pays' penalties. For instance, we might expect a 1875 session layer charging model where the session originator pays for a 1876 pair of duplex flows, one as receiver and one as sender. 1877 Traditionally this has been a common model for telephony and we might 1878 expect it to be used, at least sometimes, for other media such as 1879 video. Wherever such a model is used, the data receiver will be 1880 directly affected if its sessions terminate through a network like 1881 'C' that fakes congestion to over-penalise 'B'. So end-customers 1882 will experience a direct competitive pressure to switch to cheaper 1883 networks, away from networks like 'C' that try to over-penalise 'B'. 1885 This memo does not need to standardise any particular mechanism for 1886 routing based on re-PCN. Goldenberg et al [Smart_rtg] refers to 1887 various commercial products and presents its own algorithms for 1888 moving traffic between multi-homed routes based on usage charges. 1889 None of these systems require any changes to standards protocols 1890 because the choice between the available border gateway protocol 1891 (BGP) routes is based on a combination of local knowledge of the 1892 charging regime and local measurement of traffic levels. If, as we 1893 propose, charges or penalties were based on the level of re-PCN 1894 measured locally in passing traffic, a similar optimisation could be 1895 achieved without requiring any changes to standard routing protocols. 1897 We must be clear that applying pre-congestion-based routing to this 1898 admission control system remains an open research issue. Traffic 1899 engineering based on congestion requires careful damping to avoid 1900 oscillations, and should not be attempted without adult supervision 1901 :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering based 1902 on congestion. But without the benefit of re-ECN or re-PCN, they had 1903 to add a path attribute to BGP to advertise a route's downstream 1904 congestion (actually they proposed that BGP should advertise the 1905 charge for congestion, which we believe wrongly embeds an assumption 1906 into BGP that the only thing to do with congestion is charge for it). 1908 5.6.3. Fail-safes 1910 The mechanisms described so far create incentives for rational 1911 operators to behave. That is, one operator aims to make another 1912 behave responsibly by applying penalties and expects a rational 1913 response (i.e. one that trades off costs against benefits). It is 1914 usually reasonable to assume that other network operators will behave 1915 rationally (policy routing can avoid those that might not). But this 1916 approach does not protect against the misconfigurations and accidents 1917 of other operators. 1919 Therefore, we propose the following two mechanisms at a network's 1920 borders to provide "defence in depth". Both are similar: 1922 Highly positive flows: A small sample of positive packets should be 1923 picked randomly as they cross a border interface. Then subsequent 1924 packets matching the same source and destination address and DSCP 1925 should be monitored. If the fraction of positive marking is well 1926 above a threshold (to be determined by operational practice), a 1927 management alarm SHOULD be raised, and the flow MAY be 1928 automatically subject to focused drop. 1930 Persistently negative flows: A small sample of congestion marked 1931 packets should be picked randomly as they cross a border 1932 interface. Then subsequent packets matching the same source and 1933 destination address and DSCP should be monitored. If the RE 1934 blanking fraction minus the congestion marking fraction is 1935 persistently negative, a management alarm SHOULD be raised, and 1936 the flow MAY be automatically subject to focused drop. 1938 Both these mechanisms rely on the fact that highly positive (or 1939 negative) flows will appear more quickly in the sample by selecting 1940 randomly solely from positive (or negative) packets. 1942 Note that there is no assumption that _users_ behave rationally. The 1943 system is protected from the vagaries of irrational user behaviour by 1944 the ingress gateways, which transform internal penalties into a 1945 deterministic, admission control mechanism that prevents users from 1946 misbehaving, by directly engineered means. 1948 6. Analysis 1950 The domains in Figure 1 are not expected to be completely malicious 1951 towards each other. After all, we can assume that they are all co- 1952 operating to provide an internetworking service to the benefit of 1953 each of them and their customers. Otherwise their routing polices 1954 would not interconnect them in the first place. However, we assume 1955 that they are also competitors of each other. So a network may try 1956 to contravene our proposed protocol if it would gain or make a 1957 competitor lose, or both. But only if it can do so without being 1958 caught. Therefore we do not have to consider every possible random 1959 attack one network could launch on the traffic of another, given 1960 anyway one network can always drop or corrupt packets that it 1961 forwards on behalf of another. 1963 Therefore, we only consider new opportunities for _gainful_ attack 1964 that our proposal introduces. But to a certain extent we can also 1965 rely on the in depth defences we have described (Section 5.6.3 ) 1966 intended to mitigate the potential impact if one network accidentally 1967 misconfiguring the workings of this protocol. 1969 The ingress and egress gateways are shown in the most generic 1970 arrangement possible in Figure 1, without any surrounding network. 1971 This allows us to consider more specific cases where these gateways 1972 and a neighbouring network are operated by the same player. As well 1973 as cases where the same player operates neighbouring networks, we 1974 will also consider cases where the two gateways collude as one player 1975 and where the sender and receiver collude as one. Collusion of other 1976 sets of domains is less likely, but we will consider such cases. In 1977 the general case, we will assume none of the nine trust domains 1978 across the figure fully trust any of the others. 1980 As we only propose to change routers within the PCN-region, we assume 1981 the operators of networks outside the region will be doing per-flow 1982 policing. That is, we assume the networks outside the PCN-region and 1983 the gateways around its edges can protect themselves. So given we 1984 are proposing to remove flow policing from some networks, our primary 1985 concern must be to protect networks that don't do per-flow policing 1986 (the potential `victims') from those that do (the `enemy'). The 1987 ingress and egress gateways are the only way the outer enemy can get 1988 at the middle victim, so we can consider the gateways as the 1989 representatives of the enemy as far as domains 'A', 'B' and 'C' are 1990 concerned. We will call this trust scenario `edges against middles'. 1992 Earlier in this memo, we outlined the classic border rate policing 1993 problem (Section 3). It will now be useful to reiterate the 1994 motivations that are the root cause of the problem. The more 1995 reservations a gateway can allow, the more revenue it receives. The 1996 middle networks want the edges to comply with the admission control 1997 protocol when they become so congested that their service to others 1998 might suffer. The middle networks also want to ensure the edges 1999 cannot steal more service from them than they are entitled to. 2001 In the context of this `edges against middles' scenario, the re-PCN 2002 protocol has two main effects: 2004 o The more pre-congestion there is on a path across the PCN-region, 2005 the higher the ingress gateway must declare downstream pre- 2006 congestion. 2008 o If the ingress gateway does not declare downstream pre-congestion 2009 high enough on average, it will `hit the ground before the 2010 runway', going negative and triggering sanctions, either directly 2011 against the traffic or against the ingress gateway at a management 2012 level 2014 An executive summary of our security analysis can be stated in three 2015 parts, distinguished by the type of collusion considered. 2017 Neighbour-only Middle-Middle Collusion: Here there is no collusion 2018 or collusion is limited to neighbours in the feedback loop. In 2019 other words, two neighbouring networks can be assumed to act as 2020 one. Or the egress gateway might collude with domain 'C'. Or the 2021 ingress gateway might collude with domain 'A'. Or ingress and 2022 egress gateways might collude with each other. 2024 In these cases where only neighbours in the feedback loop collude, 2025 we concludes that all parties have a positive incentive to declare 2026 downstream pre-congestion truthfully, and the ingress gateway has 2027 a positive incentive to invoke admission control when congestion 2028 rises above the admission threshold in any network in the region 2029 (including its own). No party has an incentive to send more 2030 traffic than declared in reservation signalling (even though only 2031 the gateways read this signalling). In short, no party can gain 2032 at the expense of another. 2034 Non-neighbour Middle-Middle Collusion: In the case of other forms of 2035 collusion between middle networks (e.g. between domain 'A' and 2036 'C') it would be possible for say 'A' & 'C' to create a tunnel 2037 between themselves so that 'A' would gain at the expense of 'B'. 2038 But 'C' would then lose the gain that 'A' had made. Therefore the 2039 value to 'A' & 'C' of colluding to mount this attack seems 2040 questionable. It is made more questionable, because the attack 2041 can be statistically detected by 'B' using the second `defence in 2042 depth' mechanism mentioned already. Note that 'C' can defend 2043 itself from being attacked through a tunnel by treating the tunnel 2044 end point as a direct link to a neighbouring network (e.g. as if 2045 'A' were a neighbour of 'C', via the tunnel), which falls back to 2046 the safety of the neighbour-only scenario. 2048 Middle-Edge Collusion: Collusion between networks or gateways within 2049 the PCN-region and networks or users outside the region has not 2050 yet been fully analysed. The presence of full per-flow policing 2051 at the ingress gateway seems to make this a less likely source of 2052 a successful attack. 2054 {ToDo: Due to lack of time, the full write up of the security 2055 analysis is deferred to the next version of this memo.} 2057 Finally, it is well known that the best person to analyse the 2058 security of a system is not the designer. Therefore, our confident 2059 claims must be hedged with doubt until others with perhaps a greater 2060 incentive to break it have mounted a full analysis. 2062 7. Incremental Deployment 2064 We believe ECN has so far not been widely deployed because it 2065 requires end system and widespread network deployment just to achieve 2066 a marginal improvement in performance. The ability to offer a new 2067 service (admission control) would be a much stronger driver for ECN 2068 deployment. 2070 As stated in the introduction, the aim of this memo is to "Design in 2071 security from the start" when admission control is based on pre- 2072 congestion notification. The proposal has been designed so that 2073 security can be added some time after first deployment, but only if 2074 the PCN wire protocol encoding is defined with the foresight to 2075 accommodate the extended set of codepoints defined in this document. 2076 Given admission control based on pre-congestion notification requires 2077 few changes to standards, it should be deployable fairly soon. 2078 However, re-PCN requires a change to IP, which may take a little 2079 longer :) 2081 We expect that initial deployments of PCN-based admission control 2082 will be confined to single networks, or to clubs of networks that 2083 trust each other. The proposal in this memo will only become 2084 relevant once networks with conflicting interests wish to 2085 interconnect their admission controlled services, but without the 2086 scalability constraints of per-flow border policing. It will not be 2087 possible to use re-PCN, even in a controlled environment between 2088 consenting operators, unless it is standardised into IP. Given the 2089 IPv4 header has limited space for further changes, current IESG 2090 policy [RFC4727] is not to allow experimental use of codepoints in 2091 the IPv4 header, as whenever an experiment isn't taken up, the space 2092 it used tends to be impossible to reclaim. Therefore, for IPv4 at 2093 least, we will need to find a way to run an experiment so that the 2094 header fields it uses can be reclaimed if the experiment is not a 2095 success. 2097 If PCN-based admission control is deployed before re-PCN is 2098 standardised into IP, wherever a network (or club of networks) 2099 connects to another network (or club of networks) with conflicting 2100 interests, they will place a gateway between the two regions that 2101 does per-flow rate policing and admission control. If re-PCN is 2102 eventually standardised into IP, it will be possible for these 2103 separate regions to upgrade all their ingress gateways to support re- 2104 PCN before removing the per-flow policing gateways between them. 2105 Given the edge-to-edge deployment model of PCN-based admission 2106 control, it is reasonable to expect incremental deployment of re-PCN 2107 will be feasible on a domain-by domain basis, without needing to 2108 cater for partial deployment of re-PCN in just some of the gateways 2109 around one PCN-domain. 2111 Nonetheless, if the upgrade of one ingress gateway is accidentally 2112 overlooked, the RE flag has been defined the safe way round for the 2113 default legacy behaviour (leaving RE cleared as "0"). A legacy 2114 ingress will appear to be declaring a high level of pre-congestion 2115 into the aggregate. The fail-safe border mechanism in Section 5.6.3 2116 might trigger management alarms (which would help in tracking down 2117 the need to upgrade the ingress), but all packets would continue to 2118 be delivered safely, as overstatement of downstream congestion 2119 requires no sanction. 2121 Only the ingress edge gateways around a PCN-region have to be 2122 upgraded to add re-PCN support, not interior routers. It is also 2123 necessary to add the mechanisms that monitor re-PCN to secure a 2124 network against misbehaving gateways and networks. Specifically, 2125 these are the border mechanisms (Section 5.6) and the mechanisms to 2126 sanction dishonest marking (Section 5.5). 2128 We also RECOMMEND adding improvements to forwarding on interior 2129 routers (Section 4.3.4). But the system works whether all, some or 2130 none are upgraded, so interior routers may be upgraded in a piecemeal 2131 fashion at any time. 2133 8. Design Choices and Rationale 2135 The primary insight of this work is that downstream congestion is the 2136 metric that would be most useful to control an internetwork, and 2137 particularly to police how one network responds to the congestion it 2138 causes in a remote network. This is the problem that has previously 2139 made it so hard to provide scalable admission control. 2141 The case for using re-feedback (a generalisation of re-ECN) to police 2142 congestion response and provide QoS is made in [Re-fb]. Essentially, 2143 the insight is that congestion is a factor that crosses layers from 2144 the physical upwards. Therefore re-feedback polices congestion as it 2145 crosses the physical interface between networks. This is achieved by 2146 bringing information about congestion of resources later on the path 2147 to the interface, rather than trying to deal with congestion where it 2148 happens by examining the notoriously unreliable source address in 2149 packets. Then congestion crossing the physical interface at a border 2150 can be policed at the interface, rather than policing the congestion 2151 on packets that claim to come from an address (which may be spoofed). 2152 Also, re-feedback works in the network layer independently of other 2153 layers--despite its name re-feedback does not actually require 2154 feedback. It makes a source to act conservatively before it gets 2155 feedback. 2157 On the subject of lack of feedback, the feedback not established 2158 (FNE) codepoint is motivated by arguments for a state set-up bit in 2159 IP to prevent state exhaustion attacks. This idea was first put 2160 forward informally by David Clark and developed by Handley and 2161 Greenhalgh in [Steps_DoS]. The idea is that network layer datagrams 2162 should signal explicitly when they require state to be created in the 2163 network layer or the layer above (e.g. at flow start). Then a node 2164 can refuse to create any state unless a datagram declares this 2165 intent. We believe the proposed FNE codepoint serves the same 2166 purpose as the proposed state set-up bit, but it has been overloaded 2167 with a more specific purpose, using it on more packets than just the 2168 first in a flow, but never less (i.e. it is idempotent). In effect 2169 the FNE codepoint serves the purpose of a `soft-state set-up 2170 codepoint'. 2172 The re-feedback paper [Re-fb] also makes the case for converting the 2173 economic interpretation of congestion into hard engineering 2174 mechanism, which is the basis of the approach used in this memo. The 2175 admission control gateways around the PCN-region use hard 2176 engineering, not incentives, to prevent end users from sending more 2177 traffic than they have reserved. Incentive-based mechanisms are only 2178 used between networks, because they are expected to respond to 2179 incentives more rationally than end-users can be expected to. 2180 However, even then, a network can use fail-safes to protect itself 2181 from excessively unusual behaviour by neighbouring networks, whether 2182 due to an accidental misconfiguration or malicious intent. 2184 The guiding principle behind the incentive-based approach used 2185 between networks is that any gain from subverting the protocol should 2186 be precisely neutralised, rather than punished. If a gain is 2187 punished to a greater extent than is sufficient to neutralise it, it 2188 will most likely open up a new vulnerability, where the amplifying 2189 effect of the punishment mechanism can be turned on others. 2191 The re-feedback paper also makes the case against the use of 2192 congestion charging to police congestion if it is based on classic 2193 feedback (where only upstream congestion is visible to network 2194 elements). It argues this would open up receiving networks to 2195 `denial of funds' attacks and would require end users to accept 2196 dynamic pricing (which few would). 2198 Re-PCN has been deliberately designed to simplify policing at the 2199 borders between networks. These trust boundaries are the critical 2200 pinch-points that will limit the scalability of the whole 2201 internetwork unless the overall design minimises the complexity of 2202 security functions at these borders. The border mechanisms described 2203 in this memo run passively in parallel to data forwarding and they do 2204 not require per-flow processing. 2206 9. Security Considerations 2208 This whole memo concerns the security of a scalable admission control 2209 system. In particular the analysis section. Below some specific 2210 security issues are mentioned that did not belong elsewhere or which 2211 comment on the overall robustness of the security provided by the 2212 design. 2214 Firstly, we must repeat the statement of applicability in the 2215 analysis: that we only consider new opportunities for _gainful_ 2216 attack that our proposal introduces, particularly if the attacker can 2217 avoid being identified. Despite only involving a few bits, there is 2218 sufficient complexity in the whole system that there are probably 2219 numerous possibilities for other attacks. However, as far as we are 2220 aware, none reap any benefit to the attacker. For instance, it would 2221 be possible for a downstream network to remove the congestion 2222 markings introduced by an upstream network, but it would only lose 2223 out on the penalties it could apply to a downstream network. 2225 When one network forwards a neighbouring network's traffic it will 2226 always be possible to cause damage by dropping or corrupting it. 2227 Therefore we do not believe networks would set their routing policies 2228 to interconnect in the first place if they didn't trust the other 2229 networks not to arbitrarily damage their traffic. 2231 Having said this, we do want to highlight some of the weaker parts of 2232 our argument. 2234 o We have argued that networks will be dissuaded from faking 2235 congestion marking by the possibility that upstream networks will 2236 route round them. As we have said, these arguments are based on 2237 fairly delicate assumptions and will remain fairly tenuous until 2238 proved in practice, particularly close to the egress where less 2239 competitive routing is likely. 2241 o Given the congestion feedback system is piggy-backed on flow 2242 signalling, which can be fairly infrequent, sanctions may not be 2243 appropriate until a flow has been persistently negative for 2244 perhaps 20s. This may allow brief attacks to go unpunished. 2245 However, vulnerability to brief attacks may be reduced if the 2246 egress triggers asynchronous feedback when the congestion level on 2247 an aggregate has risen sufficiently since the last feedback, 2248 rather than waiting for the next opportunity to piggy-back on a 2249 signal. 2251 o We should also point out that the approach in this memo was only 2252 designed to be robust for admission control. We do not claim the 2253 incentives will always be strong enough to force correct flow 2254 termination behaviour. This is because a user will tend to 2255 perceive much greater loss in value if a flow is terminated than 2256 if admission is denied at the start. However, in general the 2257 incentives for correct flow termination are similar to those for 2258 admission control. 2260 Finally, it may seem that the 8 codepoints that have been made 2261 available by extending the ECN field with the RE flag have been used 2262 rather wastefully. In effect the RE flag has been used as an 2263 orthogonal single bit in nearly all cases. The only exception being 2264 when the ECN field is cleared to "00". The mapping of the codepoints 2265 in an earlier version of this proposal used the codepoint space more 2266 efficiently, but the scheme became vulnerable to a network operator 2267 focusing its congestion marking to mark more positive than neutral 2268 packets in order to reduce its penalties (see Appendix B of 2269 [I-D.briscoe-tsvwg-re-ecn-tcp]). 2271 With the scheme as now proposed, once the RE flag is set or cleared 2272 by the sender or its proxy, it should not be written by the network, 2273 only read. So the gateways can detect if any network maliciously 2274 alters the RE flag. IPSec AH integrity checking does not cover the 2275 IPv4 option flags (they were considered mutable--even the one we 2276 propose using for the RE flag that was `currently unused' when IPSec 2277 was defined). But it would be sufficient for a pair of gateways to 2278 make random checks on whether the RE flag was the same when it 2279 reached the egress gateway as when it left the ingress. Indeed, if 2280 IPSec AH had covered the RE flag, any network intending to alter 2281 sufficient RE flags to make a gain would have focused its alterations 2282 on packets without authenticating headers (AHs). 2284 Therefore, no cryptographic algorithms have been exploited in the 2285 making of this proposal. 2287 10. IANA Considerations 2289 This memo includes no request to IANA. 2291 11. Conclusions 2293 This memo solves the classic problem of making flow admission control 2294 scale to any size network. It builds on a technique, called PCN, 2295 which involves the use of Diffserv in a domain and uses pre- 2296 congestion notification feedback to control admission into each 2297 network path across the domain [I-D.ietf-pcn-architecture]. 2299 Without PCN, Diffserv requires over-provisioning that must grow 2300 linearly with network diameter to cater for variation in the traffic 2301 matrix. However, even with PCN, multiple network domains can only 2302 join together into one larger PCN region if all domains trust each 2303 other to comply with the protocols, invoking admission control and 2304 flow termination when requested. Domains could join together and 2305 still police flows at their borders by requiring reservation 2306 signalling to touch each border and only use PCN internally to each 2307 domain. But the per-flow processing at borders would still limit 2308 scalability. 2310 Instead, this memo proposes a technique called re-PCN which enables a 2311 PCN region to extend across multiple domains, without unscalable per- 2312 flow processing at borders, and still without the need for linear 2313 growth in capacity over-provisioning as the hop-diameter of the 2314 Diffserv region grows. 2316 We propose that the congestion feedback used for PCN-based admission 2317 control should be re-echoed into the forward data path, by making a 2318 trivial modification to the ingress gateway. We then explain how the 2319 resulting downstream pre-congestion metric in packets can be 2320 monitored in bulk at borders to sufficiently emulate flow rate 2321 policing. 2323 We claim the result of combining these two approaches is an admission 2324 control system that scales to any size network _and_ any number of 2325 interconnected networks, even if they all act in their own interests. 2327 This proposal aims to convince its readers to "Design in Security 2328 from the start," by ensuring the PCN wire protocol encoding can 2329 accommodate the extended set of codepoints defined in this document, 2330 even if per-flow policing is used at first rather than the bulk 2331 border policing described here. This way, we will not build 2332 ourselves tomorrow's legacy problem. 2334 Re-echoing congestion feedback is based on a principled technique 2335 called Re-ECN [I-D.briscoe-tsvwg-re-ecn-tcp], designed to add 2336 accountability for causing congestion to the general-purpose IP 2337 datagram service. Re-ECN proposes to consume the last completely 2338 unused bit in the basic IPv4 header or it uses extension header in 2339 IPv6. 2341 12. Acknowledgements 2343 All the following have given helpful comments either on re-PCN or on 2344 relevant parts of re-ECN that re-PCN uses: Arnaud Jacquet, Alessandro 2345 Salvatori, Steve Rudkin, David Songhurst, John Davey, Ian Self, 2346 Anthony Sheppard, Carla Di Cairano-Gilfedder (BT), Mark Handley (who 2347 identified the excess canceled packets attack), Stephen Hailes, Adam 2348 Greenhalgh (UCL), Francois Le Faucheur, Anna Charny (Cisco), Jozef 2349 Babiarz, Kwok-Ho Chan, Corey Alexander (Nortel), David Clark, Bill 2350 Lehr, Sharon Gillett, Steve Bauer (MIT) (who publicised various dummy 2351 traffic attacks), Sally Floyd (ICIR) and comments from participants 2352 in the CFP/CRN Inter-Provider QoS, Broadband and DoS-Resistant 2353 Internet working groups. 2355 13. Comments Solicited 2357 Comments and questions are encouraged and very welcome. They can be 2358 addressed to the IETF Congestion and Pre-Congestion Notification 2359 working group's mailing list , and/or to the author(s). 2361 14. References 2363 14.1. Normative References 2365 [I-D.briscoe-tsvwg-ecn-tunnel] 2366 Briscoe, B., "Layered Encapsulation of Congestion 2367 Notification", draft-briscoe-tsvwg-ecn-tunnel-01 (work in 2368 progress), July 2008. 2370 [I-D.briscoe-tsvwg-re-ecn-tcp] 2371 Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, 2372 "Re-ECN: Adding Accountability for Causing Congestion to 2373 TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-06 (work in 2374 progress), August 2008. 2376 [I-D.eardley-pcn-marking-behaviour] 2377 Eardley, P., "Marking behaviour of PCN-nodes", 2378 draft-eardley-pcn-marking-behaviour-01 (work in progress), 2379 June 2008. 2381 [I-D.moncaster-pcn-baseline-encoding] 2382 Moncaster, T., Briscoe, B., and M. Menth, "Baseline 2383 Encoding and Transport of Pre-Congestion Information", 2384 draft-moncaster-pcn-baseline-encoding-02 (work in 2385 progress), July 2008. 2387 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2388 Requirement Levels", BCP 14, RFC 2119, March 1997. 2390 [RFC2211] Wroclawski, J., "Specification of the Controlled-Load 2391 Network Element Service", RFC 2211, September 1997. 2393 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 2394 of Explicit Congestion Notification (ECN) to IP", 2395 RFC 3168, September 2001. 2397 [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, 2398 J., Courtney, W., Davari, S., Firoiu, V., and D. 2399 Stiliadis, "An Expedited Forwarding PHB (Per-Hop 2400 Behavior)", RFC 3246, March 2002. 2402 [RFC4774] Floyd, S., "Specifying Alternate Semantics for the 2403 Explicit Congestion Notification (ECN) Field", BCP 124, 2404 RFC 4774, November 2006. 2406 14.2. Informative References 2408 [CLoop_pol] 2409 Salvatori, A., "Closed Loop Traffic Policing", Politecnico 2410 Torino and Institut Eurecom Masters Thesis , 2411 September 2005. 2413 [ECN-BGP] Mortier, R. and I. Pratt, "Incentive Based Inter-Domain 2414 Routeing", Proc Internet Charging and QoS Technology 2415 Workshop (ICQT'03) pp308--317, September 2003, . 2418 [I-D.arumaithurai-nsis-pcn] 2419 Arumaithurai, M., "NSIS PCN-QoSM: A Quality of Service 2420 Model for Pre-Congestion Notification (PCN)", 2421 draft-arumaithurai-nsis-pcn-00 (work in progress), 2422 September 2007. 2424 [I-D.charny-pcn-single-marking] 2425 Charny, A., Zhang, X., Faucheur, F., and V. Liatsos, "Pre- 2426 Congestion Notification Using Single Marking for Admission 2427 and Termination", draft-charny-pcn-single-marking-03 2428 (work in progress), November 2007. 2430 [I-D.ietf-nsis-rmd] 2431 Bader, A., "RMD-QOSM - The Resource Management in Diffserv 2432 QOS Model", draft-ietf-nsis-rmd-12 (work in progress), 2433 November 2007. 2435 [I-D.ietf-pcn-architecture] 2436 Eardley, P., "Pre-Congestion Notification (PCN) 2437 Architecture", draft-ietf-pcn-architecture-06 (work in 2438 progress), September 2008. 2440 [I-D.ietf-tsvwg-admitted-realtime-dscp] 2441 Baker, F., Polk, J., and M. Dolly, "DSCPs for Capacity- 2442 Admitted Traffic", 2443 draft-ietf-tsvwg-admitted-realtime-dscp-04 (work in 2444 progress), February 2008. 2446 [IXQoS] Briscoe, B. and S. Rudkin, "Commercial Models for IP 2447 Quality of Service Interconnect", BT Technology Journal 2448 (BTTJ) 23(2)171--195, April 2005, 2449 . 2451 [QoS_scale] 2452 Reid, A., "Economics and Scalability of QoS Solutions", BT 2453 Technology Journal (BTTJ) 23(2)97--117, April 2005. 2455 [RFC2205] Braden, B., Zhang, L., Berson, S., Herzog, S., and S. 2456 Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1 2457 Functional Specification", RFC 2205, September 1997. 2459 [RFC2207] Berger, L. and T. O'Malley, "RSVP Extensions for IPSEC 2460 Data Flows", RFC 2207, September 1997. 2462 [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, 2463 M., Romanow, A., Weinrib, A., and L. Zhang, "Resource 2464 ReSerVation Protocol (RSVP) Version 1 Applicability 2465 Statement Some Guidelines on Deployment", RFC 2208, 2466 September 1997. 2468 [RFC2747] Baker, F., Lindell, B., and M. Talwar, "RSVP Cryptographic 2469 Authentication", RFC 2747, January 2000. 2471 [RFC2998] Bernet, Y., Ford, P., Yavatkar, R., Baker, F., Zhang, L., 2472 Speer, M., Braden, R., Davie, B., Wroclawski, J., and E. 2473 Felstaine, "A Framework for Integrated Services Operation 2474 over Diffserv Networks", RFC 2998, November 2000. 2476 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 2477 Congestion Notification (ECN) Signaling with Nonces", 2478 RFC 3540, June 2003. 2480 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 2481 Internet Protocol", RFC 4301, December 2005. 2483 [RFC4727] Fenner, B., "Experimental Values In IPv4, IPv6, ICMPv4, 2484 ICMPv6, UDP, and TCP Headers", RFC 4727, November 2006. 2486 [RFC5129] Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion 2487 Marking in MPLS", RFC 5129, January 2008. 2489 [RSVP-ECN] 2490 Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P., 2491 Babiarz, J., and K. Chan, "RSVP Extensions for Admission 2492 Control over Diffserv using Pre-congestion Notification", 2493 draft-lefaucheur-rsvp-ecn-01 (work in progress), 2494 June 2006. 2496 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 2497 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 2498 Congestion Response in an Internetwork Using Re-Feedback", 2499 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 2503 [Smart_rtg] 2504 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 2505 "Optimizing Cost and Performance for Multihoming", ACM 2506 SIGCOMM CCR 34(4)79--92, October 2004, 2507 . 2509 [Steps_DoS] 2510 Handley, M. and A. Greenhalgh, "Steps towards a DoS- 2511 resistant Internet Architecture", Proc. ACM SIGCOMM 2512 workshop on Future directions in network architecture 2513 (FDNA'04) pp 49--56, August 2004. 2515 Appendix A. Implementation 2517 A.1. Ingress Gateway Algorithm for Blanking the RE flag 2519 The ingress gateway receives regular feedback 'PCN-feedback- 2520 information' reporting the fraction of congestion marked octets for 2521 each aggregate arriving at the egress. So for each aggregate it 2522 should blank the RE flag on this fraction of octets. A suitable 2523 pseudo-code algorithm for the ingress gateway is as follows: 2524 ==================================================================== 2525 for each PCN-capable-packet { 2526 if RAND(0,1) <= PCN-feedback-information 2527 writeRE(0); 2528 else 2529 writeRE(1); 2530 } 2531 ==================================================================== 2533 A.2. Downstream Congestion Metering Algorithms 2535 A.2.1. Bulk Downstream Congestion Metering Algorithm 2537 To meter the bulk amount of downstream pre-congestion in traffic 2538 crossing an inter-domain border, an algorithm is needed that 2539 accumulates the size of positive packets and subtracts the size of 2540 negative packets. We maintain two counters: 2542 V_b: accumulated pre-congestion volume 2544 B: total data volume (in case it is needed) 2546 A suitable pseudo-code algorithm for a border router is as follows: 2548 ==================================================================== 2549 V_b = 0 2550 B = 0 2551 for each PCN-capable packet { 2552 b = readLength(packet) /* set b to packet size */ 2553 B += b /* accumulate total volume */ 2554 if readEPCN(packet) == (Re-PCT-Echo || FNE) { 2555 V_b += b /* increment... */ 2556 } elseif readEPCN(packet) == ( AM(-1) || TM(-1) ) { 2557 V_b -= b /* ...or decrement V_b... */ 2558 } /*...depending on EPCN field */ 2559 } 2560 ==================================================================== 2562 At the end of an accounting period this counter V_b represents the 2563 pre-congestion volume that penalties could be applied to, as 2564 described in Section 5.3. 2566 For instance, accumulated volume of pre-congestion through a border 2567 interface over a month might be V_b = 5TB (terabyte = 10^12 byte). 2568 This might have resulted from an average downstream pre-congestion 2569 level of 0.001% on an accumulated total data volume of B = 500PB 2570 (petabyte = 10^15 byte). 2572 A.2.2. Inflation Factor for Persistently Negative Flows 2574 The following process is suggested to complement the simple algorithm 2575 above in order to protect against the various attacks from 2576 persistently negative flows described in Section 5.6.1. As explained 2577 in that section, the most important and first step is to estimate the 2578 contribution of persistently negative flows to the bulk volume of 2579 downstream pre-congestion and to inflate this bulk volume as if these 2580 flows weren't there. The process below has been designed to give an 2581 unbiased estimate, but it may be possible to define other processes 2582 that achieve similar ends. 2584 While the above simple metering algorithm (Appendix A.2) is counting 2585 the bulk of traffic over an accounting period, the meter should also 2586 select a subset of the whole flow ID space that is small enough to be 2587 able to realistically measure but large enough to give a realistic 2588 sample. Many different samples of different subsets of the ID space 2589 should be taken at different times during the accounting period, 2590 preferably covering the whole ID space. During each sample, the 2591 meter should count the volume of positive packets and subtract the 2592 volume of negative, maintaining a separate account for each flow in 2593 the sample. It should run a lot longer than the large majority of 2594 flows, to avoid a bias from missing the starts and ends of flows, 2595 which tend to be positive and negative respectively. 2597 Once the accounting period finishes, the meter should calculate the 2598 total of the accounts V_{bI} for the subset of flows I in the sample, 2599 and the total of the accounts V_{fI} excluding flows with a negative 2600 account from the subset I. Then the weighted mean of all these 2601 samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I} 2602 V_{bI}. 2604 If V_b is the result of the bulk accounting algorithm over the 2605 accounting period (Appendix A.2.1) it can be inflated by this factor 2606 a_S to get a good unbiased estimate of the volume of downstream 2607 congestion over the accounting period a_S.V_b, without being polluted 2608 by the effect of persistently negative flows. 2610 A.3. Algorithm for Sanctioning Negative Traffic 2612 {ToDo: Write up algorithms similar to Appendix E of 2613 [I-D.briscoe-tsvwg-re-ecn-tcp] for the negative flow monitor with 2614 flow management algorithm and the variant with bounded flow state.} 2616 Author's Address 2618 Bob Briscoe 2619 BT & UCL 2620 B54/77, Adastral Park 2621 Martlesham Heath 2622 Ipswich IP5 3RE 2623 UK 2625 Phone: +44 1473 645196 2626 Email: bob.briscoe@bt.com 2627 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 2629 Full Copyright Statement 2631 Copyright (C) The IETF Trust (2008). 2633 This document is subject to the rights, licenses and restrictions 2634 contained in BCP 78, and except as set forth therein, the authors 2635 retain all their rights. 2637 This document and the information contained herein are provided on an 2638 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2639 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 2640 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 2641 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 2642 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2643 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2645 Intellectual Property 2647 The IETF takes no position regarding the validity or scope of any 2648 Intellectual Property Rights or other rights that might be claimed to 2649 pertain to the implementation or use of the technology described in 2650 this document or the extent to which any license under such rights 2651 might or might not be available; nor does it represent that it has 2652 made any independent effort to identify any such rights. Information 2653 on the procedures with respect to rights in RFC documents can be 2654 found in BCP 78 and BCP 79. 2656 Copies of IPR disclosures made to the IETF Secretariat and any 2657 assurances of licenses to be made available, or the result of an 2658 attempt made to obtain a general license or permission for the use of 2659 such proprietary rights by implementers or users of this 2660 specification can be obtained from the IETF on-line IPR repository at 2661 http://www.ietf.org/ipr. 2663 The IETF invites any interested party to bring to its attention any 2664 copyrights, patents or patent applications, or other proprietary 2665 rights that may cover technology that may be required to implement 2666 this standard. Please address the information to the IETF at 2667 ietf-ipr@ietf.org. 2669 Acknowledgment 2671 This document was produced using xml2rfc v1.33 (of 2672 http://xml.resource.org/) from a source in RFC-2629 XML format.