idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-border-cheat-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1607. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1584. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1591. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1597. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RSVP-ECN], [Re-TCP], [PCN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: If the ingress gateway can guarantee that the network(s) that will carry the flow to its egress gateway all use a common identifier for the aggregate (e.g. a single MPLS network without ECMP routing), it MAY NOT set NF when it adds a new flow to an active aggregate and an NF packet need only be sent if a whole aggregate has been idle for more than 1 second. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 27, 2006) is 6633 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-03) exists of draft-briscoe-tsvwg-cl-phb-01 -- Possible downref: Normative reference to a draft: ref. 'PCN' == Outdated reference: A later version (-01) exists of draft-lefaucheur-rsvp-ecn-00 -- Possible downref: Normative reference to a draft: ref. 'RSVP-ECN' == Outdated reference: A later version (-09) exists of draft-briscoe-tsvwg-re-ecn-tcp-01 == Outdated reference: A later version (-04) exists of draft-briscoe-tsvwg-cl-architecture-02 == Outdated reference: A later version (-20) exists of draft-ietf-nsis-rmd-06 Summary: 4 errors (**), 0 flaws (~~), 8 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Expires: August 31, 2006 February 27, 2006 6 Emulating Border Flow Policing using Re-ECN on Bulk Data 7 draft-briscoe-tsvwg-re-ecn-border-cheat-00 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on August 31, 2006. 34 Copyright Notice 36 Copyright (C) The Internet Society (2006). 38 Abstract 40 Scaling per flow admission control to the Internet is a hard problem. 41 A recently proposed approach combines Diffserv and pre-congestion 42 notification (PCN) to provide a service slightly better than Intserv 43 controlled load. It scales to networks of any size, but only if 44 domains trust each other to comply with admission control and rate 45 policing. This memo claims to solve this trust problem without 46 losing scalability. It describes bulk border policing that emulates 47 per-flow policing with the help of another recently proposed 48 extension to ECN, involving re-echoing ECN feedback (re-ECN). With 49 only passive, bulk measurements at borders, sanctions can be applied 50 against cheating networks. 52 Status (to be removed by the RFC Editor) 54 This memo is posted as an Internet-Draft with the intent to 55 eventually progress to informational status. It is envisaged that 56 the necessary standards actions to realise the system described would 57 sit in three other documents currently being discussed (but not on 58 the standards track) in the IETF Transport Area [Re-TCP], [RSVP-ECN] 59 & [PCN]. The authors seek comments from the Internet community on 60 whether combining PCN and re-ECN is a sufficient solution to the 61 admission control problem. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 66 2. Requirements Notation . . . . . . . . . . . . . . . . . . . . 5 67 3. The Problem . . . . . . . . . . . . . . . . . . . . . . . . . 5 68 3.1. The Traditional Per-flow Policing Problem . . . . . . . . 5 69 3.2. Generic Scenario . . . . . . . . . . . . . . . . . . . . . 7 70 4. Re-ECN Protocol for an RSVP Transport . . . . . . . . . . . . 9 71 4.1. Protocol Overview . . . . . . . . . . . . . . . . . . . . 9 72 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or 73 v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 74 4.3. Protocol Operation . . . . . . . . . . . . . . . . . . . . 13 75 4.4. Aggregate Bootstrap . . . . . . . . . . . . . . . . . . . 15 76 4.5. Flow Bootstrap . . . . . . . . . . . . . . . . . . . . . . 16 77 5. Emulating Border Policing with Re-ECN . . . . . . . . . . . . 17 78 5.1. Policing Overview . . . . . . . . . . . . . . . . . . . . 18 79 5.2. Pre-requisite Contractual Arrangements . . . . . . . . . . 21 80 5.3. Emulation of Per-Flow Rate Policing: Rationale and 81 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 23 82 5.4. Policing Dishonest Marking . . . . . . . . . . . . . . . . 24 83 5.5. Competitive Routing . . . . . . . . . . . . . . . . . . . 25 84 5.6. Fail-safes . . . . . . . . . . . . . . . . . . . . . . . . 26 85 6. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 86 7. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 29 87 8. Design Choices and Rationale . . . . . . . . . . . . . . . . . 29 88 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 89 10. Security Considerations . . . . . . . . . . . . . . . . . . . 30 90 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 31 91 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31 92 13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 31 93 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 31 94 14.1. Normative References . . . . . . . . . . . . . . . . . . . 31 95 14.2. Informative References . . . . . . . . . . . . . . . . . . 32 96 Appendix A. Implementation . . . . . . . . . . . . . . . . . . . 33 97 A.1. Ingress Gateway Algorithm for Blanking the RE bit . . . . 33 98 A.2. Bulk Downstream Congestion Metering Algorithm . . . . . . 34 99 A.3. Algorithm for Sanctioning Negative Traffic . . . . . . . . 35 100 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 36 101 Intellectual Property and Copyright Statements . . . . . . . . . . 37 103 1. Introduction 105 The Internet community largely lost interest in the Intserv 106 architecture after it was clarified that it would be unlikely to 107 scale to the whole Internet [RFC2208]. Although Intserv mechanisms 108 proved impractical, the services it aimed to offer are still very 109 much required. 111 A recently proposed approach [CL-arch] combines Diffserv and pre- 112 congestion notification (PCN) to provide a service slightly better 113 than Intserv controlled load [RFC2211]. It scales to any size 114 network, but only if domains trust each other to comply with 115 admission control and rate policing. This memo describes border 116 policing measures to sanction networks that cheat each other. The 117 approach provides a sufficient emulation of flow rate policing at 118 trust boundaries but without per-flow processing. The emulation is 119 not perfect, but it is sufficient to ensure that the punishment is at 120 least proportionate to the severity of the cheat. 122 The aim is to be able to claim that controlled load service can scale 123 to any number of endpoints, even though such scaling must take 124 account of the increasing numbers of networks and users who may all 125 have conflicting interests. To achieve such scaling, this memo 126 combines two recent proposals, both of which it briefly recaps: 128 o A framework for admission control over Diffserv using pre- 129 congestion notification [CL-arch] describes how bulk pre- 130 congestion notification on routers within an edge-to-edge Diffserv 131 region can emulate the precision of per-flow admission control to 132 provide controlled load service without unscalable per-flow 133 processing; 135 o Re-ECN: Adding Accountability to TCP/IP [Re-TCP]. The trick that 136 addresses cheating at borders is to recognise that border policing 137 is mainly necessary because cheating upstream networks will admit 138 traffic when they shouldn't only as long as they don't directly 139 experience the downstream congestion their misbehaviour can cause. 140 The re-ECN protocol ensures upstream nodes honestly declare 141 expected downstream congestion in all forwarded packets, which we 142 then use to emulate border policing. 144 Rather than the end-to-end arrangement used when re-ECN was specified 145 for the TCP transport [Re-TCP], this memo specifies re-ECN in an 146 edge-to-edge arrangement, making it applicable to the Diffserv 147 admission control scenario in the framework. Also, rather than using 148 a TCP transport for regular congestion feedback, this memo specifies 149 re-ECN using RSVP as the transport. We use the proposed minor 150 extension of RSVP that allows it to carry congestion feedback [RSVP- 151 ECN], which is much less frequent but more precise than TCP. 153 Of course, network operators may choose to process per-flow 154 signalling at their borders for their own reasons, such as per-flow 155 accounting. But the goal of this document is to show that per-flow 156 processing at borders is no longer necessary in order to provide end- 157 to-end QoS using flow admission control. To be clear, we are 158 absolutely opposed to standardisation of technology that embeds 159 particular business models into the Internet. Our aim here is to 160 provide a new metric (downstream congestion) at trust boundaries. 161 Given the well-known significance of congestion in economics, 162 operators can then use this new metric in their interconnection 163 contracts if they choose. This will enable competitive evolution of 164 new business models (for examples see [IXQoS]), alongside more 165 traditional models that depend on more costly per-flow processing at 166 borders. 168 We specify this protocol solution in detail in Section 4, after 169 specifying the inter-domain policing problem more precisely and 170 briefly recapping the framework for providing admission control using 171 pre-congestion notification in Section 3. 173 Having described the solution, this memo continues as follows: {ToDo: 174 } 176 2. Requirements Notation 178 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 179 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 180 document are to be interpreted as described in [RFC2119]. 182 3. The Problem 184 3.1. The Traditional Per-flow Policing Problem 186 If we claim to be able to emulate per-flow policing with bulk 187 policing at trust boundaries, we need to know exactly what we are 188 emulating. So, even though we expect it to become a historic 189 practice, we will start from the traditional scenario with per-flow 190 policing at trust boundaries to explain why it has always been 191 considered necessary. 193 To be able to take advantage of a reservation-based service such as 194 controlled load, a source must reserve resources using a signalling 195 protocol such as RSVP [RFC2205]. But, even if the source is 196 authorised and admitted at the flow level, it cannot necessarily be 197 trusted to send packets within the rate profile it requested. For 198 instance, without data rate policing, a source could reserve 199 resources for an 8kbps audio flow but transmit a 6Mbps video (theft 200 of service). More subtly, the sender could generate bursts that were 201 outside the profile it had requested. 203 In traditional architectures, per-flow packet rate-policing is 204 expensive and unscalable but, without it, a network is vulnerable to 205 such theft of service (whether malicious or accidental). Perhaps 206 more importantly, if flows are allowed to send more data than they 207 were permitted, the ability of admission control to give assurances 208 to other flows will break. 210 A signalled request refers to a flow of packets by its flow ID tuple 211 (filter spec [RFC2205]) (or its security parameter index (SPI)& 212 nbsp[RFC2207] if port numbers are hidden by IPsec encryption). But 213 merely opening a pin-hole for packets that match an admitted flow ID 214 is an insufficient policing mechanism. The packet rate must also be 215 policed to keep the flow within the requested flow spec [RFC2205]. 217 Just as sources need not be trusted to keep within their requested 218 flow spec, whole networks might also try to cheat. We will now set 219 up a concrete scenario to illustrate such cheats. Imagine 220 reservations for unidirectional flows from senders, through at least 221 two networks, an edge network and its downstream transit provider. 222 Imagine the edge network charges its retail customers per reservation 223 but also has to pay its transit provider a charge per reservation. 224 Typically, both its selling and buying charges might depend on the 225 duration and rate of each reservation. The level of the actual 226 selling and buying prices are irrelevant to our discussion (most 227 likely the network will sell at a higher price than it buys, of 228 course). 230 A cheating ingress network could systematically reduce the size of 231 its retail customers' reservation signalling requests before 232 forwarding them to its transit provider (and systematically reinstate 233 the responses on the way back). It would then receive an honest 234 income from its upstream retail customer but only pay for 235 fraudulently smaller reservations downstream. Equivalently, a 236 cheating ingress network may feed the traffic from a number of flows 237 into an aggregate reservation over the transit that is smaller than 238 the total of all the flows. Because of these fraud possibilities, in 239 traditional QoS reservation architectures the downstream network 240 polices at each border. The policer checks that the actual sent data 241 rate of each flow is within the signalled reservation. 243 Reservation signalling could be authenticated end to end, but this 244 wouldn't prevent the aggregation cheat just described. For this 245 reason, and to avoid the need for a global PKI, signalling integrity 246 is typically only protected on a hop-by-hop basis  [RFC2747]. 248 A variant of the above cheat is where a router in an honest 249 downstream network denies admission to a new reservation, but a 250 cheating upstream network still admits the flow. For instance, the 251 networks may be using Diffserv internally, but Intserv admission 252 control at their borders [RFC2998]. The cheat would only work if 253 they were using bulk Diffserv traffic policing at their borders, 254 perhaps to avoid the cost/complexity of Intserv border policing. As 255 far as the cheating upstream network is concerned, it gets the 256 revenue from the reservation, but it doesn't have to pay any 257 downstream wholesale charges and the congestion is in someone else's 258 network. The cheating network may calculate that most of the flows 259 affected by congestion in the downstream network aren't likely to be 260 its own. It may also calculate that the downstream router is 261 probably not actually congested, but rather it is denying admission 262 to new flows to protect bandwidth assigned to other lower priority 263 services. 265 To summarise, in traditional reservation signalling architectures, if 266 a network cannot trust a neighbouring upstream network to rate-police 267 each reservation, it has to check for itself that the data fits 268 within each of the reservations it has admitted. 270 3.2. Generic Scenario 272 We will now describe a generic internetworking scenario that we will 273 use to describe and to test our bulk policing proposal. It consists 274 of a number of networks and endpoints that do not fully trust each 275 other to behave. In Section 6 we will tie down exactly what we mean 276 by partial trust, and we will consider the various combinations where 277 some networks do not trust each other and others are colluding 278 together. 280 _ ___ _____________________________________ ___ _ 281 | | | | _|__ ______ ______ ______ _|__ | | | | 282 | | | | | | | | | | | | | | | | | | 283 | | | | | | |Inter-| |Inter-| |Inter-| | | | | | | 284 | | | | | | | ior | | ior | | ior | | | | | | | 285 | | | | | | |Domain| |Domain| |Domain| | | | | | | 286 | | | | | | | A | | B | | C | | | | | | | 287 | | | | | | | | | | | | | | | | | | 288 | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | 289 | | | | | | |B| |B| |B| |B| |B| |B| | | | | | | 290 | |==| |==|Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr |==| |==| | 291 | | | | |G/W | | | | | | | | | | | | | |G/W | | | | | 292 | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | 293 | | | | | | | | | | | | | | | | | | 294 | | | | |____| |______| |______| |______| |____| | | | | 295 |_| |___| |_____________________________________| |___| |_| 297 Sx Ingress Diffserv region Egress Rx 298 End Access Access End 299 Host Network Network Host 300 <-------- edge-to-edge signalling -------> 301 (for admission control) 303 <-------------------end-to-end QoS signalling protocol-------------> 305 Figure 1: Generic Scenario (see text for explanation of terms) 307 An ingress and egress gateway (Ingr G/W and Egr G/W in Figure 1) 308 connect the interior Diffserv region to the edge access networks 309 where routers (not shown) use per-flow reservation processing. 310 Within the Diffserv region are three interior domains, A, B and C, as 311 well as the inward facing interfaces of the ingress and egress 312 gateways. An ingress and egress border router (BR) is shown 313 interconnecting each interior domain with the next. There may be 314 other interior routers (not shown) within each interior domain. 316 In two paragraphs we now briefly recap how pre-congestion 317 notification is intended to be used to control flow admission to a 318 large Diffserv region. The first paragraph describes data plane 319 functions and the second describes signalling in the control plane. 320 We omit many details from [CL-arch] including behaviour during 321 routing changes. For brevity here we assume other flows are already 322 in progress across a path through the Diffserv region before a new 323 one arrives, but how bootstrap works is described in Section 4.4. 325 Figure 1 shows a single simplex reserved flow from the sending (Sx) 326 end host to the receiving (Rx) end host. The ingress gateway polices 327 incoming traffic within its admitted reservation and remarks it to 328 turn on an ECN-capable codepoint [RFC3168] and the controlled 329 load (CL) Diffserv codepoint. Together, these codepoints define 330 which traffic is entitled to the enhanced scheduling of the CL 331 behaviour aggregate on routers within the Diffserv region. The CL 332 PHB of interior routers consists of a scheduling behaviour and a new 333 ECN marking behaviour that we call 'pre-congestion 334 notification' [PCN]. The CL PHB simply re-uses the definition of 335 expedited forwarding (EF) [RFC3246] for its scheduling behaviour. 336 But it incorporates a new ECN marking behaviour, which sets the ECN 337 field of an increasing number of CL packets to the admission marked 338 (AM) codepoint as they approach a threshold rate that is lower than 339 the line rate. The use of virtual queues ensures real queues have 340 hardly built up any congestion delay. 342 The level of marking detected at the egress of the Diffserv region, 343 is then used by the signalling system in order to determine admission 344 control. The end-to-end QoS signalling (e.g. RSVP) for a new 345 reservation takes one giant hop from ingress to egress gateway, 346 because interior routers within the Diffserv region are configured to 347 ignore RSVP. The egress gateway holds flow state because it takes 348 part in the end-to-end reservation. So it can classify all packets 349 by flow and it can identify all flows that have the same previous 350 RSVP hop (a CL-region-aggregate). For each CL-region-aggregate of 351 flows in progress, the egress gateway maintains a per-packet moving 352 average of the fraction of pre-congestion-marked traffic. Once an 353 RSVP PATH message for a new reservation has hopped across the 354 Diffserv region and reached the destination, an RSVP RESV message is 355 returned. As the RESV message passes, the egress gateway piggy-backs 356 the relevant pre-congestion level onto it [RSVP-ECN]. Again, 357 interior routers ignore the RSVP message, but the ingress gateway 358 strips off the pre-congestion level. If the pre-congestion level is 359 above a threshold, the ingress gateway denies admission to the new 360 reservation, otherwise it returns the original RESV signal back 361 towards the data sender. 363 Once a reservation is admitted, its traffic will always receive low 364 delay service for the duration of the reservation. This is because 365 ingress gateways ensure that traffic not under a reservation cannot 366 pass into the Diffserv region with the CL DSCP set. So non-reserved 367 traffic will always be treated with a lower priority PHB at each 368 interior router. 370 4. Re-ECN Protocol for an RSVP Transport 372 4.1. Protocol Overview 374 First we need to recap the way routers accumulate congestion marking 375 along a path. Each ECN-capable router marks some packets with CE, 376 the marking probability increasing with the length of the virtual 377 queue at its egress link [PCN]. With multiple ECN-capable routers on 378 a path, the ECN field accumulates the fraction of CE marking that 379 each router adds. The combined effect of the packet marking of all 380 the routers along the path signals congestion of the whole path to 381 the receiver. So, for example, if one router early in a path is 382 marking 1% of packets and another later in a path is marking 2%, 383 flows that pass through both routers will experience approximately 3% 384 marking. 386 The packets crossing an inter-domain trust boundary within the 387 Diffserv region will all have come from different ingress gateways 388 and will all be destined for different egress gateways. We will show 389 that the key to policing against theft of service is to be able to 390 measure expected downstream pre-congestion on the paths between a 391 border router and the egress gateways that packets are headed for. 393 With the original ECN protocol, if CE markings crossing the border 394 had been counted over a period, they would have represented the 395 accumulated upstream pre-congestion that had already been experienced 396 by those packets. The general idea of re-ECN is for the ingress 397 gateway to continuously encode path congestion into the IP header, 398 where path means from ingress to egress gateway. Then at any point 399 on that path (e.g. between domains A & B in Figure 2 below), IP 400 headers can be monitored to subtract upstream congestion from 401 expected path congestion in order to give the expected downstream 402 congestion still to be experienced until the egress gateway. 404 _____________________________________ 405 _|__ ______ ______ ______ _|__ 406 | | | A | | B | | C | | | 407 +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ 408 | | |B| |B| |B| |B| |B| |B| | | 409 |Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr | 410 |G/W | | | | |: | | | | | | | | |G/W | 411 +----+ +-+ +-+: +-+ +-+ +-+ +-+ +----+ 412 | | | |: | | | | | | 413 |____| |______|: |______| |______| |____| 414 |_____________:_______________________| 415 : 416 | : | 417 |<-upstream-->:<-expected downstream->| 418 | congestion : congestion | 419 | u v ~= p - u | 420 | | 421 |<--- expected path congestion, p --->| 423 Figure 2: Re-ECN concept 425 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 427 In this section we define the names of the various codepoints of the 428 re-ECN protocol, deferring description of their semantics to the 429 following sections. First we recap the re-ECN wire protocol proposed 430 in [Re-TCP]. It uses the two bit ECN field broadly as in 431 RFC3168 [RFC3168]. It also uses a new re-ECN extension (RE) bit. 432 The actual position of the RE bit is different between IPv4 & v6 433 headers so we will use an abstraction of the IPv4 and v6 wire 434 protocols by just calling it the RE bit. [Re-TCP] proposes using bit 435 48 (currently unused) in the IPv4 header for the RE bit, while it 436 proposes an ECN extension header for IPv6. 438 Unlike the ECN field, the RE bit is intended to be set by the sender 439 and remain unchanged along the path, although it can be read by 440 network elements that understand the re-ECN protocol. In the 441 scenario used in this memo, an ingress gateway changes the setting of 442 the RE bit, acting as a proxy for the sender, as permitted in the 443 specification of re-ECN. 445 Although the RE bit is a separate, single bit field, it can be read 446 as an extension to the two-bit ECN field; the three concatenated bits 447 in what we will call the extended ECN field (EECN) make eight 448 codepoints available. When the RE bit setting is "don't care", we 449 use the RFC3168 names of the ECN codepoints, but [Re-TCP] proposes 450 the following six codepoint names for when there is a need to be more 451 specific. 453 +-------+------------+------+-------------+-------------------------+ 454 | ECN | RFC3168 | RE | re-ECN | re-ECN meaning | 455 | field | codepoint | bit | codepoint | | 456 +-------+------------+------+-------------+-------------------------+ 457 | 00 | Not-ECT | 0 | NRECT | Not re-ECN-capable | 458 | | | | | transport | 459 | 00 | Not-ECT | 1 | NF | No feedback | 460 | | | | | | 461 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 462 | | | | | and RECT | 463 | 01 | ECT(1) | 1 | RECT | re-ECN capable | 464 | | | | | transport | 465 | 10 | ECT(0) | 0 | --CU-- | Currently unused | 466 | | | | | | 467 | 10 | ECT(0) | 1 | --CU-- | Currently unused | 468 | | | | | | 469 | 11 | CE | 0 | CE(0) | Congestion experienced | 470 | | | | | with Re-Echo | 471 | 11 | CE | 1 | CE(-1) | Congestion experienced | 472 +-------+------------+------+-------------+-------------------------+ 474 Table 1: Re-cap of Default Extended ECN Codepoints Proposed for Re- 475 ECN 477 As permitted by RFC3168, [PCN] proposes new semantics for the ECN 478 codepoints when combined with a Diffserv codepoint (DSCP) that uses 479 pre-congestion notification. It also proposes various alternative 480 encodings for these semantics, attempting to fit five states into the 481 four available ECN codepoints by making various compromises. The 482 five states are Not-ECT, ECT (ECN-capable transport), the ECN Nonce, 483 Admission Marking (AM) and Pre-emption Marking (PM). 485 One of the five states was for the ECN Nonce [RFC3540], but the 486 capability we describe in this memo supercedes any need for the 487 Nonce. The ECN Nonce is an elegant scheme, but it only allows a 488 sending node (or its proxy) to detect suppression of congestion 489 marking by a cheating receiver. Thus the Nonce requires the sender 490 or its proxy to be trusted to respond correctly to congestion. But 491 this is precisely the main cheat we want to protect against (as well 492 as many others). 494 One of the compromises that [PCN] explores ("Alternative 5") leaves 495 out support for the ECN Nonce. Therefore we use that one. Then, 496 with the addition of the RE bit, the 8 encodings of the extended ECN 497 (EECN) field become those defined in the table below. Note that 498 these codepoints only take on the semantics in the table below when 499 combined with a Diffserv codepoint that the operator has defined as 500 supporting pre-congestion notification. 502 +--------+-----------+------+-------------+-------------------------+ 503 | ECN | PCN | RE | re-ECN | re-ECN meaning | 504 | field | codepoint | bit | codepoint | | 505 +--------+-----------+------+-------------+-------------------------+ 506 | 00 | Not-ECT | 0 | NRECT | Not re-ECN-capable | 507 | | | | | transport | 508 | 00 | Not-ECT | 1 | NF | No feedback | 509 | | | | | | 510 | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | 511 | | | | | and RECT | 512 | 01 | ECT(1) | 1 | RECT | re-ECN capable | 513 | | | | | transport | 514 | 10 | AM | 0 | AM(0) | Admission Marking with | 515 | | | | | Re-Echo | 516 | 10 | AM | 1 | AM(-1) | Admission Marking | 517 | 11 | PM | 0 | PM(0) | Pre-emption Marking | 518 | | | | | with Re-Echo | 519 | 11 | PM | 1 | PM(-1) | Pre-emption Marking | 520 +--------+-----------+------+-------------+-------------------------+ 522 Table 2: Extended ECN Codepoints if the Diffserv codepoint uses Pre- 523 congestion Notification (PCN) 525 For the rest of this memo, we will not distinguish between Admission 526 Marking and Pre-emption Marking (unless stated otherwise). We will 527 call both "congestion marking". With the above encoding, congestion 528 marking can be read to mean any packet with the left-most bit of the 529 ECN field set. 531 All but the "not re-ECN-capable transport" (NRECT) field imply the 532 presence of an ECN-capable transport. Congested PCN-capable routers 533 must drop rather than mark packets carrying the NRECT codepoint. 534 Note that adding PCN-capability to a router will involve checking the 535 RE bit as well as the ECN field and DSCP before deciding whether to 536 drop or to mark a packet during congestion. Router implementations 537 might well append the RE bit to their internal representation of the 538 ECN field, treating them internally as one 3-bit extended ECN value. 540 4.3. Protocol Operation 542 In this section we will give an overview of the operation of the re- 543 ECN protocol for an RSVP transport, deferring a detailed 544 specification to the following sections. 546 The re-ECN protocol involves a simple tweak to the action of the 547 gateway at the ingress edge of the CL region. In the framework just 548 described [CL-arch], for each active traffic aggregate across the CL 549 region (CL-region-aggregate) the ingress gateway will hold a fairly 550 recent Congestion-Level-Estimate that the egress gateway will have 551 fed back to it, piggybacked on the signalling that sets up each flow. 552 For instance, one aggregate might have been experiencing 3% pre- 553 congestion (that is, congestion marked octets whether Admission 554 Marked or Pre-emption Marked). In this case, the ingress gateway 555 MUST clear the RE bit to "0" for the same percentage of octets of CL- 556 packets (3%) and set it to "1" in the rest (97%). Appendix A.1 gives 557 a simple pseudo-code algorithm that the ingress gateway may use to do 558 this. 560 The RE bit is set and cleared this way round for incremental 561 deployent reasons (see [Re-TCP]). To avoid confusion we will use the 562 term `blanking' (rather than marking) when the RE bit is cleared to 563 "0", so we will talk of the `RE blanking fraction' as the fraction of 564 octets with the RE bit cleared to "0". 566 ^ 567 | 568 | RE blanking fraction 569 3% | +----------------------------+====+ 570 | | | | 571 2% | | | | 572 | | congestion marking fraction| | 573 1% | | +----------------------+ | 574 | | | | 575 0% +----+=====+---------------------------+------> 576 ^ <--A---> <---B---> <---C---> ^ domain 577 | ^ ^ | 578 ingress | | egress 579 1.00% 2.00% marking fraction 581 Figure 3: Example Re-ECN Codepoint Marking fractions (Imprecise) 583 Figure 3 illustrates our example. The horizontal axis represents the 584 index of each congestible resource (typically queues) along a path 585 through the Internet. The two superimposed plots show the fraction 586 of each ECN codepoint observed along this path, assuming two 587 congested routers somewhere within domans A and C. And the table 588 below shows the downstream pre-congestion measured at various border 589 observation points along the path. These figures are actually 590 reasonable approximations derived from more precise formulae given in 591 Appendix A of [Re-TCP]. The RE bit is not changed by interior 592 routers, so it can be seen that it acts as a reference against which 593 the congestion marking fraction can be compared along the path. 595 +--------------------------+---------------------------------------+ 596 | Border observation point | Approximate Downstream pre-congestion | 597 +--------------------------+---------------------------------------+ 598 | ingress -- A | 3% - 0% = 3% | 599 | A -- B | 3% - 1% = 2% | 600 | B -- C | 3% - 1% = 2% | 601 | C -- egress | 3% - 3% = 0% | 602 +--------------------------+---------------------------------------+ 604 Note that the ingress determines the RE blanking fraction for each 605 aggregate using the most recent feedback from the relevant egress, 606 arriving with each new reservation, or each refresh. These arrive 607 relatively infrequently compared to the speed with which congestion 608 changes. Although this feedback will always be out of date, on 609 average positive errors will cancel out negative over a sufficiently 610 long duration. 612 In summary, the network adds pre-congestion marking in the forward 613 data path, the egress feeds its level back to the ingress in RSVP, 614 then the ingress gateway re-echoes it into the forward data path by 615 blanking the RE bit. Hence the name re-ECN. Then at any border 616 within the Diffserv region, the pre-congestion marking that every 617 passing packet will be expected to experience downstream can be 618 measured to be the RE blanking fraction minus the congestion marking 619 fraction. 621 4.4. Aggregate Bootstrap 623 When a new reservation PATH message arrives at the egress, if there 624 are currently no flows in progress from the same ingress, there will 625 be no state maintaining the current level of pre-congestion marking 626 for the aggregate. While the reservation signalling continues onward 627 towards the receiving host, the egress gateway returns an RSVP 628 message to the ingress with a flag [RSVP-ECN] asking the ingress to 629 send a specified number of data probes between them. This bootstrap 630 behaviour is all described in the framework [CL-arch]. 632 However, with our new re-ECN scheme, the ingress does not know what 633 proportion of the data probes should have the RE bit blanked, because 634 it has no estimate yet of pre-congestion for the path across the 635 Diffserv region. 637 To be conservative, following the guidance for specifying other re- 638 ECN transports in [Re-TCP], the ingress SHOULD set the NF codepoint 639 of the extended ECN header in all probe packets (Table 2). As per 640 the framework, the egress gateway measures the fraction of 641 congestion-marked probe octets and feeds back the resulting pre- 642 congestion level to the ingress, piggy-backed on the returning 643 reservation response (RESV) for the new flow. Probe packets are 644 identifiable by the egress because they have the ingress as the 645 source and the egress as the destination in the IP header. 647 It may seem inadvisable to expect the NF codepoint to be set on 648 probes, given legacy firewalls etc. might discard such packets 649 (because this flag had no prevous legitimate use). However, in the 650 deployment scenarios envisaged for this admission control framework, 651 each domain in the Diffserv region has to be explicitly configured to 652 support the controlled load service. So, before deploying the 653 service, the operator MUST reconfigure such a misbehaving middlebox 654 to allow through packets with the RE bit set. 656 Note that we have said SHOULD rather than MUST for the NF setting 657 behaviour of the ingress for probe packets. This entertains the 658 possibility of an ingress implementation having the benefit of other 659 knowledge of the path, which it re-uses for a newly starting 660 aggregate. For instance, it may hold cached information from a 661 recent use of the aggregate that is still sufficiently current to be 662 useful. 664 It might seem pedantic worrying about these few probe packets, but 665 this behaviour ensures the system is safe, even if the proportion of 666 probe packets becomes large. 668 4.5. Flow Bootstrap 670 It might be expected that a new flow within an active aggregate would 671 need no special bootstrap behaviour. If there was an aggregate 672 already in progress between the gateways the new flow was about to 673 use, it would inherit the prevailing RE blanking fraction. And if 674 there were no active aggregate, the aggregate bootstrap behaviour 675 would be appropriate and sufficient for the new flow. 677 However, for a number of reasons, at least the first packet of each 678 new flow SHOULD be set to the NF codepoint, irrespective of whether 679 it is joining an active aggregate or not. If the first packet is 680 unlikely to be reliably delivered, a number of NF packets MAY be sent 681 to increase the probability that at least one is delivered to the 682 egress gateway. 684 If each flow does not start with an NF packet, it will be seen later 685 that sanctions may be incorrectly applied at the interface before the 686 egress gateway. It will often be possible to apply sanctions at the 687 granularity of aggregates rather than flows, but in an internetworked 688 environment it cannot be guaranteed that aggregates will be 689 identifiable in remote networks. So setting NF at the start of each 690 flow is a safe strategy. For instance, a remote network may have 691 equal cost multi-path (ECMP) routing enabled, causing flows between 692 the same gateways to traverse different paths. 694 After an idle period of more than 1 second, the ingress gateway 695 SHOULD set the EECN field of the next packet it sends to NF. This 696 REQUIREMENT allows the design of network policers to be 697 deterministic. 699 If the ingress gateway can guarantee that the network(s) that will 700 carry the flow to its egress gateway all use a common identifier for 701 the aggregate (e.g. a single MPLS network without ECMP routing), it 702 MAY NOT set NF when it adds a new flow to an active aggregate and an 703 NF packet need only be sent if a whole aggregate has been idle for 704 more than 1 second. 706 5. Emulating Border Policing with Re-ECN 708 Note: In the rest of this memo, where the context makes it clear, we 709 will loosely use the term 'congestion' rather than using the stricter 710 'downstream pre-congestion'. Also we will loosely talk of positive 711 or negative traffic, meaning traffic where the moving average of the 712 downstream pre-congestion metric is persistently positive or negative 713 respectively. 715 The notion of positive and negative downstream pre-congestion is 716 because downstream pre-congestion is calculated by subtracting the 717 congestion marking fraction from the RE blanking fraction. Therefore 718 packets can be considered to have a 'value multiplier' of +1, 0 or 719 -1. Blanking the RE bit increments the 'value multiplier' of a 720 packet. Congestion marking a packet decrements 'the value 721 multiplier' (whether admission marking or pre-emption marking). Both 722 together cancel each other out (a neutral or zero 'value- 723 multiplier'). The NF codepoint is an exception. It has the same 724 positive 'value multiplier' as a re-echoed packet. The table below 725 specifies unambiguously the value multipliers of each extended ECN 726 codepoint. 728 +-------+------+-------------+--------------+-----------------------+ 729 | ECN | RE | re-ECN | 'Value | re-ECN meaning | 730 | field | bit | codepoint | multiplier' | | 731 +-------+------+-------------+--------------+-----------------------+ 732 | 00 | 0 | NRECT | n/a | Not re-ECN-capable | 733 | | | | | transport | 734 | 00 | 1 | NF | +1 | No feedback | 735 | 01 | 0 | Re-Echo | +1 | Re-echoed congestion | 736 | | | | | and RECT | 737 | 01 | 1 | RECT | 0 | re-ECN capable | 738 | | | | | transport | 739 | 10 | 0 | AM(0) | 0 | Admission Marking | 740 | | | | | with Re-Echo | 741 | 10 | 1 | AM(-1) | -1 | Admission Marking | 742 | 11 | 0 | PM(0) | 0 | Pre-emption Marking | 743 | | | | | with Re-Echo | 744 | 11 | 1 | PM(-1) | -1 | Pre-emption Marking | 745 +-------+------+-------------+--------------+-----------------------+ 747 Table 4: 'Sign' of Extended ECN Codepoints 749 Just as we will loosely talk of positive and negative traffic when we 750 mean the level of downstream pre-congestion in the stream of traffic, 751 we will also talk of positive or negative packets, meaning whether a 752 packet contributes positively or negatively to downstream pre- 753 congestion. 755 5.1. Policing Overview 757 To emulate border policing, the general idea is for each domain to 758 apply financial penalties to its upstream neighbour in proportion to 759 the amount of downstream pre-congestion that the upstream network 760 sends across the border. This seems to encourage everyone to 761 understate downstream pre-congestion to reduce the penalties they 762 incur. But it is in the last domain's interest to create a balancing 763 upward pressure by applying sanctions to flows where the marking 764 fraction goes negative before the egress gateway. 766 Of course, some domains may trust other domains to comply without 767 applying sanctions or penalties. In these cases, no penalties need 768 be applied. The re-ECN protocol ensures downstream pre-congestion 769 marking is passed on correctly whether or not penalties are applied 770 to it, so the system works just as well with a mixture of some 771 domains trusting each other and others not. 773 Figure 4 uses the same example as in previous sections to show the 774 downstream pre-congestion marking fraction, v, across a path through 775 the Internet. Downward arrows show the pressure for each domain to 776 underdeclare downstream pre-congestion in traffic they pass to the 777 next domain, because of the penalties. Note that at the last egress 778 of the Diffserv region, domain C should not agree to pay any 779 penalties to the egress gateway for pre-congestion passed to the 780 egress gateway. Downstream pre-congestion to the egress gateway 781 should have reached zero here, so if domain C agreed to pay for any 782 downstream pre-congestion, it would give the egress gateway an 783 incentive to overdeclare pre-congestion feedback and take the 784 resulting profit from domain C. 786 Providers should be free to agree the contractual terms they wish 787 between themselves, so this memo does not propose to standardise how 788 these penalties would be applied. It is sufficient to standardise 789 the re-ECN protocol so the downstream pre-congestion metric is 790 available if providers choose to use it. However, Section 5.2 gives 791 some examples of how these penalties might be implemented. 793 p e n a l t i e s 794 / | \ 795 A : : : 796 | | <--A---> <---B---> <---C---> domain 797 | V : : : 798 3% | +-----+ | | : 799 | | | V V : 800 2% | | +----------------------+ : 801 | | downstream pre-congestion | : 802 1% | | : | : 803 | | : | : 804 0% +----+----------------------------+====+------> 805 : : : A : 806 : : : | : 807 ingress : : : egress 808 1.00% 2.00%: pre-congestion 809 | 810 sanctions 812 Figure 4: Policing Framework, showing creation of opposing pressures 813 to underdeclare and overdeclare downstream pre-congestion, using 814 penalties and sanctions 816 Any traffic that persistently goes negative by the time it leaves a 817 domain must not have been marked correctly in the first place. A 818 domain that discovers such traffic can adopt a range of strategies to 819 protect itself. Which strategy it uses will depend on policy, 820 because it cannot immediately assume malice---there may be an 821 innocent configuration error somewhere in the system. So this memo 822 also does not propose to standardise any particular mechanism, but 823 Section 5.4 does give examples of how the underlying re-ECN protocol 824 could be used to apply sanctions to persistently negative traffic. 825 The ultimate sanction would be to drop such negative traffic 826 indiscriminately, without regard to flows. A less drastic sanction 827 might be to focus drop on specific packets in specific flows to 828 remove the negative bias while doing minimal harm. 830 In all cases a management alarm SHOULD be raised on detecting 831 persistently negative traffic and any automatic sanctions taken 832 SHOULD be logged. Even if the chosen policy is to take no automatic 833 action, the cause can then be investigated manually. 835 The incentive for domains not to tolerate negatively marked traffic 836 depends on financial penalties never being negative. That is, any 837 level of negative marking only equates to zero penalty. In other 838 words, penalties are always paid in the same direction as the data, 839 and never against the data flow. This is consistent with the 840 definition of physical congestion; when a resource is underutilised, 841 it is not negatively congested, its congestion is just zero. So, 842 although short periods of negative marking can be tolerated to 843 correct temporary overdeclarations due to lags in the feedback 844 system, persistent downstream negative congestion can have no 845 physical meaning and therefore must signify a problem. 847 The upward arrow at the egress of domain C at its border with the 848 egress gateway in Figure 4 represents this incentive not to allow 849 negative traffic. But the same upward pressure applies at every 850 domain border (arrows not shown). 852 With the above penalty system, each domain seems to have a perverse 853 incentive to fake pre-congestion. For instance domain B's profit 854 depends on the difference between pre-congestion at its ingress (its 855 revenue) and at its egress (its cost). So if B overstates internal 856 pre-congestion it seems to increase its profit. However, we can 857 assume that domain A could bypass B, routing through other domains to 858 reach the egress. So the competitive discipline of least-cost 859 routing can ensure that any domain tempted to fake pre-congestion for 860 profit risks losing all its usage revenue. The least congested route 861 would eventually be able to win this competitive game, only as long 862 as it didn't declare more fake pre-congestion than the next most 863 competitive route. 865 Again, this memo does need to standardise any particular mechanism 866 for routing based on re-ECN. Section 5.5 explains why no new 867 standards would be needed for congestion routing as long as re-ECN 868 marking had been standardised. That section also points to papers 869 concerning optimising routing in the presence of usage charging. 871 5.2. Pre-requisite Contractual Arrangements 873 The re-ECN protocol has been chosen to solve the policing problem 874 because it embeds a downstream pre-congestion metric in passing CL 875 traffic that is difficult to lie about and can be measured in bulk. 876 The ability to emulate border policing depends on network operators 877 choosing to use this metric as one of the elements in their contracts 878 with each other. 880 Already many inter-domain agreements involve a capacity and a usage 881 element. The usage element may be based on volume or various 882 measures of peak demand. We expect that those network operators that 883 choose to use pre-congestion notification for admission control would 884 also be willing to consider using this downstream pre-congestion 885 metric as a usage element in their interconnection contracts for 886 admission controlled traffic. 888 Appendix A.2 gives a suggested algorithm for metering downstream 889 congestion at a border router. It could hardly be simpler. It 890 involves accumulating the volume of packets with the RE bit blanked 891 and the volume of those with congestion marking and subtracting the 892 two. In order to discard a persistent negative balance (see above), 893 time is slotted into periods of say 10secs (or a time sufficient for 894 a few rounds of feedback depending on the level of aggregation). 895 Every timeslot, a positive balance between the two counters is 896 accumulated into a long-term counter and reset. Whereas, if the 897 balance during any timeslot is negative, it is discarded and a 898 management alarm SHOULD also be raised. Over an accounting period 899 (say a month) the single metric in the long term counter represents 900 all the downstream congestion caused by traffic passing the border 901 meter. 903 Congestion has the dimension of [byte], being the product of volume 904 transferred [byte] and percentage pre-congestion [dimensionless]. 905 The above algorithm effectively gives a measure of the volume 906 transferred, but modulated by pre-congestion expected downstream. So 907 volume transferred during off-peak periods counts as nearly nothing, 908 while volume transferred at peak times counts very highly. The re- 909 ECN protocol allows one network to measure how much pre-congestion 910 has been 'dumped' into it by another network. And then in turn how 911 much of that pre-congestion it dumped into the next downstream 912 network. 914 Once this downstream pre-congestion metric is available, operators 915 are free to choose how they incorporate it into their interconnection 916 contracts [IXQoS]. Some may include a threshold volume of pre- 917 congestion as a quality measure in their service level agreement, 918 perhaps with a penalty clause if the upstream network exceeds this 919 threshold over, say, a month. Others may agree a set of tiered 920 monthly thresholds, with increasing penalties as each threshold is 921 exceeded. But, it would be just as easy and more precise to do away 922 with discrete thresholds, and instead make the penalty rise smoothly 923 with the volume of pre-congestion by applying a price to pre- 924 congestion itself. Then the usage element of the interconnection 925 contract would directly relate to the volume of pre-congestion caused 926 by the upstream network. 928 Typically, where capacity charges are concerned, lower tier customer 929 networks pay higher tier provider networks. So money flows from the 930 edges to the middle of the internetwork where there is greater 931 connectivity. But penalties or charges for usage normally follow the 932 same direction as the data flow---the direction of control at the 933 network layer. So, where a tier 2 provider sends data into a tier 3 934 customer network, we would expect the penalty clauses for sending too 935 much pre-congestion to be against the tier 3 network, even though it 936 is the provider. 938 The relative direction of penalties and charges is a constant source 939 of confusion. It may help to remember that data will be flowing in 940 the other direction too. So the provider network has as much 941 opportunity to levy usage penalties as its customer, and it can set 942 the price or strength of its own penalties higher if it chooses. 943 Usage charges in both directions tend to cancel each other out, which 944 confirms that usage-charging is less to do with revenue raising and 945 more to do with encouraging load control discipline in order to 946 smooth peaks and troughs, improving utilisation and quality. 948 To focus the discussion, from now on, unless otherwise stated, we 949 will assume a downstream network charges its upstream neighbour in 950 proportion to the pre-congestion it sends, B_v, using the notation of 951 Appendix A.2. If they previously agreed the (fixed) price per byte 952 of pre-congestion would be L, then the bill at the end of the month 953 will simply be the product L.B_v, plus any fixed charges they may 954 also have agreed. 956 We are well aware that the IETF tries to avoid standardising 957 technology that depends on a particular business model. But our aim 958 is merely to show that border policing can at least work with this 959 one model, then we can assume that operators might experiment with 960 the metric in other models. Effectively tiered thresholds are just 961 more coarse-grained approximations of the fine-grained case we choose 962 to examine. Of course, operators are free to complement this pre- 963 congestion-based usage element of their charges with traditional 964 capacity charging, and we expect they will. 966 5.3. Emulation of Per-Flow Rate Policing: Rationale and Limits 968 The important feature of charging in proportion to congestion volume 969 is that the penalty aggregates and deaggregates correctly along with 970 packet flows. This is because the penalty rises linearly with bit 971 rate and linearly with congestion, because it is the product of them 972 both. So if the packets crossing a border consist of a thousand 973 flows, and one of those flows doubles its rate, the ingress gateway 974 forwarding that flow will have to put twice as much congestion 975 marking into the packets of that flow. And this extra congestion 976 marking will add proportionately to the charges levied at every 977 border the flow crosses in proportion to the amount of pre-congestion 978 remaining on the path. 980 As importantly, pre-congestion itself rises super-linearly with 981 utilisation of a particular resource. So if someone tries to push 982 another flow into a path that is already signalling enough pre- 983 congestion to warrant admission control, the penalty will be a lot 984 greater than it would have been to add the same flow to a less 985 congested path. So, the system as a whole is fairly insensitive to 986 the actual level of pre-congestion that each ingress chooses for 987 triggering admission control. The deterrent against exceeding 988 whatever threshold is chosen rises very quickly with a small amount 989 of cheating. 991 These are the properties that allow re-ECN to emulate per-flow border 992 policing of both rate and admission control. When a whole inter- 993 network is operating at normal (typically very low) congestion, the 994 pre-congestion marking from virtual queues will be a little higher--- 995 still low, but more noticeable. But this does not imply that usage 996 /charges/ must also be low. That depends on the /price/ L. 998 For instance, combining capacity and volume charges is quite a common 999 feature of interconnection agreements in today's Internet, 1000 particularly since p2p file-sharing became popular. Imagine that the 1001 monthly payment between two networks is made up of a volume charge 1002 and a capacity charge, and they usually turn out to be in a ratio of 1003 about 1:2 (not atypical). If charging for volume were replaced with 1004 charging for congested volume, one would expect the price of 1005 congestion to be arranged so that the total charge for usage remained 1006 about the same---still about one third of the total settlement. 1007 Because that is obviously the charge that the market has found is 1008 necessary to push back against usage. So, if an average pre- 1009 congestion fraction turned out to be 0.1%, one would expect that the 1010 price L per byte of pre-congestion would be about 1000 times the 1011 previously used per byte price for volume (before congestion metrics 1012 were available). 1014 From the above example it can be seen why operators will become 1015 acutely sensitive to the congestion they cause in other networks, 1016 which is of course the desired effect to encourage networks to 1017 /control/ the congestion they allow their users to cause to others. 1019 Effectively, usage charges will continuously flow from ingress 1020 gateways to the places where there is mild pre-congestion, in 1021 proportion to the data rates from those gateways and to the path pre- 1022 congestion. 1024 If anyone sends even one flow at higher rate, they will immediately 1025 have to pay proportionately more usage charges. Because there is no 1026 knowledge of reservations within the Diffserv region, no interior 1027 router can police whether the rate of each flow is greater than each 1028 reservation. So the system doesn't truly emulate rate-policing of 1029 each flow. But there is no incentive to pack a higher rate into a 1030 reservation, because the charges are directly proportional to rate, 1031 irrespective of the reservation. 1033 However, if virtual queues start to fill on any path, even though 1034 real queues will still be able to provide low latency service, pre- 1035 congestion marking will rise fairly quickly. It may eventually reach 1036 the threshold where the ingress gateway would deny admission to new 1037 flows. If the ingress gateway cheats and continues to admit new 1038 flows, the affected virtual queues will rapidly fill, even though the 1039 real queues will still be little worse than they were when admission 1040 control should have been invoked. The ingress gateway will have to 1041 pay the penalty for such an extremely high pre-congestion level, so 1042 the pressure to invoke admission control should become unbearable. 1044 The above mechanisms protect against rational operators. In 1045 Section 5.6 we discuss how networks can protect themselves from 1046 accidental or deliberate misconfiguration in neighbouring networks. 1048 5.4. Policing Dishonest Marking 1050 As CL traffic leaves the last network before the egress gateway 1051 (domain C) the RE blanking fraction should match the congestion 1052 marking fraction, when averaged over a sufficiently long duration 1053 (perhaps ~10s to allow a few rounds of feedback through regular 1054 signalling of new and refreshed reservations). 1056 If domain C doesn't trust the networks around it to behave honestly, 1057 it should install a monitor at its egress. This monitor aims to 1058 detect flows of CL packets that are persistently negative. If flows 1059 are positive, domain C need take no action---this simply means an 1060 upstream network must be paying more penalties than it needs to. 1061 Appendix A.3 gives a suggested algorithm for the monitor. 1063 Note that the monitor operates on flows but we would like it not to 1064 require per-flow state. This is why we have been careful to ensure 1065 that all flows MUST start with a packet marked with the NF codepoint. 1066 If a flow does not start with the NF codepoint, a monitor is likely 1067 to treat it unfavourably. This incentivises setting of the NF 1068 codepoint. 1070 This also means that a monitor will be resistant to state exhaustion 1071 attacks from other networks, as the monitor never creates state 1072 unless an NF packet arrives. And an NF packet counts positive, so it 1073 will cost a lot for a network to send many of them. 1075 Monitor algorithms will often maintain an average fraction of RE 1076 blanked packets across flows. When maintaining an average across 1077 flows, a monitor MUST ignore packets with the NF codepoint set. An 1078 ingress gateway sets the NF codepoint when it does not have the 1079 benefit of feedback from the ingress. So counting packets with FE 1080 cleared would be likely to make the average unnecessarily positive, 1081 providing headroom (or should we say footroom?) for dishonest 1082 (negative) traffic. 1084 If the monitor detects a persistently negative flow, it could drop 1085 sufficient negative and neutral packets to force the flow to not be 1086 negative. This is the approach taken for the 'egress dropper' in 1087 [Re-TCP], but for the scenario in this memo, where everyone would 1088 expect everyone else to keep to the protocol it is probably more 1089 advisable to raise a management alarm. So all ingresses cannot 1090 understate downstream pre-congestion without getting logged. Then 1091 the network operator can deal with the offending network at the human 1092 level, out of band. 1094 5.5. Competitive Routing 1096 Goldenberg et al [Smart_rtg] refers to various commercial product and 1097 presents its own algorithms for moving traffic between multihomed 1098 routes based on usage charges. None of these systems require any 1099 changes to standards protocols because the choice between the 1100 available border gateway protocol (BGP) routes is based on a 1101 combination of local knowledge of the charging regime and local 1102 measurement of traffic levels. If, as we propose, charges or 1103 penalties were based on the level of re-ECN measured in passing 1104 traffic, a similar optimisation could be achieved without requiring 1105 any changes to standard routing protocols. 1107 We must be clear that applying pre-congestion-based routing to this 1108 admission control system remains an open research issue. Traffic 1109 engineering based on congestion requires careful damping to avoid 1110 oscillations, and should not be attempted without adult supervision 1111 :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering based 1112 on congestion. Without the benefit of re-ECN, they they had to add a 1113 path attribute to BGP to advertise a route's downstream congestion 1114 (actually they proposed that BGP should advertise the charge for 1115 congestion, which we believe wrongly embeds an assumption into BGP 1116 that congestion will be charged for). 1118 5.6. Fail-safes 1120 The mechanisms described so far create incentives for rational 1121 operators to behave. That is, one operator aims to make another 1122 behave responsibly by applying penalties and expecting a rational 1123 response that trades off costs against benefits. It is usually 1124 reasonable to assume that other network operators behave rationally 1125 (policy routing can avoid those that might not). But this approach 1126 does not protect against the misconfigurations and accidents of other 1127 operators. 1129 Therefore, we propose the following two similar mechanisms at a 1130 network's borders to provide "defence in depth": 1132 Highly positive flows RE blanked packets should be sampled and a 1133 small regular sample picked randomly as they cross a border 1134 interface. Then subsequent packets matching the same source and 1135 destination address and DSCP should be monitored. If the RE 1136 blanking rate is well above a threshold (to be determined by 1137 operational practice), a management alarm SHOULD be raised, and 1138 the flow MAY be automatically subject to focused drop. 1140 Persistently negative flows congestion marked packets should be 1141 sampled and a small regular sample picked randomly as they cross a 1142 border interface. Then subsequent packets matching the same 1143 source and destination address and DSCP should be monitored. If 1144 the RE blanking rate minus the congestion marking rate is 1145 persistently negative, a management alarm SHOULD be raised, and 1146 the flow MAY be automatically subject to focused drop. 1148 Both these mechanisms rely on the fact that highly postive (or 1149 negative) flows will appear more quickly in the sample by selecting 1150 randomly solely from positive (or negative) packets. 1152 Note that there is no assumption that users behave rationally. The 1153 system is protected from the vagiaries of irrational user behaviour 1154 by the ingress gateways, which transform internal penalties into a 1155 deterministic, admission control mechanism that prevents users from 1156 misbehaving, by directly engineered means. 1158 6. Analysis 1160 The domains in Figure 1 are not expected to be completely malicious 1161 towards each other. After all, we can assume that they are all co- 1162 operating to provide an internetworking service to the benefit of 1163 each of them and their customers. Otherwise their routing polices 1164 would not interconnect them in the first place. However, we assume 1165 that they are also competitors of each other. So a network may try 1166 to contravene our proposed protocol if it would gain or make a 1167 competitor lose, or both, but only if it can do so without being 1168 caught. Therefore we do not have to consider every possible random 1169 attack one network could launch on the traffic of another, given 1170 anyway one network can always drop or corrupt packets that it 1171 forwards on behalf of another. 1173 Therefore, we only consider new opportunities for /gainful/ attack 1174 that our proposal introduces. But to a certain extent we can also 1175 rely on the in depth defences we have described (Section 5.6 ) 1176 intended to mitigate the potential impact if one network accidentally 1177 misconfiguring the workings of this protocol. 1179 In the generic scenario we introduced in Figure 1 the ingress and 1180 egress gateways are shown in the most generic arrangement, without 1181 any surrounding network. This allows us to consider more specific 1182 cases where these gateways and a neighbouring network are operated by 1183 the same player. As well as cases where the same player operates 1184 neighbouring networks, we will also consider cases where the two 1185 gateways collude as one player and where the sender and receiver 1186 collude as one. Collusion of other sets of domains are less likely, 1187 but we will consider such cases. In the general case, we will assume 1188 none of the nine trust domains across the figure fully trust any of 1189 the others. 1191 Taking the generic scenario in Figure 1, as we only propose to change 1192 routers within the Diffserv region, we assume the operators of 1193 networks outside the region will be doing per-flow policing. That 1194 is, we assume the networks outside the Diffserv region and the 1195 gateways around its edges can protect themselves. So our primary 1196 concern is to be able to protect networks that don't do per-flow 1197 policing from those that do. The ingress and egress gateways are the 1198 only way the outer 'enemy' can get at the middle victim, so we can 1199 consider the gateways as the representatives of the 'enemy' as far as 1200 domains A, B and C are concerned. We will call this trust scenario 1201 'edges against middles'. 1203 Earlier in this memo, we outlined the classic border rate policing 1204 problem (Section 3). It will now be useful to spell out the 1205 motivations that would create the lack of trust as the root cause of 1206 the problem. The more reservations a gateway can allow, the more 1207 revenue it receives. The middle networks want the edges to comply 1208 with the admission control protocol when they become so congested 1209 that their service to others might suffer. The middle networks also 1210 want to ensure the edges cannot steal more service from them than 1211 they pay for. 1213 In the context of this 'edges aginst middles' scenario, the re-ECN 1214 protocol has two main effects: 1216 o The more pre-congestion there is on a path across the Diffserv 1217 region, the higher the ingress gateway has to declare downstream 1218 pre-congestion v_0. 1220 o because downstream pre-congestion should on average be zero at the 1221 egress 1223 An executive summary of our security analysis can be stated in two 1224 parts, distinguished by the type of collusion considered. In the 1225 first case collusion is limited to neighbours in the feedback loop. 1226 In other words, two neighbouring networks can be assumed to act as 1227 one. Or the egress gateway might collude with domain C. Or the 1228 ingress gateway might collude with domain A. Or ingress and egress 1229 gateways might collude with each other. 1231 In these cases where only neighbours in the feedback loop collude, 1232 all parties have a positive incentive to declare downstream pre- 1233 congestion truthfully, and the ingress gateway has a positive 1234 incentive to invoke admission control when congestion rises above the 1235 admission threshold in any network in the region (including its own). 1236 No party has an incentive to send more traffic than declared in 1237 reservation signalling (even though only the gateways read this 1238 signalling). In short, no party can gain at the expense of another. 1240 In the case of other forms of collusion (e.g. between domain A and C) 1241 it would be possible for say A & B to create a tunnel between 1242 theselves so that A would gain at the expense of B. But C would then 1243 lose the gain that A had made. Therefore the value to A & C of 1244 colluding to mount this attack seems questionable. It is made more 1245 questionable, because the attack can be statistically detected by B 1246 using the second defence in depth mechanism mentioned already. Note 1247 that C can effectively prevent A attacking it through a tunnel, by 1248 treating the tunnel end point as a direct link to a neighbouring 1249 network, which falls back to the regular scenario without collusion. 1251 {ToDo: Due to lack of time, the full write up of the security 1252 analysis is deferred to the next version of this memo.} 1253 Finally, it is well known that the best person to analyse the 1254 security of a system is not the designer. Therefore, our confident 1255 claims must be hedged with doubt until others with an incentive to 1256 break it have mounted a full analysis. 1258 7. Extensions 1260 If a different signalling system, such as NSIS, were used, but 1261 providing admission control in a similar way using pre-congestion 1262 notification (e.g. with RMD [NSIS-RMD]) a similar approach to re-ECN 1263 could be used. 1265 8. Design Choices and Rationale 1267 The case for using re-feedback (a generalisation of re-ECN) to police 1268 congestion response and provide QoS is made in [Re-fb]. Essentially, 1269 the insight is that congestion crosses layers from the physical 1270 upwards. Therefore re-feedback polices congestion response based on 1271 physical interfaces not addresses. That is, the congestion leaving a 1272 physical interface can be policed at the interface, rather than the 1273 congestion on packets that claim to come from an address, which may 1274 be spoofed. Also, re-feedback does not actually require feedback. A 1275 source must act conservatively before it gets feedback. 1277 On the subject of lack of feedback, the no feedback (NF) codepoint is 1278 motivated by arguments for a state set-up bit in IP to prevent state 1279 exhaustion attacks. This idea was first put forward by David Clark 1280 and documented in [Handley_Steps_DoS]. The idea is that network 1281 layer datagrams should signal explicitly when they require state to 1282 be created in the layer above (e.g. at flow start). Then the higher 1283 layer can refuse to create any state unless a datagram declares this 1284 intent. We believe the NF codepoint can be used to serve the same 1285 purpose as the proposed more generic state-set-up bit. 1287 The re-feedback paper [Re-fb] also makes the case for using an 1288 economic interpretation of congestion, which is the basis of the 1289 incentives-based approach used in this memo. That paper also makes 1290 the case against the use of classic feedback if the economic 1291 interpretation of congestion is to be realised. The problem with 1292 using classic feedback for policing congestion is that it opens up 1293 receiving networks to `denial of funds' attacks. 1295 {ToDo: Further Design Rationale will be included in future versions 1296 of this memo} 1298 9. IANA Considerations 1300 {ToDo:}This memo includes no request to IANA (yet). 1302 10. Security Considerations 1304 This whole memo concerns the security of a scalable admission control 1305 system. In particular the analysis section. Below some specific 1306 security issues are mentioned that did not fit elsewhere in the memo 1307 or which comment on the robustness of the security provided by the 1308 design. 1310 Firstly, we must repeat the statement of applicability in the 1311 analysis: that we only consider new opportunities for /gainful/ 1312 attack that our proposal introduces. Despite only involving a few 1313 bits, there is sufficient complexity in the whole system that there 1314 are numerous possibilities for attacks not catered for. But as far 1315 as we are aware, none reap any benefit to the attacker. It will 1316 always be possible for one network to cause damage to another 1317 neighbouring network's traffic by dropping or corrupting it as it 1318 forwards it. Therefore we do not believe networks would set their 1319 routing policies to interconnect in the first place if they didn't 1320 trust the other networks not to damage their traffic without any 1321 /direct/ gain to themselves. 1323 Having said this, we do want to highlight some of the weaker parts of 1324 our argument. We have argued that networks will be dissuaded from 1325 faking congestion marking by the possibility that upstream networks 1326 will route round them. As we have said, these arguments are 1327 intuitive and will remain fairly tenuous until proved in practice, 1328 particularly close to the egress where less competitive routing is 1329 likely. 1331 We should also point out that the approach in this memo was only 1332 designed to be robust for admission control. We do not claim the 1333 incentives will always be strong enough to force correct flow pre- 1334 emption behaviour. This is because pre-emption of flows tends to be 1335 associated with much higher damage to an operator's reputation for 1336 robust quality than denying admission. However, in general the 1337 incentives for correct flow pre-emption are similar to those for 1338 admission control. 1340 Finally, it may seem that the 8 codepoints that have been made 1341 available by extending the ECN field with the RE bit have been used 1342 rather wastefully. In effect the RE bit has been used as an 1343 orthogonal single bit in nearly all cases. The only exception being 1344 when the ECN field is cleared to "00". The mapping of the codepoints 1345 in an earlier version of this proposal used the codepoint space more 1346 efficiently, but the scheme became vulnerable to a network operator 1347 focusing its congestion marking to mark more positive than neutral 1348 packets in order to reduce its penalties. 1350 {ToDo: More security considerations will undoubtedly be added in 1351 future versions of this memo.} 1353 11. Conclusions 1355 Using pre-congestion is a promising technique to control flow 1356 admissions that will scale to any size network. However, it requires 1357 a mechanism to ensure that networks can interconnect even if they do 1358 not trust each to keep to the admission control protocols. We claim 1359 that the re-ECN protocol provides such a mechanism, so that one 1360 network can detect and prevent another network in the system fro 1361 cheating for its own gain. 1363 12. Acknowledgements 1365 All the following have given helpful comments and some may become co- 1366 authors of later drafts: Arnaud Jacquet, Alessandro Salvatori, Steve 1367 Rudkin, David Songhurst, John Davey, Ian Self, Anthony Sheppard (BT), 1368 Stephen Hailes (UCL), Francois Le Faucheur, Anna Charny (Cisco), 1369 Jozef Babiarz, Kwok-Ho Chan, Corey Alexander (Nortel), David Clark, 1370 Bill Lehr, Sharon Gillett (MIT) and comments from participants in the 1371 CFP/CRN inter-provider QoS and broadband working groups. 1373 13. Comments Solicited 1375 Comments and questions are encouraged and very welcome. They can be 1376 addressed to the IETF Transport Area working group's mailing list 1377 , and/or to the authors. 1379 14. References 1381 14.1. Normative References 1383 [PCN] Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., 1384 Charny, A., Liatsos, V., Babiarz, J., Chan, K., and S. 1385 Dudley, "Pre-Congestion Notification", 1386 draft-briscoe-tsvwg-cl-phb-01 (work in progress), 1387 March 2006. 1389 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1390 Requirement Levels", BCP 14, RFC 2119, March 1997. 1392 [RFC2211] Wroclawski, J., "Specification of the Controlled-Load 1393 Network Element Service", RFC 2211, September 1997. 1395 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1396 of Explicit Congestion Notification (ECN) to IP", 1397 RFC 3168, September 2001. 1399 [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, 1400 J., Courtney, W., Davari, S., Firoiu, V., and D. 1401 Stiliadis, "An Expedited Forwarding PHB (Per-Hop 1402 Behavior)", RFC 3246, March 2002. 1404 [RSVP-ECN] 1405 Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P., 1406 Babiarz, J., and K. Chan, "RSVP Extensions for Admission 1407 Control over Diffserv using Pre-congestion Notification", 1408 draft-lefaucheur-rsvp-ecn-00 (work in progress), 1409 October 2005. 1411 [Re-TCP] Briscoe, B., Jacquet, A., and A. Salvatori, "Re-ECN: 1412 Adding Accountability for Causing Congestion to TCP/IP", 1413 draft-briscoe-tsvwg-re-ecn-tcp-01 (work in progress), 1414 March 2006. 1416 14.2. Informative References 1418 [CL-arch] Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., 1419 Charny, A., Babiarz, J., and K. Chan, "A Framework for 1420 Admission Control over DiffServ using Pre-Congestion 1421 Notification", draft-briscoe-tsvwg-cl-architecture-02 1422 (work in progress), March 2006. 1424 [ECN-BGP] Mortier, R. and I. Pratt, "Incentive Based Inter-Domain 1425 Routeing", Proc Internet Charging and QoS Technology 1426 Workshop (ICQT'03) pp308--317, September 2003, . 1429 [IXQoS] Briscoe, B. and S. Rudkin, "Commercial Models for IP 1430 Quality of Service Interconnect", BT Technology Journal 1431 (BTTJ) 23(2)171--195, April 2005, 1432 . 1434 [NSIS-RMD] 1435 Bader, A., Westberg, L., Karagiannis, G., Kappler, C., and 1436 T. Phelan, "RMD-QOSM - The Resource Management in Diffserv 1437 QOS Model", draft-ietf-nsis-rmd-06 (work in progress), 1438 February 2006. 1440 [RFC2205] Braden, B., Zhang, L., Berson, S., Herzog, S., and S. 1441 Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1 1442 Functional Specification", RFC 2205, September 1997. 1444 [RFC2207] Berger, L. and T. O'Malley, "RSVP Extensions for IPSEC 1445 Data Flows", RFC 2207, September 1997. 1447 [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, 1448 M., Romanow, A., Weinrib, A., and L. Zhang, "Resource 1449 ReSerVation Protocol (RSVP) Version 1 Applicability 1450 Statement Some Guidelines on Deployment", RFC 2208, 1451 September 1997. 1453 [RFC2747] Baker, F., Lindell, B., and M. Talwar, "RSVP Cryptographic 1454 Authentication", RFC 2747, January 2000. 1456 [RFC2998] Bernet, Y., Ford, P., Yavatkar, R., Baker, F., Zhang, L., 1457 Speer, M., Braden, R., Davie, B., Wroclawski, J., and E. 1458 Felstaine, "A Framework for Integrated Services Operation 1459 over Diffserv Networks", RFC 2998, November 2000. 1461 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 1462 Congestion Notification (ECN) Signaling with Nonces", 1463 RFC 3540, June 2003. 1465 [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., 1466 Salvatori, A., Soppera, A., and M. Koyabe, "Policing 1467 Congestion Response in an Internetwork Using Re-Feedback", 1468 ACM SIGCOMM CCR 35(4)277--288, August 2005, . 1472 [Smart_rtg] 1473 Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, 1474 "Optimizing Cost and Performance for Multihoming", ACM 1475 SIGCOMM CCR 34(4)79--92, October 2004, 1476 . 1478 Appendix A. Implementation 1480 A.1. Ingress Gateway Algorithm for Blanking the RE bit 1482 The ingress gateway receives regular feedback reporting the fraction 1483 of congestion marked octets for each aggregate arriving at the 1484 egress. So for each aggregate it should blank the RE bit on the same 1485 fraction of octets. It is more efficient to calculate the reciprocal 1486 of this fraction when the signalling arrives, Z_0 = 1 / Congestion- 1487 Level-Estimate, which will be the number of bytes of packets the 1488 ingress should send with the RE bit set between those it sends with 1489 the RE bit blanked. Z_0 will also take account of the sustainable 1490 rate reported during the flow pre-emption process, if necessary. 1492 A suitable pseudo-code algorithm for the ingress gateway is as 1493 follows: 1495 ==================================================================== 1496 B_i = 0 /* interblank volume */ 1497 for each packet { 1498 b = readLength() /* set b to packet size */ 1499 B_i += b /* accumulate interblank volume */ 1500 if B_i < b * Z_0 { /* test whether interblank volume... */ 1501 writeRE(1) 1502 } else { /* ...exceeds blank RE spacing * pkt size*/ 1503 writeRE(0) /* ...and if so, clear RE */ 1504 B_i = 0 /* ...and re-set interblank volume */ 1505 } 1506 } 1507 ==================================================================== 1509 A.2. Bulk Downstream Congestion Metering Algorithm 1511 To meter the bulk amount of downstream pre-congestion in passing 1512 traffic an algorithm is needed that accumulates the size of packets 1513 with RE blanked (or NF set) and subtracts the size of congestion 1514 marked packets, but ignores a persistently negative balance over a 1515 duration of T ~ 10secs, say. Three counters need to be maintained: 1517 B_v: accumulated pre-congestion volume 1519 B_s: pre-congestion volume in timeslot 1521 B_t: total data volume 1523 A suitable pseudo-code algorithm for a border router is as follows: 1525 ==================================================================== 1526 B_v = 0 1527 B_s = 0 1528 B_t = 0 1529 t = timeNow() + T /* divide into timeslots of few secs */ 1530 for each packet { 1531 b = readLength() /* set b to packet size */ 1532 B_t += b /* accumulate total volume */ 1533 if readRE() == 0 || readEECN() == NF { 1534 B_s += b /* increment... */ 1535 } elseif readECN() == 1X { 1536 B_s -= b /* ...or decrement B_s... */ 1537 } /*...depending on EECN field */ 1538 if timeNow() > t { /* every timeslot... */ 1539 if B_v > 0 { /* count a negative balance as zero */ 1540 B_v += B_s /* otherwise accumulate the balance */ 1541 } 1542 B_s = 0 /* re-set the temp counter... */ 1543 t += T /* ...for the next timeslot */ 1544 } 1545 } 1546 ==================================================================== 1548 At the end of an accounting period this counter B_v represents the 1549 pre-congestion volume that penalties could be applied to, as 1550 described in Section 5.2. 1552 For instance, accumulated volume of pre-congestion through a border 1553 interface over a month might be B_v = 5PB (petabyte = 10^15 byte). 1554 This might have resulted from an average downstream pre-congestion 1555 level of 1% on an accumulated total data volume of B_t = 500PB. 1557 A.3. Algorithm for Sanctioning Negative Traffic 1559 {ToDo: Write up dropper with flow management algorithm and variant 1560 with bounded flow state.} 1562 Author's Address 1564 Bob Briscoe 1565 BT & UCL 1566 B54/77, Adastral Park 1567 Martlesham Heath 1568 Ipswich IP5 3RE 1569 UK 1571 Phone: +44 1473 645196 1572 Email: bob.briscoe@bt.com 1573 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 1575 Intellectual Property Statement 1577 The IETF takes no position regarding the validity or scope of any 1578 Intellectual Property Rights or other rights that might be claimed to 1579 pertain to the implementation or use of the technology described in 1580 this document or the extent to which any license under such rights 1581 might or might not be available; nor does it represent that it has 1582 made any independent effort to identify any such rights. Information 1583 on the procedures with respect to rights in RFC documents can be 1584 found in BCP 78 and BCP 79. 1586 Copies of IPR disclosures made to the IETF Secretariat and any 1587 assurances of licenses to be made available, or the result of an 1588 attempt made to obtain a general license or permission for the use of 1589 such proprietary rights by implementers or users of this 1590 specification can be obtained from the IETF on-line IPR repository at 1591 http://www.ietf.org/ipr. 1593 The IETF invites any interested party to bring to its attention any 1594 copyrights, patents or patent applications, or other proprietary 1595 rights that may cover technology that may be required to implement 1596 this standard. Please address the information to the IETF at 1597 ietf-ipr@ietf.org. 1599 Disclaimer of Validity 1601 This document and the information contained herein are provided on an 1602 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1603 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1604 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1605 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1606 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1607 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1609 Copyright Statement 1611 Copyright (C) The Internet Society (2006). This document is subject 1612 to the rights, licenses and restrictions contained in BCP 78, and 1613 except as set forth therein, the authors retain all their rights. 1615 Acknowledgment 1617 Funding for the RFC Editor function is currently provided by the 1618 Internet Society.