idnits 2.17.1 draft-briscoe-tsvwg-cl-architecture-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 33. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2549. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2526. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2533. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 2553), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 55. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure Invitation -- however, there's a paragraph with a matching beginning. Boilerplate error? Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 3 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 1193 has weird spacing: '... can be used ...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2205' is mentioned on line 1184, but not defined == Missing Reference: 'RT-ECN' is mentioned on line 599, but not defined == Missing Reference: 'TEWMA' is mentioned on line 2215, but not defined == Unused Reference: 'AVQ' is defined on line 2239, but no explicit reference was found in the text == Unused Reference: 'RFC2309' is defined on line 2354, but no explicit reference was found in the text == Unused Reference: 'RFC2474' is defined on line 2358, but no explicit reference was found in the text == Unused Reference: 'RFC2597' is defined on line 2367, but no explicit reference was found in the text == Unused Reference: 'RFC3246' is defined on line 2380, but no explicit reference was found in the text == Unused Reference: 'RFC3270' is defined on line 2385, but no explicit reference was found in the text == Outdated reference: A later version (-05) exists of draft-ietf-tsvwg-rsvp-dste-03 -- Possible downref: Non-RFC (?) normative reference: ref. 'AVQ' -- Possible downref: Non-RFC (?) normative reference: ref. 'Breslau99' -- Possible downref: Non-RFC (?) normative reference: ref. 'Breslau00' -- Possible downref: Non-RFC (?) normative reference: ref. 'Briscoe' -- Possible downref: Non-RFC (?) normative reference: ref. 'DCAC' == Outdated reference: A later version (-01) exists of draft-davie-ecn-mpls-00 ** Downref: Normative reference to an Informational RFC: RFC 3689 (ref. 'EMERG-RQTS') ** Downref: Normative reference to an Informational RFC: RFC 3690 (ref. 'EMERG-TEL') -- Possible downref: Normative reference to a draft: ref. 'Floyd' -- Possible downref: Non-RFC (?) normative reference: ref. 'GSPa' -- Possible downref: Non-RFC (?) normative reference: ref. 'GSP-TR' -- Possible downref: Non-RFC (?) normative reference: ref. 'ITU.MLPP.1990' -- Possible downref: Non-RFC (?) normative reference: ref. 'Johnson' -- Possible downref: Non-RFC (?) normative reference: ref. 'LoadBalancing-a' -- Possible downref: Non-RFC (?) normative reference: ref. 'LoadBalancing-b' -- Possible downref: Non-RFC (?) normative reference: ref. 'Low' -- Possible downref: Non-RFC (?) normative reference: ref. 'NAC-a' -- Possible downref: Non-RFC (?) normative reference: ref. 'NAC-b' == Outdated reference: A later version (-03) exists of draft-briscoe-tsvwg-cl-phb-02 -- Possible downref: Normative reference to a draft: ref. 'PCN' == Outdated reference: A later version (-09) exists of draft-briscoe-tsvwg-re-ecn-tcp-01 -- Possible downref: Non-RFC (?) normative reference: ref. 'Re-feedback' == Outdated reference: A later version (-01) exists of draft-briscoe-tsvwg-re-ecn-border-cheat-00 -- Possible downref: Normative reference to a draft: ref. 'Re-PCN' -- Possible downref: Non-RFC (?) normative reference: ref. 'Reid' ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Downref: Normative reference to an Informational RFC: RFC 2475 ** Downref: Normative reference to an Informational RFC: RFC 2998 ** Downref: Normative reference to an Informational RFC: RFC 4542 == Outdated reference: A later version (-20) exists of draft-ietf-nsis-rmd-03 ** Downref: Normative reference to an Experimental draft: draft-ietf-nsis-rmd (ref. 'RMD') -- Possible downref: Normative reference to a draft: ref. 'RSVP-PCN' == Outdated reference: A later version (-05) exists of draft-babiarz-tsvwg-rtecn-04 -- Possible downref: Normative reference to a draft: ref. 'RTECN' -- Possible downref: Normative reference to a draft: ref. 'RTECN-usage' -- Possible downref: Non-RFC (?) normative reference: ref. 'Songhurst' Summary: 14 errors (**), 0 flaws (~~), 20 warnings (==), 30 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TSVWG B. Briscoe 2 Internet Draft P. Eardley 3 draft-briscoe-tsvwg-cl-architecture-04.txt D. Songhurst 4 Expires: April 2007 BT 6 F. Le Faucheur 7 A. Charny 8 Cisco Systems, Inc 10 J. Babiarz 11 K. Chan 12 S. Dudley 13 Nortel 15 G. Karagiannis 16 University of Twente / Ericsson 18 A. Bader 19 L. Westberg 20 Ericsson 22 25 October, 2006 24 An edge-to-edge Deployment Model for Pre-Congestion Notification: 25 Admission Control over a DiffServ Region 26 draft-briscoe-tsvwg-cl-architecture-04.txt 28 Status of this Memo 30 By submitting this Internet-Draft, each author represents that any 31 applicable patent or other IPR claims of which he or she is aware 32 have been or will be disclosed, and any of which he or she becomes 33 aware will be disclosed, in accordance with Section 6 of BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as Internet- 38 Drafts. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress". 45 The list of current Internet-Drafts can be accessed at 46 http://www.ietf.org/1id-abstracts.html 48 The list of Internet-Draft Shadow Directories can be accessed at 49 http://www.ietf.org/shadow.html 51 This Internet-Draft will expire on September 6, 2006. 53 Copyright Notice 55 Copyright (C) The Internet Society (2006). All Rights Reserved. 57 Abstract 59 This document describes a deployment model for pre-congestion 60 notification (PCN) operating in a large DiffServ-based region of the 61 Internet. PCN-based admission control protects the quality of service 62 of existing flows in normal circumstances, whilst if necessary (eg 63 after a large failure) pre-emption of some flows preserves the quality 64 of service of the remaining flows. Each link has a configured- 65 admission-rate and a configured-pre-emption-rate, and a router marks 66 packets that exceed these rates. Hence routers give an early warning of 67 their own potential congestion, before packets need to be dropped. 68 Gateways around the edges of the PCN-region convert measurements of 69 packet rates and their markings into decisions about whether to admit 70 new flows, and (if necessary) into the rate of excess traffic that 71 should be pre-empted. Per-flow admission states are kept at the 72 gateways only, while the PCN markers that are required for all routers 73 operate on the aggregate traffic - hence there is no scalability impact 74 on interior routers. 76 Authors' Note (TO BE DELETED BY THE RFC EDITOR UPON PUBLICATION) 78 This document is posted as an Internet-Draft with the intention of 79 eventually becoming an INFORMATIONAL RFC. 81 Table of Contents 83 1. Introduction................................................5 84 1.1. Summary................................................5 85 1.2. Key benefits...........................................8 86 1.3. Terminology............................................9 87 1.4. Existing terminology...................................11 88 1.5. Standardisation requirements...........................11 89 1.6. Structure of rest of the document......................12 90 2. Key aspects of the deployment model.........................13 91 2.1. Key goals.............................................13 92 2.2. Key assumptions........................................14 93 3. Deployment model...........................................17 94 3.1. Admission control......................................17 95 3.1.1. Pre-Congestion Notification for Admission Marking..17 96 3.1.2. Measurements to support admission control..........17 97 3.1.3. How edge-to-edge admission control supports end-to-end 98 QoS signalling..........................................18 99 3.1.4. Use case.........................................18 100 3.2. Flow pre-emption.......................................20 101 3.2.1. Alerting an ingress gateway that flow pre-emption may be 102 needed..................................................20 103 3.2.2. Determining the right amount of CL traffic to drop.23 104 3.2.3. Use case for flow pre-emption.....................24 105 3.3. Both admission control and pre-emption.................25 106 4. Summary of Functionality....................................27 107 4.1. Ingress gateways.......................................27 108 4.2. Interior routers.......................................28 109 4.3. Egress gateways........................................28 110 4.4. Failures..............................................29 111 5. Limitations and some potential solutions....................31 112 5.1. ECMP..................................................31 113 5.2. Beat down effect.......................................33 114 5.3. Bi-directional sessions................................35 115 5.4. Global fairness........................................37 116 5.5. Flash crowds..........................................39 117 5.6. Pre-empting too fast...................................41 118 5.7. Other potential extensions.............................42 119 5.7.1. Tunnelling........................................42 120 5.7.2. Multi-domain and multi-operator usage.............43 121 5.7.3. Preferential dropping of pre-emption marked packets44 122 5.7.4. Adaptive bandwidth for the Controlled Load service.44 123 5.7.5. Controlled Load service with end-to-end Pre-Congestion 124 Notification............................................45 125 5.7.6. MPLS-TE..........................................45 126 6. Relationship to other QoS mechanisms........................46 127 6.1. IntServ Controlled Load................................46 128 6.2. Integrated services operation over DiffServ............46 129 6.3. Differentiated Services................................46 130 6.4. ECN...................................................47 131 6.5. RTECN.................................................47 132 6.6. RMD...................................................48 133 6.7. RSVP Aggregation over MPLS-TE..........................48 134 6.8. Other Network Admission Control Approaches.............48 135 7. Security Considerations.....................................49 136 8. Acknowledgements...........................................49 137 9. Comments solicited.........................................50 138 10. Changes from earlier versions of the draft.................50 139 11. Appendices................................................52 140 11.1. Appendix A: Explicit Congestion Notification..........52 141 11.2. Appendix B: What is distributed measurement-based admission 142 control?...................................................53 143 11.3. Appendix C: Calculating the Exponentially weighted moving 144 average (EWMA).............................................54 145 12. References................................................56 146 Authors' Addresses............................................61 147 Intellectual Property Statement................................63 148 Disclaimer of Validity........................................63 149 Copyright Statement...........................................63 151 1. Introduction 153 1.1. Summary 155 This document describes a deployment model to achieve an end-to-end 156 Controlled Load service by using (within a large region of the 157 Internet) DiffServ and edge-to-edge distributed measurement-based 158 admission control and flow pre-emption. Controlled load service is a 159 quality of service (QoS) closely approximating the QoS that the same 160 flow would receive from a lightly loaded network element [RFC2211]. 161 Controlled Load (CL) is useful for inelastic flows such as those for 162 real-time media. 164 In line with the "IntServ over DiffServ" framework defined in 165 [RFC2998], the CL service is supported end-to-end and RSVP signalling 166 [RFC2205] is used end-to-end, over an edge-to-edge DiffServ region. 167 We call the DiffServ region the "CL-region". 169 ___ ___ _______________________________________ ____ ___ 170 | | | | |Ingress Interior Egress| | | | | 171 | | | | |gateway routers gateway| | | | | 172 | | | | |-------+ +-------+ +-------+ +------| | | | | 173 | | | | | PCN- | | PCN- | | PCN- | | | | | | | 174 | |..| |..|marking|..|marking|..|marking|..| Meter|..| |..| | 175 | | | | |-------+ +-------+ +-------+ +------| | | | | 176 | | | | | \ / | | | | | 177 | | | | | \ / | | | | | 178 | | | | | \ Congestion-Level-Estimate / | | | | | 179 | | | | | \ (for admission control) / | | | | | 180 | | | | | --<-----<----<----<-----<-- | | | | | 181 | | | | | Sustainable-Aggregate-Rate | | | | | 182 | | | | | (for flow pre-emption) | | | | | 183 |___| |___| |_______________________________________| |____| |___| 185 Sx Access CL-region Access Rx 186 End Network Network End 187 Host Host 188 <------ edge-to-edge signalling -----> 189 (for admission control & flow pre-emption) 191 <-------------------end-to-end QoS signalling protocol---------------> 193 Figure 1: Overall QoS architecture (NB terminology explained later) 194 Figure 1 shows an example of an overall QoS architecture, where the 195 two access networks are connected by a CL-region. Another possibility 196 is that there are several CL-regions between the access networks - 197 each would operate the Pre-Congestion Notification mechanisms 198 separately. The document assumes RSVP as the end-to-end QoS 199 signalling protocol. However, the RSVP signalling may itself be 200 originated or terminated by proxies still closer to the edge of the 201 network, such as home hubs or the like, triggered in turn by 202 application layer signalling. [RFC2998] and our approach are compared 203 further in Section 6.2. 205 Flows must enter and leave the CL-region through its ingress and 206 egress gateways, and they need traffic descriptors that are policed 207 by the ingress gateway (NB the policing function is out of this 208 document's scope). The overall CL-traffic between two border routers 209 is called a "CL-region-aggregate". 211 The document introduces a mechanism for flow admission control: 212 should a new flow be admitted into a specific CL-region-aggregate? 213 Admission control protects the QoS of existing CL-flows in normal 214 circumstances. In abnormal circumstances, for instance a disaster 215 affecting multiple interior routers, then the QoS on existing CL 216 microflows may degrade even if care was exercised when admitting 217 those microflows before those circumstances. Therefore we also 218 propose a mechanism for flow pre-emption: how much traffic, in a 219 specific CL-region-aggregate, should be pre-empted in order to 220 preserve the QoS of the remaining CL-flows? Flow pre-emption also 221 restores QoS to lower priority traffic. 223 As a fundamental building block to enable these two mechanisms, each 224 link of the CL-region is associated with a configured-admission-rate 225 and configured-pre-emption-rate; the former is usually significantly 226 larger than the latter. If traffic in a specific DiffServ class ("CL- 227 traffic") on the link exceeds these rates then packets are marked 228 with "Admission Marking" or "Pre-emption Marking". The algorithms 229 that determine the number of packets marked are outlined in Section 3 230 and detailed in [PCN]. PCN marking (Pre-Congestion Notification) 231 builds on the concepts of RFC 3168, "The addition of Explicit 232 Congestion Notification to IP" (which is briefly summarised in 233 Appendix A). 235 Traffic rate on link ^ 236 | 237 | Drop packets 238 link bandwidth -|--------------------------- 239 | 240 | Pre-emption Mark packets 241 configured-pre-emption-rate -|--------------------------- 242 | 243 | Admission Mark packets 244 configured-admission-rate -|--------------------------- 245 | 246 | No marking of packets 247 | 248 +--------------------------- 250 Figure 2: Packet Marking by Routers 252 Gateways of the CL-region make measurements of packet rates and their 253 PCN markings and convert them into decisions about whether to admit 254 new flows, and (if necessary) into the rate of excess traffic that 255 should be pre-empted. These mechanisms are detailed in Section 3 and 256 briefly outlined in the next few paragraphs. 258 The admission control mechanism for a new flow entering the network 259 at ingress gateway G0 and leaving it at egress gateway G1 relies on 260 feedback from the egress gateway G1 about the existing CL-region- 261 aggregate between G0 and G1. This feedback is generated as follows. 262 All routers meter the rate of the CL-traffic on their outgoing links 263 and mark the packets with the Admission Mark if the configured- 264 admission-rate is exceeded. Egress gateway G1 measures the Admission 265 Marks for each of its CL-region-aggregates separately. If the 266 fraction of traffic on a CL-region-aggregate that is Admission Marked 267 exceeds some threshold, no further flows should be admitted into this 268 CL-region-aggregate. Because sources vary their data rates (amongst 269 other reasons) the rate of the CL-traffic on a link may fluctuate 270 above and below the configured-admission-rate. Hence to get more 271 stable information, the egress gateway measures the fraction as a 272 moving average, called the Congestion-Level-Estimate. This is 273 signalled from the egress G1 to the ingress G0, to enable the ingress 274 to block new flows. 276 Admission control seems most useful for DiffServ's Controlled load 277 service. In order to support CL traffic we would expect PCN to 278 supplement the existing scheduling behaviour Expedited Forwarding 279 (EF). Since PCN gives an "early warning" of potential congestion 280 (hence "pre-congestion notification"), admission control can kick in 281 before there is any significant build up of packets in routers - 282 which is exactly the performance required for CL. However, PCN is not 283 only intended to supplement EF. PCN is specified (in [PCN]) as a 284 building block which can supplement the scheduling behaviour of other 285 PHBs. 287 The function to pre-empt flows (or allow the potential to pre-empt 288 them) relies on feedback from the egress gateway about the CL-region- 289 aggregates. This feedback is generated as follows. All routers meter 290 the rate of the CL-traffic on their outgoing links, and if the rate 291 is in excess of the configured-pre-emption-rate then packets 292 amounting to the excess rate are Pre-emption Marked. If the egress 293 gateway G1 sees a Pre-emption Marked packet then it measures, for 294 this CL-region-aggregate, the rate of all received packets that 295 aren't Pre-emption Marked. This is the rate of CL-traffic that the 296 network can actually support from G0 to G1, and we thus call it the 297 Sustainable-Aggregate-Rate. The ingress gateway G0 compares the 298 Sustainable-Aggregate-Rate with the rate that it is sending towards 299 G1, and hence determines the required traffic rate reduction. The 300 document assumes flow pre-emption as the way of reacting to this 301 information, ie stopping sufficient flows to reduce the rate to the 302 Sustainable-Aggregate-Rate. However, this isn't mandated, for 303 instance policy or regulation may prevent pre-emption of some flows - 304 such considerations are out of scope of this document. 306 1.2. Key benefits 308 We believe that the mechanisms described in this document are simple, 309 scalable, and robust because: 311 o Per flow state is only required at the ingress gateways to prevent 312 non-admitted CL traffic from entering the PCN-region. Other 313 network entities are not aware of individual flows. 315 o For each of its links a router has Admission Marking and Pre- 316 emption Marking behaviours. These markers operate on the overall 317 CL traffic of the respective link. Therefore, there are no 318 scalability concerns. 320 o The information of these measurements is implicitly signalled to 321 the egress gateways by the marks in the packet headers. No 322 protocol actions (explicit messages) are required. 324 o The egress gateways make separate measurements for each ingress 325 gateway of packets. Each meter operates on the overall CL traffic 326 of a particular CL-region-aggregate. Therefore, there are no 327 scalability concerns as long as the number of ingress gateways is 328 not overwhelmingly large. 330 o Feedback signalling is required between all pairs of ingress and 331 egress gateways and the signalled information is on the basis of 332 the corresponding CL-region-aggregate, i.e. it is also unaware of 333 individual flows. 335 o The configured-admission-rates can be chosen small enough that 336 admitted traffic can still be carried after a rerouting in most 337 failure cases. This is an important feature as QoS violations in 338 core networks due to link failures are more likely than QoS 339 violations due to increased traffic volume. 341 o The admitted load is controlled dynamically. Therefore it adapts 342 as the traffic matrix changes, and also if the network topology 343 changes (eg after a link failure). Hence an operator can be less 344 conservative when deploying network capacity, and less accurate in 345 their prediction of the traffic matrix. Also, controlling the load 346 using statically provisioned capacity per ingress (regardless of 347 the egress of a flow), as is typical in the DiffServ architecture 348 [RFC2475], can lead to focussed overload: many flows happen to 349 focus on a particular link and then all flows through the 350 congested link fail catastrophically (Section 6.2). 352 o The pre-emption function complements admission control. It allows 353 the network to recover from sudden unexpected surges of CL-traffic 354 on some links, thus restoring QoS to the remaining flows. Such 355 scenarios are very unlikely but not impossible. They can be caused 356 by large network failures that redirect lots of admitted CL- 357 traffic to other links, or by malfunction of the measurement-based 358 admission control in the presence of admitted flows that send for 359 a while with an atypically low rate and increase their rates in a 360 correlated way. 362 1.3. Terminology 364 EDITOR'S NOTE: Terminology in this document is (hopefully) consistent 365 with that in [PCN]. However, it may not be consistent with the 366 terminology in other PCN-related documents. The PCN Working Group (if 367 formed) will need to agree a single set of terminology. 369 This terminology is copied from the pre-congestion notification 370 marking draft [PCN]: 372 o Pre-Congestion Notification (PCN): two new algorithms that 373 determine when a PCN-enabled router Admission Marks and Pre- 374 emption Marks a packet, depending on the traffic level. 376 o Admission Marking condition: the traffic level is such that the 377 router Admission Marks packets. The router provides an "early 378 warning" that the load is nearing the engineered admission control 379 capacity, before there is any significant build-up of CL packets 380 in the queue. 382 o Pre-emption Marking condition: the traffic level is such that the 383 router Pre-emption Marks packets. The router warns explicitly that 384 pre-emption may be needed. 386 o Configured-admission-rate: the reference rate used by the 387 admission marking algorithm in a PCN-enabled router. 389 o Configured-pre-emption-rate - the reference rate used by the pre- 390 emption marking algorithm in a PCN-enabled router. 392 The following terms are defined here: 394 o Ingress gateway: router at an ingress to the CL-region. A CL- 395 region may have several ingress gateways. 397 o Egress gateway: router at an egress from the CL-region. A CL- 398 region may have several egress gateways. 400 o Interior router: a router which is part of the CL-region, but 401 isn't an ingress or egress gateway. 403 o CL-region: A region of the Internet in which all traffic 404 enters/leaves through an ingress/egress gateway and all routers 405 run Pre-Congestion Notification marking. A CL-region is a DiffServ 406 region (a DiffServ region is either a single DiffServ domain or 407 set of contiguous DiffServ domains), but note that the CL-region 408 does not use the traffic conditioning agreements (TCAs) of the 409 (informational) DiffServ architecture. 411 o CL-region-aggregate: all the microflows between a specific pair of 412 ingress and egress gateways. Note there is no field in the flow 413 packet headers that uniquely identifies the aggregate. 415 o Congestion-Level-Estimate: the number of bits in CL packets that 416 are admission marked (or pre-emption marked), divided by the 417 number of bits in all CL packets. It is calculated as an 418 exponentially weighted moving average. It is calculated by an 419 egress gateway for the CL packets from a particular ingress 420 gateway, i.e. there is a Congestion-Level-Estimate for each CL- 421 region-aggregate. 423 o Sustainable-Aggregate-Rate: the rate of traffic that the network 424 can actually support for a specific CL-region-aggregate. So it is 425 measured by an egress gateway for the CL packets from a particular 426 ingress gateway. 428 o Ingress-Aggregate-Rate: the rate of traffic that is being sent on 429 a specific CL-region-aggregate. So it is measured by an ingress 430 gateway for the CL packets sent towards a particular egress 431 gateway. 433 1.4. Existing terminology 435 This is a placeholder for useful terminology that is defined 436 elsewhere. 438 1.5. Standardisation requirements 440 The framework described in this document has two new standardisation 441 requirements: 443 o new Pre-Congestion Notification for Admission Marking and Pre- 444 emption Marking are required, as detailed in [PCN]. 446 o the end-to-end signalling protocol needs to be modified to carry 447 the Congestion-Level-Estimate report (for admission control) and 448 the Sustainable-Aggregate-Rate (for flow pre-emption). With our 449 assumption of RSVP (Section 2.2) as the end-to-end signalling 450 protocol, it means that extensions to RSVP are required, as 451 detailed in [RSVP-PCN], for example to carry the Congestion-Level- 452 Estimate and Sustainable-Aggregate-Rate information from egress 453 gateway to ingress gateway. 455 o We are discussing what to standardise about the gateway's 456 behaviour. 458 Other than these things, the arrangement uses existing IETF protocols 459 throughout, although not in their usual architecture. 461 1.6. Structure of rest of the document 463 Section 2 describes some key aspects of the deployment model: our 464 goals and assumptions. Section 3 describes the deployment model, 465 whilst Section 4 summarises the required changes to the various 466 routers in the CL-region. Section 5 outlines some limitations of PCN 467 that we've identified in this deployment model; it also discusses 468 some potential solutions, and other possible extensions. Section 6 469 provides some comparison with existing QoS mechanisms. 471 2. Key aspects of the deployment model 473 EDITOR'S NOTE: The material in Section 2 will eventually disappear, 474 as it will be covered by the problem statement of the PCN Working 475 Group (if formed). 477 In this section we discuss the key aspects of the deployment model: 479 o At a high level, our key goals, i.e. the functionality that we 480 want to achieve 482 o The assumptions that we're prepared to make 484 2.1. Key goals 486 The deployment model achieves an end-to-end controlled load (CL) 487 service where a segment of the end-to-end path is an edge-to-edge 488 Pre-Congestion Notification region. CL is a quality of service (QoS) 489 closely approximating the QoS that the same flow would receive from a 490 lightly loaded network element [RFC2211]. It is useful for inelastic 491 flows such as those for real-time media. 493 o The CL service should be achieved despite varying load levels of 494 other sorts of traffic, which may or may not be rate adaptive 495 (i.e. responsive to packet drops or ECN marks). 497 o The CL service should be supported for a variety of possible CL 498 sources: Constant Bit Rate (CBR), Variable Bit Rate (VBR) and 499 voice with silence suppression. VBR is the most challenging to 500 support. 502 o After a localised failure in the interior of the CL-region causing 503 heavy congestion, the CL service should recover gracefully by pre- 504 empting (dropping) some of the admitted CL microflows, whilst 505 preserving as many of them as possible with their full CL QoS. 507 o It needs to be possible to complete flow pre-emption within 1-2 508 seconds. Operators will have varying requirements but, at least 509 for voice, it has been estimated that after a few seconds then 510 many affected users will start to hang up, making the flow pre- 511 emption mechanism redundant and possibly even counter-productive. 512 Until flow pre-emption kicks in, other applications using CL (e.g. 513 video) and lower priority traffic (e.g. Assured Forwarding (AF)) 514 could be receiving reduced service. Therefore an even faster flow 515 pre-emption mechanism would be desirable (even if, in practice, 516 operators have to add a deliberate pause to ride out a transient 517 while the natural rate of call tear down or lower layer protection 518 mechanisms kick in). 520 o The CL service should support emergency services ([EMERG-RQTS], 521 [EMERG-TEL]) as well as the Assured Service which is the IP 522 implementation of the existing ITU-T/NATO/DoD telephone system 523 architecture known as Multi-Level Pre-emption and Precedence 524 [ITU.MLPP.1990] [ANSI.MLPP.Spec][ANSI.MLPP.Supplement], or MLPP. 525 In particular, this involves admitting new flows that are part of 526 high priority sessions even when admission control would reject 527 new routine flows. Similarly, when having to choose which flows to 528 pre-empt, this involves taking into account the priorities and 529 properties of the sessions that flows are part of. 531 2.2. Key assumptions 533 The framework does not try to deliver the above functionality in all 534 scenarios. We make the following assumptions about the type of 535 scenario to be solved. 537 o Edge-to-edge: all the routers in the CL-region are upgraded with 538 Pre-Congestion Notification, and all the ingress and egress 539 gateways are upgraded to perform the measurement-based admission 540 control and flow pre-emption. Note that although the upgrades 541 required are edge-to-edge, the CL service is provided end-to-end. 543 o Additional load: we assume that any additional load offered within 544 the reaction time of the admission control mechanism doesn't move 545 the CL-region directly from no congestion to overload. So it 546 assumes there will always be an intermediate stage where some CL 547 packets are Admission Marked, but they are still delivered without 548 significant QoS degradation. We believe this is valid for core and 549 backbone networks with typical call arrival patterns (given the 550 reaction time is little more than one round trip time across the 551 CL-region), but is unlikely to be valid in access networks where 552 the granularity of an individual call becomes significant. 554 o Aggregation: we assume that in normal operations, there are many 555 CL microflows within the CL-region, typically at least hundreds 556 between any pair of ingress and egress gateways. The implication 557 is that the solution is targeted at core and backbone networks and 558 possibly parts of large access networks. 560 o Trust: we assume that there is trust between all the routers in 561 the CL-region. For example, this trust model is satisfied if one 562 operator runs the whole of the CL-region. But we make no such 563 assumptions about the end hosts, i.e. depending on the scenario 564 they may be trusted or untrusted by the CL-region. 566 o Signalling: we assume that the end-to-end signalling protocol is 567 RSVP. Section 3 describes how the CL-region fits into such an end- 568 to-end QoS scenario, whilst [RSVP-PCN] describes the extensions to 569 RSVP that are required. 571 o Separation: we assume that all routers within the CL-region are 572 upgraded with the CL mechanism, so the requirements of [Floyd] are 573 met because the CL-region is an enclosed environment. Also, an 574 operator separates CL-traffic in the CL-region from outside 575 traffic by administrative configuration of the ring of gateways 576 around the region. Within the CL-region we assume that the CL- 577 traffic is separated from non-CL traffic. 579 o Routing: we assume that all packets between a pair of ingress and 580 egress gateways follow the same path, or that they follow 581 different paths but that the load balancing scheme is tuned in the 582 CL-region to distribute load such that the different paths always 583 receive comparable relative load. This ensures that the 584 Congestion-Level-Estimate used in the admission control procedure 585 (and which is computed taking into account packets travelling on 586 all the paths) approximately reflects the status of the actual 587 path that will be followed by the new microflow's packets. 589 We are investigating ways of loosening the restrictions set by some 590 of these assumptions, for instance: 592 o Trust: to allow the CL-region to span multiple, non-trusting 593 operators, using the technique of [Re-PCN] as mentioned in Section 594 5.7.2. 596 o Signalling: we believe that the solution could operate with 597 another signalling protocol, such as the one produced by the NSIS 598 working group. It could also work with application level 599 signalling as suggested in [RT-ECN]. 601 o Additional load: we believe that the assumption is valid for core 602 and backbone networks, with an appropriate margin between the 603 configured-admission-rate and the capacity for CL traffic. 604 However, in principle a burst of admission requests can occur in a 605 short time. We expect this to be a rare event under normal 606 conditions, but it could happen e.g. due to a 'flash crowd'. If it 607 does, then more flows may be admitted than should be, triggering 608 the pre-emption mechanism. There are various ways an operator 609 might try to alleviate this issue, which are discussed in the 610 'Flash crowds' section 5.5 later. 612 o Separation: the assumption that CL traffic is separated from non- 613 CL traffic implies that the CL traffic has its own PHB, not shared 614 with other traffic. We are looking at whether it could share 615 Expedited Forwarding's PHB, but supplemented with Pre-Congestion 616 Notification. If this is possible, other PHBs (like Assured 617 Forwarding) could be supplemented with the same new behaviours. 618 This is similar to how RFC3168 ECN was defined to supplement any 619 PHB. 621 o Routing: we are looking in greater detail at the solution in the 622 presence of Equal Cost Multi-Path routing and at suitable 623 enhancements. See also the 'ECMP' section 5.1 later. 625 3. Deployment model 627 3.1. Admission control 629 In this section we describe the admission control mechanism. We 630 discuss the three pieces of the solution and then give an example of 631 how they fit together in a use case: 633 o the new Pre-Congestion Notification for Admission Marking used by 634 all routers in the CL-region 636 o how the measurements made support our admission control mechanism 638 o how the edge to edge mechanism fits into the end to end RSVP 639 signalling 641 3.1.1. Pre-Congestion Notification for Admission Marking 643 This is discussed in [PCN]. Here we only give a brief outline. 645 To support our admission control mechanism, each router in the CL- 646 region runs an algorithm to determine whether to Admission Mark the 647 packet. The algorithm measures the aggregate CL traffic on the link 648 and ensures that packets are admission marked before the actual queue 649 builds up, but when it is in danger of doing so soon; the probability 650 of admission marking increases with the danger. The algorithm's main 651 parameter is the configured-admission-rate, which is set lower than 652 the link speed, perhaps considerably so. Admission marked packets 653 indicate that the CL traffic rate is reaching the configured- 654 admission-rate and so act as an "early warning" that the engineered 655 capacity is nearly reached. Therefore they indicate that requests to 656 admit prospective new CL flows may need to be refused. 658 3.1.2. Measurements to support admission control 660 To support our admission control mechanism the egress measures the 661 Congestion-Level-Estimate for traffic from each remote ingress 662 gateway, i.e. per CL-region-aggregate. The Congestion-Level-Estimate 663 is the number of bits in CL packets that are admission marked or pre- 664 emption marked, divided by the number of bits in all CL packets. It 665 is calculated as an exponentially weighted moving average. It is 666 calculated by an egress gateway separately for the CL packets from 667 each particular ingress gateway. 669 Why are pre-emption marked packets included in the Congestion-Level- 670 Estimate? Pre-emption marking over-writes admission marking, i.e. a 671 packet cannot be both admission and pre-emption marked. So if pre- 672 emption marked packets weren't counted we would have the anomaly that 673 as the traffic rate grew above the configured-pre-emption-rate, the 674 Congestion-Level-Estimate would fall. If a particular encoding scheme 675 is chosen where a packet can be both admission and pre-emption marked 676 (such as Alternative 4 in Appendix C of [PCN]), then this is not 677 necessary. 679 This Congestion-Level-Estimate provides an estimate of how near the 680 links on the path inside the CL-region are getting to the configured- 681 admission-rate. Note that the metering is done separately per ingress 682 gateway, because there may be sufficient capacity on all the routers 683 on the path between one ingress gateway and a particular egress, but 684 not from a second ingress to that same egress gateway. 686 3.1.3. How edge-to-edge admission control supports end-to-end QoS 687 signalling 689 Consider a scenario that consists of two end hosts, each connected to 690 their own access networks, which are linked by the CL-region. A 691 source tries to set up a new CL microflow by sending an RSVP PATH 692 message, and the receiving end host replies with an RSVP RESV 693 message. Outside the CL-region some other method, for instance 694 IntServ, is used to provide QoS. From the perspective of RSVP the CL- 695 region is a single hop, so the RSVP PATH and RESV messages are 696 processed by the ingress and egress gateways but are carried 697 transparently across all the interior routers; hence, the ingress and 698 egress gateways hold per microflow state, whilst no per microflow 699 state is kept by the interior routers. So far this is as in IntServ 700 over DiffServ [RFC2998]. However, in order to support our admission 701 control mechanism, the egress gateway adds to the RESV message an 702 opaque object which states the current Congestion-Level-Estimate for 703 the relevant CL-region-aggregate. Details of the corresponding RSVP 704 extensions are described in [RSVP-PCN]. 706 3.1.4. Use case 708 To see how the three pieces of the solution fit together, we imagine 709 a scenario where some microflows are already in place between a given 710 pair of ingress and egress gateways, but the traffic load is such 711 that no packets from these flows are admission marked as they travel 712 across the CL-region. A source wanting to start a new CL microflow 713 sends an RSVP PATH message. The egress gateway adds an object to the 714 RESV message with the Congestion-Level-Estimate, which is zero. The 715 ingress gateway sees this and consequently admits the new flow. It 716 then forwards the RSVP RESV message upstream towards the source end 717 host. Hence, assuming there's sufficient capacity in the access 718 networks, the new microflow is admitted end-to-end. 720 The source now sends CL packets, which arrive at the ingress gateway. 721 The ingress uses a five-tuple filter to identify that the packets are 722 part of a previously admitted CL microflow, and it also polices the 723 microflow to ensure it remains within its traffic profile. (The 724 ingress has learnt the required information from the RSVP messages.) 725 When forwarding a packet belonging to an admitted microflow, the 726 ingress sets the packet's DSCP and ECN fields to the appropriate 727 values configured for the CL region. The CL packet now travels across 728 the CL-region, getting admission marked if necessary. 730 Next, we imagine the same scenario but at a later time when load is 731 higher at one (or more) of the interior routers, which start to 732 Admission Mark CL packets, because their load on the outgoing link is 733 nearing the configured-admission-rate. The next time a source tries 734 to set up a CL microflow, the ingress gateway learns (from the 735 egress) the relevant Congestion-Level-Estimate. If it is greater than 736 some CLE-threshold value then the ingress refuses the request, 737 otherwise it is accepted. The ingress gateway could also take into 738 account attributes of the RSVP reservation (such as for example the 739 RSVP pre-emption priority of [RSVP-PREEMPTION] or the RSVP admission 740 priority of [RSVP-EMERGENCY]) as well as information provided by a 741 policy decision point in order to make a more sophisticated admission 742 decision. This way, flow admission can help emergency/military calls 743 by taking into account the corresponding priorities (as conveyed in 744 RSVP policy elements) when deciding to admit or reject a new 745 reservation. Use of RSVP for the support of emergency/military 746 applications is discussed in further detail in [RFC4542] and [RSVP- 747 EMERGENCY]. 749 It is also possible for an egress gateway to get a RSVP RESV message 750 and not know what the Congestion-Level-Estimate is. For example, if 751 there are no CL microflows at present between the relevant ingress 752 and egress gateways. In this case the egress requests the ingress to 753 send probe packets, from which it can initialise its meter. RSVP 754 Extensions for such a request to send probe data can be found in 755 [RSVP-PCN]. 757 3.2. Flow pre-emption 759 In this section we describe the flow pre-emption mechanism. We 760 discuss the two parts of the solution and then give an example of how 761 they fit together in a use case: 763 o How an ingress gateway is triggered to test whether flow pre- 764 emption may be needed 766 o How an ingress gateway determines the right amount of CL traffic 767 to drop 769 The mechanism is defined in [PCN] and [RSVP-PCN]. 771 Two subsequent steps could be: 773 o Choose which flows to shed, influenced by their priority and other 774 policy information 776 o Tear down the reservations for the chosen flows 778 We provide some hints about these latter two steps in Section 3.2.3, 779 but don't try to provide full guidance as it greatly depends on the 780 particular detailed operational situation. 782 An essential QoS issue in core and backbone networks is being able to 783 cope with failures of routers and links. The consequent re-routing 784 can cause severe congestion on some links and hence degrade the QoS 785 experienced by on-going microflows and other, lower priority traffic. 786 Even when the network is engineered to sustain a single link failure, 787 multiple link failures (e.g. due to a fibre cut, router failure or a 788 natural disaster) can cause violation of capacity constraints and 789 resulting QoS failures. Our solution uses rate-based flow pre- 790 emption, so that sufficient of the previously admitted CL microflows 791 are dropped to ensure that the remaining ones again receive QoS 792 commensurate with the CL service and at least some QoS is quickly 793 restored to other traffic classes. 795 3.2.1. Alerting an ingress gateway that flow pre-emption may be needed 797 Alerting an ingress gateway that flow pre-emption may be needed is a 798 two stage process: a router in the CL-region alerts an egress gateway 799 that flow pre-emption may be needed; in turn the egress gateway 800 alerts the relevant ingress gateway. Every router in the CL-region 801 has the ability to alert egress gateways, which may be done either 802 explicitly or implicitly: 804 o Explicit - the router per-hop behaviour is supplemented with a new 805 Pre-emption Marking behaviour, which is outlined below. Reception 806 of such a packet by the egress gateway alerts it that pre-emption 807 may be needed. 809 o Implicit - the router behaviour is unchanged from the Admission 810 Marking behaviour described earlier. The egress gateway treats a 811 Congestion-Level-Estimate of (almost) 100% as an implicit alert 812 that pre-emption may be required. ('Almost' because the 813 Congestion-Level-Estimate is a moving average, so can never reach 814 exactly 100%.) 816 To support explicit pre-emption alerting, each router in the CL- 817 region runs an algorithm to determine whether to Pre-emption Mark the 818 packet. The algorithm measures the aggregate CL traffic and ensures 819 that packets are pre-emption marked before the actual queue builds 820 up. The algorithm's main parameter is the configured-pre-emption- 821 rate, which is set lower than the link speed (but higher than the 822 configured-admission-rate). Thus pre-emption marked packets indicate 823 that the CL traffic rate is reaching the configured-pre-emption-rate 824 and so act as an "early warning" that the engineered capacity is 825 nearly reached. Therefore they indicate that it may be advisable to 826 pre-empt some of the existing CL flows in order to preserve the QoS 827 of the others. 829 Note that the pre-emption marking algorithm doesn't measure the 830 packets that are already Pre-emption Marked. This ensures that in a 831 scenario with several links that are above their configured-pre- 832 emption-rate, then at the egress gateway the rate of packets 833 excluding Pre-emption Marked ones truly does represent the 834 Sustainable-Aggregate-Rate(see below for explanation). 836 Note that the explicit mechanism only makes sense if all the routers 837 in the CL-region have the functionality so that the egress gateways 838 can rely on the explicit mechanism. Otherwise there is the danger 839 that the traffic happens to focus on a router without it, and egress 840 gateways then have also to watch for implicit pre-emption alerts. 842 When one or more packets in a CL-region-aggregate alert the egress 843 gateway of the need for flow pre-emption, whether explicitly or 844 implicitly, the egress puts that CL-region-aggregate into the Pre- 845 emption Alert state. For each CL-region-aggregate in alert state it 846 measures the rate of traffic at the egress gateway (i.e. the traffic 847 rate of the appropriate CL-region-aggregate) and reports this to the 848 relevant ingress gateway. The steps are: 850 o Determine the relevant ingress gateway - for the explicit case the 851 egress gateway examines the pre-emption marked packet and uses the 852 state installed at the time of admission to determine which 853 ingress gateway the packet came from. For the implicit case the 854 egress gateway has already determined this information, because 855 the Congestion-Level-Estimate is calculated per ingress gateway. 857 o Measure the traffic rate of CL packets - as soon as the egress 858 gateway is alerted (whether explicitly or implicitly) it measures 859 the rate of CL traffic from this ingress gateway (i.e. for this 860 CL-region-aggregate). Note that pre-emption marked packets are 861 excluded from that measurement. It should make its measurement 862 quickly and accurately, but exactly how is up to the 863 implementation. 865 o Alert the ingress gateway - the egress gateway then immediately 866 alerts the relevant ingress gateway about the fact that flow pre- 867 emption may be required. This Alert message also includes the 868 measured Sustainable-Aggregate-Rate, i.e. the rate of CL-traffic 869 received from this ingress gateway. The Alert message is sent 870 using reliable delivery. Procedures for the support of such an 871 Alert using RSVP are defined in [RSVP-PCN]. 873 -------------- _ _ ----------------- 874 CL packet |Update | / Is it a \ Y | Measure CL rate | 875 arrives --->|Congestion- |--->/pre-emption\-----> | from ingress and| 876 |Level-Estimate| \ marked / | alert ingress | 877 -------------- \ packet? / ----------------- 878 \_ _/ 880 Figure 2: Egress gateway action for explicit Pre-emption Alert 881 _ _ 882 -------------- / \ ----------------- 883 CL packet |Update | / Is \ Y | Measure CL rate | 884 arrives --->|Congestion- |--->/ C.L.E. \-----> | from ingress and| 885 |Level-Estimate| \ (nearly) / | alert ingress | 886 -------------- \ 100%? / ----------------- 887 \_ _/ 889 Figure 3: Egress gateway action for implicit Pre-emption Alert 890 3.2.2. Determining the right amount of CL traffic to drop 892 The method relies on the insight that the amount of CL traffic that 893 can be supported between a particular pair of ingress and egress 894 gateways, is the amount of CL traffic that is actually getting across 895 the CL-region to the egress gateway without being Pre-emption Marked. 896 Hence we term it the Sustainable-Aggregate-Rate. 898 So when the ingress gateway gets the Alert message from an egress 899 gateway, it compares: 901 o The traffic rate that it is sending to this particular egress 902 gateway (which we term Ingress-Aggregate-Rate) 904 o The traffic rate that the egress gateway reports (in the Alert 905 message) that it is receiving from this ingress gateway (which is 906 the Sustainable-Aggregate-Rate) 908 If the difference is significant, then the ingress gateway pre-empts 909 some microflows. It only pre-empts if: 911 Ingress-Aggregate-Rate > Sustainable-Aggregate-Rate + error 913 The "error" term is partly to allow for inaccuracies in the 914 measurements of the rates. It is also needed because the Ingress- 915 Aggregate-Rate is measured at a slightly later moment than the 916 Sustainable-Aggregate-Rate, and it is quite possible that the 917 Ingress-Aggregate-Rate has increased in the interim due to natural 918 variation of the bit rate of the CL sources. So the "error" term 919 allows for some variation in the ingress rate without triggering pre- 920 emption. 922 The ingress gateway should pre-empt enough microflows to ensure that: 924 New Ingress-Aggregate-Rate < Sustainable-Aggregate-Rate - error 926 The "error" term here is used for similar reasons but in the other 927 direction, to ensure slightly more load is shed than seems necessary, 928 in case the two measurements were taken during a short-term fall in 929 load. 931 When the routers in the CL-region are using explicit pre-emption 932 alerting, the ingress gateway would normally pre-empt microflows 933 whenever it gets an alert (it always would if it were possible to set 934 "error" equal to zero). For the implicit case however this is not so. 935 It receives an Alert message when the Congestion-Level-Estimate 936 reaches (almost) 100%, which is roughly when traffic exceeds the 937 configured-admission-rate. However, it is only when packets are 938 indeed dropped en route that the Sustainable-Aggregate-Rate becomes 939 less than the Ingress-Aggregate-Rate so only then will pre-emption 940 actually occur on the ingress gateway. 942 Hence with the implicit scheme, pre-emption can only be triggered 943 once the system starts dropping packets and thus the QoS of flows 944 starts being significantly degraded. This is in contrast with the 945 explicit scheme which allows flow pre-emption to be triggered before 946 any packet drop, simply when the traffic reaches the configured-pre- 947 emption-rate. Therefore we believe that the explicit mechanism is 948 superior. However it does require new functionality on all the 949 routers (although this is little more than a bulk token bucket - see 950 [PCN] for details). 952 3.2.3. Use case for flow pre-emption 954 To see how the pieces of the solution fit together in a use case, we 955 imagine a scenario where many microflows have already been admitted. 956 We confine our description to the explicit pre-emption mechanism. Now 957 an interior router in the CL-region fails. The network layer routing 958 protocol re-routes round the problem, but as a consequence traffic on 959 other links increases. In fact let's assume the traffic on one link 960 now exceeds its configured-pre-emption-rate and so the router pre- 961 emption marks CL packets. When the egress sees the first one of the 962 pre-emption marked packets it immediately determines which microflow 963 this packet is part of (by using a five-tuple filter and comparing it 964 with state installed at admission) and hence which ingress gateway 965 the packet came from. It sets up a meter to measure the traffic rate 966 from this ingress gateway, and as soon as possible sends a message to 967 the ingress gateway. This message alerts the ingress gateway that 968 pre-emption may be needed and contains the traffic rate measured by 969 the egress gateway. Then the ingress gateway determines the traffic 970 rate that it is sending towards this egress gateway and hence it can 971 calculate the amount of traffic that needs to be pre-empted. 973 The solution operates within a little over one round trip time - the 974 time required for microflow packets that have experienced Pre-emption 975 Marking to travel downstream through the CL-region and arrive at the 976 egress gateway, plus some additional time for the egress gateway to 977 measure the rate seen after it has been alerted that pre-emption may 978 be needed, and the time for the egress gateway to report this 979 information to the ingress gateway. 981 The ingress gateway could now just shed random microflows, but it is 982 better if the least important ones are dropped. The ingress gateway 983 could use information stored locally in each reservation's state 984 (such as for example the RSVP pre-emption priority of [RSVP- 985 PREEMPTION] or the RSVP admission priority of [RSVP-EMERGENCY]) as 986 well as information provided by a policy decision point in order to 987 decide which of the flows to shed (or perhaps which ones not to 988 shed). This way, flow pre-emption can also helps emergency/military 989 calls by taking into account the corresponding priorities (as 990 conveyed in RSVP policy elements) when selecting calls to be pre- 991 empted, which is likely to be particularly important in a disaster 992 scenario. Use of RSVP for support of emergency/military applications 993 is discussed in further details in [RFC4542] and [RSVP-EMERGENCY]. 995 The ingress gateway then initiates RSVP signalling to instruct the 996 relevant destinations that their reservation has been terminated, and 997 to tell (RSVP) nodes along the path to tear down associated RSVP 998 state. To guard against recalcitrant sources, normal IntServ policing 999 may be used to block any future traffic from the dropped flows from 1000 entering the CL-region. Note that - with the explicit Pre-emption 1001 Alert mechanism - since the configured-pre-emption-rate may be 1002 significantly less than the physical line capacity, flow pre-emption 1003 may be triggered before any congestion has actually occurred and 1004 before any packet is dropped. 1006 We extend the scenario further by imagining that (due to a disaster 1007 of some kind) further routers in the CL-region fail during the time 1008 taken by the pre-emption process described above. This is handled 1009 naturally, as packets will continue to be pre-emption marked and so 1010 the pre-emption process will happen for a second time. 1012 3.3. Both admission control and pre-emption 1014 This document describes both the admission control and pre-emption 1015 mechanisms, and we suggest that an operator uses both. However, we do 1016 not require this and some operators may want to implement only one. 1018 For example, an operator could use just admission control, solving 1019 heavy congestion (caused by re-routing) by 'just waiting' - as 1020 sessions end, existing microflows naturally depart from the system 1021 over time, and the admission control mechanism will prevent admission 1022 of new microflows that use the affected links. So the CL-region will 1023 naturally return to normal controlled load service, but with reduced 1024 capacity. The drawback of this approach would be that until flows 1025 naturally depart to relieve the congestion, all flows and lower 1026 priority services will be adversely affected. As another example, an 1027 operator could use just admission control, avoiding heavy congestion 1028 (caused by re-routing) by 'capacity planning' - by configuring 1029 admission control thresholds to lower levels than the network could 1030 accept in normal situations such that the load after failure is 1031 expected to stay below acceptable levels even with reduced network 1032 resources. 1034 On the other hand, an operator could just rely for admission control 1035 on the traffic conditioning agreements of the DiffServ architecture 1036 [RFC2475]. The pre-emption mechanism described in this document would 1037 be used to counteract the problem described at the end of Section 1038 1.1.1. 1040 4. Summary of Functionality 1042 This section is intended to provide a systematic summary of the new 1043 functionality required by the routers in the CL-region. 1045 A network operator upgrades normal IP routers by: 1047 o Adding functionality related to admission control and flow pre- 1048 emption to all its ingress and egress gateways 1050 o Adding Pre-Congestion Notification for Admission Marking and Pre- 1051 emption Marking to all the routers in the CL-region. 1053 We consider the detailed actions required for each of the types of 1054 router in turn. 1056 4.1. Ingress gateways 1058 Ingress gateways perform the following tasks: 1060 o Classify incoming packets - decide whether they are CL or non-CL 1061 packets. This is done using an IntServ filter spec (source and 1062 destination addresses and port numbers), whose details have been 1063 gathered from the RSVP messaging. 1065 o Police - check that the microflow conforms with what has been 1066 agreed (i.e. it keeps to its agreed data rate). If necessary, 1067 packets which do not correspond to any reservations, packets which 1068 are in excess of the rate agreed for their reservation, and 1069 packets for a reservation that has earlier been pre-empted may be 1070 policed. Policing may be achieved via dropping or via re-marking 1071 of the packet's DSCP to a value different from the CL behaviour 1072 aggregate. 1074 o ECN colouring packets - for CL microflows, set the ECN field of 1075 packets appropriately (see [PCN] for some discussion of encoding). 1077 o Perform 'interior router' functions (see next sub-section). 1079 o Admission Control - on new session establishment, consider the 1080 Congestion-Level-Estimate received from the corresponding egress 1081 gateway and most likely based on a simple configured CLE-threshold 1082 decide if a new call is to be admitted or rejected (taking into 1083 account local policy information as well as optionally information 1084 provided by a policy decision point). 1086 o Probe - if requested by the egress gateway to do so, the ingress 1087 gateway generates probe traffic so that the egress gateway can 1088 compute the Congestion-Level-Estimate from this ingress gateway. 1089 Probe packets may be simple data addressed to the egress gateway 1090 and require no protocol standardisation, although there will be 1091 best practice for their number, size and rate. 1093 o Measure - when it receives a Pre-emption Alert message from an 1094 egress gateway, it determines the rate at which it is sending 1095 packets to that egress gateway 1097 o Pre-empt - calculate how much CL traffic needs to be pre-empted; 1098 decide which microflows should be dropped, perhaps in consultation 1099 with a Policy Decision Point; and do the necessary signalling to 1100 drop them. 1102 4.2. Interior routers 1104 Interior routers do the following tasks: 1106 o Classify packets - examine the DSCP and ECN field to see if it's a 1107 CL packet 1109 o Non-CL packets are handled as usual, with respect to dropping them 1110 or setting their CE codepoint. 1112 o Pre-Congestion Notification - CL packets are Admission Marked and 1113 Pre-emption Marked according to the algorithm detailed in [PCN] 1114 and outlined in Section 3. 1116 4.3. Egress gateways 1118 Egress gateways do the following tasks: 1120 o Classify packets - determine which ingress gateway a CL packet has 1121 come from. This is the previous RSVP hop, hence the necessary 1122 details are obtained just as with IntServ from the state 1123 associated with the packet five-tuple, which has been built using 1124 information from the RSVP messages. 1126 o Meter - for CL packets, calculate the fraction of the total number 1127 of bits which are in Admission marked packets or in Pre-emption 1128 Marked packets. The calculation is done as an exponentially 1129 weighted moving average (see Appendix C). A separate calculation 1130 is made for CL packets from each ingress gateway. The meter works 1131 on an aggregate basis and not per microflow. 1133 o Signal the Congestion-Level-Estimate - this is piggy-backed on the 1134 reservation reply. An egress gateway's interface is configured to 1135 know it is an egress gateway, so it always appends this to the 1136 RESV message. If the Congestion-Level-Estimate is unknown or is 1137 too stale, then the egress gateway can request the ingress gateway 1138 to send probes. 1140 o Packet colouring - for CL packets, set the DSCP and the ECN field 1141 to whatever has been agreed as appropriate for the next domain. By 1142 default the ECN field is set to the Not-ECT codepoint. See also 1143 the discussion in the Tunnelling section later. 1145 o Measure the rate - measure the rate of CL traffic from a 1146 particular ingress gateway, excluding packets that are Pre-emption 1147 Marked (i.e. the Sustainable-Aggregate-Rate for the CL-region- 1148 aggregate), when alerted (either explicitly or implicitly) that 1149 pre-emption may be required. The measured rate is reported back to 1150 the appropriate ingress gateway [RSVP-PCN]. 1152 4.4. Failures 1154 If an interior router fails, then the regular IP routing protocol 1155 will re-route round it. If the new route can carry all the admitted 1156 traffic, flows will gracefully continue. If instead this causes early 1157 warning of pre-congestion on the new route, then admission control 1158 based on pre-congestion notification will ensure new flows will not 1159 be admitted until enough existing flows have departed. Finally re- 1160 routing may result in heavy congestion, when the flow pre-emption 1161 mechanism will kick in. 1163 If a gateway fails then we would like regular RSVP procedures 1164 [RFC2205] to take care of things. With the local repair mechanism of 1165 [RFC2205], when a route changes the next RSVP PATH refresh message 1166 will establish path state along the new route, and thus attempt to 1167 re-establish reservations through the new ingress gateway. 1168 Essentially the same procedure is used as described earlier in this 1169 document, with the re-routed session treated as a new session 1170 request. 1172 In more detail, consider what happens if an ingress gateway of the 1173 CL-region fails. Then RSVP routers upstream of it do IP re-routing to 1174 a new ingress gateway. The next time the upstream RSVP router sends a 1175 PATH refresh message it reaches the new ingress gateway which 1176 therefore installs the associated RSVP state. The next RSVP RESV 1177 refresh will pick up the Congestion-Level-Estimate from the egress 1178 gateway, and the ingress compares this with its threshold to decide 1179 whether to admit the new session. This could result in some of the 1180 flows being rejected, but those accepted will receive the full QoS. 1182 An issue with this is that we have to wait until a PATH and RESV 1183 refresh messages are sent - which may not be very often - the default 1184 value is 30 seconds. [RFC2205] discusses how to speed up the local 1185 repair mechanism. First, the RSVP module is notified by the local 1186 routing protocol module of a route change to particular destinations, 1187 which triggers it to rapidly send out PATH refresh messages. Further, 1188 when a PATH refresh arrives with a previous hop address different 1189 from the one stored, then RESV refreshes are immediately sent to that 1190 previous hop. Where RSVP is operating hop-by-hop, i.e. on every 1191 router, then triggering the PATH refresh is easy as the router can 1192 simply monitor its local link. Thus, this fast local repair mechanism 1193 can be used to deal with failures upstream of the ingress gateway, 1194 with failures of the ingress gateway and with failures downstream of 1195 the egress gateway. 1197 But where RSVP is not operating hop-by-hop (as is the case within the 1198 CL-region), it is not so easy to trigger the PATH refresh. 1200 Unfortunately, this problem applies if an egress gateway fails, since 1201 it's very likely that an egress gateway is several IP hops from the 1202 ingress gateway. (If the ingress is several IP hops from its previous 1203 RSVP node, then there is the same issue.) The options appear to be: 1205 o the ingress gateway has a link state database for the CL-region, 1206 so it can detect that an egress gateway has failed or became 1207 unreachable 1209 o there is an inter-gateway protocol, so the ingress can 1210 continuously check that the egress gateways are still alive 1212 o (default) do nothing and wait for the regular PATH/RESV refreshes 1213 (and, if needed, the pre-emption mechanism) to sort things out. 1215 5. Limitations and some potential solutions 1217 In this section we describe various limitations of the deployment 1218 model, and some suggestions about potential ways of alleviating them. 1219 The limitations fall into three broad categories: 1221 o ECMP (Section 5.1): the assumption about routing (Section 2.2) is 1222 that all packets between a pair of ingress and egress gateways 1223 follow the same path; ECMP breaks this assumption. A study 1224 regarding the accuracy of load balancing schemes can be found in 1225 [LoadBalancing-a] and [LoadBalancing-b]. 1227 o The lack of global coordination (Sections 5.2, 5.3 and 5.4): a 1228 decision about admission control or flow pre-emption is made for 1229 one aggregate independently of other aggregates 1231 o Timing and accuracy of measurements (Sections 5.5 and 5.6): the 1232 assumption (Section 2.2) that additional load, offered within the 1233 reaction time of the measurement-based admission control 1234 mechanism, doesn't move the system directly from no congestion to 1235 overload (dropping packets). A 'flash crowd' may break this 1236 assumption (Section 5.5). There are a variety of more general 1237 issues associated with marking measurements, which may mean it's a 1238 good idea to do pre-emption 'slower' (Section 5.6). 1240 Each section describes a limitation and some possible solutions to 1241 alleviate the limitation. These are intended as options for an 1242 operator to consider, based on their particular requirements. 1244 We would welcome feedback, for example suggestions as to which 1245 potential solutions are worth working out in more detail, and ideas 1246 on new potential solutions. 1248 Finally Section 5.7 considers some other potential extensions. 1250 5.1. ECMP 1252 If the CL-region uses Equal Cost Multipath Routing (ECMP), then 1253 traffic between a particular pair of ingress and egress gateways may 1254 follow several different paths. 1256 Why? An ECMP-enabled router runs an algorithm to choose between 1257 potential outgoing links, based on a hash of fields such as the 1258 packet's source and destination addresses - exactly what depends on 1259 the proprietary algorithm. Packets are addressed to the CL flow's 1260 end-point, and therefore different flows may follow different paths 1261 through the CL-region. (All packets of an individual flow follow the 1262 same ECMP path.) 1264 The problem is that if one of the paths is congested such that 1265 packets are being admission marked, then the Congestion-Level- 1266 Estimate measured by the egress gateway will be diluted by unmarked 1267 packets from other non-congested paths. Similarly, the measurement of 1268 the Sustainable-Aggregate-Rate will also be diluted. 1270 Possible solution approaches are: 1272 o tunnel: traffic is tunnelled across the CL-region. Then the 1273 destination address (and so on) seen by the ECMP algorithm is that 1274 of the egress gateway, so all flows follow the same path. 1275 Effectively ECMP is turned off. As a compromise, to try to retain 1276 some of the benefits of ECMP, there could be several tunnels, each 1277 following a different ECMP path, with flows randomly assigned to 1278 different tunnels. 1280 o assume worst case: the operator sets the configured-admission-rate 1281 (and configured-pre-emption-rate) to below the optimum level to 1282 compensate for the fact that the effect on the Congestion-Level- 1283 Estimate (and Sustainable-Aggregate-Rate) of the congestion 1284 experienced over one of the paths may be diluted by traffic 1285 received over non-congested paths. Hence lower thresholds need to 1286 be used to ensure early admission control rejection and pre- 1287 emption over the congested path. This approach will waste capacity 1288 (e.g. flows following a non-congested ECMP path are not admitted 1289 or are pre-empted), and there is still the danger that for some 1290 traffic mixes the operator hasn't been cautious enough. 1292 o for admission control, probe to obtain a flow-specific congestion- 1293 level-estimate. Earlier this document suggests continuously 1294 monitoring the congestion-level-estimate. Instead, probe packets 1295 could be sent for each prospective new flow. The probe packets 1296 have the same IP address etc as the data packets would have, and 1297 hence follow the same ECMP path. However, probing is an extra 1298 overhead, depending on how many probe packets need to be sent to 1299 get a sufficiently accurate congestion-level-estimate. Probes also 1300 cause a processing overhead, either for the machine at the 1301 destination address or for the egress gateway to identify and 1302 remove the probe packets. 1304 o for flow pre-emption, only select flows for pre-emption from 1305 amongst those that have actually received a Pre-emption Marked 1306 packet. Because these flows must have followed an ECMP path that 1307 goes through an overloaded router. However, it needs some extra 1308 work by the egress gateway, to record this information and report 1309 it to the ingress gateway. 1311 o for flow pre-emption, a variant of this idea involves introducing 1312 a new marking behaviour, 'Router Marking'. A router that is pre- 1313 emption marking packets on an outgoing link, also 'Router Marks' 1314 all other packets. When selecting flows for pre-emption, the 1315 selection is made from amongst those that have actually received a 1316 Router Marked or Pre-emption Marked packet. Hence compared with 1317 the previous bullet, it may extend the range of flows from which 1318 the pre-emption selection is made (i.e. it includes those which, 1319 by chance, haven't had any pre-emption marked packets). However, 1320 it also requires that the 'Router Marking' state is somehow 1321 encoded into a packet, i.e. it makes harder the encoding challenge 1322 discussed in Appendix C of [PCN]. The extra work required by the 1323 egress gateway would also be somewhat higher than for the previous 1324 bullet. 1326 5.2. Beat down effect 1328 This limitation concerns the pre-emption mechanism in the case where 1329 more than one router is pre-emption marking packets. The result 1330 (explained in the next paragraph) is that the measurement of 1331 sustainable-aggregate-rate is lower than its true value, so more 1332 traffic is pre-empted than necessary. 1334 Imagine the scenario: 1336 +-------+ +-------+ +-------+ 1337 IAR-b=3 >@@@@@| CPR=2 |@@@@@| CPR>2 |@@@@@| CPR=1 |@@> SAR-b=1 1338 IAR-a=1 >#####| R1 |#####| R2 | | R3 | 1339 +-------+ +-------+ +-------+ 1340 # 1341 # 1342 # 1343 v SAR-a=0.5 1345 Figure 4: Scenario to illustrate 'beat down effect' limitation 1347 Aggregate-a (ingress-aggregate-rate, IAR, 1 unit) takes a 'short' 1348 route through two routers, one of which (R1) is above its configured- 1349 pre-emption-rate (CPR, 2 units). Aggregate-b takes a 'long' route, 1350 going through a second congested router (R3, with a CPR of 1 unit). 1352 R1's input traffic is 4 units, twice its configured-pre-emption-rate, 1353 so 50% of packets are pre-emption marked. Hence the measured 1354 sustainable-aggregate-rate (SAR) for aggregate-a is 0.5, and half of 1355 its traffic will be pre-empted. 1357 R3's input of non-pre-emption-marked traffic is 1.5 units, and 1358 therefore it has to do further marking. 1360 But this means that aggregate-a has taken a bigger hit than it needed 1361 to; the router R1 could have let through all of aggregate-a's traffic 1362 unmarked if it had known that the second router R2 was going to "beat 1363 down" aggregate-b's traffic further. 1365 Generalising, the result is that in a scenario where more than one 1366 router is pre-emption marking packets, only the final router is sure 1367 to be fully loaded after flow pre-emption. The fundamental reason is 1368 that a router makes a local decision about which packets to pre- 1369 emption mark, i.e. independently of how other routers are pre-emption 1370 marking. A very similar effect has been noted in XCP [Low]. 1372 Potential solutions: 1374 o a full solution would involve routers learning about other routers 1375 that are pre-emption marking, and being able to differentially 1376 mark flows (e.g. in the example above, aggregate-a's packets 1377 wouldn't be marked by R1). This seems hard and complex. 1379 o do nothing about this limitation. It causes over-pre-emption, 1380 which is safe. At the moment this is our suggested option. 1382 o do pre-emption 'slowly'. The description earlier in this document 1383 assumes that after the measurements of ingress-aggregate-rate and 1384 sustainable-aggregate-rate, then sufficient flows are pre-empted 1385 in 'one shot' to eliminate the excess traffic. An alternative is 1386 to spread pre-emption over several rounds: initially, only pre- 1387 empt enough to eliminate some of the excess traffic, then re- 1388 measure the sustainable-aggregate-rate, and then pre-empt some 1389 more, etc. In the scenario above, the re-measurement would be 1390 lower than expected, due to the beat down effect, and hence in the 1391 second round of pre-emption less of aggregate-a's traffic would be 1392 pre-empted (perhaps none). Overall, therefore the impact of the 1393 'beat down' effect would be lessened, i.e. there would be a 1394 smaller degree of over pre-emption. The downside is that the 1395 overall pre-emption is slower, and therefore routers will be 1396 congested longer. 1398 5.3. Bi-directional sessions 1400 The document earlier describes how to decide whether or not to admit 1401 (or pre-empt) a particular flow. However, from a user/application 1402 perspective, the session is the relevant unit of granularity. A 1403 session can consist of several flows which may not all be part of the 1404 same aggregate. The most obvious example is a bi-directional session, 1405 where the two flows should ideally be admitted or pre-empted as a 1406 pair - for instance a voice call only makes sense if A can send to B 1407 as well as B to A! But the admission and pre-emption mechanisms 1408 described earlier in this document operate on a per-aggregate basis, 1409 independently of what's happening with other aggregates. For 1410 admission control the problem isn't serious: e.g. the SIP server for 1411 the voice call can easily detect that the A-to-B flow has been 1412 admitted but the B-to-A flow blocked, and inform the user perhaps via 1413 a busy tone. For flow pre-emption, the problem is similar but more 1414 serious. If both the aggregate-1-to-2 (i.e. from gateway 1 to gateway 1415 2) and the aggregate-2-to-1 have to pre-empt flows, then it would be 1416 good if either all of the flows of a particular session were pre- 1417 empted or none of them. Therefore if the two aggregates pre-empt 1418 flows independently of each other, more sessions will end up being 1419 torn down than is really necessary. For instance, pre-empting one 1420 direction of a voice call will result in the SIP server tearing down 1421 the other direction anyway. 1423 Potential solutions: 1425 o if it's known that all session are bi-directional, simply pre- 1426 empting roughly half as many flows as suggested by the 1427 measurements of {ingress-aggregate-rate - sustainable-aggregate- 1428 rate}. But this makes a big assumption about the nature of 1429 sessions, and also that the aggregate-1-to-2 and aggregate-2-to-1 1430 are equally overloaded. 1432 o ignore the limitation. The penalty will be quite small if most 1433 sessions consist of one flow or of flows part of the same 1434 aggregate. 1436 o introduce a gateway controller. It would receive reports for all 1437 aggregates where the ingress-aggregate-rate exceeds the 1438 sustainable-aggregate-rate. It then would make a global decision 1439 about which flows to pre-empt. However it requires quite some 1440 complexity, for example the controller needs to understand which 1441 flows map to which sessions. This may be an option in some 1442 scenarios, for example where gateways aren't handling too many 1443 flows (but note that this breaks the aggregation assumption of 1444 Section 2.2). A variant of this idea would be to introduce a 1445 gateway controller per pair of gateways, in order to handle bi- 1446 directional sessions but not try to deal with more complex 1447 sessions that include flows from an arbitrary number of 1448 aggregates. 1450 o do pre-emption 'slowly'. As in the "beat down" solution 4, this 1451 would reduce the impact of this limitation. The downside is that 1452 the overall pre-emption is slower, and therefore router(s) will be 1453 congested longer. 1455 o each ingress gateway 'loosely coordinates' with other gateways its 1456 decision about which specific flows to pre-empt. Each gateway 1457 numbers flows in the order they arrive (note that this number has 1458 no meaning outside the gateway), and when pre-empting flows, the 1459 most recent (or most recent low priority flow) is selected for 1460 pre-emption; the gateway then works backwards selecting as many 1461 flows as needed. Gateways will therefore tend to pre-empt flows 1462 that are part of the same session (as they were admitted at the 1463 same time). Of course this isn't guaranteed for several reasons, 1464 for instance gateway A's most recent bi-directional sessions may 1465 be with gateway C, whereas gateway B's are with gateway A (so 1466 gateway A will pre-empt A-to-C flows and gateway B will pre-empt 1467 B-to-A flows). Rather than pre-empting the most recent (low 1468 priority) flow, an alternative algorithm (for further study) may 1469 be to select flows based on a hash of particular fields in the 1470 packet, such that both gateways produce the same hash for flows of 1471 the same bi-directional session. We believe that this approach 1472 should be investigated further. 1474 5.4. Global fairness 1476 The limitation here is that 'high priority' traffic may be pre-empted 1477 (or not admitted) when a global decision would instead pre-empt (or 1478 not admit) 'lower priority' traffic on a different aggregate. 1480 Imagine the following scenario (extreme to illustrate the point 1481 clearly). Aggregate_a is all Assured Services (MLPP) traffic, whilst 1482 aggregate_b is all ordinary traffic (i.e. comparatively low 1483 priority). Together the two aggregates cause a router to be at twice 1484 its configured-pre-emption-rate. Ideally we'd like all of aggregate_b 1485 to be pre-empted, as then all of aggregate_a could be carried. 1486 However, the approach described earlier in this document leads to 1487 half of each aggregate being pre-empted. 1489 IAR_b=1 1490 v 1491 v 1492 +-------+ 1493 IAR_a=1 ---->-----| CPR=1 |-----> SAR_a=0.5 1494 | | 1495 +-------+ 1496 v 1497 v 1498 SAR_a=0.5 1500 Figure 5: Scenario to illustrate 'global fairness' limitation 1502 Similarly, for admission control - Section 4.1 describes how if the 1503 Congestion-Level-Estimate is greater than the CLE-threshold all new 1504 sessions are refused. But it is unsatisfactory to block emergency 1505 calls, for instance. 1507 Potential solutions: 1509 o in the admission control case, it is recommended that an 1510 'emergency / Assured Services' call is admitted immediately even 1511 if the CLE-threshold is exceeded. Usually the network can actually 1512 handle the additional microflow, because there is a safety margin 1513 between the configured-admission-rate and the configured-pre- 1514 emption-rate. Normal call termination behaviour will soon bring 1515 the traffic level down below the configured-admission-rate. 1516 However, in exceptional circumstances the 'emergency / higher 1517 precedence' call may cause the traffic level to exceed the 1518 configured-pre-emption-rate; then the usual pre-emption mechanism 1519 will pre-empt enough (non 'emergency / higher precedence') 1520 microflows to bring the total traffic back under the configured- 1521 pre-emption-rate. 1523 o all egress gateways report to a global coordinator that makes 1524 decisions about what flows to pre-empt. However this solution adds 1525 complexity and probably isn't scalable, but it may be an option in 1526 some scenarios, for example where gateways aren't handling too 1527 many flows (but note that this breaks the aggregation assumption 1528 of Section 2.2). 1530 o introduce a heuristic rule: before pre-empting a 'high priority' 1531 flow the egress gateway should wait to see if sufficient (lower 1532 priority) traffic is pre-empted on other aggregates. This is a 1533 reasonable option. 1535 o enhance the functionality of all the interior routers, so they can 1536 detect the priority of a packet, and then differentially mark 1537 them. As well as adding complexity, in general this would be an 1538 unacceptable security risk for MLPP traffic, since only controlled 1539 nodes (like gateways) should know which packets are high priority, 1540 as this information can be abused by an attacker. 1542 o do nothing, i.e. accept the limitation. Whilst it's unlikely that 1543 high priority calls will be quite so unbalanced as in the scenario 1544 above, just accepting this limitation may be risky. The sorts of 1545 situations that cause routers to start pre-emption marking are 1546 also likely to cause a surge of emergency / MLPP calls. 1548 5.5. Flash crowds 1550 This limitation concerns admission control and arises because there 1551 is a time lag between the admission control decision (which depends 1552 on the Congestion-Level-Estimate during RSVP signalling during call 1553 set-up) and when the data is actually sent (after the called party 1554 has answered). In PSTN terms this is the time the phone rings. 1555 Normally the time lag doesn't matter much because (1) in the CL- 1556 region there are many flows and they terminate and are answered at 1557 roughly the same rate, and (2) the network can still operate safely 1558 when the traffic level is some margin above the configured-admission- 1559 rate. 1561 A 'flash crowd' occurs when something causes many calls to be 1562 initiated in a short period of time - for instance a 'tele-vote'. So 1563 there is a danger that a 'flash' of calls is accepted, but when the 1564 calls are answered and data flows the traffic overloads the network. 1565 Therefore potentially the 'additional load' assumption of Section 2.2 1566 doesn't hold. 1568 Potential solutions: 1570 o The simplest option is to do nothing; an operator relies on the 1571 pre-emption mechanism if there is a problem. This doesn't seem a 1572 good choice, as 'flash crowds' are reasonably common on the PSTN, 1573 unless the operator can ensure that nearly all 'flash crowd' 1574 events are blocked in the access network and so do not impact on 1575 the CL-region. 1577 o A second option is to send 'dummy data' as soon as the call is 1578 admitted, thus effectively reserving the bandwidth whilst waiting 1579 for the called party to answer. Reserving bandwidth in advance 1580 means that the network cannot admit as many calls. For example, 1581 suppose sessions last 100 seconds and ringing for 10 seconds, the 1582 cost is a 10% loss of capacity. It may be possible to offset this 1583 somewhat by increasing the configured-admission-rate in the 1584 routers, but it would need further investigation. A concern with 1585 this 'dummy data' option is that it may allow an attacker to 1586 initiate many calls that are never answered (by a cooperating 1587 attacker), so eventually the network would only be carrying 'dummy 1588 data'. The attack exploits that charging only starts when the call 1589 is answered and not when it is dialled. It may be possible to 1590 alleviate the attack at the session layer - for example, when the 1591 ingress gateway gets an RSVP PATH message it checks that the 1592 source has been well-behaved recently; and limiting the maximum 1593 time that ringing can last. We believe that if this attack can be 1594 dealt with then this is a good option. 1596 o A third option is that the egress gateway limits the rate at which 1597 it sends out the Congestion-Level-Estimate, or limits the rate at 1598 which calls are accepted by replying with a Congestion-Level- 1599 Estimate of 100% (this is the equivalent of 'call gapping' in the 1600 PSTN). There is a trade-off, which would need to be investigated 1601 further, between the degree of protection and possible adverse 1602 side-effects like slowing down call set-up. 1604 o A final option is to re-perform admission control before the call 1605 is answered. The ingress gateway monitors Congestion-Level- 1606 Estimate updates received from each egress. If it notices that a 1607 Congestion-Level-Estimate has risen above the CLE-threshold, then 1608 it terminates all unanswered calls through that egress (e.g. by 1609 instructing the session protocol to stop the 'ringing tone'). For 1610 extra safety the Congestion-Level-Estimate could be re-checked 1611 when the call is answered. A potential drawback for an operator 1612 that wants to emulate the PSTN is that the PSTN network never 1613 drops a 'ringing' PSTN call. 1615 5.6. Pre-empting too fast 1617 As a general idea it seems good to pre-empt excess flows rapidly, so 1618 that the full QoS is restored to the remaining CL users as soon as 1619 possible, and partial service is restored to lower priority traffic 1620 classes on shared links. Therefore the pre-emption mechanism 1621 described earlier in this document works in 'one shot', i.e. one 1622 measurement is made of the sustainable-aggregate-rate and the 1623 ingress-aggregate-rate, and the excess is pre-empted immediately. 1624 However, there are some reasons why an operator may potentially want 1625 to pre-empt 'more slowly': 1627 o To allow time to modify the ingress gateway's policer, as the 1628 ingress wants to be able to drop any packets that arrive from a 1629 pre-empted flow. There will be a limit on how many new filters an 1630 ingress gateway can install in a certain time period. Otherwise 1631 the source may cheat and ignore the instruction to drop its flow. 1633 o The operator may decide to slow down pre-emption in order to 1634 ameliorate the 'beat down' and/or 'bi-directional sessions' 1635 limitations (see above) 1637 o To help combat inaccuracies in measurements of the sustainable- 1638 aggregate-rate and ingress-aggregate-rate. For a CL-region where 1639 it's assumed there are many flows in an aggregate these 1640 measurements can be obtained in a short period of time, but where 1641 there are fewer flows it will take longer. 1643 o To help combat over pre-emption because, during the time it takes 1644 to pre-empt flows, others may be ending anyway (either the call 1645 has naturally ended, or the user hangs up due to poor QoS). 1646 Slowing pre-emption may seem counter-intuitive here, as it makes 1647 it more likely that calls will terminate anyway - however it also 1648 gives time to adjust the amount pre-empted to take account of 1649 this. 1651 o Earlier in this document we said that an egress starts measuring 1652 the sustainable-aggregate-rate immediately it sees a single pre- 1653 emption marked packet. However, when a link or router fails the 1654 network's underlying recovery mechanism will kick in (e.g. 1655 switching to a back up path), which may result in the network 1656 again being able to support all the traffic. 1658 Potential solutions 1660 o To combat the final issue, the egress could measure the 1661 sustainable-aggregate-rate over a longer time period than the 1662 network recovery time (say 100ms vs. 50ms). If it detects no pre- 1663 emption marked packets towards the end of its measurement period 1664 (say in the last 30 ms) then it doesn't send a pre-emption alert 1665 message to the ingress. 1667 o We suggest that optionally (the choice of the operator) pre- 1668 emption is slowed by pre-empting traffic in several rounds rather 1669 than in one shot. One possible algorithm is to pre-empt most of 1670 the traffic in the first round and the rest in the second round; 1671 the amount pre-empted in the second round is influenced by both 1672 the first and second round measurements: * Round 1: pre-empt h 1673 * S_1 where 0.5 <= h <= 1 1674 where S_1 is the amount the normal mechanism calculates that it 1675 should shed, i.e. {ingress-aggregate-rate - sustainable-aggregate- 1676 rate} * Round 2: pre-empt Predicted-S_2 - h * (Predicted- 1677 S_2 - Measured-S_2) 1678 where Predicted-S_2 = (1-h)*S_1 Note 1679 that the second measurement should be made when sufficient time 1680 has elapsed for the first round of pre-emption to have happened. 1681 One idea to achieve this is for the egress gateway to continuously 1682 measure and report its sustainable-aggregate-rate, in (say) 100ms 1683 windows. Therefore the ingress gateway knows when the egress 1684 gateway made its measurement (assuming the round trip time is 1685 known). Therefore the ingress gateway knows when measurements 1686 should reflect that it has pre-empted flows. 1688 5.7. Other potential extensions 1690 In this section we discuss some other potential extensions not 1691 already covered above. 1693 5.7.1. Tunnelling 1695 It is possible to tunnel all CL packets across the CL-region. 1696 Although there is a cost of tunnelling (additional header on each 1697 packet, additional processing at tunnel ingress and egress), there 1698 are three reasons it may be interesting. 1700 ECMP: 1702 Tunnelling is one of the possible solutions given earlier in Section 1703 5.1 on Equal Cost Multipath Routing (ECMP). 1705 Ingress gateway determination: 1707 If packets are tunnelled from ingress gateway to egress gateway, the 1708 egress gateway can very easily determine in the data path which 1709 ingress gateway a packet comes from (by simply looking at the source 1710 address of the tunnel header). This can facilitate operations such as 1711 computing the Congestion-Level-Estimate on a per ingress gateway 1712 basis. 1714 End-to-end ECN: 1716 The ECN field is used for PCN marking (see [PCN] for details), and so 1717 it needs to be re-set by the egress gateway to whatever has been 1718 agreed as appropriate for the next domain. Therefore if a packet 1719 arrives at the ingress gateway with its ECN field already set (i.e. 1720 not '00'), it may leave the egress gateway with a different value. 1721 Hence the end-to-end meaning of the ECN field is lost. 1723 It is open to debate whether end-to-end congestion control is ever 1724 necessary within an end-to-end reservation. But if a genuine need is 1725 identified for end-to-end ECN semantics within a reservation, then 1726 one solution is to tunnel CL packets across the CL-region. When the 1727 egress gateway decapsulates them the original ECN field is recovered. 1729 5.7.2. Multi-domain and multi-operator usage 1731 This potential extension would eliminate the trust assumption 1732 (Section 2.2), so that the CL-region could consist of multiple 1733 domains run by different operators that did not trust each other. 1734 Then only the ingress and egress gateways of the CL-region would take 1735 part in the admission control procedure, i.e. at the ingress to the 1736 first domain and the egress from the final domain. The border routers 1737 between operators within the CL-region would only have to do bulk 1738 accounting - they wouldn't do per microflow metering and policing, 1739 and they wouldn't take part in signal processing or hold per flow 1740 state [Briscoe]. [Re-feedback] explains how a downstream domain can 1741 police that its upstream domain does not 'cheat' by admitting traffic 1742 when the downstream path is congested. [Re-PCN] proposes how to 1743 achieve this with the help of another recently proposed extension to 1744 ECN, involving re-echoing ECN feedback [Re-ECN]. 1746 5.7.3. Preferential dropping of pre-emption marked packets 1748 When the rate of real-time traffic in the specified class exceeds the 1749 maximum configured rate, then a router has to drop some packet(s) 1750 instead of forwarding them on the out-going link. Now when the egress 1751 gateway measures the Sustainable-Aggregate-Rate, neither dropped 1752 packets nor pre-emption marked packets contribute to it. Dropping 1753 non-pre-emption-marked packets therefore reduces the measured 1754 Sustainable-Aggregate-Rate below its true value. Thus a router should 1755 preferentially drop pre-emption marked packets. 1757 Note that it is important that the operator doesn't set the 1758 configured-pre-emption-rate equal to the rate at which packets start 1759 being dropped (for the specified real-time service class). Otherwise 1760 the egress gateway may never see a pre-emption marked packet and so 1761 won't be triggered into the Pre-emption Alert state. 1763 This optimisation is optional. When considering whether to use it an 1764 operator will consider issues such as whether the over-pre-emption is 1765 serious, and whether the particular routers can easily do this sort 1766 of selective drop. 1768 5.7.4. Adaptive bandwidth for the Controlled Load service 1770 The admission control mechanism described in this document assumes 1771 that each router has a fixed bandwidth allocated to CL flows. A 1772 possible extension is that the bandwidth is flexible, depending on 1773 the level of non-CL traffic. If a large share of the current load on 1774 a path is CL, then more CL traffic can be admitted. And if the 1775 greater share of the load is non-CL, then the admission threshold can 1776 be proportionately lower. The approach re-arranges sharing between 1777 classes to aim for economic efficiency, whatever the traffic load 1778 matrix. It also deals with unforeseen changes to capacity during 1779 failures better than configuring fixed engineered rates. Adaptive 1780 bandwidth allocation can be achieved by changing the admission 1781 marking behaviour, so that the probability of admission marking a 1782 packet would now depend on the number of queued non-CL packets as 1783 well as the size of the virtual queue. The adaptive bandwidth 1784 approach would be supplemented by placing limits on the adaptation to 1785 prevent starvation of the CL by other traffic classes and of other 1786 classes by CL traffic. [Songhurst] has more details of the adaptive 1787 bandwidth approach. 1789 5.7.5. Controlled Load service with end-to-end Pre-Congestion 1790 Notification 1792 It may be possible to extend the framework to parts of the network 1793 where there are only a low number of CL microflows, i.e. the 1794 aggregation assumption (Section 2.2) doesn't hold. In the extreme it 1795 may be possible to operate the framework end-to-end, i.e. between end 1796 hosts. One potential method is to send probe packets to test whether 1797 the network can support a prospective new CL microflow. The probe 1798 packets would be sent at the same traffic rate as expected for the 1799 actual microflow, but in order not to disturb existing CL traffic a 1800 router would always schedule probe packets behind CL ones (compare 1801 [Breslau00]); this implies they have a new DSCP. Otherwise the 1802 routers would treat probe packets identically to CL packets. In order 1803 to perform admission control quickly, in parts of the network where 1804 there are only a few CL microflows, the algorithm for Admission 1805 Marking described in [PCN] would need to "switch on" very rapidly, ie 1806 go from marking no packets to marking them all for only a minimal 1807 increase in the size of the virtual queue. 1809 5.7.6. MPLS-TE 1811 [ECN-MPLS] discusses how to extend the deployment model to MPLS, i.e. 1812 for admission control of microflows into a set of MPLS-TE aggregates 1813 (Multi-protocol label switching traffic engineering). It would 1814 require that the MPLS header could include the ECN field, which is 1815 not precluded by RFC3270. See [ECN-MPLS]. 1817 6. Relationship to other QoS mechanisms 1819 6.1. IntServ Controlled Load 1821 The CL mechanism delivers QoS similar to Integrated Services 1822 controlled load, but rather better. The reason the QoS is better is 1823 that the CL mechanism keeps the real queues empty, by driving 1824 admission control from a bulk virtual queue on each interface. The 1825 virtual queue [AVQ, vq] can detect a rise in load before the real 1826 queue builds. It is also more robust to route changes. 1828 6.2. Integrated services operation over DiffServ 1830 Our approach to end-to-end QoS is similar to that described in 1831 [RFC2998] for Integrated services operation over DiffServ networks. 1832 Like [RFC2998], an IntServ class (CL in our case) is achieved end-to- 1833 end, with a CL-region viewed as a single reservation hop in the total 1834 end-to-end path. Interior routers of the CL-region do not process 1835 flow signalling nor do they hold per flow state. Unlike [RFC2998] we 1836 do not require the end-to-end signalling mechanism to be RSVP, 1837 although it can be. 1839 Bearing in mind these differences, we can describe our architecture 1840 in the terms of the options in [RFC2998]. The DiffServ network region 1841 is RSVP-aware, but awareness is confined to (what [RFC2998] calls) 1842 the "border routers" of the DiffServ region. We use explicit 1843 admission control into this region, with static provisioning within 1844 it. The ingress "border router" does per microflow policing and sets 1845 the DSCP and ECN fields to indicate the packets are CL ones (i.e. we 1846 use router marking rather than host marking). 1848 6.3. Differentiated Services 1850 The DiffServ architecture does not specify any way for devices 1851 outside the domain to dynamically reserve resources or receive 1852 indications of network resource availability. In practice, service 1853 providers rely on subscription-time Service Level Agreements (SLAs) 1854 that statically define the parameters of the traffic that will be 1855 accepted from a customer. The CL mechanism allows dynamic reservation 1856 of resources through the DiffServ domain and, with the potential 1857 extension mentioned in Section 5.7.2, it can span multiple domains 1858 without active policing mechanisms at the borders (unlike DiffServ). 1859 Therefore we do not use the traffic conditioning agreements (TCAs) of 1860 the (informational) DiffServ architecture [RFC2475]. 1862 An important benefit arises from the fact that the load is controlled 1863 dynamically rather than with traffic conditioning agreements (TCAs). 1865 TCAs were originally introduced in the (informational) DiffServ 1866 architecture [RFC2475] as an alternative to reservation processing in 1867 the interior region in order to reduce the burden on interior 1868 routers. With TCAs, in practice service providers rely on 1869 subscription-time Service Level Agreements that statically define the 1870 parameters of the traffic that will be accepted from a customer. The 1871 problem arises because the TCA at the ingress must allow any 1872 destination address, if it is to remain scalable. But for longer 1873 topologies, the chances increase that traffic will focus on an 1874 interior resource, even though it is within contract at the ingress 1875 [Reid], e.g. all flows converge on the same egress gateway. Even 1876 though networks can be engineered to make such failures rare, when 1877 they occur all inelastic flows through the congested resource fail 1878 catastrophically. 1880 [Johnson] compares admission control with a 'generously dimensioned' 1881 DiffServ network as ways to achieve QoS. The former is recommended. 1883 6.4. ECN 1885 The marking behaviour described in this document complies with the 1886 ECN aspects of the IP wire protocol RFC3168, but provides its own 1887 edge-to-edge feedback instead of the TCP aspects of RFC3168. All 1888 routers within the CL-region are upgraded with the admission marking 1889 and pre-emption marking of Pre-Congestion Notification, so the 1890 requirements of [Floyd] are met because the CL-region is an enclosed 1891 environment. The operator prevents traffic arriving at a router that 1892 doesn't understand CL by administrative configuration of the ring of 1893 gateways around the CL-region. 1895 6.5. RTECN 1897 Real-time ECN (RTECN) [RTECN, RTECN-usage] has a similar aim to this 1898 document (to achieve a low delay, jitter and loss service suitable 1899 for RT traffic) and a similar approach (per microflow admission 1900 control combined with an "early warning" of potential congestion 1901 through setting the CE codepoint). But it explores a different 1902 architecture without the aggregation assumption: host-to-host rather 1903 than edge-to-edge. We plan to document such a host-to-host framework 1904 in a parallel draft to this one, and to describe if and how [PCN] can 1905 work in this framework. 1907 6.6. RMD 1909 Resource Management in DiffServ (RMD) [RMD] is similar to this work, 1910 in that it pushes complex classification, traffic conditioning and 1911 admission control functions to the edge of a DiffServ domain and 1912 simplifies the operation of the interior routers. One of the RMD 1913 modes ("Congestion notification function based on probing") uses 1914 measurement-based admission control in a similar way to this 1915 document. The main difference is that in RMD probing plays a 1916 significant role in the admission control process. Other differences 1917 are that the admission control decision is taken on the egress 1918 gateway (rather than the ingress); 'admission marking' is encoded in 1919 a packet as a new DSCP (rather than in the ECN field), and that the 1920 NSIS protocols are used for signalling (rather than RSVP). 1922 RMD also includes the concept of Severe Congestion handling. The pre- 1923 emption mechanism described in the CL architecture has similar 1924 objectives but relies on different mechanisms. The main difference is 1925 that the interior routers measure the data rate that causes an 1926 overload and mark packets according to this rate. 1928 6.7. RSVP Aggregation over MPLS-TE 1930 Multi-protocol label switching traffic engineering (MPLS-TE) allows 1931 scalable reservation of resources in the core for an aggregate of 1932 many microflows. To achieve end-to-end reservations, admission 1933 control and policing of microflows into the aggregate can be achieved 1934 using techniques such as RSVP Aggregation over MPLS TE Tunnels as per 1935 [AGGRE-TE]. However, in the case of inter-provider environments, 1936 these techniques require that admission control and policing be 1937 repeated at each trust boundary or that MPLS TE tunnels span multiple 1938 domains. 1940 6.8. Other Network Admission Control Approaches 1942 Link admission control (LAC) describes how admission control (AC) can 1943 be done on a single link and comprises, e.g., the calculation of 1944 effective bandwidths which may be the base for a parameter-based AC. 1945 In contrast, network AC (NAC) describes how AC can be done for a 1946 network and focuses on the locations from which data is gathered for 1947 the admission decision. Most approaches implement a link budget based 1948 NAC (LB NAC) where each link has a certain AC-budget. RSVP works 1949 according to that principle, but also the new concept admits 1950 additional flows as long as each link on the new flow's path still 1951 has resources available. The border-to-border budget based NAC (BBB 1952 NAC) pre-configures an AC budget for all border-to-border 1953 relationships (= CL-region-aggregates) and if this capacity budget is 1954 exhausted, new flows are rejected. The TCA-based admission control 1955 which is associated with the DiffServ architecture implements an 1956 ingress budget based NAC (IB NAC). These basically different concepts 1957 have different flexibility and efficiency with regard to the use of 1958 link bandwidths [NAC-a,NAC-b]. They can be made resilient by choosing 1959 the budgets in such a way that the network will not be congested 1960 after rerouting due to a failure. The efficiency of the approaches is 1961 different with and without such resilient requirements. 1963 7. Security Considerations 1965 To protect against denial of service attacks, the ingress gateway of 1966 the CL-region needs to police all CL packets and drop packets in 1967 excess of the reservation. This is similar to operations with 1968 existing IntServ behaviour. 1970 For pre-emption, it is considered acceptable from a security 1971 perspective that the ingress gateway can treat "emergency/military" 1972 CL flows preferentially compared with "ordinary" CL flows. However, 1973 in the rest of the CL-region they are not distinguished (nonetheless, 1974 our proposed technique does not preclude the use of different DSCPs 1975 at the packet level as well as different priorities at the flow 1976 level.). Keeping emergency traffic indistinguishable at the packet 1977 level minimises the opportunity for new security attacks. For 1978 example, if instead a mechanism used different DSCPs for 1979 "emergency/military" and "ordinary" packets, then an attacker could 1980 specifically target the former in the data plane (perhaps for DoS or 1981 for eavesdropping). 1983 Further security aspects to be considered later. 1985 8. Acknowledgements 1987 The admission control mechanism evolved from the work led by Martin 1988 Karsten on the Guaranteed Stream Provider developed in the M3I 1989 project [GSPa, GSP-TR], which in turn was based on the theoretical 1990 work of Gibbens and Kelly [DCAC]. Kennedy Cheng, Gabriele Corliano, 1991 Carla Di Cairano-Gilfedder, Kashaf Khan, Peter Hovell, Arnaud Jacquet 1992 and June Tay (BT) helped develop and evaluate this approach. 1994 Many thanks to those who have commented on this work at Transport 1995 Area Working Group meetings and on the mailing list, including: Ken 1996 Carlberg, Ruediger Geib, Lars Westberg, David Black, Robert Hancock, 1997 Cornelia Kappler, Michael Menth. 1999 9. Comments solicited 2001 Comments and questions are encouraged and very welcome. They can be 2002 sent to the Transport Area Working Group's mailing list, 2003 tsvwg@ietf.org, and/or to the authors. 2005 10. Changes from earlier versions of the draft 2007 The main changes are: 2009 From -00 to -01 2011 The whole of the Pre-emption mechanism is added. 2013 There are several modifications to the admission control mechanism. 2015 From -01 to -02 2017 The pre-congestion notification algorithms for admission marking and 2018 pre-emption marking are now described in [PCN]. 2020 There are new sub-sections in Section 4 on Failures, Admission of 2021 'emergency / higher precedence' session, and Tunnelling; and a new 2022 sub-section in Section 5 on Mechanisms to deal with 'Flash crowds'. 2024 From -02 to -03 2026 Section 5 has been updated and expanded. It is now about the 2027 'limitations' of the PCN mechanism, as described in the earlier 2028 sections, plus discussion of 'possible solutions' to those 2029 limitations. 2031 The measurement of the Congestion-Level-Estimate now includes pre- 2032 emption marked packets as well as admission marked ones. Section 2033 3.1.2 explains. 2035 From -03 to -04 2037 Detailed review by Michael Menth. In response, Abstract, Summary and 2038 Key benefits sections re-written. Numerous detailed comments on 2039 Sections 5 and following sections. 2041 11. Appendices 2043 11.1. Appendix A: Explicit Congestion Notification 2045 This Appendix provides a brief summary of Explicit Congestion 2046 Notification (ECN). 2048 [RFC3168] specifies the incorporation of ECN to TCP and IP, including 2049 ECN's use of two bits in the IP header. It specifies a method for 2050 indicating incipient congestion to end-hosts (e.g. as in RED, Random 2051 Early Detection), where the notification is through ECN marking 2052 packets rather than dropping them. 2054 ECN uses two bits in the IP header of both IPv4 and IPv6 packets: 2056 0 1 2 3 4 5 6 7 2057 +-----+-----+-----+-----+-----+-----+-----+-----+ 2058 | DS FIELD, DSCP | ECN FIELD | 2059 +-----+-----+-----+-----+-----+-----+-----+-----+ 2061 DSCP: differentiated services codepoint 2062 ECN: Explicit Congestion Notification 2064 Figure A.1: The Differentiated Services and ECN Fields in IP. 2066 The two bits of the ECN field have four ECN codepoints, '00' to '11': 2067 +-----+-----+ 2068 | ECN FIELD | 2069 +-----+-----+ 2070 ECT CE 2071 0 0 Not-ECT 2072 0 1 ECT(1) 2073 1 0 ECT(0) 2074 1 1 CE 2076 Figure A.2: The ECN Field in IP. 2078 The not-ECT codepoint '00' indicates a packet that is not using ECN. 2080 The CE codepoint '11' is set by a router to indicate congestion to 2081 the end hosts. The term 'CE packet' denotes a packet that has the CE 2082 codepoint set. 2084 The ECN-Capable Transport (ECT) codepoints '10' and '01' (ECT(0) and 2085 ECT(1) respectively) are set by the data sender to indicate that the 2086 end-points of the transport protocol are ECN-capable. Routers treat 2087 the ECT(0) and ECT(1) codepoints as equivalent. Senders are free to 2088 use either the ECT(0) or the ECT(1) codepoint to indicate ECT, on a 2089 packet-by-packet basis. The motivation for having two codepoints (the 2090 'ECN nonce') is the desire to check two things: for the data sender 2091 to verify that network elements are not erasing the CE codepoint; and 2092 for the data sender to verify that data receivers are properly 2093 reporting to the sender the receipt of packets with the CE codepoint 2094 set. 2096 ECN requires support from the transport protocol, in addition to the 2097 functionality given by the ECN field in the IP packet header. 2098 [RFC3168] addresses the addition of ECN Capability to TCP, specifying 2099 three new pieces of functionality: negotiation between the endpoints 2100 during connection setup to determine if they are both ECN-capable; an 2101 ECN-Echo (ECE) flag in the TCP header so that the data receiver can 2102 inform the data sender when a CE packet has been received; and a 2103 Congestion Window Reduced (CWR) flag in the TCP header so that the 2104 data sender can inform the data receiver that the congestion window 2105 has been reduced. 2107 The transport layer (e.g.. TCP) must respond, in terms of congestion 2108 control, to a *single* CE packet as it would to a packet drop. 2110 The advantage of setting the CE codepoint as an indication of 2111 congestion, instead of relying on packet drops, is that it allows the 2112 receiver(s) to receive the packet, thus avoiding the potential for 2113 excessive delays due to retransmissions after packet losses. 2115 11.2. Appendix B: What is distributed measurement-based admission 2116 control? 2118 This Appendix briefly explains what distributed measurement-based 2119 admission control is [Breslau99]. 2121 Traditional admission control algorithms for 'hard' real-time 2122 services (those providing a firm delay bound for example) guarantee 2123 QoS by using 'worst case analysis'. Each time a flow is admitted its 2124 traffic parameters are examined and the network re-calculates the 2125 remaining resources. When the network gets a new request it therefore 2126 knows for certain whether the prospective flow, with its particular 2127 parameters, should be admitted. However, parameter-based admission 2128 control algorithms result in under-utilisation when the traffic is 2129 bursty. Therefore 'soft' real time services - like Controlled Load - 2130 can use a more relaxed admission control algorithm. 2132 This insight suggests measurement-based admission control (MBAC). The 2133 aim of MBAC is to provide a statistical service guarantee. The 2134 classic scenario for MBAC is where each router participates in hop- 2135 by-hop admission control, characterising existing traffic locally 2136 through measurements (instead of keeping an accurate track of traffic 2137 as it is admitted), in order to determine the current value of some 2138 parameter e.g. load. Note that for scalability the measurement is of 2139 the aggregate of the flows in the local system. The measured 2140 parameter(s) is then compared to the requirements of the prospective 2141 flow to see whether it should be admitted. 2143 MBAC may also be performed centrally for a network, it which case it 2144 uses centralised measurements by a bandwidth broker. 2146 We use distributed MBAC. "Distributed" means that the measurement is 2147 accumulated for the 'whole-path' using in-band signalling. In our 2148 case, this means that the measurement of existing traffic is for the 2149 same pair of ingress and egress gateways as the prospective 2150 microflow. 2152 In fact our mechanism can be said to be distributed in three ways: 2153 all routers on the ingress-egress path affect the Congestion-Level- 2154 Estimate; the admission control decision is made just once on behalf 2155 of all the routers on the path across the CL-region; and the ingress 2156 and egress gateways cooperate to perform MBAC. 2158 11.3. Appendix C: Calculating the Exponentially weighted moving average 2159 (EWMA) 2161 At the egress gateway, for every CL packet arrival: 2163 [EWMA-total-bits]n+1 = (w * bits-in-packet) + ((1-w) * [EWMA- 2164 total-bits]n ) 2166 [EWMA-M-bits]n+1 = (B * w * bits-in-packet) + ((1-w) * [EWMA-M- 2167 bits]n ) 2169 Then, per new flow arrival: 2171 [Congestion-Level-Estimate]n+1 = [EWMA-M-bits]n+1 / [EWMA-total- 2172 bits]n+1 2174 where 2175 EWMA-total-bits is the total number of bits in CL packets, calculated 2176 as an exponentially weighted moving average (EWMA) 2178 EWMA-M-bits is the total number of bits in CL packets that are 2179 Admission Marked or Pre-emption Marked, again calculated as an EWMA. 2181 B is either 0 or 1: 2183 B = 0 if the CL packet is not admission marked 2185 B = 1 if the CL packet is admission marked 2187 w is the exponential weighting factor. 2189 Varying the value of the weight trades off between the smoothness and 2190 responsiveness of the Congestion-Level-Estimate. However, in general 2191 both can be achieved, given our original assumption of many CL 2192 microflows and remembering that the EWMA is calculated on the basis 2193 of aggregate traffic between the ingress and egress gateways. 2194 There will be a threshold inter-arrival time between packets of the 2195 same aggregate below which the egress will consider the estimate of 2196 the Congestion-Level-Estimate as too stale, and it will then trigger 2197 generation of probes by the ingress. 2199 The first two per-packet algorithms can be simplified, if their only 2200 use will be where the result of one is divided by the result of the 2201 other in the third, per-flow algorithm. 2203 [EWMA-total-bits]'n+1 = bits-in-packet + (w' * [EWMA- total- 2204 bits]n ) 2206 [EWMA-AM-bits]'n+1 = (B * bits-in-packet) + (w' * [EWMA-AM-bits]n 2207 ) 2209 where w' = (1-w)/w. 2211 If w' is arranged to be a power of 2, these per packet algorithms can 2212 be implemented solely with a shift and an add. 2214 There are alternative possibilities for smoothing out the congestion- 2215 level-estimate. For example [TEWMA] deals better with the issue of 2216 stale information when the traffic rate for 2218 12. References 2220 A later version will distinguish normative and informative 2221 references. 2223 [AGGRE-TE] Francois Le Faucheur, Michael Dibiasio, Bruce Davie, 2224 Michael Davenport, Chris Christou, Jerry Ash, Bur 2225 Goode, 'Aggregation of RSVP Reservations over MPLS 2226 TE/DS-TE Tunnels', draft-ietf-tsvwg-rsvp-dste-03 (work 2227 in progress), June 2006 2229 [ANSI.MLPP.Spec] American National Standards Institute, 2230 "Telecommunications- Integrated Services Digital 2231 Network (ISDN) - Multi-Level Precedence and Pre- 2232 emption (MLPP) Service Capability", ANSI T1.619-1992 2233 (R1999), 1992. 2235 [ANSI.MLPP.Supplement] American National Standards Institute, "MLPP 2236 Service Domain Cause Value Changes", ANSI ANSI 2237 T1.619a-1994 (R1999), 1990. 2239 [AVQ] S. Kunniyur and R. Srikant "Analysis and Design of an 2240 Adaptive Virtual Queue (AVQ) Algorithm for Active 2241 Queue Management", In: Proc. ACM SIGCOMM'01, Computer 2242 Communication Review 31 (4) (October, 2001). 2244 [Breslau99] L. Breslau, S. Jamin, S. Shenker "Measurement-based 2245 admission control: what is the research agenda?", In: 2246 Proc. Int'l Workshop on Quality of Service 1999. 2248 [Breslau00] L. Breslau, E. Knightly, S. Shenker, I. Stoica, H. 2249 Zhang "Endpoint Admission Control: Architectural 2250 Issues and Performance", In: ACM SIGCOMM 2000 2252 [Briscoe] Bob Briscoe and Steve Rudkin, "Commercial Models for 2253 IP Quality of Service Interconnect", BT Technology 2254 Journal, Vol 23 No 2, April 2005. 2256 [DCAC] Richard J. Gibbens and Frank P. Kelly "Distributed 2257 connection acceptance control for a connectionless 2258 network", In: Proc. International Teletraffic Congress 2259 (ITC16), Edinburgh, pp. 941�952 (1999). 2261 [ECN-MPLS] Bruce Davie, Bob Briscoe, June Tay, "Explicit 2262 Congestion Marking in MPLS", draft- 2263 davie-ecn-mpls-00.txt (work in progress), June 2006 2265 [EMERG-RQTS] Carlberg, K. and R. Atkinson, "General Requirements 2266 for Emergency Telecommunication Service (ETS)", RFC 2267 3689, February 2004. 2269 [EMERG-TEL] Carlberg, K. and R. Atkinson, "IP Telephony 2270 Requirements for Emergency Telecommunication Service 2271 (ETS)", RFC 3690, February 2004. 2273 [Floyd] S. Floyd, 'Specifying Alternate Semantics for the 2274 Explicit Congestion Notification (ECN) Field', draft- 2275 floyd-ecn-alternates-02.txt (work in progress), August 2276 2005 2278 [GSPa] Karsten (Ed.), Martin "GSP/ECN Technology & 2279 Experiments", Deliverable: 15.3 PtIII, M3I Eu Vth 2280 Framework Project IST-1999-11429, URL: 2281 http://www.m3i.org/ (February, 2002) (superseded by 2282 [GSP-TR]) 2284 [GSP-TR] Martin Karsten and Jens Schmitt, "Admission Control 2285 Based on Packet Marking and Feedback Signalling �-- 2286 Mechanisms, Implementation and Experiments", TU- 2287 Darmstadt Technical Report TR-KOM-2002-03, URL: 2288 http://www.kom.e-technik.tu- 2289 darmstadt.de/publications/abstracts/KS02-5.html (May, 2290 2002) 2292 [ITU.MLPP.1990] International Telecommunications Union, "Multilevel 2293 Precedence and Pre-emption Service (MLPP)", ITU-T 2294 Recommendation I.255.3, 1990. 2296 [Johnson] DM Johnson, 'QoS control versus generous 2297 dimensioning', BT Technology Journal, Vol 23 No 2, 2298 April 2005 2300 [LoadBalancing-a] Ruediger Martin, Michael Menth, and Michael 2301 Hemmkeppler: "Accuracy and Dynamics of Hash-Based Load 2302 Balancing Algorithms for Multipath Internet Routing", 2303 IEEE Broadnets, San Jose, CA, USA, October 2006 2304 http://www3.informatik.uni- 2305 wuerzburg.de/~menth/Publications/Menth06p.pdf 2307 [LoadBalancing-b] Ruediger Martin, Michael Menth, and Michael 2308 Hemmkeppler: "Accuracy and Dynamics of Multi-Stage 2309 Load Balancing for Multipath Internet Routing", 2310 currently under submission http://www3.informatik.uni- 2311 wuerzburg.de/~menth/Publications/Menth07-Sub-6.pdf 2313 [Low] S. Low, L. Andrew, B. Wydrowski, 'Understanding XCP: 2314 equilibrium and fairness', IEEE InfoCom 2005 2316 [NAC-a] Michael Menth: "Efficient Admission Control and 2317 Routing in Resilient Communication Networks", PhD 2318 thesis, July 2004, http://opus.bibliothek.uni- 2319 wuerzburg.de/opus/volltexte/2004/994/pdf/Menth04.pdf 2321 [NAC-b] Michael Menth, Stefan Kopf, Joachim Charzinski, and 2322 Karl Schrodi: "Resilient Network Admission Control", 2323 currently under submission. 2324 http://www3.informatik.uni- 2325 wuerzburg.de/~menth/Publications/Menth07-Sub-3.pdf 2327 [PCN] B. Briscoe, P. Eardley, D. Songhurst, F. Le Faucheur, 2328 A. Charny, V. Liatsos, S. Dudley, J. Babiarz, K. Chan, 2329 G. Karagiannis, A. Bader, L. Westberg. 'Pre-Congestion 2330 Notification marking', draft-briscoe-tsvwg-cl-phb-02 2331 (work in progress), June 2006. 2333 [Re-ECN] Bob Briscoe, Arnaud Jacquet, Alessandro Salvatori, 2334 'Re-ECN: Adding Accountability for Causing Congestion 2335 to TCP/IP', draft-briscoe-tsvwg-re-ecn-tcp-01 (work in 2336 progress), March 2006. 2338 [Re-feedback] Bob Briscoe, Arnaud Jacquet, Carla Di Cairano- 2339 Gilfedder, Andrea Soppera, 'Re-feedback for Policing 2340 Congestion Response in an Inter-network', ACM SIGCOMM 2341 2005, August 2005. 2343 [Re-PCN] B. Briscoe, 'Emulating Border Flow Policing using Re- 2344 ECN on Bulk Data', draft-briscoe-tsvwg-re-ecn-border- 2345 cheat-00 (work in progress), February 2006. 2347 [Reid] ABD Reid, 'Economics and scalability of QoS 2348 solutions', BT Technology Journal, Vol 23 No 2, April 2349 2005 2351 [RFC2211] J. Wroclawski, Specification of the Controlled-Load 2352 Network Element Service, September 1997 2354 [RFC2309] Braden, B., et al., "Recommendations on Queue 2355 Management and Congestion Avoidance in the Internet", 2356 RFC 2309, April 1998. 2358 [RFC2474] Nichols, K., Blake, S., Baker, F. and D. Black, 2359 "Definition of the Differentiated Services Field (DS 2360 Field) in the IPv4 and IPv6 Headers", RFC 2474, 2361 December 1998 2363 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, 2364 Z. and W. Weiss, 'A framework for Differentiated 2365 Services', RFC 2475, December 1998. 2367 [RFC2597] Heinanen, J., Baker, F., Weiss, W. and J. Wrocklawski, 2368 "Assured Forwarding PHB Group", RFC 2597, June 1999. 2370 [RFC2998] Bernet, Y., Yavatkar, R., Ford, P., Baker, F., Zhang, 2371 L., Speer, M., Braden, R., Davie, B., Wroclawski, J. 2372 and E. Felstaine, "A Framework for Integrated Services 2373 Operation Over DiffServ Networks", RFC 2998, November 2374 2000. 2376 [RFC3168] Ramakrishnan, K., Floyd, S. and D. Black "The Addition 2377 of Explicit Congestion Notification (ECN) to IP", RFC 2378 3168, September 2001. 2380 [RFC3246] B. Davie, A. Charny, J.C.R. Bennet, K. Benson, J.Y. Le 2381 Boudec, W. Courtney, S. Davari, V. Firoiu, D. 2382 Stiliadis, 'An Expedited Forwarding PHB (Per-Hop 2383 Behavior)', RFC 3246, March 2002. 2385 [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., 2386 Vaananen, P., Krishnan, R., Cheval, P., and J. 2387 Heinanen, "Multi- Protocol Label Switching (MPLS) 2388 Support of Differentiated Services", RFC 3270, May 2389 2002. 2391 [RFC4542] F. Baker & J. Polk, "Implementing an Emergency 2392 Telecommunications Service for Real Time Services in 2393 the Internet Protocol Suite", RFC 4542, May 2006. 2395 [RMD] Attila Bader, Lars Westberg, Georgios Karagiannis, 2396 Cornelia Kappler, Tom Phelan, 'RMD-QOSM - The Resource 2397 Management in DiffServ QoS model', draft-ietf-nsis- 2398 rmd-03 Work in Progress, June 2005. 2400 [RSVP-PCN] Francois Le Faucheur, Anna Charny, Bob Briscoe, Philip 2401 Eardley, Joe Barbiaz, Kwok-Ho Chan, 'RSVP Extensions 2402 for Admission Control over DiffServ using Pre- 2403 Congestion Notification (PCN)', draft-lefaucheur-rsvp- 2404 ecn-01 (work in progress), June 2006. 2406 [RSVP-PREEMPTION] Herzog, S., "Signaled Preemption Priority Policy 2407 Element", RFC 3181, October 2001. 2409 [RSVP-EMERGENCY] Le Faucheur et al., RSVP Extensions for Emergency 2410 Services, draft-lefaucheur-emergency-rsvp-02.txt 2412 [RTECN] Babiarz, J., Chan, K. and V. Firoiu, 'Congestion 2413 Notification Process for Real-Time Traffic', draft- 2414 babiarz-tsvwg-rtecn-04 Work in Progress, July 2005. 2416 [RTECN-usage] Alexander, C., Ed., Babiarz, J. and J. Matthews, 2417 'Admission Control Use Case for Real-time ECN', draft- 2418 alexander-rtecn-admission-control-use-case-00, Work in 2419 Progress, February 2005. 2421 [Songhurst] David J. Songhurst, Philip Eardley, Bob Briscoe, Carla 2422 Di Cairano Gilfedder and June Tay, 'Guaranteed QoS 2423 Synthesis for Admission Control with Shared Capacity', 2424 BT Technical Report TR-CXR9-2006-001, Feb 2006, 2425 http://www.cs.ucl.ac.uk/staff/B.Briscoe/projects/ipe2e 2426 qos/gqs/papers/GQS_shared_tr.pdf 2428 [vq] Costas Courcoubetis and Richard Weber "Buffer Overflow 2429 Asymptotics for a Switch Handling Many Traffic 2430 Sources" In: Journal Applied Probability 33 pp. 886-- 2431 903 (1996). 2433 Authors' Addresses 2435 Bob Briscoe 2436 BT Research 2437 B54/77, Sirius House 2438 Adastral Park 2439 Martlesham Heath 2440 Ipswich, Suffolk 2441 IP5 3RE 2442 United Kingdom 2443 Email: bob.briscoe@bt.com 2445 Dave Songhurst 2446 BT Research 2447 B54/69, Sirius House 2448 Adastral Park 2449 Martlesham Heath 2450 Ipswich, Suffolk 2451 IP5 3RE 2452 United Kingdom 2453 Email: dsonghurst@jungle.bt.co.uk 2455 Philip Eardley 2456 BT Research 2457 B54/77, Sirius House 2458 Adastral Park 2459 Martlesham Heath 2460 Ipswich, Suffolk 2461 IP5 3RE 2462 United Kingdom 2463 Email: philip.eardley@bt.com 2465 Francois Le Faucheur 2466 Cisco Systems, Inc. 2467 Village d'Entreprise Green Side - Batiment T3 2468 400, Avenue de Roumanille 2469 06410 Biot Sophia-Antipolis 2470 France 2471 Email: flefauch@cisco.com 2473 Anna Charny 2474 Cisco Systems 2475 300 Apollo Drive 2476 Chelmsford, MA 01824 2477 USA 2478 Email: acharny@cisco.com 2479 Kwok Ho Chan 2480 Nortel Networks 2481 600 Technology Park Drive 2482 Billerica, MA 01821 2483 USA 2484 Email: khchan@nortel.com 2486 Jozef Z. Babiarz 2487 Nortel Networks 2488 3500 Carling Avenue 2489 Ottawa, Ont K2H 8E9 2490 Canada 2491 Email: babiarz@nortel.com 2493 Stephen Dudley 2494 Nortel Networks 2495 4001 E. Chapel Hill Nelson Highway 2496 P.O. Box 13010, ms 570-01-0V8 2497 Research Triangle Park, NC 27709 2498 USA 2499 Email: smdudley@nortel.com 2501 Georgios Karagiannis 2502 University of Twente 2503 P.O. BOX 217 2504 7500 AE Enschede, 2505 The Netherlands 2506 EMail: g.karagiannis@ewi.utwente.nl 2508 Attila B�der 2509 attila.bader@ericsson.com 2511 Lars Westberg 2512 Ericsson AB 2513 SE-164 80 Stockholm 2514 Sweden 2515 EMail: Lars.Westberg@ericsson.com 2517 Intellectual Property Statement 2519 The IETF takes no position regarding the validity or scope of any 2520 Intellectual Property Rights or other rights that might be claimed to 2521 pertain to the implementation or use of the technology described in 2522 this document or the extent to which any license under such rights 2523 might or might not be available; nor does it represent that it has 2524 made any independent effort to identify any such rights. Information 2525 on the procedures with respect to rights in RFC documents can be 2526 found in BCP 78 and BCP 79. 2528 Copies of IPR disclosures made to the IETF Secretariat and any 2529 assurances of licenses to be made available, or the result of an 2530 attempt made to obtain a general license or permission for the use of 2531 such proprietary rights by implementers or users of this 2532 specification can be obtained from the IETF on-line IPR repository at 2533 http://www.ietf.org/ipr. 2535 The IETF invites any interested party to bring to its attention any 2536 copyrights, patents or patent applications, or other proprietary 2537 rights that may cover technology that may be required to implement 2538 this standard. Please address the information to the IETF at 2539 ietf-ipr@ietf.org 2541 Disclaimer of Validity 2543 This document and the information contained herein are provided on an 2544 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2545 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2546 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2547 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2548 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2549 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2551 Copyright Statement 2553 Copyright (C) The Internet Society (2006). 2555 This document is subject to the rights, licenses and restrictions 2556 contained in BCP 78, and except as set forth therein, the authors 2557 retain all their rights.