idnits 2.17.1 draft-briscoe-tsvwg-cl-architecture-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 33. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2503. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2480. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2487. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 2507), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 55. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure Invitation -- however, there's a paragraph with a matching beginning. Boilerplate error? Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 3 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 1228 has weird spacing: '... can be used ...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2205' is mentioned on line 1219, but not defined == Missing Reference: 'RT-ECN' is mentioned on line 642, but not defined == Unused Reference: 'AVQ' is defined on line 2217, but no explicit reference was found in the text == Unused Reference: 'RFC2309' is defined on line 2308, but no explicit reference was found in the text == Unused Reference: 'RFC2474' is defined on line 2312, but no explicit reference was found in the text == Unused Reference: 'RFC2597' is defined on line 2321, but no explicit reference was found in the text == Unused Reference: 'RFC3246' is defined on line 2334, but no explicit reference was found in the text == Unused Reference: 'RFC3270' is defined on line 2339, but no explicit reference was found in the text == Outdated reference: A later version (-05) exists of draft-ietf-tsvwg-rsvp-dste-03 -- Possible downref: Non-RFC (?) normative reference: ref. 'AVQ' -- Possible downref: Non-RFC (?) normative reference: ref. 'Breslau99' -- Possible downref: Non-RFC (?) normative reference: ref. 'Breslau00' -- Possible downref: Non-RFC (?) normative reference: ref. 'Briscoe' -- Possible downref: Non-RFC (?) normative reference: ref. 'DCAC' == Outdated reference: A later version (-01) exists of draft-davie-ecn-mpls-00 ** Downref: Normative reference to an Informational RFC: RFC 3689 (ref. 'EMERG-RQTS') ** Downref: Normative reference to an Informational RFC: RFC 3690 (ref. 'EMERG-TEL') -- Possible downref: Normative reference to a draft: ref. 'Floyd' -- Possible downref: Non-RFC (?) normative reference: ref. 'GSPa' -- Possible downref: Non-RFC (?) normative reference: ref. 'GSP-TR' -- Possible downref: Non-RFC (?) normative reference: ref. 'ITU.MLPP.1990' -- Possible downref: Non-RFC (?) normative reference: ref. 'Johnson' -- Possible downref: Non-RFC (?) normative reference: ref. 'Low' == Outdated reference: A later version (-03) exists of draft-briscoe-tsvwg-cl-phb-02 -- Possible downref: Normative reference to a draft: ref. 'PCN' == Outdated reference: A later version (-09) exists of draft-briscoe-tsvwg-re-ecn-tcp-01 -- Possible downref: Non-RFC (?) normative reference: ref. 'Re-feedback' == Outdated reference: A later version (-01) exists of draft-briscoe-tsvwg-re-ecn-border-cheat-00 -- Possible downref: Normative reference to a draft: ref. 'Re-PCN' -- Possible downref: Non-RFC (?) normative reference: ref. 'Reid' ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Downref: Normative reference to an Informational RFC: RFC 2475 ** Downref: Normative reference to an Informational RFC: RFC 2998 ** Downref: Normative reference to an Informational RFC: RFC 4542 == Outdated reference: A later version (-20) exists of draft-ietf-nsis-rmd-03 ** Downref: Normative reference to an Experimental draft: draft-ietf-nsis-rmd (ref. 'RMD') -- Possible downref: Normative reference to a draft: ref. 'RSVP-PCN' == Outdated reference: A later version (-05) exists of draft-babiarz-tsvwg-rtecn-04 -- Possible downref: Normative reference to a draft: ref. 'RTECN' -- Possible downref: Normative reference to a draft: ref. 'RTECN-usage' -- Possible downref: Non-RFC (?) normative reference: ref. 'Songhurst' Summary: 14 errors (**), 0 flaws (~~), 19 warnings (==), 26 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TSVWG B. Briscoe 2 Internet Draft P. Eardley 3 draft-briscoe-tsvwg-cl-architecture-03.txt D. Songhurst 4 Expires: December 2006 BT 6 F. Le Faucheur 7 A. Charny 8 Cisco Systems, Inc 10 J. Babiarz 11 K. Chan 12 S. Dudley 13 Nortel 15 G. Karagiannis 16 University of Twente / Ericsson 18 A. Bader 19 L. Westberg 20 Ericsson 22 26 June, 2006 24 An edge-to-edge Deployment Model for Pre-Congestion Notification: 25 Admission Control over a DiffServ Region 26 draft-briscoe-tsvwg-cl-architecture-03.txt 28 Status of this Memo 30 By submitting this Internet-Draft, each author represents that any 31 applicable patent or other IPR claims of which he or she is aware 32 have been or will be disclosed, and any of which he or she becomes 33 aware will be disclosed, in accordance with Section 6 of BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as Internet- 38 Drafts. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress". 45 The list of current Internet-Drafts can be accessed at 46 http://www.ietf.org/ietf/1id-abstracts.txt 48 The list of Internet-Draft Shadow Directories can be accessed at 49 http://www.ietf.org/shadow.html 51 This Internet-Draft will expire on September 6, 2006. 53 Copyright Notice 55 Copyright (C) The Internet Society (2006). All Rights Reserved. 57 Abstract 59 This document describes a deployment model for pre-congestion 60 notification (PCN). PCN-based flow admission control and if necessary 61 flow pre-emption preserve the Controlled Load service to admitted 62 flows. Routers in a large DiffServ-based region of the Internet use 63 new pre-congestion notification marking to give early warning of 64 their own congestion. Gateways around the edges of the region convert 65 measurements of this packet granularity marking into admission 66 control and pre-emption functions at flow granularity. Note that 67 interior routers of the DiffServ-based region do not require flow 68 state or signalling - they only have to do the bulk packet marking of 69 PCN. Hence an end-to-end Controlled Load service can be achieved 70 without any scalability impact on interior routers. 72 Authors' Note (TO BE DELETED BY THE RFC EDITOR UPON PUBLICATION) 74 This document is posted as an Internet-Draft with the intention of 75 eventually becoming an INFORMATIONAL RFC. 77 Table of Contents 79 1. Introduction......................................... 5 80 1.1. Summary......................................... 5 81 1.1.1. Flow admission control........................ 7 82 1.1.2. Flow pre-emption............................. 9 83 1.1.3. Both admission control and pre-emption.......... 10 84 1.2. Terminology.................................... 10 85 1.3. Existing terminology............................. 12 86 1.4. Standardisation requirements...................... 12 87 1.5. Structure of rest of the document.................. 13 88 2. Key aspects of the deployment model..................... 14 89 2.1. Key goals...................................... 14 90 2.2. Key assumptions................................. 15 91 2.3. Key benefits ................................... 17 92 3. Deployment model.................................... 19 93 3.1. Admission control ............................... 19 94 3.1.1. Pre-Congestion Notification for Admission Marking. 19 95 3.1.2. Measurements to support admission control........ 19 96 3.1.3. How edge-to-edge admission control supports end-to-end 97 QoS signalling ................................... 20 98 3.1.4. Use case................................... 20 99 3.2. Flow pre-emption................................ 22 100 3.2.1. Alerting an ingress gateway that flow pre-emption may be 101 needed.......................................... 22 102 3.2.2. Determining the right amount of CL traffic to drop 24 103 3.2.3. Use case for flow pre-emption ................. 25 104 4. Summary of Functionality.............................. 27 105 4.1. Ingress gateways................................ 27 106 4.2. Interior routers................................ 28 107 4.3. Egress gateways................................. 28 108 4.4. Failures....................................... 29 109 5. Limitations and some potential solutions................. 31 110 5.1. ECMP.......................................... 31 111 5.2. Beat down effect................................ 33 112 5.3. Bi-directional sessions .......................... 35 113 5.4. Global fairness................................. 37 114 5.5. Flash crowds ................................... 39 115 5.6. Pre-empting too fast............................. 40 116 5.7. Other potential extensions........................ 42 117 5.7.1. Tunnelling................................. 42 118 5.7.2. Multi-domain and multi-operator usage........... 43 119 5.7.3. Preferential dropping of pre-emption marked packets43 120 5.7.4. Adaptive bandwidth for the Controlled Load service 44 121 5.7.5. Controlled Load service with end-to-end Pre-Congestion 122 Notification..................................... 44 123 5.7.6. MPLS-TE ................................... 45 124 6. Relationship to other QoS mechanisms.................... 46 125 6.1. IntServ Controlled Load .......................... 46 126 6.2. Integrated services operation over DiffServ.......... 46 127 6.3. Differentiated Services .......................... 46 128 6.4. ECN........................................... 47 129 6.5. RTECN......................................... 47 130 6.6. RMD........................................... 47 131 6.7. RSVP Aggregation over MPLS-TE...................... 48 132 7. Security Considerations............................... 49 133 8. Acknowledgements.................................... 49 134 9. Comments solicited................................... 49 135 10. Changes from earlier versions of the draft.............. 50 136 11. Appendices ........................................ 51 137 11.1. Appendix A: Explicit Congestion Notification ........ 51 138 11.2. Appendix B: What is distributed measurement-based admission 139 control?........................................... 52 140 11.3. Appendix C: Calculating the Exponentially weighted moving 141 average (EWMA)...................................... 53 142 12. References ........................................ 55 143 Authors' Addresses..................................... 60 144 Intellectual Property Statement .......................... 62 145 Disclaimer of Validity.................................. 62 146 Copyright Statement.................................... 62 148 1. Introduction 150 1.1. Summary 152 This document describes a deployment model to achieve an end-to-end 153 Controlled Load service by using (within a large region of the 154 Internet) DiffServ and edge-to-edge distributed measurement-based 155 admission control and flow pre-emption. Controlled load service is a 156 quality of service (QoS) closely approximating the QoS that the same 157 flow would receive from a lightly loaded network element [RFC2211]. 158 Controlled Load (CL) is useful for inelastic flows such as those for 159 real-time media. 161 In line with the "IntServ over DiffServ" framework defined in 162 [RFC2998], the CL service is supported end-to-end and RSVP signalling 163 [RFC2205] is used end-to-end, over an edge-to-edge DiffServ region. 165 ___ ___ _______________________________________ ____ ___ 166 | | | | | | | | | | 167 | | | | |Ingress Interior Egress| | | | | 168 | | | | |gateway routers gateway| | | | | 169 | | | | |-------+ +-------+ +-------+ +------| | | | | 170 | | | | | PCN- | | PCN- | | PCN- | | | | | | | 171 | |..| |..|marking|..|marking|..|marking|..| Meter|..| |..| | 172 | | | | |-------+ +-------+ +-------+ +------| | | | | 173 | | | | | \ / | | | | | 174 | | | | | \ / | | | | | 175 | | | | | \ Congestion-Level-Estimate / | | | | | 176 | | | | | \ (for admission control) / | | | | | 177 | | | | | --<-----<----<----<-----<-- | | | | | 178 | | | | | Sustainable-Aggregate-Rate | | | | | 179 | | | | | (for flow pre-emption) | | | | | 180 |___| |___| |_______________________________________| |____| |___| 182 Sx Access CL-region Access Rx 183 End Network Network End 184 Host Host 185 <------ edge-to-edge signalling -----> 186 (for admission control & flow pre-emption) 188 <-------------------end-to-end QoS signalling protocol---------------> 190 Figure 1: Overall QoS architecture (NB terminology explained later) 191 Figure 1 shows an example of an overall QoS architecture, where the 192 two access networks are connected by a CL-region. Another possibility 193 is that there are several CL-regions between the access networks - 194 each would operate the Pre-Congestion Notification mechanisms 195 separately. 197 In Section 1.1.1 we summarise how admission of new CL microflows is 198 controlled so as to deliver the required QoS. In abnormal 199 circumstances, for instance a disaster affecting multiple interior 200 routers, then the QoS on existing CL microflows may degrade even if 201 care was exercised when admitting those microflows before those 202 circumstances. Therefore we also propose a mechanism (summarised in 203 Section 1.1.2) to pre-empt some of the existing microflows. Then 204 remaining microflows retain their expected QoS, while improved QoS is 205 quickly restored to lower priority traffic. 207 As a fundamental building block to support these two mechanisms, we 208 introduce "Pre-Congestion Notification". Pre-Congestion Notification 209 (PCN) builds on the concepts of RFC 3168, "The addition of Explicit 210 Congestion Notification to IP". The [PCN] document defines the 211 respective algorithms that determine when a PCN-enabled router marks 212 a packet with Admission Marking or Pre-emption Marking, depending on 213 the traffic level. 215 In order to support CL traffic we would expect PCN to supplement the 216 existing Expedited Forwarding (EF). Within the controlled edge-to- 217 edge region, a particular packet receives the Pre-Congestion 218 Notification (PCN) behaviour if the packet's differentiated services 219 codepoint (DSCP) is set to EF and also the ECN field indicates ECN 220 Capable Transport. However, PCN is not only intended to supplement 221 EF. PCN is specified (in [PCN]) as a building block which can 222 supplement the scheduling behaviour of other PHBs. 224 There are various possible ways to encode the markings into a packet, 225 using the ECN field and perhaps other DSCPs, which are discussed in 226 [PCN]. In this draft we use the abstract names Admission Marking and 227 Pre-emption Marking. 229 This framework assumes that the Pre-Congestion Notification behaviour 230 is used in a controlled environment, i.e. within the controlled edge- 231 to-edge region. 233 1.1.1. Flow admission control 235 This document describes a new admission control procedure for an 236 edge-to-edge region, which uses new per-hop Pre-Congestion 237 Notification 'admission marking' as a fundamental building block. In 238 turn, an end-to-end CL service would use this as a building block 239 within a broader QoS architecture. 241 The per-hop, edge-to-edge and end-to-end aspects are now briefly 242 introduced in turn. 244 Appendix A provides a brief summary of Explicit Congestion 245 Notification (ECN) [RFC3168]. It specifies that a router sets the ECN 246 field to the Congestion Experienced (CE) value as a warning of 247 incipient congestion. RFC3168 doesn't specify a particular algorithm 248 for setting the CE codepoint, although Random Early Detection (RED) 249 is expected to be used. 251 Pre-Congestion Notification (PCN) builds on the concepts of ECN. PCN 252 introduces a new algorithm that Admission Marks packets before there 253 is any significant build-up of CL packets in the queue. Admission 254 marked packets therefore act as an "early warning" when the amount of 255 packets flowing is getting close to the engineered capacity. Hence it 256 can be used with per-hop behaviours (PHBs) designed to operate with 257 very low queue occupancy, such as Expedited Forwarding (EF). Note 258 that our use of the ECN field operates across the CL-region, i.e. 259 edge-to-edge, and not host-to-host as in [RFC3168]. 261 Turning next to the edge-to-edge aspect. All routers within a region 262 of the Internet, which we call the CL-region, apply the PHB used for 263 CL traffic and the Pre-Congestion Notification behaviour. Traffic 264 must enter/leave the CL-region through ingress/egress gateways, which 265 have special functionality. Typically the CL-region is the core or 266 backbone of an operator. The CL service is achieved "edge-to-edge" 267 across the CL-region, by using distributed measurement-based 268 admission control: the decision whether to admit a new microflow 269 depends on a measurement of the existing traffic between the same 270 pair of ingress and egress gateways (i.e. the same pair as the 271 prospective new microflow). (See Appendix B for further discussion on 272 "What is distributed measurement-based admission control?") 274 As CL packets travel across the CL-region, routers will admission 275 mark packets (according to the Pre-Congestion Notification algorithm) 276 as an "early warning" of potential congestion, i.e. before there is 277 any significant build-up of CL packets in the queue. For traffic from 278 each remote ingress gateway, the CL-region's egress gateway measures 279 the fraction of CL traffic that is admission marked. The egress 280 gateway calculates the value on a per bit basis as a moving average 281 (exponentially weighted is suggested), and which we term Congestion- 282 Level-Estimate (CLE). Then it reports it to the CL-region's ingress 283 gateway, piggy-backed on the signalling for a new flow. The ingress 284 gateway only admits the new CL microflow if the Congestion-Level- 285 Estimate is less than the value of the CLE-threshold. Hence 286 previously accepted CL microflows will suffer minimal queuing delay, 287 jitter and loss. 289 In turn, the edge-to-edge architecture is a building block in 290 delivering an end-to-end CL service. The approach is similar to that 291 described in [RFC2998] for Integrated services operation over 292 DiffServ networks. Like [RFC2998], an IntServ class (CL in our case) 293 is achieved end-to-end, with a CL-region viewed as a single 294 reservation hop in the total end-to-end path. Interior routers of the 295 CL-region do not process flow signalling nor do they hold per flow 296 state. We assume that the end-to-end signalling mechanism is RSVP 297 (Section 2.2). However, the RSVP signalling may itself be originated 298 or terminated by proxies still closer to the edge of the network, 299 such as home hubs or the like, triggered in turn by application layer 300 signalling. [RFC2998] and our approach are compared further in 301 Section 6.2. 303 An important benefit compared with the IntServ over DiffServ model 304 [RFC2998] arises from the fact that the load is controlled 305 dynamically rather than with traffic conditioning agreements (TCAs). 306 TCAs were originally introduced in the (informational) DiffServ 307 architecture [RFC2475] as an alternative to reservation processing in 308 the interior region in order to reduce the burden on interior 309 routers. With TCAs, in practice service providers rely on 310 subscription-time Service Level Agreements that statically define the 311 parameters of the traffic that will be accepted from a customer. The 312 problem arises because the TCA at the ingress must allow any 313 destination address, if it is to remain scalable. But for longer 314 topologies, the chances increase that traffic will focus on an 315 interior resource, even though it is within contract at the ingress 316 [Reid], e.g. all flows converge on the same egress gateway. Even 317 though networks can be engineered to make such failures rare, when 318 they occur all inelastic flows through the congested resource fail 319 catastrophically. 321 Distributed measurement-based admission control avoids reservation 322 processing (whether per flow or aggregated) on interior routers but 323 flows are still blocked dynamically in response to actual congestion 324 on any interior router. Hence there is no need for accurate or 325 conservative prediction of the traffic matrix. 327 1.1.2. Flow pre-emption 329 An essential QoS issue in core and backbone networks is being able to 330 cope with failures of routers and links. The consequent re-routing 331 can cause severe congestion on some links and hence degrade the QoS 332 experienced by on-going microflows and other, lower priority traffic. 333 Even when the network is engineered to sustain a single link failure, 334 multiple link failures (e.g. due to a fibre cut, router failure or a 335 natural disaster) can cause violation of capacity constraints and 336 resulting QoS failures. Our solution uses rate-based flow pre- 337 emption, so that sufficient of the previously admitted CL microflows 338 are dropped to ensure that the remaining ones again receive QoS 339 commensurate with the CL service and at least some QoS is quickly 340 restored to other traffic classes. 342 The solution involves four steps. First, triggering the ingress 343 gateway to test whether pre-emption may be needed. A router enhanced 344 with Pre-Congestion Notification may optionally include an algorithm 345 that Pre-emption Marks packets. Reception of a packet with such a 346 marking alerts the egress gateway that pre-emption may be needed, 347 which in turn sends a Pre-emption Alert message to the ingress 348 gateway. Secondly, calculating the right amount of traffic to drop. 349 This involves the egress gateway measuring, and reporting to the 350 ingress gateway, the current rate of CL traffic received from that 351 particular ingress gateway. This is the CL rate which the network can 352 actually support from that ingress gateway to that egress gateway, 353 and we thus call it the Sustainable-Aggregate-Rate. The ingress 354 gateway compares the Sustainable-Aggregate-Rate) with the rate that 355 it is sending and hence determines how much traffic needs to be pre- 356 empted. Thirdly, choosing which flows to shed in order to drop the 357 traffic calculated in the second step. Information on the priority of 358 flows may be held by the ingress gateway, or by some out of band 359 policy decision point. How these systems co-ordinate to determine 360 which flows to drop is outside the scope of this document, but 361 between them they have all the information necessary to make the 362 decision. Fourthly, tearing down reservations for the chosen flows. 363 The ingress gateway triggers standard tear-down messages for the 364 reservation protocol in use. In turn, this is expected to result in 365 end-systems tearing down the corresponding sessions (e.g. voice 366 calls) using the corresponding session control protocols. 368 The focus of this document is on the first two steps, i.e. 369 determining that pre-emption may be needed and estimating how much 370 traffic needs to be pre-empted. We provide some hints about the 371 latter two steps in Section 3.2.3, but don't try to provide full 372 guidance as it greatly depends on the particular detailed operational 373 situation. 375 The solution operates within a little over one round trip time - the 376 time required for microflow packets that have experienced Pre-emption 377 Marking to travel downstream through the CL-region and arrive at the 378 egress gateway, plus some additional time for the egress gateway to 379 measure the rate seen after it has been alerted that pre-emption may 380 be needed, and the time for the egress gateway to report this 381 information to the ingress gateway. 383 1.1.3. Both admission control and pre-emption 385 This document describes both the admission control and pre-emption 386 mechanisms, and we suggest that an operator uses both. However, we do 387 not require this and some operators may want to implement only one. 389 For example, an operator could use just admission control, solving 390 heavy congestion (caused by re-routing) by 'just waiting' - as 391 sessions end, existing microflows naturally depart from the system 392 over time, and the admission control mechanism will prevent admission 393 of new microflows that use the affected links. So the CL-region will 394 naturally return to normal controlled load service, but with reduced 395 capacity. The drawback of this approach would be that until flows 396 naturally depart to relieve the congestion, all flows and lower 397 priority services will be adversely affected. As another example, an 398 operator could use just admission control, avoiding heavy congestion 399 (caused by re-routing) by 'capacity planning' - by configuring 400 admission control thresholds to lower levels than the network could 401 accept in normal situations such that the load after failure is 402 expected to stay below acceptable levels even with reduced network 403 resources. 405 On the other hand, an operator could just rely for admission control 406 on the traffic conditioning agreements of the DiffServ architecture 407 [RFC2475]. The pre-emption mechanism described in this document would 408 be used to counteract the problem described at the end of Section 409 1.1.1. 411 1.2. Terminology 413 This terminology is copied from the pre-congestion notification 414 marking draft [PCN]: 416 o Pre-Congestion Notification (PCN): two new algorithms that 417 determine when a PCN-enabled router Admission Marks and Pre- 418 emption Marks a packet, depending on the traffic level. 420 o Admission Marking condition: the traffic level is such that the 421 router Admission Marks packets. The router provides an "early 422 warning" that the load is nearing the engineered admission control 423 capacity, before there is any significant build-up of CL packets 424 in the queue. 426 o Pre-emption Marking condition: the traffic level is such that the 427 router Pre-emption Marks packets. The router warns explicitly that 428 pre-emption may be needed. 430 o Configured-admission-rate: the reference rate used by the 431 admission marking algorithm in a PCN-enabled router. 433 o Configured-pre-emption-rate - the reference rate used by the pre- 434 emption marking algorithm in a PCN-enabled router. 436 The following terms are defined here: 438 o Ingress gateway: router at an ingress to the CL-region. A CL- 439 region may have several ingress gateways. 441 o Egress gateway: router at an egress from the CL-region. A CL- 442 region may have several egress gateways. 444 o Interior router: a router which is part of the CL-region, but 445 isn't an ingress or egress gateway. 447 o CL-region: A region of the Internet in which all traffic 448 enters/leaves through an ingress/egress gateway and all routers 449 run Pre-Congestion Notification marking. A CL-region is a DiffServ 450 region (a DiffServ region is either a single DiffServ domain or 451 set of contiguous DiffServ domains), but note that the CL-region 452 does not use the traffic conditioning agreements (TCAs) of the 453 (informational) DiffServ architecture. 455 o CL-region-aggregate: all the microflows between a specific pair of 456 ingress and egress gateways. Note there is no field in the flow 457 packet headers that uniquely identifies the aggregate. 459 o Congestion-Level-Estimate: the number of bits in CL packets that 460 are admission marked (or pre-emption marked), divided by the 461 number of bits in all CL packets. It is calculated as an 462 exponentially weighted moving average. It is calculated by an 463 egress gateway for the CL packets from a particular ingress 464 gateway, i.e. there is a Congestion-Level-Estimate for each CL- 465 region-aggregate. 467 o Sustainable-Aggregate-Rate: the rate of traffic that the network 468 can actually support for a specific CL-region-aggregate. So it is 469 measured by an egress gateway for the CL packets from a particular 470 ingress gateway. 472 o Ingress-Aggregate-Rate: the rate of traffic that is being sent on 473 a specific CL-region-aggregate. So it is measured by an ingress 474 gateway for the CL packets sent towards a particular egress 475 gateway. 477 1.3. Existing terminology 479 This is a placeholder for useful terminology that is defined 480 elsewhere. 482 1.4. Standardisation requirements 484 The framework described in this document has two new standardisation 485 requirements: 487 o new Pre-Congestion Notification for Admission Marking and Pre- 488 emption Marking are required, as detailed in [PCN]. 490 o the end-to-end signalling protocol needs to be modified to carry 491 the Congestion-Level-Estimate report (for admission control) and 492 the Sustainable-Aggregate-Rate (for flow pre-emption). With our 493 assumption of RSVP (Section 2.2) as the end-to-end signalling 494 protocol, it means that extensions to RSVP are required, as 495 detailed in [RSVP-PCN], for example to carry the Congestion-Level- 496 Estimate and Sustainable-Aggregate-Rate information from egress 497 gateway to ingress gateway. 499 o We are discussing what to standardise about the gateway's 500 behaviour. 502 Other than these things, the arrangement uses existing IETF protocols 503 throughout, although not in their usual architecture. 505 1.5. Structure of rest of the document 507 Section 2 describes some key aspects of the deployment model: our 508 goals, assumptions and the benefits we believe it has. Section 3 509 describes the deployment model, whilst Section 4 summarises the 510 required changes to the various routers in the CL-region. Section 5 511 outlines some limitations of PCN that we've identified in this 512 deployment model; it also discusses some potential solutions, and 513 other possible extensions. Section 6 provides some comparison with 514 existing QoS mechanisms. 516 2. Key aspects of the deployment model 518 In this section we discuss the key aspects of the deployment model: 520 o At a high level, our key goals, i.e. the functionality that we 521 want to achieve 523 o The assumptions that we're prepared to make 525 o The consequent benefits they bring 527 2.1. Key goals 529 The deployment model achieves an end-to-end controlled load (CL) 530 service where a segment of the end-to-end path is an edge-to-edge 531 Pre-Congestion Notification region. CL is a quality of service (QoS) 532 closely approximating the QoS that the same flow would receive from a 533 lightly loaded network element [RFC2211]. It is useful for inelastic 534 flows such as those for real-time media. 536 o The CL service should be achieved despite varying load levels of 537 other sorts of traffic, which may or may not be rate adaptive 538 (i.e. responsive to packet drops or ECN marks). 540 o The CL service should be supported for a variety of possible CL 541 sources: Constant Bit Rate (CBR), Variable Bit Rate (VBR) and 542 voice with silence suppression. VBR is the most challenging to 543 support. 545 o After a localised failure in the interior of the CL-region causing 546 heavy congestion, the CL service should recover gracefully by pre- 547 empting (dropping) some of the admitted CL microflows, whilst 548 preserving as many of them as possible with their full CL QoS. 550 o It needs to be possible to complete flow pre-emption within 1-2 551 seconds. Operators will have varying requirements but, at least 552 for voice, it has been estimated that after a few seconds then 553 many affected users will start to hang up, making the flow pre- 554 emption mechanism redundant and possibly even counter-productive. 555 Until flow pre-emption kicks in, other applications using CL (e.g. 556 video) and lower priority traffic (e.g. Assured Forwarding (AF)) 557 could be receiving reduced service. Therefore an even faster flow 558 pre-emption mechanism would be desirable (even if, in practice, 559 operators have to add a deliberate pause to ride out a transient 560 while the natural rate of call tear down or lower layer protection 561 mechanisms kick in). 563 o The CL service should support emergency services ([EMERG-RQTS], 564 [EMERG-TEL]) as well as the Assured Service which is the IP 565 implementation of the existing ITU-T/NATO/DoD telephone system 566 architecture known as Multi-Level Pre-emption and Precedence 567 [ITU.MLPP.1990] [ANSI.MLPP.Spec][ANSI.MLPP.Supplement], or MLPP. 568 In particular, this involves admitting new flows that are part of 569 high priority sessions even when admission control would reject 570 new routine flows. Similarly, when having to choose which flows to 571 pre-empt, this involves taking into account the priorities and 572 properties of the sessions that flows are part of. 574 2.2. Key assumptions 576 The framework does not try to deliver the above functionality in all 577 scenarios. We make the following assumptions about the type of 578 scenario to be solved. 580 o Edge-to-edge: all the routers in the CL-region are upgraded with 581 Pre-Congestion Notification, and all the ingress and egress 582 gateways are upgraded to perform the measurement-based admission 583 control and flow pre-emption. Note that although the upgrades 584 required are edge-to-edge, the CL service is provided end-to-end. 586 o Additional load: we assume that any additional load offered within 587 the reaction time of the admission control mechanism doesn't move 588 the CL-region directly from no congestion to overload. So it 589 assumes there will always be an intermediate stage where some CL 590 packets are Admission Marked, but they are still delivered without 591 significant QoS degradation. We believe this is valid for core and 592 backbone networks with typical call arrival patterns (given the 593 reaction time is little more than one round trip time across the 594 CL-region), but is unlikely to be valid in access networks where 595 the granularity of an individual call becomes significant. 597 o Aggregation: we assume that in normal operations, there are many 598 CL microflows within the CL-region, typically at least hundreds 599 between any pair of ingress and egress gateways. The implication 600 is that the solution is targeted at core and backbone networks and 601 possibly parts of large access networks. 603 o Trust: we assume that there is trust between all the routers in 604 the CL-region. For example, this trust model is satisfied if one 605 operator runs the whole of the CL-region. But we make no such 606 assumptions about the end hosts, i.e. depending on the scenario 607 they may be trusted or untrusted by the CL-region. 609 o Signalling: we assume that the end-to-end signalling protocol is 610 RSVP. Section 3 describes how the CL-region fits into such an end- 611 to-end QoS scenario, whilst [RSVP-PCN] describes the extensions to 612 RSVP that are required. 614 o Separation: we assume that all routers within the CL-region are 615 upgraded with the CL mechanism, so the requirements of [Floyd] are 616 met because the CL-region is an enclosed environment. Also, an 617 operator separates CL-traffic in the CL-region from outside 618 traffic by administrative configuration of the ring of gateways 619 around the region. Within the CL-region we assume that the CL- 620 traffic is separated from non-CL traffic. 622 o Routing: we assume that all packets between a pair of ingress and 623 egress gateways follow the same path, or that they follow 624 different paths but that the load balancing scheme is tuned in the 625 CL-region to distribute load such that the different paths always 626 receive comparable relative load. This ensures that the 627 Congestion-Level-Estimate used in the admission control procedure 628 (and which is computed taking into account packets travelling on 629 all the paths) approximately reflects the status of the actual 630 path that will be followed by the new microflow's packets. 632 We are investigating ways of loosening the restrictions set by some 633 of these assumptions, for instance: 635 o Trust: to allow the CL-region to span multiple, non-trusting 636 operators, using the technique of [Re-PCN] as mentioned in Section 637 5.7.2. 639 o Signalling: we believe that the solution could operate with 640 another signalling protocol, such as the one produced by the NSIS 641 working group. It could also work with application level 642 signalling as suggested in [RT-ECN]. 644 o Additional load: we believe that the assumption is valid for core 645 and backbone networks, with an appropriate margin between the 646 configured-admission-rate and the capacity for CL traffic. 647 However, in principle a burst of admission requests can occur in a 648 short time. We expect this to be a rare event under normal 649 conditions, but it could happen e.g. due to a 'flash crowd'. If it 650 does, then more flows may be admitted than should be, triggering 651 the pre-emption mechanism. There are various ways an operator 652 might try to alleviate this issue, which are discussed in the 653 'Flash crowds' section 5.5 later. 655 o Separation: the assumption that CL traffic is separated from non- 656 CL traffic implies that the CL traffic has its own PHB, not shared 657 with other traffic. We are looking at whether it could share 658 Expedited Forwarding's PHB, but supplemented with Pre-Congestion 659 Notification. If this is possible, other PHBs (like Assured 660 Forwarding) could be supplemented with the same new behaviours. 661 This is similar to how RFC3168 ECN was defined to supplement any 662 PHB. 664 o Routing: we are looking in greater detail at the solution in the 665 presence of Equal Cost Multi-Path routing and at suitable 666 enhancements. See also the 'ECMP' section 5.1 later. 668 2.3. Key benefits 670 We believe that the mechanism described in this document has several 671 advantages: 673 o It achieves statistical guarantees of quality of service for 674 microflows, delivering a very low delay, jitter and packet loss 675 service suitable for applications like voice and video calls that 676 generate real time inelastic traffic. This is because of its per 677 microflow admission control scheme, combined with its dynamic on- 678 path "early warning" of potential congestion. The guarantee is at 679 least as strong as with IntServ Controlled Load (Section 6.1 680 mentions why the guarantee may be somewhat better), but without 681 the scalability problems of per-microflow IntServ. 683 o It can support "Emergency" and military Multi-Level Pre-emption 684 and Priority (MLPP) services, even in times of heavy congestion 685 (perhaps caused by failure of a router within the CL-region), by 686 pre-empting on-going "ordinary CL microflows". See also Section 687 4.5. 689 o It scales well, because there is no signal processing or per flow 690 state held by the interior routers of the CL-region. Note that 691 interior routers only hold state per outgoing interface - they do 692 not hold state per CL-region-aggregate nor per flow. 694 o It is resilient, again because no per flow state is held by the 695 interior routers of the CL-region. Hence during an interior 696 routing change caused by a router failure, no microflow state has 697 to be relocated. The flow pre-emption mechanism further helps 698 resilience because it rapidly reduces the load to one that the CL- 699 region can support. 701 o It helps preserve, through the flow pre-emption mechanism, QoS to 702 as many microflows as possible and to lower priority traffic in 703 times of heavy congestion (e.g. caused by failure of an interior 704 router). Otherwise long-lived microflows could cause loss on all 705 CL microflows for a long time. 707 o It avoids the potential catastrophic failure problem when the 708 DiffServ architecture is used in large networks using statically 709 provisioned capacity. This is achieved by controlling the load 710 dynamically, based on edge-to-edge-path real-time measurement of 711 Pre-Congestion Notification, as discussed in Section 1.1.1. 713 o It requires minimal new standardisation, because it reuses 714 existing QoS protocols and algorithms. 716 o It can be deployed incrementally, region by region or network by 717 network. Not all the regions or networks on the end-to-end path 718 need to have it deployed. Two CL-regions can even be separated by 719 a network that uses another QoS mechanism (e.g. MPLS-TE). 721 o It provides a deployment path for use of ECN for real-time 722 applications. Operators can gain experience of ECN before its 723 applicability to end-systems is understood and end terminals are 724 ECN capable. 726 3. Deployment model 728 3.1. Admission control 730 In this section we describe the admission control mechanism. We 731 discuss the three pieces of the solution and then give an example of 732 how they fit together in a use case: 734 o the new Pre-Congestion Notification for Admission Marking used by 735 all routers in the CL-region 737 o how the measurements made support our admission control mechanism 739 o how the edge to edge mechanism fits into the end to end RSVP 740 signalling 742 3.1.1. Pre-Congestion Notification for Admission Marking 744 This is discussed in [PCN]. Here we only give a brief outline. 746 To support our admission control mechanism, each router in the CL- 747 region runs an algorithm to determine whether to Admission Mark the 748 packet. The algorithm measures the aggregate CL traffic on the link 749 and ensures that packets are admission marked before the actual queue 750 builds up, but when it is in danger of doing so soon; the probability 751 of admission marking increases with the danger. The algorithm's main 752 parameter is the configured-admission-rate, which is set lower than 753 the link speed, perhaps considerably so. Admission marked packets 754 indicate that the CL traffic rate is reaching the configured- 755 admission-rate and so act as an "early warning" that the engineered 756 capacity is nearly reached. Therefore they indicate that requests to 757 admit prospective new CL flows may need to be refused. 759 3.1.2. Measurements to support admission control 761 To support our admission control mechanism the egress measures the 762 Congestion-Level-Estimate for traffic from each remote ingress 763 gateway, i.e. per CL-region-aggregate. The Congestion-Level-Estimate 764 is the number of bits in CL packets that are admission marked or pre- 765 emption marked, divided by the number of bits in all CL packets. It 766 is calculated as an exponentially weighted moving average. It is 767 calculated by an egress gateway separately for the CL packets from 768 each particular ingress gateway. 770 Why are pre-emption marked packets included in the Congestion-Level- 771 Estimate? Pre-emption marking over-writes admission marking, i.e. a 772 packet cannot be both admission and pre-emption marked. So if pre- 773 emption marked packets weren't counted we would have the anomaly that 774 as the traffic rate grew above the configured-pre-emption-rate, the 775 Congestion-Level-Estimate would fall. If a particular encoding scheme 776 is chosen where a packet can be both admission and pre-emption marked 777 (such as Alternative 4 in Appendix C of [PCN]), then this is not 778 necessary. 780 This Congestion-Level-Estimate provides an estimate of how near the 781 links on the path inside the CL-region are getting to the configured- 782 admission-rate. Note that the metering is done separately per ingress 783 gateway, because there may be sufficient capacity on all the routers 784 on the path between one ingress gateway and a particular egress, but 785 not from a second ingress to that same egress gateway. 787 3.1.3. How edge-to-edge admission control supports end-to-end QoS 788 signalling 790 Consider a scenario that consists of two end hosts, each connected to 791 their own access networks, which are linked by the CL-region. A 792 source tries to set up a new CL microflow by sending an RSVP PATH 793 message, and the receiving end host replies with an RSVP RESV 794 message. Outside the CL-region some other method, for instance 795 IntServ, is used to provide QoS. From the perspective of RSVP the CL- 796 region is a single hop, so the RSVP PATH and RESV messages are 797 processed by the ingress and egress gateways but are carried 798 transparently across all the interior routers; hence, the ingress and 799 egress gateways hold per microflow state, whilst no per microflow 800 state is kept by the interior routers. So far this is as in IntServ 801 over DiffServ [RFC2998]. However, in order to support our admission 802 control mechanism, the egress gateway adds to the RESV message an 803 opaque object which states the current Congestion-Level-Estimate for 804 the relevant CL-region-aggregate. Details of the corresponding RSVP 805 extensions are described in [RSVP-PCN]. 807 3.1.4. Use case 809 To see how the three pieces of the solution fit together, we imagine 810 a scenario where some microflows are already in place between a given 811 pair of ingress and egress gateways, but the traffic load is such 812 that no packets from these flows are admission marked as they travel 813 across the CL-region. A source wanting to start a new CL microflow 814 sends an RSVP PATH message. The egress gateway adds an object to the 815 RESV message with the Congestion-Level-Estimate, which is zero. The 816 ingress gateway sees this and consequently admits the new flow. It 817 then forwards the RSVP RESV message upstream towards the source end 818 host. Hence, assuming there's sufficient capacity in the access 819 networks, the new microflow is admitted end-to-end. 821 The source now sends CL packets, which arrive at the ingress gateway. 822 The ingress uses a five-tuple filter to identify that the packets are 823 part of a previously admitted CL microflow, and it also polices the 824 microflow to ensure it remains within its traffic profile. (The 825 ingress has learnt the required information from the RSVP messages.) 826 When forwarding a packet belonging to an admitted microflow, the 827 ingress sets the packet's DSCP and ECN fields to the appropriate 828 values configured for the CL region. The CL packet now travels across 829 the CL-region, getting admission marked if necessary. 831 Next, we imagine the same scenario but at a later time when load is 832 higher at one (or more) of the interior routers, which start to 833 Admission Mark CL packets, because their load on the outgoing link is 834 nearing the configured-admission-rate. The next time a source tries 835 to set up a CL microflow, the ingress gateway learns (from the 836 egress) the relevant Congestion-Level-Estimate. If it is greater than 837 some CLE-threshold value then the ingress refuses the request, 838 otherwise it is accepted. The ingress gateway could also take into 839 account attributes of the RSVP reservation (such as for example the 840 RSVP pre-emption priority of [RSVP-PREEMPTION] or the RSVP admission 841 priority of [RSVP-EMERGENCY]) as well as information provided by a 842 policy decision point in order to make a more sophisticated admission 843 decision. This way, flow admission can help emergency/military calls 844 by taking into account the corresponding priorities (as conveyed in 845 RSVP policy elements) when deciding to admit or reject a new 846 reservation. Use of RSVP for the support of emergency/military 847 applications is discussed in further detail in [RFC4542] and [RSVP- 848 EMERGENCY]. 850 It is also possible for an egress gateway to get a RSVP RESV message 851 and not know what the Congestion-Level-Estimate is. For example, if 852 there are no CL microflows at present between the relevant ingress 853 and egress gateways. In this case the egress requests the ingress to 854 send probe packets, from which it can initialise its meter. RSVP 855 Extensions for such a request to send probe data can be found in 856 [RSVP-PCN]. 858 3.2. Flow pre-emption 860 In this section we describe the flow pre-emption mechanism. We 861 discuss the two parts of the solution and then give an example of how 862 they fit together in a use case: 864 o How an ingress gateway is triggered to test whether flow pre- 865 emption may be needed 867 o How an ingress gateway determines the right amount of CL traffic 868 to drop 870 The mechanism is defined in [PCN] and [RSVP-PCN]. 872 3.2.1. Alerting an ingress gateway that flow pre-emption may be needed 874 Alerting an ingress gateway that flow pre-emption may be needed is a 875 two stage process: a router in the CL-region alerts an egress gateway 876 that flow pre-emption may be needed; in turn the egress gateway 877 alerts the relevant ingress gateway. Every router in the CL-region 878 has the ability to alert egress gateways, which may be done either 879 explicitly or implicitly: 881 o Explicit - the router per-hop behaviour is supplemented with a new 882 Pre-emption Marking behaviour, which is outlined below. Reception 883 of such a packet by the egress gateway alerts it that pre-emption 884 may be needed. 886 o Implicit - the router behaviour is unchanged from the Admission 887 Marking behaviour described earlier. The egress gateway treats a 888 Congestion-Level-Estimate of (almost) 100% as an implicit alert 889 that pre-emption may be required. ('Almost' because the 890 Congestion-Level-Estimate is a moving average, so can never reach 891 exactly 100%.) 893 To support explicit pre-emption alerting, each router in the CL- 894 region runs an algorithm to determine whether to Pre-emption Mark the 895 packet. The algorithm measures the aggregate CL traffic and ensures 896 that packets are pre-emption marked before the actual queue builds 897 up. The algorithm's main parameter is the configured-pre-emption- 898 rate, which is set lower than the link speed (but higher than the 899 configured-admission-rate). Thus pre-emption marked packets indicate 900 that the CL traffic rate is reaching the configured-pre-emption-rate 901 and so act as an "early warning" that the engineered capacity is 902 nearly reached. Therefore they indicate that it may be advisable to 903 pre-empt some of the existing CL flows in order to preserve the QoS 904 of the others. 906 Note that the explicit mechanism only makes sense if all the routers 907 in the CL-region have the functionality so that the egress gateways 908 can rely on the explicit mechanism. Otherwise there is the danger 909 that the traffic happens to focus on a router without it, and egress 910 gateways then have also to watch for implicit pre-emption alerts. 912 When one or more packets in a CL-region-aggregate alert the egress 913 gateway of the need for flow pre-emption, whether explicitly or 914 implicitly, the egress puts that CL-region-aggregate into the Pre- 915 emption Alert state. For each CL-region-aggregate in alert state it 916 measures the rate of traffic at the egress gateway (i.e. the traffic 917 rate of the appropriate CL-region-aggregate) and reports this to the 918 relevant ingress gateway. The steps are: 920 o Determine the relevant ingress gateway - for the explicit case the 921 egress gateway examines the pre-emption marked packet and uses the 922 state installed at the time of admission to determine which 923 ingress gateway the packet came from. For the implicit case the 924 egress gateway has already determined this information, because 925 the Congestion-Level-Estimate is calculated per ingress gateway. 927 o Measure the traffic rate of CL packets - as soon as the egress 928 gateway is alerted (whether explicitly or implicitly) it measures 929 the rate of CL traffic from this ingress gateway (i.e. for this 930 CL-region-aggregate). Note that pre-emption marked packets are 931 excluded from that measurement. It should make its measurement 932 quickly and accurately, but exactly how is up to the 933 implementation. 935 o Alert the ingress gateway - the egress gateway then immediately 936 alerts the relevant ingress gateway about the fact that flow pre- 937 emption may be required. This Alert message also includes the 938 measured Sustainable-Aggregate-Rate, i.e. the rate of CL-traffic 939 received from this ingress gateway. The Alert message is sent 940 using reliable delivery. Procedures for the support of such an 941 Alert using RSVP are defined in [RSVP-PCN]. 943 -------------- _ _ ----------------- 944 CL packet |Update | / Is it a \ Y | Measure CL rate | 945 arrives --->|Congestion- |--->/pre-emption\-----> | from ingress and| 946 |Level-Estimate| \ marked / | alert ingress | 947 -------------- \ packet? / ----------------- 948 \_ _/ 950 Figure 2: Egress gateway action for explicit Pre-emption Alert 951 _ _ 952 -------------- / \ ----------------- 953 CL packet |Update | / Is \ Y | Measure CL rate | 954 arrives --->|Congestion- |--->/ C.L.E. \-----> | from ingress and| 955 |Level-Estimate| \ (nearly) / | alert ingress | 956 -------------- \ 100%? / ----------------- 957 \_ _/ 959 Figure 3: Egress gateway action for implicit Pre-emption Alert 961 3.2.2. Determining the right amount of CL traffic to drop 963 The method relies on the insight that the amount of CL traffic that 964 can be supported between a particular pair of ingress and egress 965 gateways, is the amount of CL traffic that is actually getting across 966 the CL-region to the egress gateway without being Pre-emption Marked. 967 Hence we term it the Sustainable-Aggregate-Rate. 969 So when the ingress gateway gets the Alert message from an egress 970 gateway, it compares: 972 o The traffic rate that it is sending to this particular egress 973 gateway (which we term Ingress-Aggregate-Rate) 975 o The traffic rate that the egress gateway reports (in the Alert 976 message) that it is receiving from this ingress gateway (which is 977 the Sustainable-Aggregate-Rate) 979 If the difference is significant, then the ingress gateway pre-empts 980 some microflows. It only pre-empts if: 982 Ingress-Aggregate-Rate > Sustainable-Aggregate-Rate + error 984 The "error" term is partly to allow for inaccuracies in the 985 measurements of the rates. It is also needed because the Ingress- 986 Aggregate-Rate is measured at a slightly later moment than the 987 Sustainable-Aggregate-Rate, and it is quite possible that the 988 Ingress-Aggregate-Rate has increased in the interim due to natural 989 variation of the bit rate of the CL sources. So the "error" term 990 allows for some variation in the ingress rate without triggering pre- 991 emption. 993 The ingress gateway should pre-empt enough microflows to ensure that: 995 New Ingress-Aggregate-Rate < Sustainable-Aggregate-Rate - error 997 The "error" term here is used for similar reasons but in the other 998 direction, to ensure slightly more load is shed than seems necessary, 999 in case the two measurements were taken during a short-term fall in 1000 load. 1002 When the routers in the CL-region are using explicit pre-emption 1003 alerting, the ingress gateway would normally pre-empt microflows 1004 whenever it gets an alert (it always would if it were possible to set 1005 "error" equal to zero). For the implicit case however this is not so. 1006 It receives an Alert message when the Congestion-Level-Estimate 1007 reaches (almost) 100%, which is roughly when traffic exceeds the 1008 configured-admission-rate. However, it is only when packets are 1009 indeed dropped en route that the Sustainable-Aggregate-Rate becomes 1010 less than the Ingress-Aggregate-Rate so only then will pre-emption 1011 actually occur on the ingress gateway. 1013 Hence with the implicit scheme, pre-emption can only be triggered 1014 once the system starts dropping packets and thus the QoS of flows 1015 starts being significantly degraded. This is in contrast with the 1016 explicit scheme which allows flow pre-emption to be triggered before 1017 any packet drop, simply when the traffic reaches the configured-pre- 1018 emption-rate. Therefore we believe that the explicit mechanism is 1019 superior. However it does require new functionality on all the 1020 routers (although this is little more than a bulk token bucket - see 1021 [PCN] for details). 1023 3.2.3. Use case for flow pre-emption 1025 To see how the pieces of the solution fit together in a use case, we 1026 imagine a scenario where many microflows have already been admitted. 1027 We confine our description to the explicit pre-emption mechanism. Now 1028 an interior router in the CL-region fails. The network layer routing 1029 protocol re-routes round the problem, but as a consequence traffic on 1030 other links increases. In fact let's assume the traffic on one link 1031 now exceeds its configured-pre-emption-rate and so the router pre- 1032 emption marks CL packets. When the egress sees the first one of the 1033 pre-emption marked packets it immediately determines which microflow 1034 this packet is part of (by using a five-tuple filter and comparing it 1035 with state installed at admission) and hence which ingress gateway 1036 the packet came from. It sets up a meter to measure the traffic rate 1037 from this ingress gateway, and as soon as possible sends a message to 1038 the ingress gateway. This message alerts the ingress gateway that 1039 pre-emption may be needed and contains the traffic rate measured by 1040 the egress gateway. Then the ingress gateway determines the traffic 1041 rate that it is sending towards this egress gateway and hence it can 1042 calculate the amount of traffic that needs to be pre-empted. 1044 The ingress gateway could now just shed random microflows, but it is 1045 better if the least important ones are dropped. The ingress gateway 1046 could use information stored locally in each reservation's state 1047 (such as for example the RSVP pre-emption priority of [RSVP- 1048 PREEMPTION] or the RSVP admission priority of [RSVP-EMERGENCY]) as 1049 well as information provided by a policy decision point in order to 1050 decide which of the flows to shed (or perhaps which ones not to 1051 shed). This way, flow pre-emption can also helps emergency/military 1052 calls by taking into account the corresponding priorities (as 1053 conveyed in RSVP policy elements) when selecting calls to be pre- 1054 empted, which is likely to be particularly important in a disaster 1055 scenario. Use of RSVP for support of emergency/military applications 1056 is discussed in further details in [RFC4542] and [RSVP-EMERGENCY]. 1058 The ingress gateway then initiates RSVP signalling to instruct the 1059 relevant destinations that their reservation has been terminated, and 1060 to tell (RSVP) nodes along the path to tear down associated RSVP 1061 state. To guard against recalcitrant sources, normal IntServ policing 1062 may be used to block any future traffic from the dropped flows from 1063 entering the CL-region. Note that - with the explicit Pre-emption 1064 Alert mechanism - since the configured-pre-emption-rate may be 1065 significantly less than the physical line capacity, flow pre-emption 1066 may be triggered before any congestion has actually occurred and 1067 before any packet is dropped. 1069 We extend the scenario further by imagining that (due to a disaster 1070 of some kind) further routers in the CL-region fail during the time 1071 taken by the pre-emption process described above. This is handled 1072 naturally, as packets will continue to be pre-emption marked and so 1073 the pre-emption process will happen for a second time. 1075 4. Summary of Functionality 1077 This section is intended to provide a systematic summary of the new 1078 functionality required by the routers in the CL-region. 1080 A network operator upgrades normal IP routers by: 1082 o Adding functionality related to admission control and flow pre- 1083 emption to all its ingress and egress gateways 1085 o Adding Pre-Congestion Notification for Admission Marking and Pre- 1086 emption Marking to all the routers in the CL-region. 1088 We consider the detailed actions required for each of the types of 1089 router in turn. 1091 4.1. Ingress gateways 1093 Ingress gateways perform the following tasks: 1095 o Classify incoming packets - decide whether they are CL or non-CL 1096 packets. This is done using an IntServ filter spec (source and 1097 destination addresses and port numbers), whose details have been 1098 gathered from the RSVP messaging. 1100 o Police - check that the microflow conforms with what has been 1101 agreed (i.e. it keeps to its agreed data rate). If necessary, 1102 packets which do not correspond to any reservations, packets which 1103 are in excess of the rate agreed for their reservation, and 1104 packets for a reservation that has earlier been pre-empted may be 1105 policed. Policing may be achieved via dropping or via re-marking 1106 of the packet's DSCP to a value different from the CL behaviour 1107 aggregate. 1109 o ECN colouring packets - for CL microflows, set the ECN field of 1110 packets appropriately (see [PCN] for some discussion of encoding). 1112 o Perform 'interior router' functions (see next sub-section). 1114 o Admission Control - on new session establishment, consider the 1115 Congestion-Level-Estimate received from the corresponding egress 1116 gateway and most likely based on a simple configured CLE-threshold 1117 decide if a new call is to be admitted or rejected (taking into 1118 account local policy information as well as optionally information 1119 provided by a policy decision point). 1121 o Probe - if requested by the egress gateway to do so, the ingress 1122 gateway generates probe traffic so that the egress gateway can 1123 compute the Congestion-Level-Estimate from this ingress gateway. 1124 Probe packets may be simple data addressed to the egress gateway 1125 and require no protocol standardisation, although there will be 1126 best practice for their number, size and rate. 1128 o Measure - when it receives a Pre-emption Alert message from an 1129 egress gateway, it determines the rate at which it is sending 1130 packets to that egress gateway 1132 o Pre-empt - calculate how much CL traffic needs to be pre-empted; 1133 decide which microflows should be dropped, perhaps in consultation 1134 with a Policy Decision Point; and do the necessary signalling to 1135 drop them. 1137 4.2. Interior routers 1139 Interior routers do the following tasks: 1141 o Classify packets - examine the DSCP and ECN field to see if it's a 1142 CL packet 1144 o Non-CL packets are handled as usual, with respect to dropping them 1145 or setting their CE codepoint. 1147 o Pre-Congestion Notification - CL packets are Admission Marked and 1148 Pre-emption Marked according to the algorithm detailed in [PCN] 1149 and outlined in Section 3. 1151 4.3. Egress gateways 1153 Egress gateways do the following tasks: 1155 o Classify packets - determine which ingress gateway a CL packet has 1156 come from. This is the previous RSVP hop, hence the necessary 1157 details are obtained just as with IntServ from the state 1158 associated with the packet five-tuple, which has been built using 1159 information from the RSVP messages. 1161 o Meter - for CL packets, calculate the fraction of the total number 1162 of bits which are in Admission marked packets or in Pre-emption 1163 Marked packets. The calculation is done as an exponentially 1164 weighted moving average (see Appendix C). A separate calculation 1165 is made for CL packets from each ingress gateway. The meter works 1166 on an aggregate basis and not per microflow. 1168 o Signal the Congestion-Level-Estimate - this is piggy-backed on the 1169 reservation reply. An egress gateway's interface is configured to 1170 know it is an egress gateway, so it always appends this to the 1171 RESV message. If the Congestion-Level-Estimate is unknown or is 1172 too stale, then the egress gateway can request the ingress gateway 1173 to send probes. 1175 o Packet colouring - for CL packets, set the DSCP and the ECN field 1176 to whatever has been agreed as appropriate for the next domain. By 1177 default the ECN field is set to the Not-ECT codepoint. See also 1178 the discussion in the Tunnelling section later. 1180 o Measure the rate - measure the rate of CL traffic from a 1181 particular ingress gateway, excluding packets that are Pre-emption 1182 Marked (i.e. the Sustainable-Aggregate-Rate for the CL-region- 1183 aggregate), when alerted (either explicitly or implicitly) that 1184 pre-emption may be required. The measured rate is reported back to 1185 the appropriate ingress gateway [RSVP-PCN]. 1187 4.4. Failures 1189 If an interior router fails, then the regular IP routing protocol 1190 will re-route round it. If the new route can carry all the admitted 1191 traffic, flows will gracefully continue. If instead this causes early 1192 warning of pre-congestion on the new route, then admission control 1193 based on pre-congestion notification will ensure new flows will not 1194 be admitted until enough existing flows have departed. Finally re- 1195 routing may result in heavy congestion, when the flow pre-emption 1196 mechanism will kick in. 1198 If a gateway fails then we would like regular RSVP procedures 1199 [RFC2205] to take care of things. With the local repair mechanism of 1200 [RFC2205], when a route changes the next RSVP PATH refresh message 1201 will establish path state along the new route, and thus attempt to 1202 re-establish reservations through the new ingress gateway. 1203 Essentially the same procedure is used as described earlier in this 1204 document, with the re-routed session treated as a new session 1205 request. 1207 In more detail, consider what happens if an ingress gateway of the 1208 CL-region fails. Then RSVP routers upstream of it do IP re-routing to 1209 a new ingress gateway. The next time the upstream RSVP router sends a 1210 PATH refresh message it reaches the new ingress gateway which 1211 therefore installs the associated RSVP state. The next RSVP RESV 1212 refresh will pick up the Congestion-Level-Estimate from the egress 1213 gateway, and the ingress compares this with its threshold to decide 1214 whether to admit the new session. This could result in some of the 1215 flows being rejected, but those accepted will receive the full QoS. 1217 An issue with this is that we have to wait until a PATH and RESV 1218 refresh messages are sent - which may not be very often - the default 1219 value is 30 seconds. [RFC2205] discusses how to speed up the local 1220 repair mechanism. First, the RSVP module is notified by the local 1221 routing protocol module of a route change to particular destinations, 1222 which triggers it to rapidly send out PATH refresh messages. Further, 1223 when a PATH refresh arrives with a previous hop address different 1224 from the one stored, then RESV refreshes are immediately sent to that 1225 previous hop. Where RSVP is operating hop-by-hop, i.e. on every 1226 router, then triggering the PATH refresh is easy as the router can 1227 simply monitor its local link. Thus, this fast local repair mechanism 1228 can be used to deal with failures upstream of the ingress gateway, 1229 with failures of the ingress gateway and with failures downstream of 1230 the egress gateway. 1232 But where RSVP is not operating hop-by-hop (as is the case within the 1233 CL-region), it is not so easy to trigger the PATH refresh. 1235 Unfortunately, this problem applies if an egress gateway fails, since 1236 it's very likely that an egress gateway is several IP hops from the 1237 ingress gateway. (If the ingress is several IP hops from its previous 1238 RSVP node, then there is the same issue.) The options appear to be: 1240 o the ingress gateway has a link state database for the CL-region, 1241 so it can detect that an egress gateway has failed or became 1242 unreachable 1244 o there is an inter-gateway protocol, so the ingress can 1245 continuously check that the egress gateways are still alive 1247 o (default) do nothing and wait for the regular PATH/RESV refreshes 1248 (and, if needed, the pre-emption mechanism) to sort things out. 1250 5. Limitations and some potential solutions 1252 In this section we describe various limitations of the deployment 1253 model, and some suggestions about potential ways of alleviating them. 1254 The limitations fall into three broad categories: 1256 o ECMP (Section 5.1): the assumption about routing (Section 2.2) is 1257 that all packets between a pair of ingress and egress gateways 1258 follow the same path; ECMP breaks this assumption 1260 o The lack of global coordination (Sections 5.2, 5.3 and 5.4): a 1261 decision about admission control or flow pre-emption is made for 1262 one aggregate independently of other aggregates 1264 o Timing and accuracy of measurements (Sections 5.5 and 5.6): the 1265 assumption (Section 2.2) that additional load, offered within the 1266 reaction time of the measurement-based admission control 1267 mechanism, doesn't move the system directly from no congestion to 1268 overload (dropping packets). A 'flash crowd' may break this 1269 assumption (Section 5.5). There are a variety of more general 1270 issues associated with marking measurements, which may mean it's a 1271 good idea to do pre-emption 'slower' (Section 5.6). 1273 Each section describes a limitation and some possible solutions to 1274 alleviate the limitation. These are intended as options for an 1275 operator to consider, based on their particular requirements. 1277 We would welcome feedback, for example suggestions as to which 1278 potential solutions are worth working out in more detail, and ideas 1279 on new potential solutions. 1281 Finally Section 5.7 considers some other potential extensions. 1283 5.1. ECMP 1285 If the CL-region uses Equal Cost Multipath Routing (ECMP), then 1286 traffic between a particular pair of ingress and egress gateways may 1287 follow several different paths. 1289 Why? An ECMP-enabled router runs an algorithm to choose between 1290 potential outgoing links, based on a hash of fields such as the 1291 packet's source and destination addresses - exactly what depends on 1292 the proprietary algorithm. Packets are addressed to the CL flow's 1293 end-point, and therefore different flows may follow different paths 1294 through the CL-region. (All packets of an individual flow follow the 1295 same ECMP path.) 1297 The problem is that if one of the paths is congested such that 1298 packets are being admission marked, then the Congestion-Level- 1299 Estimate measured by the egress gateway will be diluted by unmarked 1300 packets from other non-congested paths. Similarly, the measurement of 1301 the Sustainable-Aggregate-Rate will also be diluted. 1303 Possible solution approaches are: 1305 o tunnel: traffic is tunnelled across the CL-region. Then the 1306 destination address (and so on) seen by the ECMP algorithm is that 1307 of the egress gateway, so all flows follow the same path. 1308 Effectively ECMP is turned off. As a compromise, to try to retain 1309 some of the benefits of ECMP, there could be several tunnels, each 1310 following a different ECMP path, with flows randomly assigned to 1311 different tunnels. 1313 o assume worst case: the operator sets the configured-admission-rate 1314 (and configured-pre-emption-rate) to below the optimum level to 1315 compensate for the fact that the effect on the Congestion-Level- 1316 Estimate (and Sustainable-Aggregate-Rate) of the congestion 1317 experienced over one of the paths may be diluted by traffic 1318 received over non-congested paths. Hence lower thresholds need to 1319 be used to ensure early admission control rejection and pre- 1320 emption over the congested path. This approach will waste capacity 1321 (e.g. flows following a non-congested ECMP path are not admitted 1322 or are pre-empted), and there is still the danger that for some 1323 traffic mixes the operator hasn't been cautious enough. 1325 o for admission control, probe to obtain a flow-specific congestion- 1326 level-estimate. Earlier this document suggests continuously 1327 monitoring the congestion-level-estimate. Instead, probe packets 1328 could be sent for each prospective new flow. The probe packets 1329 have the same IP address etc as the data packets would have, and 1330 hence follow the same ECMP path. However, probing is an extra 1331 overhead, depending on how many probe packets need to be sent to 1332 get a sufficiently accurate congestion-level-estimate. 1334 o for flow pre-emption, only select flows for pre-emption from 1335 amongst those that have actually received a Pre-emption Marked 1336 packet. Because these flows must have followed an ECMP path that 1337 goes through an overloaded router. However, it needs some extra 1338 work by the egress gateway, to record this information and report 1339 it to the ingress gateway. 1341 o for flow pre-emption, a variant of this idea involves introducing 1342 a new marking behaviour, 'Router Marking'. A router that is pre- 1343 emption marking packets on an outgoing link, also 'Router Marks' 1344 all other packets. When selecting flows for pre-emption, the 1345 selection is made from amongst those that have actually received a 1346 Router Marked or Pre-emption Marked packet. Hence compared with 1347 the previous bullet, it may extend the range of flows from which 1348 the pre-emption selection is made (i.e. it includes those which, 1349 by chance, haven't had any pre-emption marked packets). However, 1350 it also requires that the 'Router Marking' state is somehow 1351 encoded into a packet, i.e. it makes harder the encoding challenge 1352 discussed in Appendix C of [PCN]. The extra work required by the 1353 egress gateway would also be somewhat higher than for the previous 1354 bullet. 1356 5.2. Beat down effect 1358 This limitation concerns the pre-emption mechanism in the case where 1359 more than one router is pre-emption marking packets. The result 1360 (explained in the next paragraph) is that the measurement of 1361 sustainable-aggregate-rate is lower than its true value, so more 1362 traffic is pre-empted than necessary. 1364 Imagine the scenario: 1366 +-------+ +-------+ +-------+ 1367 IAR-b=3 >@@@@@| CPR=2 |@@@@@| CPR>2 |@@@@@| CPR=1 |@@> SAR-b=1 1368 IAR-a=1 >#####| R1 |#####| R2 | | R3 | 1369 +-------+ +-------+ +-------+ 1370 # 1371 # 1372 # 1373 v SAR-a=0.5 1375 Figure 4: Scenario to illustrate 'beat down effect' limitation 1377 Aggregate-a (ingress-aggregate-rate, IAR, 1 unit) takes a 'short' 1378 route through two routers, one of which (R1) is above its configured- 1379 pre-emption-rate (CPR, 2 units). Aggregate-b takes a 'long' route, 1380 going through a second congested router (R3, with a CPR of 1 unit). 1382 R1's input traffic is 4 units, twice its configured-pre-emption-rate, 1383 so 50% of packets are pre-emption marked. Hence the measured 1384 sustainable-aggregate-rate (SAR) for aggregate-a is 0.5, and half of 1385 its traffic will be pre-empted. 1387 R3's input of non-pre-emption-marked traffic is 1.5 units, and 1388 therefore it has to do further marking. 1390 But this means that aggregate-a has taken a bigger hit than it needed 1391 to; the router R1 could have let through all of aggregate-a's traffic 1392 unmarked if it had known that the second router R2 was going to "beat 1393 down" aggregate-b's traffic further. 1395 Generalising, the result is that in a scenario where more than one 1396 router is pre-emption marking packets, only the final router is sure 1397 to be fully loaded after flow pre-emption. The fundamental reason is 1398 that a router makes a local decision about which packets to pre- 1399 emption mark, i.e. independently of how other routers are pre-emption 1400 marking. A very similar effect has been noted in XCP [Low]. 1402 Potential solutions: 1404 o a full solution would involve routers learning about other routers 1405 that are pre-emption marking, and being able to differentially 1406 mark flows (e.g. in the example above, aggregate-a's packets 1407 wouldn't be marked by R1). This seems hard and complex. 1409 o do nothing about this limitation. It causes over-pre-emption, 1410 which is safe. At the moment this is our suggested option. 1412 o do pre-emption 'slowly'. The description earlier in this document 1413 assumes that after the measurements of ingress-aggregate-rate and 1414 sustainable-aggregate-rate, then sufficient flows are pre-empted 1415 in 'one shot' to eliminate the excess traffic. An alternative is 1416 to spread pre-emption over several rounds: initially, only pre- 1417 empt enough to eliminate some of the excess traffic, then re- 1418 measure the sustainable-aggregate-rate, and then pre-empt some 1419 more, etc. In the scenario above, the re-measurement would be 1420 lower than expected, due to the beat down effect, and hence in the 1421 second round of pre-emption less of aggregate-a's traffic would be 1422 pre-empted (perhaps none). Overall, therefore the impact of the 1423 'beat down' effect would be lessened, i.e. there would be a 1424 smaller degree of over pre-emption. The downside is that the 1425 overall pre-emption is slower, and therefore routers will be 1426 congested longer. 1428 5.3. Bi-directional sessions 1430 The document earlier describes how to decide whether or not to admit 1431 (or pre-empt) a particular flow. However, from a user/application 1432 perspective, the session is the relevant unit of granularity. A 1433 session can consist of several flows which may not all be part of the 1434 same aggregate. The most obvious example is a bi-directional session, 1435 where the two flows should ideally be admitted or pre-empted as a 1436 pair - for instance a voice call only makes sense if A can send to B 1437 as well as B to A! But the admission and pre-emption mechanisms 1438 described earlier in this document operate on a per-aggregate basis, 1439 independently of what's happening with other aggregates. For 1440 admission control the problem isn't serious: e.g. the SIP server for 1441 the voice call can easily detect that the A-to-B flow has been 1442 admitted but the B-to-A flow blocked, and inform the user perhaps via 1443 a busy tone. For flow pre-emption, the problem is similar but more 1444 serious. If both the aggregate-1-to-2 (i.e. from gateway 1 to gateway 1445 2) and the aggregate-2-to-1 have to pre-empt flows, then it would be 1446 good if either all of the flows of a particular session were pre- 1447 empted or none of them. Therefore if the two aggregates pre-empt 1448 flows independently of each other, more sessions will end up being 1449 torn down than is really necessary. For instance, pre-empting one 1450 direction of a voice call will result in the SIP server tearing down 1451 the other direction anyway. 1453 Potential solutions: 1455 o if it's known that all session are bi-directional, simply pre- 1456 empting roughly half as many flows as suggested by the 1457 measurements of {ingress-aggregate-rate - sustainable-aggregate- 1458 rate}. But this makes a big assumption about the nature of 1459 sessions, and also that the aggregate-1-to-2 and aggregate-2-to-1 1460 are equally overloaded. 1462 o ignore the limitation. The penalty will be quite small if most 1463 sessions consist of one flow or of flows part of the same 1464 aggregate. 1466 o introduce a gateway controller. It would receive reports for all 1467 aggregates where the ingress-aggregate-rate exceeds the 1468 sustainable-aggregate-rate. It then would make a global decision 1469 about which flows to pre-empt. However it requires quite some 1470 complexity, for example the controller needs to understand which 1471 flows map to which sessions. This may be an option in some 1472 scenarios, for example where gateways aren't handling too many 1473 flows (but note that this breaks the aggregation assumption of 1474 Section 2.2). A variant of this idea would be to introduce a 1475 gateway controller per pair of gateways, in order to handle bi- 1476 directional sessions but not try to deal with more complex 1477 sessions that include flows from an arbitrary number of 1478 aggregates. 1480 o do pre-emption 'slowly'. As in the "beat down" solution 4, this 1481 would reduce the impact of this limitation. The downside is that 1482 the overall pre-emption is slower, and therefore router(s) will be 1483 congested longer. 1485 o each ingress gateway 'loosely coordinates' with other gateways its 1486 decision about which specific flows to pre-empt. Each gateway 1487 numbers flows in the order they arrive (note that this number has 1488 no meaning outside the gateway), and when pre-empting flows, the 1489 most recent (or most recent low priority flow) is selected for 1490 pre-emption; the gateway then works backwards selecting as many 1491 flows as needed. Gateways will therefore tend to pre-empt flows 1492 that are part of the same session (as they were admitted at the 1493 same time). Of course this isn't guaranteed for several reasons, 1494 for instance gateway A's most recent bi-directional sessions may 1495 be with gateway C, whereas gateway B's are with gateway A (so 1496 gateway A will pre-empt A-to-C flows and gateway B will pre-empt 1497 B-to-A flows). Rather than pre-empting the most recent (low 1498 priority) flow, an alternative algorithm (for further study) may 1499 be to select flows based on a hash of particular fields in the 1500 packet, such that both gateways produce the same hash for flows of 1501 the same bi-directional session. We believe that this approach 1502 should be investigated further. 1504 5.4. Global fairness 1506 The limitation here is that 'high priority' traffic may be pre-empted 1507 (or not admitted) when a global decision would instead pre-empt (or 1508 not admit) 'lower priority' traffic on a different aggregate. 1510 Imagine the following scenario (extreme to illustrate the point 1511 clearly). Aggregate_a is all Assured Services (MLPP) traffic, whilst 1512 aggregate_b is all ordinary traffic (i.e. comparatively low 1513 priority). Together the two aggregates cause a router to be at twice 1514 its configured-pre-emption-rate. Ideally we'd like all of aggregate_b 1515 to be pre-empted, as then all of aggregate_a could be carried. 1516 However, the approach described earlier in this document leads to 1517 half of each aggregate being pre-empted. 1519 IAR_b=1 1520 v 1521 v 1522 +-------+ 1523 IAR_a=1 ---->-----| CPR=1 |-----> SAR_a=0.5 1524 | | 1525 +-------+ 1526 v 1527 v 1528 SAR_a=0.5 1530 Figure 5: Scenario to illustrate 'global fairness' limitation 1532 Similarly, for admission control - Section 4.1 describes how if the 1533 Congestion-Level-Estimate is greater than the CLE-threshold all new 1534 sessions are refused. But it is unsatisfactory to block emergency 1535 calls, for instance. 1537 Potential solutions: 1539 o in the admission control case, it is recommended that an 1540 'emergency / Assured Services' call is admitted immediately even 1541 if the CLE-threshold is exceeded. Usually the network can actually 1542 handle the additional microflow, because there is a safety margin 1543 between the configured-admission-rate and the configured-pre- 1544 emption-rate. Normal call termination behaviour will soon bring 1545 the traffic level down below the configured-admission-rate. 1546 However, in exceptional circumstances the 'emergency / higher 1547 precedence' call may cause the traffic level to exceed the 1548 configured-pre-emption-rate; then the usual pre-emption mechanism 1549 will pre-empt enough (non 'emergency / higher precedence') 1550 microflows to bring the total traffic back under the configured- 1551 pre-emption-rate. 1553 o all egress gateways report to a global coordinator that makes 1554 decisions about what flows to pre-empt. However this solution adds 1555 complexity and probably isn't scalable, but it may be an option in 1556 some scenarios, for example where gateways aren't handling too 1557 many flows (but note that this breaks the aggregation assumption 1558 of Section 2.2). 1560 o introduce a heuristic rule: before pre-empting a 'high priority' 1561 flow the egress gateway should wait to see if sufficient (lower 1562 priority) traffic is pre-empted on other aggregates. This is a 1563 reasonable option. 1565 o enhance the functionality of all the interior routers, so they can 1566 detect the priority of a packet, and then differentially mark 1567 them. As well as adding complexity, in general this would be an 1568 unacceptable security risk for MLPP traffic, since only controlled 1569 nodes (like gateways) should know which packets are high priority, 1570 as this information can be abused by an attacker. 1572 o do nothing, i.e. accept the limitation. Whilst it's unlikely that 1573 high priority calls will be quite so unbalanced as in the scenario 1574 above, just accepting this limitation may be risky. The sorts of 1575 situations that cause routers to start pre-emption marking are 1576 also likely to cause a surge of emergency / MLPP calls. 1578 5.5. Flash crowds 1580 This limitation concerns admission control and arises because there 1581 is a time lag between the admission control decision (which depends 1582 on the Congestion-Level-Estimate during RSVP signalling during call 1583 set-up) and when the data is actually sent (after the called party 1584 has answered). In PSTN terms this is the time the phone rings. 1585 Normally the time lag doesn't matter much because (1) in the CL- 1586 region there are many flows and they terminate and are answered at 1587 roughly the same rate, and (2) the network can still operate safely 1588 when the traffic level is some margin above the configured-admission- 1589 rate. 1591 A 'flash crowd' occurs when something causes many calls to be 1592 initiated in a short period of time - for instance a 'tele-vote'. So 1593 there is a danger that a 'flash' of calls is accepted, but when the 1594 calls are answered and data flows the traffic overloads the network. 1595 Therefore potentially the 'additional load' assumption of Section 2.2 1596 doesn't hold. 1598 Potential solutions: 1600 o The simplest option is to do nothing; an operator relies on the 1601 pre-emption mechanism if there is a problem. This doesn't seem a 1602 good choice, as 'flash crowds' are reasonably common on the PSTN, 1603 unless the operator can ensure that nearly all 'flash crowd' 1604 events are blocked in the access network and so do not impact on 1605 the CL-region. 1607 o A second option is to send 'dummy data' as soon as the call is 1608 admitted, thus effectively reserving the bandwidth whilst waiting 1609 for the called party to answer. Reserving bandwidth in advance 1610 means that the network cannot admit as many calls. For example, 1611 suppose sessions last 100 seconds and ringing for 10 seconds, the 1612 cost is a 10% loss of capacity. It may be possible to offset this 1613 somewhat by increasing the configured-admission-rate in the 1614 routers, but it would need further investigation. A concern with 1615 this 'dummy data' option is that it may allow an attacker to 1616 initiate many calls that are never answered (by a cooperating 1617 attacker), so eventually the network would only be carrying 'dummy 1618 data'. The attack exploits that charging only starts when the call 1619 is answered and not when it is dialled. It may be possible to 1620 alleviate the attack at the session layer - for example, when the 1621 ingress gateway gets an RSVP PATH message it checks that the 1622 source has been well-behaved recently; and limiting the maximum 1623 time that ringing can last. We believe that if this attack can be 1624 dealt with then this is a good option. 1626 o A third option is that the egress gateway limits the rate at which 1627 it sends out the Congestion-Level-Estimate, or limits the rate at 1628 which calls are accepted by replying with a Congestion-Level- 1629 Estimate of 100% (this is the equivalent of 'call gapping' in the 1630 PSTN). There is a trade-off, which would need to be investigated 1631 further, between the degree of protection and possible adverse 1632 side-effects like slowing down call set-up. 1634 o A final option is to re-perform admission control before the call 1635 is answered. The ingress gateway monitors Congestion-Level- 1636 Estimate updates received from each egress. If it notices that a 1637 Congestion-Level-Estimate has risen above the CLE-threshold, then 1638 it terminates all unanswered calls through that egress (e.g. by 1639 instructing the session protocol to stop the 'ringing tone'). For 1640 extra safety the Congestion-Level-Estimate could be re-checked 1641 when the call is answered. A potential drawback for an operator 1642 that wants to emulate the PSTN is that the PSTN network never 1643 drops a 'ringing' PSTN call. 1645 5.6. Pre-empting too fast 1647 As a general idea it seems good to pre-empt excess flows rapidly, so 1648 that the full QoS is restored to the remaining CL users as soon as 1649 possible, and partial service is restored to lower priority traffic 1650 classes on shared links. Therefore the pre-emption mechanism 1651 described earlier in this document works in 'one shot', i.e. one 1652 measurement is made of the sustainable-aggregate-rate and the 1653 ingress-aggregate-rate, and the excess is pre-empted immediately. 1654 However, there are some reasons why an operator may potentially want 1655 to pre-empt 'more slowly': 1657 o To allow time to modify the ingress gateway's policer, as the 1658 ingress wants to be able to drop any packets that arrive from a 1659 pre-empted flow. There will be a limit on how many new filters an 1660 ingress gateway can install in a certain time period. Otherwise 1661 the source may cheat and ignore the instruction to drop its flow. 1663 o The operator may decide to slow down pre-emption in order to 1664 ameliorate the 'beat down' and/or 'bi-directional sessions' 1665 limitations (see above) 1667 o To help combat inaccuracies in measurements of the sustainable- 1668 aggregate-rate and ingress-aggregate-rate. For a CL-region where 1669 it's assumed there are many flows in an aggregate these 1670 measurements can be obtained in a short period of time, but where 1671 there are fewer flows it will take longer. 1673 o To help combat over pre-emption because, during the time it takes 1674 to pre-empt flows, others may be ending anyway (either the call 1675 has naturally ended, or the user hangs up due to poor QoS). 1676 Slowing pre-emption may seem counter-intuitive here, as it makes 1677 it more likely that calls will terminate anyway - however it also 1678 gives time to adjust the amount pre-empted to take account of 1679 this. 1681 o Earlier in this document we said that an egress starts measuring 1682 the sustainable-aggregate-rate immediately it sees a single pre- 1683 emption marked packet. However, when a link or router fails the 1684 network's underlying recovery mechanism will kick in (e.g. 1685 switching to a back up path), which may result in the network 1686 again being able to support all the traffic. 1688 Potential solutions 1690 o To combat the final issue, the egress could measure the 1691 sustainable-aggregate-rate over a longer time period than the 1692 network recovery time (say 100ms vs. 50ms). If it detects no pre- 1693 emption marked packets towards the end of its measurement period 1694 (say in the last 30 ms) then it doesn't send a pre-emption alert 1695 message to the ingress. 1697 o We suggest that optionally (the choice of the operator) pre- 1698 emption is slowed by pre-empting traffic in several rounds rather 1699 than in one shot. One possible algorithm is to pre-empt most of 1700 the traffic in the first round and the rest in the second round; 1701 the amount pre-empted in the second round is influenced by both 1702 the first and second round measurements: * Round 1: pre-empt h 1703 * S_1 where 0.5 <= h <= 1 1704 where S_1 is the amount the normal mechanism calculates that it 1705 should shed, i.e. {ingress-aggregate-rate - sustainable-aggregate- 1706 rate} * Round 2: pre-empt Predicted-S_2 - h * (Predicted- 1707 S_2 - Measured-S_2) 1708 where Predicted-S_2 = (1-h)*S_1 Note 1709 that the second measurement should be made when sufficient time 1710 has elapsed for the first round of pre-emption to have happened. 1711 One idea to achieve this is for the egress gateway to continuously 1712 measure and report its sustainable-aggregate-rate, in (say) 100ms 1713 windows. Therefore the ingress gateway knows when the egress 1714 gateway made its measurement (assuming the round trip time is 1715 known). Therefore the ingress gateway knows when measurements 1716 should reflect that it has pre-empted flows. 1718 5.7. Other potential extensions 1720 In this section we discuss some other potential extensions not 1721 already covered above. 1723 5.7.1. Tunnelling 1725 It is possible to tunnel all CL packets across the CL-region. 1726 Although there is a cost of tunnelling (additional header on each 1727 packet, additional processing at tunnel ingress and egress), there 1728 are three reasons it may be interesting. 1730 ECMP: 1732 Tunnelling is one of the possible solutions given earlier in Section 1733 5.1 on Equal Cost Multipath Routing (ECMP). 1735 Ingress gateway determination: 1737 If packets are tunnelled from ingress gateway to egress gateway, the 1738 egress gateway can very easily determine in the data path which 1739 ingress gateway a packet comes from (by simply looking at the source 1740 address of the tunnel header). This can facilitate operations such as 1741 computing the Congestion-Level-Estimate on a per ingress gateway 1742 basis. 1744 End-to-end ECN: 1746 The ECN field is used for PCN marking (see [PCN] for details), and so 1747 it needs to be re-set by the egress gateway to whatever has been 1748 agreed as appropriate for the next domain. Therefore if a packet 1749 arrives at the ingress gateway with its ECN field already set (i.e. 1750 not '00'), it may leave the egress gateway with a different value. 1751 Hence the end-to-end meaning of the ECN field is lost. 1753 It is open to debate whether end-to-end congestion control is ever 1754 necessary within an end-to-end reservation. But if a genuine need is 1755 identified for end-to-end ECN semantics within a reservation, then 1756 one solution is to tunnel CL packets across the CL-region. When the 1757 egress gateway decapsulates them the original ECN field is recovered. 1759 5.7.2. Multi-domain and multi-operator usage 1761 This potential extension would eliminate the trust assumption 1762 (Section 2.2), so that the CL-region could consist of multiple 1763 domains run by different operators that did not trust each other. 1764 Then only the ingress and egress gateways of the CL-region would take 1765 part in the admission control procedure, i.e. at the ingress to the 1766 first domain and the egress from the final domain. The border routers 1767 between operators within the CL-region would only have to do bulk 1768 accounting - they wouldn't do per microflow metering and policing, 1769 and they wouldn't take part in signal processing or hold per flow 1770 state [Briscoe]. [Re-feedback] explains how a downstream domain can 1771 police that its upstream domain does not 'cheat' by admitting traffic 1772 when the downstream path is over-congested. [Re-PCN] proposes how to 1773 achieve this with the help of another recently proposed extension to 1774 ECN, involving re-echoing ECN feedback [Re-ECN]. 1776 5.7.3. Preferential dropping of pre-emption marked packets 1778 When the rate of real-time traffic in the specified class exceeds the 1779 maximum configured rate, then a router has to drop some packet(s) 1780 instead of forwarding them on the out-going link. Now when the egress 1781 gateway measures the Sustainable-Aggregate-Rate, neither dropped 1782 packets nor pre-emption marked packets contribute to it. Dropping 1783 non-pre-emption-marked packets therefore reduces the measured 1784 Sustainable-Aggregate-Rate below its true value. Thus a router should 1785 preferentially drop pre-emption marked packets. 1787 Note that it is important that the operator doesn't set the 1788 configured-pre-emption-rate equal to the rate at which packets start 1789 being dropped (for the specified real-time service class). Otherwise 1790 the egress gateway may never see a pre-emption marked packet and so 1791 won't be triggered into the Pre-emption Alert state. 1793 This optimisation is optional. When considering whether to use it an 1794 operator will consider issues such as whether the over-pre-emption is 1795 serious, and whether the particular routers can easily do this sort 1796 of selective drop. 1798 5.7.4. Adaptive bandwidth for the Controlled Load service 1800 The admission control mechanism described in this document assumes 1801 that each router has a fixed bandwidth allocated to CL flows. A 1802 possible extension is that the bandwidth is flexible, depending on 1803 the level of non-CL traffic. If a large share of the current load on 1804 a path is CL, then more CL traffic can be admitted. And if the 1805 greater share of the load is non-CL, then the admission threshold can 1806 be proportionately lower. The approach re-arranges sharing between 1807 classes to aim for economic efficiency, whatever the traffic load 1808 matrix. It also deals with unforeseen changes to capacity during 1809 failures better than configuring fixed engineered rates. Adaptive 1810 bandwidth allocation can be achieved by changing the admission 1811 marking behaviour, so that the probability of admission marking a 1812 packet would now depend on the number of queued non-CL packets as 1813 well as the size of the virtual queue. The adaptive bandwidth 1814 approach would be supplemented by placing limits on the adaptation to 1815 prevent starvation of the CL by other traffic classes and of other 1816 classes by CL traffic. [Songhurst] has more details of the adaptive 1817 bandwidth approach. 1819 5.7.5. Controlled Load service with end-to-end Pre-Congestion 1820 Notification 1822 It may be possible to extend the framework to parts of the network 1823 where there are only a low number of CL microflows, i.e. the 1824 aggregation assumption (Section 2.2) doesn't hold. In the extreme it 1825 may be possible to operate the framework end-to-end, i.e. between end 1826 hosts. One potential method is to send probe packets to test whether 1827 the network can support a prospective new CL microflow. The probe 1828 packets would be sent at the same traffic rate as expected for the 1829 actual microflow, but in order not to disturb existing CL traffic a 1830 router would always schedule probe packets behind CL ones (compare 1831 [Breslau00]); this implies they have a new DSCP. Otherwise the 1832 routers would treat probe packets identically to CL packets. In order 1833 to perform admission control quickly, in parts of the network where 1834 there are only a few CL microflows, the Pre-Congestion marking 1835 behaviour for probe packets would switch from admission marking no 1836 packets to admission marking them all for only a minimal increase in 1837 load. 1839 5.7.6. MPLS-TE 1841 [ECN-MPLS] discusses how to extend the deployment model to MPLS, i.e. 1842 for admission control of microflows into a set of MPLS-TE aggregates 1843 (Multi-protocol label switching traffic engineering). It would 1844 require that the MPLS header could include the ECN field, which is 1845 not precluded by RFC3270. See [ECN-MPLS]. 1847 6. Relationship to other QoS mechanisms 1849 6.1. IntServ Controlled Load 1851 The CL mechanism delivers QoS similar to Integrated Services 1852 controlled load, but rather better. The reason the QoS is better is 1853 that the CL mechanism keeps the real queues empty, by driving 1854 admission control from a bulk virtual queue on each interface. The 1855 virtual queue [AVQ, vq] can detect a rise in load before the real 1856 queue builds. It is also more robust to route changes. 1858 6.2. Integrated services operation over DiffServ 1860 Our approach to end-to-end QoS is similar to that described in 1861 [RFC2998] for Integrated services operation over DiffServ networks. 1862 Like [RFC2998], an IntServ class (CL in our case) is achieved end-to- 1863 end, with a CL-region viewed as a single reservation hop in the total 1864 end-to-end path. Interior routers of the CL-region do not process 1865 flow signalling nor do they hold per flow state. Unlike [RFC2998] we 1866 do not require the end-to-end signalling mechanism to be RSVP, 1867 although it can be. 1869 Bearing in mind these differences, we can describe our architecture 1870 in the terms of the options in [RFC2998]. The DiffServ network region 1871 is RSVP-aware, but awareness is confined to (what [RFC2998] calls) 1872 the "border routers" of the DiffServ region. We use explicit 1873 admission control into this region, with static provisioning within 1874 it. The ingress "border router" does per microflow policing and sets 1875 the DSCP and ECN fields to indicate the packets are CL ones (i.e. we 1876 use router marking rather than host marking). 1878 6.3. Differentiated Services 1880 The DiffServ architecture does not specify any way for devices 1881 outside the domain to dynamically reserve resources or receive 1882 indications of network resource availability. In practice, service 1883 providers rely on subscription-time Service Level Agreements (SLAs) 1884 that statically define the parameters of the traffic that will be 1885 accepted from a customer. The CL mechanism allows dynamic reservation 1886 of resources through the DiffServ domain and, with the potential 1887 extension mentioned in Section 5.7.2, it can span multiple domains 1888 without active policing mechanisms at the borders (unlike DiffServ). 1889 Therefore we do not use the traffic conditioning agreements (TCAs) of 1890 the (informational) DiffServ architecture [RFC2475]. 1892 [Johnson] compares admission control with a 'generously dimensioned' 1893 DiffServ network as ways to achieve QoS. The former is recommended. 1895 6.4. ECN 1897 The marking behaviour described in this document complies with the 1898 ECN aspects of the IP wire protocol RFC3168, but provides its own 1899 edge-to-edge feedback instead of the TCP aspects of RFC3168. All 1900 routers within the CL-region are upgraded with the admission marking 1901 and pre-emption marking of Pre-Congestion Notification, so the 1902 requirements of [Floyd] are met because the CL-region is an enclosed 1903 environment. The operator prevents traffic arriving at a router that 1904 doesn't understand CL by administrative configuration of the ring of 1905 gateways around the CL-region. 1907 6.5. RTECN 1909 Real-time ECN (RTECN) [RTECN, RTECN-usage] has a similar aim to this 1910 document (to achieve a low delay, jitter and loss service suitable 1911 for RT traffic) and a similar approach (per microflow admission 1912 control combined with an "early warning" of potential congestion 1913 through setting the CE codepoint). But it explores a different 1914 architecture without the aggregation assumption: host-to-host rather 1915 than edge-to-edge. We plan to document such a host-to-host framework 1916 in a parallel draft to this one, and to describe if and how [PCN] can 1917 work in this framework. 1919 6.6. RMD 1921 Resource Management in DiffServ (RMD) [RMD] is similar to this work, 1922 in that it pushes complex classification, traffic conditioning and 1923 admission control functions to the edge of a DiffServ domain and 1924 simplifies the operation of the interior routers. One of the RMD 1925 modes ("Congestion notification function based on probing") uses 1926 measurement-based admission control in a similar way to this 1927 document. The main difference is that in RMD probing plays a 1928 significant role in the admission control process. Other differences 1929 are that the admission control decision is taken on the egress 1930 gateway (rather than the ingress); 'admission marking' is encoded in 1931 a packet as a new DSCP (rather than in the ECN field), and that the 1932 NSIS protocols are used for signalling (rather than RSVP). 1934 RMD also includes the concept of Severe Congestion handling. The pre- 1935 emption mechanism described in the CL architecture has similar 1936 objectives but relies on different mechanisms. The main difference is 1937 that the interior routers measure the data rate that causes an 1938 overload and mark packets according to this rate. 1940 6.7. RSVP Aggregation over MPLS-TE 1942 Multi-protocol label switching traffic engineering (MPLS-TE) allows 1943 scalable reservation of resources in the core for an aggregate of 1944 many microflows. To achieve end-to-end reservations, admission 1945 control and policing of microflows into the aggregate can be achieved 1946 using techniques such as RSVP Aggregation over MPLS TE Tunnels as per 1947 [AGGRE-TE]. However, in the case of inter-provider environments, 1948 these techniques require that admission control and policing be 1949 repeated at each trust boundary or that MPLS TE tunnels span multiple 1950 domains. 1952 7. Security Considerations 1954 To protect against denial of service attacks, the ingress gateway of 1955 the CL-region needs to police all CL packets and drop packets in 1956 excess of the reservation. This is similar to operations with 1957 existing IntServ behaviour. 1959 For pre-emption, it is considered acceptable from a security 1960 perspective that the ingress gateway can treat "emergency/military" 1961 CL flows preferentially compared with "ordinary" CL flows. However, 1962 in the rest of the CL-region they are not distinguished (nonetheless, 1963 our proposed technique does not preclude the use of different DSCPs 1964 at the packet level as well as different priorities at the flow 1965 level.). Keeping emergency traffic indistinguishable at the packet 1966 level minimises the opportunity for new security attacks. For 1967 example, if instead a mechanism used different DSCPs for 1968 "emergency/military" and "ordinary" packets, then an attacker could 1969 specifically target the former in the data plane (perhaps for DoS or 1970 for eavesdropping). 1972 Further security aspects to be considered later. 1974 8. Acknowledgements 1976 The admission control mechanism evolved from the work led by Martin 1977 Karsten on the Guaranteed Stream Provider developed in the M3I 1978 project [GSPa, GSP-TR], which in turn was based on the theoretical 1979 work of Gibbens and Kelly [DCAC]. Kennedy Cheng, Gabriele Corliano, 1980 Carla Di Cairano-Gilfedder, Kashaf Khan, Peter Hovell, Arnaud Jacquet 1981 and June Tay (BT) helped develop and evaluate this approach. 1983 Many thanks to those who have commented on this work at Transport 1984 Area Working Group meetings and on the mailing list, including: Ken 1985 Carlberg, Ruediger Geib, Lars Westberg, David Black, Robert Hancock, 1986 Cornelia Kappler. 1988 9. Comments solicited 1990 Comments and questions are encouraged and very welcome. They can be 1991 sent to the Transport Area Working Group's mailing list, 1992 tsvwg@ietf.org, and/or to the authors. 1994 10. Changes from earlier versions of the draft 1996 The main changes are: 1998 From -00 to -01 2000 The whole of the Pre-emption mechanism is added. 2002 There are several modifications to the admission control mechanism. 2004 From -01 to -02 2006 The pre-congestion notification algorithms for admission marking and 2007 pre-emption marking are now described in [PCN]. 2009 There are new sub-sections in Section 4 on Failures, Admission of 2010 'emergency / higher precedence' session, and Tunnelling; and a new 2011 sub-section in Section 5 on Mechanisms to deal with 'Flash crowds'. 2013 From -02 to -03 2015 Section 5 has been updated and expanded. It is now about the 2016 'limitations' of the PCN mechanism, as described in the earlier 2017 sections, plus discussion of 'possible solutions' to those 2018 limitations. 2020 The measurement of the Congestion-Level-Estimate now includes pre- 2021 emption marked packets as well as admission marked ones. Section 2022 3.1.2 explains. 2024 11. Appendices 2026 11.1. Appendix A: Explicit Congestion Notification 2028 This Appendix provides a brief summary of Explicit Congestion 2029 Notification (ECN). 2031 [RFC3168] specifies the incorporation of ECN to TCP and IP, including 2032 ECN's use of two bits in the IP header. It specifies a method for 2033 indicating incipient congestion to end-hosts (e.g. as in RED, Random 2034 Early Detection), where the notification is through ECN marking 2035 packets rather than dropping them. 2037 ECN uses two bits in the IP header of both IPv4 and IPv6 packets: 2039 0 1 2 3 4 5 6 7 2040 +-----+-----+-----+-----+-----+-----+-----+-----+ 2041 | DS FIELD, DSCP | ECN FIELD | 2042 +-----+-----+-----+-----+-----+-----+-----+-----+ 2044 DSCP: differentiated services codepoint 2045 ECN: Explicit Congestion Notification 2047 Figure A.1: The Differentiated Services and ECN Fields in IP. 2049 The two bits of the ECN field have four ECN codepoints, '00' to '11': 2050 +-----+-----+ 2051 | ECN FIELD | 2052 +-----+-----+ 2053 ECT CE 2054 0 0 Not-ECT 2055 0 1 ECT(1) 2056 1 0 ECT(0) 2057 1 1 CE 2059 Figure A.2: The ECN Field in IP. 2061 The not-ECT codepoint '00' indicates a packet that is not using ECN. 2063 The CE codepoint '11' is set by a router to indicate congestion to 2064 the end hosts. The term 'CE packet' denotes a packet that has the CE 2065 codepoint set. 2067 The ECN-Capable Transport (ECT) codepoints '10' and '01' (ECT(0) and 2068 ECT(1) respectively) are set by the data sender to indicate that the 2069 end-points of the transport protocol are ECN-capable. Routers treat 2070 the ECT(0) and ECT(1) codepoints as equivalent. Senders are free to 2071 use either the ECT(0) or the ECT(1) codepoint to indicate ECT, on a 2072 packet-by-packet basis. The use of both the two codepoints for ECT is 2073 motivated primarily by the desire to allow mechanisms for the data 2074 sender to verify that network elements are not erasing the CE 2075 codepoint, and that data receivers are properly reporting to the 2076 sender the receipt of packets with the CE codepoint set. 2078 ECN requires support from the transport protocol, in addition to the 2079 functionality given by the ECN field in the IP packet header. 2080 [RFC3168] addresses the addition of ECN Capability to TCP, specifying 2081 three new pieces of functionality: negotiation between the endpoints 2082 during connection setup to determine if they are both ECN-capable; an 2083 ECN-Echo (ECE) flag in the TCP header so that the data receiver can 2084 inform the data sender when a CE packet has been received; and a 2085 Congestion Window Reduced (CWR) flag in the TCP header so that the 2086 data sender can inform the data receiver that the congestion window 2087 has been reduced. 2089 The transport layer (e.g.. TCP) must respond, in terms of congestion 2090 control, to a *single* CE packet as it would to a packet drop. 2092 The advantage of setting the CE codepoint as an indication of 2093 congestion, instead of relying on packet drops, is that it allows the 2094 receiver(s) to receive the packet, thus avoiding the potential for 2095 excessive delays due to retransmissions after packet losses. 2097 11.2. Appendix B: What is distributed measurement-based admission 2098 control? 2100 This Appendix briefly explains what distributed measurement-based 2101 admission control is [Breslau99]. 2103 Traditional admission control algorithms for 'hard' real-time 2104 services (those providing a firm delay bound for example) guarantee 2105 QoS by using 'worst case analysis'. Each time a flow is admitted its 2106 traffic parameters are examined and the network re-calculates the 2107 remaining resources. When the network gets a new request it therefore 2108 knows for certain whether the prospective flow, with its particular 2109 parameters, should be admitted. However, parameter-based admission 2110 control algorithms result in under-utilisation when the traffic is 2111 bursty. Therefore 'soft' real time services - like Controlled Load - 2112 can use a more relaxed admission control algorithm. 2114 This insight suggests measurement-based admission control (MBAC). The 2115 aim of MBAC is to provide a statistical service guarantee. The 2116 classic scenario for MBAC is where each router participates in hop- 2117 by-hop admission control, characterising existing traffic locally 2118 through measurements (instead of keeping an accurate track of traffic 2119 as it is admitted), in order to determine the current value of some 2120 parameter e.g. load. Note that for scalability the measurement is of 2121 the aggregate of the flows in the local system. The measured 2122 parameter(s) is then compared to the requirements of the prospective 2123 flow to see whether it should be admitted. 2125 MBAC may also be performed centrally for a network, it which case it 2126 uses centralised measurements by a bandwidth broker. 2128 We use distributed MBAC. "Distributed" means that the measurement is 2129 accumulated for the 'whole-path' using in-band signalling. In our 2130 case, this means that the measurement of existing traffic is for the 2131 same pair of ingress and egress gateways as the prospective 2132 microflow. 2134 In fact our mechanism can be said to be distributed in three ways: 2135 all routers on the ingress-egress path affect the Congestion-Level- 2136 Estimate; the admission control decision is made just once on behalf 2137 of all the routers on the path across the CL-region; and the ingress 2138 and egress gateways cooperate to perform MBAC. 2140 11.3. Appendix C: Calculating the Exponentially weighted moving average 2141 (EWMA) 2143 At the egress gateway, for every CL packet arrival: 2145 [EWMA-total-bits]n+1 = (w * bits-in-packet) + ((1-w) * [EWMA- 2146 total-bits]n ) 2148 [EWMA-M-bits]n+1 = (B * w * bits-in-packet) + ((1-w) * [EWMA-M- 2149 bits]n ) 2151 Then, per new flow arrival: 2153 [Congestion-Level-Estimate]n+1 = [EWMA-M-bits]n+1 / [EWMA-total- 2154 bits]n+1 2156 where 2157 EWMA-total-bits is the total number of bits in CL packets, calculated 2158 as an exponentially weighted moving average (EWMA) 2160 EWMA-M-bits is the total number of bits in CL packets that are 2161 Admission Marked or Pre-emption Marked, again calculated as an EWMA. 2163 B is either 0 or 1: 2165 B = 0 if the CL packet is not admission marked 2167 B = 1 if the CL packet is admission marked 2169 w is the exponential weighting factor. 2171 Varying the value of the weight trades off between the smoothness and 2172 responsiveness of the Congestion-Level-Estimate. However, in general 2173 both can be achieved, given our original assumption of many CL 2174 microflows and remembering that the EWMA is calculated on the basis 2175 of aggregate traffic between the ingress and egress gateways. 2176 There will be a threshold inter-arrival time between packets of the 2177 same aggregate below which the egress will consider the estimate of 2178 the Congestion-Level-Estimate as too stale, and it will then trigger 2179 generation of probes by the ingress. 2181 The first two per-packet algorithms can be simplified, if their only 2182 use will be where the result of one is divided by the result of the 2183 other in the third, per-flow algorithm. 2185 [EWMA-total-bits]'n+1 = bits-in-packet + (w' * [EWMA- total- 2186 bits]n ) 2188 [EWMA-AM-bits]'n+1 = (B * bits-in-packet) + (w' * [EWMA-AM-bits]n 2189 ) 2191 where w' = (1-w)/w. 2193 If w' is arranged to be a power of 2, these per packet algorithms can 2194 be implemented solely with a shift and an add. 2196 12. References 2198 A later version will distinguish normative and informative 2199 references. 2201 [AGGRE-TE] Francois Le Faucheur, Michael Dibiasio, Bruce Davie, 2202 Michael Davenport, Chris Christou, Jerry Ash, Bur 2203 Goode, 'Aggregation of RSVP Reservations over MPLS 2204 TE/DS-TE Tunnels', draft-ietf-tsvwg-rsvp-dste-03 (work 2205 in progress), June 2006 2207 [ANSI.MLPP.Spec] American National Standards Institute, 2208 "Telecommunications- Integrated Services Digital 2209 Network (ISDN) - Multi-Level Precedence and Pre- 2210 emption (MLPP) Service Capability", ANSI T1.619-1992 2211 (R1999), 1992. 2213 [ANSI.MLPP.Supplement] American National Standards Institute, "MLPP 2214 Service Domain Cause Value Changes", ANSI ANSI 2215 T1.619a-1994 (R1999), 1990. 2217 [AVQ] S. Kunniyur and R. Srikant "Analysis and Design of an 2218 Adaptive Virtual Queue (AVQ) Algorithm for Active 2219 Queue Management", In: Proc. ACM SIGCOMM'01, Computer 2220 Communication Review 31 (4) (October, 2001). 2222 [Breslau99] L. Breslau, S. Jamin, S. Shenker "Measurement-based 2223 admission control: what is the research agenda?", In: 2224 Proc. Int'l Workshop on Quality of Service 1999. 2226 [Breslau00] L. Breslau, E. Knightly, S. Shenker, I. Stoica, H. 2227 Zhang "Endpoint Admission Control: Architectural 2228 Issues and Performance", In: ACM SIGCOMM 2000 2230 [Briscoe] Bob Briscoe and Steve Rudkin, "Commercial Models for 2231 IP Quality of Service Interconnect", BT Technology 2232 Journal, Vol 23 No 2, April 2005. 2234 [DCAC] Richard J. Gibbens and Frank P. Kelly "Distributed 2235 connection acceptance control for a connectionless 2236 network", In: Proc. International Teletraffic Congress 2237 (ITC16), Edinburgh, pp. 941�952 (1999). 2239 [ECN-MPLS] Bruce Davie, Bob Briscoe, June Tay, "Explicit 2240 Congestion Marking in MPLS", draft- 2241 davie-ecn-mpls-00.txt (work in progress), June 2006 2243 [EMERG-RQTS] Carlberg, K. and R. Atkinson, "General Requirements 2244 for Emergency Telecommunication Service (ETS)", RFC 2245 3689, February 2004. 2247 [EMERG-TEL] Carlberg, K. and R. Atkinson, "IP Telephony 2248 Requirements for Emergency Telecommunication Service 2249 (ETS)", RFC 3690, February 2004. 2251 [Floyd] S. Floyd, 'Specifying Alternate Semantics for the 2252 Explicit Congestion Notification (ECN) Field', draft- 2253 floyd-ecn-alternates-02.txt (work in progress), August 2254 2005 2256 [GSPa] Karsten (Ed.), Martin "GSP/ECN Technology & 2257 Experiments", Deliverable: 15.3 PtIII, M3I Eu Vth 2258 Framework Project IST-1999-11429, URL: 2259 http://www.m3i.org/ (February, 2002) (superseded by 2260 [GSP-TR]) 2262 [GSP-TR] Martin Karsten and Jens Schmitt, "Admission Control 2263 Based on Packet Marking and Feedback Signalling �-- 2264 Mechanisms, Implementation and Experiments", TU- 2265 Darmstadt Technical Report TR-KOM-2002-03, URL: 2266 http://www.kom.e-technik.tu- 2267 darmstadt.de/publications/abstracts/KS02-5.html (May, 2268 2002) 2270 [ITU.MLPP.1990] International Telecommunications Union, "Multilevel 2271 Precedence and Pre-emption Service (MLPP)", ITU-T 2272 Recommendation I.255.3, 1990. 2274 [Johnson] DM Johnson, 'QoS control versus generous 2275 dimensioning', BT Technology Journal, Vol 23 No 2, 2276 April 2005 2278 [Low] S. Low, L. Andrew, B. Wydrowski, 'Understanding XCP: 2279 equilibrium and fairness', IEEE InfoCom 2005 2281 [PCN] B. Briscoe, P. Eardley, D. Songhurst, F. Le Faucheur, 2282 A. Charny, V. Liatsos, S. Dudley, J. Babiarz, K. Chan, 2283 G. Karagiannis, A. Bader, L. Westberg. 'Pre-Congestion 2284 Notification marking', draft-briscoe-tsvwg-cl-phb-02 2285 (work in progress), June 2006. 2287 [Re-ECN] Bob Briscoe, Arnaud Jacquet, Alessandro Salvatori, 2288 'Re-ECN: Adding Accountability for Causing Congestion 2289 to TCP/IP', draft-briscoe-tsvwg-re-ecn-tcp-01 (work in 2290 progress), March 2006. 2292 [Re-feedback] Bob Briscoe, Arnaud Jacquet, Carla Di Cairano- 2293 Gilfedder, Andrea Soppera, 'Re-feedback for Policing 2294 Congestion Response in an Inter-network', ACM SIGCOMM 2295 2005, August 2005. 2297 [Re-PCN] B. Briscoe, 'Emulating Border Flow Policing using Re- 2298 ECN on Bulk Data', draft-briscoe-tsvwg-re-ecn-border- 2299 cheat-00 (work in progress), February 2006. 2301 [Reid] ABD Reid, 'Economics and scalability of QoS 2302 solutions', BT Technology Journal, Vol 23 No 2, April 2303 2005 2305 [RFC2211] J. Wroclawski, Specification of the Controlled-Load 2306 Network Element Service, September 1997 2308 [RFC2309] Braden, B., et al., "Recommendations on Queue 2309 Management and Congestion Avoidance in the Internet", 2310 RFC 2309, April 1998. 2312 [RFC2474] Nichols, K., Blake, S., Baker, F. and D. Black, 2313 "Definition of the Differentiated Services Field (DS 2314 Field) in the IPv4 and IPv6 Headers", RFC 2474, 2315 December 1998 2317 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, 2318 Z. and W. Weiss, 'A framework for Differentiated 2319 Services', RFC 2475, December 1998. 2321 [RFC2597] Heinanen, J., Baker, F., Weiss, W. and J. Wrocklawski, 2322 "Assured Forwarding PHB Group", RFC 2597, June 1999. 2324 [RFC2998] Bernet, Y., Yavatkar, R., Ford, P., Baker, F., Zhang, 2325 L., Speer, M., Braden, R., Davie, B., Wroclawski, J. 2326 and E. Felstaine, "A Framework for Integrated Services 2327 Operation Over DiffServ Networks", RFC 2998, November 2328 2000. 2330 [RFC3168] Ramakrishnan, K., Floyd, S. and D. Black "The Addition 2331 of Explicit Congestion Notification (ECN) to IP", RFC 2332 3168, September 2001. 2334 [RFC3246] B. Davie, A. Charny, J.C.R. Bennet, K. Benson, J.Y. Le 2335 Boudec, W. Courtney, S. Davari, V. Firoiu, D. 2336 Stiliadis, 'An Expedited Forwarding PHB (Per-Hop 2337 Behavior)', RFC 3246, March 2002. 2339 [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., 2340 Vaananen, P., Krishnan, R., Cheval, P., and J. 2341 Heinanen, "Multi- Protocol Label Switching (MPLS) 2342 Support of Differentiated Services", RFC 3270, May 2343 2002. 2345 [RFC4542] F. Baker & J. Polk, "Implementing an Emergency 2346 Telecommunications Service for Real Time Services in 2347 the Internet Protocol Suite", RFC 4542, May 2006. 2349 [RMD] Attila Bader, Lars Westberg, Georgios Karagiannis, 2350 Cornelia Kappler, Tom Phelan, 'RMD-QOSM - The Resource 2351 Management in DiffServ QoS model', draft-ietf-nsis- 2352 rmd-03 Work in Progress, June 2005. 2354 [RSVP-PCN] Francois Le Faucheur, Anna Charny, Bob Briscoe, Philip 2355 Eardley, Joe Barbiaz, Kwok-Ho Chan, 'RSVP Extensions 2356 for Admission Control over DiffServ using Pre- 2357 Congestion Notification (PCN)', draft-lefaucheur-rsvp- 2358 ecn-01 (work in progress), June 2006. 2360 [RSVP-PREEMPTION] Herzog, S., "Signaled Preemption Priority Policy 2361 Element", RFC 3181, October 2001. 2363 [RSVP-EMERGENCY] Le Faucheur et al., RSVP Extensions for Emergency 2364 Services, draft-lefaucheur-emergency-rsvp-02.txt 2366 [RTECN] Babiarz, J., Chan, K. and V. Firoiu, 'Congestion 2367 Notification Process for Real-Time Traffic', draft- 2368 babiarz-tsvwg-rtecn-04 Work in Progress, July 2005. 2370 [RTECN-usage] Alexander, C., Ed., Babiarz, J. and J. Matthews, 2371 'Admission Control Use Case for Real-time ECN', draft- 2372 alexander-rtecn-admission-control-use-case-00, Work in 2373 Progress, February 2005. 2375 [Songhurst] David J. Songhurst, Philip Eardley, Bob Briscoe, Carla 2376 Di Cairano Gilfedder and June Tay, 'Guaranteed QoS 2377 Synthesis for Admission Control with Shared Capacity', 2378 BT Technical Report TR-CXR9-2006-001, Feb 2006, 2379 http://www.cs.ucl.ac.uk/staff/B.Briscoe/projects/ipe2e 2380 qos/gqs/papers/GQS_shared_tr.pdf 2382 [vq] Costas Courcoubetis and Richard Weber "Buffer Overflow 2383 Asymptotics for a Switch Handling Many Traffic 2384 Sources" In: Journal Applied Probability 33 pp. 886-- 2385 903 (1996). 2387 Authors' Addresses 2389 Bob Briscoe 2390 BT Research 2391 B54/77, Sirius House 2392 Adastral Park 2393 Martlesham Heath 2394 Ipswich, Suffolk 2395 IP5 3RE 2396 United Kingdom 2397 Email: bob.briscoe@bt.com 2399 Dave Songhurst 2400 BT Research 2401 B54/69, Sirius House 2402 Adastral Park 2403 Martlesham Heath 2404 Ipswich, Suffolk 2405 IP5 3RE 2406 United Kingdom 2407 Email: dsonghurst@jungle.bt.co.uk 2409 Philip Eardley 2410 BT Research 2411 B54/77, Sirius House 2412 Adastral Park 2413 Martlesham Heath 2414 Ipswich, Suffolk 2415 IP5 3RE 2416 United Kingdom 2417 Email: philip.eardley@bt.com 2419 Francois Le Faucheur 2420 Cisco Systems, Inc. 2421 Village d'Entreprise Green Side - Batiment T3 2422 400, Avenue de Roumanille 2423 06410 Biot Sophia-Antipolis 2424 France 2425 Email: flefauch@cisco.com 2427 Anna Charny 2428 Cisco Systems 2429 300 Apollo Drive 2430 Chelmsford, MA 01824 2431 USA 2432 Email: acharny@cisco.com 2433 Kwok Ho Chan 2434 Nortel Networks 2435 600 Technology Park Drive 2436 Billerica, MA 01821 2437 USA 2438 Email: khchan@nortel.com 2440 Jozef Z. Babiarz 2441 Nortel Networks 2442 3500 Carling Avenue 2443 Ottawa, Ont K2H 8E9 2444 Canada 2445 Email: babiarz@nortel.com 2447 Stephen Dudley 2448 Nortel Networks 2449 4001 E. Chapel Hill Nelson Highway 2450 P.O. Box 13010, ms 570-01-0V8 2451 Research Triangle Park, NC 27709 2452 USA 2453 Email: smdudley@nortel.com 2455 Georgios Karagiannis 2456 University of Twente 2457 P.O. BOX 217 2458 7500 AE Enschede, 2459 The Netherlands 2460 EMail: g.karagiannis@ewi.utwente.nl 2462 Attila B�der 2463 attila.bader@ericsson.com 2465 Lars Westberg 2466 Ericsson AB 2467 SE-164 80 Stockholm 2468 Sweden 2469 EMail: Lars.Westberg@ericsson.com 2471 Intellectual Property Statement 2473 The IETF takes no position regarding the validity or scope of any 2474 Intellectual Property Rights or other rights that might be claimed to 2475 pertain to the implementation or use of the technology described in 2476 this document or the extent to which any license under such rights 2477 might or might not be available; nor does it represent that it has 2478 made any independent effort to identify any such rights. Information 2479 on the procedures with respect to rights in RFC documents can be 2480 found in BCP 78 and BCP 79. 2482 Copies of IPR disclosures made to the IETF Secretariat and any 2483 assurances of licenses to be made available, or the result of an 2484 attempt made to obtain a general license or permission for the use of 2485 such proprietary rights by implementers or users of this 2486 specification can be obtained from the IETF on-line IPR repository at 2487 http://www.ietf.org/ipr. 2489 The IETF invites any interested party to bring to its attention any 2490 copyrights, patents or patent applications, or other proprietary 2491 rights that may cover technology that may be required to implement 2492 this standard. Please address the information to the IETF at 2493 ietf-ipr@ietf.org 2495 Disclaimer of Validity 2497 This document and the information contained herein are provided on an 2498 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2499 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2500 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2501 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2502 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2503 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2505 Copyright Statement 2507 Copyright (C) The Internet Society (2006). 2509 This document is subject to the rights, licenses and restrictions 2510 contained in BCP 78, and except as set forth therein, the authors 2511 retain all their rights.