idnits 2.17.1 draft-briscoe-tsvwg-cl-architecture-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 26. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1994. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1971. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1978. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 1998), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 48. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure Invitation -- however, there's a paragraph with a matching beginning. Boilerplate error? Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 2 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 6, 2006) is 6625 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2205' is mentioned on line 1159, but not defined == Missing Reference: 'RT-ECN' is mentioned on line 608, but not defined == Unused Reference: 'AVQ' is defined on line 1749, but no explicit reference was found in the text == Unused Reference: 'RFC2309' is defined on line 1833, but no explicit reference was found in the text == Unused Reference: 'RFC2474' is defined on line 1837, but no explicit reference was found in the text == Unused Reference: 'RFC2597' is defined on line 1846, but no explicit reference was found in the text == Unused Reference: 'RFC3246' is defined on line 1859, but no explicit reference was found in the text == Unused Reference: 'RFC3270' is defined on line 1864, but no explicit reference was found in the text == Outdated reference: A later version (-05) exists of draft-ietf-tsvwg-rsvp-dste-00 -- Possible downref: Non-RFC (?) normative reference: ref. 'AVQ' -- Possible downref: Non-RFC (?) normative reference: ref. 'Breslau99' -- Possible downref: Non-RFC (?) normative reference: ref. 'Breslau00' -- Possible downref: Non-RFC (?) normative reference: ref. 'Briscoe' -- Possible downref: Non-RFC (?) normative reference: ref. 'DCAC' ** Downref: Normative reference to an Informational RFC: RFC 3689 (ref. 'EMERG-RQTS') ** Downref: Normative reference to an Informational RFC: RFC 3690 (ref. 'EMERG-TEL') -- Possible downref: Normative reference to a draft: ref. 'Floyd' -- Possible downref: Non-RFC (?) normative reference: ref. 'GSPa' -- Possible downref: Non-RFC (?) normative reference: ref. 'GSP-TR' -- Possible downref: Non-RFC (?) normative reference: ref. 'ITU.MLPP.1990' -- Possible downref: Non-RFC (?) normative reference: ref. 'Johnson' == Outdated reference: A later version (-03) exists of draft-briscoe-tsvwg-cl-phb-01 -- Possible downref: Normative reference to a draft: ref. 'PCN' == Outdated reference: A later version (-09) exists of draft-briscoe-tsvwg-re-ecn-tcp-01 -- Possible downref: Non-RFC (?) normative reference: ref. 'Re-feedback' == Outdated reference: A later version (-01) exists of draft-briscoe-tsvwg-re-ecn-border-cheat-00 -- Possible downref: Normative reference to a draft: ref. 'Re-PCN' -- Possible downref: Non-RFC (?) normative reference: ref. 'Reid' ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Downref: Normative reference to an Informational RFC: RFC 2475 ** Downref: Normative reference to an Informational RFC: RFC 2998 == Outdated reference: A later version (-20) exists of draft-ietf-nsis-rmd-03 ** Downref: Normative reference to an Experimental draft: draft-ietf-nsis-rmd (ref. 'RMD') == Outdated reference: A later version (-01) exists of draft-lefaucheur-rsvp-ecn-00 -- Possible downref: Normative reference to a draft: ref. 'RSVP-ECN' == Outdated reference: A later version (-05) exists of draft-babiarz-tsvwg-rtecn-04 -- Possible downref: Normative reference to a draft: ref. 'RTECN' -- Possible downref: Normative reference to a draft: ref. 'RTECN-usage' Summary: 13 errors (**), 0 flaws (~~), 18 warnings (==), 23 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TSVWG B. Briscoe 2 Internet Draft P. Eardley 3 draft-briscoe-tsvwg-cl-architecture-02.txt D. Songhurst 4 Expires: September 2006 BT 6 F. Le Faucheur 7 A. Charny 8 Cisco Systems, Inc 10 J. Babiarz 11 K. Chan 12 S. Dudley 13 Nortel 15 March 6, 2006 17 A Framework for Admission Control over DiffServ using Pre-Congestion 18 Notification 19 draft-briscoe-tsvwg-cl-architecture-02.txt 21 Status of this Memo 23 By submitting this Internet-Draft, each author represents that any 24 applicable patent or other IPR claims of which he or she is aware 25 have been or will be disclosed, and any of which he or she becomes 26 aware will be disclosed, in accordance with Section 6 of BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress". 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html 44 This Internet-Draft will expire on September 6, 2006. 46 Copyright Notice 48 Copyright (C) The Internet Society (2006). All Rights Reserved. 50 Abstract 52 This document describes a framework to achieve an end-to-end 53 Controlled Load (CL) service without the scalability problems of 54 previous approaches. Flow admission control and if necessary flow 55 pre-emption preserve the CL service to admitted flows. But interior 56 routers within a large DiffServ-based region of the Internet do not 57 require flow state or signalling. They only have to give early 58 warning of their own congestion by bulk packet marking using new pre- 59 congestion notification marking. Gateways around the edges of the 60 region convert measurements of this packet granularity marking into 61 admission control and pre-emption functions at flow granularity. 63 Authors' Note (TO BE DELETED BY THE RFC EDITOR UPON PUBLICATION) 65 This document is posted as an Internet-Draft with the intention of 66 eventually becoming an INFORMATIONAL RFC, rather than a standards 67 track document. 69 Table of Contents 71 1. Introduction................................................4 72 1.1. Summary................................................4 73 1.1.1. Flow admission control.............................5 74 1.1.2. Flow pre-emption...................................7 75 1.1.3. Both admission control and pre-emption.............8 76 1.2. Terminology............................................9 77 1.3. Existing terminology...................................10 78 1.4. Standardisation requirements...........................10 79 1.5. Structure of rest of the document......................11 80 2. Key aspects of the framework................................12 81 2.1. Key goals.............................................12 82 2.2. Key assumptions........................................13 83 2.3. Key benefits..........................................15 84 3. Architecture...............................................18 85 3.1. Admission control......................................18 86 3.1.1. Pre-Congestion Notification for Admission Marking..18 87 3.1.2. Measurements to support admission control..........18 88 3.1.3. How edge-to-edge admission control supports end-to-end 89 QoS signalling..........................................19 90 3.1.4. Use case.........................................19 91 3.2. Flow pre-emption.......................................20 92 3.2.1. Alerting an ingress gateway that flow pre-emption may be 93 needed..................................................20 94 3.2.2. Determining the right amount of CL traffic to drop.23 95 3.2.3. Use case for flow pre-emption.....................24 96 4. Details....................................................26 97 4.1. Ingress gateways.......................................26 98 4.2. Interior nodes........................................27 99 4.3. Egress gateways........................................27 100 4.4. Failures..............................................28 101 4.5. Admission of 'emergency / higher precedence' session....29 102 4.6. Tunnelling............................................30 103 5. Potential future extensions.................................32 104 5.1. Mechanisms to deal with 'Flash crowds'.................32 105 5.2. Multi-domain and multi-operator usage..................33 106 5.3. Adaptive bandwidth for the Controlled Load service......33 107 5.4. Controlled Load service with end-to-end Pre-Congestion 108 Notification...............................................34 109 5.5. MPLS-TE...............................................34 110 6. Relationship to other QoS mechanisms........................35 111 6.1. IntServ Controlled Load................................35 112 6.2. Integrated services operation over DiffServ............35 113 6.3. Differentiated Services................................35 114 6.4. ECN...................................................36 115 6.5. RTECN.................................................36 116 6.6. RMD...................................................36 117 6.7. RSVP Aggregation over MPLS-TE..........................37 118 7. Security Considerations.....................................37 119 8. Acknowledgements...........................................38 120 9. Comments solicited.........................................38 121 10. Changes from earlier versions of the draft.................38 122 11. Appendices................................................39 123 11.1. Appendix A: Explicit Congestion Notification..........39 124 11.2. Appendix B: What is distributed measurement-based admission 125 control?...................................................40 126 11.3. Appendix C: Calculating the Exponentially weighted moving 127 average (EWMA).............................................41 128 12. References................................................43 129 Authors' Addresses............................................46 130 Intellectual Property Statement................................48 131 Disclaimer of Validity........................................48 132 Copyright Statement...........................................49 134 1. Introduction 136 1.1. Summary 138 This document describes a framework to achieve an end-to-end 139 controlled load service by using - within a large region of the 140 Internet - DiffServ and edge-to-edge distributed measurement-based 141 admission control and flow pre-emption. Controlled load service is a 142 quality of service (QoS) closely approximating the QoS that the same 143 flow would receive from a lightly loaded network element [RFC2211]. 144 Controlled Load (CL) is useful for inelastic flows such as those for 145 real-time media. 147 In line with the "IntServ over DiffServ" framework defined in 148 [RFC2998], the CL service is supported end-to-end and RSVP signalling 149 [RFC2205] is used end-to-end, over an edge-to-edge DiffServ region. 151 ___ ___ _______________________________________ ____ ___ 152 | | | | | | | | | | 153 | | | | |Ingress Interior Egress| | | | | 154 | | | | |gateway nodes gateway| | | | | 155 | | | | |-------+ +-------+ +-------+ +------| | | | | 156 | | | | | PCN- | | PCN- | | PCN- | | | | | | | 157 | |..| |..|marking|..|marking|..|marking|..| Meter|..| |..| | 158 | | | | |-------+ +-------+ +-------+ +------| | | | | 159 | | | | | \ / | | | | | 160 | | | | | \ / | | | | | 161 | | | | | \ Congestion-Level-Estimate / | | | | | 162 | | | | | \ (for admission control) / | | | | | 163 | | | | | --<-----<----<----<-----<-- | | | | | 164 | | | | | Sustainable-Aggregate-Rate | | | | | 165 | | | | | (for flow pre-emption) | | | | | 166 |___| |___| |_______________________________________| |____| |___| 168 Sx Access CL-region Access Rx 169 End Network Network End 170 Host Host 171 <------ edge-to-edge signalling -----> 172 (for admission control & flow pre-emption) 174 <-------------------end-to-end QoS signalling protocol---------------> 176 Figure 1: Overall QoS architecture (NB terminology explained later) 177 In Section 1.1.1 we summarise how admission of new CL microflows is 178 controlled so as to deliver the required QoS. In abnormal 179 circumstances for instance a disaster affecting multiple interior 180 nodes, then the QoS on existing CL microflows may degrade even if 181 care was exercised when admitting those microflows before those 182 circumstances. Therefore we also propose a mechanism (summarised in 183 Section 1.1.2) to pre-empt some of the existing microflows. Then 184 remaining microflows retain their expected QoS, while improved QoS is 185 quickly restored to lower priority traffic. 187 As a fundamental building block to support these two mechanisms, we 188 introduce "Pre-Congestion Notification". Pre-Congestion Notification 189 (PCN) builds on the concepts of RFC 3168, "The addition of Explicit 190 Congestion Notification to IP". The draft [PCN] proposes the 191 respective algorithms that determine when a PCN-enabled router marks 192 a packet with Admission Marking or Pre-emption Marking, depending on 193 the traffic level. 195 Pre-Congestion Notification can supplement any Per Hop Behaviour. In 196 order to support CL traffic we would expect it to supplement the 197 existing Expedited Forwarding (EF). Within the controlled edge-to- 198 edge region, a particular packet receives the Pre-Congestion 199 Notification behaviour if the packet's DSCP (differentiated services 200 codepoint) is set to EF (or whatever is configured for CL traffic) 201 and also the ECN field indicates ECN Capable Transport. 203 There are various possible ways to encode the markings into a packet, 204 using the ECN field and perhaps other DSCPs, which are discussed in 205 [PCN]. In this draft we use the abstract names Admission Marking and 206 Pre-emption Marking. 208 This framework assumes that the Pre-Congestion Notification behaviour 209 is used in a controlled environment, i.e. within the controlled edge- 210 to-edge region. 212 1.1.1. Flow admission control 214 This document describes a new admission control procedure for an 215 edge-to-edge region, which uses new per-hop Pre-Congestion 216 Notification 'admission marking' as a fundamental building block. In 217 turn, an end-to-end CL service would use this as a building block 218 within a broader QoS architecture. 220 The per-hop, edge-to-edge and end-to-end aspects are now briefly 221 introduced in turn. 223 Appendix A provides a brief summary of Explicit Congestion 224 Notification (ECN) [RFC3168]. It specifies that a router sets the ECN 225 field to the Congestion Experienced (CE) value as a warning of 226 incipient congestion. RFC3168 doesn't specify a particular algorithm 227 for setting the CE codepoint, although RED (Random Early Detection) 228 is expected to be used. 230 Pre-Congestion Notification (PCN) builds on the concepts of ECN. PCN 231 introduces a new algorithm that Admission Marks packets before there 232 is any significant build-up of CL packets in the queue. Admission 233 marked packets therefore act as an "early warning" when the amount of 234 packets flowing is getting close to the engineered capacity. Hence it 235 can be used with per-hop behaviours (PHBs) designed to operate with 236 very low queue occupancy, such as Expedited Forwarding (EF). Note 237 that our use of the ECN field operates across the CL-region, i.e. 238 edge-to-edge, and not host-to-host as in [RFC3168]. 240 Turning next to the edge-to-edge aspect. All nodes within a region of 241 the Internet, which we call the CL-region, apply the PHB used for CL 242 traffic and the Pre-Congestion Notification behaviour. Traffic must 243 enter/leave the CL-region through ingress/egress gateways, which have 244 special functionality. Typically the CL-region is the core or 245 backbone of an operator. The CL service is achieved "edge-to-edge" 246 across the CL-region, by using distributed measurement-based 247 admission control: the decision whether to admit a new microflow 248 depends on a measurement of the existing traffic between the same 249 pair of ingress and egress gateways (i.e. the same pair as the 250 prospective new microflow). (See Appendix B for further discussion on 251 "What is distributed measurement-based admission control?") 253 As CL packets travel across the CL-region, nodes will admission mark 254 packets (according to the Pre-Congestion Notification algorithm) as 255 an "early warning" of potential congestion, i.e. before there is any 256 significant build-up of CL packets in the queue. For traffic from 257 each remote ingress gateway, the CL-region's egress gateway measures 258 the fraction of CL traffic that is admission marked. The egress 259 gateway calculates the value on a per bit basis as an exponentially 260 weighted moving average (which we term Congestion-Level-Estimate). 261 Then it reports it to the CL-region's ingress gateway piggy-backed on 262 the signalling for a new flow. The ingress gateway only admits the 263 new CL microflow if the Congestion-Level-Estimate is less than the 264 value of the CLE-threshold. Hence previously accepted CL microflows 265 will suffer minimal queuing delay, jitter and loss. 267 In turn, the edge-to-edge architecture is a building block in 268 delivering an end-to-end CL service. The approach is similar to that 269 described in [RFC2998] for Integrated services operation over 270 DiffServ networks. Like [RFC2998], an IntServ class (CL in our case) 271 is achieved end-to-end, with a CL-region viewed as a single 272 reservation hop in the total end-to-end path. Interior nodes of the 273 CL-region do not process flow signalling nor do they hold state. We 274 assume that the end-to-end signalling mechanism is RSVP (Section 275 2.2). However, the RSVP signalling may itself be originated or 276 terminated by proxies still closer to the edge of the network, such 277 as home hubs or the like, triggered in turn by application layer 278 signalling. [RFC2998] and our approach are compared further in 279 Section 6.2. 281 An important benefit compared with the IntServ over DiffServ model 282 [RFC2998] arises from the fact that the load is controlled 283 dynamically rather than with the traffic conditioning agreements 284 (TCAs). TCAs were originally introduced in the (informational) 285 DiffServ architecture [RFC2475] as an alternative to reservation 286 processing in the interior region in order to reduce the burden on 287 interior nodes. With TCAs, in practice service providers rely on 288 subscription-time Service Level Agreements that statically define the 289 parameters of the traffic that will be accepted from a customer. The 290 problem arises because the TCA at the ingress must allow any 291 destination address, if it is to remain scalable. But for longer 292 topologies, the chances increase that traffic will focus on an 293 interior resource, even though it is within contract at the ingress 294 [Reid], e.g. all flows converge on the same egress gateway. Even 295 though networks can be engineered to make such failures rare, when 296 they occur all inelastic flows through the congested resource fail 297 catastrophically. 299 Distributed measurement-based admission control avoids reservation 300 processing (whether per flow or aggregated) on interior nodes but 301 flows are still blocked dynamically in response to actual congestion 302 on any interior node. Hence there is no need for accurate or 303 conservative prediction of the traffic matrix. 305 1.1.2. Flow pre-emption 307 An essential QoS issue in core and backbone networks is being able to 308 cope with failures of nodes and links. The consequent re-routing can 309 cause severe congestion on some links and hence degrade the QoS 310 experienced by on-going microflows and other, lower priority traffic. 311 Even when the network is engineered to sustain a single link failure, 312 multiple link failures (e.g. due to a fibre cut or a node failure, or 313 a natural disaster) can cause violation of capacity constraints and 314 resulting QoS failures. Our solution uses rate-based flow pre- 315 emption, so that sufficient of the previously admitted CL microflows 316 are dropped to ensure that the remaining ones again receive QoS 317 commensurate with the CL service and at least some QoS is quickly 318 restored to other traffic classes. 320 The solution has two aspects. First, triggering the ingress gateway 321 to test whether pre-emption may be needed. A router enhanced with 322 Pre-Congestion Notification may optionally include an algorithm that 323 sets packets into the Pre-emption Marked state. Such a packet alerts 324 the egress that pre-emption may be needed, which in turn sends a Pre- 325 emption Alert message to the ingress. Secondly, calculating the right 326 amount of traffic to drop. This involves the egress gateway 327 measuring, and reporting to the ingress gateway, the current amount 328 of CL traffic received from that particular ingress gateway. The 329 ingress gateway compares this measurement (which is the amount that 330 the network can actually support, and which we thus call the 331 Sustainable-Aggregate-Rate) with the rate that it is sending and 332 hence determines how much traffic needs to be pre-empted. 334 The solution operates within a little over one round trip time - the 335 time required for microflow packets that have experienced Pre-emption 336 Marking to travel downstream through the CL-region and arrive at the 337 egress gateway, plus some additional time for the egress gateway to 338 measure the rate seen after it has been alerted that pre-emption may 339 be needed, and the time for the egress gateway to report this 340 information to the ingress gateway. 342 1.1.3. Both admission control and pre-emption 344 This document describes both the admission control and pre-emption 345 mechanisms, and we suggest that an operator uses both. However, we do 346 not require this and some operators may want to implement only one. 348 For example, an operator could use just admission control, solving 349 heavy congestion (caused by re-routing) by 'just waiting' - as 350 sessions end, existing microflows naturally depart from the system 351 over time, and the admission control mechanism will prevent admission 352 of new microflows that use the affected links. So the CL-region will 353 naturally return to normal controlled load service, but with reduced 354 capacity. The drawback of this approach would be that until flows 355 naturally depart to relieve the congestion, all flows and lower 356 priority services will be adversely affected. As another example, an 357 operator could use just admission control, avoiding heavy congestion 358 (caused by re-routing) by 'capacity planning' - by configuring 359 admission control thresholds to lower levels than the network could 360 accept in normal situations such that the load after failure is 361 expected to stay below acceptable levels even with reduced network 362 resources. 364 On the other hand, an operator could just rely for admission control 365 on the traffic conditioning agreements of the DiffServ architecture 366 [RFC2475]. The pre-emption mechanism described in this document would 367 be used to counteract the problem described at the end of Section 368 1.1.1. 370 1.2. Terminology 372 This terminology is copied from the pre-congestion notification 373 marking draft [PCN]: 375 o Pre-Congestion Notification (PCN): two new algorithms that 376 determine when a PCN-enabled router Admission Marks and Pre- 377 emption Marks a packet, depending on the traffic level. 379 o Admission Marking condition: the traffic level is such that the 380 router Admission Marks packets. The router provides an "early 381 warning" that the load is nearing the engineered admission control 382 capacity, before there is any significant build-up of CL packets 383 in the queue. 385 o Pre-emption Marking condition: the traffic level is such that the 386 router Pre-emption Marks packets. The router warns explicitly that 387 pre-emption may be needed. 389 o Configured-admission-rate: the reference rate used by the 390 admission marking algorithm in a PCN-enabled router. 392 o Configured-pre-emption-rate - the reference rate used by the pre- 393 emption marking algorithm in a PCN-enabled router. 395 The following terms are defined here: 397 o Ingress gateway: node at an ingress to the CL-region. A CL-region 398 may have several ingress gateways. 400 o Egress gateway: node at an egress from the CL-region. A CL-region 401 may have several egress gateways. 403 o Interior node: a node which is part of the CL-region, but isn't an 404 ingress or egress node. 406 o CL-region: A region of the Internet in which all traffic 407 enters/leaves through an ingress/egress gateway and all nodes run 408 Pre-Congestion Notification marking. A CL-region is a DiffServ 409 region (a DiffServ region is either a single DiffServ domain or 410 set of contiguous DiffServ domains), but note that the CL-region 411 does not use the traffic conditioning agreements (TCAs) of the 412 (informational) DiffServ architecture. 414 o CL-region-aggregate: all the microflows between a specific pair of 415 ingress and egress gateways. Note there is no identifier unique to 416 the aggregate. 418 o Congestion-Level-Estimate: the number of bits in CL packets that 419 are admission marked, divided by the number of bits in all CL 420 packets. It is calculated as an exponentially weighted moving 421 average. It is calculated by an egress gateway for the CL packets 422 from a particular ingress gateway, i.e. there is a Congestion- 423 Level-Estimate for each CL-region-aggregate. 425 o Sustainable-Aggregate-Rate: the rate of traffic that the network 426 can actually support for a specific CL-region-aggregate. So it is 427 measured by an egress gateway for the CL packets from a particular 428 ingress gateway. 430 1.3. Existing terminology 432 This is a placeholder for useful terminology that is defined 433 elsewhere. 435 1.4. Standardisation requirements 437 The framework described in this document has two new standardisation 438 requirements: 440 o new Pre-Congestion Notification for Admission Marking and Pre- 441 emption Marking are required, as detailed in [PCN]. 443 o the end-to-end signalling protocol needs to be modified to carry 444 the Congestion-Level-Estimate report (for admission control) and 445 the Sustainable-Aggregate-Rate (for flow pre-emption). With our 446 assumption of RSVP (Section 2.2) as the end-to-end signalling 447 protocol, it means that extensions to RSVP are required, as 448 detailed in [RSVP-ECN], for example to carry the Congestion-Level- 449 Estimate and Sustainable-Aggregate-Rate information from egress 450 gateway to ingress gateway. 452 Other than these things, the arrangement uses existing IETF protocols 453 throughout, although not in their usual architecture. 455 1.5. Structure of rest of the document 457 Section 2 describes some key aspects of the framework: our goals, 458 assumptions and the benefits we believe it has. Section 3 describes 459 the architecture (including a use case), whilst Section 4 summarises 460 the required changes to the various nodes in the CL-region. Section 5 461 outlines some possible extensions. Section 6 provides some comparison 462 with existing QoS mechanisms. 464 2. Key aspects of the framework 466 In this section we discuss the key aspects of the framework: 468 o At a high level, our key goals, i.e. the functionality that we 469 want to achieve 471 o The assumptions that we're prepared to make 473 o The consequent benefits they bring 475 2.1. Key goals 477 The framework achieves an end-to-end controlled load (CL) service 478 where a segment of the end-to-end path is an edge-to-edge Pre- 479 Congestion Notification region. CL is a quality of service (QoS) 480 closely approximating the QoS that the same flow would receive from a 481 lightly loaded network element [RFC2211]. It is useful for inelastic 482 flows such as those for real-time media. 484 o The CL service should be achieved despite varying load levels of 485 other sorts of traffic, which may or may not be rate adaptive 486 (i.e. responsive to packet drops or ECN marks). 488 o The CL service should be supported for a variety of possible CL 489 sources: Constant Bit Rate (CBR), Variable Bit Rate (VBR) and 490 voice with silence suppression. VBR is the most challenging to 491 support. 493 o After a localised failure in the interior of the CL-region causing 494 heavy congestion, the CL service should recover gracefully by pre- 495 empting (dropping) some of the admitted CL microflows, whilst 496 preserving as many of them as possible with their full CL QoS. 498 o It is suggested that flow pre-emption needs to be completed within 499 1-2 seconds, because it is estimated that after a few seconds then 500 many affected users will start to hang up (and then not only is a 501 flow pre-emption mechanism redundant and possibly even counter- 502 productive, but also many more flows than necessary to reduce 503 congestion may hang up). Also, other, lower priority traffic 504 classes will not be restored to partial service until the higher 505 priority CL service reduces its load on shared links. 507 o The CL service should support emergency services ([EMERG-RQTS], 508 [EMERG-TEL]) as well as the Assured Service which is the IP 509 implementation of the existing ITU-T/NATO/DoD telephone system 510 architecture known as Multi-Level Pre-emption and Precedence 511 [ITU.MLPP.1990] [ANSI.MLPP.Spec][ANSI.MLPP.Supplement], or MLPP. 512 In particular, this involves admitting new high priority sessions 513 even when admission control thresholds are reached and new routine 514 sessions are rejected. Similarly, this involves taking into 515 account session priorities and properties at the time of pre- 516 empting flows. 518 2.2. Key assumptions 520 The framework does not try to deliver the above functionality in all 521 scenarios. We make the following assumptions about the type of 522 scenario to be solved. 524 o Edge-to-edge: all the nodes in the CL-region are upgraded with 525 Pre-Congestion Notification, and all the ingress and egress 526 gateways are upgraded to perform the measurement-based admission 527 control and flow pre-emption. Note that although the upgrades 528 required are edge-to-edge, the CL service is provided end-to-end. 530 o Additional load: we assume that any additional load offered within 531 the reaction time of the admission control mechanism doesn't move 532 the CL-region directly from no congestion to overload. So it 533 assumes there will always be an intermediate stage where some CL 534 packets are Admission Marked, but they are still delivered without 535 significant QoS degradation. We believe this is valid for core and 536 backbone networks with typical call arrival patterns (given the 537 reaction time is little more than one round trip time across the 538 CL-region), but is unlikely to be valid in access networks where 539 the granularity of an individual call becomes significant. 541 o Aggregation: we assume that in normal operations, there are many 542 CL microflows within the CL-region, typically at least hundreds 543 between any pair of ingress and egress gateways. The implication 544 is that the solution is targeted at core and backbone networks and 545 possibly parts of large access networks. 547 o Trust: we assume that there is trust between all the nodes in the 548 CL-region. For example, this trust model is satisfied if one 549 operator runs the whole of the CL-region. But we make no such 550 assumptions about the end nodes, i.e. depending on the scenario 551 they may be trusted or untrusted by the CL-region. 553 o Signalling: we assume that the end-to-end signalling protocol is 554 RSVP. Section 3 describes how the CL-region fits into such an end- 555 to-end QoS scenario, whilst [RSVP-ECN] describes the extensions to 556 RSVP that are required. 558 o Separation: we assume that all nodes within the CL-region are 559 upgraded with the CL mechanism, so the requirements of [Floyd] are 560 met because the CL-region is an enclosed environment. Also, an 561 operator separates CL-traffic in the CL-region from outside 562 traffic by administrative configuration of the ring of gateways 563 around the region. Within the CL-region we assume that the CL- 564 traffic is separated from non-CL traffic. 566 o Routing: we assume that one of the following applies: 568 (same path) all packets between a pair of ingress and egress 569 gateways follow the same path. This ensures that the Congestion- 570 Level-Estimate used in the admission control procedure reflects 571 the status of the path followed by the new flow's packets 573 (load balanced) packets between a pair of ingress and egress 574 gateways follow different paths but that the load balancing 575 scheme is tuned in the CL-region to distribute load such that 576 the different paths always receive comparable relative load. 577 This ensures that the Congestion-Level-Estimate used in the 578 admission control procedure (and which is computed taking into 579 account packets travelling on all the paths) also approximately 580 reflects the status of the actual path followed by the new 581 microflow's packets 583 (worst case assumed) packets between a pair of ingress and 584 egress gateways follow different paths but that (i) it is 585 acceptable for the operator to keep the CL traffic between this 586 pair of gateways to a level dictated by the most loaded of all 587 paths between this pair of gateways (so that CL flows may be 588 rejected - or even pre-empted in some situations - even if one 589 or more of the paths between the pair of gateways is operating 590 below its engineered levels) and that (ii) it is acceptable for 591 that operator to configure engineered levels below optimum 592 levels to compensate for the fact that the effect on the 593 Congestion-Level-Estimate of the congestion experienced over one 594 of the paths may be diluted by traffic received over non- 595 congested paths so that lower thresholds need to be used in 596 these cases to ensure early admission control rejection and pre- 597 emption over the congested paths. 599 We are investigating ways of loosening the restrictions set by some 600 of these assumptions, for instance: 602 o Trust: to allow the CL-region to span multiple, non-trusting 603 operators, using the technique of [Re-PCN] and mentioned in 604 Section 5.1. 606 o Signalling: we believe that the solution could operate with 607 another signalling protocol such as NSIS. It could also work with 608 application level signalling as suggested in [RT-ECN]. 610 o Additional load: we believe that the assumption is valid for core 611 and backbone networks, with an appropriate margin between the 612 configured-admission-rate and the capacity for CL traffic. 613 However, in principle a burst of admission requests can occur in a 614 short time. We expect this to be a rare event under normal 615 conditions, but it could happen e.g. due to a 'flash crowd'. If it 616 does, then more flows may be admitted than should be, triggering 617 the pre-emption mechanisms. There are various approaches to how an 618 operator might try to alleviate this issue, which are discussed in 619 the 'Flash crowds' section 5.1 later. 621 o Separation: the assumption that CL traffic is separated from non- 622 CL traffic implies that the CL traffic has its own PHB, not shared 623 with other traffic. We are looking at whether it could share 624 Expedited Forwarding's PHB, but supplemented with Pre-Congestion 625 Notification. If this is possible, other PHBs (like Assured 626 Forwarding) could be supplemented with the same new behaviours. 627 This is similar to how RFC3168 ECN was defined to supplement any 628 PHB. 630 o Routing: we are looking in greater detail at the solution in the 631 presence of Equal Cost Multi-Path routing and at suitable 632 enhancements. See also the "Tunnelling" section later. 634 2.3. Key benefits 636 We believe that the mechanism described in this document has several 637 advantages: 639 o It achieves statistical guarantees of quality of service for 640 microflows, delivering a very low delay, jitter and packet loss 641 service suitable for applications like voice and video calls that 642 generate real time inelastic traffic. This is because of its per 643 microflow admission control scheme, combined with its dynamic on- 644 path "early warning" of potential congestion. The guarantee is at 645 least as strong as with IntServ Controlled Load (Section 6.1 646 mentions why the guarantee may be somewhat better), but without 647 the scalability problems of per-microflow IntServ. 649 o It can support "Emergency" and military Multi-Level Pre-emption 650 and Priority services, even in times of heavy congestion (perhaps 651 caused by failure of a node within the CL-region), by pre-empting 652 on-going "ordinary CL microflows". See also Section 4.5. 654 o It scales well, because there is no signal processing or path 655 state held by the interior nodes of the CL-region. 657 o It is resilient, again because no state is held by the interior 658 nodes of the CL-region. Hence during an interior routing change 659 caused by a node failure no microflow state has to be relocated. 660 The flow pre-emption mechanism further helps resilience because it 661 rapidly reduces the load to one that the CL-region can support. 663 o It helps preserve, through the flow pre-emption mechanism, QoS to 664 as many microflows as possible and to lower priority traffic in 665 times of heavy congestion (e.g. caused by failure of an interior 666 node). Otherwise long-lived microflows could cause loss on all CL 667 microflows for a long time. 669 o It avoids the potential catastrophic failure problem when the 670 DiffServ architecture is used in large networks using statically 671 provisioned capacity. This is achieved by controlling the load 672 dynamically, based on edge-to-edge-path real-time measurement of 673 Pre-Congestion Notification, as discussed in Section 1.1.1. 675 o It requires minimal new standardisation, because it reuses 676 existing QoS protocols and algorithms. 678 o It can be deployed incrementally, region by region or network by 679 network. Not all the regions or networks on the end-to-end path 680 need to have it deployed. Two CL-regions can even be separated by 681 a network that uses another QoS mechanism (e.g. MPLS-TE). 683 o It provides a deployment path for use of ECN for real-time 684 applications. Operators can gain experience of ECN before its 685 applicability to end-systems is understood and end terminals are 686 ECN capable. 688 3. Architecture 690 3.1. Admission control 692 In this section we describe the admission control mechanism. We 693 discuss the three pieces of the solution and then give an example of 694 how they fit together in a use case: 696 o the new Pre-Congestion Notification for Admission Marking used by 697 all nodes in the CL-region 699 o how the measurements made support our admission control mechanism 701 o how the edge to edge mechanism fits into the end to end RSVP 702 signalling 704 3.1.1. Pre-Congestion Notification for Admission Marking 706 This is discussed in [PCN]. Here we only give a brief outline. 708 To support our admission control mechanism, each node in the CL- 709 region runs an algorithm to determine whether to set the packet into 710 the Admission Marked state. The algorithm measures the aggregate CL 711 traffic on the link and ensures that packets are admission marked 712 before the actual queue builds up, but when it is in danger of doing 713 so soon; the probability of admission marking increases with the 714 danger. The algorithm's main parameter is the configured-admission- 715 rate, which is set lower than the link speed, perhaps considerably 716 so. Admission marked packets indicate that the CL traffic rate is 717 reaching the configured-admission-rate and so act as an "early 718 warning" that the engineered capacity is nearly reached. Therefore 719 they indicate that requests to admit prospective new CL flows may 720 need to be refused. 722 3.1.2. Measurements to support admission control 724 To support our admission control mechanism the egress measures the 725 Congestion-Level-Estimate for traffic from each remote ingress 726 gateway, i.e. per CL-region-aggregate. The Congestion-Level-Estimate 727 is the number of bits in CL packets that are admission marked, 728 divided by the number of bits in all CL packets. It is calculated as 729 an exponentially weighted moving average. It is calculated by an 730 egress node separately for the CL packets from each particular 731 ingress node. This Congestion-Level-Estimate provides an estimate of 732 how near the links on the path inside the CL-region are getting to 733 the configured-admission-rate. Note that the metering is done 734 separately per ingress node, because there may be sufficient capacity 735 on all the nodes on the path between one ingress gateway and a 736 particular egress, but not from a second ingress to that same egress 737 gateway. 739 3.1.3. How edge-to-edge admission control supports end-to-end QoS 740 signalling 742 Consider a scenario that consists of two end hosts, each connected to 743 their own access networks, which are linked by the CL-region. A 744 source tries to set up a new CL microflow by sending an RSVP PATH 745 message, and the receiving end host replies with an RSVP RESV 746 message. Outside the CL-region some other method, for instance 747 IntServ, is used to provide QoS. From the perspective of RSVP the CL- 748 region is a single hop, so the RSVP PATH and RESV messages are 749 processed by the ingress and egress gateways but are carried 750 transparently across all the interior nodes; hence, the ingress and 751 egress gateways hold per microflow state, whilst no state is kept by 752 the interior nodes. So far this is as in IntServ over DiffServ 753 [RFC2998]. However, in order to support our admission control 754 mechanism, the egress gateway adds to the RESV message an opaque 755 object which states the current Congestion-Level-Estimate for the 756 relevant CL-region-aggregate. Details of the corresponding RSVP 757 extensions are described in [RSVP-ECN]. 759 3.1.4. Use case 761 To see how the three pieces of the solution fit together, we imagine 762 a scenario where some microflows are already in place between a given 763 pair of ingress and egress gateways, but the traffic load is such 764 that no packets from these flows are admission marked as they travel 765 across the CL-region. A source wanting to start a new CL microflow 766 sends an RSVP PATH message. The egress gateway adds an object to the 767 RESV message with the Congestion-Level-Estimate, which is zero. The 768 ingress gateway sees this and consequently admits the new flow. It 769 then forwards the RSVP RESV message upstream towards the source end 770 host. Hence, assuming there's sufficient capacity in the access 771 networks, the new microflow is admitted end-to-end. 773 The source now sends CL packets, which arrive at the ingress gateway. 774 The ingress uses a five-tuple filter to identify that the packets are 775 part of a previously admitted CL microflow, and it also polices the 776 microflow to ensure it remains within its traffic profile. (The 777 ingress has learnt the required information from the RSVP messages.) 778 When forwarding a packet belonging to an admitted microflow, the 779 ingress sets the packet's DSCP and ECN fields to the appropriate 780 values configured for the CL region. The CL packet now travels across 781 the CL-region, getting admission marked if necessary. 783 Next, we imagine the same scenario but at a later time when load is 784 higher at one (or more) of the interior nodes, which start to set CL 785 packets into the Admission Marked state, because their load on the 786 outgoing link is nearing the configured-admission-rate. The next time 787 a source tries to set up a CL microflow, the ingress gateway learns 788 (from the egress) the relevant Congestion-Level-Estimate. If it is 789 greater than some CLE-threshold value then the ingress refuses the 790 request, otherwise it is accepted. 792 It is also possible for an egress gateway to get a RSVP RESV message 793 and not know what the Congestion-Level-Estimate is. For example, if 794 there are no CL microflows at present between the relevant ingress 795 and egress gateways. In this case the egress requests the ingress to 796 send probe packets, from which it can initialise its meter. RSVP 797 Extensions for such a request to send probe data can be found in 798 [RSVP-ECN]. 800 3.2. Flow pre-emption 802 In this section we describe the flow pre-emption mechanism. We 803 discuss the two parts of the solution and then give an example of how 804 they fit together in a use case: 806 o How an ingress gateway is triggered to test whether flow pre- 807 emption may be needed 809 o How an ingress gateway determines the right amount of CL traffic 810 to drop 812 The mechanism is defined in [PCN] and [RSVP-ECN]. 814 3.2.1. Alerting an ingress gateway that flow pre-emption may be needed 816 Alerting an ingress gateway that flow pre-emption may be needed is a 817 two stage process: a router in the CL-region alerts an egress gateway 818 that flow pre-emption may be needed; in turn the egress gateway 819 alerts the relevant ingress gateway. Every router in the CL-region 820 has the ability to alert egress gateways, which may be done either 821 explicitly or implicitly: 823 o Explicit - the router per-hop behaviour is supplemented with a new 824 Pre-emption Marking behaviour, which is outlined below. Reception 825 of such a packet by the egress gateway alerts it that pre-emption 826 may be needed. 828 o Implicit - the router behaviour is unchanged from the Admission 829 Marking behaviour described earlier. The egress gateway treats a 830 Congestion-Level-Estimate of (almost) 100% as an implicit alert 831 that pre-emption may be required. ('Almost' because the 832 Congestion-Level-Estimate is a moving average, so can never reach 833 exactly 100%.) 835 To support explicit pre-emption alerting, each node in the CL-region 836 runs an algorithm to determine whether to set the packet into the 837 Pre-emption Marked state. The algorithm measures the aggregate CL 838 traffic and ensures that packets are pre-emption marked before the 839 actual queue builds up. The algorithm's main parameter is the 840 configured-pre-emption-rate, which is set lower than the link speed 841 (but higher than the configured-admission-rate). Thus pre-emption 842 marked packets indicate that the CL traffic rate is reaching the 843 configured-pre-emption-rate and so act as an "early warning" that the 844 engineered capacity is nearly reached. Therefore they indicate that 845 it may be advisable to pre-empt some of the existing CL flows in 846 order to preserve the QoS of the others. 848 Note that the explicit mechanism only makes sense if all the routers 849 in the CL-region have the functionality so that the egress gateways 850 can rely on the explicit mechanism. Otherwise there is the danger 851 that the traffic happens to focus on a router without it, and egress 852 gateways then have also to watch for implicit pre-emption alerts. 854 When one or more packets in a CL-region-aggregate alert the egress 855 gateway of the need for flow pre-emption, whether explicitly or 856 implicitly, the egress puts that CL-region-aggregate into the Pre- 857 emption Alert state. For each CL-region-aggregate in alert state it 858 measures the rate of traffic at the egress gateway (i.e. the traffic 859 rate of the appropriate CL-region-aggregate) and reports this to the 860 relevant ingress gateway. The steps are: 862 o Determine the relevant ingress gateway - for the explicit case the 863 egress gateway examines the pre-emption marked packet and uses the 864 state installed at the time of admission to determine which 865 ingress gateway the packet came from. For the implicit case the 866 egress gateway has already determined this information, because 867 the Congestion-Level-Estimate is calculated per ingress gateway. 869 o Measure the traffic rate of CL packets - as soon as the egress 870 gateway is alerted (whether explicitly or implicitly) it measures 871 the rate of CL traffic from this ingress gateway (i.e. for this 872 CL-region-aggregate). Note that pre-emption marked packets are 873 excluded from that measurement. It should make its measurement 874 quickly and accurately, but exactly how is up to the 875 implementation. 877 o Alert the ingress gateway - the egress gateway then immediately 878 alerts the relevant ingress gateway about the fact that flow pre- 879 emption may be required. This Alert message also includes the 880 measured Sustainable-Aggregate-Rate, i.e. the egress rate of CL- 881 traffic for this ingress gateway. The Alert message is sent using 882 reliable delivery. Procedures for support of such an Alert using 883 RSVP are defined in [RSVP-ECN]. 885 _ _ 886 -------------- / \ ----------------- 887 CL packet |Update | / Is it a \ Y | Measure CL rate | 888 arrives --->|Congestion- |--->/pre-emption\-----> | from ingress and| 889 |Level-Estimate| \ marked / | alert ingress | 890 -------------- \ packet? / ----------------- 891 \_ _/ 893 Figure 2: Egress gateway action for explicit Pre-emption Alert 895 _ _ 896 -------------- / \ ----------------- 897 CL packet |Update | / Is \ Y | Measure CL rate | 898 arrives --->|Congestion- |--->/ C.L.E. \-----> | from ingress and| 899 |Level-Estimate| \ (nearly) / | alert ingress | 900 -------------- \ 100%? / ----------------- 901 \_ _/ 903 Figure 3: Egress gateway action for implicit Pre-emption Alert 904 3.2.2. Determining the right amount of CL traffic to drop 906 The method relies on the insight that the amount of CL traffic that 907 can be supported between a particular pair of ingress and egress 908 gateways, is the amount of CL traffic that is actually getting across 909 the CL-region to the egress gateway without being re-marked to the 910 Pre-emption Marked state. Hence we term it the Sustainable-Aggregate- 911 Rate. 913 So when the ingress gateway gets the Alert message from an egress 914 gateway, it compares: 916 o The traffic rate that it is sending to this particular egress 917 gateway (which we term ingress-aggregate-rate) 919 o The traffic rate that the egress gateway reports (in the Alert 920 message) that it is receiving from this ingress gateway (which is 921 the Sustainable-Aggregate-Rate) 923 If the difference is significant, then the ingress gateway pre-empts 924 some microflows. It only pre-empts if: 926 Ingress-aggregate-rate > Sustainable-Aggregate-Rate + error 928 The "error" term is partly to allow for inaccuracies in the 929 measurements of the rates. It is also needed because the ingress- 930 aggregate-rate is measured at a slightly later moment than the 931 Sustainable-Aggregate-Rate, and it is quite possible that the 932 ingress-aggregate-rate has increased in the interim due to natural 933 variation of the bit rate of the CL sources. So the "error" term 934 allows for some variation in the ingress rate without triggering pre- 935 emption. 937 The ingress gateway should pre-empt enough microflows to ensure that: 939 New ingress-aggregate-rate < Sustainable-Aggregate-Rate - error 941 The "error" term here is used for similar reasons but in the other 942 direction, to ensure slightly more load is shed than seems necessary, 943 in case the two measurements were taken during a short-term fall in 944 load. 946 When the routers in the CL-region are using explicit pre-emption 947 alerting, the ingress gateway would normally pre-empt microflows 948 whenever it gets an alert (it always would if it were possible to set 949 "error" equal to zero). For the implicit case however this is not so. 950 It receives an Alert message when the Congestion-Level-Estimate 951 reaches (almost) 100%, which is roughly when traffic exceeds the 952 configured-admission-rate. However, it is only when packets are 953 indeed dropped en route that the Sustainable-Aggregate-Rate becomes 954 less than the ingress-aggregate-rate so only then will pre-emption 955 will actually occur on the ingress router. 957 Hence with the implicit scheme, pre-emption can only be triggered 958 once the system starts dropping packets and thus the QoS of flows 959 starts being significantly degraded. This is in contrast with the 960 explicit scheme which allows flow pre-emption to be triggered before 961 any packet drop, simply when the traffic reaches the configured-pre- 962 emption-rate. Therefore we believe that the explicit mechanism is 963 superior. However it does require new functionality on all the 964 routers (although this is little more than a bulk token bucket - see 965 [PCN] for details). 967 3.2.3. Use case for flow pre-emption 969 To see how the pieces of the solution fit together in a use case, we 970 imagine a scenario where many microflows have already been admitted. 971 We confine our description to the explicit pre-emption mechanism. Now 972 an interior router in the CL-region fails. The network layer routing 973 protocol re-routes round the problem, but as a consequence traffic on 974 other links increases. In fact let's assume the traffic on one link 975 now exceeds its configured-pre-emption-rate and so the router pre- 976 emption marks CL packets. When the egress sees the first one of the 977 pre-emption marked packets it immediately determines which microflow 978 this packet is part of (by using a five-tuple filter and comparing it 979 with state installed at admission) and hence which ingress gateway 980 the packet came from. It sets up a meter to measure the traffic rate 981 from this ingress gateway, and as soon as possible sends a message to 982 the ingress gateway. This message alerts the ingress gateway that 983 pre-emption may be needed and contains the traffic rate measured by 984 the egress gateway. Then the ingress gateway determines the traffic 985 rate that it is sending towards this egress gateway and hence it can 986 calculate the amount of traffic that needs to be pre-empted. 988 The ingress gateway could now just shed random microflows, but it is 989 better if the least important ones are dropped. The ingress gateway 990 could use information stored locally in each reservation's state 991 (such as for example the RSVP pre-emption priority) as well as 992 information provided by a policy decision point in order to decide 993 which of the flows to shed (or perhaps which ones not to shed). The 994 ingress gateway then initiates RSVP signalling to instruct the 995 relevant destinations that their session has been terminated, and to 996 tell (RSVP) nodes along the path to tear down associated RSVP state. 997 To guard against recalcitrant sources, normal IntServ policing will 998 block any future traffic from the dropped flows from entering the CL- 999 region. Note that - with the explicit Pre-emption Alert mechanism - 1000 since the configured-pre-emption-rate may be significantly less than 1001 the physical line capacity, flow pre-emption may be triggered before 1002 any congestion has actually occurred and before any packet is 1003 dropped. 1005 We extend the scenario further by imagining that (due to a disaster 1006 of some kind) further routers in the CL-region fail during the time 1007 taken by the pre-emption process described above. This is handled 1008 naturally, as packets will continue to be pre-emption marked and so 1009 the pre-emption process will happen for a second time. 1011 Flow pre-emption also helps emergency/military calls by taking into 1012 account the corresponding call priorities when selecting calls to be 1013 pre-empted, which is likely to be particularly important in a 1014 disaster scenario. 1016 4. Details 1018 This section is intended to provide a systematic summary of the new 1019 functionality required by the routers in the CL-region. 1021 A network operator upgrades normal IP routers by: 1023 o Adding functionality related to admission control and flow pre- 1024 emption to all its ingress and egress gateways 1026 o Adding Pre-Congestion Notification for Admission and Pre-emption 1027 Marking to all the nodes in the CL-region. 1029 We consider the detailed actions required for each of the types of 1030 node in turn. 1032 4.1. Ingress gateways 1034 Ingress gateways perform the following tasks: 1036 o Classify incoming packets - decide whether they are CL or non-CL 1037 packets. This is done using an IntServ filter spec (source and 1038 destination addresses and port numbers), whose details have been 1039 gathered from the RSVP messaging. 1041 o Police - check that the microflow conforms with what has been 1042 agreed (i.e. it keeps to its agreed data rate). If necessary, 1043 packets which do not correspond to any reservations, packets which 1044 are in excess of the rate agreed for their reservation, and 1045 packets for a reservation that has earlier been pre-empted may be 1046 policed. Policing may be achieved via dropping or via re-marking 1047 of the packet's DSCP to a value different from the CL behaviour 1048 aggregate. 1050 o Packet ECN colouring - for CL microflows, set the ECN field 1051 appropriately (see [PCN] for some discussion of encoding) 1053 o Perform 'interior node' functions (see next sub-section) 1055 o Admission Control - on new session establishment, consider the 1056 Congestion-Level-Estimate received from the corresponding egress 1057 gateway and most likely based on a simple configured CLE-threshold 1058 decide if a new call is to be admitted or rejected (taking into 1059 account local policy information as well as optionally information 1060 provided by a policy decision point). 1062 o Probe - if requested by the egress gateway to do so, the ingress 1063 gateway generates probe traffic so that the egress gateway can 1064 compute the Congestion-Level-Estimate from this ingress gateway. 1065 Probe packets may be simple data addressed to the egress gateway 1066 and require no protocol standardisation, although there will be 1067 best practice for their number, size and rate. 1069 o Measure - when it receives an Alert message from an egress 1070 gateway, it determines the rate at which it is sending packets to 1071 that egress gateway 1073 o Pre-empt - calculate how much CL traffic needs to be pre-empted; 1074 decide which microflows should be dropped, perhaps in consultation 1075 with a Policy Decision Point; and do the necessary signalling to 1076 drop them. 1078 4.2. Interior nodes 1080 Interior nodes do the following tasks: 1082 o Classify packets - examine the DSCP and ECN field to see if it's a 1083 CL packet 1085 o Non-CL packets are handled as usual, with respect to dropping them 1086 or setting their CE codepoint. 1088 o Pre-Congestion Notification - CL packets are Admission Marked and 1089 Pre-emption Marked according to the algorithm detailed in [PCN] 1090 and outlined in Section 3. 1092 4.3. Egress gateways 1094 Egress gateways do the following tasks: 1096 o Classify packets - determine which ingress gateway a CL packet has 1097 come from. This is the previous RSVP hop, hence the necessary 1098 details are obtained just as with IntServ from the state 1099 associated with the packet five-tuple, which has been built using 1100 information from the RSVP messages. 1102 o Meter - for CL packets, calculate the fraction of the total number 1103 of bits which are in Admission marked packets. The calculation is 1104 done as an exponentially weighted moving average (see Appendix). A 1105 separate calculation is made for CL packets from each ingress 1106 gateway. The meter works on an aggregate basis and not per 1107 microflow. 1109 o Signal the Congestion-Level-Estimate - this is piggy-backed on the 1110 reservation reply. An egress gateway's interface is configured to 1111 know it is an egress gateway, so it always appends this to the 1112 RESV message. If the Congestion-Level-Estimate is unknown or is 1113 too stale, then the egress gateway can request the ingress gateway 1114 to send probes. 1116 o Packet colouring - for CL packets, set the DSCP and the ECN field 1117 to whatever has been agreed as appropriate for the next domain. By 1118 default the ECN field is set to the Not-ECT codepoint. See also 1119 the discussion in the Tunnelling section later. 1121 o Measure the rate - measure the rate of CL traffic from a 1122 particular ingress gateway (i.e. the rate for the CL-region- 1123 aggregate), when alerted (either explicitly or implicitly) that 1124 pre-emption may be required. The measured rate is reported back to 1125 the appropriate ingress gateway [RSVP-ECN]. 1127 4.4. Failures 1129 If an interior node fails, then the regular IP routing protocol will 1130 re-route round it. If the new route can carry all the admitted 1131 traffic, flows will gracefully continue. If instead this causes early 1132 warning of congestion from the new route, then admission control 1133 based on pre-congestion notification will ensure new flows will not 1134 be admitted until enough existing flows have departed. Finally re- 1135 routing may result in heavy congestion, when the pre-emption 1136 mechanism will kick in. 1138 If a gateway fails then we would like regular RSVP procedures 1139 [RFC2205] to take care of things. With the local repair mechanism of 1140 [RFC2205], when a route changes the next RSVP PATH refresh message 1141 will establish path state along the new route, and thus attempt to 1142 re-establish reservations through the new ingress gateway. 1143 Essentially the same procedure is used as described earlier in this 1144 document, with the re-routed session treated as a new session 1145 request. 1147 In more detail, consider what happens if an ingress gateway of the 1148 CL-region fails. Then RSVP routers upstream of it do IP re-routing to 1149 a new ingress gateway. The next time the upstream RSVP router sends a 1150 PATH refresh message it reaches the new ingress gateway which 1151 therefore installs the associated RSVP state. The next RSVP RESV 1152 refresh will pick up the congestion-level-estimate from the egress 1153 gateway, and the ingress compares this with its threshold to decide 1154 whether to admit the new session. This could result in some of the 1155 flows being rejected, but those accepted will receive the full QoS. 1157 An issue with this is that we have to wait until a PATH and RESV 1158 refresh messages are sent - which may not be very often - the default 1159 value is 30 seconds. [RFC2205] discusses how to speed up the local 1160 repair mechanism. First, the RSVP module is notified by the local 1161 routing protocol module of a route change to particular destinations, 1162 which triggers it to rapidly send out PATH refresh messages. Further, 1163 when a PATH refresh arrives with a previous hop address different 1164 from the one stored, then RESV refreshes are immediately sent to that 1165 previous hop. Where RSVP is operating hop-by-hop, ie on every router, 1166 then triggering the PATH refresh is easy as the node can simply 1167 monitor its local link. Thus, this fast local repair mechanism can be 1168 used to deal with failures upstream of the ingress gateway, with 1169 failures of the ingress gateway and with failures downstream of the 1170 egress gateway. 1172 But where RSVP is not operating hop-by-hop (as is the case within the 1173 CL-region), it is not so easy to trigger the PATH refresh. 1175 Unfortunately, this problem applies if an egress gateway fails, since 1176 it's very likely that an egress gateway is several IP hops from the 1177 ingress gateway. (If the ingress is several IP hops from its previous 1178 RSVP node, then there is the same issue.) The options appear to be: 1180 o the ingress gateway has a link state database for the CL-region, 1181 so it can detect that an egress gateway has failed or became 1182 unreachable 1184 o there is an inter-gateway protocol, so the ingress can 1185 continuously check that the egress gateways are still alive 1187 o (default) do nothing and wait for the regular PATH/RESV refreshes 1188 (and, if needed, the pre-emption mechanism) to sort things out. 1190 4.5. Admission of 'emergency / higher precedence' session 1192 Section 4.1 describes how if the Congestion-Level-Estimate is greater 1193 than the CLE-threshold all new sessions are refused. But it is 1194 unsatisfactory to block emergency calls, for instance. Therefore it 1195 is recommended that an 'emergency / higher precedence' call is 1196 admitted immediately even if the CLE-threshold is exceeded. Usually 1197 the network can actually handle the additional microflow, because 1198 there is a safety margin between the configured-admission-rate and 1199 the configured-pre-emption-rate. Normal call termination behaviour 1200 will soon bring the traffic level down below the configured- 1201 admission-rate. However, in exceptional circumstances the 'emergency 1202 / higher precedence' call may cause the traffic level to exceed the 1203 configured-pre-emption-rate; then the usual pre-emption mechanism 1204 will pre-empt enough (non 'emergency / higher precedence' ) 1205 microflows to bring the total traffic back under the configured-pre- 1206 emption-rate. 1208 4.6. Tunnelling 1210 It is possible to tunnel all CL packets across the CL-region. 1211 Although there is a cost of tunnelling (additional header on each 1212 packet, additional processing at tunnel ingress and egress), there 1213 are three reasons it may be interesting. 1215 ECMP: 1217 If the CL-region uses Equal Cost Multipath Routing (ECMP), then 1218 traffic between a particular pair of ingress and egress gateways may 1219 follow several different paths. 1221 Why? An ECMP-enabled router runs an algorithm to choose between 1222 potential outgoing links, based on a hash of fields such as the 1223 packet's source and destination addresses - exactly what depends on 1224 the proprietary algorithm. Packets are addressed to the CL flow's 1225 end-point, and therefore different flows may follow different paths 1226 through the CL-region. 1228 The problem is that if one of the paths is congested such that 1229 packets are being admission marked, then the Congestion-Level- 1230 Estimate measured by the egress gateway will be diluted by unmarked 1231 packets from other non-congested paths. Similarly, the measurement of 1232 the Sustainable-Aggregate-Rate will also be diluted. 1234 One solution is to tunnel across the CL-region. Then the destination 1235 address (and so on) seen by the ECMP algorithm is that of the egress 1236 gateway, so all flows follow the same path. 1238 Ingress gateway determination: 1240 If packets are tunnelled from ingress gateway to egress gateway, the 1241 egress gateway can very easily determine in the datapath which 1242 ingress gateway a packet comes from (by simply looking at the source 1243 address of the tunnel header). This can facilitate operations such as 1244 computing the Congestion-Level-Estimate on a per ingress gateway 1245 basis. 1247 End-to-end ECN: 1249 The ECN field is used for PCN marking (see [PCN] for details), and so 1250 it needs to be re-set by the egress gateway to whatever has been 1251 agreed as appropriate for the next domain. Therefore if a packet 1252 arrives at the ingress gateway with its ECN field already set (ie not 1253 '00'), it may leave the egress gateway with a different value. Hence 1254 the end-to-end meaning of the ECN field is lost. 1256 It is open to debate whether end-to-end congestion control is ever 1257 necessary within an end-to-end reservation. But if a genuine need is 1258 identified for end-to-end ECN semantics within a reservation, then 1259 one solution is to tunnel CL packets across the CL-region. When the 1260 egress gateway decapsulates them the original ECN field is recovered. 1262 5. Potential future extensions 1264 5.1. Mechanisms to deal with 'Flash crowds' 1266 There is a time lag between the admission control decision (which 1267 depends on the Congestion-Level-Estimate during RSVP signalling 1268 during call set-up) and when the data is actually sent (after the 1269 called party has answered). In PSTN terms this is the time the phone 1270 rings. Normally the time lag doesn't matter much because (1) in the 1271 CL-region there are many flows and they terminate and are answered at 1272 roughly the same rate, and (2) the network can still operate safely 1273 when the traffic level is some margin above the configured-admission- 1274 rate. 1276 A 'flash crowd' occurs when something causes many calls to be 1277 initiated in a short period of time - for instance a 'televote'. So 1278 there is a danger that a 'flash' of calls is accepted, but when the 1279 calls are answered and data flows the traffic overloads the network. 1280 There are various possible ways an operator could try to address the 1281 problem. 1283 The simplest option is to do nothing; an operator relies on the pre- 1284 emption mechanism if there is a problem. This doesn't seem a good 1285 choice, as 'flash crowds' are reasonably common on the PSTN, unless 1286 the operator can ensure that nearly all "flash crowd" events are 1287 blocked in the access network and so do not impact on the CL-region. 1289 A second option is to send 'dummy data' as soon as the call is 1290 admitted, thus effectively reserving the bandwidth whilst waiting for 1291 the called party to answer. Reserving bandwidth in advance means that 1292 the network cannot admit as many calls. For example, suppose sessions 1293 last 100 seconds and ringing for 10 seconds, the cost is a 10% loss 1294 of capacity. It may be possible to offset this somewhat by increasing 1295 the configured-admission-rate in the routers, but it would need 1296 further investigation. 1298 A concern with this 'dummy data' option is that it may allow an 1299 attacker to initiate many calls that are never answered (by a 1300 cooperating attacker), so eventually the network would only be 1301 carrying 'dummy data'. The attack exploits that charging only starts 1302 when the call is answered and not when it is dialled. It may be 1303 possible to alleviate the attack at the session layer - for example, 1304 when the ingress gateway gets an RSVP PATH message it checks that the 1305 source has been well-behaved recently. 1307 A third option is that the egress gateway limits the rate at which it 1308 sends out the Congestion-Level-Estimate, or limits the rate at which 1309 calls are accepted by replying with a Congestion-Level-Estimate of 1310 100% (this is the equivalent of 'call gapping' in the PSTN). There is 1311 a trade-off, which would need to be investigated further, between the 1312 degree of protection and possible adverse side-effects like slowing 1313 down call set-up. 1315 A final option is to re-perform admission control before the call is 1316 answered. The ingress gateway monitors Congestion-Level-Estimate 1317 updates received from each egress. If it notices that a Congestion- 1318 Level-Estimate has risen above the CLE-threshold, then it terminates 1319 all unanswered calls through that egress (eg by instructing the 1320 session protocol to stop the 'ringing tone'). For extra safety the 1321 Congestion-Level-Estimate could be re-checked when the call is 1322 answered. A potential drawback for an operator that wants to emulate 1323 the PSTN is that the PSTN network never drops a 'ringing' PSTN call. 1325 5.2. Multi-domain and multi-operator usage 1327 This potential extension would eliminate the trust assumption 1328 (Section 2.2), so that the CL-region could consist of multiple 1329 domains run by different operators that did not trust each other. 1330 Then only the ingress and egress gateways of the CL-region would take 1331 part in the admission control procedure, i.e. at the ingress to the 1332 first domain and the egress from the final domain. The border routers 1333 between operators within the CL-region would only have to do bulk 1334 accounting - they wouldn't do per microflow metering and policing, 1335 and they wouldn't take part in signal processing or hold path state 1336 [Briscoe]. [Re-feedback] explains how a downstream domain can police 1337 that its upstream domain does not 'cheat' by admitting traffic when 1338 the downstream path is over-congested. [Re-PCN] proposes how to 1339 achieve this with the help of another recently proposed extension to 1340 ECN, involving re-echoing ECN feedback [Re-ECN]. 1342 5.3. Adaptive bandwidth for the Controlled Load service 1344 The admission control mechanism described in this document assumes 1345 that each router has a fixed bandwidth allocated to CL flows. A 1346 possible extension is that the bandwidth is flexible, depending on 1347 the level of non-CL traffic. If a large share of the current load on 1348 a path is CL, then more CL traffic can be admitted. And if the 1349 greater share of the load is non-CL, then the admission threshold can 1350 be proportionately lower. The approach re-arranges sharing between 1351 classes to aim for economic efficiency, whatever the traffic load 1352 matrix. It also deals with unforeseen changes to capacity during 1353 failures better than configuring fixed engineered rates. Adaptive 1354 bandwidth allocation can be achieved by changing the admission 1355 marking behaviour, so that the probability of admission marking a 1356 packet would now depend on the number of queued non-CL packets as 1357 well as the size of the virtual queue. The adaptive bandwidth 1358 approach would be supplemented by placing limits on the adaptation to 1359 prevent starvation of the CL by other traffic classes and of other 1360 classes by CL traffic. 1362 5.4. Controlled Load service with end-to-end Pre-Congestion Notification 1364 It may be possible to extend the framework to parts of the network 1365 where there are only a low number of CL microflows, i.e. the 1366 aggregation assumption (Section 2.2) doesn't hold. In the extreme it 1367 may be possible to operate the framework end-to-end, i.e. between end 1368 hosts. One potential method is to send probe packets to test whether 1369 the network can support a prospective new CL microflow. The probe 1370 packets would be sent at the same traffic rate as expected for the 1371 actual microflow, but in order not to disturb existing CL traffic a 1372 router would always schedule probe packets behind CL ones (compare 1373 [Breslau00]); this implies they have a new DSCP. Otherwise the 1374 routers would treat probe packets identically to CL packets. In order 1375 to perform admission control quickly, in parts of the network where 1376 there are only a few CL microflows, the Pre-Congestion marking 1377 behaviour for probe packets would switch from admission marking no 1378 packets to admission marking them all for only a minimal increase in 1379 load. 1381 5.5. MPLS-TE 1383 It may be possible to extend the framework for admission control of 1384 microflows into a set of MPLS-TE aggregates (Multi-protocol label 1385 switching traffic engineering). However it would require that the 1386 MPLS header could include the ECN field, which is not precluded by 1387 RFC3270. 1389 6. Relationship to other QoS mechanisms 1391 6.1. IntServ Controlled Load 1393 The CL mechanism delivers QoS similar to Integrated Services 1394 controlled load, but rather better as queues are kept empty by 1395 driving admission control from a bulk virtual queue on each interface 1396 that can detect a rise in load before queues build, sometimes termed 1397 a virtual queue [AVQ, vq]. It is also more robust to route changes. 1399 6.2. Integrated services operation over DiffServ 1401 Our approach to end-to-end QoS is similar to that described in 1402 [RFC2998] for Integrated services operation over DiffServ networks. 1403 Like [RFC2998], an IntServ class (CL in our case) is achieved end-to- 1404 end, with a CL-region viewed as a single reservation hop in the total 1405 end-to-end path. Interior routers of the CL-region do not process 1406 flow signalling nor do they hold state. Unlike [RFC2998] we do not 1407 require the end-to-end signalling mechanism to be RSVP, although it 1408 can be. 1410 Bearing in mind these differences, we can describe our architecture 1411 in the terms of the options in [RFC2998]. The DiffServ network region 1412 is RSVP-aware, but awareness is confined to (what [RFC2998] calls) 1413 the "border routers" of the DiffServ region. We use explicit 1414 admission control into this region, with static provisioning within 1415 it. The ingress "border router" does per microflow policing and sets 1416 the DSCP and ECN fields to indicate the packets are CL ones (i.e. we 1417 use router marking rather than host marking). 1419 6.3. Differentiated Services 1421 The DiffServ architecture does not specify any way for devices 1422 outside the domain to dynamically reserve resources or receive 1423 indications of network resource availability. In practice, service 1424 providers rely on subscription-time Service Level Agreements (SLAs) 1425 that statically define the parameters of the traffic that will be 1426 accepted from a customer. The CL mechanism allows dynamic reservation 1427 of resources through the DiffServ domain and, with the potential 1428 extension mentioned in Section 5.1, it can span multiple domains 1429 without active policing mechanisms at the borders (unlike DiffServ). 1430 Therefore we do not use the traffic conditioning agreements (TCAs) of 1431 the (informational) DiffServ architecture [RFC2475]. 1433 [Johnson] compares admission control with a 'generously dimensioned' 1434 DiffServ network as ways to achieve QoS. The former is recommended. 1436 6.4. ECN 1438 The marking behaviour described in this document complies with the 1439 ECN aspects of the IP wire protocol RFC3168, but provides its own 1440 edge-to-edge feedback instead of the TCP aspects of RFC3168. All 1441 nodes within the CL-region are upgraded with the admission marking 1442 and pre-emption marking of Pre-Congestion Notification, so the 1443 requirements of [Floyd] are met because the CL-region is an enclosed 1444 environment. The operator prevents traffic arriving at a node that 1445 doesn't understand CL by administrative configuration of the ring of 1446 gateways around the CL-region. 1448 6.5. RTECN 1450 Real-time ECN (RTECN) [RTECN, RTECN-usage] has a similar aim to this 1451 document (to achieve a low delay, jitter and loss service suitable 1452 for RT traffic) and a similar approach (per microflow admission 1453 control combined with an "early warning" of potential congestion 1454 through setting the CE codepoint). But it explores a different 1455 architecture without the aggregation assumption: host-to-host rather 1456 than edge-to-edge. We plan to document such a host-to-host framework 1457 in a parallel draft to this one, and to describe if and how [PCN] can 1458 work in this framework. 1460 6.6. RMD 1462 Resource Management in DiffServ (RMD) [RMD] is similar to this work, 1463 in that it pushes complex classification, traffic conditioning and 1464 admission control functions to the edge of a DiffServ domain and 1465 simplifies the operation of the interior nodes. One of the RMD modes 1466 uses measurement-based admission control, however it works 1467 differently: each interior node measures the user traffic load in the 1468 PHB traffic aggregate, and each interior node processes a local 1469 RESERVE message and compares the requested resources with the 1470 available resources (maximum allowed load minus current load). 1472 Hence a difference is that the CL architecture described in this 1473 document has been designed not to require interaction between 1474 interior nodes and signalling, whereas in RMD all interior nodes are 1475 QoS-NSLP aware. So our architecture involves less processing in 1476 interior nodes, is more agnostic to signalling, requires fewer 1477 changes to existing standards and therefore works with existing RSVP 1478 as well as having the potential to work with future signalling 1479 protocols like NSIS. 1481 RMD introduced the concept of Severe Congestion handling. The pre- 1482 emption mechanism described in the CL architecture has similar 1483 objectives but relies on different mechanisms. 1485 It is planned to work together with the authors of [RMD] and that the 1486 next version of this draft and [PCN] will be co-authored with them. 1488 6.7. RSVP Aggregation over MPLS-TE 1490 Multi-protocol label switching traffic engineering (MPLS-TE) allows 1491 scalable reservation of resources in the core for an aggregate of 1492 many microflows. To achieve end-to-end reservations, admission 1493 control and policing of microflows into the aggregate can be achieved 1494 using techniques such as RSVP Aggregation over MPLS TE Tunnels as per 1495 [AGGRE-TE]. However, in the case of inter-provider environments, 1496 these techniques require that admission control and policing be 1497 repeated at each trust boundary or that MPLS TE tunnels span multiple 1498 domains. 1500 7. Security Considerations 1502 To protect against denial of service attacks, the ingress gateway of 1503 the CL-region needs to police all CL packets and drop packets in 1504 excess of the reservation. This is similar to operations with 1505 existing IntServ behaviour. 1507 For pre-emption, it is considered acceptable from a security 1508 perspective that the ingress gateway can treat "emergency/military" 1509 CL flows preferentially compared with "ordinary" CL flows. However, 1510 in the rest of the CL-region they are not distinguished (nonetheless, 1511 our proposed technique does not preclude the use of different DSCPs 1512 at the packet level as well as different priorities at the flow 1513 level.). Keeping emergency traffic indistinguishable at the packet 1514 level minimises the opportunity for new security attacks. For 1515 example, if instead a mechanism used different DSCPs for 1516 "emergency/military" and "ordinary" packets, then an attacker could 1517 specifically target the former in the data plane (perhaps for DoS or 1518 for eavesdropping). 1520 Further security aspects to be considered later. 1522 8. Acknowledgements 1524 The admission control mechanism evolved from the work led by Martin 1525 Karsten on the Guaranteed Stream Provider developed in the M3I 1526 project [GSPa, GSP-TR], which in turn was based on the theoretical 1527 work of Gibbens and Kelly [DCAC]. Kennedy Cheng, Gabriele Corliano, 1528 Carla Di Cairano-Gilfedder, Kashaf Khan, Peter Hovell, Arnaud Jacquet 1529 and June Tay (BT) helped develop and evaluate this approach. 1531 9. Comments solicited 1533 Comments and questions are encouraged and very welcome. They can be 1534 sent to the Transport Area Working Group's mailing list, 1535 tsvwg@ietf.org, and/or to the authors. 1537 10. Changes from earlier versions of the draft 1539 The main changes are: 1541 From -00 to -01 1543 The whole of the Pre-emption mechanism is added. 1545 There are several modifications to the admission control mechanism. 1547 From -01 to -02 1549 The pre-congestion notification algorithms for admission marking and 1550 pre-emption marking are now described in [PCN]. 1552 There are new sub-sections in Section 4 on Failures, Admission of 1553 'emergency / higher precedence' session, and Tunnelling; and a new 1554 sub-section in Section 5 on Mechanisms to deal with 'Flash crowds'. 1556 11. Appendices 1558 11.1. Appendix A: Explicit Congestion Notification 1560 This Appendix provides a brief summary of Explicit Congestion 1561 Notification (ECN). 1563 [RFC3168] specifies the incorporation of ECN to TCP and IP, including 1564 ECN's use of two bits in the IP header. It specifies a method for 1565 indicating incipient congestion to end-nodes (eg as in RED, Random 1566 Early Detection), where the notification is through ECN marking 1567 packets rather than dropping them. 1569 ECN uses two bits in the IP header of both IPv4 and IPv6 packets: 1571 0 1 2 3 4 5 6 7 1572 +-----+-----+-----+-----+-----+-----+-----+-----+ 1573 | DS FIELD, DSCP | ECN FIELD | 1574 +-----+-----+-----+-----+-----+-----+-----+-----+ 1576 DSCP: differentiated services codepoint 1577 ECN: Explicit Congestion Notification 1579 Figure A.1: The Differentiated Services and ECN Fields in IP. 1581 The two bits of the ECN field have four ECN codepoints, '00' to '11': 1582 +-----+-----+ 1583 | ECN FIELD | 1584 +-----+-----+ 1585 ECT CE 1586 0 0 Not-ECT 1587 0 1 ECT(1) 1588 1 0 ECT(0) 1589 1 1 CE 1591 Figure A.2: The ECN Field in IP. 1593 The not-ECT codepoint '00' indicates a packet that is not using ECN. 1595 The CE codepoint '11' is set by a router to indicate congestion to 1596 the end nodes. The term 'CE packet' denotes a packet that has the CE 1597 codepoint set. 1599 The ECN-Capable Transport (ECT) codepoints '10' and '01' (ECT(0) and 1600 ECT(1) respectively) are set by the data sender to indicate that the 1601 end-points of the transport protocol are ECN-capable. Routers treat 1602 the ECT(0) and ECT(1) codepoints as equivalent. Senders are free to 1603 use either the ECT(0) or the ECT(1) codepoint to indicate ECT, on a 1604 packet-by-packet basis. The use of both the two codepoints for ECT is 1605 motivated primarily by the desire to allow mechanisms for the data 1606 sender to verify that network elements are not erasing the CE 1607 codepoint, and that data receivers are properly reporting to the 1608 sender the receipt of packets with the CE codepoint set. 1610 ECN requires support from the transport protocol, in addition to the 1611 functionality given by the ECN field in the IP packet header. 1612 [RFC3168] addresses the addition of ECN Capability to TCP, specifying 1613 three new pieces of functionality: negotiation between the endpoints 1614 during connection setup to determine if they are both ECN-capable; an 1615 ECN-Echo (ECE) flag in the TCP header so that the data receiver can 1616 inform the data sender when a CE packet has been received; and a 1617 Congestion Window Reduced (CWR) flag in the TCP header so that the 1618 data sender can inform the data receiver that the congestion window 1619 has been reduced. 1621 The transport layer (e.g.. TCP) must respond, in terms of congestion 1622 control, to a *single* CE packet as it would to a packet drop. 1624 The advantage of setting the CE codepoint as an indication of 1625 congestion, instead of relying on packet drops, is that it allows the 1626 receiver(s) to receive the packet, thus avoiding the potential for 1627 excessive delays due to retransmissions after packet losses. 1629 11.2. Appendix B: What is distributed measurement-based admission 1630 control? 1632 This Appendix briefly explains what distributed measurement-based 1633 admission control is [Breslau99]. 1635 Traditional admission control algorithms for 'hard' real-time 1636 services (those providing a firm delay bound for example) guarantee 1637 QoS by using 'worst case analysis'. Each time a flow is admitted its 1638 traffic parameters are examined and the network re-calculates the 1639 remaining resources. When the network gets a new request it therefore 1640 knows for certain whether the prospective flow, with its particular 1641 parameters, should be admitted. However, parameter-based admission 1642 control algorithms result in under-utilisation when the traffic is 1643 bursty. Therefore 'soft' real time services - like Controlled Load - 1644 can use a more relaxed admission control algorithm. 1646 This insight suggests measurement-based admission control (MBAC). The 1647 aim of MBAC is to provide a statistical service guarantee. The 1648 classic scenario for MBAC is where each node participates in hop-by- 1649 hop admission control, characterising existing traffic locally 1650 through measurements (instead of keeping an accurate track of traffic 1651 as it is admitted), in order to determine the current value of some 1652 parameter e.g. load. Note that for scalability the measurement is of 1653 the aggregate of the flows in the local system. The measured 1654 parameter(s) is then compared to the requirements of the prospective 1655 flow to see whether it should be admitted. 1657 MBAC may also be performed centrally for a network, it which case it 1658 uses centralised measurements by a bandwidth broker. 1660 We use distributed MBAC. "Distributed" means that the measurement is 1661 accumulated for the 'whole-path' using in-band signalling. In our 1662 case, this means that the measurement of existing traffic is for the 1663 same pair of ingress and egress gateways as the prospective 1664 microflow. 1666 In fact our mechanism can be said to be distributed in three ways: 1667 all nodes on the ingress-egress path affect the Congestion-Level- 1668 Estimate; the admission control decision is made just once on behalf 1669 of all the nodes on the path across the CL-region; and the ingress 1670 and egress gateways cooperate to perform MBAC. 1672 11.3. Appendix C: Calculating the Exponentially weighted moving average 1673 (EWMA) 1675 At the egress gateway, for every CL packet arrival: 1677 [EWMA-total-bits]n+1 = (w * bits-in-packet) + ((1-w) * [EWMA- 1678 total-bits]n ) 1680 [EWMA-AM-bits]n+1 = (B * w * bits-in-packet) + ((1-w) * [EWMA-AM- 1681 bits]n ) 1683 Then, per new flow arrival: 1685 [Congestion-Level-Estimate]n+1 = [EWMA-AM-bits]n+1 / [EWMA-total- 1686 bits]n+1 1688 where 1689 EWMA-total-bits is the total number of bits in CL packets, calculated 1690 as an exponentially weighted moving average (EWMA) 1692 EWMA-AM-bits is the total number of bits in CL packets that are 1693 Admission Marked, again calculated as an EWMA. 1695 B is either 0 or 1: 1697 B = 0 if the CL packet is not admission marked 1699 B = 1 if the CL packet is admission marked 1701 w is the exponential weighting factor. 1703 Varying the value of the weight trades off between the smoothness and 1704 responsiveness of the Congestion-Level-Estimate. However, in general 1705 both can be achieved, given our original assumption of many CL 1706 microflows and remembering that the EWMA is calculated on the basis 1707 of aggregate traffic between the ingress and egress gateways. 1708 There will be a threshold inter-arrival time between packets of the 1709 same aggregate below which the egress will consider the estimate of 1710 the Congestion-Level-Estimate as too stale, and it will then trigger 1711 generation of probes by the ingress. 1713 The first two per-packet algorithms can be simplified, if their only 1714 use will be where the result of one is divided by the result of the 1715 other in the third, per-flow algorithm. 1717 [EWMA-total-bits]'n+1 = bits-in-packet + (w' * [EWMA- total- 1718 bits]n ) 1720 [EWMA-AM-bits]'n+1 = (B * bits-in-packet) + (w' * [EWMA-AM-bits]n 1721 ) 1723 where w' = (1-w)/w. 1725 If w' is arranged to be a power of 2, these per packet algorithms can 1726 be implemented solely with a shift and an add. 1728 12. References 1730 A later version will distinguish normative and informative 1731 references. 1733 [AGGRE-TE] Francois Le Faucheur, Michael Dibiasio, Bruce Davie, 1734 Michael Davenport, Chris Christou, Jerry Ash, Bur 1735 Goode, 'Aggregation of RSVP Reservations over MPLS 1736 TE/DS-TE Tunnels', draft-ietf-tsvwg-rsvp-dste-00 (work 1737 in progress), July 2005 1739 [ANSI.MLPP.Spec] American National Standards Institute, 1740 "Telecommunications- Integrated Services Digital 1741 Network (ISDN) - Multi-Level Precedence and Pre- 1742 emption (MLPP) Service Capability", ANSI T1.619-1992 1743 (R1999), 1992. 1745 [ANSI.MLPP.Supplement] American National Standards Institute, "MLPP 1746 Service Domain Cause Value Changes", ANSI ANSI 1747 T1.619a-1994 (R1999), 1990. 1749 [AVQ] S. Kunniyur and R. Srikant "Analysis and Design of an 1750 Adaptive Virtual Queue (AVQ) Algorithm for Active 1751 Queue Management", In: Proc. ACM SIGCOMM'01, Computer 1752 Communication Review 31 (4) (October, 2001). 1754 [Breslau99] L. Breslau, S. Jamin, S. Shenker "Measurement-based 1755 admission control: what is the research agenda?", In: 1756 Proc. Int'l Workshop on Quality of Service 1999. 1758 [Breslau00] L. Breslau, E. Knightly, S. Shenker, I. Stoica, H. 1759 Zhang "Endpoint Admission Control: Architectural 1760 Issues and Performance", In: ACM SIGCOMM 2000 1762 [Briscoe] Bob Briscoe and Steve Rudkin, "Commercial Models for 1763 IP Quality of Service Interconnect", BT Technology 1764 Journal, Vol 23 No 2, April 2005. 1766 [DCAC] Richard J. Gibbens and Frank P. Kelly "Distributed 1767 connection acceptance control for a connectionless 1768 network", In: Proc. International Teletraffic Congress 1769 (ITC16), Edinburgh, pp. 941�952 (1999). 1771 [EMERG-RQTS] Carlberg, K. and R. Atkinson, "General Requirements 1772 for Emergency Telecommunication Service (ETS)", RFC 1773 3689, February 2004. 1775 [EMERG-TEL] Carlberg, K. and R. Atkinson, "IP Telephony 1776 Requirements for Emergency Telecommunication Service 1777 (ETS)", RFC 3690, February 2004. 1779 [Floyd] S. Floyd, 'Specifying Alternate Semantics for the 1780 Explicit Congestion Notification (ECN) Field', draft- 1781 floyd-ecn-alternates-02.txt (work in progress), August 1782 2005 1784 [GSPa] Karsten (Ed.), Martin "GSP/ECN Technology & 1785 Experiments", Deliverable: 15.3 PtIII, M3I Eu Vth 1786 Framework Project IST-1999-11429, URL: 1787 http://www.m3i.org/ (February, 2002) (superseded by 1788 [GSP-TR]) 1790 [GSP-TR] Martin Karsten and Jens Schmitt, "Admission Control 1791 Based on Packet Marking and Feedback Signalling �-- 1792 Mechanisms, Implementation and Experiments", TU- 1793 Darmstadt Technical Report TR-KOM-2002-03, URL: 1794 http://www.kom.e-technik.tu- 1795 darmstadt.de/publications/abstracts/KS02-5.html (May, 1796 2002) 1798 [ITU.MLPP.1990] International Telecommunications Union, "Multilevel 1799 Precedence and Pre-emption Service (MLPP)", ITU-T 1800 Recommendation I.255.3, 1990. 1802 [Johnson] DM Johnson, 'QoS control versus generous 1803 dimensioning', BT Technology Journal, Vol 23 No 2, 1804 April 2005 1806 [PCN] B. Briscoe, P. Eardley, D. Songhurst, F. Le Faucheur, 1807 A. Charny, V. Liatsos, S. Dudley, J. Babiarz, K. 1808 Chan. 'Pre-Congestion Notification marking', draft- 1809 briscoe-tsvwg-cl-phb-01 (work in progress), March 1810 2006. 1812 [Re-ECN] Bob Briscoe, Arnaud Jacquet, Alessandro Salvatori, 1813 'Re-ECN: Adding Accountability for Causing Congestion 1814 to TCP/IP', draft-briscoe-tsvwg-re-ecn-tcp-01 (work in 1815 progress), March 2006. 1817 [Re-feedback] Bob Briscoe, Arnaud Jacquet, Carla Di Cairano- 1818 Gilfedder, Andrea Soppera, 'Re-feedback for Policing 1819 Congestion Response in an Inter-network', ACM SIGCOMM 1820 2005, August 2005. 1822 [Re-PCN] B. Briscoe, 'Emulating Border Flow Policing using Re- 1823 ECN on Bulk Data', draft-briscoe-tsvwg-re-ecn-border- 1824 cheat-00 (work in progress), February 2006. 1826 [Reid] ABD Reid, 'Economics and scalability of QoS 1827 solutions', BT Technology Journal, Vol 23 No 2, April 1828 2005 1830 [RFC2211] J. Wroclawski, Specification of the Controlled-Load 1831 Network Element Service, September 1997 1833 [RFC2309] Braden, B., et al., "Recommendations on Queue 1834 Management and Congestion Avoidance in the Internet", 1835 RFC 2309, April 1998. 1837 [RFC2474] Nichols, K., Blake, S., Baker, F. and D. Black, 1838 "Definition of the Differentiated Services Field (DS 1839 Field) in the IPv4 and IPv6 Headers", RFC 2474, 1840 December 1998 1842 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, 1843 Z. and W. Weiss, 'A framework for Differentiated 1844 Services', RFC 2475, December 1998. 1846 [RFC2597] Heinanen, J., Baker, F., Weiss, W. and J. Wrocklawski, 1847 "Assured Forwarding PHB Group", RFC 2597, June 1999. 1849 [RFC2998] Bernet, Y., Yavatkar, R., Ford, P., Baker, F., Zhang, 1850 L., Speer, M., Braden, R., Davie, B., Wroclawski, J. 1851 and E. Felstaine, "A Framework for Integrated Services 1852 Operation Over DiffServ Networks", RFC 2998, November 1853 2000. 1855 [RFC3168] Ramakrishnan, K., Floyd, S. and D. Black "The Addition 1856 of Explicit Congestion Notification (ECN) to IP", RFC 1857 3168, September 2001. 1859 [RFC3246] B. Davie, A. Charny, J.C.R. Bennet, K. Benson, J.Y. Le 1860 Boudec, W. Courtney, S. Davari, V. Firoiu, D. 1861 Stiliadis, 'An Expedited Forwarding PHB (Per-Hop 1862 Behavior)', RFC 3246, March 2002. 1864 [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., 1865 Vaananen, P., Krishnan, R., Cheval, P., and J. 1866 Heinanen, "Multi- Protocol Label Switching (MPLS) 1867 Support of Differentiated Services", RFC 3270, May 1868 2002. 1870 [RMD] Attila Bader, Lars Westberg, Georgios Karagiannis, 1871 Cornelia Kappler, Tom Phelan, 'RMD-QOSM - The Resource 1872 Management in DiffServ QoS model', draft-ietf-nsis- 1873 rmd-03 Work in Progress, June 2005. 1875 [RSVP-ECN] Francois Le Faucheur, Anna Charny, Bob Briscoe, Philip 1876 Eardley, Joe Barbiaz, Kwok-Ho Chan, 'RSVP Extensions 1877 for Admission Control over DiffServ using Pre- 1878 congestion Notification', draft-lefaucheur-rsvp-ecn-00 1879 (work in progress), October 2005. 1881 [RTECN] Babiarz, J., Chan, K. and V. Firoiu, 'Congestion 1882 Notification Process for Real-Time Traffic', draft- 1883 babiarz-tsvwg-rtecn-04 Work in Progress, July 2005. 1885 [RTECN-usage] Alexander, C., Ed., Babiarz, J. and J. Matthews, 1886 'Admission Control Use Case for Real-time ECN', draft- 1887 alexander-rtecn-admission-control-use-case-00, Work in 1888 Progress, February 2005. 1890 [vq] Costas Courcoubetis and Richard Weber "Buffer Overflow 1891 Asymptotics for a Switch Handling Many Traffic 1892 Sources" In: Journal Applied Probability 33 pp. 886-- 1893 903 (1996). 1895 Authors' Addresses 1897 Bob Briscoe 1898 BT Research 1899 B54/77, Sirius House 1900 Adastral Park 1901 Martlesham Heath 1902 Ipswich, Suffolk 1903 IP5 3RE 1904 United Kingdom 1905 Email: bob.briscoe@bt.com 1906 Dave Songhurst 1907 BT Research 1908 B54/69, Sirius House 1909 Adastral Park 1910 Martlesham Heath 1911 Ipswich, Suffolk 1912 IP5 3RE 1913 United Kingdom 1914 Email: dsonghurst@jungle.bt.co.uk 1916 Philip Eardley 1917 BT Research 1918 B54/77, Sirius House 1919 Adastral Park 1920 Martlesham Heath 1921 Ipswich, Suffolk 1922 IP5 3RE 1923 United Kingdom 1924 Email: philip.eardley@bt.com 1926 Francois Le Faucheur 1927 Cisco Systems, Inc. 1928 Village d'Entreprise Green Side - Batiment T3 1929 400, Avenue de Roumanille 1930 06410 Biot Sophia-Antipolis 1931 France 1932 Email: flefauch@cisco.com 1934 Anna Charny 1935 Cisco Systems 1936 300 Apollo Drive 1937 Chelmsford, MA 01824 1938 USA 1939 Email: acharny@cisco.com 1941 Kwok Ho Chan 1942 Nortel Networks 1943 600 Technology Park Drive 1944 Billerica, MA 01821 1945 USA 1946 Email: khchan@nortel.com 1947 Jozef Z. Babiarz 1948 Nortel Networks 1949 3500 Carling Avenue 1950 Ottawa, Ont K2H 8E9 1951 Canada 1952 Email: babiarz@nortel.com 1954 Stephen Dudley 1955 Nortel Networks 1956 4001 E. Chapel Hill Nelson Highway 1957 P.O. Box 13010, ms 570-01-0V8 1958 Research Triangle Park, NC 27709 1959 USA 1960 Email: smdudley@nortel.com 1962 Intellectual Property Statement 1964 The IETF takes no position regarding the validity or scope of any 1965 Intellectual Property Rights or other rights that might be claimed to 1966 pertain to the implementation or use of the technology described in 1967 this document or the extent to which any license under such rights 1968 might or might not be available; nor does it represent that it has 1969 made any independent effort to identify any such rights. Information 1970 on the procedures with respect to rights in RFC documents can be 1971 found in BCP 78 and BCP 79. 1973 Copies of IPR disclosures made to the IETF Secretariat and any 1974 assurances of licenses to be made available, or the result of an 1975 attempt made to obtain a general license or permission for the use of 1976 such proprietary rights by implementers or users of this 1977 specification can be obtained from the IETF on-line IPR repository at 1978 http://www.ietf.org/ipr. 1980 The IETF invites any interested party to bring to its attention any 1981 copyrights, patents or patent applications, or other proprietary 1982 rights that may cover technology that may be required to implement 1983 this standard. Please address the information to the IETF at 1984 ietf-ipr@ietf.org 1986 Disclaimer of Validity 1988 This document and the information contained herein are provided on an 1989 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1990 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1991 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1992 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1993 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1994 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1996 Copyright Statement 1998 Copyright (C) The Internet Society (2006). 2000 This document is subject to the rights, licenses and restrictions 2001 contained in BCP 78, and except as set forth therein, the authors 2002 retain all their rights.