idnits 2.17.1 draft-ietf-tsvwg-circuit-breaker-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 01, 2015) is 3099 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TSVWG Working Group G. Fairhurst 3 Internet-Draft University of Aberdeen 4 Intended status: Best Current Practice November 01, 2015 5 Expires: May 4, 2016 7 Network Transport Circuit Breakers 8 draft-ietf-tsvwg-circuit-breaker-08 10 Abstract 12 This document explains what is meant by the term "network transport 13 Circuit Breaker" (CB). It describes the need for circuit breakers 14 when using network tunnels, and other non-congestion controlled 15 applications, and explains where circuit breakers are, and are not, 16 needed. It also defines requirements for building a circuit breaker 17 and the expected outcomes of using a circuit breaker within the 18 Internet. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on May 4, 2016. 37 Copyright Notice 39 Copyright (c) 2015 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 1.1. Types of Circuit Breaker . . . . . . . . . . . . . . . . 5 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 3. Design of a Circuit-Breaker (What makes a good circuit 58 breaker?) . . . . . . . . . . . . . . . . . . . . . . . . . . 6 59 3.1. Functional Components . . . . . . . . . . . . . . . . . . 6 60 4. Requirements for a Network Transport Circuit Breaker . . . . 9 61 5. Other network topologies . . . . . . . . . . . . . . . . . . 12 62 5.1. Use with a multicast control/routing protocol . . . . . . 12 63 5.2. Use with control protocols supporting pre-provisioned 64 capacity . . . . . . . . . . . . . . . . . . . . . . . . 14 65 5.3. Unidirectional Circuit Breakers over Controlled Paths . . 14 66 6. Examples of Circuit Breakers . . . . . . . . . . . . . . . . 15 67 6.1. A Fast-Trip Circuit Breaker . . . . . . . . . . . . . . . 15 68 6.1.1. A Fast-Trip Circuit Breaker for RTP . . . . . . . . . 15 69 6.2. A Slow-trip Circuit Breaker . . . . . . . . . . . . . . . 16 70 6.3. A Managed Circuit Breaker . . . . . . . . . . . . . . . . 16 71 6.3.1. A Managed Circuit Breaker for SAToP Pseudo-Wires . . 17 72 6.3.2. A Managed Circuit Breaker for Pseudowires (PWs) . . . 18 73 7. Examples where circuit breakers may not be needed. . . . . . 18 74 7.1. CBs over pre-provisioned Capacity . . . . . . . . . . . . 18 75 7.2. CBs with tunnels carrying Congestion-Controlled Traffic . 19 76 7.3. CBs with Uni-directional Traffic and no Control Path . . 19 77 8. Security Considerations . . . . . . . . . . . . . . . . . . . 20 78 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 79 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 21 80 11. Revision Notes . . . . . . . . . . . . . . . . . . . . . . . 21 81 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 82 12.1. Normative References . . . . . . . . . . . . . . . . . . 22 83 12.2. Informative References . . . . . . . . . . . . . . . . . 23 84 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 24 86 1. Introduction 88 A network transport Circuit Breaker (CB) is an automatic mechanism 89 that is used to continuously monitor a flow or aggregate set of 90 flows. The mechanism seeks to detect when the flow(s) experience 91 persistent congestion and when this is detected to terminate (or 92 significantly reduce the rate of) the flow(s). This is a safety 93 measure to prevent starvation of network resources denying other 94 flows from access to the Internet, such measures are essential for an 95 Internet that is heterogeneous and for traffic that is hard to 96 predict in advance. Avoiding persistent prevention is important to 97 reduce the potential for "Congestion Collapse" [RFC2914]. 99 The term "Circuit Breaker" originates in electricity supply, and has 100 nothing to do with network circuits or virtual circuits. In 101 electricity supply, a Circuit Breaker is intended as a protection 102 mechanism of last resort. Under normal circumstances, a Circuit 103 Breaker ought not to be triggered; it is designed to protect the 104 supply network and attached equipment when there is overload. Just 105 as people do not expect the electrical circuit-breaker (or fuse) in 106 their home to be triggered, except when there is a wiring fault or a 107 problem with an electrical appliance. 109 In networking, the Circuit Breaker principle can be used as a 110 protection mechanism of last resort to avoid persistent congestion 111 impacting other flows that share network capacity. Persistent 112 congestion was a feature of the early Internet of the 1980s. This 113 resulted in excess traffic starving other connection from access to 114 the Internet. It was countered by the requirement to use congestion 115 control (CC) by the Transmission Control Protocol (TCP) [Jacobsen88] 116 [RFC1112]. These mechanisms operate in Internet hosts to cause TCP 117 connections to "back off" during congestion. The introduction of a 118 Congestion Controller in TCP (currently documented in [RFC5681] 119 ensured the stability of the Internet, because it was able to detect 120 congestion and promptly react. This worked well while TCP was by far 121 the dominant traffic in the Internet, and most TCP flows were long- 122 lived (ensuring that they could detect and respond to congestion 123 before the flows terminated). This is no longer the case, and non- 124 congestion controlled traffic, including many applications of the 125 User Datagram Protocol (UDP) can form a significant proportion of the 126 total traffic traversing a link. The current Internet therefore 127 requires that non-congestion controlled traffic needs to be 128 considered to avoid persistent congestion. 130 There are important differences between a transport circuit-breaker 131 and a congestion-control method. Specifically, congestion control 132 (as implemented in TCP, SCTP, and DCCP) operates on the timescale on 133 the order of a packet round-trip-time (RTT), the time from sender to 134 destination and return. Congestion control methods are able to react 135 to a single packet loss/marking and continiusouly adapt to reduce the 136 transmission rate for each loss or congestion event. The goal is 137 usually to limit the maximum transmission rate to a rate that 138 reflects a fair use of the available capacity across a network path. 139 These methods typically operate on individual traffic flows (e.g., a 140 5-tuple). 142 In contrast, Circuit Breakers are recommended for non-congestion- 143 controlled Internet flows and for traffic aggregates, e.g., traffic 144 sent using a network tunnel. They operate on timescales much longer 145 than the packet RTT, and trigger under situations of abnormal 146 excessive congestion. People have been implementing what this draft 147 characterizes as circuit breakers on an ad hoc basis to protect 148 Internet traffic, this draft therefore provides guidance on how to 149 deploy and use these mechanisms. Later sections provide examples of 150 cases where circuit-breakers may or may not be desirable. 152 A Circuit Breaker needs to measure (meter) the traffic to determine 153 if the network is experiencing congestion and needs to be designed to 154 trigger robustly when there is persistent congestion. 156 A Circuit Breaker trigger will often utilize a series of successive 157 sample measurements metered at an ingress point and an egress point 158 (either of which could be a transport endpoint). The trigger needs 159 to operate on a timescale much longer than the path round trip time 160 (e.g., seconds to possibly many tens of seconds). This longer period 161 is needed to provide sufficient time for transports (or applications) 162 to adjust their rate following congestion, and for the network load 163 to stabilize after any adjustment. This is to ensure that a Circuit 164 Breaker does not accidentally trigger following a single (or even 165 successive) congestion events (congestion events are what triggers 166 congestion control, and are to be regarded as normal on a network 167 link operating near its capacity). Once triggered, a control 168 function needs to remove traffic from the network, either by 169 disabling the flow or by significantly reducing the level of traffic. 170 This reaction provides the required protection to prevent persistent 171 congestion being experienced by other flows that share the congested 172 part of the network path. 174 Section 4 defines requirements for building a Circuit Breaker. 176 The operational conditions that cause a Circuit Breaker to trigger 177 should be regarded as abnormal. Examples of situations that could 178 trigger a Circuit Breaker include: 180 o anomalous traffic that exceeds the provisioned capacity (or whose 181 traffic characteristics exceed the threshold configured for the 182 Circuit Breaker); 184 o traffic generated by an application at a time when the provisioned 185 network capacity is being utilised for other purposes; 187 o routing changes that cause additional traffic to start using the 188 path monitored by the Circuit Breaker; 190 o misconfiguration of a service/network device where the capacity 191 available is insufficient to support the current traffic 192 aggregate; 194 o misconfiguration of an admission controller or traffic policer 195 that allows more traffic than expected across the path monitored 196 by the Circuit Breaker. 198 In many cases the reason for triggering a Circuit Breaker will not be 199 evident to the source of the traffic (user, application, endpoint, 200 etc). In contrast, an application that uses congestion control will 201 generate elastic traffic that may be expected to regulate the load it 202 introduces under congestion. This will therefore often be a 203 preferred solution for applications that can respond to congestion 204 signals or that can use a congestion-controlled transport. A Circuit 205 Breaker can be used with traffic that is unable, or chooses not, to 206 use congestion control, or where the congestion control properties of 207 the traffic can not be relied upon (e.g., traffic carried over a 208 network tunnel). 210 1.1. Types of Circuit Breaker 212 There are various forms of network transport circuit breaker. These 213 are differentiated mainly on the timescale over which they are 214 triggered, but also in the intended protection they offer: 216 o Fast-Trip Circuit Breakers: The relatively short timescale used by 217 this form of circuit breaker is intended to provide protection for 218 network traffic from a single flow or related group of flows. 220 o Slow-Trip Circuit Breakers: This circuit breaker utilizes a longer 221 timescale and is designed to protect network traffic from 222 congestion by traffic aggregates. 224 o Managed Circuit Breakers: Utilize the operations and management 225 functions that might be present in a managed service to implement 226 a circuit breaker. 228 Examples of each type of circuit breaker are provided in section 4. 230 2. Terminology 232 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 233 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 234 document are to be interpreted as described in [RFC2119]. 236 3. Design of a Circuit-Breaker (What makes a good circuit breaker?) 238 Although circuit breakers have been talked about in the IETF for many 239 years, there has not yet been guidance on the cases where circuit 240 breakers are needed or upon the design of circuit breaker mechanisms. 241 This document seeks to offer advice on these two topics. 243 Circuit Breakers are RECOMMENDED for IETF protocols and tunnels that 244 carry non-congestion-controlled Internet flows and for traffic 245 aggregates. This includes traffic sent using a network tunnel. 246 Designers of other protocols and tunnel encapsulations also ought to 247 consider the use of these techniques as a last resort to protect 248 traffic that shares the network path being used. 250 This document defines the requirements for design of a Circuit 251 Breaker and provides examples of how a Circuit Breaker can be 252 constructed. The specifications of individual protocols and tunnel 253 encapsulations need to detail the protocol mechanisms needed to 254 implement a Circuit Breaker. 256 Section 3.1 describes the functional components of a circuit breaker 257 and section 3.2 defines requirements for implementing a Circuit 258 Breaker. 260 3.1. Functional Components 262 The basic design of a transport circuit breaker involves 263 communication between an ingress point (a sender) and an egress point 264 (a receiver) of a network flow or set of flows. A simple picture of 265 Circuit Breaker operation is provided in figure 1. This shows a set 266 of routers (each labelled R) connecting a set of endpoints. 268 A Circuit Breaker is used to control traffic passing through a subset 269 of these routers, acting between the ingress and a egress point 270 network devices. The path between the ingress and egress could be 271 provided by a tunnel or other network-layer technique. One expected 272 use would be at the ingress and egress of a service, where all 273 traffic being considered terminates beyond the egress point, and 274 hence the ingress and egress carry the same set of flows. 276 +--------+ +--------+ 277 |Endpoint| |Endpoint| 278 +--+-----+ >>> circuit breaker traffic >>> +--+-----+ 279 | | 280 | +-+ +-+ +---------+ +-+ +-+ +-+ +--------+ +-+ +-+ | 281 +-+R+--+R+->+ Ingress +--+R+--+R+--+R+--+ Egress |--+R+--+R+-+ 282 +++ +-+ +------+--+ +-+ +-+ +-+ +-----+--+ +++ +-+ 283 | ^ | | | 284 | | +--+------+ +------+--+ | 285 | | | Ingress | | Egress | | 286 | | | Meter | | Meter | | 287 | | +----+----+ +----+----+ | 288 | | | | | 289 +-+ | | +----+----+ | | +-+ 290 |R+--+ | | Measure +<----------------+ +--+R| 291 +++ | +----+----+ Reported +++ 292 | | | Egress | 293 | | +----+----+ Measurement | 294 +--+-----+ | | Trigger + +--+-----+ 295 |Endpoint| | +----+----+ |Endpoint| 296 +--------+ | | +--------+ 297 +---<---+ 298 Reaction 300 Figure 1: A CB controlling the part of the end-to-end path between an 301 ingress point and an egress point. (Note: In some cases, the trigger 302 and measure functions could alternatively be located at other 303 locations (e.g., at a network operations centre.) 305 In the context of a Circuit Breaker, the ingress and egress functions 306 could be implemented in different places. For example, they could be 307 located in network devices at a tunnel ingress and at the tunnel 308 egress. In some cases, they could be located at one or both network 309 endpoints (see figure 2), implemented as components within a 310 transport protocol. 312 +----------+ +----------+ 313 | Ingress | +-+ +-+ +-+ | Egress | 314 | Endpoint +->+R+--+R+--+R+--+ Endpoint | 315 +--+----+--+ +-+ +-+ +-+ +----+-----+ 316 ^ | | 317 | +--+------+ +----+----+ 318 | | Ingress | | Egress | 319 | | Meter | | Meter | 320 | +----+----+ +----+----+ 321 | | | 322 | +--- +----+ | 323 | | Measure +<-----------------+ 324 | +----+----+ Reported 325 | | Egress 326 | +----+----+ Measurement 327 | | Trigger | 328 | +----+----+ 329 | | 330 +---<--+ 331 Reaction 333 Figure 2: An endpoint CB implemented at the sender (ingress) and 334 receiver (egress). 336 The set of components needed to implement a Circuit Breaker are: 338 1. An ingress meter (at the sender or tunnel ingress) records the 339 number of packets/bytes sent in each measurement interval. This 340 measures the offered network load for a flow or set of flows. 341 For example, the measurement interval could be many seconds (or 342 every few tens of seconds or a series of successive shorter 343 measurements that are combined by the Circuit Breaker Measurement 344 function). 346 2. An egress meter (at the receiver or tunnel egress) records the 347 number/bytes received in each measurement interval. This 348 measures the supported load for the flow or set of flows, and 349 could utilize other signals to detect the effect of congestion 350 (e.g., loss/marking experienced over the path). The measurements 351 at the egress could be synchronised (including an offset for the 352 time of flight of the data, or referencing the measurements to a 353 particular packet) to ensure any counters refer to the same span 354 of packets. 356 3. The measured values at the ingress and egress are communicated to 357 the Circuit Breaker Measurement function. This could use several 358 methods including: Sending return measurement packets from a 359 receiver to a trigger function at the sender; An implementation 360 using Operations, Administration and Management (OAM); or be 361 sending another in-band signalling datagram to the trigger 362 function. This could also be implemented purely as a control 363 plane function, e.g., using a software-defined network 364 controller. 366 4. The measurement function combines the ingress and egress 367 measurements to assess the present level of network congestion. 368 (For example, the loss rate for each measurement interval could 369 be deduced from calculating the difference between ingress and 370 egress counter values.) Note the method does not require high 371 accuracy for the period of the measurement interval (or therefore 372 the measured value, since isolated and/or infrequent loss events 373 need to be disregarded.) 375 5. A trigger function determines whether the measurements indicate 376 persistent congestion. This function defines an appropriate 377 threshold for determining that there is persistent congestion 378 between the ingress and egress. This preferably considers a rate 379 or ratio, rather than an absolute value (e.g., more than 10% 380 loss, but other methods could also be based on the rate of 381 transmission as well as the loss rate). The transport Circuit 382 Breaker is triggered when the threshold is exceeded in multiple 383 measurement intervals (e.g., 3 successive measurements). Designs 384 need to be robust so that single or spurious events do not 385 trigger a reaction. 387 6. A reaction that is applied at the Ingress when the Circuit 388 Breaker is triggered. This seeks to automatically remove the 389 traffic causing persistent congestion. 391 7. A feedback mechanism that triggers when either the receive or 392 ingress and egress measurements are not available, since this 393 also could indicate a loss of control packets (also a symptom of 394 heavy congestion or inability to control the load). 396 4. Requirements for a Network Transport Circuit Breaker 398 The requirements for implementing a Circuit Breaker are: 400 o There MUST be a communication path used for control messages from 401 the ingress meter and the egress meter to the point of 402 measurement. The Circuit Breaker MUST trigger if there is a 403 failure of the communication path used for the control messages. 404 That is, the feedback indicating a congested period needs to be 405 designed so that the Circuit Breaker is triggered when it fails to 406 receive measurement reports that indicate an absence of 407 congestion, rather than relying on the successful transmission of 408 a "congested" signal back to the sender. (The feedback signal 409 could itself be lost under congestion). 411 o A Circuit Breaker MUST define a measurement period over which the 412 Circuit Breaker Measurement function measures the level of 413 congestion or loss. This method does not have to detect 414 individual packet loss, but MUST have a way to know that packets 415 have been lost/marked from the traffic flow. If Explicit 416 Congestion Notification (ECN) is enabled [RFC3168], an egress 417 meter MAY also count the number of ECN congestion marks/event per 418 measurement interval, but even if ECN is used, loss MUST still be 419 measured, since this better reflects the impact of persistent 420 congestion. In this context, loss represents a reliable 421 indication of congestion, as opposed to the finer-grain marking of 422 incipient congestion that can be provided via ECN. The type of 423 Circuit Breaker will determine how long this measurement period 424 needs to be. 426 o The measurement period used by a Circuit Breaker Measurement 427 function MUST be longer than the time that current Congestion 428 Control algorithms need to reduce their rate following detection 429 of congestion. This is important because end-to-end Congestion 430 Control algorithms require at least one RTT to notify and adjust 431 the traffic to experienced congestion, and congestion bottlenecks 432 can share traffic with a diverse range of RTTs. The measurement 433 period is therefore expected to be significantly longer than the 434 RTT experienced by the Circuit Breaker itself. 436 o If necessary, MAY combine successive individual meter samples from 437 the ingress and egress to ensure observation of an average over a 438 sufficiently long interval. (Note when meter samples need to be 439 combined, the combination needs to reflect the sum of the 440 individual sample counts divided by the total time/volume over 441 which the samples were measured. Individual samples over 442 different intervals can not be directly combined to generate an 443 average value.) 445 o A Circuit Breaker is REQUIRED to define a threshold to determine 446 whether the measured congestion is considered excessive. 448 o A Circuit Breaker is REQUIRED to define the triggering interval, 449 defining the period over which the trigger uses the collected 450 measurements. Circuit Breakers need to trigger over a 451 sufficiently long period to avoid additionally penalizing flows 452 with a long path RTT (e.g., many path RTTs). 454 o A Circuit Breaker MUST be robust to multiple congestion events. 455 This usually will define a number of measured persistent 456 congestion events per triggering period. For example, a Circuit 457 Breaker MAY combine the results of several measurement periods to 458 determine if the Circuit Breaker is triggered. (e.g., triggered 459 when persistent congestion is detected in 3 of the measurements 460 within the triggering interval). 462 o A Circuit Breaker SHOULD be constructed so that it does not 463 trigger under light or intermittent congestion. 465 o The default response to a trigger SHOULD disable all traffic that 466 contributed to congestion. 468 o Once triggered, the Circuit Breaker MUST react decisively by 469 disabling or significantly reducing traffic at the source (e.g., 470 ingress). A reaction that results in a reduction SHOULD result in 471 reducing the traffic by at least an order of magnitude, each time 472 the Circuit Breaker is triggered. This response needs to be much 473 more severe than that of a Congestion Controller algorithm (such 474 as TCP's congestion control [RFC5681] or TCP-Friendly Rate 475 Control, TFRC [RFC5348]), because the Circuit Breaker reacts to 476 more persistent congestion and operates over longer timescales 477 (i.e., the overload condition will have persisted for a longer 478 time before the Circuit Breaker is triggered). 480 o A Circuit Breaker that reduces the rate of a flow, MUST continue 481 to monitor the level of congestion and MUST further reduce the 482 rate if the Circuit Breaker is again triggered. 484 o The reaction to a triggered Circuit Breaker MUST continue for a 485 period that is at least the triggering interval. Operator 486 intervention will usually be required to restore a flow. If an 487 automated response is needed to reset the trigger, then this needs 488 to not be immediate. The design of an automated reset mechanism 489 needs to be sufficiently conservative that it does not adversely 490 interact with other mechanisms (including other Circuit Breaker 491 algorithms that control traffic over a common path). It SHOULD 492 NOT perform an automated reset when there is evidence of continued 493 congestion. 495 o When a Circuit Breaker is triggered, it SHOULD be regarded as an 496 abnormal network event. As such, this event SHOULD be logged. 497 The measurements that lead to triggering of the Circuit Breaker 498 SHOULD also be logged. 500 o A Circuit Breaker requires control communication between endpoints 501 and/or network devices. The source and integrity of control 502 information (measurements and triggers) MUST be protected from 503 off-path attacks (Section 8 ). The circuit breaker MUST be 504 designed to be robust to packet loss that can also be experienced 505 during congestion/overload. This does not imply that it is 506 desirable to provide reliable delivery (e.g., over TCP), since 507 this can incur additional delay in responding to congestion. 508 Appropriate mechanisms could be to duplicate control messages to 509 provide increased robustness to loss, or/and to regard a lack of 510 control traffic as an indication that excessive congestion may be 511 being experienced [ID-ietf-tsvwg-RFC5405.bis]. 513 o The control communication may be in-band or out-of-band. In-band 514 communication is RECOMMENDED when either design would be possible. 515 If this traffic is sent over a shared path, it is RECOMMENDED that 516 this control traffic is prioritized to reduce the probability of 517 loss under congestion. Control traffic also needs to be 518 considered when provisioning a network that uses a circuit 519 breaker. 521 in-Band: An in-band control method SHOULD assume that loss of 522 control messages is an indication of potential congestion on 523 the path, and repeated loss ought to cause the Circuit Breaker 524 to be triggered. This design has the advantage that it 525 provides fate-sharing of the traffic flow(s) and the control 526 communications. 528 Out-of-Band: An out-of-band control method SHOULD NOT trigger 529 Circuit Breaker reaction when there is loss of control messages 530 (e.g., a loss of measurements). This avoids failure 531 amplification/propagation when the measurement and data paths 532 fail independently. A failure of an out-of-band communication 533 path SHOULD be regarded as abnormal network event and be 534 handled as appropriate for the network, e.g., this event SHOULD 535 be logged, and additional network operator action might be 536 appropriate, depending on the network and the traffic involved. 538 5. Other network topologies 540 A Circuit Breaker can be deployed in networks with topologies 541 different to that presented in figure 2. This section describes 542 examples of such usage, and possible places where functions may be 543 implemented. 545 5.1. Use with a multicast control/routing protocol 546 +----------+ +--------+ +----------+ 547 | Ingress | +-+ +-+ +-+ | Egress | | Egress | 548 | Endpoint +->+R+--+R+--+R+--+ Router |--+ Endpoint +->+ 549 +----+-----+ +-+ +-+ +-+ +---+--+-+ +----+-----+ | 550 ^ ^ ^ ^ | ^ | | 551 | | | | | | | | 552 +----+----+ + - - - < - - - - + | +----+----+ | Reported 553 | Ingress | multicast Prune | | Egress | | Ingress 554 | Meter | | | Meter | | Measurement 555 +---------+ | +----+----+ | 556 | | | 557 | +----+----+ | 558 | | Measure +<--+ 559 | +----+----+ 560 | | 561 | +----+----+ 562 multicast | | Trigger | 563 Leave | +----+----+ 564 Message | | 565 +----<----+ 567 Figure 3: An example of a multicast CB controlling the end-to-end 568 path between an ingress endpoint and an egress endpoint. 570 Figure 3 shows one example of how a multicast circuit breaker could 571 be implemented at a pair of multicast endpoints (e.g., to implement a 572 Fast-Trip Circuit Breaker, Section 6.1). The ingress endpoint (the 573 sender that sources the multicast traffic) meters the ingress load, 574 generating an ingress measurement (e.g., recording timestamped packet 575 counts), and sends this measurement to the multicast group together 576 with the traffic it has measured. 578 Routers along a multicast path forward the multicast traffic 579 (including the ingress measurement) to all active endpoint receivers. 580 Each last hop (egress) router forwards the traffic to one or more 581 egress endpoint(s). 583 In this figure, each endpoint includes a meter that performs a local 584 egress load measurement. An endpoint also extracts the received 585 ingress measurement from the traffic, and compares the ingress and 586 egress measurements to determine if the Circuit Breaker ought to be 587 triggered. This measurement has to be robust to loss (see previous 588 section). If the Circuit Breaker is triggered, it generates a 589 multicast leave message for the egress (e.g., an IGMP or MLD message 590 sent to the last hop router), which causes the upstream router to 591 cease forwarding traffic to the egress endpoint. 593 Any multicast router that has no active receivers for a particular 594 multicast group will prune traffic for that group, sending a prune 595 message to its upstream router. This starts the process of releasing 596 the capacity used by the traffic and is a standard multicast routing 597 function (e.g., using the PIM-SM routing protocol). Each egress 598 operates autonomously, and the circuit breaker "reaction" is executed 599 by the multicast control plane (e.g., by the PIM multicast routing 600 protocol), requiring no explicit signalling by the circuit breaker 601 along the communication path used for the control messages. Note: 602 there is no direct communication with the Ingress, and hence a 603 triggered Circuit Breaker only controls traffic downstream of the 604 first hop router. It does not stop traffic flowing from the sender 605 to the first hop router; this is however the common practice for 606 multicast deployment. 608 The method could also be used with a multicast tunnel or subnetwork 609 (e.g., Section 6.2, Section 6.3), where a meter at the ingress 610 generates additional control messages to carry the measurement data 611 towards the egress where the egress metering is implemented. 613 5.2. Use with control protocols supporting pre-provisioned capacity 615 Some paths are provisioned using a control protocol, e.g., flows 616 provisioned using the Multi-Protocol Label Switching (MPLS) services, 617 path provisioned using the Resource reservation protocol (RSVP), 618 networks utilizing Software Defined Network (SDN) functions, or 619 admission-controlled Differentiated Services. 621 Figure 1 shows one expected use case, where in this usage a separate 622 device could be used to perform the measurement and trigger 623 functions. The reaction generated by the trigger could take the form 624 of a network control message sent to the ingress and/or other network 625 elements causing these elements to react to the Circuit Breaker. 626 Examples of this type of use are provided in section Section 6.3. 628 5.3. Unidirectional Circuit Breakers over Controlled Paths 630 A Circuit Breaker can be used to control uni-directional UDP traffic, 631 providing that there is a communication path that can be used for 632 control messages to connect the functional components at the Ingress 633 and Egress. This communication path for the control messages can 634 exist in networks for which the traffic flow is purely 635 unidirectional. For example, a multicast stream that sends packets 636 across an Internet path and can use multicast routing to prune flows 637 to shed network load. Some other types of subnetwork also utilize 638 control protocols that can be used to control traffic flows. 640 6. Examples of Circuit Breakers 642 There are multiple types of Circuit Breaker that could be defined for 643 use in different deployment cases. This section provides examples of 644 different types of circuit breaker: 646 6.1. A Fast-Trip Circuit Breaker 648 [RFC2309] discusses the dangers of congestion-unresponsive flows and 649 states that "all UDP-based streaming applications should incorporate 650 effective congestion avoidance mechanisms". All applications ought 651 to use a full-featured transport (TCP, SCTP, DCCP), and if not, an 652 application (e.g., those using UDP and its UDP-Lite variant) needs to 653 provide appropriate congestion avoidance. Guidance for applications 654 that do not use congestion-controlled transports is provided in 655 [ID-ietf-tsvwg-RFC5405.bis]. Such mechanisms can be designed to 656 react on much shorter timescales than a circuit breaker, that only 657 observes a traffic envelope. Congestion-control methods can also 658 interact with an application to more effectively control its sending 659 rate. 661 A fast-trip circuit breaker is the most responsive form of Circuit 662 Breaker. It has a response time that is only slightly larger than 663 that of the traffic that it controls. It is suited to traffic with 664 well-understood characteristics (and could include one or more 665 trigger functions specifically tailored the type of traffic for which 666 it is designed). It is not suited to arbitrary network traffic and 667 may be unsuitable for traffic aggregates, since it could prematurely 668 trigger (e.g., when multiple congestion-controlled flows lead to 669 short-term overload). 671 These mechanisms are suitable for implementation in endpoints (e.g., 672 as a part of the tranport system), where they can also compliment 673 end-to-end congestion control methods. A shorter response time 674 enables these mechanisms to triggers before other forms of circuit 675 breaker (e.g., circuit breakers operating on traffic aggregates at a 676 point along the network path). 678 6.1.1. A Fast-Trip Circuit Breaker for RTP 680 A set of fast-trip Circuit Breaker methods have been specified for 681 use together by a Real-time Transport Protocol (RTP) flow using the 682 RTP/AVP Profile [RTP-CB]. It is expected that, in the absence of 683 severe congestion, all RTP applications running on best-effort IP 684 networks will be able to run without triggering these circuit 685 breakers. A fast-trip RTP Circuit Breaker is therefore implemented 686 as a fail-safe that when triggered will terminate RTP traffic. 688 The sending endpoint monitors reception of in-band RTP Control 689 Protocol (RTCP) reception report blocks, as contained in SR or RR 690 packets, that convey reception quality feedback information. This is 691 used to measure (congestion) loss, possibly in combination with ECN 692 [RFC6679]. 694 The Circuit Breaker action (shutdown of the flow) is triggered when 695 any of the following trigger conditions are true: 697 1. An RTP Circuit Breaker triggers on reported lack of progress. 699 2. An RTP Circuit Breaker triggers when no receiver reports messages 700 are received. 702 3. An RTP Circuit Breaker uses a TFRC-style check and sets a hard 703 upper limit to the long-term RTP throughput (over many RTTs). 705 4. An RTP Circuit Breaker includes the notion of Media Usability. 706 This circuit breaker is triggered when the quality of the 707 transported media falls below some required minimum acceptable 708 quality. 710 6.2. A Slow-trip Circuit Breaker 712 A slow-trip Circuit Breaker could be implemented in an endpoint or 713 network device. This type of Circuit Breaker is much slower at 714 responding to congestion than a fast-trip Circuit Breaker and is 715 expected to be more common. 717 One example where a slow-trip Circuit Breaker is needed is where 718 flows or traffic-aggregates use a tunnel or encapsulation and the 719 flows within the tunnel do not all support TCP-style congestion 720 control (e.g., TCP, SCTP, TFRC), see [ID-ietf-tsvwg-RFC5405.bis] 721 section 3.1.3. A use case is where tunnels are deployed in the 722 general Internet (rather than "controlled environments" within an 723 Internet service provider or enterprise network), especially when the 724 tunnel could need to cross a customer access router. 726 6.3. A Managed Circuit Breaker 728 A managed Circuit Breaker is implemented in the signalling protocol 729 or management plane that relates to the traffic aggregate being 730 controlled. This type of circuit breaker is typically applicable 731 when the deployment is within a "controlled environment". 733 A Circuit Breaker requires more than the ability to determine that a 734 network path is forwarding data, or to measure the rate of a path - 735 which are often normal network operational functions. There is an 736 additional need to determine a metric for congestion on the path and 737 to trigger a reaction when a threshold is crossed that indicates 738 persistent congestion. 740 The control messages can use either in-band or out-of-band 741 communications. 743 6.3.1. A Managed Circuit Breaker for SAToP Pseudo-Wires 745 [RFC4553], SAToP Pseudo-Wires (PWE3), section 8 describes an example 746 of a managed circuit breaker for isochronous flows. 748 If such flows were to run over a pre-provisioned (e.g., Multi- 749 Protocol Label Switching, MPLS) infrastructure, then it could be 750 expected that the Pseudowire (PW) would not experience congestion, 751 because a flow is not expected to either increase (or decrease) their 752 rate. If instead Pseudo-Wire traffic is multiplexed with other 753 traffic over the general Internet, it could experience congestion. 754 [RFC4553] states: "If SAToP PWs run over a PSN providing best-effort 755 service, they SHOULD monitor packet loss in order to detect "severe 756 congestion". The currently recommended measurement period is 1 757 second, and the trigger operates when there are more than three 758 measured Severely Errored Seconds (SES) within a period. If such a 759 condition is detected, a SAToP PW ought to shut down bidirectionally 760 for some period of time...". 762 The concept was that when the packet loss ratio (congestion) level 763 increased above a threshold, the PW was by default disabled. This 764 use case considered fixed-rate transmission, where the PW had no 765 reasonable way to shed load. 767 The trigger needs to be set at the rate that the PW was likely to 768 experience a serious problem, possibly making the service non- 769 compliant. At this point, triggering the Circuit Breaker would 770 remove the traffic preventing undue impact on congestion-responsive 771 traffic (e.g., TCP). Part of the rationale, was that high loss 772 ratios typically indicated that something was "broken" and ought to 773 have already resulted in operator intervention, and therefore need to 774 trigger this intervention. 776 An operator-based response provides opportunity for other action to 777 restore the service quality, e.g., by shedding other loads or 778 assigning additional capacity, or to consciously avoid reacting to 779 the trigger while engineering a solution to the problem. This could 780 require the trigger to be sent to a third location (e.g., a network 781 operations centre, NOC) responsible for operation of the tunnel 782 ingress, rather than the tunnel ingress itself. 784 6.3.2. A Managed Circuit Breaker for Pseudowires (PWs) 786 Pseudowires (PWs) [RFC3985] have become a common mechanism for 787 tunneling traffic, and may compete for network resources both with 788 other PWs and with non-PW traffic, such as TCP/IP flows. 790 [ID-ietf-pals-congcons] discusses congestion conditions that can 791 arise when PWs compete with elastic (i.e., congestion responsive) 792 network traffic (e.g, TCP traffic). Elastic PWs carrying IP traffic 793 (see [RFC4488]) do not raise major concerns because all of the 794 traffic involved responds, reducing the transmission rate when 795 network congestion is detected. 797 In contrast, inelastic PWs (e.g., a fixed bandwidth Time Division 798 Multiplex, TDM) [RFC4553] [RFC5086] [RFC5087]) have the potential to 799 harm congestion responsive traffic or to contribute to excessive 800 congestion because inelastic PWs do not adjust their transmission 801 rate in response to congestion. [ID-ietf-pals-congcons] analyses TDM 802 PWs, with an initial conclusion that a TDM PW operating with a degree 803 of loss that may result in congestion-related problems is also 804 operating with a degree of loss that results in an unacceptable TDM 805 service. For that reason, the draft suggests that a managed circuit 806 breaker that shuts down a PW when it persistently fails to deliver 807 acceptable TDM service is a useful means for addressing these 808 congestion concerns. 810 7. Examples where circuit breakers may not be needed. 812 A Circuit Breaker is not required for a single Congestion Controller- 813 controlled flow using TCP, SCTP, TFRC, etc. In these cases, the 814 Congestion Control methods are already designed to prevent persistent 815 congestion. 817 7.1. CBs over pre-provisioned Capacity 819 One common question is whether a Circuit Breaker is needed when a 820 tunnel is deployed in a private network with pre-provisioned 821 capacity. 823 In this case, compliant traffic that does not exceed the provisioned 824 capacity ought not to result in persistent congestion. A Circuit 825 Breaker will hence only be triggered when there is non-compliant 826 traffic. It could be argued that this event ought never to happen - 827 but it could also be argued that the Circuit Breaker equally ought 828 never to be triggered. If a Circuit Breaker were to be implemented, 829 it will provide an appropriate response if persistent congestion 830 occurs in an operational network. 832 Implementing a Circuit Breaker will not reduce the performance of the 833 flows, but in the event that persistent congestion occurs it protects 834 network traffic that shares network capacity with these flows. A 835 Circuit Breaker also could be used to protect other sharing network 836 traffic from a failure that causes the Circuit Breaker traffic to be 837 routed over a non-pre-provisioned path. 839 7.2. CBs with tunnels carrying Congestion-Controlled Traffic 841 IP-based traffic is generally assumed to be congestion-controlled, 842 i.e., it is assumed that the transport protocols generating IP-based 843 traffic at the sender already employ mechanisms that are sufficient 844 to address congestion on the path [ID-ietf-tsvwg-RFC5405.bis]. A 845 question therefore arises when people deploy a tunnel that is thought 846 to only carry an aggregate of TCP (or some other Congestion 847 Controller-controlled) traffic: Is there advantage in this case in 848 using a Circuit Breaker? 850 For sure, traffic in a such a tunnel will respond to congestion. 851 However, the answer to the question is not always obvious, because 852 the overall traffic formed by an aggregate of flows that implement a 853 Congestion Controller mechanism does not necessarily prevent 854 persistent congestion. For instance, most Congestion Controller 855 mechanisms require long-lived flows to react to reduce the rate of a 856 flow, an aggregate of many short flows could result in many 857 terminating before they experience congestion. It is also often 858 impossible for a tunnel service provider to know that the tunnel only 859 contains CC-controlled traffic (e.g., Inspecting packet headers could 860 not be possible). The important thing to note is that if the 861 aggregate of the traffic does not result in persistent congestion 862 (impacting other flows), then the Circuit Breaker will not trigger. 863 This is the expected case in this context - so implementing a Circuit 864 Breaker will not reduce performance of the tunnel, but in the event 865 that persistent congestion occurs this protects other network traffic 866 that shares capacity with the tunnel traffic. 868 7.3. CBs with Uni-directional Traffic and no Control Path 870 A one-way forwarding path could have no associated communication path 871 for sending control messages, and therefore cannot be controlled 872 using an automated process. This service could be provided using a 873 path that has dedicated capacity and does not share this capacity 874 with other elastic Internet flows (i.e., flows that vary their rate). 876 A way to mitigate the impact on other flows when capacity could be 877 shared is to manage the traffic envelope by using ingress policing. 879 Supporting this type of traffic in the general Internet requires 880 operator monitoring to detect and respond to persistent congestion. 882 8. Security Considerations 884 All Circuit Breaker mechanisms rely upon coordination between the 885 ingress and egress meters and communication with the trigger 886 function. This is usually achieved by passing network control 887 information (or protocol messages) across the network. Timely 888 operation of a circuit breaker depends on the choice of measurement 889 period. If the receiver has an interval that is overly long, then 890 the responsiveness of the circuit breaker decreases. This impacts 891 the ability of the circuit breaker to detect and react to congestion. 893 A Circuit Breaker could potentially be exploited by an attacker to 894 mount a denial of service attack against the traffic being measured. 895 Mechanisms therefore need to be implemented to prevent attacks on the 896 network control information that would result in Denial of Service 897 (DoS). The source and integrity of control information (measurements 898 and triggers) MUST be protected from off-path attacks. Without 899 protection, it could be trivial for an attacker to inject packets 900 with values that could prematurely trigger a circuit breaker 901 resulting in DoS. Simple protection can be provided by using a 902 randomized source port, or equivalent field in the packet header 903 (such as the RTP SSRC value and the RTP sequence number) expected not 904 to be known to an off-path attacker. Stronger protection can be 905 achieved using a secure authentication protocol. 907 Transmission of network control information consumes network 908 capacity. This control traffic needs to be considered in the design 909 of a Circuit Breaker and could potentially add to network congestion. 910 If this traffic is sent over a shared path, it is RECOMMENDED that 911 this control traffic is prioritized to reduce the probability of loss 912 under congestion. Control traffic also needs to be considered when 913 provisioning a network that uses a circuit breaker. 915 The circuit breaker MUST be designed to be robust to packet loss that 916 can also be experienced during congestion/overload. Loss of control 917 messages could be a side-effect of a congested network, but also 918 could arise from other causes Section 4. 920 The security implications depend on the design of the mechanisms, the 921 type of traffic being controlled and the intended deployment 922 scenario. Each design of a Circuit Breaker MUST therefore evaluate 923 whether the particular circuit breaker mechanism has new security 924 implications. 926 9. IANA Considerations 928 This document makes no request from IANA. 930 10. Acknowledgments 932 There are many people who have discussed and described the issues 933 that have motivated this draft. Contributions and comments included: 934 Lars Eggert, Colin Perkins, David Black, Matt Mathis, Andrew 935 McGregor, Bob Briscoe and Elliot Lear. This work was part-funded by 936 the European Community under its Seventh Framework Programme through 937 the Reducing Internet Transport Latency (RITE) project (ICT-317700). 939 11. Revision Notes 941 XXX RFC-Editor: Please remove this section prior to publication XXX 943 Draft 00 945 This was the first revision. Help and comments are greatly 946 appreciated. 948 Draft 01 950 Contained clarifications and changes in response to received 951 comments, plus addition of diagram and definitions. Comments are 952 welcome. 954 WG Draft 00 956 Approved as a WG work item on 28th Aug 2014. 958 WG Draft 01 960 Incorporates feedback after Dallas IETF TSVWG meeting. This version 961 is thought ready for WGLC comments. Definitions of abbreviations. 963 WG Draft 02 965 Minor fixes for typos. Rewritten security considerations section. 967 WG Draft 03 969 Updates following WGLC comments (see TSV mailing list). Comments 970 from C Perkins; D Black and off-list feedback. 972 A clear recommendation of intended scope. 974 Changes include: Improvement of language on timescales and minimum 975 measurement period; clearer articulation of endpoint and multicast 976 examples - with new diagrams; separation of the controlled network 977 case; updated text on position of trigger function; corrections to 978 RTP-CB text; clarification of loss v ECN metrics; checks against 979 submission checklist 9use of keywords, added meters to diagrams). 981 WG Draft 04 983 Added section on PW CB for TDM - a newly adopted draft (D. Black). 985 WG Draft 05 987 Added clarifications requested during AD review. 989 WG Draft 06 991 Fixed some remaining typos. 993 Update following detailed review by Bob Briscoe, and comments by D. 994 Black. 996 WG Draft 07 998 Additional update following review by Bob Briscoe. 1000 WG Draft 08 1002 Updated text on the response to lack of meter measurements with 1003 managed circuit breakers. Additional comments from Elliot Lear (APPs 1004 area). 1006 12. References 1008 12.1. Normative References 1010 [ID-ietf-tsvwg-RFC5405.bis] 1011 Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 1012 Guidelines (Work-in-Progress)", 2015. 1014 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1015 Requirement Levels", BCP 14, RFC 2119, 1016 DOI 10.17487/RFC2119, March 1997, 1017 . 1019 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1020 of Explicit Congestion Notification (ECN) to IP", 1021 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1022 . 1024 12.2. Informative References 1026 [ID-ietf-pals-congcons] 1027 Stein, YJ., Black, D., and B. Briscoe, "Pseudowire 1028 Congestion Considerations (Work-in-Progress)", 2015. 1030 [Jacobsen88] 1031 European Telecommunication Standards, Institute (ETSI), 1032 "Congestion Avoidance and Control", SIGCOMM Symposium 1033 proceedings on Communications architectures and 1034 protocols", August 1998. 1036 [RFC1112] Deering, S., "Host extensions for IP multicasting", STD 5, 1037 RFC 1112, DOI 10.17487/RFC1112, August 1989, 1038 . 1040 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 1041 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 1042 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 1043 S., Wroclawski, J., and L. Zhang, "Recommendations on 1044 Queue Management and Congestion Avoidance in the 1045 Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, 1046 . 1048 [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, 1049 RFC 2914, DOI 10.17487/RFC2914, September 2000, 1050 . 1052 [RFC3985] Bryant, S., Ed. and P. Pate, Ed., "Pseudo Wire Emulation 1053 Edge-to-Edge (PWE3) Architecture", RFC 3985, 1054 DOI 10.17487/RFC3985, March 2005, 1055 . 1057 [RFC4488] Levin, O., "Suppression of Session Initiation Protocol 1058 (SIP) REFER Method Implicit Subscription", RFC 4488, 1059 DOI 10.17487/RFC4488, May 2006, 1060 . 1062 [RFC4553] Vainshtein, A., Ed. and YJ. Stein, Ed., "Structure- 1063 Agnostic Time Division Multiplexing (TDM) over Packet 1064 (SAToP)", RFC 4553, DOI 10.17487/RFC4553, June 2006, 1065 . 1067 [RFC5086] Vainshtein, A., Ed., Sasson, I., Metz, E., Frost, T., and 1068 P. Pate, "Structure-Aware Time Division Multiplexed (TDM) 1069 Circuit Emulation Service over Packet Switched Network 1070 (CESoPSN)", RFC 5086, DOI 10.17487/RFC5086, December 2007, 1071 . 1073 [RFC5087] Stein, Y(J)., Shashoua, R., Insler, R., and M. Anavi, 1074 "Time Division Multiplexing over IP (TDMoIP)", RFC 5087, 1075 DOI 10.17487/RFC5087, December 2007, 1076 . 1078 [RFC5348] Floyd, S., Handley, M., Padhye, J., and J. Widmer, "TCP 1079 Friendly Rate Control (TFRC): Protocol Specification", 1080 RFC 5348, DOI 10.17487/RFC5348, September 2008, 1081 . 1083 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1084 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1085 . 1087 [RFC6679] Westerlund, M., Johansson, I., Perkins, C., O'Hanlon, P., 1088 and K. Carlberg, "Explicit Congestion Notification (ECN) 1089 for RTP over UDP", RFC 6679, DOI 10.17487/RFC6679, August 1090 2012, . 1092 [RTP-CB] Perkins, and Singh, "Multimedia Congestion Control: 1093 Circuit Breakers for Unicast RTP Sessions", February 2014. 1095 Author's Address 1097 Godred Fairhurst 1098 University of Aberdeen 1099 School of Engineering 1100 Fraser Noble Building 1101 Aberdeen, Scotland AB24 3UE 1102 UK 1104 Email: gorry@erg.abdn.ac.uk 1105 URI: http://www.erg.abdn.ac.uk