idnits 2.17.1 draft-briscoe-conex-policing-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 14, 2014) is 3725 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-02) exists of draft-briscoe-conex-data-centre-01 == Outdated reference: A later version (-13) exists of draft-ietf-conex-abstract-mech-08 == Outdated reference: A later version (-06) exists of draft-ietf-conex-mobile-03 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ConEx B. Briscoe 3 Internet-Draft BT 4 Intended status: Informational February 14, 2014 5 Expires: August 18, 2014 7 Network Performance Isolation using Congestion Policing 8 draft-briscoe-conex-policing-01 10 Abstract 12 This document describes why policing using congestion information can 13 isolate users from network performance degradation due to each 14 other's usage, but without losing the multiplexing benefits of a LAN- 15 style network where anyone can use any amount of any resource. 16 Extensive numerical examples and diagrams are given. The document is 17 agnostic to how the congestion information reaches the policer. The 18 congestion exposure (ConEX) protocol is recommended, but other tunnel 19 feedback mechanisms have been proposed. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on August 18, 2014. 38 Copyright Notice 40 Copyright (c) 2014 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Example Bulk Congestion Policer . . . . . . . . . . . . . . . 4 57 3. Network Performance Isolation: Intuition . . . . . . . . . . 8 58 3.1. The Problem . . . . . . . . . . . . . . . . . . . . . . . 8 59 3.2. Approach . . . . . . . . . . . . . . . . . . . . . . . . 10 60 3.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 11 61 3.4. Simple Boundary Model of Congestion Control . . . . . . . 11 62 3.5. Long-Running Flows . . . . . . . . . . . . . . . . . . . 12 63 3.6. On-Off Flows . . . . . . . . . . . . . . . . . . . . . . 14 64 3.6.1. Numerical Examples Without Policing . . . . . . . . . 16 65 3.6.2. Congestion Policing of On-Off Flows . . . . . . . . . 19 66 3.7. Weighted Congestion Controls . . . . . . . . . . . . . . 20 67 3.8. A Network of Links . . . . . . . . . . . . . . . . . . . 22 68 3.8.1. Numerical Example: Isolation from Focused Load . . . 23 69 3.8.2. Isolation in the Short-Term . . . . . . . . . . . . . 24 70 3.8.3. Encouraging Load Balancing . . . . . . . . . . . . . 24 71 3.9. Tenants of Different Sizes . . . . . . . . . . . . . . . 25 72 3.10. Links of Different Sizes . . . . . . . . . . . . . . . . 25 73 3.11. Diverse Congestion Control Algorithms . . . . . . . . . . 26 74 4. Parameter Setting . . . . . . . . . . . . . . . . . . . . . . 29 75 5. Security Considerations . . . . . . . . . . . . . . . . . . . 30 76 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 77 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 30 78 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 30 79 9. Informative References . . . . . . . . . . . . . . . . . . . 30 80 Appendix A. Summary of Changes between Drafts . . . . . . . . . 33 82 1. Introduction 84 This document is informative, not normative. It describes why 85 congestion policing isolates the network performance of different 86 parties who are using a network, and why it is more useful than other 87 forms of performance isolation such as peak and committed information 88 rate policing, weighted round-robin or weighted fair-queueing. 90 The main point is that congestion policing isolates performance even 91 when the load applied by everyone is highly variable, where variation 92 may be over time, or by spreading traffic across different paths (the 93 hose model). Unlike other isolation techniques, congestion policing 94 is an edge mechanism that works over a pool of links across whole 95 network, not just at one link. If the load from everyone happens to 96 be constant, and there is a single bottleneck, congestion policing 97 gives a nearly identical outcome to weighted round-robin or weighted 98 fair-queuing at that bottleneck. But otherwise, congestion policing 99 is a far more general solution. 101 The document complements others that describe various network 102 arrangements that could use congestion policing, such as in a 103 broadband access network [I-D.briscoe-conex-initial-deploy], in a 104 mobile access network [I-D.ietf-conex-mobile] or in a data centre 105 network [I-D.briscoe-conex-data-centre]. Nonetheless, all the 106 numerical examples use the data centre scenario. 108 The key to the solution is the use of congestion-bit-rate rather than 109 bit-rate as the policing metric. _How _this works is very simple and 110 quick to describe Section 2. 112 However, it is much more difficult to understand _why_ this approach 113 provides performance isolation. In particular, why it provides 114 performance isolation across a network of links, even though there is 115 apparently no isolation mechanism in each link. Section 3 builds up 116 an intuition for why the approach works, and why other approaches 117 fall down in different ways. The bullets below provide a summary of 118 that explanation, which builds from the simple case of long-running 119 flows through a single link up to a full meshed network with on-off 120 flows of different sizes and different behaviours: 122 o Starting with the simple case of long-running flows focused on a 123 single bottleneck link, tenants get weighted shares of the link, 124 much like weighted round robin, but with no mechanism in any of 125 the links. This is because losses (or ECN marks) are random, so 126 if one tenant sends twice as much bit-rate it will suffer twice as 127 many lost bits (or ECN-marked bits). So, at least for constant 128 long-running flows, regulating congestion-bits gives the same 129 outcome as regulating bits; 131 o In the more realistic case where flows are not all long-running 132 but a mix of short to very long, it is explained that bit-rate is 133 not a sufficient metric for isolating performance; how _often_ a 134 tenant is sending (or not sending) is the significant factor for 135 performance isolation, not whether bit-rate is shared equally 136 whenever a source happens to be sending; 138 o Although it might seem that data volume would be a good measure of 139 how often a tenant is sending, we then show why it is not. For 140 instance, a tenant can send a large volume of data but hardly 141 affect the performance of others -- by being more responsive to 142 congestion. Using congestion-volume (congestion-bit-rate over 143 time) in a policer encourages large data senders to be more 144 responsive (to yield), giving other tenants much higher 145 performance while hardly affecting their own performance. 146 Whereas, using straight volume as an allocation metric provides no 147 distinction between high volume sources that yield and high volume 148 sources that do not yield (the widespread behaviour today); 150 o We then show that a policer based on the congestion-bit-rate 151 metric works across a network of links treating it as a pool of 152 capacity, whereas other approaches treat each link independently, 153 which is why the proposed approach requires none of the 154 configuration complexity on switches that is involved in other 155 approaches. 157 o We also show that a congestion policer can be arranged to limit 158 bursts of congestion from sources that focus traffic onto a single 159 link, even where one source may consist of a large aggregate of 160 sources. 162 o We show that a congestion policer rewards traffic that shifts to 163 less congested paths (e.g. multipath TCP or virtual machine 164 motion). 166 o We show that congestion policing works on the pool of links, 167 irrespective of whether individual links have significantly 168 different capacities. 170 o We show that a congestion policer allows a wide variety of 171 responses to congestion (e.g. New Reno TCP [RFC5681], Cubic TCP, 172 Compound TCP, Data Centre TCP [DCTCP] and even unresponsive UDP 173 traffic), while still encouraging and enforcing a sufficient 174 response to congestion from all sources taken together, so that 175 the performance of each application is sufficiently isolated from 176 others. 178 2. Example Bulk Congestion Policer 180 In order to explain why a congestion policer works, we first describe 181 a particular design of congestion policer, called a bulk congestion 182 policer. The aim is to give a concrete example to motivate the 183 protocol changes necessary to implement such a policer. This is 184 merely an informative document, so there is no intention to 185 standardise this or any specific congestion policing algorithm. The 186 aim is for different implementers to be free to develop their own 187 improvements to this design. 189 A number of other documents describe deployment arrangements that 190 include a congestion policer, for instance in a broadband access 191 network [I-D.briscoe-conex-initial-deploy], in a mobile access 192 network [I-D.ietf-conex-mobile] and in a data centre network 194 [I-D.briscoe-conex-data-centre]. The congestion policer described in 195 the present document could be deployed in any of these arrangements. 196 However, to be concrete, the example link capacities, topology and 197 terminology used in the present document assume a multi-tenant data 198 centre scenario [I-D.briscoe-conex-data-centre]. It is believed that 199 similar examples could be drawn up for the other scenarios, just 200 using different numbers for link capacities, etc and replacing some 201 terminology, e.g. substituting 'end-customer' for 'tenant' 202 throughout. 204 Figure 1 illustrates a bulk congestion policer. It is very similar 205 to a regular token bucket for policing bit-rate except the tokens do 206 not represent bits, rather they represent 'congestion-bits'. Using 207 Congestion Exposure (ConEx) [I-D.ietf-conex-abstract-mech] is 208 probably the easiest way to understand what congestion-bits are. 209 However, there are other valid ways to signal congestion information 210 to a congestion policer without ConEx (e.g. tunnelled feedback 211 described in [I-D.briscoe-conex-data-centre]. 213 Using ConEx as an example, a ConEx-based congestion token bucket 214 ignores any packets unless they are ConEx-marked. The ConEx protocol 215 includes audit capabilities that force the sender to reveal 216 information it receives about losses anywhere on the downstream path 217 (e.g. at the points marked X in Figure 1). The sender signals this 218 information with a ConEx-marking in the IP header of the packets it 219 sends into the network, using an arrangement called re-inserted 220 feedback or re-feedback. So, every time the sender detects a loss of 221 a 1500B packet, it has to apply a ConEx-marking to a subsequent 1500B 222 packet (or it has to ConEx-mark at least as many bytes made up of 223 smaller packets). In Figure 1 ConEx-marked packets are shown tagged 224 with a '*'. 226 A bulk congestion policer is deployed at every point where traffic 227 enters an operator's network (similar to the Diffserv architecture). 228 At any one point of entry, the operator assigns a congestion token 229 bucket to each tenant that is sending traffic into the network. The 230 operator assigns a contracted token-fill-rate w1 to tenant 1, w2 to 231 tenant 2, and in general wi to tenant with index i, where i = 1,2,... 232 This contracted token-fill-rate, wi, can be thought of like a weight, 233 where a higher weight allows a tenant to contribute more traffic 234 through a downstream congested link or to contribute traffic into 235 more congested links, as we shall see. 237 Tenants wanting to send more traffic can be assigned a higher weight 238 (probably paying more for it). Typically, tenants might sign-up to a 239 traffic contract in two parts: 241 1. a peak rate, which represents the capacity of their access to the 242 network 244 2. a congestion token-fill rate, which limits how much traffic can 245 be sent that limits what others are sending. 247 In Figure 1, as each packet arrives, the classifier assigns it to the 248 congestion token bucket of the relevant tenant. The token bucket 249 ignores all non-ConEx-marked packets - they do not drain any tokens 250 from the bucket. Otherwise, if a packet is ConEx-marked and ,say, 251 1500B in size it will drain 1500B of tokens from the congestion token 252 bucket. If the tenant is contributing to more congestion than its 253 contracted congestion-fill-rate, the bucket level will decrease, and 254 if it persists, the bucket level will become low enough to cause the 255 policer to start discarding a proportion of all that tenant's traffic 256 (whether ConEx-marked or not). 258 ______________________ 259 | | | | Legend | 260 |w1 |w2 |wi | | 261 | | | | [_] [_]packet stream | 262 V V V | | 263 congestion . . . | [*] marked packet | 264 token bucket| . | | . | __|___| | ___ | 265 __|___| | . | | |:::| | \ / policer | 266 | |:::| __|___| | |:::| | /_\ | 267 | +---+ | +---+ | +---+ | | 268 bucket depth | : | : | : | /\ marking meter | 269 controls the | . | : | . | \/ | 270 policer _V_ . | : | . |______________________| 271 ____\ /__/\___________________________ downstream 272 /[*] /_\ \/ [_] | : [_] | : [_] \ /->network 273 class-/ | . | . \ / /---> 274 ifier/ _V_ . | . \ / / 275 __,--.__________________\ /__/\___________________\______/____/ loss 276 `--' [_] [*] [_] /_\ \/ [_] | . [*] / \ \-X---> 277 \ | . / \--> 278 \ _V_ : / \ loss 279 \__________________________\ /__/\_____/ \-X---> 280 [_] [*] [_] [*] [_] /_\ \/ [_] \ 281 \---> 283 Figure 1: Bulk Congestion Policer Schematic 285 Note that this is called a bulk congestion policer because it polices 286 all the flows of a tenant even if they spread out in a hose pattern 287 across numerous downstream networks and even if only a few of these 288 flows are primarily to blame for contributing heavily to downstream 289 congestion, while others find plenty of capacity downstream. This 290 approach is motivated by simplicity rather than precision (if 291 precision were required other policer designs could be used). 292 Simplicity might be more important than precision if capacity were 293 generally well-provisioned and the policer was intended to intervene 294 only rarely to limit unusually extreme behaviour. Also, a network 295 might take such a simple but somewhat heavy-handed approach in order 296 to motivate tenants to implement more complex flow-specific controls 297 themselves [CongPol]. 299 Various more sophisticated congestion policer designs have been 300 evaluated [CPolTrilogyExp]. In these experiments, it was found that 301 it is better if the policer gradually increases discards as the 302 bucket becomes empty. Also isolation between tenants is better if 303 each tenant is policed based on the combination of two buckets, not 304 one (Figure 2): 306 1. A deep bucket (that would take minutes or even hours to fill at 307 the contracted fill-rate) that constrains the tenant's long-term 308 average rate of congestion (wi) 310 2. a very shallow bucket (e.g. only two or three MTU) that is filled 311 considerably faster than the deep bucket (c * wi), where c = ~10, 312 which prevents a tenant storing up a large backlog of tokens then 313 causing congestion in one large burst. 315 In this arrangement each marked packet drains tokens from both 316 buckets, and the probability of policer discard is taken as the worse 317 of the two buckets. 319 | | 320 Legend: |c*wi |wi 321 See previous figure V V 322 . . 323 . | . | deep bucket 324 _ _ _ _ _ _ _ _ _ _ _ _ |___| 325 | . |:::| 326 |_ _ _ _ _ _ _ |___| |:::| 327 | shallow +---+ +---+ 328 worse of the| bucket 329 two buckets| \____ ____/ 330 triggers| \ / both buckets 331 policing V : drained by 332 ___ . marked packets 333 ___________\ /___________________/ \__________________ 334 [_] [_] /_\ [_] [*] [_] \ / [_] [_] [_] 336 Figure 2: Dual Congestion Token Bucket (in place of each single 337 bucket in the previous figure) 339 This design of congestion policer will be sufficient to understand 340 the rest of the document. However, it should be remembered that this 341 is only an example, not an ideal design. Also it should be 342 remembered that other mechanisms such as tunnel feedback can be used 343 instead of ConEx. 345 3. Network Performance Isolation: Intuition 347 3.1. The Problem 349 Network performance isolation traditionally meant that each user 350 could be sure of a minimum guaranteed bit-rate. Such assurances are 351 useful if traffic from each tenant follows relatively predictable 352 paths and is fairly constant. If traffic demand is more dynamic and 353 unpredictable (both over time and across paths), minimum bit-rate 354 assurances can still be given, but they have to be very small 355 relative to the available capacity, because a large number of users 356 might all want to simulataneously share any one link, even though 357 they rarely all use it at the same time. 359 This either means the shared capacity has to be greatly overprovided 360 so that the assured level is large enough, or the assured level has 361 to be small. The former is unnecessarily expensive; the latter 362 doesn't really give a sufficiently useful assurance. 364 Round robin or fair queuing are other forms of isolation that 365 guarantee that each user will get 1/N of the capacity of each link, 366 where N is the number of active users at each link. This is fine if 367 the number of active users (N) sharing a link is fairly predictable. 368 However, if large numbers of tenants do not typically share any one 369 link but at any time they all could (as in a data centre), a 1/N 370 assurance is fairly worthless. Again, given N is typically small but 371 could be very large, either the shared capacity has to be expensively 372 overprovided, or the assured bit-rate has to be worthlessly small. 373 The argument is no different for the weighted forms of these 374 algorithms: WRR & WFQ). 376 Both these traditional forms of isolation try to give one tenant 377 assured instantaneous bit-rate by constraining the instantaneous bit- 378 rate of everyone else. This approach is flawed except in the special 379 case when the load from every tenant on every link is continuous and 380 fairly constant. The reality is usually very different: sources are 381 on-off and the route taken varies, so that on any one link a source 382 is more often off than on. 384 In these more realistic (non-constant) scenarios, the capacity 385 available for any one tenant depends much more on _how often_ 386 everyone else uses a link, not just _how much_ bit-rate everyone 387 else would be entitled to if they did use it. 389 For instance, if 100 tenants are using a 1Gb/s link for 1% of the 390 time, there is a good chance each will get the full 1Gb/s link 391 capacity. But if just six of those tenants suddenly start using the 392 link 50% of the time, whenever the other 94 tenants need the link, 393 they will typically find 3 of these heavier tenants using it already. 394 If a 1/N approach like round-robin were used, then the light tenants 395 would suddently get 1/4 * 1Gb/s = 250Mb/s on average. Round-robin 396 cannot claim to isolate tenants from each other if they usually get 397 1Gb/s but sometimes they get 250Mb/s (and only 10Mb/s guaranteed in 398 the worst case when all 100 tenants are active). 400 In contrast, congestion policing is the key to network performance 401 isolation because it focuses policing only on those tenants that go 402 fast over congested path(s) excessively and persistently over time. 403 This keeps congestion below a design threshold everywhere so that 404 everyone else can go fast. In this way, congestion policing takes 405 account of highly variable loads (varying in time and varying across 406 routes). And, if everyone's load happens to be constant, congestion 407 policing converges on the same outcome as the traditional forms of 408 isolation. 410 The other flaw in the traditional approaches to isolation, like WRR & 411 WFQ, is that they actually prevent long-running flows from yielding 412 to brief bursts from lighter tenants. A long-running flow can yield 413 to brief flows and still complete nearly as soon as it would have 414 otherwise (the brief flows complete sooner, freeing up the capacity 415 for the longer flow sooner--see Section 3.7). However, WRR & WFQ 416 prevent flows from even seeing the congestion signals that would 417 allow them to co-ordinate between themselves, because they isolate 418 each tenant completely into separate queues. 420 In summary, superficially, traditional approaches with separate 421 queues sound good for isolation, but: 423 1. not when everyone's load is variable, so each tenant has no 424 assurance about how many other queues there will be; 426 2. and not when each tenant can no longer even see the congestion 427 signals from other tenants, so no-one's control algorithms can 428 determine whether they would benefit most by pushing harder or 429 yielding. 431 3.2. Approach 433 It has not been easy to find a way to give the intuition on why 434 congesiton policing isolates performance, particularly across a 435 networks of links not just on a single link. The approach used in 436 this section, is to describe the system as if everyone is using the 437 congestion response they would be forced to use if congestion 438 policing had to intervene. We therefore call this the boundary model 439 of congestion control. It is a very simple congestion response, so 440 it is much easier to understand than if we introduced all the square 441 root terms and other complexity of New Reno TCP's response. And it 442 means we don't have to try to decribe a mix of responses. 444 The purpose of congestion policing is not to intervene in everyone's 445 rate control all the time. Rather it is to encourage each tenant to 446 _avoid_ being policed -- to keep the aggregate of all their flows' 447 rates within an overall envelope that is sufficiently responsive to 448 congestion. Nonetheless, congestion policing can and will enforce a 449 congestion response if a particular tenant sends traffic that is 450 completely unresponsive to congestion. This upper bound enforced by 451 everyone else's congestion policers is what ensures that each 452 tenant's minimum performance is isolated from the combined effect of 453 everyone else. 455 We cannot emphasise enough that the intention is not to make 456 individual flows conform to this boundary response to congestion. 457 Indeed the intention is to allow a diverse evolving mix of congestion 458 responses, but constrained in total within a simple overall envelope. 460 After describing and further justifying using the simple boundary 461 model of congestion control, we start by considering long-running 462 flows sharing one link. Then we will consider on-off traffic, before 463 widening the scope from one link to a network of links and to links 464 of different sizes. Then we will depart from the initial simplified 465 model of congestion control and consider diverse congestion control 466 algorithms, including completely unresponsive end-systems. 468 Formal analysis to back-up the intuition provided by this section 469 will be made available in a more extensive companion technical report 470 [conex-dc_tr]. 472 3.3. Terminology 474 The term congestion probability will be used as a generisation of 475 either loss probability or ECN marking probability. 477 It is necessary to carefully distinguish: 479 o congestion bit-rate (or just congestion rate), which is an 480 absolute measure of the rate of congested bits 482 o congestion probability, which is a relative measure of the 483 proportion of congested bits to all bits 485 Sometimes people loosely talk of loss rate when they mean loss 486 probability (measured in %). In this document, we will keep to 487 strict terminology, so loss rate would be measured in lost packets 488 per second, or lost bits per second. 490 3.4. Simple Boundary Model of Congestion Control 492 The boundary model of congestion control ensures a flow's bit-rate is 493 inversely proportional to the congestion level that it detects. For 494 instance, if congestion probability doubles, the flow's bit-rate 495 halves. This is called a scalable congestion control because it 496 maintains the same rate of congestion signals (marked or dropped 497 packets) no matter how fast it goes. Examples from the research 498 community are Relentless TCP [Relentless] and Scalable TCP [STCP]. 500 In production systems, TCP algorithms based on New Reno [RFC5681] 501 have been widely replaced by alternatives closer to this scalable 502 ideal (e.g. Cubic TCP in Linux and Compound TCP in Windows), because 503 at high rates New Reno generated congestion signals too infrequently 504 to track available capacity fast enough [RFC3649]. More recent TCP 505 updates (e.g. Data Centre TCP [DCTCP] in Windows 8 and available in 506 Linux) are becoming closer still to the scalable ideal. 508 For instance, consider a scenario where a flow with scalable 509 congestion control is alone in a 1Gb/s link, then another similar 510 flow from another tenant joins it. Both will push up the congestion 511 probability, which will push down their rates until they together fit 512 into the link. Because the flow's rate has to halve to accomodate 513 the new flow, congestion probability will double (lets say from 514 0.002% to 0.004%), by our initial assumption of a two similar 515 scalable congestion controls. When it is alone on the link, the 516 congestion-bit-rate of the flow is 20kb/s (=1Gb/s * 0.002%), and when 517 it shares the link it is still 20kb/s (= 500Mb/s * 0.004%). 519 In summary, a congestion control can be considered scalable if the 520 bit-rate of packets carrying congestion signals (the congestion-bit- 521 rate) always stays the same no matter how much capacity it finds 522 available. This ensures there will always be enough signals in a 523 round trip time to keep the dynamics under control. 525 Reminder: Making individual flows conform to this boundary (scalable) 526 response to congestion is a non-goal. Although we start this 527 explanation with this specific simple end-system congestion response, 528 this is just to aid intuition. 530 3.5. Long-Running Flows 532 Table 1 shows various scenarios where each of five tenants has 533 contracted for 40kb/s of congestion-bit-rate in order to share a 1Gb/ 534 s link. In order to help intuition, we start with the (unlikely) 535 scenario where all their flows are long-running. A number of long- 536 running flows will try to use all the link capacity, so for 537 simplicity we take utilisation as a round figure of 100%. 539 In the case we have just described (scenario A) neither tenant's 540 policer is intervening at all, because both their congestion 541 allowances are 40kb/s and each sends only one flow that contributes 542 20kb/s of congestion -- half the allowance. 544 +--------+-------------+----------+----------+----------+-----------+ 545 | Tenant | contracted | scenario | scenario | scenario | scenario | 546 | | congestion- | A | B | C | D | 547 | | bit-rate | # : Mb/s | # : Mb/s | # : Mb/s | # : Mb/s | 548 | | kb/s | | | | | 549 +--------+-------------+----------+----------+----------+-----------+ 550 | | | | | | | 551 | (a) | 40 | 1 : 500 | 5 : 250 | 5 : 200 | 5 : 250 | 552 | (b) | 40 | 1 : 500 | 3 : 250 | 3 : 200 | 2 : 250 | 553 | (c) | 40 | - : --- | 3 : 250 | 3 : 200 | 2 : 250 | 554 | (d) | 40 | - : --- | 2 : 250 | 2 : 200 | 1 : 125 | 555 | (e) | 40 | - : --- | - : --- | 2 : 200 | 1 : 125 | 556 | | Congestion | 0.004% | 0.016% | 0.02% | 0.016% | 557 | | probability | | | | | 558 +--------+-------------+----------+----------+----------+-----------+ 560 Table 1: Bit-rates that a congestion policer allocates to five 561 tenants sharing a 1Gb/s link with various numbers (#) of long-running 562 flows all using 'scalable congestion control' of weight 20kb/s 564 Scenario B shows a case where four of the tenants all send 2 or more 565 long-running flows. Recall, from the assumption of scalable 566 controls, that each flow always contributes 20kb/s no matter how fast 567 it goes. Therefore the policers of tenants (a-c) limit them to two 568 flows-worth of congestion (2 x 20kb/s = 40kb/s). Tenant (d) is only 569 asking for 2 flows, so it gets them without being policed, and all 570 four get the same quarter share of the link. 572 Scenario C is similar, except the fifth tenant (e) joins in, so they 573 all get equal 1/5 shares of the link. 575 In Scenario D, only tenant (a) sends more than two flows, so (a)'s 576 policer limits it to two flows-worth of congestion, and everyone else 577 gets the number of flows-worth that they ask for. This means that 578 tenants (d) & (e) get less than everyone else, which is fine because 579 they asked for less than they would have been allowed. (Similarly, 580 in Scenarios A & B, some of the tenants are inactive, so they get 581 zero, which is also less than they could have had if they had 582 wanted.) 584 With lots of long-running flows, as in scenarios B & C, congestion 585 policing seems to emulate per-tenant round robin scheduling, 586 equalising the bit-rate of each tenant, no matter how many flows they 587 run. By configuring different contracted allowances for each tenant, 588 it can easily be seen that congestion policing could emulate weighted 589 round robin (WRR), with the relative sizes of the allowances acting 590 as the weights. 592 Scenario D departs from round-robin. This is deliberate, the idea 593 being that tenants are free to take less than their share in the 594 short term, which allows them to take more at other times, as we 595 shall see in Section 3.7. In Scenario D, policing focuses only on 596 the tenant (a) that is continually exceeding its contract. This 597 policer focuses discard solely on tenant a's traffic so that it 598 cannot cause any more congestion at the shared link (shown as 0.016% 599 in the last row). 601 To summarise so far, ingress congestion policers control congestion- 602 bit-rate in order to indirectly assure a minimum bit-rate per tenant. 603 With lots of long-running flows, the outcome is somewhat similar to 604 WRR, but without the need for any mechanism in each queue. 606 However, aiming to behave like WRR is only useful when all flows are 607 infinitely long. When load is variable because transfers are of 608 finite size, congestion policing properly isolates one tenant's 609 performance from the behaviour of others, unlike WRR would, as will 610 now be explained. 612 3.6. On-Off Flows 614 Figure 3 compares two example scenarios where tenant 'b' regularly 615 sends small files in the top chart and the same size files but more 616 often in the bottom chart (a higher 'on-off ratio'). This is the 617 typical behaviour of a Web server as more clients request more files 618 at peak time. Meanwhile, in this example, tenant c's behaviour 619 doesn't change between the two scenarios -- it sends a couple of 620 large files, each starting at the same time in both cases. 622 The capacity of the link that 'b' and 'c' share is shown as the full 623 height of the plot. The files sent by 'b' are shown as little 624 rectangles. 'b' can go at the full bit-rate of the link when 'c' is 625 not sending, which is represented by the tall thin rectangles part- 626 way through the trace labelled 'b'. We assume for simplicity that 627 'b' and 'c' divide up the bit-rate equally. So, when both 'b' and 628 'c' are sending, the 'b' rectangles are half the height (bit-rate) 629 and twice the width (duration) relative to when 'b' sends alone. The 630 area of a file to be transferred stays the same, whether tall and 631 thin or short and fat, because the area represents the size of the 632 file (bit-rate x duration = file size). The files from 'c' look like 633 inverted castellations, because 'c' uses half the link rate while 634 each file from 'b' completes, then 'c' can fill the link until 'b' 635 starts the next file. The cross-hatched areas represent idle times 636 when no-one is sending. 638 For this simple scenario we ignore start-up dynamics and just focus 639 on the rate and duration of flows that are long enough to stabilise, 640 which is why they can be represented as simple rectangles. We will 641 introduce the effect of flow startups later. 643 In the bottom case, where 'b' sends more often, the gaps between b's 644 transfers are smaller, so 'c' has less opportunity to use the whole 645 line rate. This squeezes out the time it takes for 'c' to complete 646 its file transfers (recall a file will always have the same area 647 which represents its size). Although 'c' finishes later, it still 648 starts the next flow at the same time. In turn, this means 'c' is 649 sending during a greater proprotion of b's transfers, which extends 650 b's average completion time too. 652 ^ bit-rate 653 | 654 |---------------------------------------,--.---------,--.----,------- 655 | | |\/\/\/\/\| |/\/\| 656 | c | b|/\/\/\/\/| b|\/\/| c 657 |------. ,-----. ,-----. | |\/\/\/\/\| |/\/\| ,-- 658 | b | | b | | b | | |/\/\/\/\/| |\/\/| | b 659 | | | | | | | |\/\/\/\/\| |/\/\| | 660 +------'------'-----'------'-----'------'--'---------'--'----'----'--> 661 time 662 ^ bit-rate 663 | 664 |---------------------------------------------------.--,--.--,------- 665 | |/\| |\/| 666 | c |\/| b|/\| c 667 |------. ,-----. ,-----. ,-----. ,-----. ,-----./\| |\/| ,---- 668 | b | | b | | b | | b | | b | | b |\/| |/\| | b 669 | | | | | | | | | | | |/\| |\/| | 670 +------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'----> 671 time 673 Figure 3: In the lower case, the on-off ratio of 'b' has increased, 674 which extends all the completion times of 'c' and 'b' 676 Round-robin would do little if anything to isolate 'c' from the 677 effect of 'b' sending files more often. Round-robin is designed to 678 force 'b' and 'c' to share the capacity equally when they are both 679 active. But in both scenarios they already share capacity equally 680 when they are both active. The difference is in how often they are 681 active. Round-robin and other traditional fair queuing techniques 682 don't have any memory to sense that 'b' has been active more of the 683 time. 685 In contrast, a congestion policer can tell when one tenant is sending 686 files more frequently, by measuring the rate at which the tenant is 687 contributing to congestion. Our aim is to show that policers will be 688 able to isolate performance properly by using the right metric 689 (congestion bit-rate), rather than using the wrong metric (bit-rate), 690 which doesn't sense whether the load over time is large or small. 692 3.6.1. Numerical Examples Without Policing 694 The usefulness of the congestion bit-rate metric will now be 695 illustrated with the numerical examples in Table 2. The scenarios 696 illustrate what the congestion bit-rate would be without any policing 697 or scheduling action in the network. Then this metric can be 698 monitored and limited by a policer, to prevent one tenant from 699 harming the performance of others. 701 The 2nd & 3rd columns (file-size and inter-arrival time) fully 702 represent the behaviour of each tenant in each scenario. All the 703 other columns merely characterise the outcome in various ways. The 704 inter-arrival time (T) is the average time between starting one file 705 and the next. For instance, in scenario E tenant 'b' sends a 16Mb 706 file every 200ms on average. The formula in the heading of some 707 columns shows how the column was derived from other columns. 709 Scenario E is contrived so that the three tenants all offer the same 710 load to the network, even though they send files of very different 711 size (S). The files sent by tenant 'a' are 100 times smaller than 712 those of tenant 'b', but 'a' sends them 100 times more often. In 713 turn, b's files are 100 times smaller than c's, but 'b' in turn sends 714 them 100 times more often. Graphically, the scenario would look 715 similar to Figure 3, except with three sizes of file, not just two. 716 Scenarios E-G are designed to roughly represent various distributions 717 of file sizes found in data centres, but still to be simple enough to 718 facilitate intuition, even though each tenant would not normally send 719 just one size file. 721 The average completion time (t) and the maximum were calculated from 722 a fairly simple analytical model (documented in a campanion technical 723 report [conex-dc_tr]). Using one data point as an example, it can be 724 seen that a 1600Mb (200MB) file from tenant 'c' completes in 1905ms 725 (about 1.9s). The files that are 100 times smaller complete 100 726 times more quickly on average. In fact, in this scenario with equal 727 loads, each tenant perceives that their files are being transferred 728 at the same rate of 840Mb/s on average (file-size divided by 729 completion time, as shown in the 'apparent bit-rate' column). Thus 730 on average all three tenants perceive they are getting 84% of the 1Gb 731 /s link on average (due to the benefit of multiplexing and 732 utilisation being low at 240Mb/s / 1Gb/s = 24% in this case). 734 The completion times of the smaller files vary significantly, 735 depending on whether a larger file transfer is proceeding at the same 736 time. We have already seen this effect in Figure 3, where, when 737 tenant b's files share with 'c', they take twice as long to complete 738 as when they don't. This is why the maximum completion time is 739 greater than the average for the small files, whereas there is 740 imperceptible variance for the largest files. 742 The final column shows how congestion bit-rate will be a useful 743 metric to enforce performance isolation (as we said, the table 744 illustrates the situation before any enforcement mechanism is added). 745 In the case of equal loads (scenario E), average congestion bit-rates 746 are all equal. In scenarios F and G average congestion bit-rates are 747 higher, because all tenants are placing much more load on the network 748 over time, even though each still sends at equal rate to others when 749 they are active together. Figure 3 illustrated a similar effect in 750 the difference between the top and bottom scenarios. 752 The maximum instantaneous congestion bit-rate is nearly always 20kb/ 753 s. That is because, by definition, all the tenants are using scalable 754 congestion controls with a constant congestion rate of 20kb/s, and 755 each tenant's flows rarely overlap because the per-tenant load in 756 these scenarios is fairly low. As we saw in Section 3.4, the 757 congestion rate of a particular scalable congestion control is always 758 the same, no matter how many other flows it competes with. 760 Once it is understood that the congestion bit-rate of one scalable 761 flow is always 'w' and doesn't change whenever a flow is active, it 762 becomes clear what the congestion bit-rate will be when averaged over 763 time; it will simply be 'w' multiplied by the proportion of time that 764 the tenant's file transfers are active. That is, w*t/T. For 765 instance, in scenario E, on average tenant b's flows start 200ms 766 apart, but they complete in 19ms. So they are active for 19/200 = 767 10% of the time (rounded). A tenant that causes a congestion bit- 768 rate of 20kb/s for 10% of the time will have an average congestion- 769 bit-rate of 2kb/s, as shown. 771 To summarise so far, no matter how many more files other tenants 772 transfer at the same time, each scalable flow still contributes to 773 congestion at the same rate, but it contributes for more of the time, 774 because it squeezes out into the gap before the next flow from that 775 tenant starts. 777 +-----+------+--------+------+-------------+-----------+------------+ 778 | Ten | File | Ave. | Ave. | Completion | Apparent | Congest- | 779 | ant | size | inter- | load | time | bit-rate | ion bit- | 780 | | S | arr- | S/T | ave : max | ave : min | rate | 781 | | Mb | ival | Mb/s | t | S/t | ave : max | 782 | | | T | | ms | Mb/s | w*t/T | 783 | | | ms | | | | kb/s | 784 +-----+------+--------+------+-------------+-----------+------------+ 785 | | | | | Scenario E | | | 786 | a | 0.16 | 2 | 80 | 0.19 : 0.48 | 840 : 333 | 2 : 20 | 787 | b | 16 | 200 | 80 | 19 : 35 | 840 : 460 | 2 : 20 | 788 | c | 1600 | 20000 | 80 | 1905 : 1905 | 840 : 840 | 2 : 20 | 789 | | | | ____ | | | | 790 | | | | 240 | | | | 791 | | | | | Scenario F | | | 792 | a | 0.16 | 0.67 | 240 | 0.31 : 0.48 | 516 : 333 | 9 : 20 | 793 | b | 16 | 50 | 320 | 29 : 42 | 557 : 380 | 11 : 20 | 794 | c | 1600 | 10000 | 160 | 3636 : 3636 | 440 : 440 | 7 : 20 | 795 | | | | ____ | | | | 796 | | | | 720 | | | | 797 | | | | | Scenario G | | | 798 | a | 0.16 | 0.67 | 240 | 0.33 : 0.64 | 481 : 250 | 10 : 20 | 799 | b | 16 | 40 | 400 | 32 : 46 | 505 : 345 | 16 : 40 | 800 | c | 1600 | 10000 | 160 | 4543 : 4543 | 352 : 352 | 9 : 20 | 801 | | | | ____ | | | | 802 | | | | 800 | | | | 803 +-----+------+--------+------+-------------+-----------+------------+ 805 Single link of capacity 1Gb/s. Each tenant uses a scalable congestion 806 control which contributes a congestion-bit-rate for each flow of w = 807 20kb/s. 809 Table 2: How the effect on others of various file-transfer behaviours 810 can be measured by the resulting congestion-bit-rate 812 In scenario F, clients have increased the rate they request files 813 from tenants a, b and c respectively by 3x, 4x and 2x relative to 814 scenario E. The tenants send the same size files but 3x, 4x and 2x 815 more often. For instance tenant 'b' is sending 16Mb files four times 816 as often as before, and they now take longer as well -- nearly 29ms 817 rather than 19ms -- because the other tenants are active more often 818 too, so completion gets squeezed to later. Consequently, tenant 'b' 819 is now sending 57% of the time, so its congestion-bit-rate is 20kb/s 820 * 57% = 11kb/s. This is nearly 6x higher than in scenario E, 821 reflecting both b's own increase by 4x and that this increase 822 coincides with everyone else increasing their load. 824 In scenario G, tenant 'b' increases even more, to 5x the load it 825 offered in scenario E. This results in average utilisation of 800Mb/s 826 / 1Gb/s = 80%, compared to 72% in scenario F and only 24% in scenario 827 E. 'b' sends the same files but 5x more often, so its load rises 5x. 829 Completion times rise for everyone due to the overall rise in load, 830 but the congestion rates of 'a' and 'c' don't rise anything like as 831 much as that of 'b', because they still leave large gaps between 832 files. For instance, tenant 'c' completes each large file transfer 833 in 4.5s (compared to 1.9s in scenario E), but it still only sends 834 files every 10s. So 'c' only sends 45% of the time, which is 835 reflected in its congestion bit-rate of 20kb/s * 45% = 9kb/s. 837 In contrast, on average tenant 'b' can only complete each medium- 838 sized file transfer in 32ms (compared to 19ms in scenario E), but on 839 average it starts sending another file after 40ms. So 'b' sends 79% 840 of the time, which is reflected in its congestion bit-rate of 20kb/s 841 * 79% = 16kb/s (rounded). 843 However, during the 45% of the time that 'c' sends a large file, b's 844 completion time is higher than average (as shown in Figure 3). In 845 fact, as shown in the maximum completion time column, 'b' completes 846 in 46ms, but it starts sending a new file after 40ms, which is before 847 the previous one has completed. Therefore, during each of c's large 848 files, 'b' sends 46/40 = 116% of the time on average. 850 This actually means 'b' is overlapping two files for 16% of the time 851 on average and sending one file for the remaining 84%. Whenever two 852 file transfers overlap, 'b' will be causing 2 x 20kb/s = 40kb/s of 853 congestion, which explains why tenant b in scenario G is the only 854 case with a maximum congestion rate of 40kb/s rather than 20kb/s as 855 in every other case. Over the duration of c's large files, 'b' would 856 therefore cause congestion at an average rate of 20kb/s * 84% + 40kb/ 857 s * 16% = 23kb/s (or more simply 20kb/s * 116% = 23kb/s). Of course, 858 when 'c' is not sending a large file, 'b' will contribute less to 859 congestion, which is why its average congestion rate is 16kb/s 860 overall, as discussed earlier. 862 3.6.2. Congestion Policing of On-Off Flows 864 Still referring to the numerical examples in Table 2, we will now 865 discuss the effect of limiting each tenant with a congestion policer. 867 The network operator might have deployed congestion policers to cap 868 each tenant's average congestion rate to 16kb/s. None of the tenants 869 are exceeding this limit in any of the scenarios, but tenant 'b' is 870 just shy of it in scenario G. Therefore all the tenants would be free 871 to behave in all sorts of ways, such as those of scenarios E-G. 873 However, they would be prevented from degrading the performance of 874 the other tenants beyond the point reached by tenant 'b' in scenario 875 G. If tenant 'b' added more load, the policer would prevent the extra 876 load entering the network by focusing drop solely on tenant 'b', 877 preventing the other tenants from experiencing any more congestion 878 due to tenant 'b'. Then tenants 'a' and 'c' would be assured the 879 (average) apparent bit-rates shown, whatever the behaviour of 'b'. 881 If 'a' added more load, 'c' would not suffer. Instead 'b' would go 882 over limit and its rate would be trimmed during congestion peaks, 883 sacrificing some of its lead to 'a'. Similarly, if 'c' added more 884 load, 'b' would be made to sacrifice some of its performance, so that 885 'a' would not suffer. Further, if more tenants arrived to share the 886 same link, the policer would force 'b' to sacrifice performance in 887 favour of the additional tenants. 889 There is nothing special about a policer limit of 16kb/s. The example 890 when discussing infinite flows used a limit of 40kb/s per tenant. 891 And some tenants can be given higher limits than others (e.g. at an 892 additional charge). If the operator gives out congestion limits that 893 together add up to a higher amount but it doesn't increase the link 894 capacity, it merely allows the tenants to apply more load (e.g. more 895 files of the same size in the same time), but each with lower bit- 896 rate. 898 3.7. Weighted Congestion Controls 900 At high speed, congestion controls such as Cubic TCP, Data Centre 901 TCP, Compound TCP etc all contribute to congestion at widely 902 differing rates, which is called their 'aggressiveness' or 'weight'. 903 So far, we have made the simplifying assumption of a scalable 904 congestion control algorithm that contributes to congestion at a 905 constant rate of w = 20kb/s. We now assume tenant 'c' uses a similar 906 congestion control to before, but with different parameters in the 907 algorithm so that its weight is still constant, but much lower, at w 908 = 2.2kb/s. 910 Tenant 'b' is sending smaler files than 'c', so it still uses w = 911 20kb/s. Then, when the two compete for the 1Gb/s link, they will 912 share it in proportion to their weights, 20:2.2 (or 90%:10%). That 913 is, when competing, 'b' and 'c' will respectively get (20/22.2)*1Gb/s 914 = 900Mb/s and (2.2/22.2)*1Gb/s = 100Mb/s of the 1Gb/s link. Figure 4 915 shows the situation before (upper) and after (lower) this change. 917 When the two compete, 'b' transfers each file 9/5 faster than before 918 (900Mb/s rather than 500Mb/s), so it completes them in 5/9 of the 919 time. 'b' still contributes congestion at the same rate of 20kb/s, 920 but for 5/9 less time than before. Therefore, relative to before, 921 'b' uses up its allowance only just over half as quickly, and it 922 completes each transfer nearly twice as fast. 924 Even though 'b' completes each small file much faster, 'c' should 925 complete its file transfer hardly any later than before. Although 926 tenant 'b' goes faster, as each 'b' file finishes, it gets out of the 927 way sooner. Irrespective of its lower weight, 'c' can still use the 928 full link when 'b' is not present, because weights only determine 929 shares between active flows. Because 'c' has the whole link sooner 930 it can catch up to the same point it would have reached in its 931 download if 'b' had been slower but longer. Tenant 'c' will probably 932 lose some completion time because it has to accelerate and decelerate 933 more. But, whenever it is sending a file, 'c' gains (20kb/s -2kb/s) 934 = 18kb of allowance every second, which it can use for other 935 transfers. 937 ^ bit-rate 938 | 939 |---------------------------------------------------.--,--.--,------- 940 | |/\| |\/| 941 | c |\/| b|/\| c 942 |------. ,-----. ,-----. ,-----. ,-----. ,-----./\| |\/| ,---- 943 | b | | b | | b | | b | | b | | b |\/| |/\| | b 944 | | | | | | | | | | | |/\| |\/| | 945 +------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'----> 946 time 948 ^ bit-rate 949 | 950 |---------------------------------------------------.--,--.--,------- 951 |---. ,---. ,---. ,---. ,---. ,---. |/\| |\/| ,---. 952 | | | | | | c | | | | | | |\/| b|/\| c| | 953 | | | | | | | | | | | | |/\| |\/| | | 954 | b | | b | | b | | b | | b | | b | |\/| |/\| | b | 955 | | | | | | | | | | | | |/\| |\/| | | 956 +---'-----'---'----'---'----'---'----'---'----'---'-'--'--'--'--'---'> 957 time 959 Figure 4: Weighted congestion controls with equal weights (upper) and 960 unequal (lower) 962 It seems too good to be true that both tenants gain so much and lose 963 so little by 'c' reducing its aggressiveness. Effectively 'c' allows 964 everyone else's shorter flows to proceed as if the long flows were 965 hardly there, and 'c' hardly loses out at all itself. The gains are 966 unlikely to be as perfect as this simple model predicts, but we 967 believe they will be nearly as substantial. 969 It might seem that everyone can keep gaining by everyone agreeing to 970 reduce their weights, ad infinitum. However, the lower the weight, 971 the less signals the congestion control gets, so it starts to lose 972 its control during dynamics. Nonetheless, congestion policing should 973 encourage congestion control designs to keep reducing their weights, 974 but they will have to stop when they reach the minimum necessary 975 congestion in order to maintain sufficient control signals. 977 3.8. A Network of Links 979 So far we have only considered a single link. Congestion policing at 980 the network edge is designed to work across a network of links, 981 treating them all as a pool of resources, as we shall now explain. 982 We will use the dual-homed data centre topology shown in Figure 5 983 (stretching the bounds of ASCII art) as a simple example of a pool of 984 resources. 986 In this case where there are 48 servers (H1, H2, ... Hn where n=48) 987 on the left, with on average 8 virtual machines (VMs) running on each 988 (e.g. server n is running Vn1, Vn2, ... to Vnm where m = 8). Each 989 server is connected by two 1Gb/s links, one to each top-of-rack 990 switch S1 & S2. To the right of the switches, there are 6 links of 991 10Gb/s each, connecting onwards to customer networks or to the rest 992 of the data centre. There is a total of 48 *2 *1Gb/s = 96Gb/s 993 capacity between the 48 servers and the 2 switches, but there is only 994 6 *10Gb/s = 60Gb/s to the right of the switches. Nonetheless, data 995 centres are often designed with some level of contention like this, 996 because at the ToR switches a proportion of the traffic from certain 997 hosts turns round locally towards other hosts in the same rack. 999 virtual hosts switches 1000 machines 1002 V11 V12 V1m __/ 1003 * * ... * H1 ,-.__________+--+__/ 1004 \___\__ __\____/`-' __-|S1|____,-- 1005 `. _ ,' ,'| |_______ 1006 . H2 ,-._,`. ,' +--+ 1007 . . `-'._ `. 1008 . . `,' `. 1009 . ,' `-. `.+--+_______ 1010 Vn1 Vn2 Vnm / `-_|S2|____ 1011 * * ... * Hn ,-.__________| |__ `-- 1012 \___\__ __\____/`-' +--+ \__ 1013 \ 1015 Figure 5: Dual-Homed Topology -- a Simple Resource Pool 1017 The congestion policer proposed in this document is based on the 1018 'hose' model, where a tenant's congestion allowance can be used for 1019 sending data over any path, including many paths at once. Therefore, 1020 any one of the virtual machines on the left can use its allowance to 1021 contribute to congestion on any or all of the 6 links on the right 1022 (or any other link in the diagram actually, including those from the 1023 server to the switches and those turning back to other hosts). 1025 Nonetheless, if congestion policers are to enforce performance 1026 isolation, they should stop one tenant squeezing the capacity 1027 available to another tenant who needs to use a particular bottleneck 1028 link or links (e.g. to reach a particular receiver). Policing should 1029 work whether the offending tenant is acting deliberately or merely 1030 carelessly. 1032 3.8.1. Numerical Example: Isolation from Focused Load 1034 Consider the following scenario: two tenants have equal congestion 1035 allowances of 40kb/s, and let's imagine they each run servers that 1036 send about 1 flow per second. All their flows use the same scalable 1037 congestion control and all use the same aggressiveness or weight of 1038 20kb/s. 1040 Tenant A has clients all over the network, so it spreads its traffic 1041 over numerous links, and its flows usually complete within 1 second. 1042 Therefore, on average, A has one flow running at any one instant, so 1043 it consumes 20kb/s from its 40kb/s congestion allowance. 1045 Then tenant B decides (carelessly or deliberately) to concentrate all 1046 its flows into one of the links that A sometimes uses for one of its 1047 flows. All the flows through this new bottleneck (A's, B's and those 1048 from other tenants) will still each consume 20kb/s of allowance while 1049 they are active (Section 3.6), but they will all take longer to 1050 complete. Let's say they take 15 times longer to complete on average 1051 (i.e. 15s not 1s). And let's assume 5% of tenant A's flows pass 1052 through this link. Then each tenant will consume congestion 1053 allowance at: 1055 o Tenant A: (95% * 1 * 20kb/s) + (5% * 15 * 20kb/s) = 34kb/s 1057 o Tenant B: (100% * 15 * 20kb/s) = 300kb/s 1059 Tenant B's congestion token bucket will therefore drain 7.5 times 1060 faster than it is filling (300 / 40 = 7.5) so it will rapidly empty 1061 and start dropping some of tenant B's traffic before it enters the 1062 network. In fact, once tenant B's bucket is empty, it will be 1063 limited to no more than 40kb/s of congestion at the bottleneck link, 1064 rather than 300kb/s otherwise. In effect, the policer shifts packet 1065 drops in excess of the 40kb/s allowance out of the shared link an 1066 into itself, to focus excess discards only on tenant B. 1068 Once tenant B's bucket empties, tenant A's completion time for the 1069 flows through this bottleneck would not return to 1s, but it would be 1070 limited to about 2s (not 15s), and it would consume its congestion 1071 allowance at about 21kb/s, well within its limt of 40kb/s. 1073 3.8.2. Isolation in the Short-Term 1075 It seems problematic that tenant B's congestion policer takes time to 1076 empty before it kicks in, during which time other tenants suffer 1077 reduced performance. This is deliberate--to allow some give and take 1078 over time. Nonetheless, it is possible to limit this problem as 1079 well, using the dual token bucket already described in Figure 5. The 1080 second shallow bucket sets an immediate limit on how much tenant B 1081 can harm the performance of others. The fill-rate of the shallow 1082 bucket is c times that of the deeper bucket (let's say c = 8), so 1083 tenant B is immediately limited to 8 * 40kb/s = 320kb/s of 1084 congestion. In the scenario described above, tenant B only reaches 1085 300kb/s but at least this backstop prevents congestion from tenant B 1086 ever exceeding 320kb/s. 1088 3.8.3. Encouraging Load Balancing 1090 A policer may not even have to directly intervene for tenants to be 1091 protected; it might encourage load balancing to remove the problem 1092 first. Load balancing might either be provided by the network 1093 (usually just random), or some of the tenants might themselves 1094 actively shift traffic off the increasingly congested bottleneck and 1095 onto other paths. Some of them might be using the multipath TCP 1096 protocol (MPTCP -- see experimental [RFC6356]) that would achieve 1097 this automatically, or ultimately if the congestion signals 1098 persisted, automatic VM placement algorithms might shift their 1099 virtual machine to a different endpoint to circumvent the congestion 1100 hot-spot completely. Even if some tenants were not using MPTCP or 1101 could not shift easily, others shifting away would achieve the same 1102 outcome. Essentially, the deterrent effect of congestion policers 1103 encourages everyone to even out congestion across the network, 1104 shifting load away from hot spots. Then performance isolation 1105 becomes an emergent property of everyone's behaviour, due to the 1106 deterrent effect of policers, rather than always through explicit 1107 policer intervention. 1109 In such an environment, a policer is needed that shares out the whole 1110 pool of network resources, not one that just controls shares of each 1111 link individually. 1113 In contrast, enforcement mechanisms based on scheduling algorithms 1114 like WRR or WFQ have to be deployed at each link, and each one works 1115 in ignorance of how much of other links the tenant is using. These 1116 algorithms were designed for networks with a single known bottleneck 1117 per customer (e.g. many access networks). However, in data centres 1118 there are many potential bottlenecks and each tenant generally only 1119 uses a share of a small number of them. A mechanism like WRR would 1120 not isolate anyone's performance if it gave every tenant the right to 1121 use the same share of all the links in the network, without regard to 1122 how many they were using. 1124 3.9. Tenants of Different Sizes 1126 The scenario in Section 3.8 assumed tenants A & B had equal 1127 congestion allowances. If one tenant bought a huge allowance (e.g. 1128 500kb/s) while another tenant bought a small allowance (e.g. 10kb/s), 1129 it would seem that the former could focus all its traffic on one link 1130 to swamp the latter. 1132 At this point, it needs to be pointed out that congestion allowances 1133 do not preclude the need for decent capacity planning. If we further 1134 assume that tenant A has bought the same access capacity as tenant B, 1135 50 times more congestion allowance implies tenant A expects to be 1136 provided with 50 times the contention ratio of tenant B, i.e. any 1137 single links they share will be big enough for B to still have enough 1138 if it gets 1 share for every 50 used by A. 1140 However, there is still a problem in that A can store up congestion 1141 allowance and cause much more than 500kb/s of congestion in one 1142 burst. The dual token bucket arrangement (Figure 5) limits tenant A 1143 to c * 500kb/s of congestion, but if c=8 as before, 8 * 500kb/s = 4Mb 1144 /s of congestion in a burst would seriously swamp tenant B if all 1145 focused on one link where tenant B was quietly consuming its 10kb/s 1146 allowance. 1148 The answer to this problem is that the burst limit c can be smaller 1149 for larger tenants, tending to 1 for the largest tenants (i.e. a 1150 single shallow bucket would suffice for huge tenants). The reasoning 1151 is that the c factor allows a tenant some give-and-take over time, 1152 but the aggregate of traffic from a larger tenant consists of enough 1153 flows that it has sufficient give-and-take within its own aggregate 1154 without needing give-and-take from others. 1156 3.10. Links of Different Sizes 1158 Congestion policing treats a Mb/s of capacity in one link as 1159 identical to a Mb/s of capacity in another link, even if the size of 1160 each link is different. For instance, consider the case where one of 1161 the three links to the right of each switch in Figure 5 were upgraded 1162 to 40Gb/s while the other two remained at 10Gb/s (perhaps to 1163 accommodate the extra traffic from a couple of the dual homed 1Gb/s 1164 servers being upgraded to dual-homed 10Gb/s). 1166 A congestion control algorithm running at the same rate will cause 1167 the same level of congestion probability, whatever size link it is 1168 sharing. For example: 1170 o If 50 equal flows share a 10Gb/s link (10Gb/s / 50 = 200Mb/s each) 1171 they will cause 0.01% congestion probability; 1173 o If 200 of the same flows share a 40Gb/s link (40Gb/s / 200 = 200Mb 1174 /s each) they will still cause 0.01% congestion probability; 1176 This is because the congestion probability is determined by the 1177 congestion control algorithms, not by the link. 1179 Therefore, if an average of 300 flows were spread across the above 1180 links (1x 40Gb/s and 2 x 10Gb/s), the numbers on each link would tend 1181 towards respectively 200:50:50, so that each flow would get 200Mb/s 1182 and each link would have 0.01% congestion on it. 1184 Sometimes, there might be more flows on one link, resulting in less 1185 than 200Mb/s per flow and congestion higher than 0.01%. However, 1186 whenever the congestion level was greater on one link than another, 1187 congestion policing would encourage flows to balance out the 1188 congestion level across the links (as long as some flows could use 1189 congestion balancing mechanisms like MPTCP). 1191 In summary, all the outcomes of congestion policing described so far 1192 (emulating WRR, etc) apply across a pool of diverse link sizes just 1193 as much as they apply to single links, without any need for the links 1194 to signal to each other nor to a central controller. 1196 3.11. Diverse Congestion Control Algorithms 1198 Throughout this explanation we have assumed a scalable congestion 1199 control algorithm, which we justified (Section 3.4) as the 'boundary' 1200 case if congestion policing were to intervene, which is all that is 1201 relevant when considering whether the policer can enforce performance 1202 isolation. 1204 The defining difference between the scalable congestion we have 1205 assumed and the congestion controls in widespread production 1206 operating systems (New Reno, Compound, Cubic, Data Centre TCP etc) is 1207 the way congestion probability decreases as flow-rate increases (for 1208 a long-running flow). With a scalable congestion control, if flow- 1209 rate doubles, congestion probability halves. Whereas, with most 1210 production congestion controls, if flow-rate doubles, congestion 1211 probability reduces to less than half. For instance, New Reno TCP 1212 reduces congestion to a quarter. The responses of Cubic and Compound 1213 are closer to the ideal scalable control than to New Reno, but they 1214 do not depart too far from New Reno to ensure they can co-exist 1215 happily with it. 1217 Congestion policing still works, whether or not the congestion 1218 controls in daily use by tenants fit the scalable model. A bulk 1219 congestion policer constrains the sum of all the congestion controls 1220 being used by a tenant so that they collectively remain below an 1221 aggregate envelope that is itself shaped like the sum of many 1222 scalable algorithms. Bulk congestion policers will constrain the 1223 overall congestion effect (the sum) of any mix of algorithms within 1224 it, including flows that are completely unresponsive to congestion. 1226 This will now be explained using Figure 6 (if the ASCII art is 1227 incomprehensible, see a similar plot at Figure 3 of [CongPol]). 1229 bit-rate[b/s] 1230 / \ 1231 | * / / / / / / / / / / / / / / / / / 1232 | +New Reno * Policed / / / / / / / / / / / / / 1233 | + TCP * / / / / / / / / / / / / / / / / 1234 | + * / / / / / / / / / / / / / / / 1235 | - - - - - -+- - - - - - - - -*- - - - - - - - - - - - - - - 1236 | + * / / / / / unresponsive/ / 1237 | + * / / / / / / / / / / / 1238 | + * / / / / / / / / / / 1239 | +* / / / / / / / 1240 | * / / /+/ / 1241 | * / / 1242 |____________________________________________________________\ 1243 0 ' 0.2%' ' 0.4%' 'congestion/ 1245 Figure 6: Schematic Plot of Bulk Policing of Traffic with Different 1246 Congestion Responses 1248 The congestion policer prevents the aggregate of a tenant's traffic 1249 from operating in the hatched region shown in Figure 6 (there is 1250 deliberately no vertical scale, because the plot is schematic only). 1251 The tenant can have either high congestion or high rate, but not both 1252 at the same time. It can be seen that the single New Reno TCP flow 1253 responds to congestion in a similar way, but with a flatter slope. 1254 It can also be seen that, at 0.25% prevailing congestion, this 1255 policer would allow roughly two simultaneous New Reno TCP flows 1256 without intervening. At about 0.45% congestion it would allow only 1257 one, and above 0.45% it would start to police even one flow. But at 1258 much lower congestion levels typical in a high-capacity data centre 1259 (e.g. 0.01%) the policer would allow numerous New Reno TCP flows 1260 (extrapolating the curve upwards). 1262 The horizontal plot represents an unresponsive (UDP) flow. The 1263 congestion policer would allow one of these flows to run unimpeded up 1264 to roughly 0.3% congestion, when the policer would effectively take 1265 control and impose its own congestion response on this flow, dropping 1266 more UDP packets for higher levels of prevailing congestion. If a 1267 tenant ran two of these UDP flows (or one at twice the rate), the 1268 policer would leave them alone unless prevailing congestion was above 1269 about 0.22% when it would intervene (by extrapolation of the visible 1270 plots). 1272 The New Reno TCP curve follows the characteristic k/sqrt(p) rule, 1273 while the 'Policed' curve follows a simple w/p rule, where k is the 1274 New Reno constant and w is the constant contracted fill-rate of a 1275 particular policer. A scalable TCP (not shown) would follow a v/p 1276 rule where v is the weight of the individual flow. If we imagine a 1277 scenario where w = 8v, the policer would allow 8 of these scalable 1278 TCP flows (averaged over time), and this would be true at any 1279 congestion level, because the curvature of 8 scalable TCP's would 1280 exacatly match the 'Policed' curve. 1282 So far, the discussion has focused on the congestion avoidance phase 1283 of each congestion response. As each TCP flow starts, it typically 1284 increases exponentially (TCP slow-start) until it overshoots 1285 available capacity causing a spike of loss due to the feedback delay 1286 over the following round-trip. The ConEx audit mechanism requires 1287 the source to send credit markings in advance to cover the risk of 1288 these spikes of loss. Credit marked ConEx packets drain tokens from 1289 the congestion policer similarly to regular ConEx marked packets, 1290 which reduces the available congestion allowance for other flows. If 1291 the tenant starts new flows fast enough, these credit markings alone 1292 will empty the token bucket so that the policer starts to discard 1293 some packets from the start-up phase before they can enter the 1294 network. (The tunnel feedback method for signalling congestion back 1295 to the policer lacks such a credit mechanism, which is an important 1296 advantage of the ConEx protocol, given Internet traffic generally 1297 consists of a high proportion of short flows.) 1299 A tenant could run a mix of multiple different types of TCP and UDP 1300 flows, as long as, when all stacked up, the sum of them all still 1301 fitted under the 'Policed' curve at the prevailing level of 1302 congestion. 1304 4. Parameter Setting 1306 The primary parameter to set for each tenant is the contracted fill- 1307 rate of their congestion token bucket. For a server sending 1308 predominantly long-running flows (e.g.a video server or an indexing 1309 robot filling search databases), this fill-rate can be determined 1310 from the required number of simultaneous TCP flows, and the typical 1311 congestion-bit-rate of one TCP flow. 1313 For a scalable TCP flow, the congestion bit-rate is constant. But 1314 currently (2013) TCP Cubic is predominant, and the congestion bit- 1315 rate of one Cubic flow is not constant, but depends somewhat on bit- 1316 rate and RTT. For example, a Cubic flow with 100Mb/s throughput and 1317 50ms RTT, has a congestion-bit-rate of ~2kbs, whereas at 200Mb/s and 1318 5ms RTT, its congestion-bit-rate is ~6kb/s. Therefore, a reasonable 1319 rule of thumb for Cubic's congestion-bit-rate at rates and RTTs of 1320 about this order is 5kb/s. 1322 Therefore, if a tenant wanted to serve no more on average than 12 1323 simultaneous TCP Cubic flows, their contracted congestion-rate should 1324 be roughly 12 * 5kb/s = 60kb/s. 1326 The contracted fill rate of a congestion policer should not need to 1327 change as flow-rates and link capacities scale, assuming the tenant 1328 still utilises the higher capacity to the same extent. 1330 A congestion policer based on the dual token bucket in Figure 2 also 1331 needs to specify the c factor that determined the maximum congestion- 1332 rate allowed, rather than the average. A theoretical approach to 1333 setting this factor is being worked on but, in the mean time, a 1334 figure of about 10 seems reasonable for a tenant sending only a few 1335 simultaneous flows (see Section 3.8.2), while Section 3.9 explains 1336 that the c factor should tend to reduce down to 1 for very large 1337 tenants. 1339 The dual token bucket makes isolation fairly insensitive to the depth 1340 of the deeper bucket, because the shallower bucket constrains the 1341 congestion burst rate, which makes the congestion burst size less 1342 important. It is believed that the deeper bucket depth could be 1343 minutes or even hours of token fill. The depth of the shallower 1344 bucket should be just a few MTU--more experimentation is needed to 1345 determine how few. 1347 Indeed, all the above recommendations on parameter setting are best 1348 considered as initial values to be used in experiments to determine 1349 whether other values would be better. 1351 No recommendations can yet be given for parameter settings for a 1352 tenant running a large proportion of short flows. This requires 1353 further experimentation. 1355 5. Security Considerations 1357 This document is about performance isolation, therefore the whole 1358 document concerns security. Where ConEx is used, the specification 1359 of the ConEx Abstract Mechanism [I-D.ietf-conex-abstract-mech] should 1360 be consulted for security matters, particularly the design of audit 1361 to assure the integrity of ConEx signals. 1363 6. IANA Considerations 1365 {Section to be removed by the RFC Editor}. This document does not 1366 require actions by IANA. 1368 7. Conclusions 1370 {ToDo} 1372 8. Acknowledgments 1374 Bob Briscoe is part-funded by the European Community under its 1375 Seventh Framework Programme through the Trilogy 2 project (ICT- 1376 317756). The views expressed here are solely those of the author. 1378 9. Informative References 1380 [CPolTrilogyExp] 1381 Raiciu, C., Ed., "Progress on resource control", Trilogy 1382 EU 7th Framework Project ICT-216372 Deliverable 9, 1383 December 2009, . 1386 [CongPol] Jacquet, A., Briscoe, B., and T. Moncaster, "Policing 1387 Freedom to Use the Internet Resource Pool", Proc ACM 1388 Workshop on Re-Architecting the Internet (ReArch'08) , 1389 December 2008, 1390 . 1392 [DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 1393 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data 1394 Center TCP (DCTCP)", ACM SIGCOMM CCR 40(4)63--74, October 1395 2010, . 1397 [I-D.briscoe-conex-data-centre] 1398 Briscoe, B. and M. Sridharan, "Network Performance 1399 Isolation in Data Centres using Congestion Policing", 1400 draft-briscoe-conex-data-centre-01 (work in progress), 1401 February 2013. 1403 [I-D.briscoe-conex-initial-deploy] 1404 Briscoe, B., "Initial Congestion Exposure (ConEx) 1405 Deployment Examples", draft-briscoe-conex-initial- 1406 deploy-03 (work in progress), July 2012. 1408 [I-D.ietf-conex-abstract-mech] 1409 Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1410 Concepts and Abstract Mechanism", draft-ietf-conex- 1411 abstract-mech-08 (work in progress), October 2013. 1413 [I-D.ietf-conex-mobile] 1414 Kutscher, D., Mir, F., Winter, R., Krishnan, S., Zhang, 1415 Y., and C. Bernardos, "Mobile Communication Congestion 1416 Exposure Scenario", draft-ietf-conex-mobile-03 (work in 1417 progress), February 2014. 1419 [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", 1420 RFC 3649, December 2003. 1422 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1423 Control", RFC 5681, September 2009. 1425 [RFC6356] Raiciu, C., Handley, M., and D. Wischik, "Coupled 1426 Congestion Control for Multipath Transport Protocols", RFC 1427 6356, October 2011. 1429 [Relentless] 1430 Mathis, M., "Relentless Congestion Control", Proc. Int'l 1431 Wkshp on Protocols for Future, Large-scale & Diverse 1432 Network Transports (PFLDNeT'09) , May 2009, 1433 . 1435 [STCP] Kelly, T., "Scalable TCP: Improving Performance in 1436 Highspeed Wide Area Networks", ACM SIGCOMM Computer 1437 Communication Review 32(2), April 2003, 1438 . 1440 [conex-dc_tr] 1441 Briscoe, , "Network Performance Isolation in Data Centres 1442 by Congestion Exposure to Edge Policers", BT Technical 1443 Report TR-DES8-2011-004, November 2011. 1445 To appear 1447 Appendix A. Summary of Changes between Drafts 1449 From briscoe-00 to briscoe-01: No technical differences; merely 1450 clarified throughout & updated references. 1452 From briscoe-conex-data-centre-00 to briscoe-conex-policing-00: This 1453 draft was separated out from the -00 version of 1454 [I-D.briscoe-conex-data-centre], taken mostly from what was 1455 section 4. Changes from that draft are listed below: 1457 From section 4 of draft-briscoe-conex-data-centre-00: 1459 + New Introduction 1461 + Added section 2. "Example Bulk Congestion Policer" 1463 + Rearranged and clarified the previous introductory text that 1464 preceded what is now Section 3.4 into three subsections: 1466 3.1 "The Problem" 1468 3.2 "Approach" 1470 3.3 "Terminology" 1472 + Corrected numerical examples: 1474 - Section 3.4 "Long-RunningFlows": 0.04% to 0.004% and 1475 400kb/s to 40kb/s 1477 - Section 3.6.1 "Numerical Examples Without Policing": 10kb 1478 /s to 20kb/s 1480 - section 3.7 "Weighted Congestion Controls": 22.2 to 20 1482 + Clarified where necessary, especially the descriptions of 1483 the scenarios in 3.7 "Weighted Congestion Controls" and 3.8 1484 "A Network of Links". 1486 + Section 3.8 "A Network of Links": added a numerical example 1487 including congestion burst limiting that was not covered at 1488 all previously 1490 + Added Section 3.9 "Tenants of Different Sizes" 1492 + Filled in absent text in: 1494 - Section 3.11 "Diverse Congestion Controls" 1495 - Section 4. "Parameters Settings" 1497 - Section 5. "Security Considerations" 1499 + Minor corrections throughout and updated references. 1501 Author's Address 1503 Bob Briscoe 1504 BT 1505 B54/77, Adastral Park 1506 Martlesham Heath 1507 Ipswich IP5 3RE 1508 UK 1510 Phone: +44 1473 645196 1511 EMail: bob.briscoe@bt.com 1512 URI: http://bobbriscoe.net/