idnits 2.17.1 draft-briscoe-conex-data-centre-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 668: '... 1. The sender SHOULD send ConEx-ena...' RFC 2119 keyword, line 681: '... capable it SHOULD be tunnelled t...' RFC 2119 keyword, line 688: '...ng ConEx signals MUST be copied to the...' RFC 2119 keyword, line 693: '...e tunnel ingress MUST use the normal m...' RFC 2119 keyword, line 700: '...-ECT i.e. 00) it MUST be made ECN-capa...' (2 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 268 has weird spacing: '...rvisors host...' -- The document date (February 14, 2014) is 3724 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-conex-abstract-mech-08 Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ConEx B. Briscoe 3 Internet-Draft BT 4 Intended status: Informational M. Sridharan 5 Expires: August 18, 2014 Microsoft 6 February 14, 2014 8 Network Performance Isolation in Data Centres using Congestion Policing 9 draft-briscoe-conex-data-centre-02 11 Abstract 13 This document describes how a multi-tenant (or multi-department) data 14 centre operator can isolate tenants from network performance 15 degradation due to each other's usage, but without losing the 16 multiplexing benefits of a LAN-style network where anyone can use any 17 amount of any resource. Zero per-tenant configuration and no 18 implementation change is required on network equipment. Instead the 19 solution is implemented with a simple change to the hypervisor (or 20 container) beneath the tenant's virtual machines on every physical 21 server connected to the network. These collectively enforce a very 22 simple distributed contract - a single network allowance that each 23 tenant can allocate among their virtual machines, even if distributed 24 around the network. The solution uses layer-3 switches that support 25 explicit congestion notification (ECN). It is best if the sending 26 operating system supports congestion exposure (ConEx). Nonetheless, 27 the operator can unilaterally deploy a complete solution while 28 operating systems are being incrementally upgraded to support ConEx 29 and ECN. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at http://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on August 18, 2014. 48 Copyright Notice 49 Copyright (c) 2014 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (http://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Features of the Solution . . . . . . . . . . . . . . . . . . . 4 66 3. Outline Design . . . . . . . . . . . . . . . . . . . . . . . . 7 67 4. Performance Isolation: Intuition . . . . . . . . . . . . . . . 9 68 4.1. Performance Isolation: The Problem . . . . . . . . . . . . 9 69 4.2. Why Congestion Policing Works . . . . . . . . . . . . . . 11 70 5. Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 71 5.1. Trustworthy Congestion Signals at Ingress . . . . . . . . 13 72 5.1.1. Tunnel Feedback vs. ConEx . . . . . . . . . . . . . . 14 73 5.1.2. ECN Recommended . . . . . . . . . . . . . . . . . . . 14 74 5.1.3. Summary: Trustworthy Congestion Signals at Ingress . . 15 75 5.2. Switch/Router Support . . . . . . . . . . . . . . . . . . 16 76 5.3. Congestion Policing . . . . . . . . . . . . . . . . . . . 17 77 5.4. Distributed Token Buckets . . . . . . . . . . . . . . . . 18 78 6. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 19 79 6.1. Migration . . . . . . . . . . . . . . . . . . . . . . . . 19 80 6.2. Evolution . . . . . . . . . . . . . . . . . . . . . . . . 20 81 7. Related Approaches . . . . . . . . . . . . . . . . . . . . . . 20 82 8. Security Considerations . . . . . . . . . . . . . . . . . . . 21 83 9. IANA Considerations (to be removed by RFC Editor) . . . . . . 21 84 10. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 21 85 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 21 86 12. Informative References . . . . . . . . . . . . . . . . . . . . 21 87 Appendix A. Summary of Changes between Drafts (to be removed 88 by RFC Editor) . . . . . . . . . . . . . . . . . . . 23 90 1. Introduction 92 A number of companies offer hosting of virtual machines on their data 93 centre infrastructure--so-called infrastructure as a service (IaaS) 94 or 'cloud computing'. A set amount of processing power, memory, 95 storage and network are offered. Although processing power, memory 96 and storage are relatively simple to allocate on the 'pay as you go' 97 basis that has become common, the network is less easy to allocate, 98 given it is a naturally distributed system. 100 This document describes how a data centre infrastructure provider can 101 offer isolated network performance to each tenant by deploying 102 congestion policing at every ingress to the data centre network, e.g. 103 in all the hypervisors (or containers). The data packets pick up 104 congestion information as they traverse the network, which is brought 105 to the ingress using one of two approaches: feedback tunnels or ConEx 106 (or a mix of the two). Then, these ingress congestion policers have 107 sufficient information to limit the amount of congestion any tenant 108 can cause anywhere in the whole meshed pool of data centre network 109 resources. This isolates the network performance experienced by each 110 tenant from the behaviour of all the others, without any tenant- 111 related configuration on any of the switches. 113 _How _it works is very simple and quick to describe. _Why_ this 114 approach provides performance isolation may be more difficult to 115 grasp. In particular, why it provides performance isolation across a 116 network of links, even though there is no isolation mechanism in each 117 link. Essentially, rather than limiting how much traffic can go 118 where, traffic is allowed anywhere and the policer finds out whenever 119 and wherever any traffic causes a small amount of congestion so that 120 it can prevent heavier congestion. 122 This document explains how it works, while a companion document 123 [conex-policing] builds up an intuition for why it works. 124 Nonetheless to make this document self-contained, brief summaries of 125 both the 'how' and the 'why' are given in sections 3 & 4. Then 126 Section 5 gives details of the design and Section 6 explains the 127 aspects of the design that enable incremental deployment. Finally 128 Section 7 introduces other attempts to solve the network performance 129 isolation problem and why they fall down in various ways. 131 The solution would also be just as applicable to isolate the network 132 performance of different departments within the private data centre 133 of an enterprise, which could be implemented without virtualisation. 134 However, it will be described as a multi-tenant scenario, which is 135 the more difficult case from a security point of view. 137 2. Features of the Solution 139 The following goals are met by the design, each of which is explained 140 subsequently: 142 o Performance isolation 144 o No loss of LAN-like openness and multiplexing benefits 146 o Zero tenant-related switch configuration 148 o No change to existing switch implementations 150 o Weighted performance differentiation 152 o Ultra-Simple contract--per-tenant network-wide allowance 154 o Sender constraint, but with transferrable allowance 156 o Transport-agnostic 158 o Extensible to wide-area and inter-data-centre interconnection 160 o Doesn't require traffic classes, or manages traffic within each 161 class 163 Performance Isolation with Openness of a LAN: The primary goal is to 164 ensure that each tenant of a data centre receives a minimum 165 assured performance from the whole network resource pool, but 166 without losing the efficiency savings from multiplexed use of 167 shared infrastructure (work-conserving). There is no need for 168 partitioning or reservation of network resources. 170 Zero Tenant-Related Switch Configuration: Performance isolation is 171 achieved with no per-tenant configuration of switches. All switch 172 resources are potentially available to all tenants. 174 Separately, _forwarding_ isolation may (or may not) be configured 175 to ensure one tenant cannot receive traffic from another's virtual 176 network. However, _performance_ isolation is kept completely 177 orthogonal, and adds nothing to the configuration complexity of 178 the network. 180 No New Switch Implementation: Straightforward commodity switches (or 181 routers) are sufficient. Bulk explicit congestion notification 182 (ECN) is recommended, which is available in a growing range of 183 layer-3 switches (a layer-3 switch does switching at layer-2, but 184 it can use the Diffserv and ECN fields for traffic control if it 185 can find an IP header). 187 Weighted Performance Differentiation: A tenant gets network 188 performance in proportion to their allowance when constrained by 189 others, with no constraint otherwise. Importantly, this assurance 190 is not just instantaneous, but over time. And the assurance is 191 not just localised to each link but network-wide. This will be 192 explained later with reference to the numerical examples in 193 [conex-policing]. 195 Ultra-Simple Contract: The tenant needs to decide only two things: 196 The peak bit-rate connecting each virtual machine to the network 197 (as today) and an overall 'usage' allowance. This document 198 focuses on the latter. A tenant just decides one number for this 199 contracted allowance that can be shared between all the tenant's 200 virtual machines (VMs). The 'usage' allowance is a measure of 201 congestion-bit-rate, which will be explained later, but most 202 tenants will just think of it as a number, where more is better. 204 Multi-machine: A tenant operating multiple VMs has no need to decide 205 in advance which VMs will need more allowance and which less--an 206 automated process can allocate the allowance across the VMs, 207 shifting more to those that need it most, as they use it. 208 Therefore, performance cannot be constrained by poor choice of 209 allocations between VMs, removing a whole dimension from the 210 problem that tenants face when choosing their traffic contract. 211 The allocation process can be operated by the tenant, or provided 212 by the data centre operator as part of an enhanced platform to 213 complement the basic infrastructure (platform as a service or 214 PaaS). 216 Sender Constraint with transferrable allowance: By default, 217 constraints are always placed on data senders, determined by the 218 sending party's traffic contract. Nonetheless, if the receiving 219 party (or any other party) wishes to enhance performance it can 220 arrange this with the sender at the expense of its own sending 221 allowance. 223 For instance, when a VM sends data to a storage facility the 224 tenant that owns the VM consumes as much of their allowance as 225 necessary to achieve the desired sending performance. But by 226 default when that tenant later retrieves data from storage, the 227 storage facility is the sender, so the storage facility consumes 228 its allowance to determine performance in the reverse direction. 229 Nonetheless, during the retrieval request, the storage facility 230 can require that its sending 'costs' are covered by the receiving 231 VM's allowance. The design of this feature is beyond the scope of 232 this document, but the system provides all the hooks to build it 233 at the application (or transport) layer. 235 Transport-Agnostic: In a well-provisioned network, enforcement of 236 performance isolation rarely introduces constraints on network 237 behaviour. However, it continually counts how much each tenant is 238 limiting the performance of others, and it will intervene to 239 enforce performance isolation against only those tenants who most 240 persistently constrain others. By default, this intervention is 241 oblivious to flows and to the protocols and algorithms being used 242 above the IP layer. However, flow-aware or application-aware 243 prioritisation can be built on top, either by the tenant or by the 244 data centre operator as a complementary PaaS facility. 246 Interconnection: The solution is designed so that interconnected 247 networks can ensure each is accountable for the performance 248 degradation it contributes to in other networks. If necessary, 249 one network has the information to intervene at its ingress to 250 limit traffic from another network that is degrading performance. 251 Alternatively, with the proposed protocols, networks can see 252 sufficient information in traffic arriving at their borders to 253 give their neighbours financial incentives to limit the traffic 254 themselves. 256 The present document focuses on a single provider-scenario, but 257 evolution to interconnection with other data centres over wide- 258 area networks, and interconnection with access networks is briefly 259 discussed in Section 6.2. 261 Intra-class: The solution does not need traffic to have been 262 classified into classes. Or if traffic is divided into classes, 263 it manages contention for the resources of each class, 264 independently of any scheduling between classes. 266 3. Outline Design 268 virtual hypervisors hosts switches 269 machines 271 V11 V12 V1m +--------+ __/ 272 * * ... * | ____ | H1 ,-.__________+--+__/ 273 \___\__ __\__|__\T1/__|____/`-' __-|S1|____,-- 274 | /__\ | `. _ ,' ,'| |_______ 275 . +--------| H2 ,-._,`. ,' +--+ 276 . . `-'._ `. 277 . +--------+ . `,' `. 278 | ____ | . ,' `-. `.+--+_______ 279 Vn1 Vn2 Vnm | _\T1/_ | / `-_|S2|____ 280 * * ... * |/ /__\ \| Hn ,-.__________| |__ `-- 281 \___\__ __\__/policers\____/`-' +--+ \__ 282 \ ____ / \ 283 |\_\T2/_/| 284 | /__\ | 285 +--------+ 287 The two (or more) policers associated with tenant T1 act as one 288 logical policer. 290 Figure 1: Edge Policing and the Hose Traffic Model 292 Edge policing: Traffic policing is located at the policy enforcement 293 point where each sending host connects to the network, typically 294 beneath the tenant's operating system in the hypervisor controlled 295 by the infrastructure operator (Figure 1). In this respect, the 296 approach has a similar arrangement to the Diffserv architecture 297 with traffic policers forming a ring around the network [RFC2475]. 299 (Multi-)Hose model: Each policer controls all traffic from the set 300 of VMs associated with each tenant without regard to destination, 301 similar to the Diffserv 'hose' model. If the tenant has VMs 302 spread across multiple physical hosts, they are all constrained by 303 one logical policer that feeds tokens to individual sub-policers 304 within each hypervisor on each physical host (e.g. the two 305 policers associated with tenant T1 in Figure 1). In other words, 306 the network is treated as one resource pool. 308 Congestion policing: A congestion policer is very similar to a 309 traditional bit-rate policer. A classifier associates each packet 310 with the relevant tenant's meter to drain tokens from the 311 associated token bucket, while at the same time the bucket fills 312 with tokens at the tenant's contracted rate (Figure 2). 314 However, unlike a traditional policer, the tokens in a congestion 315 policer represent congested bits (i.e. discarded or ECN-marked 316 bits), not just any bits. So, the bits in ECN-marked packets in 317 Figure 2 count as congested bits, while all other bits don't drain 318 anything from the token bucket--unmarked packets are invisible to 319 the meter. And a tenant's contracted fill rate (wi for tenant Ti 320 in Figure 2) is only the rate of congested bits, not all bits. 321 Then if, on average, any tenant tries to cause more congestion 322 than their allowance, the policer will focus discards on that 323 tenant's traffic to prevent any further increase in congestion for 324 everyone else. 326 The detail design section describes how congestion policers at the 327 network ingress know the congestion that each packet will 328 encounter in the network, as well as how the congestion policer 329 limits both peak and average rates of congestion. 331 ______________________ 332 | | | | Legend | 333 |w1 |w2 |wi | | 334 | | | | [_] [_]packet stream | 335 V V V | | 336 congestion . . . | [*] marked packet | 337 token bucket| . | | . | __|___| | ___ | 338 __|___| | . | | |:::| | \ / policer | 339 | |:::| __|___| | |:::| | /_\ | 340 | +---+ | +---+ | +---+ | | 341 bucket depth | : | : | : | /\ marking meter | 342 controls the | . | : | . | \/ | 343 policer _V_ . | : | . |______________________| 344 ____\ /__/\___________________________ downstream 345 /[*] /_\ \/ [_] | : [_] | : [_] \ /->network 346 class-/ | . | . \ / /---> 347 ifier/T1 _V_ . | . \ / / 348 __,--.__________________\ /__/\___________________\______/____/ loss 349 `--' T2 [*] [_] [_]/_\ \/ [_] | . [*] / \ \-X---> 350 \ | . / \--> 351 \Ti _V_ : / \ loss 352 \__________________________\ /__/\_____/ \-X---> 353 [_] [*] [_] [*] [_] /_\ \/ [_] 355 Figure 2: Bulk Congestion Policer Schematic 357 Optional Per-Flow policing: A congestion policer could be designed 358 to focus policing on the particular data flow(s) contributing most 359 to the excess congestion-bit-rate. However bulk per-tenant 360 congestion policing is sufficient to protect other tenants, then 361 each tenant can choose per-flow policing if it wants. 363 FIFO forwarding: If scheduling by traffic class is used in network 364 buffers (for whatever reason), congestion policing can be used to 365 isolate tenants from each other within each class. However, 366 congestion policing will tend to keep queues short, therefore it 367 is more likely that simple first-in first-out (FIFO) will be 368 sufficient, with no need for any priority scheduling. 370 ECN marking recommended: All queues that might become congested 371 should support bulk ECN marking. For any non-ECN-capable flows or 372 packets, the solution enables ECN universally in the outer IP 373 header of an edge-to-edge tunnel. It can use the edge-to-edge 374 tunnel created by one of the network virtualisation overlay 375 approaches, e.g. [nvgre, vxlan]. 377 In the proposed approach, the network operator deploys capacity as 378 usual--using previous experience to determine a reasonable contention 379 ratio at every tier of the network. Then, the tenant contracts with 380 the operator for the rate at which their congestion policer will 381 allow them to contribute to congestion. [conex-policing] explains how 382 the operator or tenant would determine an appropriate allowance. 384 4. Performance Isolation: Intuition 386 4.1. Performance Isolation: The Problem 388 Network performance isolation traditionally meant that each user 389 could be sure of a minimum guaranteed bit-rate. Such assurances are 390 useful if traffic from each tenant follows relatively predictable 391 paths and is fairly constant. If traffic demand is more dynamic and 392 unpredictable (both over time and across paths), minimum bit-rate 393 assurances can still be given, but they have to be very small 394 relative to the available capacity, because a large number of users 395 might all want to simulataneously share any one link, even though 396 they rarely all use it at the same time. 398 This either means the shared capacity has to be greatly overprovided 399 so that the assured level is large enough, or the assured level has 400 to be small. The former is unnecessarily expensive; the latter 401 doesn't really give a sufficiently useful assurance. 403 Round robin or fair queuing are other forms of isolation that 404 guarantee that each user will get 1/N of the capacity of each link, 405 where N is the number of active users at each link. This is fine if 406 the number of active users (N) sharing a link is fairly predictable. 407 However, if large numbers of tenants do not typically share any one 408 link but at any time they all could (as in a data centre), a 1/N 409 assurance is fairly worthless. Again, given N is typically small but 410 could be very large, either the shared capacity has to be expensively 411 overprovided, or the assured bit-rate has to be worthlessly small. 412 The argument is no different for the weighted forms of these 413 algorithms: WRR & WFQ). 415 Both these traditional forms of isolation try to give one tenant 416 assured instantaneous bit-rate by constraining the instantaneous bit- 417 rate of everyone else. This approach is flawed except in the special 418 case when the load from every tenant on every link is continuous and 419 fairly constant. The reality is usually very different: sources are 420 on-off and the route taken varies, so that on any one link a source 421 is more often off than on. 423 In these more realistic (non-constant) scenarios, the capacity 424 available for any one tenant depends much more on _how often_ 425 everyone else uses a link, not just _how much_ bit-rate everyone 426 else would be entitled to if they did use it. 428 For instance, if 100 tenants are using a 1Gb/s link for 1% of the 429 time, there is a good chance each will get the full 1Gb/s link 430 capacity. But if just six of those tenants suddenly start using the 431 link 50% of the time, whenever the other 94 tenants need the link, 432 they will typically find 3 of these heavier tenants using it already. 433 If a 1/N approach like round-robin were used, then the light tenants 434 would suddently get 1/4 * 1Gb/s = 250Mb/s on average. Round-robin 435 cannot claim to isolate tenants from each other if they usually get 436 1Gb/s but sometimes they get 250Mb/s (and only 10Mb/s guaranteed in 437 the worst case when all 100 tenants are active). 439 In contrast, congestion policing is the key to network performance 440 isolation because it focuses policing only on those tenants that go 441 fast over congested path(s) excessively and persistently over time. 442 This keeps congestion below a design threshold everywhere so that 443 everyone else can go fast. In this way, congestion policing takes 444 account of highly variable loads (varying in time and varying across 445 routes). And, if everyone's load happens to be constant, congestion 446 policing converges on the same outcome as the traditional forms of 447 isolation. 449 The other flaw in the traditional approaches to isolation, like WRR & 450 WFQ, is that they actually prevent long-running flows from yielding 451 to brief bursts from lighter tenants. A long-running flow can yield 452 to brief flows and still complete nearly as soon as it would have 453 otherwise (the brief flows complete sooner, freeing up the capacity 454 for the longer flow sooner). However, WRR & WFQ prevent flows from 455 even seeing the congestion signals that would allow them to co- 456 ordinate between themselves, because they isolate each tenant 457 completely into separate queues. 459 In summary, superficially, traditional approaches with separate 460 queues sound good for isolation, but: 462 1. not when everyone's load is variable, so each tenant has no 463 assurance about how many other queues there will be; 465 2. and not when each tenant can no longer even see the congestion 466 signals from other tenants, so no-one's control algorithms can 467 determine whether they would benefit most by pushing harder or 468 yielding. 470 4.2. Why Congestion Policing Works 472 [conex-policing] explains why congestion policing works using 473 numerical examples from a data centre and schematic traffic plots (in 474 ASCII art). The bullets below provide a summary of that explanation, 475 which builds from the simple case of long-running flows through a 476 single link up to a full meshed network with on-off flows of 477 different sizes and different behaviours: 479 o Starting with the simple case of long-running flows focused on a 480 single bottleneck link, tenants get weighted shares of the link, 481 much like weighted round robin, but with no mechanism in any of 482 the links. This is because losses (or ECN marks) are random, so 483 if one tenant sends twice as much bit-rate it will suffer twice as 484 many lost bits (or ECN-marked bits). So, at least for constant 485 long-running flows, regulating congestion-bits gives the same 486 outcome as regulating bits; 488 o In the more realistic case where flows are not all long-running 489 but a mix of short to very long, it is explained that bit-rate is 490 not a sufficient metric for isolating performance; how _often_ a 491 tenant is sending (or not sending) is the significant factor for 492 performance isolation, not whether bit-rate is shared equally 493 whenever a source happens to be sending; 495 o Although it might seem that data volume would be a good measure of 496 how often a tenant is sending, we then show why it is not. For 497 instance, a tenant can send a large volume of data but hardly 498 affect the performance of others -- by being more responsive to 499 congestion. Using congestion-volume (congestion-bit-rate over 500 time) in a policer encourages large data senders to be more 501 responsive (to yield), giving other tenants much higher 502 performance while hardly affecting their own performance. 503 Whereas, using straight volume as an allocation metric provides no 504 distinction between high volume sources that yield and high volume 505 sources that do not yield (the widespread behaviour today); 507 o We then show that a policer based on the congestion-bit-rate 508 metric works across a network of links treating it as a pool of 509 capacity, whereas other approaches treat each link independently, 510 which is why the proposed approach requires none of the 511 configuration complexity on switches that is involved in other 512 approaches. 514 o We also show that a congestion policer can be arranged to limit 515 bursts of congestion from sources that focus traffic onto a single 516 link, even where one source may consist of a large aggregate of 517 sources. 519 o We show that a congestion policer rewards traffic that shifts to 520 less congested paths (e.g. multipath TCP or virtual machine 521 motion). This means congestion policing encourages and ultimately 522 forces end-systems to balance their load over the whole pool of 523 bandwidth. The network can attempt to balance the load, but bulk 524 congestion policing is particularly designed to encourage end- 525 systems to do the job, either at the transport layer with 526 multipath TCP [RFC6356] or at the application layer by moving 527 virtual machines or choosing peer virtual machines in a similar 528 way to BitTorrent. 530 o We show that congestion policing works on the pool of links, 531 irrespective of whether individual links have significantly 532 different capacities. 534 o We show that a congestion policer allows a wide variety of 535 responses to congestion (e.g. New Reno TCP, Cubic TCP, Compound 536 TCP, Data Centre TCP and even unresponsive UDP traffic), while 537 still encouraging and enforcing a sufficient response to 538 congestion from all sources taken together. 540 o Congestion policing can and will enforce a congestion response if 541 a tenant persistently causes excessive congestion. This ensures 542 that each tenant's minimum performance is isolated from the 543 combined effects of everyone else. However, the purpose of 544 congestion policing is not to intervene in everyone's rate control 545 all the time. Rather it is encourage each tenant to avoid being 546 policed -- to keep the aggregate of all their flows' responses to 547 congestion within an overall envelope and balanced across the 548 network. 550 [conex-policing] also includes a section that gives guidance on how 551 to estimate appropriate fill rates and sizes for congestion token 552 buckets. 554 5. Design 556 The design involves the following elements, each detailed in the 557 following subsections: 559 1. Trustworthy Congestion Signals at Ingress 561 2. Switch/Router Support 563 3. Congestion Policing 565 4. Distributed Token Buckets 567 5.1. Trustworthy Congestion Signals at Ingress 568 ,---------. ,---------. 569 |Transport| |Transport| 570 | Sender | . |Receiver | 571 | | /|___________________________________________| | 572 | ,-<---------------Congestion-Feedback-Signals--<--------. | 573 | | |/ | | | 574 | | |\ Transport Layer Feedback Flow | | | 575 | | | \ ___________________________________________| | | 576 | | | \| | | | 577 | | | ' ,-----------. . | | | 578 | | |_____________| |_______________|\ | | | 579 | | | IP Layer | | Data Flow \ | | | 580 | | | |(Congested)| ,-----.\ | | | 581 | | | | Network |--Congestion-Signals--->-' | 582 | | | | Device | | | \| | 583 | | | | | |Audit| /| | 584 | `----------->--(new)-IP-Layer-ConEx-Signals-------->| | 585 | | | | `-----'/ | | 586 | |_____________| |_______________ / | | 587 | | | | |/ | | 588 `---------' `-----------' ' `---------' 590 Figure 3: The ConEx Protocol in the Internet Architecture 592 The operator of the data centre infrastructure needs to trust this 593 information, therefore it cannot just use the feedback in the end-to- 594 end transport (e.g. TCP SACK or ECN echo congestion experienced 595 flags) that might anyway be encrypted. Trusted congestion feedback 596 may be implemented in either of the following two ways: 598 a. Either as a shim in both sending and receiving hypervisors using 599 an edge-to-edge (host-host) tunnel controlled by the 600 infrastructure operator, with feedback messages reporting 601 congestion back to the sending host's hypervisor (in addition to 602 the e2e feedback at the transport layer). 604 b. Or in the sending operating system using the congestion exposure 605 protocol (ConEx [ConEx-Abstract-Mech]) with a ConEx audit 606 function at the egress edge to check ConEx signals against actual 607 congestion signals (Figure 3); 609 5.1.1. Tunnel Feedback vs. ConEx 611 The feedback tunnel approach (a) is inefficient because it duplicates 612 end-to-end feedback and it introduces at least a round trip's delay, 613 whereas the ConEx approach (b) is more efficient and not delayed, 614 because ConEx packets signal a conservative estimate of congestion in 615 the upcoming round trip. Avoiding feedback delay is important for 616 controlling congestion from aggregated short flows. However, ConEx 617 signals will not necessarily be supported by the sending operating 618 system. 620 Therefore, given ConEx IP packets are self-identifying, the best 621 approach is to rely on ConEx signals when present and fill in with 622 tunnelled feedback when not, on a packet-by-packet basis. 624 5.1.2. ECN Recommended 626 Both approaches are much easier if explicit congestion notification 627 (ECN [RFC3168]) is enabled on network switches and if all packets are 628 ECN-capable. For non-ECN-capable packets, ECN support can be turned 629 on in the outer of an edge-to-edge tunnel. The reasons that ECN 630 helps in each case are: 632 a. Tunnel Feeback: To feed back congestion signals, the tunnel 633 egress needs to be able to detect forward congestion signals in 634 the first place. If the only symptom of congestion is dropped 635 packets, the egress has to watch for gaps in the sequence space 636 of the transport protocol, which cannot be guaranteed to be 637 possible--the IP payload may be encrypted, or an unknown 638 protocol, or parts of the flow may be sent over diverse paths. 639 The tunnel ingress could add its own sequence numbers (as done by 640 some pseudowire protocols), but it is easier to simply turn on 641 ECN at the ingress so that the egress can detect ECN markings. 643 b. ConEx: The audit function needs to be able to compare ConEx 644 signals with actual congestion. So, as before, it needs to be 645 able to detect congestion at the egress. Therefore the same 646 arguments for ECN apply. 648 5.1.3. Summary: Trustworthy Congestion Signals at Ingress 650 The above cases can be arranged in a 2x2 matrix, to show when edge- 651 to-edge tunnelling is needed and what function the tunnel would need 652 to serve: 654 +----------------+----------------+---------------------------------+ 655 | ConEx-capable? | ECN-capable: Y | ECN-capable: N | 656 +----------------+----------------+---------------------------------+ 657 | Y | No tunnel | ECN-enabled tunnel | 658 | | needed | | 659 | N | Tunnel | ECN-enabled tunnel + Tunnel | 660 | | Feedback | feedback | 661 +----------------+----------------+---------------------------------+ 663 We can now summarise the steps necessary to ensure an ingress 664 congestion policer obtains trustworthy congestion signals: 666 1. Sending operating system: 668 1. The sender SHOULD send ConEx-enabled and ECN-enabled packets 669 whenever possible. 671 2. If the sender uses IPv6 it can signal ConEx in a destination 672 option header [conex-destopt]. 674 3. If the sender uses IPv4, it can signal ConEx markings by 675 encoding them within the packet ID field as proposed in 676 [ipv4-id-reuse]. 678 2. Ingress edge: 680 1. If an arriving packet is either not ConEx-capable or not ECN- 681 capable it SHOULD be tunnelled to the appropriate egress edge 682 in an outer IP header. 684 2. A pre-existing edge-to-edge tunnel (e.g. [nvgre, vxlan]) can 685 be used, irrespective of whether the packet is not ConEx- 686 capable or not ECN-capable. 688 3. Incoming ConEx signals MUST be copied to the outer. For an 689 incoming IPv4 packet, this implies copying the ID field. For 690 an incoming IPv6 packet, this implies copying the Destination 691 Option header. 693 4. In all cases, the tunnel ingress MUST use the normal mode of 694 ECN tunnelling [RFC6040]. 696 3. Directly after encapsulation (but not if the packet was not 697 encapsulated): 699 1. If and only if the ECN field of the outer header is not ECN- 700 capable (Not-ECT i.e. 00) it MUST be made ECN-capable by 701 remarking it to ECT(0), i.e. 01. 703 2. If the outer ECN field carries any other value than 00, it 704 should be left unchanged. 706 4. Directly before the edge egress (irrespective of whether the 707 packet is encapsulated): 709 1. If the outer IP header is ConEx-capable, it MUST be passed 710 through a ConEx audit function 712 2. If the packet is not ConEx-capable, it MUST be passed to a 713 function that feeds back ECN marking statistics to the tunnel 714 ingress. Such a function is also a requirement of 715 [tunnel-cong-exp], which may be re-usable for this purpose 716 {ToDo: to be confirmed}. 718 5. Egress Edge Decapsulator: 720 1. Decapsulation must comply with [RFC6040]. This ensures that, 721 a congestion experienced marking (CE or 11) on the outer will 722 lead to the packet being dropped if the inner indicates that 723 the endpoints will not understand ECN (i.e. the inner ECN 724 field is Not-ECT or 00). Effectively the egress edge drops 725 such packets on behalf of the congested upstream buffer that 726 marked it because the packet appeared to be ECN-capable on 727 the outside, but it is not ECN-capable on this inside. 728 [RFC6040] was deliberately arranged like this so that it 729 would drop such packets to give an equivalent congestion 730 signal to the end-to-end transport. 732 5.2. Switch/Router Support 734 Network switches/routers do not need any modification. However, both 735 congestion detection by the tunnel (approach a) and ConEx audit 736 (approach b) are significantly easier if switches support ECN. 738 Once switches support ECN, Data centre TCP [DCTCP] could optionally 739 be used (DCTCP requires ECN). It also requires modified sender and 740 receiver TCP algorithms as well as a more aggressive configuration of 741 the active queue management (AQM) in the L3 switches or routers. 743 5.3. Congestion Policing 745 Innovation in the design of congestion policers is expected and 746 encouraged, but here we wlil describe one specific design to be 747 concrete. 749 A bulk congestion policing function would most likely be implemented 750 as a shim in the hypervisor. The hypervisor would create one 751 instance of a bulk congestion policer per tenant on the physical 752 machine, and it would ensure that all traffic sent by that tenant's 753 VMs into the network would pass through the relevant congestion 754 policer by associating every new virtual machine with the relevant 755 policer. 757 A bulk congestion policing function has already been outlined in 758 Section 3. To recap, it consists of a token bucket that is filled 759 with congestion tokens at a constant rate. The bucket is drained by 760 the size of every packet that carries a congestion marking. If the 761 tunnel-feedback approach (a) were used, the bucket would be drained 762 by congestion feedback from the tunnel egress, rather than markings 763 on packets. If the ConEx approach (b) were used, the bucket would be 764 drained by ConEx markings on the actual data packets being forwarded. 765 A congestion policer will need to drain in response to either form of 766 signal, because it is recommended that both approaches are used in 767 combination. 769 Various more sophisticated congestion policer designs have been 770 evaluated [CPolTrilogyExp]. In these experiments, it was found that 771 it is better if the policer gradually increases discards as the 772 bucket becomes empty. Also isolation between tenants is better if 773 each tenant is policed based on the combination of two buckets, not 774 one (Figure 4): 776 1. A deep bucket (that would take minutes or even hours to fill at 777 the contracted fill-rate) that constrains the tenant's long-term 778 average rate of congestion (wi) 780 2. a very shallow bucket (e.g. only two or three MTU) that is filled 781 considerably faster than the deep bucket (c * wi), where c = ~10, 782 which prevents a tenant storing up a large backlog of tokens then 783 causing congestion in one large burst. 785 In this arrangement each marked packet drains tokens from both 786 buckets, and the probability of policer discard is taken as the worse 787 of the two buckets. 789 | | 790 Legend: |c*wi |wi 791 See previous figure V V 792 . . 793 . | . | deep bucket 794 _ _ _ _ _ _ _ _ _ _ _ _ |___| 795 | . |:::| 796 |_ _ _ _ _ _ _ |___| |:::| 797 | shallow +---+ +---+ 798 worse of the| bucket 799 two buckets| \____ ____/ 800 triggers| \ / both buckets 801 policing V : drained by 802 ___ . marked packets 803 ___________\ /___________________/ \__________________ 804 [_] [_] /_\ [_] [*] [_] \ / [_] [_] [_] 806 Figure 4: Dual Congestion Token Bucket (in place of each single 807 bucket in the previous figure) 809 While the data centre network operator only needs to police 810 congestion in bulk, tenants may wish to enforce their own limits on 811 individual users or applications, as sub-limits of their overall 812 allowance. Given all the information used for policing is readily 813 available within the transport layer of their own operating system. 814 Tenants can readily apply any such per-flow, per-user or per- 815 application limitations. The tenant may operate their own fine- 816 grained policing software, or such detailed control capabilities may 817 be offered as part of the platform (platform as a service or PaaS). 819 5.4. Distributed Token Buckets 821 A customer may run virtual machines on multiple physical nodes, in 822 which case at the time each VM is instantiated the data centre 823 operator will deploy a congestion policer in the hypervisor on each 824 node where the customer is running a VM.The DC operator can arrange 825 for these congestion policers to collectively enforce the per- 826 customer congestion allowance, as a distributed policer. 828 A function to distribute a customer's tokens to the policer 829 associated with each of the customer's VMs would be needed. This 830 could be similar to the distributed rate limiting of [DRL], which 831 uses a gossip-like protocol to fill the sub-buckets. Alternatively, 832 a logically centralised bucket of congestion tokens could be used. it 833 could be replicated for reliability then there could be simple 1-1 834 communication between the central bucket and each local token bucket. 836 Importantly, congestion tokens can be freely reassigned between 837 different VMs, because a congestion token is equivalent at any place 838 or time in a network. In contrast, traditional bit-rate tokens 839 cannot simply be reassigned from one VM to another without 840 implications on the balance of network loading. This is because the 841 parameters used for bit-rate policing depend on the topology and its 842 capacity planning (open loop), whereas congestion policing 843 complements the closed loop congestion avoidance system that adapts 844 to the prevailing traffic and topology. 846 As well as distribution of tokens between the VMs of a tenant, it 847 would similarly be feasible to allow transfer of tokens between 848 tenants, also without breaking the performance isolation properties 849 of the system. Secure token transfer mechanisms could be built above 850 the underlying policing design described here, but that is beyond the 851 current scope and therefore deferred to future work. 853 6. Incremental Deployment 855 6.1. Migration 857 A mechanism to bring trustworthy congestion signals to the ingress 858 (Section 5.1) is critical to this performance isolation solution. 859 Section 5.1.1 compares the two solutions: b) ConEx, which is 860 efficient and it's timely enough to police short flows; and a) 861 tunnel-feedback, which is neither. However, ConEx requires 862 deployment in host operating systems first, while tunnel feedback can 863 be deployed unilaterally by the data centre operator in all 864 hypervisors (or containers), without requiring support in guest 865 operating systems. 867 The section describes the steps necessary to support both approaches. 868 This would provide an incremental deployment route with the best of 869 both worlds: tunnel feedback could be deployed initially for 870 unmodified guest OSs despite its weaknesses, and ConEx could 871 gradually take over as it was deployed more widely in guest OSs. It 872 is important not to deploy the tunnel feedback approach without 873 checking for ConEx-capable packets, otherwise it will never be 874 possible to migrate to ConEx. The advantages of being able to 875 migrate to ConEx are: 877 o no duplicate feedback channel between hypervisors (sending and 878 forwarding a large proportion of tiny packets), which would cause 879 considerable packet processing overhead 881 o performance isolation includes the contribution to congestion from 882 short (sub-round-trip-time) flows 884 6.2. Evolution 886 Initially, the approach would be confined to intra-data centre 887 traffic. With the addition of ECN support on network equipment (at 888 least bottleneck access routers) in the WAN between data centres, it 889 could straightforwardly be extended to inter-data centre scenarios, 890 including across interconnected backbone networks. 892 Once this approach becomes deployed within data centres and possibly 893 across interconnects between data centres and enterprise LANs, the 894 necessary support will be implemented in a wide range of equipment 895 used in these scenarios. Similar equipment is also used in other 896 networks (e.g. broadband access and backhaul), so that it would start 897 to be possible for these other networks to deploy a similar approach. 899 7. Related Approaches 901 The Related Work section of [CongPol] provides a useful comparison of 902 the approach proposed here against other attempts to solve similar 903 problems. 905 When the hose model is used with Diffserv, capacity has to be 906 considerably over-provisioned for all the unfortunate cases when 907 multiple sources of traffic happen to coincide even though they are 908 all in-contract at their respective ingress policers. Even so, every 909 node within a Diffserv network also has to be configured to limit 910 higher traffic classes to a maximum rate in case of really unusual 911 traffic distributions that would starve lower priority classes. 912 Therefore, for really important performance assurances, Diffserv is 913 used in the 'pipe' model where the policer constrains traffic 914 separately for each destination, and sufficient capacity is provided 915 at each network node for the sum of all the peak contracted rates for 916 paths crossing that node. 918 In contrast, the congestion policing approach is designed to give 919 full performance assurances across a meshed network (the hose model), 920 without having to divide a network up into pipes. If an unexpected 921 distribution of traffic from all sources focuses on a congestion 922 hotspot, it will increase the congestion-bit-rate seen by the 923 policers of all sources contributing to the hot-spot. The congestion 924 policers then focus on these sources, which in turn limits the 925 severity of the hot-spot. 927 The critical improvement over Diffserv is that the ingress edges 928 receive information about any congestion occuring in the middle, so 929 they can limit how much congestion occurs, wherever it happens to 930 occur. Previously Diffserv edge policers had to limit traffic 931 generally in case it caused congestion, because they never knew 932 whether it would (open loop control). 934 Congestion policing mechanisms could be used to assure the 935 performance of one data flow (the 'pipe' model), but this would 936 involve unnecessary complexity, given the approach works well for the 937 'hose' model. 939 Therefore, congestion policing allows capacity to be provisioned for 940 the average case, not for the near-worst case when many unlikely 941 cases coincide. It assures performance for all traffic using just 942 one traffic class, whereas Diffserv only assures performance for a 943 small proportion of traffic by partitioning it off into higher 944 priority classes and over-provisioning relative to the traffic 945 contracts sold for for this class. 947 {ToDo: Refer to [conex-policing] for comparison with WRR & WFQ} 949 Seawall {ToDo} [Seawall] 951 8. Security Considerations 953 {ToDo} 955 9. IANA Considerations (to be removed by RFC Editor) 957 This document does not require actions by IANA. 959 10. Conclusions 961 {ToDo} 963 11. Acknowledgments 965 Thanks to Yu-Shun Wang for comments on some of the practicalities. 967 Bob Briscoe is part-funded by the European Community under its 968 Seventh Framework Programme through the Trilogy 2 project (ICT- 969 317756). The views expressed here are solely those of the author. 971 12. Informative References 973 [CPolTrilogyExp] Raiciu, C., Ed., "Progress on resource 974 control", Trilogy EU 7th Framework Project 975 ICT-216372 Deliverable 9, December 2009, . 979 [ConEx-Abstract-Mech] Mathis, M. and B. Briscoe, "Congestion 980 Exposure (ConEx) Concepts and Abstract 981 Mechanism", draft-ietf-conex-abstract-mech-08 982 (work in progress), October 2013. 984 [CongPol] Jacquet, A., Briscoe, B., and T. Moncaster, 985 "Policing Freedom to Use the Internet Resource 986 Pool", Proc ACM Workshop on Re-Architecting 987 the Internet (ReArch'08) , December 2008, 988 . 991 [DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., 992 Padhye, J., Patel, P., Prabhakar, B., 993 Sengupta, S., and M. Sridharan, "Data Center 994 TCP (DCTCP)", ACM SIGCOMM CCR 40(4)63--74, 995 October 2010, . 998 [DRL] Raghavan, B., Vishwanath, K., Ramabhadran, S., 999 Yocum, K., and A. Snoeren, "Cloud control with 1000 distributed rate limiting", ACM SIGCOMM 1001 CCR 37(4)337--348, 2007, 1002 . 1004 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., 1005 Wang, Z., and W. Weiss, "An Architecture for 1006 Differentiated Services", RFC 2475, 1007 December 1998. 1009 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, 1010 "The Addition of Explicit Congestion 1011 Notification (ECN) to IP", RFC 3168, 1012 September 2001. 1014 [RFC6040] Briscoe, B., "Tunnelling of Explicit 1015 Congestion Notification", RFC 6040, 1016 November 2010. 1018 [RFC6356] Raiciu, C., Handley, M., and D. Wischik, 1019 "Coupled Congestion Control for Multipath 1020 Transport Protocols", RFC 6356, October 2011. 1022 [Seawall] Shieh, A., Kandula, S., Greenberg, A., and C. 1023 Kim, "Seawall: Performance Isolation in Cloud 1024 Datacenter Networks", Proc 2nd USENIX Workshop 1025 on Hot Topics in Cloud Computing , June 2010, 1026 . 1029 [conex-destopt] Krishnan, S., Kuehlewind, M., and C. Ucendo, 1030 "IPv6 Destination Option for ConEx", 1031 draft-ietf-conex-destopt-05 (work in 1032 progress), October 2013. 1034 [conex-policing] Briscoe, B., "Network Performance Isolation 1035 using Congestion Policing", 1036 draft-briscoe-conex-policing-01 (work in 1037 progress), February 2014. 1039 [ipv4-id-reuse] Briscoe, B., "Reusing the IPv4 Identification 1040 Field in Atomic Packets", 1041 draft-briscoe-intarea-ipv4-id-reuse-04 (work 1042 in progress), February 2014. 1044 [nvgre] Sridhavan, M., Greenberg, A., Wang, Y., Garg, 1045 P., Duda, K., Venkataramaiah, N., Ganga, I., 1046 Lin, G., Pearson, M., Thaler, P., and C. 1047 Tumuluri, "NVGRE: Network Virtualization using 1048 Generic Routing Encapsulation", 1049 draft-sridharan-virtualization-nvgre-04 (work 1050 in progress), February 2014. 1052 [tunnel-cong-exp] Zhu, L., Zhang, H., and X. Gong, "Tunnel 1053 Congestion Exposure", draft-zhang-tsvwg- 1054 tunnel-congestion-exposure-00 (work in 1055 progress), October 2012. 1057 [vxlan] Mahalingam, M., Dutt, D., Duda, K., Agarwal, 1058 P., Kreeger, L., Sridhar, T., Bursell, M., and 1059 C. Wright, "VXLAN: A Framework for Overlaying 1060 Virtualized Layer 2 Networks over Layer 3 1061 Networks", 1062 draft-mahalingam-dutt-dcops-vxlan-08 (work in 1063 progress), February 2014. 1065 Appendix A. Summary of Changes between Drafts (to be removed by RFC 1066 Editor) 1068 Detailed changes are available from 1069 http://tools.ietf.org/html/draft-briscoe-conex-data-centre 1071 From briscoe-01 to briscoe-02: Added clarification about intra-class 1072 applicability. Updated references. 1074 From briscoe-conex-data-centre-00 to briscoe-conex-data-centre-01: 1076 * Took out text Section 4 "Performance Isolation Intuition" and 1077 Section 6. "Parameter Setting" into a separate draft 1078 [conex-policing] and instead included only a summary in these 1079 sections, referring out for details. 1081 * Considerably updated Section 5 "Design" 1083 * Clarifications and updates throughout, including addition of 1084 diagrams 1086 From briscoe-conex-initial-deploy-02 to 1087 briscoe-conex-data-centre-00: 1089 * Split off data-centre scenario as a separate document, by 1090 popular request. 1092 Authors' Addresses 1094 Bob Briscoe 1095 BT 1096 B54/77, Adastral Park 1097 Martlesham Heath 1098 Ipswich IP5 3RE 1099 UK 1101 Phone: +44 1473 645196 1102 EMail: bob.briscoe@bt.com 1103 URI: http://bobbriscoe.net/ 1105 Murari Sridharan 1106 Microsoft 1107 1 Microsoft Way 1108 Redmond, WA 98052 1110 Phone: 1111 Fax: 1112 EMail: muraris@microsoft.com 1113 URI: