idnits 2.17.1 draft-gs-vpn-scaling-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 824: '...ting table, a PE MUST peer with all of...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 14, 2014) is 3717 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force W. George 3 Internet-Draft Time Warner Cable 4 Intended status: Informational R. Shakir 5 Expires: August 18, 2014 BT 6 February 14, 2014 8 IP VPN Scaling Considerations 9 draft-gs-vpn-scaling-03 11 Abstract 13 This document discusses scaling considerations unique to 14 implementation of Layer 3 (IP) Virtual Private Networks, discusses a 15 few best practices, and identifies gaps in the current tools and 16 techniques which are making it more difficult for operators to cost- 17 effectively scale and manage their L3VPN deployments. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on August 18, 2014. 36 Copyright Notice 38 Copyright (c) 2014 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 1.1. Intention of this Document . . . . . . . . . . . . . . . 3 55 1.2. Horizontal vs. Vertical Scaling . . . . . . . . . . . . . 5 56 1.3. Developing Requirements for Scaled L3VPN Environments . . 6 57 2. PE-CE routing protocols . . . . . . . . . . . . . . . . . . . 6 58 2.1. Best Common Practice . . . . . . . . . . . . . . . . . . 7 59 2.2. Common Problems at Scale Limits . . . . . . . . . . . . . 9 60 3. Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . 10 61 3.1. Best Common Practices . . . . . . . . . . . . . . . . . . 10 62 3.2. Common Problems at Scale Limits . . . . . . . . . . . . . 11 63 4. Network Events . . . . . . . . . . . . . . . . . . . . . . . 11 64 4.1. Best Common Practices . . . . . . . . . . . . . . . . . . 11 65 4.2. Common Problems at Scale Limits . . . . . . . . . . . . . 12 66 5. General Route Scale . . . . . . . . . . . . . . . . . . . . . 13 67 5.1. Route-reflection and scaling . . . . . . . . . . . . . . 16 68 5.2. Best Common Practices . . . . . . . . . . . . . . . . . . 18 69 5.2.1. Topology-related optimizations . . . . . . . . . . . 19 70 5.3. Common problems at scale limits . . . . . . . . . . . . . 20 71 6. Known issues and gaps . . . . . . . . . . . . . . . . . . . . 21 72 6.1. PE-CE routing protocols . . . . . . . . . . . . . . . . . 21 73 6.2. Multicast . . . . . . . . . . . . . . . . . . . . . . . . 22 74 6.3. Network Events . . . . . . . . . . . . . . . . . . . . . 22 75 6.4. General Route Scale . . . . . . . . . . . . . . . . . . . 22 76 6.5. Modeling and Capacity planning . . . . . . . . . . . . . 22 77 6.6. Performance issues . . . . . . . . . . . . . . . . . . . 23 78 6.7. High Availability and Network Resiliency . . . . . . . . 24 79 6.8. New methods of horizontal scaling . . . . . . . . . . . . 25 80 7. To-Do list . . . . . . . . . . . . . . . . . . . . . . . . . 25 81 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 82 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 83 10. Security Considerations . . . . . . . . . . . . . . . . . . . 26 84 11. Informative References . . . . . . . . . . . . . . . . . . . 26 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 87 1. Introduction 89 As IP networking has become more ubiquitous and mature, many 90 enterprises have begun migration away from legacy point to point or 91 layer 2 virtual private network (VPN) implementations toward layer 3 92 VPNs. The VPN implementation as defined by RFC 4364 [RFC4364] 93 enables flexible and robust implementations of IP VPNs. However, in 94 practice, it has become clear that it suffers from significant 95 scaling considerations beyond those discussed in RFC4364. In many 96 cases, the limits of scale for a given platform are not in sync with 97 the maximum physical and logical interface density supported by the 98 platform, such that a platform may be considered "full" long before 99 the physical slots and ports have all been filled with equipment and 100 connections. This represents an inefficient use of space and power, 101 as well as stranded capital assets, which increase the operator's 102 cost to provide the service as well as the complexity of managing the 103 platform to ensure proper service levels in a wide variety of 104 circumstances. While these scaling considerations are somewhat 105 similar to the scaling concerns experienced in the Global Internet, 106 those are at best a subset of the overall problem, and may not have a 107 great deal of overlap between solutions and best practices. The 108 added complexity and feature set required to support today's 109 enterprise IP networks drives additional scaling considerations for 110 large deployments. A common response to concerns about control plane 111 scale is simply to "throw hardware at the problem" in the form of 112 ever-increasing amounts of memory and CPU resources. In some cases, 113 this may be the only solution, but similarly to the concerns 114 identified in RFC 4984 [RFC4984], there are limits to the growth 115 curve that can be supported and cost-effectively deployed by a VPN 116 provider such that their service remains profitable, and therefore it 117 is necessary to explore the potential for optimization to make the 118 existing resources stretch further. 120 Generally, router scale can be considered in one of three areas: 121 forwarding capacity, interface density, and control plane capacity. 122 This draft will focus almost exclusively on control plane capacity, 123 because while the others are important considerations for most 124 operators, they are less affected by the details of how L3VPN is 125 implemented either by the router vendor or the operator. Interface 126 density is usually a factor of the forwarding capacity of a given 127 module or slot as well as physical packaging. In this application, 128 interface density is interesting from the perspective of its impact 129 to the control plane - more interfaces means more of all of the 130 different factors that contribute to control plane load, and the 131 operator wants to be able to strike a balance between interface 132 density and control plane capacity such that neither grows out of 133 pace with the other. 135 1.1. Intention of this Document 137 This document is intended to provide a discussion of the challenges 138 that network operators face in deploying large-scale L3VPN 139 environments at the time of writing, with two key sets of 140 recommendations. As such, these outcomes can be divided into those 141 that apply to network operators regarding the deployment of 142 particular technologies, and those that apply to network protocol and 143 operating system implementors relating to allowing better 144 understanding of scaling characteristics in deployments of such 145 equipment. 147 The best practices defined in this document are intended to allow 148 more optimal scaling of L3VPN networks, whilst minimising the impact 149 on end-customer network behaviour. It is intended that such guidance 150 can be directly utilised by Service Providers to improve the 151 scaleability of network elements. However, the guidance in this 152 document should not be viewed as a panacea to the problems of scaling 153 network elements. It is the intention of the authors to document a 154 number of key problems experienced in such environments and provide 155 information to the SP that may result in more optimal deployment of 156 existing technologies to this audience. Additionally, there is a 157 point at which the limits of hardware will be reached, and hence new 158 network elements are required. The key intention of the 159 recommendations provided to Service Providers within this document 160 are intended to allow the resources that exist within existing 161 elements to be utilised in the most efficient manner. Clearly, the 162 optimal point in this balance is that the data-plane and control- 163 plane scale to support similar levels of service termination, so as 164 to result in minimal "over provisioning" of one element. 166 The scaling considerations presented in this document are intended to 167 provide both network operators and network equipment implementors 168 further guidance around the toolset, and information required to 169 provide accurate means of capacity planning in L3VPN environments. 170 Again, the authors consider that the scaling characteristics, and 171 toolsets required of L3VPN PE equipment diverge somewhat from those 172 required by Internet network equipment. In Internet deployments, 173 relatively standardised interconnects exist across all deployments - 174 typically utilising either static routing, or BGP-4. As such, each 175 connected port comes with a relatively standard overhead in terms of 176 the protocols required. Whilst there is some variance in how 177 "chatty" each customer connection may be, this is balanced by the 178 fact that the whole Internet routing table is typically held on such 179 edge equipment (and hence individual customer's instability tends to 180 be relatively small when compared to the instability of the Internet 181 DFZ). In addition, since such instability is limited to relatively 182 few impacts to a node (interface or BGP session flapping, and BGP 183 UPDATE messages) routers can be optimised to cope with such 184 instability. Counter to this, the L3VPN environment does not have a 185 standardised connectivity model, and typically connects to much less 186 controlled environments. Further details of this are provided within 187 later sections of this document. The result of this difference is 188 that 'headline' scaling figures presented for particular equipment 189 tends to be of limited utility to a network operator. The 190 recommendations within this document outline some of the 191 considerations that must be made in considering the scaling of such 192 elements, and provide guidance as to the missing inputs and tools 193 that are required to provide information around the capacity of such 194 elements. 196 1.2. Horizontal vs. Vertical Scaling 198 Within this document, two forms of 'scaling' are referred to - the 199 "throw hardware at the problem" approach outlined previously involves 200 deploying additional network elements in order to provide further 201 network capacity. Throughout this document, this approach is 202 referred to a horizontal scaling - insofar as it requires parallel 203 deployment of numerous similar elements and balancing the load across 204 the combined capacity of all of the elements. The approach of 205 increasing the capacity of an individual node through allowing the 206 control plane capacity to support the maximum forwarding plane 207 capacity (be it data forwarded, or available ports) is referred to as 208 vertical scaling. It is obvious that at some point the approach of 209 horizontal scaling of elements is required - due to either exhausting 210 available port capacities, or available forwarding plane - however, 211 there are a number of motivations for delaying such provisioning, 212 some of which relate directly to the characteristics of L3VPN 213 environments. 215 Since a significant proportion of the customers who purchase L3VPN 216 services are Enterprise customers, typically the service is utilised 217 as a WAN for their inter-location connectivity. Clearly, as such 218 customer base tends to be distributed based on differing factors, 219 this implies that such customers connect in numerous geographical 220 locations. The requirement to support service in these locations 221 therefore results in a requirement for the service provider network 222 architecture to support geographically distributed access into such 223 services. A balance must be struck between the extent to which 224 access networks are utilised to backhaul traffic to the service 225 layer, and the geographical distribution of the service layer itself. 226 Both scale and performance characteristics of such networks tend to 227 result in more geographical distribution of service layer elements 228 than in Internet deployments. This distribution results in two 229 particular changes - primarily that the idea of a "point-of-presence" 230 must be reconsidered - where an assumption in Internet environments 231 may be that there are separated core and access elements within a 232 single location, within a distributed L3VPN environment, a point of 233 presence may be a single PE device. The result of these small scale 234 points of presence is that numerous core and edge functions must be 235 collapsed onto a single device. For this reason, the approach of 236 adding additional devices to the network may have an impact on a 237 further subset of devices within the network (particularly due to any 238 mesh-based protocols that are deployed), and hence result in a change 239 in the scaling characteristics of these devices. In this case, there 240 is further motivation to avoid large numbers of devices in the 241 network where possible. Further to this, the smaller PoP profile may 242 result in physical constraints around the deployment of additional 243 network elements, particularly due to the availability of power and 244 physical space to deploy such elements. 246 1.3. Developing Requirements for Scaled L3VPN Environments 248 Whilst the collected scaling considerations outlined in this document 249 are based on the author's collective experience within various 250 Service Provider networks, and discussions with operators of similar 251 networks, it should be noted that the problems outlined in this 252 document are not static. With the growth in the use of IP as the 253 underlying transport of many services, the demand for L3VPN 254 environments has grown. As such, this has meant that various 255 technologies are being considered to allow growth of these networks 256 at a lower cost point to a wider footprint than was previously 257 required. A network operator must therefore consider the extent to 258 which the service layer must be built - both to meet economic and 259 technical requirements. With newer aggregation methods, the service 260 layer edge (and hence the L3VPN PE) acquires responsibility for 261 inter-working between newer dynamic aggregation technologies, and the 262 existing IP network. As such, these edge functionalities result in 263 further requirements for loading onto these network elements. 265 *** Author's note: Do we want to put anything about NNI for footprint 266 extension here? Datacenter edge - perhaps Ning's problem around the 267 L3VPN edge in his datacentres? *** 269 2. PE-CE routing protocols 271 One of the things that makes IP VPNs so flexible and robust is their 272 ability to participate in the encapsulated network's routing 273 protocols, where the customer edge (CE) router has a direct neighbor 274 relationship with its upstream provider edge (PE) router in order to 275 exchange routing information about the Virtual Route Forwarding (VRF) 276 instance that represents the VPN. In many cases, this is managed 277 through a combination of static routes and BGP neighbors, but IGPs 278 such as OSPF RFC 4577 [RFC4577] are often supported, because it 279 enables a more complete integration into an existing enterprise 280 network design and topology. In some single-vendor implementations, 281 carriers sometimes support proprietary routing protocols such as 282 EIGRP [EIGRP]. IGPs may also be chosen due to a belief that they 283 will respond more rapidly during a failure than BGP will. In 284 reality, this may not be true due to the fact that VRF routing 285 information is still carried in MP-BGP from PE to PE, and the PE-CE 286 routing protocol's characteristics are only locally significant. In 287 fact, the increased overhead may lead to slower convergence times 288 than a more standard BGP implementation. 290 IGPs often translate to a significant increase in overhead due to 291 their inherent characteristics as link-state routing protocols 292 requiring full topology databases and flooding of updates to all 293 participants, and the fact that they invoke additional processes on 294 the router when compared to simply using BGP (which is already going 295 to be running on a router using MP-BGP for VPNs). While a router may 296 be able to scale almost effortlessly with a few thousand routes in a 297 single IGP plus hundreds of thousands of routes and many neighbors in 298 BGP, it may be quickly challenged if it is also required to run 299 multiple instances of an IGP each with a certain number of routes 300 that must be moved into MP-BGP to be passed to the rest of the VPN 301 infrastructure. The advent of support for IPv6 within a VPN (6VPE) 302 [RFC4659] has the potential to make this problem worse, especially in 303 the case of OSPF, where it now requires both OSPFv2 [RFC4577] and v3 304 [RFC6565] to run as separate instances for the two address families, 305 or use of multiple instances of OSPFv3 to support multiple address 306 families as documented in RFC5838 [RFC5838]. 308 Another consideration in PE-CE routing protocols is the timers used 309 for each session. These will be discussed in greater detail in the 310 best practices section. 312 2.1. Best Common Practice 314 Ultimately, the decision as to which PE-CE routing protocols to 315 support is a business decision much more often than it is a technical 316 one, because there are few use cases where something other than BGP 317 and static routing as PE-CE routing protocols is a technical 318 requirement. If a provider chooses to support additional protocols, 319 especially IGPs, they should consider the effects that these have on 320 the overall scaling profile of the PE routers and the network as a 321 whole when determining if and to what extent they will support other 322 protocols. 324 Often, those designing VPN solutions attempt to use extremely 325 aggressive routing protocol timer and keepalive values as a means of 326 rapid failure detection and reconvergence. This tends to make PE-CE 327 routing protocols more fragile and increase the load on the PE router 328 with questionable benefit. This is especially common in scenarios 329 where the network designer is attempting to replicate native IGP-like 330 failure detection and reroute capabilities using BGP. In order to 331 avoid this, the preferred values should be set to something that is 332 appropriate for large-scale implementations. Further, because timer 333 and keepalive values are often negotiated based on the more 334 aggressive neighbor, it is a good idea to set a minimum acceptable 335 value, so that instead of being forced to support negotiated timer 336 values that are too aggressive for the scale that a given PE router 337 is expected to support, the neighbor session will simply stay down 338 until the remote end timers are reconfigured to a more acceptable 339 value. This acts as a safety valve against abuse that can 340 destabilize a router used by multiple customers. Because aggressive 341 timers may be unavoidable in certain situations, it may be advisable 342 to track the number of sessions which are provisioned with aggressive 343 timers vs how many are using more conservative timers on a per-router 344 basis, so that effort can be made to balance aggressive and 345 conservative timers on each router. This will help to prevent "hot- 346 spots" where given a similar port and VRF density, some routers have 347 significantly higher CPU usage in steady-state than others. 349 It is important to realize that while use of aggressive routing 350 protocol timers is not a scalable way to do fast failure detection, 351 fast failure detection is still a requirement for many customers. 352 Because this is becoming such a table-stakes requirement, the 353 provider must consider other alternatives such as Bidirectional 354 Forwarding Detection ([RFC5880]), Ethernet OAM 802.1ag [IEEE802.1], 355 ITU-T &.1731 [Y.1731] LACP 802.3ad [IEEE802.3] and the like. These 356 extensions often come with their own scaling considerations, but more 357 and more they are implemented in a distributed fashion so that 358 instead of affecting the main router CPU like a routing protocol 359 might, they offload that processing to the linecard CPU, and 360 therefore can support more aggressive scale. The general philosophy 361 is that these lower-layer detection mechanisms should serve as the 362 primary detection and failure point, with the upper layer routing 363 protocols only serving as a backstop if the failure is not detected 364 by the lower level protocols for some period of time. 366 Another important consideration is that there is not likely to be a 367 "one-size-fits-all" solution when it comes to setting timers and 368 policies around PE-CE routing protocols. At a minuimum, a 369 distinction should be made between sites that have only a single 370 upstream connection and those that have two or more diverse 371 connections to the network. Further distinction can be made based on 372 the importance of the site, whether it is a hub site or an end site. 373 These can all be used to determine the aggressiveness that is 374 appropriate for the timers and perhaps even which routing protocols 375 are appropriate. For example, an end site with a single upstream 376 connection likely does not need very aggressive timers and may be 377 able to get by using only static routing, while a hub site with 378 multiple connections and a need for rapid restoration and reaction to 379 any routing changes may need BGP along with aggressive lower-layer 380 timers for fault detection. 382 2.2. Common Problems at Scale Limits 384 Two common problems when working on a heavily-loaded system: 386 CPU cycle constraints, even before the system reaches the point of 387 scheduler thrashing often lead to one or more routing protocol 388 neighbor hello drops. If several consecutive drops occur, the remote 389 neighbor may declare the session dead, which triggers a restart of 390 the connection and a resync of the routing data. Because this 391 connection initialization requires dedicated CPU cycles to generate, 392 receive, acknowledge, and process the updates, it increases the CPU 393 utilization further, which may trigger additional hello failures and 394 neighbor resets, resulting in a snowball effect where a relatively 395 minor event rapidly becomes a major one due to interactions between 396 multiple scaling limitations. This problem is made worse by 397 extremely aggressive timer values, because they raise the baseline 398 CPU load with more frequent hellos and responses, and are more 399 sensitive to drops caused by increased CPU load. Further, because 400 failures brought on by loss of hello packets are unlikely to invoke 401 any graceful restart [RFC4781] machinery that the system may support, 402 it is unlikely that the session reset will be able to take advantage 403 of optimizations like only synching the changes that occured while 404 the session was dead, thus increasing the outage time and the CPU 405 cycles to get things back into sync. 407 Another potential issue during times of high-CPU operation is related 408 to process prioritization. This is applicable in different ways for 409 both multithreaded and interrupt-driven OS architectures. In each 410 case, the scheduling algorithm that the router uses to prioritize 411 different CPU cycle work items and manage the timeslices individual 412 tasks are given to complete may require significant tuning and 413 prioritization in order to ensure the desired behavior during high 414 CPU usage. Improperly tuned or prioritized processes may 415 significantly delay completion of routing table/update processing 416 such that it may take an excessive amount of time for the routing 417 table to converge properly. This issue is further exacerbated if the 418 VRF instance has a large amount of routes, or is prone to frequent 419 event-driven route churn. In some cases, the routing table in a 420 given VRF may never fully converge, leading to routing loops, traffic 421 loss, inconsistent latency, and a generally adverse customer 422 experience. 424 These items also can have a cascade effect on other routers in the 425 system if they also participate in a given VRF that is being affected 426 by this type of scaling issue. Not only is the local PE router 427 affected, but any upstream Route reflectors, as well as other PEs, 428 and even CEs participating in this VRF will see increased CPU cycles 429 in order to receive and process the increased flow of updates driven 430 by the local churn. 432 ***specific items related to different PE-CE protocols?*** 434 3. Multicast 436 Multicast support within a VPN [RFC6513] has become an increasingly 437 popular feature, but comes with its own scaling considerations. 438 Depending on the application, the frequency at which multicast state 439 changes within a given VPN (e.g. PIM joins and prunes) will 440 contribute to the CPU load on the router, and any instability in the 441 network can potentially increase these as remote sites flap. In 442 extreme cases, PIM neighborships can be lost during events, 443 disrupting the flow of multicast traffic. 445 It should be noted that, in some cases, dynamic action is required by 446 a PE device to support the transition of flooding of multicast data 447 from a non-optimal distribution tree (the default MDT in [RFC6037], 448 or the I-PMSI) onto a more optimal one (a data MDT or S-PMSI). Where 449 such a transition is required, consideration is required of the 450 nature of the traffic sourced by an end user of the L3VPN service. 451 The net result of this consideration is that it becomes increasingly 452 difficult to reliably gauge the scaling impact of specific end-site 453 deployments. Additional scaling considerations around Multicast in a 454 VPN are related to the size and number of multicast streams. While 455 this is a consideration whenever Multicast is used even outside of a 456 VPN because of the bandwidth utilization it may generate in the core, 457 the additional overhead of implementing multicast within a VPN makes 458 this a more siginificant consideration in this case. Related to the 459 previous consideration is the stream fanout - the amount of P and PE 460 router paths in the network that could potentially carry a given 461 multicast stream based on the number of PEs that are configured with 462 a given Multicast-enabled VRF, and the number that actually do carry 463 the stream based on actual receivers joining the stream behind that 464 PE. 466 *** This section is quite weak. We're looking for contributors who 467 can assist in fleshing this out *** 469 3.1. Best Common Practices 471 Multicast BCPs??? 473 3.2. Common Problems at Scale Limits 475 Multicast tree interruptions 477 PIM neighbor adjacency drops 479 4. Network Events 481 Network events are an important scaling consideration because they 482 can have wide-ranging impacts far beyond the individual VRF or even 483 PE router that experiences the event. At high scale, a seemingly 484 innocuous event on one router or VRF can trigger secondary impacts 485 and outages on remote routers elsewhere in the network. Correlating 486 these events for root cause analysis can be challenging by itself, 487 and trying to characterize the impacts as they relate to scale in a 488 way that informs the provider's decisions is even more difficult. 489 Different types of Network Events that can contribute are: Interface 490 flaps, hardware and software outages (both planned and unplanned), 491 externally driven route-churn events (such as those that originate on 492 an NNI partner's network) and configuration changes. 494 4.1. Best Common Practices 496 While this document suggests that lower layer failure detection 497 protocols like BFD and Ethernet OAM be more aggressive so that 498 routing protocol timers can be more conservative, it is still 499 important to remember that this can generate false positives or 500 excessive churn that will cascade into a scaling problem at other 501 parts of the system, so the timers should not automatically be 502 configured to their minimum supported values. Rather, each 503 application may be slightly different, and the timers should only be 504 set as aggressively as necessary to ensure acceptable performance of 505 the applications in question. It may be appropriate to set limits 506 (e.g. in provisioning logic/rules) as to the number of interfaces per 507 router and per VRF that can use aggressive, moderate, and 508 conservative interface timers. 510 Even with timers set as conservatively as the application will allow, 511 churn is unavoidable. For this reason, it is also a good idea to use 512 interface-level dampening such as hold-down timers or event dampening 513 in order to ensure that interfaces that flap too rapidly will not 514 telegraph that churn into the upper-layer routing protocols any more 515 than necessary. BGP Peer Oscillation Dampening 516 (DampPeerOscillations, RFC4271 [RFC4271] ) may also help to reduce 517 intermittent outage-based churn while leaving the interface itself 518 unaffected. All of these dampening measures help to ensure that 519 problems are localized to a single PE or even a single interface, 520 rather than causing instability and routing churn throughout the VRF 521 and the provider network. 523 In addition to interface dampening, it may be advisable to consider 524 implementing some manner of route flap dampening route flap dampening 525 (RFD) [I-D.ietf-idr-rfd-usable] to assist in reducing the impact that 526 route churn may have on the SP's network infrastructure. This is 527 currently fairly uncommon within VPN environments, and is not without 528 controversy. While it may help with scaling, it also requires each 529 PE to maintain more state to store and compute the per-prefix penalty 530 values, which may reduce the benefits gained by implementing RFD. 531 Further, customers typically expect a fair amount of transparency in 532 the provider's participation in their routing instances. Many 533 providers and customers view a VPN or VRF as a part of the customer's 534 internal network and therefore compartmentalized so that the customer 535 can only affect their own routing if they have a problem with 536 excessive route flaps. Further, if routes are dampened it requires 537 intervention from the SP to clear the dampening, which can 538 potentially add to the outage time that a customer experiences once 539 the issue that triggered the dampening is resolved. Implementing RFD 540 may even drive the need for a customer-accessible looking glass, 541 which is far more complex in the VPN space owing to the requirement 542 to prevent one customer from looking at another's VRF routes on a 543 common platform. 545 4.2. Common Problems at Scale Limits 547 Network events are both a cause and a symptom of a system running at 548 or near its scaling limits. As noted above, event-driven routing 549 table churn or routing protocol interactions can significantly drive 550 up CPU usage on the locally connected PE as well as on other PEs and 551 CEs participating in the VRF. If routes are constantly changing due 552 to a preferred path repeatedly being added and removed, latency and 553 jitter numbers can be affected in a way that adversely effects 554 applications sensitive to this sort of change. Network events can 555 also be triggered by routers with high CPU, because similarly to 556 systems which may have aggressive routing protocol timers for 557 enhanced failure detection, systems with centralized CPU-based 558 implementations for lower-layer protocols (such as HDLC [ISO13239] 559 PPP [RFC1661], LACP, BFD/EOAM) may start losing keepalives and 560 declaring outages that result in physical interfaces being torn down 561 and restored. Again, implementations that choose timer and 562 multiplier values or numbers of sessions at or near the maximum rated 563 scaling for the device put the operator in a position where there is 564 very little headroom to deal with an event that momentarily spikes 565 CPU usage, meaning that the likelihood of a cascade failure 566 dramatically increases. 568 As above, these network events may be something that occurs elsewhere 569 in the network, and may trigger a failure on a completely different 570 PE or CE router. The danger with this is that it is extremely 571 difficult to troubleshoot and correlate root causes when the outage 572 observed isn't caused by an event on the same router. Failures 573 become increasingly non-deterministic and difficult for operators to 574 manage and address. 576 5. General Route Scale 578 PE routers in a carrier network can have many different 579 implementation scenarios. Some carriers implement a dedicated PE 580 router that is only responsible for carrying VPN routes and therefore 581 may only carry IGP routes in its global routing table, rather than a 582 full internet routing table. Others use combined edge routers that 583 carry full routes plus a complement of customer VPN routes, and some 584 even place the full internet routing table into one or more VRF 585 instances. The issue here is that the weight of all of these routes 586 and paths must be combined when considering the maximum scale of the 587 router, both in terms of memory footprint and in terms of convergence 588 times. The addition of an 8-byte RD appended to the IP address to 589 ensure uniqueness means that each VPN prefix takes up incrementally 590 more physical space in memory than an equivalent non-VPN route. 591 Further, the greater number of Address-families running 592 simultaneously on the same router, the more sensitive it will be to 593 event-induced churn since each address-family (and VRF) often has its 594 own independent computation/SPF run. The addition of IPv6 support 595 within both the global routing table and within a VPN adds yet 596 another source for routing table bloat. A PE router can be running a 597 combination of any of the following address-families: 599 o Global IPv4 unicast 601 o Global IPv4 multicast 603 o VPN IPv4 unicast 605 o VPN IPv4 multicast 607 o Global IPv6 unicast 609 o Global IPv6 multicast 611 o VPN IPv6 unicast 613 o VPN IPv6 multicast 614 Even PE routers that do not carry the full internet routing table are 615 still required to carry a minimal number of IGP routes, LDP 616 information, and some amount of TE tunnel state, adding to the items 617 competing for scale. On high-scale PE routers, the VPN routing 618 tables are often as large as or larger than the equivalent global 619 routing table in both number of routes and number of paths. This is 620 at least partially due to the fact that there are no constraints on 621 the customer addressing plan within a VPN other than they cannot 622 conflict within a given VRF, or with any extranet with which the VRF 623 interconnects. As such, they may not necessarily adhere to any best 624 practices to control the deaggregation of the routing table such as 625 hierarchical addressing, aggregation and summarization of 626 announcements, and minimum prefix lengths. It's also quite likely 627 that connected interfaces will be redistributed, and little or no 628 route filtering may take place. Most PE routers use the absence of a 629 given VRF instance (or RD/RT filtering) to limit the number of routes 630 that they must actually carry, but this is sometimes of limited 631 utility for a couple of reasons. First, it leads to an inconsistent 632 routing table footprint from one PE router to the next, and it can 633 change with every new customer turned up on the router. These lead 634 to non-deterministic performance and scale from PE to PE and from 635 customer to customer. In other words, PE1 may be fine from a scale 636 perspective, while PE2, which has the same number of occupied ports 637 has significant scaling problems on account of which VRFs are present 638 /absent. Then, PE1 may find itself suddenly having the same scaling 639 concerns because a new customer was provisioned with a large or high- 640 churn VRF that was previously not present on the router. Second, 641 many customer VPNs are so large and have such stringent diversity 642 requirements that they have a presence on nearly every PE router in a 643 provider's network, meaning that one cannot rely heavily on 644 statistical distribution to reduce the percentage of VRFs that must 645 be installed on a specific PE router. In addition, customers may 646 request the use of BGP multipath for faster failover or better load 647 balancing, which has the net effect of installing more active routes 648 into the table, rather than simply selecting the single best path. 649 The scaling considerations for enabling BGP Multipath are not unique 650 to L3VPN, but they are more pertinent here because SPs are less 651 likely to be willing to enable MP for standard internet traffic, 652 while they will do it for L3VPN. The application as an enterprise 653 network instead of internet connectivity drives a different set of 654 expectations about the performance of the network, design tradeoffs 655 that must be made to meet the SP's requirements, etc. In many cases, 656 L3VPNs are replacing old point-to-point networks or L2VPNs using 657 legacy Frame Relay, ATM, or L2TPv3. Customers often don't want to 658 make major architectural changes to their routing, and therefore 659 expect the SP to do the same things that they were doing between 660 their routers before, including things like multipath. 662 In addition to such intended behaviour, within many L3VPN networks, a 663 balance must be struck between complexity in OSS such as provisioning 664 and inventory systems, and complexity in network deployments. One 665 such example of this is the assignment of route distinguisher (RD) 666 attributes. Where it may be possible to assign a single RD per L3VPN 667 instance, and hence achieve some level of route aggregation for 668 multi-homed CE routes on BGP speakers within the solution, this has 669 some consequences for both convergence in the VPN (due to BGP 670 convergence being relied upon) and in its potential to exacerbate 671 geographic distance between PE and Route-reflector and is therefore 672 undesirable in some circumstances. In order to avoid this, multiple 673 RDs are then required, which requires OSS and inventory support to 674 control the namespace. As such, due to this requirement, often each 675 VRF instance is deployed with a specific RD - which, whilst achieving 676 the desired convergence effect, places load on all BGP control-plane 677 elements of the provider network. 679 Total supportable route scale on a given PE router will be driven by 680 multiple different variables, which have a roughly inverse 681 relationship to one another: Number of VRFs per router, number of 682 routes per VRF, number of neighbors per VRF. For example, a router 683 can support a low number of VRFs per router if each VRF has a large 684 number of routes per VRF and/or a large number of neighbors per VRF. 685 Conversely, a router can support a relatively high number of VRFs if 686 each VRF is kept to a much lower number of routes per VRF, and/or 687 lower numbers of neighbors per VRF. This provides a baseline that 688 then must be reduced based on the expected level of event-driven 689 churn, the type of protocol chosen, etc. In short, this is a 690 difficult problem from a modeling and capacity planning perspective. 692 It is fairly common for the contract or Service Level Agreement 693 between SP and customer to include a maximum limit as to how many 694 routes can be carried in a VRF. At its most basic, this maximum can 695 be used as a method to estimate the number of VRFs that can be 696 present on a PE given its scaling limitations. However, there is a 697 wide gulf between a contractual limitation of no more than N routes 698 per VRF with a corresponding configured limit and the fact that many 699 customers will not carry nearly that many routes. This leads to the 700 potential for significant stranded capacity. Therefore the provider 701 needs a way to have different tiers of "maximum routes allowed" so 702 that the capacity management can be done in such a way as to enable 703 better loading of PE routers to take this relationship into account 704 (e.g. populating a PE with a combination of high-scale and low-scale 705 VRFs). The alternative to this method would be to assume a standard 706 maximum routes per VRF, and then similarly to the way that carriers 707 use statistical multiplexing and oversubscription to assume that not 708 all customers will have their pipes full of bandwidth at the same 709 time, make some assumptions about control plane capacity. This may 710 come in the form of an average that is calculated based on the actual 711 size of the routes in each VRF. This has many challenges. Among 712 them- Should it be calculated per-PE? Network-wide? What happens 713 when there are too many VRFs that exceed the average on a given PE? 714 How does one add control plane capacity to a "full" router? This may 715 be a manageable model in a network with a robust and flexible 716 provisioning system, such that high-scale VRFs can be moved between 717 PE routers to balance the load, but each of these moves likely 718 represents an outage for the customer and the potential for other 719 errors to creep in, and is not likely to be attractive due to the 720 operational costs of managing the network. In other words, it 721 doesn't scale, but for a completely different reason. Further, this 722 VRF route limit may or may not be a physically enforced value. Some 723 PEs have an additional configuration knob per VRF that places a hard 724 limit on the number of routes the VRF will accept. This works well 725 as a last-chance safety valve to protect the PE and the network in 726 the case where there are misconfigurations in the VRF that cause a 727 sudden and significant increase in the number of routes, but can 728 create inconsistencies in the VRF's routing table if there is a 729 periodic or intermittent increase in the routes that causes the 730 maximum to be periodically exceeded. Unlike something like a BGP 731 maximum prefix limit, which shuts down the BGP neighbor when a 732 threshold is exceeded, there is no direct feedback to the peers that 733 the VRF route limit threshold is exceeded, and different 734 implementations handle this in different ways in terms of how they 735 drop or buffer routes, and how they resynch once the routes are below 736 the threshold again. It may be appropriate to identify a common way 737 for implementations to handle this limit, perhaps triggering one or 738 more PE-CE peering sessions to drop, etc. so that this is a more 739 useful tool to protect the PE from increases that would cause it to 740 have scaling problems. 742 5.1. Route-reflection and scaling 744 Most of this document focuses on scaling at the PE router, but a 745 discussion of route scaling would not be complete without at least a 746 cursory mention of route-reflection [RFC4456]. While using route- 747 reflectors to eliminate the need for a full mesh of your PE routers 748 is a common optimization, there are many different deployment models 749 as far as whether dedicated route-reflectors are deployed vs. running 750 an existing PE or P router as a route-reflector, how many are 751 deployed and where, the method for ensuring diversity and redundancy, 752 and even whether a router is used vs. a commodity PC running some 753 sort of routing daemon. From a scaling perspective, there are 754 several considerations that are unique to the route-reflector design 755 that will be discussed here. 757 Starting with the route-reflector itself, these devices are often 758 experiencing a worst-case scenario when it comes to storing entries 759 in the RIB, exposure to route-churn, etc. This is because they are 760 not capable of filtering the routes from VRFs not locally configured 761 on themselves, and they must carry all of the routes for all of the 762 VRFs in the ASN. This requires significant amounts of CPU and memory 763 to store and manage these updates, and an underpowered route- 764 reflector can quickly cause widespread convergence problems if it is 765 unable to keep up with the load of receiving, processing, and 766 propagating these updates. Beyond CPU and memory, it may also be 767 necessary to know how the router manages its FIB when running as a 768 route-reflector. A route-reflector is almost 100% control-plane, but 769 if it tries to install all of the routes that it has in its RIB into 770 the FIB, it may require very high-scale (and therefore costly) 771 forwarding hardware to manage the large FIB. It may be useful to 772 select a device that is capable of optimizing for this control-plane 773 only mode and suppressing unnecessary routes from its FIB to reduce 774 the overhead. This is why some providers choose to use commodity 775 PCs, which are well-suited for high-scale, processor and memory- 776 intensive control plane work, and can easily and cost-effectively be 777 horizontally scaled. The main consideration with using a PC instead 778 of a router for route-reflection is that there may be implementation 779 differences that lead to incompatibililties in terms of supported 780 features, and there may be a different model in terms of how high- 781 scale applications are managed, or even what bugs are exposed at 782 maximum scale, all of which will require significant additional 783 testing. 785 Route-reflector placement is another important consideration. 786 Because route-reflectors are control-plane devices, and the scale 787 requirements for them are high enough that they can be expensive, the 788 tendency might be to deploy two large geographically-diverse and 789 horizontally scaled sets of them in order to provide an acceptable 790 amount of diversity while deploying the fewest possible devices. 791 However, this leads to potential problems with the geographic 792 distance between the PE and the route-reflector leading to geographic 793 "routing artifacts". (Geographic routing artifacts in this case is 794 referring to the phenomenon where the PE and the route-reflector are 795 significantly distant from one another in the network, and the route- 796 reflector chooses one or more best paths based on its view of the 797 IGP, and then reflects those to its neighbors, even though there may 798 be a better path at a given PE based on its location in the network 799 and its view of IGP. Also, propagation delay and the latency it 800 induces for updates and convergence may be a factor.) Use of a small 801 number of route-reflectors network-wide can also result in scaling 802 problems based on the number of BGP sessions a given route-reflector 803 must maintain. Both of these items point to a larger deployment of 804 smaller, more geographically diverse route-reflectors throughout the 805 network, so that a given route-reflector is maintaining fewer BGP 806 sessions with PE routers, has an IGP view of the network that is 807 closer to that of the PEs it peers with, and can rapidly propagate 808 local updates to the surrounding PEs. 810 The number of route-reflectors peering with each PE is a scaling 811 consideration as well. While a minimum of two discrete route- 812 reflector BGP sessions is the minimum to ensure proper redundancy, 813 adding additional route-reflectors requires each PE to carrry the 814 additional state of those sessions, adding significant overhead with 815 questionable value. 817 Related to route-reflector placement and the number of PE to route- 818 reflector peering sessions is the use of cluster-IDs within a set of 819 route-reflectors. Cluster-IDs can be effectively used to reduce the 820 amount of duplicate route updates propagated between route- 821 reflectors, thus reducing some of the same state and churn impact 822 that is so critical in high-scale implementations. However, it can 823 have unintend side effects. In order to prevent inconsistency in the 824 routing table, a PE MUST peer with all of the route-reflectors in a 825 given cluster. As a result, depending on how route-reflectors are 826 spread out throughout the network and clustered together, it may 827 create the need for a PE to either peer with multiple clusters, or to 828 peer with one or more route-reflectors that are not optimal in terms 829 of geographic placement in relation to the PE. For example, if each 830 cluster has two route-reflectors for redundancy, and there are three 831 regional clusters (East, Central, West), PEs that sit in the overlap 832 area between two cluster "regions" may have to peer with one or more 833 route-reflectors that are farther away, lest they have to now peer 834 with four route-reflectors in order to peer with the two closest to 835 them. 837 5.2. Best Common Practices 839 A number of things can be done to improve the general route scaling. 840 Most BGP sessions can be configured with a similar set of protections 841 as they would be if they were global Internet eBGP sessions, such as 842 maximum prefix limits, inbound and outbound prefix filtering, etc. 843 Prefix filtering is less common within VPNs because it is treated 844 more like iBGP, where filtering is typically not recommended 845 (***reference?***), or as noted above, it's part of the customer's 846 network and therefore not the SP's business/problem to do filtering 847 in an application that can only break that customer's network. What 848 is often more important in the case of individual VRFs is to 849 configure an acceptable maximum number of routes that the VRF is 850 permitted to carry. This allows the SP to control their exposure to 851 sudden increases in the memory footprint of the routing table, 852 especially if a misconfiguration on the CE side leads to significant 853 amounts of route leakage, such as to suddenly leak a significant 854 amount of the Global Internet Routing Table into their VRF. However, 855 it can also be used to enforce the assumptions on number of routes 856 per VRF that the SP has used to determine what the other max scaling 857 values such as number of VRFs per router, number of sessions per 858 router, etc. 860 As noted above, the number of VRFs per router, number of routes per 861 VRF, and number of sessions per router and per VRF are all inter- 862 related values in the way that they contribute to overall router 863 scale. The more of this information is known in advance based on the 864 design of the customer's network, the more it can be used as input to 865 the provisioning system to determine the best available PE router on 866 which to terminate the connections for consistent loading. Since 867 these values are usually estimates, and considerations like diverse 868 router terminations may drive a specific choice, this is not by any 869 means fool-proof, but is a valuable optimization to improve the 870 density of customers on a given router and maximize the return on 871 investment for the capacity deployed. It is worth noting, however, 872 that many SP VPN networks have a different geographic spread than do 873 their Internet service counterparts, where there will be more POPs 874 with fewer routers, as it is important to provide more local handoffs 875 to customers. This may limit the SP's flexibility in terms of homing 876 locations and router choices, and thus may be of limited value when 877 controlling scale impacts on individual PE routers. 879 *** Authors' note: Should we iscuss incremental SPF, next-hop 880 tracking, SPF timer tuning (By protocol and AF), prefix 881 prioritization, etc? All of these are generally thought of as 882 convergence optimizations, and may be applicable here as a way to 883 both reduce the CPU load and ensure that behavior is more 884 deterministic, but I'm not sure how much depth we want to get into 885 here, especially since some are vendor-specific or FIB-specific 886 optimizations... *** 888 5.2.1. Topology-related optimizations 890 As has been discussed above, the topology of a given VPN and its 891 placement on the available PE routers can be a significant 892 contributing factor to the impacts of that VPN on the scaling limits 893 of a given PE. For example, a hub and spoke arrangement allows for 894 some amount of aggregation and route summarization to be used, but 895 there are limitations to its effectiveness at minimizing routing 896 table growth since this is typically implemented by the end customer, 897 and is dependent on how hierarchical their topology and IP addressing 898 plan is. While there are plenty of other good reasons to use a hub 899 and spoke design, including security (traffic separation) between 900 spoke sites, etc., generally, a customer does not have much incentive 901 to expend the time and effort to maintain a proper hierarchy or deal 902 with the added complexity of a hub and spoke design if the only 903 benefit is to improve route scaling. A possible solution for some 904 full-mesh topologies is to use Virtual Hub-and-Spoke in BGP/MPLS VPNs 905 [RFC7024]. From the abstract: 907 "With BGP/MPLS VPNs, any-to-any connectivity among sites of a 908 given Virtual Private Network would require each Provider Edge 909 router that has one or more of these sites connected to it to hold 910 all the routes of that Virtual Private Network. The approach 911 described in this document allows to reduce the number of Provider 912 Edge routers that have to maintain all these routes by requiring 913 only a subset of these routers to maintain all these routes." 915 The value of this approach is that it is much less dependent on the 916 individual customer to implement a hierarchy in order to conserve 917 routing table entries. The potential downside to this approach is 918 that it requires additional provisioning and troubleshooting 919 complexity due to the way that routes are/are not imported, the use 920 of default/summary routes, etc. This approach also potentially 921 exacerbates the problem discussed above where PE's are inconsistently 922 loaded (in terms of total number of routes) from one PE to the next 923 and the potential provisioning difficulty that comes from a desire to 924 find and use as much spare control plane capacity as possible without 925 overloading a given PE. 927 5.3. Common problems at scale limits 929 As mentioned above, systems that are carrying a large number of VRFs 930 and/or VRFs with large numbers of routes tend to be more sensitive 931 during events due to the increased amount of periodic and event- 932 driven processing that must be done to complete a walk of the routing 933 table to process updates. While optimization techniques may reduce 934 the overhead of (re)programming the FIB after an update, there are 935 less tricks to be employed in managing the RIB, and they are often 936 vendor-specific, which leads to a lowest-common-denominator threshold 937 in multivendor environments. 939 In addition to CPU constraints, it's common for route memory 940 footprint to be a consideration if there are large numbers of VRFs 941 with large numbers of routes. Similarly to the way that high scale 942 reduces the cushion of available CPU resources to absorb temporary 943 peaks, as memory use reaches its high threshold, allocation of the 944 remaining memory becomes less efficient and more fragmented, such 945 that memory allocations may begin to fail well before the available 946 memory is actually exhausted. Depending on the specific 947 implementation, the "largest free" may be more important than the 948 "total free" and it may be difficult or impossible to coalesce the 949 free memory to reduce fragmentation to an acceptable level. As with 950 other scaling problems, a failure of this type has the nasty habit of 951 causing a cascade of problems. Depending on how robust the system is 952 at recovering from memory allocation failures, it may trigger 953 restarts of critical routing processes or even the entire system. 954 These may or may not be graceful and hitless, and even if they are 955 locally a fairly low impact, these may trigger events on other 956 routers due to the ripple effect of the network event itself. It is 957 also worth noting that there are hardware and software limits to how 958 much memory a given system can use - if the router in question does 959 not use a 64-bit OS, then it is unable to address more than 4GB of 960 RAM, for example. This may make an otherwise robust system incapable 961 of scaling to the necessary level, and make memory usage an even more 962 significant consideration. 964 6. Known issues and gaps 966 6.1. PE-CE routing protocols 968 While support for route flap dampening in BGP as a PE-CE routing 969 protocol is equivalent to its support in non-VPN applications, the 970 addition of IGP routing protocols such as OSPF creates a new problem, 971 in that there is not really a way to manage route dampening, either 972 by configuring it within the context of the IGP itself, or by 973 configuring it in the translation point where the IGP's routing 974 information is moved into the MP-BGP control plane infrastructure to 975 be exchanged between participating PEs across the VPN network. This 976 means that in the case where IGPs are used, which is often more CPU- 977 intensive and performance-conscious to start with, the route flaps 978 associated with an unstable network will make a bad problem even 979 worse. It may be advisable for the IETF to document updates to 980 standards managing use of IGPs as PE-CE routing protocols to 981 explicitly define the use of RFD in this application. 983 There are also not clear guidelines based on testing and real-world 984 experience for recommended timer values or appropriate use cases for 985 an IGP vs BGP as a PE-CE routing protocol. In other words, rather 986 than enterprises simply defaulting to whatever IGP is already in use 987 or they are most comfortable with, there may be certain cases where 988 use of an IGP is recommended, and those where it is not. Guidance in 989 this area may be very useful to both the SPs supporting these 990 networks and the engineers designing the corporate networks that make 991 use of them. 993 6.2. Multicast 995 Issues in multicast VPN scale? 997 6.3. Network Events 999 Guidance on interface event dampening values (research and testing), 1000 correlation tools to help determine root cause in a cascade failure, 1002 6.4. General Route Scale 1004 Route flap dampening may potentially be a best practice, but it has a 1005 number of shortcomings. First, there is no systematic way for end 1006 customers to view and clear dampening without some sort of advanced- 1007 functionality looking glass that allows them to view only the routes 1008 in their authorized VRFs. Also, allowing customers to make 1009 unattended clears of dampened routes may defeat the purpose of having 1010 dampening enabled at all, since customers may clear the dampening 1011 without addressing the underlying cause of the problem. In addition, 1012 as noted in [I-D.ietf-idr-rfd-usable] and 1013 [I-D.shishio-grow-isp-rfd-implement-survey] , Route flap Dampening is 1014 not widely used even within the Global Internet routing table, and 1015 its values probably need to be tweaked. Due to the differences in 1016 the characteristics of VPN routes compared with the global routing 1017 table, additional study and recommendations as to appropriate RFD 1018 values within a VPN are likely required. Additionally, it is not 1019 possible to configure RFD on IGPs, either natively within the PE-CE 1020 routing protocol or upstream where the learned routes are carried in 1021 MP-BGP. This means that in some cases, there is no way to insulate 1022 the SP network from the adverse impacts of rapid route churn. 1024 6.5. Modeling and Capacity planning 1026 There is a significant lack of multidimensional scale guidance and 1027 modeling for capacity planning and troubleshooting large-scale VPN 1028 deployments. This has a number of contributing factors. First, 1029 behavior at scale becomes increasingly non-deterministic the more 1030 variables you're working with simultaneously, so this is classically 1031 a difficult problem to model. Even worse, it's difficult to account 1032 in a model for latent design/implementation flaws: things that work 1033 well enough at moderate scale, but are not efficient enough for high 1034 scale, or suffer some sort of secondary impact due to dependencies, 1035 race conditions, etc. These problems are often only found through 1036 extensive testing or even escape into production. Second, it is 1037 difficult to characterize an "average" implementation in such a way 1038 that it can be tested to failure in mulitple permutations to provide 1039 a reasonably accurate multidimensional model. Consequently, the 1040 guidance available normally takes the form of multiple uni- 1041 dimensional scale thresholds plus some very conservative multi- 1042 dimensional thresholds. These conservative recommendations avoid 1043 risk to both the vendor and the implementer by catering to the lowest 1044 common denominator, but they have the adverse effect of leaving a lot 1045 of capacity sitting idle. Some vendors make an effort to 1046 characterize their customers' large scale implementations such that 1047 they can better replicate real-world conditions, but gathering this 1048 information and devising ways to replicate the behavior in a lab is 1049 problematic and time-consuming. 1051 This leads to a follow-on issue, which is that there is a lack of 1052 instrumentation on critical scaling vectors. Some routers have very 1053 limited abilities to provide useful data about critical scaling 1054 vectors (routing updates per second, changes in multicast state, 1055 sources of internal bottlenecks, etc), either for use in a model or 1056 for use as additional capacity monitoring thresholds. While most 1057 routers can provide information about CPU usage and memory 1058 thresholds, and even which processes are consuming large amounts of 1059 resources, it often takes special instrumented versions of the OS to 1060 provide a window into what is actually causing some sort of failure 1061 at scale. Because these are not routinely monitored, it means that 1062 the provider may be blind to one or more early warning signs that the 1063 router is nearing its scaling limits and cannot take action to 1064 prevent exceeding those limits before it causes customer impacts. 1066 Additionally, even if this information is available, the provisioning 1067 systems used by most providers do not currently have the intelligence 1068 or visibility to make a decision regarding which PE to provision new 1069 customers on to evenly load the available PE routers. The 1070 provisioning system is often aware of the available physical or 1071 logical port capacity on a given router or site, and uses this as a 1072 key input to its port choice for newly provisioned customers. 1073 However, these additional capacity and scale vectors are based on 1074 real-time statistics from the router (CPU, memory load, etc) and 1075 there is no interaction or feedback loop between the provisioning 1076 system and these types of real-time router scale stats. As a result, 1077 manual intervention is often required to either remove busy routers 1078 from the available capacity pool, move spare port capacity from a 1079 busy router to a full one, or even to reprovision customers to move 1080 them from one device to another to rebalance the load on each router. 1082 6.6. Performance issues 1084 In many ways, it's difficult to define a hard-and-fast scale limit, 1085 because each provider and customer have a differing view on what is 1086 an acceptable performance envelope both in steady state and during 1087 recovery from outages, whether planned or unplanned. In the most 1088 extreme sorts of network events, such as a heavily loaded PE router 1089 undergoing a cold restart, the scale considerations may take 1090 something like boot and convergence times from what the involved 1091 parties consider acceptable and extend them to the point where they 1092 significantly prolong the pain that to which an end customer is 1093 exposed. They often have the added problem of making it difficult to 1094 predict the duration of an outage, because individual customer VRFs 1095 may be affected for differing amounts of time based on all of the 1096 factors that contribute to scaling and affect convergence. For 1097 example, if a customer has one critical route that happens to be 1098 among the last to converge, they perceive the outage to be ongoing 1099 until that last route converges, even if the entire rest of their 1100 network has been functional for a significant amount of time prior to 1101 that point. 1103 When dealing with scheduled outages, customers obviously prefer that 1104 they never are impacted. Since this is not really possible, they 1105 expect the provider to give them very clear and accurate guidance on 1106 what the impacts will be, when they will occur, and for what 1107 duration, so that they can set expectations for their customers. 1108 VPNs are often carrying mission-critical services and data, so any 1109 downtime is bad downtime. While a customer may be understanding of a 1110 scheduled maintenance with a 15-30 minute traffic interruption while 1111 a router reloads, they may be less so if the outage actually 1112 stretches for 60-90 minutes while the router runs at 100% CPU trying 1113 to deal with this worst-case sort of load or suffers intermittent 1114 cascade problems while any remaining cushion is used up dealing with 1115 the results of the event. These impacts may be largely invisible to 1116 the provider unless they have probes within each VRF or other means 1117 to verify that traffic is no longer impacted for a given customer. 1118 It's often difficult or impossible for a provider to tell the 1119 difference between a router that is fully converged but running near 1120 100% CPU after a reload from one that is thrashing and causing delays 1121 in convergence and customer traffic impacts while it runs at 100% CPU 1122 after a reload. Even worse, a scheduled or known outage on one 1123 router may trigger unplanned outages on other high-CPU devices. Even 1124 in unplanned outages, communication regarding impacts and duration is 1125 key, and these sorts of scale issues make it difficult to predict the 1126 impacts. 1128 6.7. High Availability and Network Resiliency 1130 In many cases, L3VPN services are carrying significant amounts of 1131 business-critical data. Customers and carriers design their networks 1132 to be robust enough to absorb single and sometimes even dual faults 1133 with little or no impact to the network as a whole. However, the 1134 expectations as to the frequency and duration of outages due to 1135 either scheduled or unscheduled events continue to go higher and 1136 higher. This is leading more providers to adopt features such as 1137 Non-Stop Forwarding and Non-Stop Routing, as well as things like In- 1138 Service Software Upgrades to improve the chances that outages will be 1139 transparent to the underlying customers, networks, and applications 1140 using the network elements. As these become more common within the 1141 L3VPN space, they must be considered when evaluating PE scale. 1142 Often, the machinery necessary to make these reliability enhancements 1143 work requires duplication and sharing of state between multiple 1144 elements. At its most basic level, this state sharing takes more 1145 resources and more time the more state there is to be shared, so 1146 increases in the different scaling vectors discussed in this document 1147 will cause proportional increases in the complexity and resource 1148 requirements necessary for the combined feature set. In more complex 1149 scenarios and implementations, it may contribute to the complexity 1150 associated with capacity planning, and may make the response even 1151 more non-deterministic as scale increases. 1153 6.8. New methods of horizontal scaling 1155 When this document was being written, there was considerable 1156 discussion around the area of Software Defined Networking and 1157 Openflow[ONF]. These are technologies which provide a way to offload 1158 some of the more complex control plane elements to a more central 1159 controller device, which then programs the routing elements for 1160 correct forwarding plane operation. This is interesting in solving a 1161 problem such as described in this document because it effectively 1162 decouples the growth of the control plane from the growth of the 1163 forwarding plane. In other words, it would be possible to continue 1164 allocating more and more CPU resources to the high-overhead control 1165 plane elements discussed above, and keep it almost totally 1166 independent from the physical forwarding plane resources necessary. 1167 While in some ways this would simple move the need for horizontal 1168 scaling elsewhere, rather than actually reducing the scaling 1169 considerations, the benefit is that an SP could use commodity compute 1170 hardware, which would potentially be a lower cost and more easily 1171 scaled than your average PE router's CPU. The application of SDN/ 1172 Openflow or any other interface to the routing system that offloads 1173 some control plane elements for improved BGP VPN scale is beyond the 1174 scope of this document, but may be a valid use case for future 1175 discussion within the IETF. 1177 7. To-Do list 1179 RFC EDITOR: Please remove this section before publication. 1181 Still not discussed in the document: 1183 Inter-AS VPN NNI scaling considerations (separate discussions on 10A, 1184 10B/hybrid, 10C?) - include discussion on number of VRFs per NNI, 1185 routes per VRF, NNIs per router 1187 Label Exhaustion 1189 BGP Fast External Fallover 1191 additional scaling considerations if using L2TPv3 or RSVP-TE 1192 tunneling for PE-PE transport 1194 Future scaling considerations (MPLS-TP at the edge, interworking with 1195 L2 technologies, significant increases in density, etc) 1197 8. Acknowledgements 1199 The idea for this draft came from a presentation made by Ning So 1200 during the CDNI working group meeting at IETF 81 in Quebec City where 1201 some of these same scaling considerations are discussed. Thanks also 1202 to Yakov Rekhter, Luay Jalil, Jeff Loughridge, Stephane Litkowski, 1203 Rajiv Asati, and Daniel Cohn for their reviews and comments. 1205 9. IANA Considerations 1207 This draft makes no request to IANA.. 1209 10. Security Considerations 1211 Security considerations for IP VPNs are covered in the protocol 1212 definitions. This draft does not introduce any new security 1213 considerations, but it is worth noting that attack vectors that 1214 result in minor impacts in a low-scale environment may make the 1215 problems observed in a high-scale or resource-constrained environment 1216 worse, thereby magnifying the potential for impacts. 1218 11. Informative References 1220 [EIGRP] Wikipedia.org, "Enhanced Interior Gateway Routing 1221 Protocol", . 1224 [I-D.ietf-idr-rfd-usable] 1225 Pelsser, C., Bush, R., Patel, K., Mohapatra, P., and O. 1226 Maennel, "Making Route Flap Damping Usable", draft-ietf- 1227 idr-rfd-usable-04 (work in progress), October 2013. 1229 [I-D.shishio-grow-isp-rfd-implement-survey] 1230 Tsuchiya, S., Kawamura, S., Bush, R., and C. Pelsser, 1231 "Route Flap Damping Deployment Status Survey", draft- 1232 shishio-grow-isp-rfd-implement-survey-05 (work in 1233 progress), June 2012. 1235 [IEEE802.1] 1236 IEEE, "Connectivity Fault Management", 1237 . 1240 [IEEE802.3] 1241 IEEE, "Carrier Sense Multiple Access with Collision 1242 Detection (CSMA/CD) Access Method and Physical Layer 1243 Specifications", 1244 . 1246 [ISO13239] 1247 ISO, "High-level Data Link Control protocol", 1248 . 1251 [ONF] ONF, "The Open Networking Foundation", . 1254 [RFC1661] Simpson, W., "The Point-to-Point Protocol (PPP)", STD 51, 1255 RFC 1661, July 1994. 1257 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 1258 Protocol 4 (BGP-4)", RFC 4271, January 2006. 1260 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1261 Networks (VPNs)", RFC 4364, February 2006. 1263 [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route 1264 Reflection: An Alternative to Full Mesh Internal BGP 1265 (IBGP)", RFC 4456, April 2006. 1267 [RFC4577] Rosen, E., Psenak, P., and P. Pillay-Esnault, "OSPF as the 1268 Provider/Customer Edge Protocol for BGP/MPLS IP Virtual 1269 Private Networks (VPNs)", RFC 4577, June 2006. 1271 [RFC4659] De Clercq, J., Ooms, D., Carugi, M., and F. Le Faucheur, 1272 "BGP-MPLS IP Virtual Private Network (VPN) Extension for 1273 IPv6 VPN", RFC 4659, September 2006. 1275 [RFC4781] Rekhter, Y. and R. Aggarwal, "Graceful Restart Mechanism 1276 for BGP with MPLS", RFC 4781, January 2007. 1278 [RFC4984] Meyer, D., Zhang, L., and K. Fall, "Report from the IAB 1279 Workshop on Routing and Addressing", RFC 4984, September 1280 2007. 1282 [RFC5838] Lindem, A., Mirtorabi, S., Roy, A., Barnes, M., and R. 1283 Aggarwal, "Support of Address Families in OSPFv3", RFC 1284 5838, April 2010. 1286 [RFC5880] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 1287 (BFD)", RFC 5880, June 2010. 1289 [RFC6037] Rosen, E., Cai, Y., and IJ. Wijnands, "Cisco Systems' 1290 Solution for Multicast in BGP/MPLS IP VPNs", RFC 6037, 1291 October 2010. 1293 [RFC6513] Rosen, E. and R. Aggarwal, "Multicast in MPLS/BGP IP 1294 VPNs", RFC 6513, February 2012. 1296 [RFC6565] Pillay-Esnault, P., Moyer, P., Doyle, J., Ertekin, E., and 1297 M. Lundberg, "OSPFv3 as a Provider Edge to Customer Edge 1298 (PE-CE) Routing Protocol", RFC 6565, June 2012. 1300 [RFC7024] Jeng, H., Uttaro, J., Jalil, L., Decraene, B., Rekhter, 1301 Y., and R. Aggarwal, "Virtual Hub-and-Spoke in BGP/MPLS 1302 VPNs", RFC 7024, October 2013. 1304 [Y.1731] ITU-T, "OAM functions and mechanisms for Ethernet based 1305 networks", . 1307 Authors' Addresses 1309 Wesley George 1310 Time Warner Cable 1311 13820 Sunrise Valley Drive 1312 Herndon, VA 20171 1313 US 1315 Phone: +1 703-561-2540 1316 Email: wesley.george@twcable.com 1318 Rob Shakir 1319 BT 1320 London 1321 UK 1323 Phone: + 1324 Email: rob.shakir@bt.com