idnits 2.17.1 

draft-gs-vpn-scaling-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 824: '...ting table, a PE MUST peer with all of...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (February 14, 2014) is 3717 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                W. George
3	Internet-Draft                                         Time Warner Cable
4	Intended status: Informational                                 R. Shakir
5	Expires: August 18, 2014                                              BT
6	                                                       February 14, 2014

8	                     IP VPN Scaling Considerations
9	                        draft-gs-vpn-scaling-03

11	Abstract

13	   This document discusses scaling considerations unique to
14	   implementation of Layer 3 (IP) Virtual Private Networks, discusses a
15	   few best practices, and identifies gaps in the current tools and
16	   techniques which are making it more difficult for operators to cost-
17	   effectively scale and manage their L3VPN deployments.

19	Status of This Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at http://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on August 18, 2014.

36	Copyright Notice

38	   Copyright (c) 2014 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (http://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with respect
46	   to this document.  Code Components extracted from this document must
47	   include Simplified BSD License text as described in Section 4.e of
48	   the Trust Legal Provisions and are provided without warranty as
49	   described in the Simplified BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
54	     1.1.  Intention of this Document  . . . . . . . . . . . . . . .   3
55	     1.2.  Horizontal vs. Vertical Scaling . . . . . . . . . . . . .   5
56	     1.3.  Developing Requirements for Scaled L3VPN Environments . .   6
57	   2.  PE-CE routing protocols . . . . . . . . . . . . . . . . . . .   6
58	     2.1.  Best Common Practice  . . . . . . . . . . . . . . . . . .   7
59	     2.2.  Common Problems at Scale Limits . . . . . . . . . . . . .   9
60	   3.  Multicast . . . . . . . . . . . . . . . . . . . . . . . . . .  10
61	     3.1.  Best Common Practices . . . . . . . . . . . . . . . . . .  10
62	     3.2.  Common Problems at Scale Limits . . . . . . . . . . . . .  11
63	   4.  Network Events  . . . . . . . . . . . . . . . . . . . . . . .  11
64	     4.1.  Best Common Practices . . . . . . . . . . . . . . . . . .  11
65	     4.2.  Common Problems at Scale Limits . . . . . . . . . . . . .  12
66	   5.  General Route Scale . . . . . . . . . . . . . . . . . . . . .  13
67	     5.1.  Route-reflection and scaling  . . . . . . . . . . . . . .  16
68	     5.2.  Best Common Practices . . . . . . . . . . . . . . . . . .  18
69	       5.2.1.  Topology-related optimizations  . . . . . . . . . . .  19
70	     5.3.  Common problems at scale limits . . . . . . . . . . . . .  20
71	   6.  Known issues and gaps . . . . . . . . . . . . . . . . . . . .  21
72	     6.1.  PE-CE routing protocols . . . . . . . . . . . . . . . . .  21
73	     6.2.  Multicast . . . . . . . . . . . . . . . . . . . . . . . .  22
74	     6.3.  Network Events  . . . . . . . . . . . . . . . . . . . . .  22
75	     6.4.  General Route Scale . . . . . . . . . . . . . . . . . . .  22
76	     6.5.  Modeling and Capacity planning  . . . . . . . . . . . . .  22
77	     6.6.  Performance issues  . . . . . . . . . . . . . . . . . . .  23
78	     6.7.  High Availability and Network Resiliency  . . . . . . . .  24
79	     6.8.  New methods of horizontal scaling . . . . . . . . . . . .  25
80	   7.  To-Do list  . . . . . . . . . . . . . . . . . . . . . . . . .  25
81	   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  26
82	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  26
83	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  26
84	   11. Informative References  . . . . . . . . . . . . . . . . . . .  26
85	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  28

87	1.  Introduction

89	   As IP networking has become more ubiquitous and mature, many
90	   enterprises have begun migration away from legacy point to point or
91	   layer 2 virtual private network (VPN) implementations toward layer 3
92	   VPNs.  The VPN implementation as defined by RFC 4364 [RFC4364]
93	   enables flexible and robust implementations of IP VPNs.  However, in
94	   practice, it has become clear that it suffers from significant
95	   scaling considerations beyond those discussed in RFC4364.  In many
96	   cases, the limits of scale for a given platform are not in sync with
97	   the maximum physical and logical interface density supported by the
98	   platform, such that a platform may be considered "full" long before
99	   the physical slots and ports have all been filled with equipment and
100	   connections.  This represents an inefficient use of space and power,
101	   as well as stranded capital assets, which increase the operator's
102	   cost to provide the service as well as the complexity of managing the
103	   platform to ensure proper service levels in a wide variety of
104	   circumstances.  While these scaling considerations are somewhat
105	   similar to the scaling concerns experienced in the Global Internet,
106	   those are at best a subset of the overall problem, and may not have a
107	   great deal of overlap between solutions and best practices.  The
108	   added complexity and feature set required to support today's
109	   enterprise IP networks drives additional scaling considerations for
110	   large deployments.  A common response to concerns about control plane
111	   scale is simply to "throw hardware at the problem" in the form of
112	   ever-increasing amounts of memory and CPU resources.  In some cases,
113	   this may be the only solution, but similarly to the concerns
114	   identified in RFC 4984 [RFC4984], there are limits to the growth
115	   curve that can be supported and cost-effectively deployed by a VPN
116	   provider such that their service remains profitable, and therefore it
117	   is necessary to explore the potential for optimization to make the
118	   existing resources stretch further.

120	   Generally, router scale can be considered in one of three areas:
121	   forwarding capacity, interface density, and control plane capacity.
122	   This draft will focus almost exclusively on control plane capacity,
123	   because while the others are important considerations for most
124	   operators, they are less affected by the details of how L3VPN is
125	   implemented either by the router vendor or the operator.  Interface
126	   density is usually a factor of the forwarding capacity of a given
127	   module or slot as well as physical packaging.  In this application,
128	   interface density is interesting from the perspective of its impact
129	   to the control plane - more interfaces means more of all of the
130	   different factors that contribute to control plane load, and the
131	   operator wants to be able to strike a balance between interface
132	   density and control plane capacity such that neither grows out of
133	   pace with the other.

135	1.1.  Intention of this Document

137	   This document is intended to provide a discussion of the challenges
138	   that network operators face in deploying large-scale L3VPN
139	   environments at the time of writing, with two key sets of
140	   recommendations.  As such, these outcomes can be divided into those
141	   that apply to network operators regarding the deployment of
142	   particular technologies, and those that apply to network protocol and
143	   operating system implementors relating to allowing better
144	   understanding of scaling characteristics in deployments of such
145	   equipment.

147	   The best practices defined in this document are intended to allow
148	   more optimal scaling of L3VPN networks, whilst minimising the impact
149	   on end-customer network behaviour.  It is intended that such guidance
150	   can be directly utilised by Service Providers to improve the
151	   scaleability of network elements.  However, the guidance in this
152	   document should not be viewed as a panacea to the problems of scaling
153	   network elements.  It is the intention of the authors to document a
154	   number of key problems experienced in such environments and provide
155	   information to the SP that may result in more optimal deployment of
156	   existing technologies to this audience.  Additionally, there is a
157	   point at which the limits of hardware will be reached, and hence new
158	   network elements are required.  The key intention of the
159	   recommendations provided to Service Providers within this document
160	   are intended to allow the resources that exist within existing
161	   elements to be utilised in the most efficient manner.  Clearly, the
162	   optimal point in this balance is that the data-plane and control-
163	   plane scale to support similar levels of service termination, so as
164	   to result in minimal "over provisioning" of one element.

166	   The scaling considerations presented in this document are intended to
167	   provide both network operators and network equipment implementors
168	   further guidance around the toolset, and information required to
169	   provide accurate means of capacity planning in L3VPN environments.
170	   Again, the authors consider that the scaling characteristics, and
171	   toolsets required of L3VPN PE equipment diverge somewhat from those
172	   required by Internet network equipment.  In Internet deployments,
173	   relatively standardised interconnects exist across all deployments -
174	   typically utilising either static routing, or BGP-4.  As such, each
175	   connected port comes with a relatively standard overhead in terms of
176	   the protocols required.  Whilst there is some variance in how
177	   "chatty" each customer connection may be, this is balanced by the
178	   fact that the whole Internet routing table is typically held on such
179	   edge equipment (and hence individual customer's instability tends to
180	   be relatively small when compared to the instability of the Internet
181	   DFZ).  In addition, since such instability is limited to relatively
182	   few impacts to a node (interface or BGP session flapping, and BGP
183	   UPDATE messages) routers can be optimised to cope with such
184	   instability.  Counter to this, the L3VPN environment does not have a
185	   standardised connectivity model, and typically connects to much less
186	   controlled environments.  Further details of this are provided within
187	   later sections of this document.  The result of this difference is
188	   that 'headline' scaling figures presented for particular equipment
189	   tends to be of limited utility to a network operator.  The
190	   recommendations within this document outline some of the
191	   considerations that must be made in considering the scaling of such
192	   elements, and provide guidance as to the missing inputs and tools
193	   that are required to provide information around the capacity of such
194	   elements.

196	1.2.  Horizontal vs. Vertical Scaling

198	   Within this document, two forms of 'scaling' are referred to - the
199	   "throw hardware at the problem" approach outlined previously involves
200	   deploying additional network elements in order to provide further
201	   network capacity.  Throughout this document, this approach is
202	   referred to a horizontal scaling - insofar as it requires parallel
203	   deployment of numerous similar elements and balancing the load across
204	   the combined capacity of all of the elements.  The approach of
205	   increasing the capacity of an individual node through allowing the
206	   control plane capacity to support the maximum forwarding plane
207	   capacity (be it data forwarded, or available ports) is referred to as
208	   vertical scaling.  It is obvious that at some point the approach of
209	   horizontal scaling of elements is required - due to either exhausting
210	   available port capacities, or available forwarding plane - however,
211	   there are a number of motivations for delaying such provisioning,
212	   some of which relate directly to the characteristics of L3VPN
213	   environments.

215	   Since a significant proportion of the customers who purchase L3VPN
216	   services are Enterprise customers, typically the service is utilised
217	   as a WAN for their inter-location connectivity.  Clearly, as such
218	   customer base tends to be distributed based on differing factors,
219	   this implies that such customers connect in numerous geographical
220	   locations.  The requirement to support service in these locations
221	   therefore results in a requirement for the service provider network
222	   architecture to support geographically distributed access into such
223	   services.  A balance must be struck between the extent to which
224	   access networks are utilised to backhaul traffic to the service
225	   layer, and the geographical distribution of the service layer itself.
226	   Both scale and performance characteristics of such networks tend to
227	   result in more geographical distribution of service layer elements
228	   than in Internet deployments.  This distribution results in two
229	   particular changes - primarily that the idea of a "point-of-presence"
230	   must be reconsidered - where an assumption in Internet environments
231	   may be that there are separated core and access elements within a
232	   single location, within a distributed L3VPN environment, a point of
233	   presence may be a single PE device.  The result of these small scale
234	   points of presence is that numerous core and edge functions must be
235	   collapsed onto a single device.  For this reason, the approach of
236	   adding additional devices to the network may have an impact on a
237	   further subset of devices within the network (particularly due to any
238	   mesh-based protocols that are deployed), and hence result in a change
239	   in the scaling characteristics of these devices.  In this case, there
240	   is further motivation to avoid large numbers of devices in the
241	   network where possible.  Further to this, the smaller PoP profile may
242	   result in physical constraints around the deployment of additional
243	   network elements, particularly due to the availability of power and
244	   physical space to deploy such elements.

246	1.3.  Developing Requirements for Scaled L3VPN Environments

248	   Whilst the collected scaling considerations outlined in this document
249	   are based on the author's collective experience within various
250	   Service Provider networks, and discussions with operators of similar
251	   networks, it should be noted that the problems outlined in this
252	   document are not static.  With the growth in the use of IP as the
253	   underlying transport of many services, the demand for L3VPN
254	   environments has grown.  As such, this has meant that various
255	   technologies are being considered to allow growth of these networks
256	   at a lower cost point to a wider footprint than was previously
257	   required.  A network operator must therefore consider the extent to
258	   which the service layer must be built - both to meet economic and
259	   technical requirements.  With newer aggregation methods, the service
260	   layer edge (and hence the L3VPN PE) acquires responsibility for
261	   inter-working between newer dynamic aggregation technologies, and the
262	   existing IP network.  As such, these edge functionalities result in
263	   further requirements for loading onto these network elements.

265	   *** Author's note: Do we want to put anything about NNI for footprint
266	   extension here?  Datacenter edge - perhaps Ning's problem around the
267	   L3VPN edge in his datacentres? ***

269	2.  PE-CE routing protocols

271	   One of the things that makes IP VPNs so flexible and robust is their
272	   ability to participate in the encapsulated network's routing
273	   protocols, where the customer edge (CE) router has a direct neighbor
274	   relationship with its upstream provider edge (PE) router in order to
275	   exchange routing information about the Virtual Route Forwarding (VRF)
276	   instance that represents the VPN.  In many cases, this is managed
277	   through a combination of static routes and BGP neighbors, but IGPs
278	   such as OSPF RFC 4577 [RFC4577] are often supported, because it
279	   enables a more complete integration into an existing enterprise
280	   network design and topology.  In some single-vendor implementations,
281	   carriers sometimes support proprietary routing protocols such as
282	   EIGRP [EIGRP].  IGPs may also be chosen due to a belief that they
283	   will respond more rapidly during a failure than BGP will.  In
284	   reality, this may not be true due to the fact that VRF routing
285	   information is still carried in MP-BGP from PE to PE, and the PE-CE
286	   routing protocol's characteristics are only locally significant.  In
287	   fact, the increased overhead may lead to slower convergence times
288	   than a more standard BGP implementation.

290	   IGPs often translate to a significant increase in overhead due to
291	   their inherent characteristics as link-state routing protocols
292	   requiring full topology databases and flooding of updates to all
293	   participants, and the fact that they invoke additional processes on
294	   the router when compared to simply using BGP (which is already going
295	   to be running on a router using MP-BGP for VPNs).  While a router may
296	   be able to scale almost effortlessly with a few thousand routes in a
297	   single IGP plus hundreds of thousands of routes and many neighbors in
298	   BGP, it may be quickly challenged if it is also required to run
299	   multiple instances of an IGP each with a certain number of routes
300	   that must be moved into MP-BGP to be passed to the rest of the VPN
301	   infrastructure.  The advent of support for IPv6 within a VPN (6VPE)
302	   [RFC4659] has the potential to make this problem worse, especially in
303	   the case of OSPF, where it now requires both OSPFv2 [RFC4577] and v3
304	   [RFC6565] to run as separate instances for the two address families,
305	   or use of multiple instances of OSPFv3 to support multiple address
306	   families as documented in RFC5838 [RFC5838].

308	   Another consideration in PE-CE routing protocols is the timers used
309	   for each session.  These will be discussed in greater detail in the
310	   best practices section.

312	2.1.  Best Common Practice

314	   Ultimately, the decision as to which PE-CE routing protocols to
315	   support is a business decision much more often than it is a technical
316	   one, because there are few use cases where something other than BGP
317	   and static routing as PE-CE routing protocols is a technical
318	   requirement.  If a provider chooses to support additional protocols,
319	   especially IGPs, they should consider the effects that these have on
320	   the overall scaling profile of the PE routers and the network as a
321	   whole when determining if and to what extent they will support other
322	   protocols.

324	   Often, those designing VPN solutions attempt to use extremely
325	   aggressive routing protocol timer and keepalive values as a means of
326	   rapid failure detection and reconvergence.  This tends to make PE-CE
327	   routing protocols more fragile and increase the load on the PE router
328	   with questionable benefit.  This is especially common in scenarios
329	   where the network designer is attempting to replicate native IGP-like
330	   failure detection and reroute capabilities using BGP.  In order to
331	   avoid this, the preferred values should be set to something that is
332	   appropriate for large-scale implementations.  Further, because timer
333	   and keepalive values are often negotiated based on the more
334	   aggressive neighbor, it is a good idea to set a minimum acceptable
335	   value, so that instead of being forced to support negotiated timer
336	   values that are too aggressive for the scale that a given PE router
337	   is expected to support, the neighbor session will simply stay down
338	   until the remote end timers are reconfigured to a more acceptable
339	   value.  This acts as a safety valve against abuse that can
340	   destabilize a router used by multiple customers.  Because aggressive
341	   timers may be unavoidable in certain situations, it may be advisable
342	   to track the number of sessions which are provisioned with aggressive
343	   timers vs how many are using more conservative timers on a per-router
344	   basis, so that effort can be made to balance aggressive and
345	   conservative timers on each router.  This will help to prevent "hot-
346	   spots" where given a similar port and VRF density, some routers have
347	   significantly higher CPU usage in steady-state than others.

349	   It is important to realize that while use of aggressive routing
350	   protocol timers is not a scalable way to do fast failure detection,
351	   fast failure detection is still a requirement for many customers.
352	   Because this is becoming such a table-stakes requirement, the
353	   provider must consider other alternatives such as Bidirectional
354	   Forwarding Detection ([RFC5880]), Ethernet OAM 802.1ag [IEEE802.1],
355	   ITU-T &.1731 [Y.1731] LACP 802.3ad [IEEE802.3] and the like.  These
356	   extensions often come with their own scaling considerations, but more
357	   and more they are implemented in a distributed fashion so that
358	   instead of affecting the main router CPU like a routing protocol
359	   might, they offload that processing to the linecard CPU, and
360	   therefore can support more aggressive scale.  The general philosophy
361	   is that these lower-layer detection mechanisms should serve as the
362	   primary detection and failure point, with the upper layer routing
363	   protocols only serving as a backstop if the failure is not detected
364	   by the lower level protocols for some period of time.

366	   Another important consideration is that there is not likely to be a
367	   "one-size-fits-all" solution when it comes to setting timers and
368	   policies around PE-CE routing protocols.  At a minuimum, a
369	   distinction should be made between sites that have only a single
370	   upstream connection and those that have two or more diverse
371	   connections to the network.  Further distinction can be made based on
372	   the importance of the site, whether it is a hub site or an end site.
373	   These can all be used to determine the aggressiveness that is
374	   appropriate for the timers and perhaps even which routing protocols
375	   are appropriate.  For example, an end site with a single upstream
376	   connection likely does not need very aggressive timers and may be
377	   able to get by using only static routing, while a hub site with
378	   multiple connections and a need for rapid restoration and reaction to
379	   any routing changes may need BGP along with aggressive lower-layer
380	   timers for fault detection.

382	2.2.  Common Problems at Scale Limits

384	   Two common problems when working on a heavily-loaded system:

386	   CPU cycle constraints, even before the system reaches the point of
387	   scheduler thrashing often lead to one or more routing protocol
388	   neighbor hello drops.  If several consecutive drops occur, the remote
389	   neighbor may declare the session dead, which triggers a restart of
390	   the connection and a resync of the routing data.  Because this
391	   connection initialization requires dedicated CPU cycles to generate,
392	   receive, acknowledge, and process the updates, it increases the CPU
393	   utilization further, which may trigger additional hello failures and
394	   neighbor resets, resulting in a snowball effect where a relatively
395	   minor event rapidly becomes a major one due to interactions between
396	   multiple scaling limitations.  This problem is made worse by
397	   extremely aggressive timer values, because they raise the baseline
398	   CPU load with more frequent hellos and responses, and are more
399	   sensitive to drops caused by increased CPU load.  Further, because
400	   failures brought on by loss of hello packets are unlikely to invoke
401	   any graceful restart [RFC4781] machinery that the system may support,
402	   it is unlikely that the session reset will be able to take advantage
403	   of optimizations like only synching the changes that occured while
404	   the session was dead, thus increasing the outage time and the CPU
405	   cycles to get things back into sync.

407	   Another potential issue during times of high-CPU operation is related
408	   to process prioritization.  This is applicable in different ways for
409	   both multithreaded and interrupt-driven OS architectures.  In each
410	   case, the scheduling algorithm that the router uses to prioritize
411	   different CPU cycle work items and manage the timeslices individual
412	   tasks are given to complete may require significant tuning and
413	   prioritization in order to ensure the desired behavior during high
414	   CPU usage.  Improperly tuned or prioritized processes may
415	   significantly delay completion of routing table/update processing
416	   such that it may take an excessive amount of time for the routing
417	   table to converge properly.  This issue is further exacerbated if the
418	   VRF instance has a large amount of routes, or is prone to frequent
419	   event-driven route churn.  In some cases, the routing table in a
420	   given VRF may never fully converge, leading to routing loops, traffic
421	   loss, inconsistent latency, and a generally adverse customer
422	   experience.

424	   These items also can have a cascade effect on other routers in the
425	   system if they also participate in a given VRF that is being affected
426	   by this type of scaling issue.  Not only is the local PE router
427	   affected, but any upstream Route reflectors, as well as other PEs,
428	   and even CEs participating in this VRF will see increased CPU cycles
429	   in order to receive and process the increased flow of updates driven
430	   by the local churn.

432	   ***specific items related to different PE-CE protocols?***

434	3.  Multicast

436	   Multicast support within a VPN [RFC6513] has become an increasingly
437	   popular feature, but comes with its own scaling considerations.
438	   Depending on the application, the frequency at which multicast state
439	   changes within a given VPN (e.g. PIM joins and prunes) will
440	   contribute to the CPU load on the router, and any instability in the
441	   network can potentially increase these as remote sites flap.  In
442	   extreme cases, PIM neighborships can be lost during events,
443	   disrupting the flow of multicast traffic.

445	   It should be noted that, in some cases, dynamic action is required by
446	   a PE device to support the transition of flooding of multicast data
447	   from a non-optimal distribution tree (the default MDT in [RFC6037],
448	   or the I-PMSI) onto a more optimal one (a data MDT or S-PMSI).  Where
449	   such a transition is required, consideration is required of the
450	   nature of the traffic sourced by an end user of the L3VPN service.
451	   The net result of this consideration is that it becomes increasingly
452	   difficult to reliably gauge the scaling impact of specific end-site
453	   deployments.  Additional scaling considerations around Multicast in a
454	   VPN are related to the size and number of multicast streams.  While
455	   this is a consideration whenever Multicast is used even outside of a
456	   VPN because of the bandwidth utilization it may generate in the core,
457	   the additional overhead of implementing multicast within a VPN makes
458	   this a more siginificant consideration in this case.  Related to the
459	   previous consideration is the stream fanout - the amount of P and PE
460	   router paths in the network that could potentially carry a given
461	   multicast stream based on the number of PEs that are configured with
462	   a given Multicast-enabled VRF, and the number that actually do carry
463	   the stream based on actual receivers joining the stream behind that
464	   PE.

466	   *** This section is quite weak.  We're looking for contributors who
467	   can assist in fleshing this out ***

469	3.1.  Best Common Practices

471	   Multicast BCPs???

473	3.2.  Common Problems at Scale Limits

475	   Multicast tree interruptions

477	   PIM neighbor adjacency drops

479	4.  Network Events

481	   Network events are an important scaling consideration because they
482	   can have wide-ranging impacts far beyond the individual VRF or even
483	   PE router that experiences the event.  At high scale, a seemingly
484	   innocuous event on one router or VRF can trigger secondary impacts
485	   and outages on remote routers elsewhere in the network.  Correlating
486	   these events for root cause analysis can be challenging by itself,
487	   and trying to characterize the impacts as they relate to scale in a
488	   way that informs the provider's decisions is even more difficult.
489	   Different types of Network Events that can contribute are: Interface
490	   flaps, hardware and software outages (both planned and unplanned),
491	   externally driven route-churn events (such as those that originate on
492	   an NNI partner's network) and configuration changes.

494	4.1.  Best Common Practices

496	   While this document suggests that lower layer failure detection
497	   protocols like BFD and Ethernet OAM be more aggressive so that
498	   routing protocol timers can be more conservative, it is still
499	   important to remember that this can generate false positives or
500	   excessive churn that will cascade into a scaling problem at other
501	   parts of the system, so the timers should not automatically be
502	   configured to their minimum supported values.  Rather, each
503	   application may be slightly different, and the timers should only be
504	   set as aggressively as necessary to ensure acceptable performance of
505	   the applications in question.  It may be appropriate to set limits
506	   (e.g. in provisioning logic/rules) as to the number of interfaces per
507	   router and per VRF that can use aggressive, moderate, and
508	   conservative interface timers.

510	   Even with timers set as conservatively as the application will allow,
511	   churn is unavoidable.  For this reason, it is also a good idea to use
512	   interface-level dampening such as hold-down timers or event dampening
513	   in order to ensure that interfaces that flap too rapidly will not
514	   telegraph that churn into the upper-layer routing protocols any more
515	   than necessary.  BGP Peer Oscillation Dampening
516	   (DampPeerOscillations, RFC4271 [RFC4271] ) may also help to reduce
517	   intermittent outage-based churn while leaving the interface itself
518	   unaffected.  All of these dampening measures help to ensure that
519	   problems are localized to a single PE or even a single interface,
520	   rather than causing instability and routing churn throughout the VRF
521	   and the provider network.

523	   In addition to interface dampening, it may be advisable to consider
524	   implementing some manner of route flap dampening route flap dampening
525	   (RFD) [I-D.ietf-idr-rfd-usable] to assist in reducing the impact that
526	   route churn may have on the SP's network infrastructure.  This is
527	   currently fairly uncommon within VPN environments, and is not without
528	   controversy.  While it may help with scaling, it also requires each
529	   PE to maintain more state to store and compute the per-prefix penalty
530	   values, which may reduce the benefits gained by implementing RFD.
531	   Further, customers typically expect a fair amount of transparency in
532	   the provider's participation in their routing instances.  Many
533	   providers and customers view a VPN or VRF as a part of the customer's
534	   internal network and therefore compartmentalized so that the customer
535	   can only affect their own routing if they have a problem with
536	   excessive route flaps.  Further, if routes are dampened it requires
537	   intervention from the SP to clear the dampening, which can
538	   potentially add to the outage time that a customer experiences once
539	   the issue that triggered the dampening is resolved.  Implementing RFD
540	   may even drive the need for a customer-accessible looking glass,
541	   which is far more complex in the VPN space owing to the requirement
542	   to prevent one customer from looking at another's VRF routes on a
543	   common platform.

545	4.2.  Common Problems at Scale Limits

547	   Network events are both a cause and a symptom of a system running at
548	   or near its scaling limits.  As noted above, event-driven routing
549	   table churn or routing protocol interactions can significantly drive
550	   up CPU usage on the locally connected PE as well as on other PEs and
551	   CEs participating in the VRF.  If routes are constantly changing due
552	   to a preferred path repeatedly being added and removed, latency and
553	   jitter numbers can be affected in a way that adversely effects
554	   applications sensitive to this sort of change.  Network events can
555	   also be triggered by routers with high CPU, because similarly to
556	   systems which may have aggressive routing protocol timers for
557	   enhanced failure detection, systems with centralized CPU-based
558	   implementations for lower-layer protocols (such as HDLC [ISO13239]
559	   PPP [RFC1661], LACP, BFD/EOAM) may start losing keepalives and
560	   declaring outages that result in physical interfaces being torn down
561	   and restored.  Again, implementations that choose timer and
562	   multiplier values or numbers of sessions at or near the maximum rated
563	   scaling for the device put the operator in a position where there is
564	   very little headroom to deal with an event that momentarily spikes
565	   CPU usage, meaning that the likelihood of a cascade failure
566	   dramatically increases.

568	   As above, these network events may be something that occurs elsewhere
569	   in the network, and may trigger a failure on a completely different
570	   PE or CE router.  The danger with this is that it is extremely
571	   difficult to troubleshoot and correlate root causes when the outage
572	   observed isn't caused by an event on the same router.  Failures
573	   become increasingly non-deterministic and difficult for operators to
574	   manage and address.

576	5.  General Route Scale

578	   PE routers in a carrier network can have many different
579	   implementation scenarios.  Some carriers implement a dedicated PE
580	   router that is only responsible for carrying VPN routes and therefore
581	   may only carry IGP routes in its global routing table, rather than a
582	   full internet routing table.  Others use combined edge routers that
583	   carry full routes plus a complement of customer VPN routes, and some
584	   even place the full internet routing table into one or more VRF
585	   instances.  The issue here is that the weight of all of these routes
586	   and paths must be combined when considering the maximum scale of the
587	   router, both in terms of memory footprint and in terms of convergence
588	   times.  The addition of an 8-byte RD appended to the IP address to
589	   ensure uniqueness means that each VPN prefix takes up incrementally
590	   more physical space in memory than an equivalent non-VPN route.
591	   Further, the greater number of Address-families running
592	   simultaneously on the same router, the more sensitive it will be to
593	   event-induced churn since each address-family (and VRF) often has its
594	   own independent computation/SPF run.  The addition of IPv6 support
595	   within both the global routing table and within a VPN adds yet
596	   another source for routing table bloat.  A PE router can be running a
597	   combination of any of the following address-families:

599	   o  Global IPv4 unicast

601	   o  Global IPv4 multicast

603	   o  VPN IPv4 unicast

605	   o  VPN IPv4 multicast

607	   o  Global IPv6 unicast

609	   o  Global IPv6 multicast

611	   o  VPN IPv6 unicast

613	   o  VPN IPv6 multicast
614	   Even PE routers that do not carry the full internet routing table are
615	   still required to carry a minimal number of IGP routes, LDP
616	   information, and some amount of TE tunnel state, adding to the items
617	   competing for scale.  On high-scale PE routers, the VPN routing
618	   tables are often as large as or larger than the equivalent global
619	   routing table in both number of routes and number of paths.  This is
620	   at least partially due to the fact that there are no constraints on
621	   the customer addressing plan within a VPN other than they cannot
622	   conflict within a given VRF, or with any extranet with which the VRF
623	   interconnects.  As such, they may not necessarily adhere to any best
624	   practices to control the deaggregation of the routing table such as
625	   hierarchical addressing, aggregation and summarization of
626	   announcements, and minimum prefix lengths.  It's also quite likely
627	   that connected interfaces will be redistributed, and little or no
628	   route filtering may take place.  Most PE routers use the absence of a
629	   given VRF instance (or RD/RT filtering) to limit the number of routes
630	   that they must actually carry, but this is sometimes of limited
631	   utility for a couple of reasons.  First, it leads to an inconsistent
632	   routing table footprint from one PE router to the next, and it can
633	   change with every new customer turned up on the router.  These lead
634	   to non-deterministic performance and scale from PE to PE and from
635	   customer to customer.  In other words, PE1 may be fine from a scale
636	   perspective, while PE2, which has the same number of occupied ports
637	   has significant scaling problems on account of which VRFs are present
638	   /absent.  Then, PE1 may find itself suddenly having the same scaling
639	   concerns because a new customer was provisioned with a large or high-
640	   churn VRF that was previously not present on the router.  Second,
641	   many customer VPNs are so large and have such stringent diversity
642	   requirements that they have a presence on nearly every PE router in a
643	   provider's network, meaning that one cannot rely heavily on
644	   statistical distribution to reduce the percentage of VRFs that must
645	   be installed on a specific PE router.  In addition, customers may
646	   request the use of BGP multipath for faster failover or better load
647	   balancing, which has the net effect of installing more active routes
648	   into the table, rather than simply selecting the single best path.
649	   The scaling considerations for enabling BGP Multipath are not unique
650	   to L3VPN, but they are more pertinent here because SPs are less
651	   likely to be willing to enable MP for standard internet traffic,
652	   while they will do it for L3VPN.  The application as an enterprise
653	   network instead of internet connectivity drives a different set of
654	   expectations about the performance of the network, design tradeoffs
655	   that must be made to meet the SP's requirements, etc.  In many cases,
656	   L3VPNs are replacing old point-to-point networks or L2VPNs using
657	   legacy Frame Relay, ATM, or L2TPv3.  Customers often don't want to
658	   make major architectural changes to their routing, and therefore
659	   expect the SP to do the same things that they were doing between
660	   their routers before, including things like multipath.

662	   In addition to such intended behaviour, within many L3VPN networks, a
663	   balance must be struck between complexity in OSS such as provisioning
664	   and inventory systems, and complexity in network deployments.  One
665	   such example of this is the assignment of route distinguisher (RD)
666	   attributes.  Where it may be possible to assign a single RD per L3VPN
667	   instance, and hence achieve some level of route aggregation for
668	   multi-homed CE routes on BGP speakers within the solution, this has
669	   some consequences for both convergence in the VPN (due to BGP
670	   convergence being relied upon) and in its potential to exacerbate
671	   geographic distance between PE and Route-reflector and is therefore
672	   undesirable in some circumstances.  In order to avoid this, multiple
673	   RDs are then required, which requires OSS and inventory support to
674	   control the namespace.  As such, due to this requirement, often each
675	   VRF instance is deployed with a specific RD - which, whilst achieving
676	   the desired convergence effect, places load on all BGP control-plane
677	   elements of the provider network.

679	   Total supportable route scale on a given PE router will be driven by
680	   multiple different variables, which have a roughly inverse
681	   relationship to one another: Number of VRFs per router, number of
682	   routes per VRF, number of neighbors per VRF.  For example, a router
683	   can support a low number of VRFs per router if each VRF has a large
684	   number of routes per VRF and/or a large number of neighbors per VRF.
685	   Conversely, a router can support a relatively high number of VRFs if
686	   each VRF is kept to a much lower number of routes per VRF, and/or
687	   lower numbers of neighbors per VRF.  This provides a baseline that
688	   then must be reduced based on the expected level of event-driven
689	   churn, the type of protocol chosen, etc.  In short, this is a
690	   difficult problem from a modeling and capacity planning perspective.

692	   It is fairly common for the contract or Service Level Agreement
693	   between SP and customer to include a maximum limit as to how many
694	   routes can be carried in a VRF.  At its most basic, this maximum can
695	   be used as a method to estimate the number of VRFs that can be
696	   present on a PE given its scaling limitations.  However, there is a
697	   wide gulf between a contractual limitation of no more than N routes
698	   per VRF with a corresponding configured limit and the fact that many
699	   customers will not carry nearly that many routes.  This leads to the
700	   potential for significant stranded capacity.  Therefore the provider
701	   needs a way to have different tiers of "maximum routes allowed" so
702	   that the capacity management can be done in such a way as to enable
703	   better loading of PE routers to take this relationship into account
704	   (e.g. populating a PE with a combination of high-scale and low-scale
705	   VRFs).  The alternative to this method would be to assume a standard
706	   maximum routes per VRF, and then similarly to the way that carriers
707	   use statistical multiplexing and oversubscription to assume that not
708	   all customers will have their pipes full of bandwidth at the same
709	   time, make some assumptions about control plane capacity.  This may
710	   come in the form of an average that is calculated based on the actual
711	   size of the routes in each VRF.  This has many challenges.  Among
712	   them- Should it be calculated per-PE?  Network-wide?  What happens
713	   when there are too many VRFs that exceed the average on a given PE?
714	   How does one add control plane capacity to a "full" router?  This may
715	   be a manageable model in a network with a robust and flexible
716	   provisioning system, such that high-scale VRFs can be moved between
717	   PE routers to balance the load, but each of these moves likely
718	   represents an outage for the customer and the potential for other
719	   errors to creep in, and is not likely to be attractive due to the
720	   operational costs of managing the network.  In other words, it
721	   doesn't scale, but for a completely different reason.  Further, this
722	   VRF route limit may or may not be a physically enforced value.  Some
723	   PEs have an additional configuration knob per VRF that places a hard
724	   limit on the number of routes the VRF will accept.  This works well
725	   as a last-chance safety valve to protect the PE and the network in
726	   the case where there are misconfigurations in the VRF that cause a
727	   sudden and significant increase in the number of routes, but can
728	   create inconsistencies in the VRF's routing table if there is a
729	   periodic or intermittent increase in the routes that causes the
730	   maximum to be periodically exceeded.  Unlike something like a BGP
731	   maximum prefix limit, which shuts down the BGP neighbor when a
732	   threshold is exceeded, there is no direct feedback to the peers that
733	   the VRF route limit threshold is exceeded, and different
734	   implementations handle this in different ways in terms of how they
735	   drop or buffer routes, and how they resynch once the routes are below
736	   the threshold again.  It may be appropriate to identify a common way
737	   for implementations to handle this limit, perhaps triggering one or
738	   more PE-CE peering sessions to drop, etc. so that this is a more
739	   useful tool to protect the PE from increases that would cause it to
740	   have scaling problems.

742	5.1.  Route-reflection and scaling

744	   Most of this document focuses on scaling at the PE router, but a
745	   discussion of route scaling would not be complete without at least a
746	   cursory mention of route-reflection [RFC4456].  While using route-
747	   reflectors to eliminate the need for a full mesh of your PE routers
748	   is a common optimization, there are many different deployment models
749	   as far as whether dedicated route-reflectors are deployed vs. running
750	   an existing PE or P router as a route-reflector, how many are
751	   deployed and where, the method for ensuring diversity and redundancy,
752	   and even whether a router is used vs. a commodity PC running some
753	   sort of routing daemon.  From a scaling perspective, there are
754	   several considerations that are unique to the route-reflector design
755	   that will be discussed here.

757	   Starting with the route-reflector itself, these devices are often
758	   experiencing a worst-case scenario when it comes to storing entries
759	   in the RIB, exposure to route-churn, etc.  This is because they are
760	   not capable of filtering the routes from VRFs not locally configured
761	   on themselves, and they must carry all of the routes for all of the
762	   VRFs in the ASN.  This requires significant amounts of CPU and memory
763	   to store and manage these updates, and an underpowered route-
764	   reflector can quickly cause widespread convergence problems if it is
765	   unable to keep up with the load of receiving, processing, and
766	   propagating these updates.  Beyond CPU and memory, it may also be
767	   necessary to know how the router manages its FIB when running as a
768	   route-reflector.  A route-reflector is almost 100% control-plane, but
769	   if it tries to install all of the routes that it has in its RIB into
770	   the FIB, it may require very high-scale (and therefore costly)
771	   forwarding hardware to manage the large FIB.  It may be useful to
772	   select a device that is capable of optimizing for this control-plane
773	   only mode and suppressing unnecessary routes from its FIB to reduce
774	   the overhead.  This is why some providers choose to use commodity
775	   PCs, which are well-suited for high-scale, processor and memory-
776	   intensive control plane work, and can easily and cost-effectively be
777	   horizontally scaled.  The main consideration with using a PC instead
778	   of a router for route-reflection is that there may be implementation
779	   differences that lead to incompatibililties in terms of supported
780	   features, and there may be a different model in terms of how high-
781	   scale applications are managed, or even what bugs are exposed at
782	   maximum scale, all of which will require significant additional
783	   testing.

785	   Route-reflector placement is another important consideration.
786	   Because route-reflectors are control-plane devices, and the scale
787	   requirements for them are high enough that they can be expensive, the
788	   tendency might be to deploy two large geographically-diverse and
789	   horizontally scaled sets of them in order to provide an acceptable
790	   amount of diversity while deploying the fewest possible devices.
791	   However, this leads to potential problems with the geographic
792	   distance between the PE and the route-reflector leading to geographic
793	   "routing artifacts".  (Geographic routing artifacts in this case is
794	   referring to the phenomenon where the PE and the route-reflector are
795	   significantly distant from one another in the network, and the route-
796	   reflector chooses one or more best paths based on its view of the
797	   IGP, and then reflects those to its neighbors, even though there may
798	   be a better path at a given PE based on its location in the network
799	   and its view of IGP.  Also, propagation delay and the latency it
800	   induces for updates and convergence may be a factor.)  Use of a small
801	   number of route-reflectors network-wide can also result in scaling
802	   problems based on the number of BGP sessions a given route-reflector
803	   must maintain.  Both of these items point to a larger deployment of
804	   smaller, more geographically diverse route-reflectors throughout the
805	   network, so that a given route-reflector is maintaining fewer BGP
806	   sessions with PE routers, has an IGP view of the network that is
807	   closer to that of the PEs it peers with, and can rapidly propagate
808	   local updates to the surrounding PEs.

810	   The number of route-reflectors peering with each PE is a scaling
811	   consideration as well.  While a minimum of two discrete route-
812	   reflector BGP sessions is the minimum to ensure proper redundancy,
813	   adding additional route-reflectors requires each PE to carrry the
814	   additional state of those sessions, adding significant overhead with
815	   questionable value.

817	   Related to route-reflector placement and the number of PE to route-
818	   reflector peering sessions is the use of cluster-IDs within a set of
819	   route-reflectors.  Cluster-IDs can be effectively used to reduce the
820	   amount of duplicate route updates propagated between route-
821	   reflectors, thus reducing some of the same state and churn impact
822	   that is so critical in high-scale implementations.  However, it can
823	   have unintend side effects.  In order to prevent inconsistency in the
824	   routing table, a PE MUST peer with all of the route-reflectors in a
825	   given cluster.  As a result, depending on how route-reflectors are
826	   spread out throughout the network and clustered together, it may
827	   create the need for a PE to either peer with multiple clusters, or to
828	   peer with one or more route-reflectors that are not optimal in terms
829	   of geographic placement in relation to the PE.  For example, if each
830	   cluster has two route-reflectors for redundancy, and there are three
831	   regional clusters (East, Central, West), PEs that sit in the overlap
832	   area between two cluster "regions" may have to peer with one or more
833	   route-reflectors that are farther away, lest they have to now peer
834	   with four route-reflectors in order to peer with the two closest to
835	   them.

837	5.2.  Best Common Practices

839	   A number of things can be done to improve the general route scaling.
840	   Most BGP sessions can be configured with a similar set of protections
841	   as they would be if they were global Internet eBGP sessions, such as
842	   maximum prefix limits, inbound and outbound prefix filtering, etc.
843	   Prefix filtering is less common within VPNs because it is treated
844	   more like iBGP, where filtering is typically not recommended
845	   (***reference?***), or as noted above, it's part of the customer's
846	   network and therefore not the SP's business/problem to do filtering
847	   in an application that can only break that customer's network.  What
848	   is often more important in the case of individual VRFs is to
849	   configure an acceptable maximum number of routes that the VRF is
850	   permitted to carry.  This allows the SP to control their exposure to
851	   sudden increases in the memory footprint of the routing table,
852	   especially if a misconfiguration on the CE side leads to significant
853	   amounts of route leakage, such as to suddenly leak a significant
854	   amount of the Global Internet Routing Table into their VRF.  However,
855	   it can also be used to enforce the assumptions on number of routes
856	   per VRF that the SP has used to determine what the other max scaling
857	   values such as number of VRFs per router, number of sessions per
858	   router, etc.

860	   As noted above, the number of VRFs per router, number of routes per
861	   VRF, and number of sessions per router and per VRF are all inter-
862	   related values in the way that they contribute to overall router
863	   scale.  The more of this information is known in advance based on the
864	   design of the customer's network, the more it can be used as input to
865	   the provisioning system to determine the best available PE router on
866	   which to terminate the connections for consistent loading.  Since
867	   these values are usually estimates, and considerations like diverse
868	   router terminations may drive a specific choice, this is not by any
869	   means fool-proof, but is a valuable optimization to improve the
870	   density of customers on a given router and maximize the return on
871	   investment for the capacity deployed.  It is worth noting, however,
872	   that many SP VPN networks have a different geographic spread than do
873	   their Internet service counterparts, where there will be more POPs
874	   with fewer routers, as it is important to provide more local handoffs
875	   to customers.  This may limit the SP's flexibility in terms of homing
876	   locations and router choices, and thus may be of limited value when
877	   controlling scale impacts on individual PE routers.

879	   *** Authors' note: Should we iscuss incremental SPF, next-hop
880	   tracking, SPF timer tuning (By protocol and AF), prefix
881	   prioritization, etc?  All of these are generally thought of as
882	   convergence optimizations, and may be applicable here as a way to
883	   both reduce the CPU load and ensure that behavior is more
884	   deterministic, but I'm not sure how much depth we want to get into
885	   here, especially since some are vendor-specific or FIB-specific
886	   optimizations... ***

888	5.2.1.  Topology-related optimizations

890	   As has been discussed above, the topology of a given VPN and its
891	   placement on the available PE routers can be a significant
892	   contributing factor to the impacts of that VPN on the scaling limits
893	   of a given PE.  For example, a hub and spoke arrangement allows for
894	   some amount of aggregation and route summarization to be used, but
895	   there are limitations to its effectiveness at minimizing routing
896	   table growth since this is typically implemented by the end customer,
897	   and is dependent on how hierarchical their topology and IP addressing
898	   plan is.  While there are plenty of other good reasons to use a hub
899	   and spoke design, including security (traffic separation) between
900	   spoke sites, etc., generally, a customer does not have much incentive
901	   to expend the time and effort to maintain a proper hierarchy or deal
902	   with the added complexity of a hub and spoke design if the only
903	   benefit is to improve route scaling.  A possible solution for some
904	   full-mesh topologies is to use Virtual Hub-and-Spoke in BGP/MPLS VPNs
905	   [RFC7024].  From the abstract:

907	      "With BGP/MPLS VPNs, any-to-any connectivity among sites of a
908	      given Virtual Private Network would require each Provider Edge
909	      router that has one or more of these sites connected to it to hold
910	      all the routes of that Virtual Private Network.  The approach
911	      described in this document allows to reduce the number of Provider
912	      Edge routers that have to maintain all these routes by requiring
913	      only a subset of these routers to maintain all these routes."

915	   The value of this approach is that it is much less dependent on the
916	   individual customer to implement a hierarchy in order to conserve
917	   routing table entries.  The potential downside to this approach is
918	   that it requires additional provisioning and troubleshooting
919	   complexity due to the way that routes are/are not imported, the use
920	   of default/summary routes, etc.  This approach also potentially
921	   exacerbates the problem discussed above where PE's are inconsistently
922	   loaded (in terms of total number of routes) from one PE to the next
923	   and the potential provisioning difficulty that comes from a desire to
924	   find and use as much spare control plane capacity as possible without
925	   overloading a given PE.

927	5.3.  Common problems at scale limits

929	   As mentioned above, systems that are carrying a large number of VRFs
930	   and/or VRFs with large numbers of routes tend to be more sensitive
931	   during events due to the increased amount of periodic and event-
932	   driven processing that must be done to complete a walk of the routing
933	   table to process updates.  While optimization techniques may reduce
934	   the overhead of (re)programming the FIB after an update, there are
935	   less tricks to be employed in managing the RIB, and they are often
936	   vendor-specific, which leads to a lowest-common-denominator threshold
937	   in multivendor environments.

939	   In addition to CPU constraints, it's common for route memory
940	   footprint to be a consideration if there are large numbers of VRFs
941	   with large numbers of routes.  Similarly to the way that high scale
942	   reduces the cushion of available CPU resources to absorb temporary
943	   peaks, as memory use reaches its high threshold, allocation of the
944	   remaining memory becomes less efficient and more fragmented, such
945	   that memory allocations may begin to fail well before the available
946	   memory is actually exhausted.  Depending on the specific
947	   implementation, the "largest free" may be more important than the
948	   "total free" and it may be difficult or impossible to coalesce the
949	   free memory to reduce fragmentation to an acceptable level.  As with
950	   other scaling problems, a failure of this type has the nasty habit of
951	   causing a cascade of problems.  Depending on how robust the system is
952	   at recovering from memory allocation failures, it may trigger
953	   restarts of critical routing processes or even the entire system.
954	   These may or may not be graceful and hitless, and even if they are
955	   locally a fairly low impact, these may trigger events on other
956	   routers due to the ripple effect of the network event itself.  It is
957	   also worth noting that there are hardware and software limits to how
958	   much memory a given system can use - if the router in question does
959	   not use a 64-bit OS, then it is unable to address more than 4GB of
960	   RAM, for example.  This may make an otherwise robust system incapable
961	   of scaling to the necessary level, and make memory usage an even more
962	   significant consideration.

964	6.  Known issues and gaps

966	6.1.  PE-CE routing protocols

968	   While support for route flap dampening in BGP as a PE-CE routing
969	   protocol is equivalent to its support in non-VPN applications, the
970	   addition of IGP routing protocols such as OSPF creates a new problem,
971	   in that there is not really a way to manage route dampening, either
972	   by configuring it within the context of the IGP itself, or by
973	   configuring it in the translation point where the IGP's routing
974	   information is moved into the MP-BGP control plane infrastructure to
975	   be exchanged between participating PEs across the VPN network.  This
976	   means that in the case where IGPs are used, which is often more CPU-
977	   intensive and performance-conscious to start with, the route flaps
978	   associated with an unstable network will make a bad problem even
979	   worse.  It may be advisable for the IETF to document updates to
980	   standards managing use of IGPs as PE-CE routing protocols to
981	   explicitly define the use of RFD in this application.

983	   There are also not clear guidelines based on testing and real-world
984	   experience for recommended timer values or appropriate use cases for
985	   an IGP vs BGP as a PE-CE routing protocol.  In other words, rather
986	   than enterprises simply defaulting to whatever IGP is already in use
987	   or they are most comfortable with, there may be certain cases where
988	   use of an IGP is recommended, and those where it is not.  Guidance in
989	   this area may be very useful to both the SPs supporting these
990	   networks and the engineers designing the corporate networks that make
991	   use of them.

993	6.2.  Multicast

995	   Issues in multicast VPN scale?

997	6.3.  Network Events

999	   Guidance on interface event dampening values (research and testing),
1000	   correlation tools to help determine root cause in a cascade failure,

1002	6.4.  General Route Scale

1004	   Route flap dampening may potentially be a best practice, but it has a
1005	   number of shortcomings.  First, there is no systematic way for end
1006	   customers to view and clear dampening without some sort of advanced-
1007	   functionality looking glass that allows them to view only the routes
1008	   in their authorized VRFs.  Also, allowing customers to make
1009	   unattended clears of dampened routes may defeat the purpose of having
1010	   dampening enabled at all, since customers may clear the dampening
1011	   without addressing the underlying cause of the problem.  In addition,
1012	   as noted in [I-D.ietf-idr-rfd-usable] and
1013	   [I-D.shishio-grow-isp-rfd-implement-survey] , Route flap Dampening is
1014	   not widely used even within the Global Internet routing table, and
1015	   its values probably need to be tweaked.  Due to the differences in
1016	   the characteristics of VPN routes compared with the global routing
1017	   table, additional study and recommendations as to appropriate RFD
1018	   values within a VPN are likely required.  Additionally, it is not
1019	   possible to configure RFD on IGPs, either natively within the PE-CE
1020	   routing protocol or upstream where the learned routes are carried in
1021	   MP-BGP.  This means that in some cases, there is no way to insulate
1022	   the SP network from the adverse impacts of rapid route churn.

1024	6.5.  Modeling and Capacity planning

1026	   There is a significant lack of multidimensional scale guidance and
1027	   modeling for capacity planning and troubleshooting large-scale VPN
1028	   deployments.  This has a number of contributing factors.  First,
1029	   behavior at scale becomes increasingly non-deterministic the more
1030	   variables you're working with simultaneously, so this is classically
1031	   a difficult problem to model.  Even worse, it's difficult to account
1032	   in a model for latent design/implementation flaws: things that work
1033	   well enough at moderate scale, but are not efficient enough for high
1034	   scale, or suffer some sort of secondary impact due to dependencies,
1035	   race conditions, etc.  These problems are often only found through
1036	   extensive testing or even escape into production.  Second, it is
1037	   difficult to characterize an "average" implementation in such a way
1038	   that it can be tested to failure in mulitple permutations to provide
1039	   a reasonably accurate multidimensional model.  Consequently, the
1040	   guidance available normally takes the form of multiple uni-
1041	   dimensional scale thresholds plus some very conservative multi-
1042	   dimensional thresholds.  These conservative recommendations avoid
1043	   risk to both the vendor and the implementer by catering to the lowest
1044	   common denominator, but they have the adverse effect of leaving a lot
1045	   of capacity sitting idle.  Some vendors make an effort to
1046	   characterize their customers' large scale implementations such that
1047	   they can better replicate real-world conditions, but gathering this
1048	   information and devising ways to replicate the behavior in a lab is
1049	   problematic and time-consuming.

1051	   This leads to a follow-on issue, which is that there is a lack of
1052	   instrumentation on critical scaling vectors.  Some routers have very
1053	   limited abilities to provide useful data about critical scaling
1054	   vectors (routing updates per second, changes in multicast state,
1055	   sources of internal bottlenecks, etc), either for use in a model or
1056	   for use as additional capacity monitoring thresholds.  While most
1057	   routers can provide information about CPU usage and memory
1058	   thresholds, and even which processes are consuming large amounts of
1059	   resources, it often takes special instrumented versions of the OS to
1060	   provide a window into what is actually causing some sort of failure
1061	   at scale.  Because these are not routinely monitored, it means that
1062	   the provider may be blind to one or more early warning signs that the
1063	   router is nearing its scaling limits and cannot take action to
1064	   prevent exceeding those limits before it causes customer impacts.

1066	   Additionally, even if this information is available, the provisioning
1067	   systems used by most providers do not currently have the intelligence
1068	   or visibility to make a decision regarding which PE to provision new
1069	   customers on to evenly load the available PE routers.  The
1070	   provisioning system is often aware of the available physical or
1071	   logical port capacity on a given router or site, and uses this as a
1072	   key input to its port choice for newly provisioned customers.
1073	   However, these additional capacity and scale vectors are based on
1074	   real-time statistics from the router (CPU, memory load, etc) and
1075	   there is no interaction or feedback loop between the provisioning
1076	   system and these types of real-time router scale stats.  As a result,
1077	   manual intervention is often required to either remove busy routers
1078	   from the available capacity pool, move spare port capacity from a
1079	   busy router to a full one, or even to reprovision customers to move
1080	   them from one device to another to rebalance the load on each router.

1082	6.6.  Performance issues

1084	   In many ways, it's difficult to define a hard-and-fast scale limit,
1085	   because each provider and customer have a differing view on what is
1086	   an acceptable performance envelope both in steady state and during
1087	   recovery from outages, whether planned or unplanned.  In the most
1088	   extreme sorts of network events, such as a heavily loaded PE router
1089	   undergoing a cold restart, the scale considerations may take
1090	   something like boot and convergence times from what the involved
1091	   parties consider acceptable and extend them to the point where they
1092	   significantly prolong the pain that to which an end customer is
1093	   exposed.  They often have the added problem of making it difficult to
1094	   predict the duration of an outage, because individual customer VRFs
1095	   may be affected for differing amounts of time based on all of the
1096	   factors that contribute to scaling and affect convergence.  For
1097	   example, if a customer has one critical route that happens to be
1098	   among the last to converge, they perceive the outage to be ongoing
1099	   until that last route converges, even if the entire rest of their
1100	   network has been functional for a significant amount of time prior to
1101	   that point.

1103	   When dealing with scheduled outages, customers obviously prefer that
1104	   they never are impacted.  Since this is not really possible, they
1105	   expect the provider to give them very clear and accurate guidance on
1106	   what the impacts will be, when they will occur, and for what
1107	   duration, so that they can set expectations for their customers.
1108	   VPNs are often carrying mission-critical services and data, so any
1109	   downtime is bad downtime.  While a customer may be understanding of a
1110	   scheduled maintenance with a 15-30 minute traffic interruption while
1111	   a router reloads, they may be less so if the outage actually
1112	   stretches for 60-90 minutes while the router runs at 100% CPU trying
1113	   to deal with this worst-case sort of load or suffers intermittent
1114	   cascade problems while any remaining cushion is used up dealing with
1115	   the results of the event.  These impacts may be largely invisible to
1116	   the provider unless they have probes within each VRF or other means
1117	   to verify that traffic is no longer impacted for a given customer.
1118	   It's often difficult or impossible for a provider to tell the
1119	   difference between a router that is fully converged but running near
1120	   100% CPU after a reload from one that is thrashing and causing delays
1121	   in convergence and customer traffic impacts while it runs at 100% CPU
1122	   after a reload.  Even worse, a scheduled or known outage on one
1123	   router may trigger unplanned outages on other high-CPU devices.  Even
1124	   in unplanned outages, communication regarding impacts and duration is
1125	   key, and these sorts of scale issues make it difficult to predict the
1126	   impacts.

1128	6.7.  High Availability and Network Resiliency

1130	   In many cases, L3VPN services are carrying significant amounts of
1131	   business-critical data.  Customers and carriers design their networks
1132	   to be robust enough to absorb single and sometimes even dual faults
1133	   with little or no impact to the network as a whole.  However, the
1134	   expectations as to the frequency and duration of outages due to
1135	   either scheduled or unscheduled events continue to go higher and
1136	   higher.  This is leading more providers to adopt features such as
1137	   Non-Stop Forwarding and Non-Stop Routing, as well as things like In-
1138	   Service Software Upgrades to improve the chances that outages will be
1139	   transparent to the underlying customers, networks, and applications
1140	   using the network elements.  As these become more common within the
1141	   L3VPN space, they must be considered when evaluating PE scale.
1142	   Often, the machinery necessary to make these reliability enhancements
1143	   work requires duplication and sharing of state between multiple
1144	   elements.  At its most basic level, this state sharing takes more
1145	   resources and more time the more state there is to be shared, so
1146	   increases in the different scaling vectors discussed in this document
1147	   will cause proportional increases in the complexity and resource
1148	   requirements necessary for the combined feature set.  In more complex
1149	   scenarios and implementations, it may contribute to the complexity
1150	   associated with capacity planning, and may make the response even
1151	   more non-deterministic as scale increases.

1153	6.8.  New methods of horizontal scaling

1155	   When this document was being written, there was considerable
1156	   discussion around the area of Software Defined Networking and
1157	   Openflow[ONF].  These are technologies which provide a way to offload
1158	   some of the more complex control plane elements to a more central
1159	   controller device, which then programs the routing elements for
1160	   correct forwarding plane operation.  This is interesting in solving a
1161	   problem such as described in this document because it effectively
1162	   decouples the growth of the control plane from the growth of the
1163	   forwarding plane.  In other words, it would be possible to continue
1164	   allocating more and more CPU resources to the high-overhead control
1165	   plane elements discussed above, and keep it almost totally
1166	   independent from the physical forwarding plane resources necessary.
1167	   While in some ways this would simple move the need for horizontal
1168	   scaling elsewhere, rather than actually reducing the scaling
1169	   considerations, the benefit is that an SP could use commodity compute
1170	   hardware, which would potentially be a lower cost and more easily
1171	   scaled than your average PE router's CPU.  The application of SDN/
1172	   Openflow or any other interface to the routing system that offloads
1173	   some control plane elements for improved BGP VPN scale is beyond the
1174	   scope of this document, but may be a valid use case for future
1175	   discussion within the IETF.

1177	7.  To-Do list

1179	   RFC EDITOR: Please remove this section before publication.

1181	   Still not discussed in the document:

1183	   Inter-AS VPN NNI scaling considerations (separate discussions on 10A,
1184	   10B/hybrid, 10C?) - include discussion on number of VRFs per NNI,
1185	   routes per VRF, NNIs per router

1187	   Label Exhaustion

1189	   BGP Fast External Fallover

1191	   additional scaling considerations if using L2TPv3 or RSVP-TE
1192	   tunneling for PE-PE transport

1194	   Future scaling considerations (MPLS-TP at the edge, interworking with
1195	   L2 technologies, significant increases in density, etc)

1197	8.  Acknowledgements

1199	   The idea for this draft came from a presentation made by Ning So
1200	   during the CDNI working group meeting at IETF 81 in Quebec City where
1201	   some of these same scaling considerations are discussed.  Thanks also
1202	   to Yakov Rekhter, Luay Jalil, Jeff Loughridge, Stephane Litkowski,
1203	   Rajiv Asati, and Daniel Cohn for their reviews and comments.

1205	9.  IANA Considerations

1207	   This draft makes no request to IANA..

1209	10.  Security Considerations

1211	   Security considerations for IP VPNs are covered in the protocol
1212	   definitions.  This draft does not introduce any new security
1213	   considerations, but it is worth noting that attack vectors that
1214	   result in minor impacts in a low-scale environment may make the
1215	   problems observed in a high-scale or resource-constrained environment
1216	   worse, thereby magnifying the potential for impacts.

1218	11.  Informative References

1220	   [EIGRP]    Wikipedia.org, "Enhanced Interior Gateway Routing
1221	              Protocol", <http://en.wikipedia.org/wiki/
1222	              Enhanced_Interior_Gateway_Routing_Protocol>.

1224	   [I-D.ietf-idr-rfd-usable]
1225	              Pelsser, C., Bush, R., Patel, K., Mohapatra, P., and O.
1226	              Maennel, "Making Route Flap Damping Usable", draft-ietf-
1227	              idr-rfd-usable-04 (work in progress), October 2013.

1229	   [I-D.shishio-grow-isp-rfd-implement-survey]
1230	              Tsuchiya, S., Kawamura, S., Bush, R., and C. Pelsser,
1231	              "Route Flap Damping Deployment Status Survey", draft-
1232	              shishio-grow-isp-rfd-implement-survey-05 (work in
1233	              progress), June 2012.

1235	   [IEEE802.1]
1236	              IEEE, "Connectivity Fault Management",
1237	              <http://standards.ieee.org/getieee802/download/
1238	              802.1ag-2007.pdf>.

1240	   [IEEE802.3]
1241	              IEEE, "Carrier Sense Multiple Access with Collision
1242	              Detection (CSMA/CD) Access Method and Physical Layer
1243	              Specifications",
1244	              <http://standards.ieee.org/about/get/802/802.3.html>.

1246	   [ISO13239]
1247	              ISO, "High-level Data Link Control protocol",
1248	              <http://read.pudn.com/downloads79/doc/comm/306220/
1249	              ISO%2013239.pdf>.

1251	   [ONF]      ONF, "The Open Networking Foundation", <https://
1252	              www.opennetworking.org/>.

1254	   [RFC1661]  Simpson, W., "The Point-to-Point Protocol (PPP)", STD 51,
1255	              RFC 1661, July 1994.

1257	   [RFC4271]  Rekhter, Y., Li, T., and S. Hares, "A Border Gateway
1258	              Protocol 4 (BGP-4)", RFC 4271, January 2006.

1260	   [RFC4364]  Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
1261	              Networks (VPNs)", RFC 4364, February 2006.

1263	   [RFC4456]  Bates, T., Chen, E., and R. Chandra, "BGP Route
1264	              Reflection: An Alternative to Full Mesh Internal BGP
1265	              (IBGP)", RFC 4456, April 2006.

1267	   [RFC4577]  Rosen, E., Psenak, P., and P. Pillay-Esnault, "OSPF as the
1268	              Provider/Customer Edge Protocol for BGP/MPLS IP Virtual
1269	              Private Networks (VPNs)", RFC 4577, June 2006.

1271	   [RFC4659]  De Clercq, J., Ooms, D., Carugi, M., and F. Le Faucheur,
1272	              "BGP-MPLS IP Virtual Private Network (VPN) Extension for
1273	              IPv6 VPN", RFC 4659, September 2006.

1275	   [RFC4781]  Rekhter, Y. and R. Aggarwal, "Graceful Restart Mechanism
1276	              for BGP with MPLS", RFC 4781, January 2007.

1278	   [RFC4984]  Meyer, D., Zhang, L., and K. Fall, "Report from the IAB
1279	              Workshop on Routing and Addressing", RFC 4984, September
1280	              2007.

1282	   [RFC5838]  Lindem, A., Mirtorabi, S., Roy, A., Barnes, M., and R.
1283	              Aggarwal, "Support of Address Families in OSPFv3", RFC
1284	              5838, April 2010.

1286	   [RFC5880]  Katz, D. and D. Ward, "Bidirectional Forwarding Detection
1287	              (BFD)", RFC 5880, June 2010.

1289	   [RFC6037]  Rosen, E., Cai, Y., and IJ. Wijnands, "Cisco Systems'
1290	              Solution for Multicast in BGP/MPLS IP VPNs", RFC 6037,
1291	              October 2010.

1293	   [RFC6513]  Rosen, E. and R. Aggarwal, "Multicast in MPLS/BGP IP
1294	              VPNs", RFC 6513, February 2012.

1296	   [RFC6565]  Pillay-Esnault, P., Moyer, P., Doyle, J., Ertekin, E., and
1297	              M. Lundberg, "OSPFv3 as a Provider Edge to Customer Edge
1298	              (PE-CE) Routing Protocol", RFC 6565, June 2012.

1300	   [RFC7024]  Jeng, H., Uttaro, J., Jalil, L., Decraene, B., Rekhter,
1301	              Y., and R. Aggarwal, "Virtual Hub-and-Spoke in BGP/MPLS
1302	              VPNs", RFC 7024, October 2013.

1304	   [Y.1731]   ITU-T, "OAM functions and mechanisms for Ethernet based
1305	              networks", <http://www.itu.int/rec/T-REC-Y.1731/en>.

1307	Authors' Addresses

1309	   Wesley George
1310	   Time Warner Cable
1311	   13820 Sunrise Valley Drive
1312	   Herndon, VA  20171
1313	   US

1315	   Phone: +1 703-561-2540
1316	   Email: wesley.george@twcable.com

1318	   Rob Shakir
1319	   BT
1320	   London
1321	   UK

1323	   Phone: +
1324	   Email: rob.shakir@bt.com