idnits 2.17.1 

draft-lapukhov-bgp-routing-large-dc-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 14, 2012) is 4297 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'Servers' is mentioned on line 148, but not defined

  -- Obsolete informational reference (is this intentional?): RFC 2385
     (Obsoleted by RFC 5925)

  == Outdated reference: A later version (-08) exists of
     draft-ietf-grow-diverse-bgp-path-dist-07

  == Outdated reference: A later version (-01) exists of
     draft-mitchell-idr-as-private-reservation-00


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	IDR                                                          P. Lapukhov
3	Internet-Draft                                           Microsoft Corp.
4	Intended status: Informational                                 A. Premji
5	Expires: January 15, 2013                                Arista Networks
6	                                                           July 14, 2012

8	           Using BGP for routing in large-scale data centers
9	                 draft-lapukhov-bgp-routing-large-dc-01

11	Abstract

13	   Some service providers build and operate data centers that support
14	   over 100,000 servers.  In this document, such data-centers are
15	   referred to as "large-scale" data centers to differentiate them the
16	   from more common smaller infrastructures.  The data centers of this
17	   scale have a unique set of network requirements, with emphasis on
18	   operational simplicity and network stability.

20	   This document attempts to summarize the authors' experiences in
21	   designing and supporting large data centers, using BGP as the only
22	   control-plane protocol.  The intent here is to describe a proven and
23	   stable routing design that could be leveraged by others in the
24	   industry.

26	Status of this Memo

28	   This Internet-Draft is submitted in full conformance with the
29	   provisions of BCP 78 and BCP 79.

31	   Internet-Drafts are working documents of the Internet Engineering
32	   Task Force (IETF).  Note that other groups may also distribute
33	   working documents as Internet-Drafts.  The list of current Internet-
34	   Drafts is at http://datatracker.ietf.org/drafts/current/.

36	   Internet-Drafts are draft documents valid for a maximum of six months
37	   and may be updated, replaced, or obsoleted by other documents at any
38	   time.  It is inappropriate to use Internet-Drafts as reference
39	   material or to cite them other than as "work in progress."

41	   This Internet-Draft will expire on January 15, 2013.

43	Copyright Notice

45	   Copyright (c) 2012 IETF Trust and the persons identified as the
46	   document authors.  All rights reserved.

48	   This document is subject to BCP 78 and the IETF Trust's Legal
49	   Provisions Relating to IETF Documents
50	   (http://trustee.ietf.org/license-info) in effect on the date of
51	   publication of this document.  Please review these documents
52	   carefully, as they describe your rights and restrictions with respect
53	   to this document.  Code Components extracted from this document must
54	   include Simplified BSD License text as described in Section 4.e of
55	   the Trust Legal Provisions and are provided without warranty as
56	   described in the Simplified BSD License.

58	Table of Contents

60	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
61	   2.  Traditional data center designs  . . . . . . . . . . . . . . .  3
62	     2.1.  Layer 2 Designs  . . . . . . . . . . . . . . . . . . . . .  3
63	     2.2.  Fully routed network designs . . . . . . . . . . . . . . .  4
64	   3.  Document structure . . . . . . . . . . . . . . . . . . . . . .  5
65	   4.  Network design requirements  . . . . . . . . . . . . . . . . .  5
66	     4.1.  Traffic patterns . . . . . . . . . . . . . . . . . . . . .  5
67	     4.2.  CAPEX minimization . . . . . . . . . . . . . . . . . . . .  6
68	     4.3.  OPEX minimization  . . . . . . . . . . . . . . . . . . . .  6
69	     4.4.  Traffic Engineering  . . . . . . . . . . . . . . . . . . .  7
70	   5.  Requirement List . . . . . . . . . . . . . . . . . . . . . . .  7
71	   6.  Network topology . . . . . . . . . . . . . . . . . . . . . . .  7
72	     6.1.  Clos topology overview . . . . . . . . . . . . . . . . . .  8
73	     6.2.  Clos topology properties . . . . . . . . . . . . . . . . .  8
74	     6.3.  Scaling Clos topology  . . . . . . . . . . . . . . . . . .  9
75	   7.  Routing design . . . . . . . . . . . . . . . . . . . . . . . . 10
76	     7.1.  Choosing the routing protocol  . . . . . . . . . . . . . . 10
77	     7.2.  BGP configuration for Clos topology  . . . . . . . . . . . 11
78	       7.2.1.  BGP Autonomous System numbering layout . . . . . . . . 11
79	       7.2.2.  Non-unique private BGP ASN's . . . . . . . . . . . . . 12
80	       7.2.3.  Prefix advertisement . . . . . . . . . . . . . . . . . 13
81	       7.2.4.  External connectivity  . . . . . . . . . . . . . . . . 13
82	     7.3.  ECMP Considerations  . . . . . . . . . . . . . . . . . . . 14
83	       7.3.1.  Basic ECMP . . . . . . . . . . . . . . . . . . . . . . 14
84	       7.3.2.  BGP ECMP over multiple ASN . . . . . . . . . . . . . . 15
85	     7.4.  BGP convergence properties . . . . . . . . . . . . . . . . 16
86	       7.4.1.  Convergence timing . . . . . . . . . . . . . . . . . . 16
87	       7.4.2.  Failure impact scope . . . . . . . . . . . . . . . . . 16
88	       7.4.3.  Third-party route injection  . . . . . . . . . . . . . 17
89	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 17
90	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 17
91	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
92	   11. Informative References . . . . . . . . . . . . . . . . . . . . 18
93	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19

95	1.  Introduction

97	   This document presents a practical routing design that can be used in
98	   large-scale data centers.  Such data centers, also known as hyper-
99	   scale or warehouse scale data centers, have a unique attribute of
100	   supporting over a 100,000 end hosts.  In order to support networks of
101	   such scale, operators are revisiting networking designs and platforms
102	   to address this need..  Contrary to the more traditional data center
103	   designs, the approach presented in this document does not have any
104	   dependency on building a large Layer-2 domain and instead relies on
105	   routing at every layer in the network.  Implementing a pure Layer-3
106	   design using BGP further ensures broad vendor support and almost
107	   guarantees interoperability between vendors given that BGP is one of
108	   the most widely deployed protocols on the Internet.

110	2.  Traditional data center designs

112	   This section provides an overview of two types of traditional data
113	   center designs - Layer-2 and fully routed Layer-3 topologies.

115	2.1.  Layer 2 Designs

117	   In the networking industry, a common design choice for data centers
118	   is to use a mix of Ethernet-based Layer 2 technologies.  Network
119	   topologies typically look like a tree with redundant uplinks and
120	   three levels of hierarchy commonly named Core , Aggregation and
121	   Access layers (see Figure 1).  To accommodate bandwidth demands,
122	   every next level has higher port density and bandwidth capacity,
123	   moving upwards in the topology.  To keep terminology uniform, tn this
124	   document, these topology layers will be referred to as "tiers", e.g.
125	   Tier 1, Tier 2 and Tier 3 instead of Core, Aggregation or Access
126	   layers.

128	                       +------+  +------+
129	                       |      |  |      |
130	                       |      |--|      |           Tier1
131	                       |      |  |      |
132	                       +------+  +------+
133	                         |  |      |  |
134	               +---------+  |      |  +----------+
135	               | +-------+--+------+--+-------+  |
136	               | |       |  |      |  |       |  |
137	             +----+     +----+    +----+     +----+
138	             |    |     |    |    |    |     |    |
139	             |    |-----|    |    |    |-----|    | Tier2
140	             |    |     |    |    |    |     |    |
141	             +----+     +----+    +----+     +----+
142	                |         |          |         |
143	                |         |          |         |
144	                | +-----+ |          | +-----+ |
145	                +-|     |-+          +-|     |-+    Tier3
146	                  +-----+              +-----+
147	                   | | |                | | |
148	                 [Servers]            [Servers]

150	               Figure 1: Typical Data Center network layout

152	   IP routing is normally used only at the upper layers in the topology,
153	   e.g.  Tier 1 or Tier 2.  Some of the reasons for introducing such
154	   large (sometimes called stretched) layer-2 domains are:

156	   o  Supporting legacy applications that may require direct Layer 2
157	      adjacency or use non-IP protocols
158	   o  Seamless mobility for virtual machines, to allow the preservation
159	      of IP addresses when a virtual machine moves across physical hosts
160	   o  Simplified IP addressing - less IP subnets is required for the
161	      data-center
162	   o  Application load-balancing may require direct layer-2 reachability
163	      to perform certain functions such as Level 2 Direct Server Return
164	      (DSR)

166	2.2.  Fully routed network designs

168	   Network designs that leverage IP routing down to the access layer
169	   (Tier 3) of the network have gained popularity as well.  The main
170	   benefit of such designs is improved network stability and
171	   scalability, as a result of confining L2 broadcast domains.  A common
172	   choice of routing protocol for data center designs would be an IGP,
173	   such as OSPF or ISIS.  As data centers grow in scale, and server
174	   count exceeds tens of thousands, such fully routed designs become
175	   more attractive.

177	   Although BGP is the de-facto standard protocol for routing on the
178	   Internet, having wide support from both the vendor and service
179	   provider communities, it is not generally deployed in data centers
180	   for a number of reasons:

182	   o  BGP is perceived as a "WAN only protocol only" and not often
183	      considered for enterprise or data center applications.
184	   o  BGP is believed to have a "much slower" routing convergence than
185	      traditional IGPs.
186	   o  BGP deployment within an Autonomous System (iBGP mesh) is assumed
187	      to have a dependency on the presence of an IGP, which assists with
188	      recursive next-hop resolution.
189	   o  BGP is perceived to require significant configuration overhead and
190	      does not support any form of neighbor auto-discovery.

192	   In this document we demonstrate a practical approach for using BGP as
193	   the single routing protocol for data center networks.

195	3.  Document structure

197	   The remaining of this document is organized as following.  First the
198	   design requirements for large scale data centers are presented.
199	   Next, the document gives an overview of Clos network topology and its
200	   properties.  After that, the reasons for selecting BGP as the single
201	   routing protocols are presented.  Finally, the document discusses the
202	   design in more details and covers specific BGP policy features.

204	4.  Network design requirements

206	   This section describes and summarizes network design requirement for
207	   a large-scale data center.

209	4.1.  Traffic patterns

211	   The primary requirement when building an interconnection network for
212	   large number of servers is to accommodate application bandwidth and
213	   latency requirements.  Until recently it was quite common to see
214	   traffic flows mostly entering and leaving the data center (also known
215	   as north-south traffic) There were no intense, highly meshed flows or
216	   traffic patterns between the machines within the same tier.  As a
217	   result, traditional "tree" topologies were sufficient to accommodate
218	   such flows, even with high oversubscription ratios in network
219	   equipment.  If more bandwidth was required, it was added by "scaling
220	   up" the network elements, by upgrading line-cards or switch fabrics.

222	   In contrast, large-scale data centers often host applications that
223	   generate significant amount of server to server traffic, also known
224	   as "east-west" traffic.  Examples of such applications could be
225	   compute clusters such as Hadoop or live virtual machine migrations.
226	   Scaling up traditional tree topologies to match these bandwidth
227	   demands becomes either too expensive or impossible due to physical
228	   limitations.

230	4.2.  CAPEX minimization

232	   The cost of the network infrastructure alone (CAPEX) constitutes
233	   about 10-15% of total data center expenditure [GREENBERG2009].
234	   However, The absolute cost is significant, and there is a need to
235	   constantly drive down the cost of networking elements themselves.
236	   This can be accomplished in two ways:

238	   o  Unifying all network elements, preferably using the same hardware
239	      type or even the same device.  This allows for bulk purchases with
240	      discounted pricing.
241	   o  Driving costs down by introducing multiple network equipment
242	      vendors.

244	   In order to allow for vendor diversity, it is important to minimize
245	   the software feature requirements for the network elements.
246	   Furthermore, this strategy provides the maximum flexibility of vendor
247	   equipment choices while enforcing interoperability using open
248	   standards

250	4.3.  OPEX minimization

252	   Operating large scale infrastructure could be expensive, provide that
253	   larger amount of elements will statistically fail more often.  Having
254	   a simpler design and operating using a limited software feature-set
255	   ensures that failures will mostly result from hardware malfunction
256	   and not software issues.

258	   An important aspect of OPEX minimization is reducing size of failure
259	   domains in the network.  Ethernet networks are known to be
260	   susceptible to broadcast or unicast storms.  The use of a fully
261	   routed design significantly reduces the size of the data-plane
262	   failure domains (e.g. limits to Tier-3 switches only).  However, such
263	   designs also introduce the problem of distributed control-plane
264	   failures.  This calls for simpler control-plane protocols that are
265	   expected to have less chances of network meltdown.

267	4.4.  Traffic Engineering

269	   In any data center, application load-balancing is a critical function
270	   performed by network devices.  Traditionally, load-balancers are
271	   deployed as dedicated devices in the traffic forwarding path.  The
272	   problem arises in scaling load-balancers under growing traffic
273	   demand.  A preferable solution would be able to scale load-balancing
274	   layer horizontally, by adding more of the uniform nodes and
275	   distributing incoming traffic across these nodes

277	   In situation like this, an ideal choice would to use network
278	   infrastructure itself to distribute traffic across a group of load-
279	   balancers.  A combination of features such as Anycast prefix
280	   advertisement [RFC4786] along with Equal Cost Multipath (ECMP)
281	   functionality could be used to accomplish this.  To allow for more
282	   granular load-distribution, it is beneficial for the network to
283	   support the ability to perform controlled per-hop traffic
284	   engineering.  For example, it is beneficial to directly control the
285	   ECMP next-hop set for anycast prefixes at every level of network
286	   hierarchy.

288	5.  Requirement List

290	   This section summarizes the list of requirements, based on the
291	   discussion so far:

293	   o  REQ1: Select a network topology where capacity could be scaled
294	      "horizontally" by adding more links and network switches of the
295	      same type, without requiring an upgrade to the network elements
296	      themselves.
297	   o  REQ2: Define a narrow set of software features/protocols supported
298	      by a multitude of networking equipment vendors.
299	   o  REQ3: Among the network protocols, choose the one that has a
300	      simpler implementation in terms of minimal programming code
301	      complexity.
302	   o  REQ4: The network routing protocol should allow for explicit
303	      control of the routing prefix next-hop set on per-hop basis.

305	6.  Network topology

307	   This section outlines the most common choice for horizontally
308	   scalable topology in large scale data centers.

310	6.1.  Clos topology overview

312	   A common choice for a horizontally scalable topology is a folded Clos
313	   topology, sometimes called "fat-tree" (see, for example, [INTERCON]
314	   and [ALFARES2008]).  This topology features odd number of stages
315	   (dimensions) and is commonly made of the same uniform elements, e.g.
316	   switches with the same port count.  Therefore, the choice of Clos
317	   topology satisfies both REQ1 and REQ2.  See Figure 2 below for an
318	   example of folded 3-stage Clos topology:

320	             +-------+
321	             |       |----------------------------+
322	             |       |------------------+         |
323	             |       |--------+         |         |
324	             +-------+        |         |         |
325	             +-------+        |         |         |
326	             |       |--------+---------+-------+ |
327	             |       |--------+-------+ |       | |
328	             |       |------+ |       | |       | |
329	             +-------+      | |       | |       | |
330	             +-------+      | |       | |       | |
331	             |       |------+-+-------+-+-----+ | |
332	             |       |------+-+-----+ | |     | | |
333	             |       |----+ | |     | | |     | | |
334	             +-------+    | | |     | | |   ---------> M links
335	               Tier1      | | |     | | |     | | |
336	                        +-------+ +-------+ +-------+
337	                        |       | |       | |       |
338	                        |       | |       | |       | Tier2
339	                        |       | |       | |       |
340	                        +-------+ +-------+ +-------+
341	                          | | |     | | |     | | |
342	                          | | |     | | |   ---------> N Links
343	                          | | |     | | |     | | |
344	                          O O O     O O O     O O O   Servers

346	                  Figure 2: 3-Stage Folded Clos topology

348	   In the networking industry, a topology like this is sometimes
349	   referred to as "Leaf and Spine" network, where "Spine" is the name
350	   given to the middle stage of the Clos topology (Tier 1) and "Leaf" is
351	   the name of input/output stage (Tier 2).  However, for consistency,
352	   we will refer to these layers as "Tier n".

354	6.2.  Clos topology properties

356	   The following are some key properties of the Clos topology:

358	   o  Topology is fully non-blocking (or more accurately - non-
359	      interfering) if M >= N and oversubscribed by a factor of N/M
360	      otherwise.  Here M and N is the uplink and downlink port count
361	      respectively, for Tier 2 switch, as shown on Figure 2
362	   o  Implementing Clos topology requires a routing protocol supporting
363	      ECMP with the fan-out of M or more
364	   o  Every Tier 1 device has exactly one path to every end host
365	      (server) in this topology
366	   o  Traffic flowing from server to server is naturally load-balanced
367	      over all available paths using simple ECMP behavior

369	6.3.  Scaling Clos topology

371	   A Clos topology could be scaled either by increasing network switch
372	   port count or adding more stages, e.g. moving to a 5-stage Clos, as
373	   illustrated on Figure 3 below:

375	                                  Tier1
376	                                 +-----+
377	                                 |     |
378	                              +--|     |--+
379	                              |  +-----+  |
380	                      Tier2   |           |   Tier2
381	                     +-----+  |  +-----+  |  +-----+
382	       +-------------| DEV |--+--|     |--+--|     |-------------+
383	       |       +-----|  C  |--+  |     |  +--|     |-----+       |
384	       |       |     +-----+     +-----+     +-----+     |       |
385	       |       |                                         |       |
386	       |       |     +-----+     +-----+     +-----+     |       |
387	       | +-----+-----| DEV |--+  |     |  +--|     |-----+-----+ |
388	       | |     | +---|  D  |--+--|     |--+--|     |---+ |     | |
389	       | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |
390	       | |     | |            |           |            | |     | |
391	     +-----+ +-----+          |  +-----+  |          +-----+ +-----+
392	     | DEV | | DEV |          +--|     |--+          |     | |     |
393	     |  A  | |  B  | Tier3       |     |       Tier3 |     | |     |
394	     +-----+ +-----+             +-----+             +-----+ +-----+
395	       | |     | |                                     | |     | |
396	       O O     O O            <- Servers ->            O O     O O

398	                      Figure 3: 5-Stage Clos topology

400	   The topology on Figure 3 is built from switches with port count of 4
401	   and provides full bisection bandwidth to all connected servers.  We
402	   will refer to the collection of directly connected Tier 2 and Tier 3
403	   switches as a "cluster" in this document.  For example, devices A, B,
404	   C, and D on Figure 3 form a cluster.

406	   In practice, the Tier 3 level of the network (typically top of rack
407	   switches, or ToRs) is where oversubscription is introduced to allow
408	   for packaging of more servers in data center.  The main reason to
409	   limit oversubscription at a single layer of the network is to
410	   simplify application development that would otherwise need to account
411	   for two bandwidth pools: within the same access switch (e.g. rack)
412	   and outside of the local switch Since oversubscription itself does
413	   not have any effect on routing, we will not be discussing it further
414	   in this document

416	7.  Routing design

418	   This section discusses the motivation for choosing BGP as the routing
419	   protocol and BGP configuration for routing in Clos topology.

421	7.1.  Choosing the routing protocol

423	   The set of requirements discussed earlier call for a single routing
424	   protocol (REQ2) to reduce complexity and interdependencies.  While it
425	   is common to rely on an IGP in this situation, the document proposes
426	   the use of BGP only.  The advantages of using BGP are discussed
427	   below.

429	   o  BGP inherently has less complexity within its protocol design -
430	      internal data structures and state-machines are simpler when
431	      compared to a link-state IGP.  For example, instead of
432	      implementing adjacency formation, adjacency maintenance and/or
433	      flow-control, BGP simply relies on TCP as the underlying
434	      transport.  This fulfills REQ1 and REQ2.
435	   o  BGP information flooding overhead is less when compared to link-
436	      state IGPs.  Indeed, since every BGP router normally re-calculates
437	      and propagates best-paths only, a network failure is masked as
438	      soon as the BGP speaker finds an alternate path.  In contrary, the
439	      event propagation scope of a link-state IGP is single flooding
440	      domain, regardless of the failure type.  Furthermore, all well-
441	      known link-state IGPs feature periodic refresh updates, while BGP
442	      does not expire routing state.
443	   o  BGP supports third-party (recursively resolved) next-hops.  This
444	      allows for ECMP or forwarding based on customer-defined forwarding
445	      paths.  This satisfied REQ4 stated above.  Some IGPs, such as
446	      OSPF, support similar functionality using special concepts such as
447	      "Forwarding Address", but do not satisfy other requirement, such
448	      as protocol simplicity.
449	   o  Vanilla BGP configuration, without routing policies, is easier to
450	      troubleshoot for network reachability issues.  For example, it is
451	      straightforward to dump contents of LocRIB and compare it to the
452	      router's RIB and FIB.  Furthermore, every BGP neighbor has
453	      corresponding AdjRIBIn and AdjRIBOut structures with incoming/
454	      outgoing NRLI information that could be easily correlated on both
455	      sides of the BGP peering session.  Thus BGP fully satisfies REQ3.

457	7.2.  BGP configuration for Clos topology

459	   Topologies that have more than 5 stages are very uncommon due to the
460	   large numbers of interconnects required by such a design.

462	7.2.1.  BGP Autonomous System numbering layout

464	   The diagram below illustrates suggests BGP Autonomous System Number
465	   (BGP ASN) allocation scheme.  The following is a list of guidelines
466	   that can be used:

468	   o  All BGP peering sessions are external BGP (eBGP) established over
469	      direct point-to-point links interconnecting the network nodes.
470	   o  16-bit (two octet) BGP ASNs are used, since these are widely
471	      supported and have better vendor interoperability (e.g. no need to
472	      support BGP capability negotiation).
473	   o  Private BGP ASNs from the range 64512-64534 are used so as to
474	      avoid ASN conflicts.  The private ASN stripping feature can be
475	      leveraged as a result (see below).
476	   o  A single BGP ASN is allocated to the Clos middle stage ("Tier 1"),
477	      e.g.  ASN 64534 as shown in Figure 4
478	   o  Unique BGP ASN is allocated per group of "Tier 2" switches.  All
479	      Tier 2 switches in the same group share the BGP ASN.
480	   o  Unique BGP ASN is allocated to every Tier 3 switch (e.g.  ToR) in
481	      this topology.

483	                                ASN 64534
484	                               +---------+
485	                               | +-----+ |
486	                               | |     | |
487	                             +-|-|     |-|-+
488	                             | | +-----+ | |
489	                  ASN 64XXX  | |         | |  ASN 64XXX
490	                 +---------+ | |         | | +---------+
491	                 | +-----+ | | | +-----+ | | | +-----+ |
492	     +-----------|-|     |-|-+-|-|     |-|-+-|-|     |-|-----------+
493	     |       +---|-|     |-|-+ | |     | | +-|-|     |-|---+       |
494	     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
495	     |       |   |         |   |         |   |         |   |       |
496	     |       |   |         |   |         |   |         |   |       |
497	     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
498	     | +-----+---|-|     |-|-+ | |     | | +-|-|     |-|---+-----+ |
499	     | |     | +-|-|     |-|-+-|-|     |-|-+-|-|     |-|-+ |     | |
500	     | |     | | | +-----+ | | | +-----+ | | | +-----+ | | |     | |
501	     | |     | | +---------+ | |         | | +---------+ | |     | |
502	     | |     | |             | |         | |             | |     | |
503	   +-----+ +-----+           | | +-----+ | |           +-----+ +-----+
504	   | ASN | |     |           +-|-|     |-|-+           |     | |     |
505	   |65YYY| | ... |             | |     | |             | ... | | ... |
506	   +-----+ +-----+             | +-----+ |             +-----+ +-----+
507	     | |     | |               +---------+               | |     | |
508	     O O     O O              <- Servers ->              O O     O O

510	                 Figure 4: BGP ASN layout for 5-stage Clos

512	7.2.2.  Non-unique private BGP ASN's

514	   The use of private BGP ASNs limits to the usable range of 1022 unique
515	   numbers.  Since it is very likely that the number of network switches
516	   could exceed this number, a workaround is required.  One approach
517	   would be to re-use the private ASN's assigned to the Tier 3 switches
518	   across different clusters.  For example, private BGP ASN's 65001,
519	   65002 ... 65032 could be used within every individual cluster to be
520	   assigned to Tier 3 switches.

522	   To avoid route suppression due to AS PATH loop prevention, upstream
523	   eBGP sessions on Tier 3 switches must be configured with the "AllowAS
524	   In" feature that allows accepting a device's own ASN in received
525	   route advertisements.  Introducing this feature does not create the
526	   opportunity for routing loops under misconfiguration since the AS
527	   PATH is always incremented when routes are propagated from tier to
528	   tier.

530	   Another solution to this problem would be to using four-octet (32-
531	   bit) BGP ASNs.  However, there are no reserved private ASN range in
532	   the four-octet numbering scheme although efforts are underway to
533	   support this, see [I-D.mitchell-idr-as-private-reservation].  This
534	   will also require vendors to implement specific policy features, such
535	   as four-octet private AS removal from AS-PATH attribute.

537	7.2.3.  Prefix advertisement

539	   A Clos topology has a large number of point-to-point links and
540	   associated prefixes.  Advertising all of these routes into BGP may
541	   create FIB overload conditions.  There are two possible solutions
542	   that can help prevent FIB overload:

544	   o  Do not advertise any of the point-to-point links into BGP.  Since
545	      eBGP peering changes the next-hop address anyways at every node,
546	      distant networks will automatically be reachable via the
547	      advertising eBGP peer
548	   o  Advertising point-to-point links, but summarizing them on every
549	      advertising device.  This requires proper address allocation, for
550	      example allocating a consecutive block of IP addresses per Tier 1
551	      and Tier 2 device to be used for point-to-point interface
552	      addressing.

554	   Server facing subnets on Tier 3 switches are announced into BGP
555	   without using summarization on Tier 2 and Tier 1 switches.
556	   Summarizing subnets in the Clos topology will result in route black-
557	   holing under a single link failure (e.g. between Tier 2 and Tier 3
558	   switch) and hence must be avoided.  The use of peer links within the
559	   same tier to resolve the black-holing problem is undesirable due to
560	   O(N^2) complexity of the peering mesh and waste of ports on the
561	   switches.

563	7.2.4.  External connectivity

565	   A dedicate cluster (or clusters) in the Clos topology could be used
566	   solely for the purpose of connecting to the Wide Area Network (WAN)
567	   edge devices, or WAN Routers.  Tier 3 switches in such a cluster
568	   would be replaced with WAN Routers, but eBGP peering would be used
569	   again, though WAN routers are likely to belong to a public ASN.

571	   The Tier 2 devices in such a dedicated cluster will be referred to as
572	   "Border Routers" in this document.  These devices have to perform a
573	   few special functions:

575	   o  Hide network topology information when advertising paths to WAN
576	      routers, i.e. remove private BGP ASNs from the AS-PATH attribute.
577	      This is typically done to avoid BGP ASN number collisions across
578	      the data centers.  A BGP policy feature called "Remove Private AS"
579	      is commonly used to accomplish this.  This feature strips a
580	      contiguous sequence of private ASNs found in AS PATH attribute
581	      prior to advertising the path to a neighbor.  This assumes that
582	      all BGP ASN's used for intra data center numbering are from the
583	      private ASN range.
584	   o  Originate a default route to the data center devices.  This is the
585	      only place where default route could be originated, as route
586	      summarization is highly undesirable for the "scale-out" topology.
587	      Alternatively, Border Routers may simply relay the default route
588	      learned from WAN routers.

590	7.3.  ECMP Considerations

592	   This section covers the Equal Cost Multipath (ECMP) functionality for
593	   Clos topology and discusses a few special requirements.

595	7.3.1.  Basic ECMP

597	   ECMP is the fundamental load-sharing mechanism used by a Clos
598	   topology.  Effectively, every lower-tier switch will use all of its
599	   directly attached upper-tier devices to load-share traffic destined
600	   to the same prefix.  Number of ECMP paths between two input/output
601	   switches in Clos topology equals to the number of the switches in the
602	   middle stage (Tier 1).  For example, Figure 5 illustrates the
603	   topology where Tier 3 device A has four paths to reach servers X and
604	   Y, via Tier 2 devices B and C and then Tier 1 devices 1, 2, 3, and 4
605	   respectively.

607	                                  Tier 1
608	                                 +-----+
609	                                 | DEV |
610	                              +->|  1  |--+
611	                              |  +-----+  |
612	                      Tier 2  |           |   Tier 2
613	                     +-----+  |  +-----+  |  +-----+
614	       +------------>| DEV |--+->| DEV |--+--|     |-------------+
615	       |       +-----|  B  |--+  |  2  |  +--|     |-----+       |
616	       |       |     +-----+     +-----+     +-----+     |       |
617	       |       |                                         |       |
618	       |       |     +-----+     +-----+     +-----+     |       |
619	       | +-----+---->| DEV |--+  | DEV |  +--|     |-----+-----+ |
620	       | |     | +---|  C  |--+->|  3  |--+--|     |---+ |     | |
621	       | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |
622	       | |     | |            |           |            | |     | |
623	     +-----+ +-----+          |  +-----+  |          +-----+ +-----+
624	     | DEV | |     | Tier 3   +->| DEV |--+   Tier 3 |     | |     |
625	     |  A  | |     |             |  4  |             |     | |     |
626	     +-----+ +-----+             +-----+             +-----+ +-----+
627	       | |     | |                                     | |     | |
628	       O O     O O            <- Servers ->            X Y     O O

630	               Figure 5: ECMP fan-out tree from A to X and Y

632	   The ECMP requirement implies that the BGP implementation must support
633	   multi-path fan-out for up to the maximum number of devices directly
634	   attached at any point in the topology.  Normally, this number does
635	   not exceed half of the ports found on a switch in the topology.  For
636	   example, an ECMP max-path of 32 would be required when building a
637	   Clos network using 64-port devices.

639	   Most implementations declare paths to be equal from ECMP perspective
640	   if they match up to and including step (e) in Section 9.1.2.2 of
641	   [RFC4271].  In the proposed network design there is no underlying
642	   IGP, so all IGP costs are automatically assumed to be zero (or
643	   otherwise the same value across all paths).  Loop prevention is
644	   assumed to be handled by the BGP best-path selection process.

646	7.3.2.  BGP ECMP over multiple ASN

648	   For application load-balancing purposes we may want the same prefix
649	   to be advertised from multiple Tier-3 switches.  From the perspective
650	   of other devices, such a prefix would have BGP paths with different
651	   AS PATH attribute values, though having the same AS PATH attribute
652	   lengths.  Therefore, the BGP implementations must support load-
653	   sharing over above-mentioned paths.  This feature is sometimes known
654	   as "AS PATH multipath relax" and effectively allows for ECMP to be
655	   done across different neighboring ASNs.

657	7.4.  BGP convergence properties

659	   This section reviews routing convergence properties of BGP in the
660	   proposed design.  A case is made that sub-second convergence is
661	   achievable provided that implementation supports fast BGP peering
662	   session shutdown upon failure of an associated link.

664	7.4.1.  Convergence timing

666	   BGP typically relies on an IGP to route around link/node failures
667	   inside an AS, and implements either a polling based or an event-
668	   driven mechanism to obtain updates on IGP state changes.  The
669	   proposed routing design omits the use of an IGP, so the only
670	   mechanisms that could be used for fault detection are BGP keep-alives
671	   and link-failure triggers.

673	   Relying solely on BGP keep-alive packets may result in high
674	   convergence delays, in the order of multiple seconds (normally, the
675	   minimum recommended BGP hold time value is 3 seconds).  However, many
676	   BGP implementations can shut down local eBGP peering sessions in
677	   response to the "link down" event for the outgoing interface used for
678	   BGP peering.  This feature is sometimes called as "fast fail-over".
679	   Since the majority of the links in modern data centers are point to
680	   point fiber connections, a physical interface failure if often
681	   detected in milliseconds and subsequently triggers a BGP re-
682	   convergence.

684	   Furthermore, popular link technologies, such as 10Gbps Ethernet, may
685	   support a simple form of OAM for failure signaling such as
686	   [FAULTSIG10GE], which makes failure detection more robust.
687	   Alternatively, as opposed to relying on physical layer for fault
688	   signaling, some platforms may support Bidirectional Forwarding
689	   Detection ([RFC5880]) to allow for sub-second failure detection and
690	   fault signaling to the BGP process.  This, however, presents
691	   additional requirements to vendor software and possibly hardware, and
692	   may contradict REQ1.

694	7.4.2.  Failure impact scope

696	   BGP is inherently a distance-vector protocol, and as such some of
697	   failures could be masked if the local node can immediately find a
698	   backup path.  The worst case is that all devices in data center
699	   topology would have to either withdraw a prefix completely, or
700	   recalculate the ECMP paths in the FIB.  Reducing the fault domain
701	   using summarization is not possible with the proposed design, since
702	   using this technique may create route black-holing issues as
703	   mentioned previously.  Thus, the control-plane failure impact scope
704	   is the network as a whole.  It is worth pointing that such property
705	   is not a result of choosing BGP, but rather a result of using the
706	   "scale-out" Clos topology.

708	7.4.3.  Third-party route injection

710	   BGP allows for a third-party BGP speaker (not necessarily directly
711	   attached to the network devices) to inject routes anywhere in the
712	   network topology.  This could be achieved by peering an external
713	   speaker using an eBGP multi-hop session with some or even all devices
714	   in the topology.  Furthermore, BGP diverse path distribution
715	   [I-D.ietf-grow-diverse-bgp-path-dist] could be used to inject
716	   multiple next-hop for the same prefix to facilitate load-balancing.
717	   Using such a technique would make it possible to implement unequal-
718	   cost load-balancing across multiple clusters in the data-center, by
719	   associating the same prefix with next-hops mapped to different
720	   clusters.

722	   For example, a third-party BGP speaker may peer with Tier 3 and Tier
723	   1 switches, injecting the same prefix, but using a special set of BGP
724	   next-hops for Tier 1 devices.  Those next-hops are assumed to resolve
725	   recursively via BGP, and could be, for example, IP addresses on Tier
726	   3 switches.  The resulting forwarding table programming could provide
727	   desired traffic proportion distribution among different clusters.

729	8.  Security Considerations

731	   The design does not introduce any additional security concerns.  For
732	   control plane security, BGP peering sessions could be authenticated
733	   using TCP MD5 signature extension header [RFC2385].  Furthermore, BGP
734	   TTL security [I-D.gill-btsh] could be used to reduce the risk of
735	   session spoofing and TCP SYN flooding attacks against the control
736	   plane.

738	9.  IANA Considerations

740	   There are no considerations associated with IANA for this document.

742	10.  Acknowledgements

744	   This publication summarizes work of many people who participated in
745	   developing, testing and deploying the proposed design.  Their names,
746	   in alphabetical order, are George Chen, Parantap Lahiri, Dave Maltz,
747	   Edet Nkposong, Robert Toomey, and Lihua Yuan.  Authors would also
748	   like to thank Jon Mitchell, Linda Dunbar and Susan Hares for
749	   reviewing and providing valuable feedback on the document.

751	11.  Informative References

753	   [RFC4786]  Abley, J. and K. Lindqvist, "Operation of Anycast
754	              Services", BCP 126, RFC 4786, December 2006.

756	   [RFC4271]  Rekhter, Y., Li, T., and S. Hares, "A Border Gateway
757	              Protocol 4 (BGP-4)", RFC 4271, January 2006.

759	   [RFC2385]  Heffernan, A., "Protection of BGP Sessions via the TCP MD5
760	              Signature Option", RFC 2385, August 1998.

762	   [RFC5880]  Katz, D. and D. Ward, "Bidirectional Forwarding Detection
763	              (BFD)", RFC 5880, June 2010.

765	   [I-D.ietf-grow-diverse-bgp-path-dist]
766	              Raszuk, R., Fernando, R., Patel, K., McPherson, D., and K.
767	              Kumaki, "Distribution of diverse BGP paths.",
768	              draft-ietf-grow-diverse-bgp-path-dist-07 (work in
769	              progress), May 2012.

771	   [I-D.mitchell-idr-as-private-reservation]
772	              Mitchell, J., "Autonomous System (AS) Reservation for
773	              Private Use", draft-mitchell-idr-as-private-reservation-00
774	              (work in progress), June 2012.

776	   [I-D.gill-btsh]
777	              Gill, V., Heasley, J., and D. Meyer, "The BGP TTL Security
778	              Hack (BTSH)", draft-gill-btsh-02 (work in progress),
779	              May 2003.

781	   [GREENBERG2009]
782	              Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a
783	              Cloud: Research Problems in Data Center Networks",
784	              January 2009.

786	   [FAULTSIG10GE]
787	              Frazier, H. and S. Muller, "Remote Fault & Break Link
788	              Proposal for 10-Gigabit Ethernet", September 2000.

790	   [INTERCON]
791	              Dally, W. and B. Towles, "Principles and Practices of
792	              Interconnection Networks", ISBN 978-0122007514,
793	              January 2004.

795	   [ALFARES2008]
796	              Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable,
797	              Commodity Data Center Network Architecture", August 2008.

799	Authors' Addresses

801	   Petr Lapukhov
802	   Microsoft Corp.
803	   One Microsfot Way
804	   Redmond, WA  98052
805	   US

807	   Phone: +1 425 7032723 X 32723
808	   Email: petrlapu@microsoft.com
809	   URI:   http://microsoft.com/

811	   Ariff Premji
812	   Arista Networks
813	   5470 Great America Parkway
814	   Santa Clara, CA  95054
815	   US

817	   Phone: +1 408-547-5699
818	   Email: ariff@aristanetworks.com
819	   URI:   http://aristanetworks.com/