idnits 2.17.1 

draft-heitz-idr-msdc-bgp-aggregation-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 22, 2018) is 2006 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'Figure 3' is mentioned on line 322, but not defined


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	IDR                                                             J. Heitz
3	Internet-Draft                                                    D. Rao
4	Intended status: Standards Track                                   Cisco
5	Expires: April 25, 2019                                 October 22, 2018

7	          Aggregating BGP routes in Massive Scale Data Centers
8	                draft-heitz-idr-msdc-bgp-aggregation-00

10	Abstract

12	   A design for a fabric of switches to connect up to one million
13	   servers in a data center is described.  At that scale, it is
14	   impractical for every switch to maintain knowledge about every other
15	   switch and every other link in the fabric.  Aggregation of routes is
16	   an excellent way to scale such a fabric.  However, aggregation
17	   presents some problems under link failures or switch failures.  This
18	   design solves those problems.

20	Requirements Language

22	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
23	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
24	   document are to be interpreted as described in [RFC2119].

26	Status of This Memo

28	   This Internet-Draft is submitted in full conformance with the
29	   provisions of BCP 78 and BCP 79.

31	   Internet-Drafts are working documents of the Internet Engineering
32	   Task Force (IETF).  Note that other groups may also distribute
33	   working documents as Internet-Drafts.  The list of current Internet-
34	   Drafts is at https://datatracker.ietf.org/drafts/current/.

36	   Internet-Drafts are draft documents valid for a maximum of six months
37	   and may be updated, replaced, or obsoleted by other documents at any
38	   time.  It is inappropriate to use Internet-Drafts as reference
39	   material or to cite them other than as "work in progress."

41	   This Internet-Draft will expire on April 25, 2019.

43	Copyright Notice

45	   Copyright (c) 2018 IETF Trust and the persons identified as the
46	   document authors.  All rights reserved.

48	   This document is subject to BCP 78 and the IETF Trust's Legal
49	   Provisions Relating to IETF Documents
50	   (https://trustee.ietf.org/license-info) in effect on the date of
51	   publication of this document.  Please review these documents
52	   carefully, as they describe your rights and restrictions with respect
53	   to this document.  Code Components extracted from this document must
54	   include Simplified BSD License text as described in Section 4.e of
55	   the Trust Legal Provisions and are provided without warranty as
56	   described in the Simplified BSD License.

58	Table of Contents

60	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
61	   2.  Solution Overview . . . . . . . . . . . . . . . . . . . . . .   3
62	   3.  Problems with negative routes . . . . . . . . . . . . . . . .   4
63	   4.  Use of a negative route in BGP  . . . . . . . . . . . . . . .   4
64	   5.  Implementation Notes to Reduce CPU Time Consumption . . . . .   5
65	   6.  Smooth Startup and Avoidance of Too Many Negative Routes  . .   5
66	   7.  Avoidance of Transients . . . . . . . . . . . . . . . . . . .   6
67	   8.  Configuration . . . . . . . . . . . . . . . . . . . . . . . .   7
68	   9.  South Triggered Automatic Disaggregation (STAD) . . . . . . .   7
69	   10. Configuration for STAD  . . . . . . . . . . . . . . . . . . .   8
70	   11. Security Considerations . . . . . . . . . . . . . . . . . . .   9
71	   12. IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
72	   13. Acknowldgements . . . . . . . . . . . . . . . . . . . . . . .   9
73	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
74	     14.1.  Normative References . . . . . . . . . . . . . . . . . .   9
75	     14.2.  Informative References . . . . . . . . . . . . . . . . .   9
76	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9

78	1.  Introduction

80	   [RFC7938] defines a massive scale data center as one that contains
81	   over one hundred thousand servers.  It describes the advantages of
82	   using BGP as a routing protocol in a Clos switching fabric that
83	   connects these servers.  It laments the need to announce all routes
84	   individually, because of the problems associated with route
85	   aggergation.  A fabric design that scales to one million servers is
86	   considered enough for the forseeable future and is the design goal of
87	   this document.  Of course, the design should also work for smaller
88	   fabrics.

90	   A switch fabric to connect one million servers will consist of
91	   between 35000 and 130000 switches and 1.5 million to 8 million links,
92	   depending on how redundantly the servers are connected to the fabric
93	   and the level of oversubscription in the fabric.  A switch that needs
94	   to store, send and operate on hundreds of routes is clearly cheaper
95	   than one that needs to store, send and operate on millions of links.

97	   A switch running BGP and aggregating its routes needs to send only
98	   one route.  In the ideal case, each switch receives just one route
99	   from each of its neighbors.  For each link or a neighbor that fails,
100	   the switch should send just one extra route.  No single link failure
101	   needs to be known by every switch in the fabric and some switch
102	   failures do not need to be known by every switch either.  The routes
103	   that advertise these failures should only propagate to those switches
104	   that need to know about them.  During normal operation, the number of
105	   failures are few, so the number of advertisements are few.

107	   A route that advertises a failure is called a negative route.
108	   Negative routes are not a new idea, but they are unpopular, because
109	   they cause a number of problems.  This document solves the problems.

111	2.  Solution Overview

113	   In a Clos network all northbound links can reach all destinations and
114	   there is typically only one or very few southbound links to reach any
115	   specific destination.  Therefore, traffic from source to destination
116	   is spread to all available northbound links, reaches all the spines
117	   and then concentrates southbound towards its destination.  When a
118	   link fails, then a spine will lose connectivity to some southbound
119	   destiunations.  That means any northbound link to that spine also
120	   loses connectivity to the same destinations.

122	   When the fabric is fully connected with no failed links, then the
123	   forwarding tables in the switches can simply contain multipath
124	   aggreagate routes to all the northbound links.  Each of the multipath
125	   routes is the same, so traffic is spread out smoothly among these
126	   routes.  As soon as a link fails, the forwarding tables must exclude
127	   the resultant unreachable destinations from some of the northbound
128	   links.  The way to do that is to add specific routes for the failed
129	   destinations to point at the remaining links that can reach those
130	   destinations.  Since traffic will always prefer specific routes to
131	   aggregate routes, the traffic to the failed destinations will no
132	   longer take the aggregate routes.

134	   Two methods to create these specific routes are described.  One way
135	   is to send a negative route from the point where the failure is
136	   detected.  Receivers use the negative route to punch holes out of the
137	   aggregate routes and create the specific routes by subtracting the
138	   negative route from the aggregates.  This method is described
139	   starting at section 4.  The other method creates the specific routes
140	   at the point of the failure and announces them in BGP.  This method
141	   is described starting at section 9.

143	3.  Problems with negative routes

145	     - Massive failures can cause lots of negative routes and overwhelm
146	       the switches.

148	     - In order for a switch to know what has failed, it must know what
149	       is supposed to be up.  For it to know this requires either an
150	       error prone algorithm or an error prone configuration.

152	     - During certain network events that cause multiple routes to be
153	       sent and/or withdrawn, the messages may race each other and cause
154	       transient loss of connectivity to paths that were otherwise
155	       unaffected by the event.  This occurs in link state routing
156	       protocols as well.

158	     - Computation of forwarding table entries may consume a lot of CPU
159	       time in pathological cases.  However, even in pathological cases,
160	       this is still much less CPU time than it takes to compute an SPF
161	       in a million links.

163	4.  Use of a negative route in BGP

165	   Three new BGP well known communities are defined:

167	     - Hole-Punch: A route with this community can punch a hole out of
168	       another route with a shorter netmask that covers the address
169	       space of this route.

171	     - Punch-Accept: A route with this community can have holes punched
172	       out of it by hole punch routes.

174	     - Do-not-Aggregate; Do not aggregate this route.

176	   A fabric switch will aggregate routes learnt from neighbors to its
177	   south.  It must know all the routes that are expected to complete the
178	   aggregate.  It will announce the aggregate with the Punch-Accept
179	   community.  If any of the routes that are expected to complete the
180	   aggregate are missing, then it will announce those missing routes
181	   with the Hole-Punch and Do-not-Aggregate communities along with the
182	   aggregate route.

184	   A receiver of a route with the Hole-Punch community will give it a
185	   lower than normal local preference and will search the BGP table for
186	   other routes with the following properties:

188	     - a shorter netmask than this route,

190	     - covers the address space of this route,
191	     - has the Punch-Accept community,

193	     - is installed in the Routing Table.

195	   This is the candidate set.  Then, it will remove any routes that have
196	   a shorter netmask than the route with the longest netmask in the set.
197	   The final candidate set of routes will all have the same prefix.  For
198	   each route in the candidate set, BGP will create a new route with the
199	   same prefix as the Hole-punch route and the same attributes as the
200	   Punch-Accept route.  This new route is called a chad route.  If a
201	   route has an MPLS label, then the label is considered part of the
202	   attributes, not part of the prefix.

204	   Chad routes will take part in bestpath and multipath selection.  If a
205	   chad route becomes a bestpath or a multipath, it will be installed in
206	   the Routing Table.  However, chad routes are not advertised by
207	   default.  That means if a chad route is bestpath and other routes
208	   exist for the same prefix, then no route is advertised for that
209	   prefix.

211	   If a chad route has the same nexthop (and MPLS label, if labels are
212	   used) as a hole-punch route of the same prefix, then the chad route
213	   becomes hidden.  Hidden means that it cannot take part in route
214	   selection.

216	5.  Implementation Notes to Reduce CPU Time Consumption

218	   This section is not normative.

220	   When a Punch-Accept route is received, BGP needs to scan a subtree of
221	   the BGP prefix table rooted at the prefix of the Punch-Accept route
222	   to look for Hole-Punch routes that might create chad routes from it.
223	   That subtree could be large.  To reduce the number of routes to scan,
224	   a separate prefix table is created to store copies of the Hole-Punch
225	   routes.  The number of Hole-Punch routes is expected to be much
226	   smaller than the total number of routes.  That makes the scan much
227	   quicker.  The Hole-Punch routes must additionally be stored in the
228	   regular BGP route table.

230	6.  Smooth Startup and Avoidance of Too Many Negative Routes

232	   When several switches of a data center fabric start up at the same
233	   time, many negative routes can be transiently created before the
234	   whole system is up.

236	   When the BGP process starts, it will typically start in receive-only
237	   mode for some time, then perform route selection and send out it's
238	   own updates.  To ensure a smooth startup of the data center when many
239	   nodes start at the same time, the startup sequence is modified as
240	   follows.

242	     - All BGP speakers SHOULD send EOR after sending all routes after
243	       the BGP session becomes established.

245	     - When all southbound configured BGP neighbors have sent their EOR,
246	       the BGP speaker will perform route selection and send all updates
247	       to the northbound neighbors and then send EOR.  If some
248	       southbound neighbors cannot establish, a timer will be used to
249	       prevent waiting forever.

251	     - After the previous step completes, when all northbound configured
252	       BGP neighbors have sent their EOR, the BGP speaker will perform
253	       route selection and send all updates to the southbound neighbors
254	       and then send EOR.  If some northbound neighbors cannot
255	       establish, a timer will be used to prevent waiting forever.

257	     - If the number of received negative routes causes too many
258	       forwarding entries, then BGP can look for aggreagate routes that
259	       are accompanied by many hole-punch routes and invalidate some of
260	       the aggregate routes and their accompanying hole-punch routes.
261	       If the number of received negative routes is too many to hold in
262	       the BGP table, then BGP can shut down neighbor sessions that are
263	       sending the most negative routes.

265	7.  Avoidance of Transients

267	   If one event were to cause both an aggregate and a hole-punch route
268	   to be announced at the same time, but the hole-punch route were to
269	   arrive late, a transient could result.  The following rules prevent
270	   that.

272	     - It is common practice for aggregate routes to be withdrawn when
273	       no components of the aggregate exist.  Hole-Punch routes need to
274	       always be announced, even if the aggregate is not.

276	     - After a BGP session establishes, no routes that are received from
277	       it should be installed in the RIB until the EOR is received from
278	       that session.

280	     - If overlapping hole-punch routes need to be updated and
281	       withdrawn, then the updates must be sent before the withdraws.

283	     - If overlapping hole-punch and punch-accept routes need to be
284	       updated, then the hole-punch routes must be updated first.

286	     - If overlapping hole-punch and punch-accept routes need to be
287	       withdrawn, then the punch-accept routes must be withdrawn first.

289	8.  Configuration

291	   All the BGP sessions need to be configured on each switch.  The BGP
292	   sessions need to be configured as northbound or southbound.  The
293	   routes that are expected to complete an aggregate route must be
294	   configured.

296	   A companion document describes a protocol that can discover and
297	   configure the entire fabric.  If that companion document is used,
298	   then no IP addresses or tier designations or any other location
299	   dependent configuration is required on the switches.

301	9.  South Triggered Automatic Disaggregation (STAD)

303	   In this method, a node that is south of a failed link or node
304	   announces its prefix(es) along alternative links with a hint to
305	   trigger automatic disaggregation or inhibit their suppression on
306	   upstream tier nodes.  These disaggregated or unsuppressed routes
307	   traverse along redundant paths and disjoint planes to switches in
308	   other clusters in the topology where they are used in forwarding.

310	   The hint is in the form of a well known BGP community.  A few new
311	   well known communities are used in this scheme.

313	     - Do-not-Aggregate : Do not aggregate this route

315	     - Tier : An Extended Community identifying the tier of the
316	       originated route

318	     - Dis-Aggregate : Triggers announcement of more specific routes at
319	       receiving node

321	   The techniques in this draft assume a CLOS topology of the form
322	   described in [Figure 3] of RFC7938 where an access switch such as a
323	   TOR forms the lowest tier and is connected to multiple northbound
324	   upper tier switches; which in turn are connected to multiple upper
325	   tier switches, forming disjoint planes across the topology with fan-
326	   outs.

328	   A figure illustrating the topology will be added in a subsequent
329	   version.

331	   Upon a link failure, the node south of the failure announces its
332	   prefix to it's other northbound BGP sessions with the Do-Not-
333	   Advertise community.

335	   A higher tier node that receives a route with a Do-not-Aggregate
336	   community will not suppress this route when there is a local covering
337	   aggregate, but will propagate it further as is.

339	   This procedure enables the more specific route to reach the
340	   appropriate tier switches in other clusters where the topology fans
341	   out on multiple northbound links.  The received paths for the more
342	   specific prefix form a multipath excluding the links which would lead
343	   to failed paths in the topology.

345	   A route that is advertised with the Do-Not-Aggregate community as per
346	   this section will also add a Tier extended community.  If this
347	   extended community is present, then the Do-Not-Aggregate community is
348	   only applicable at tiers that are more north than the tier indicated
349	   in the extended community.

351	   The Tier Extended Community ensures that the unsuppressed specific
352	   routes do not propagate further beyond the corresponding fan-out
353	   points in the other clusters.

355	   If all the northbound links or BGP sessions at a node have failed,
356	   then the node will announce its southbound route with the Dis-
357	   Aggregate community.  This signals all it's south-side nodes to
358	   advertise their north-bound routes with the Do-Not-Aggregate
359	   community along the other north-bound links.

361	   These techniques are applicable to any tier in the topology.

363	   At the lowest tier, if there are servers that are attached to more
364	   than one fabric switch (eg.  TOR), then the host routes (or
365	   configured more-specific routes) for the server are not aggregated by
366	   the TOR to it's connected upper tier switches.  In this case, these
367	   routes are aggregated by the upper-tier switches towards the rest of
368	   the topology.

370	10.  Configuration for STAD

372	   Each switch has the notion of northbound and southbound sessions or
373	   links.  In addition, it is assigned to a tier in the hierarchy.  The
374	   switch uses this configuration to drive the procedures described in
375	   the section above.  A switch at the lowest tier (eg. a TOR) will have
376	   server subnet prefixes configured.  Switches at higher tiers have
377	   aggregates configured.

379	11.  Security Considerations

381	   TBD

383	12.  IANA Considerations

385	   TBD

387	13.  Acknowldgements

389	14.  References

391	14.1.  Normative References

393	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
394	              Requirement Levels", BCP 14, RFC 2119,
395	              DOI 10.17487/RFC2119, March 1997,
396	              <https://www.rfc-editor.org/info/rfc2119>.

398	14.2.  Informative References

400	   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
401	              BGP for Routing in Large-Scale Data Centers", RFC 7938,
402	              DOI 10.17487/RFC7938, August 2016,
403	              <https://www.rfc-editor.org/info/rfc7938>.

405	Authors' Addresses

407	   Jakob Heitz
408	   Cisco
409	   170 West Tasman Drive
410	   San Jose, CA, CA  95134
411	   USA

413	   Email: jheitz@cisco.com

415	   Dhananjaya Rao
416	   Cisco
417	   170 West Tasman Drive
418	   San Jose, CA, CA  95134
419	   USA

421	   Email: dhrao@cisco.com