idnits 2.17.1 

draft-ietf-spring-segment-routing-msdc-11.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (November 29, 2018) is 1975 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '16000' on line 252

  -- Looks like a reference, but probably isn't: '23999' on line 252

  -- Looks like a reference, but probably isn't: '1000' on line 734

  -- Looks like a reference, but probably isn't: '1999' on line 734

  -- Looks like a reference, but probably isn't: '2000' on line 735

  -- Looks like a reference, but probably isn't: '2999' on line 735

  == Unused Reference: 'RFC2119' is defined on line 883, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-26) exists of
     draft-ietf-6man-segment-routing-header-15


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                   C. Filsfils, Ed.
3	Internet-Draft                                                S. Previdi
4	Intended status: Informational                       Cisco Systems, Inc.
5	Expires: June 2, 2019                                           G. Dawra
6	                                                                LinkedIn
7	                                                                E. Aries
8	                                                        Juniper Networks
9	                                                             P. Lapukhov
10	                                                                Facebook
11	                                                       November 29, 2018

13	             BGP-Prefix Segment in large-scale data centers
14	               draft-ietf-spring-segment-routing-msdc-11

16	Abstract

18	   This document describes the motivation and benefits for applying
19	   segment routing in BGP-based large-scale data-centers.  It describes
20	   the design to deploy segment routing in those data-centers, for both
21	   the MPLS and IPv6 dataplanes.

23	Status of This Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF).  Note that other groups may also distribute
30	   working documents as Internet-Drafts.  The list of current Internet-
31	   Drafts is at https://datatracker.ietf.org/drafts/current/.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on June 2, 2019.

40	Copyright Notice

42	   Copyright (c) 2018 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (https://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.  Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
58	   2.  Large Scale Data Center Network Design Summary  . . . . . . .   3
59	     2.1.  Reference design  . . . . . . . . . . . . . . . . . . . .   4
60	   3.  Some open problems in large data-center networks  . . . . . .   5
61	   4.  Applying Segment Routing in the DC with MPLS dataplane  . . .   6
62	     4.1.  BGP Prefix Segment (BGP-Prefix-SID) . . . . . . . . . . .   6
63	     4.2.  eBGP Labeled Unicast (RFC8277)  . . . . . . . . . . . . .   6
64	       4.2.1.  Control Plane . . . . . . . . . . . . . . . . . . . .   7
65	       4.2.2.  Data Plane  . . . . . . . . . . . . . . . . . . . . .   8
66	       4.2.3.  Network Design Variation  . . . . . . . . . . . . . .   9
67	       4.2.4.  Global BGP Prefix Segment through the fabric  . . . .  10
68	       4.2.5.  Incremental Deployments . . . . . . . . . . . . . . .  10
69	     4.3.  iBGP Labeled Unicast (RFC8277)  . . . . . . . . . . . . .  11
70	   5.  Applying Segment Routing in the DC with IPv6 dataplane  . . .  13
71	   6.  Communicating path information to the host  . . . . . . . . .  13
72	   7.  Additional Benefits . . . . . . . . . . . . . . . . . . . . .  14
73	     7.1.  MPLS Dataplane with operational simplicity  . . . . . . .  14
74	     7.2.  Minimizing the FIB table  . . . . . . . . . . . . . . . .  14
75	     7.3.  Egress Peer Engineering . . . . . . . . . . . . . . . . .  15
76	     7.4.  Anycast . . . . . . . . . . . . . . . . . . . . . . . . .  15
77	   8.  Preferred SRGB Allocation . . . . . . . . . . . . . . . . . .  16
78	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  17
79	   10. Manageability Considerations  . . . . . . . . . . . . . . . .  17
80	   11. Security Considerations . . . . . . . . . . . . . . . . . . .  17
81	   12. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  18
82	   13. Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  18
83	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  19
84	     14.1.  Normative References . . . . . . . . . . . . . . . . . .  19
85	     14.2.  Informative References . . . . . . . . . . . . . . . . .  20
86	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  20

88	1.  Introduction

90	   Segment Routing (SR), as described in
91	   [I-D.ietf-spring-segment-routing] leverages the source routing
92	   paradigm.  A node steers a packet through an ordered list of
93	   instructions, called segments.  A segment can represent any
94	   instruction, topological or service-based.  A segment can have a
95	   local semantic to an SR node or global within an SR domain.  SR
96	   allows to enforce a flow through any topological path while
97	   maintaining per-flow state only at the ingress node to the SR domain.
98	   Segment Routing can be applied to the MPLS and IPv6 data-planes.

100	   The use-cases described in this document should be considered in the
101	   context of the BGP-based large-scale data-center (DC) design
102	   described in [RFC7938].  This document extends it by applying SR both
103	   with IPv6 and MPLS dataplane.

105	2.  Large Scale Data Center Network Design Summary

107	   This section provides a brief summary of the informational document
108	   [RFC7938] that outlines a practical network design suitable for data-
109	   centers of various scales:

111	   o  Data-center networks have highly symmetric topologies with
112	      multiple parallel paths between two server attachment points.  The
113	      well-known Clos topology is most popular among the operators (as
114	      described in [RFC7938]).  In a Clos topology, the minimum number
115	      of parallel paths between two elements is determined by the
116	      "width" of the "Tier-1" stage.  See Figure 1 below for an
117	      illustration of the concept.

119	   o  Large-scale data-centers commonly use a routing protocol, such as
120	      BGP-4 [RFC4271] in order to provide endpoint connectivity.
121	      Recovery after a network failure is therefore driven either by
122	      local knowledge of directly available backup paths or by
123	      distributed signaling between the network devices.

125	   o  Within data-center networks, traffic is load-shared using the
126	      Equal Cost Multipath (ECMP) mechanism.  With ECMP, every network
127	      device implements a pseudo-random decision, mapping packets to one
128	      of the parallel paths by means of a hash function calculated over
129	      certain parts of the packet, typically a combination of various
130	      packet header fields.

132	   The following is a schematic of a five-stage Clos topology, with four
133	   devices in the "Tier-1" stage.  Notice that number of paths between
134	   Node1 and Node12 equals to four: the paths have to cross all of
135	   Tier-1 devices.  At the same time, the number of paths between Node1
136	   and Node2 equals two, and the paths only cross Tier-2 devices.  Other
137	   topologies are possible, but for simplicity only the topologies that
138	   have a single path from Tier-1 to Tier-3 are considered below.  The
139	   rest could be treated similarly, with a few modifications to the
140	   logic.

142	2.1.  Reference design

144	                                   Tier-1
145	                                  +-----+
146	                                  |NODE |
147	                               +->|  5  |--+
148	                               |  +-----+  |
149	                       Tier-2  |           |   Tier-2
150	                      +-----+  |  +-----+  |  +-----+
151	        +------------>|NODE |--+->|NODE |--+--|NODE |-------------+
152	        |       +-----|  3  |--+  |  6  |  +--|  9  |-----+       |
153	        |       |     +-----+     +-----+     +-----+     |       |
154	        |       |                                         |       |
155	        |       |     +-----+     +-----+     +-----+     |       |
156	        | +-----+---->|NODE |--+  |NODE |  +--|NODE |-----+-----+ |
157	        | |     | +---|  4  |--+->|  7  |--+--|  10 |---+ |     | |
158	        | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |
159	        | |     | |            |           |            | |     | |
160	      +-----+ +-----+          |  +-----+  |          +-----+ +-----+
161	      |NODE | |NODE | Tier-3   +->|NODE |--+   Tier-3 |NODE | |NODE |
162	      |  1  | |  2  |             |  8  |             | 11  | |  12 |
163	      +-----+ +-----+             +-----+             +-----+ +-----+
164	        | |     | |                                     | |     | |
165	        A O     B O            <- Servers ->            Z O     O O

167	                      Figure 1: 5-stage Clos topology

169	   In the reference topology illustrated in Figure 1, It is assumed:

171	   o  Each node is its own AS (Node X has AS X). 4-byte AS numbers are
172	      recommended ([RFC6793]).

174	      *  For simple and efficient route propagation filtering, Node5,
175	         Node6, Node7 and Node8 use the same AS, Node3 and Node4 use the
176	         same AS, Node9 and Node10 use the same AS.

178	      *  In case of 2-byte autonomous system numbers are used and for
179	         efficient usage of the scarce 2-byte Private Use AS pool,
180	         different Tier-3 nodes might use the same AS.

182	      *  Without loss of generality, these details will be simplified in
183	         this document and assume that each node has its own AS.

185	   o  Each node peers with its neighbors with a BGP session.  If not
186	      specified, eBGP is assumed.  In a specific use-case, iBGP will be
187	      used but this will be called out explicitly in that case.

189	   o  Each node originates the IPv4 address of its loopback interface
190	      into BGP and announces it to its neighbors.

192	      *  The loopback of Node X is 192.0.2.x/32.

194	   In this document, the Tier-1, Tier-2 and Tier-3 nodes are referred to
195	   respectively as Spine, Leaf and ToR (top of rack) nodes.  When a ToR
196	   node acts as a gateway to the "outside world", it is referred to as a
197	   border node.

199	3.  Some open problems in large data-center networks

201	   The data-center network design summarized above provides means for
202	   moving traffic between hosts with reasonable efficiency.  There are
203	   few open performance and reliability problems that arise in such
204	   design:

206	   o  ECMP routing is most commonly realized per-flow.  This means that
207	      large, long-lived "elephant" flows may affect performance of
208	      smaller, short-lived "mouse" flows and reduce efficiency of per-
209	      flow load-sharing.  In other words, per-flow ECMP does not perform
210	      efficiently when flow lifetime distribution is heavy-tailed.
211	      Furthermore, due to hash-function inefficiencies it is possible to
212	      have frequent flow collisions, where more flows get placed on one
213	      path over the others.

215	   o  Shortest-path routing with ECMP implements an oblivious routing
216	      model, which is not aware of the network imbalances.  If the
217	      network symmetry is broken, for example due to link failures,
218	      utilization hotspots may appear.  For example, if a link fails
219	      between Tier-1 and Tier-2 devices (e.g.  Node5 and Node9), Tier-3
220	      devices Node1 and Node2 will not be aware of that, since there are
221	      other paths available from perspective of Node3.  They will
222	      continue sending roughly equal traffic to Node3 and Node4 as if
223	      the failure didn't exist which may cause a traffic hotspot.

225	   o  Isolating faults in the network with multiple parallel paths and
226	      ECMP-based routing is non-trivial due to lack of determinism.
227	      Specifically, the connections from HostA to HostB may take a
228	      different path every time a new connection is formed, thus making
229	      consistent reproduction of a failure much more difficult.  This
230	      complexity scales linearly with the number of parallel paths in
231	      the network, and stems from the random nature of path selection by
232	      the network devices.

234	   First, it will be explained how to apply SR in the DC, for MPLS and
235	   IPv6 data-planes.

237	4.  Applying Segment Routing in the DC with MPLS dataplane

239	4.1.  BGP Prefix Segment (BGP-Prefix-SID)

241	   A BGP Prefix Segment is a segment associated with a BGP prefix.  A
242	   BGP Prefix Segment is a network-wide instruction to forward the
243	   packet along the ECMP-aware best path to the related prefix.

245	   The BGP Prefix Segment is defined as the BGP-Prefix-SID Attribute in
246	   [I-D.ietf-idr-bgp-prefix-sid] which contains an index.  Throughout
247	   this document the BGP Prefix Segment Attribute is referred as the
248	   BGP-Prefix-SID and the encoded index as the label-index.

250	   In this document, the network design decision has been made to assume
251	   that all the nodes are allocated the same SRGB (Segment Routing
252	   Global Block), e.g. [16000, 23999].  This provides operational
253	   simplification as explained in Section 8, but this is not a
254	   requirement.

256	   For illustration purpose, when considering an MPLS data-plane, it is
257	   assumed that the label-index allocated to prefix 192.0.2.x/32 is X.
258	   As a result, a local label (16000+x) is allocated for prefix
259	   192.0.2.x/32 by each node throughout the DC fabric.

261	   When IPv6 data-plane is considered, it is assumed that Node X is
262	   allocated IPv6 address (segment) 2001:DB8::X.

264	4.2.  eBGP Labeled Unicast (RFC8277)

266	   Referring to Figure 1 and [RFC7938], the following design
267	   modifications are introduced:

269	   o  Each node peers with its neighbors via a eBGP session with
270	      extensions defined in [RFC8277] (named "eBGP8277" throughout this
271	      document) and with the BGP-Prefix-SID attribute extension as
272	      defined in [I-D.ietf-idr-bgp-prefix-sid].

274	   o  The forwarding plane at Tier-2 and Tier-1 is MPLS.

276	   o  The forwarding plane at Tier-3 is either IP2MPLS (if the host
277	      sends IP traffic) or MPLS2MPLS (if the host sends MPLS-
278	      encapsulated traffic).

280	   Figure 2 zooms into a path from server A to server Z within the
281	   topology of Figure 1.

283	                      +-----+     +-----+     +-----+
284	          +---------->|NODE |     |NODE |     |NODE |
285	          |           |  4  |--+->|  7  |--+--|  10 |---+
286	          |           +-----+     +-----+     +-----+   |
287	          |                                             |
288	      +-----+                                         +-----+
289	      |NODE |                                         |NODE |
290	      |  1  |                                         | 11  |
291	      +-----+                                         +-----+
292	        |                                              |
293	        A                    <- Servers ->             Z

295	          Figure 2: Path from A to Z via nodes 1, 4, 7, 10 and 11

297	   Referring to Figure 1 and Figure 2 and assuming the IP address with
298	   the AS and label-index allocation previously described, the following
299	   sections detail the control plane operation and the data plane states
300	   for the prefix 192.0.2.11/32 (loopback of Node11)

302	4.2.1.  Control Plane

304	   Node11 originates 192.0.2.11/32 in BGP and allocates to it a BGP-
305	   Prefix-SID with label-index: index11 [I-D.ietf-idr-bgp-prefix-sid].

307	   Node11 sends the following eBGP8277 update to Node10:

309	   . IP Prefix:  192.0.2.11/32
310	   . Label: Implicit-Null
311	   . Next-hop: Node11's interface address on the link to Node10
312	   . AS Path: {11}
313	   . BGP-Prefix-SID: Label-Index 11

315	   Node10 receives the above update.  As it is SR capable, Node10 is
316	   able to interpret the BGP-Prefix-SID and hence understands that it
317	   should allocate the label from its own SRGB block, offset by the
318	   Label-Index received in the BGP-Prefix-SID (16000+11 hence 16011) to
319	   the NLRI instead of allocating a non-deterministic label out of a
320	   dynamically allocated portion of the local label space.  The
321	   implicit-null label in the NLRI tells Node10 that it is the
322	   penultimate hop and must pop the top label on the stack before
323	   forwarding traffic for this prefix to Node11.

325	   Then, Node10 sends the following eBGP8277 update to Node7:

327	   . IP Prefix:  192.0.2.11/32
328	   . Label: 16011
329	   . Next-hop: Node10's interface address on the link to Node7
330	   . AS Path: {10, 11}
331	   . BGP-Prefix-SID: Label-Index 11

333	   Node7 receives the above update.  As it is SR capable, Node7 is able
334	   to interpret the BGP-Prefix-SID and hence allocates the local
335	   (incoming) label 16011 (16000 + 11) to the NLRI (instead of
336	   allocating a "dynamic" local label from its label manager).  Node7
337	   uses the label in the received eBGP8277 NLRI as the outgoing label
338	   (the index is only used to derive the local/incoming label).

340	   Node7 sends the following eBGP8277 update to Node4:

342	   . IP Prefix:  192.0.2.11/32
343	   . Label: 16011
344	   . Next-hop: Node7's interface address on the link to Node4
345	   . AS Path: {7, 10, 11}
346	   . BGP-Prefix-SID: Label-Index 11

348	   Node4 receives the above update.  As it is SR capable, Node4 is able
349	   to interpret the BGP-Prefix-SID and hence allocates the local
350	   (incoming) label 16011 to the NLRI (instead of allocating a "dynamic"
351	   local label from its label manager).  Node4 uses the label in the
352	   received eBGP8277 NLRI as outgoing label (the index is only used to
353	   derive the local/incoming label).

355	   Node4 sends the following eBGP8277 update to Node1:

357	   . IP Prefix:  192.0.2.11/32
358	   . Label: 16011
359	   . Next-hop: Node4's interface address on the link to Node1
360	   . AS Path: {4, 7, 10, 11}
361	   . BGP-Prefix-SID: Label-Index 11

363	   Node1 receives the above update.  As it is SR capable, Node1 is able
364	   to interpret the BGP-Prefix-SID and hence allocates the local
365	   (incoming) label 16011 to the NLRI (instead of allocating a "dynamic"
366	   local label from its label manager).  Node1 uses the label in the
367	   received eBGP8277 NLRI as outgoing label (the index is only used to
368	   derive the local/incoming label).

370	4.2.2.  Data Plane

372	   Referring to Figure 1, and assuming all nodes apply the same
373	   advertisement rules described above and all nodes have the same SRGB
374	   (16000-23999), here are the IP/MPLS forwarding tables for prefix
375	   192.0.2.11/32 at Node1, Node4, Node7 and Node10.

377	              -----------------------------------------------
378	              Incoming label    | outgoing label | Outgoing
379	              or IP destination |                | Interface
380	              ------------------+----------------+-----------
381	                   16011        |      16011     | ECMP{3, 4}
382	                192.0.2.11/32   |      16011     | ECMP{3, 4}
383	              ------------------+----------------+-----------

385	                     Figure 3: Node1 Forwarding Table

387	              -----------------------------------------------
388	              Incoming label    | outgoing label | Outgoing
389	              or IP destination |                | Interface
390	              ------------------+----------------+-----------
391	                   16011        |      16011     | ECMP{7, 8}
392	                192.0.2.11/32   |      16011     | ECMP{7, 8}
393	              ------------------+----------------+-----------

395	                     Figure 4: Node4 Forwarding Table

397	              -----------------------------------------------
398	              Incoming label    | outgoing label | Outgoing
399	              or IP destination |                | Interface
400	              ------------------+----------------+-----------
401	                   16011        |      16011     |    10
402	                192.0.2.11/32   |      16011     |    10
403	              ------------------+----------------+-----------

405	                     Figure 5: Node7 Forwarding Table

407	              -----------------------------------------------
408	              Incoming label    | outgoing label | Outgoing
409	              or IP destination |                | Interface
410	              ------------------+----------------+-----------
411	                   16011        |      POP       |    11
412	                192.0.2.11/32   |      N/A       |    11
413	              ------------------+----------------+-----------

415	                          Node10 Forwarding Table

417	4.2.3.  Network Design Variation

419	   A network design choice could consist of switching all the traffic
420	   through Tier-1 and Tier-2 as MPLS traffic.  In this case, one could
421	   filter away the IP entries at Node4, Node7 and Node10.  This might be
422	   beneficial in order to optimize the forwarding table size.

424	   A network design choice could consist in allowing the hosts to send
425	   MPLS-encapsulated traffic based on the Egress Peer Engineering (EPE)
426	   use-case as defined in [I-D.ietf-spring-segment-routing-central-epe].
427	   For example, applications at HostA would send their Z-destined
428	   traffic to Node1 with an MPLS label stack where the top label is
429	   16011 and the next label is an EPE peer segment
430	   ([I-D.ietf-spring-segment-routing-central-epe]) at Node11 directing
431	   the traffic to Z.

433	4.2.4.  Global BGP Prefix Segment through the fabric

435	   When the previous design is deployed, the operator enjoys global BGP-
436	   Prefix-SID and label allocation throughout the DC fabric.

438	   A few examples follow:

440	   o  Normal forwarding to Node11: a packet with top label 16011
441	      received by any node in the fabric will be forwarded along the
442	      ECMP-aware BGP best-path towards Node11 and the label 16011 is
443	      penultimate-popped at Node10 (or at Node 9).

445	   o  Traffic-engineered path to Node11: an application on a host behind
446	      Node1 might want to restrict its traffic to paths via the Spine
447	      node Node5.  The application achieves this by sending its packets
448	      with a label stack of {16005, 16011}. BGP Prefix SID 16005 directs
449	      the packet up to Node5 along the path (Node1, Node3, Node5).  BGP-
450	      Prefix-SID 16011 then directs the packet down to Node11 along the
451	      path (Node5, Node9, Node11).

453	4.2.5.  Incremental Deployments

455	   The design previously described can be deployed incrementally.  Let
456	   us assume that Node7 does not support the BGP-Prefix-SID and let us
457	   show how the fabric connectivity is preserved.

459	   From a signaling viewpoint, nothing would change: even though Node7
460	   does not support the BGP-Prefix-SID, it does propagate the attribute
461	   unmodified to its neighbors.

463	   From a label allocation viewpoint, the only difference is that Node7
464	   would allocate a dynamic (random) label to the prefix 192.0.2.11/32
465	   (e.g. 123456) instead of the "hinted" label as instructed by the BGP-
466	   Prefix-SID.  The neighbors of Node7 adapt automatically as they
467	   always use the label in the BGP8277 NLRI as outgoing label.

469	   Node4 does understand the BGP-Prefix-SID and hence allocates the
470	   indexed label in the SRGB (16011) for 192.0.2.11/32.

472	   As a result, all the data-plane entries across the network would be
473	   unchanged except the entries at Node7 and its neighbor Node4 as shown
474	   in the figures below.

476	   The key point is that the end-to-end Label Switched Path (LSP) is
477	   preserved because the outgoing label is always derived from the
478	   received label within the BGP8277 NLRI.  The index in the BGP-Prefix-
479	   SID is only used as a hint on how to allocate the local label (the
480	   incoming label) but never for the outgoing label.

482	                ------------------------------------------
483	                Incoming label     | outgoing | Outgoing
484	                or IP destination  |  label   | Interface
485	                -------------------+----------------------
486	                     12345         |  16011   |   10

488	                     Figure 7: Node7 Forwarding Table

490	                ------------------------------------------
491	                Incoming label     | outgoing | Outgoing
492	                or IP destination  |  label   | Interface
493	                -------------------+----------------------
494	                     16011         |  12345   |   7

496	                     Figure 8: Node4 Forwarding Table

498	   The BGP-Prefix-SID can thus be deployed incrementally one node at a
499	   time.

501	   When deployed together with a homogeneous SRGB (same SRGB across the
502	   fabric), the operator incrementally enjoys the global prefix segment
503	   benefits as the deployment progresses through the fabric.

505	4.3.  iBGP Labeled Unicast (RFC8277)

507	   The same exact design as eBGP8277 is used with the following
508	   modifications:

510	      All nodes use the same AS number.

512	      Each node peers with its neighbors via an internal BGP session
513	      (iBGP) with extensions defined in [RFC8277] (named "iBGP8277"
514	      throughout this document).

516	      Each node acts as a route-reflector for each of its neighbors and
517	      with the next-hop-self option.  Next-hop-self is a well known
518	      operational feature which consists of rewriting the next-hop of a
519	      BGP update prior to send it to the neighbor.  Usually, it's a
520	      common practice to apply next-hop-self behavior towards iBGP peers
521	      for eBGP learned routes.  In the case outlined in this section it
522	      is proposed to use the next-hop-self mechanism also to iBGP
523	      learned routes.

525	                                  Cluster-1
526	                               +-----------+
527	                               |  Tier-1   |
528	                               |  +-----+  |
529	                               |  |NODE |  |
530	                               |  |  5  |  |
531	                    Cluster-2  |  +-----+  |  Cluster-3
532	                   +---------+ |           | +---------+
533	                   | Tier-2  | |           | |  Tier-2 |
534	                   | +-----+ | |  +-----+  | | +-----+ |
535	                   | |NODE | | |  |NODE |  | | |NODE | |
536	                   | |  3  | | |  |  6  |  | | |  9  | |
537	                   | +-----+ | |  +-----+  | | +-----+ |
538	                   |         | |           | |         |
539	                   |         | |           | |         |
540	                   | +-----+ | |  +-----+  | | +-----+ |
541	                   | |NODE | | |  |NODE |  | | |NODE | |
542	                   | |  4  | | |  |  7  |  | | |  10 | |
543	                   | +-----+ | |  +-----+  | | +-----+ |
544	                   +---------+ |           | +---------+
545	                               |           |
546	                               |  +-----+  |
547	                               |  |NODE |  |
548	             Tier-3            |  |  8  |  |         Tier-3
549	         +-----+ +-----+       |  +-----+  |      +-----+ +-----+
550	         |NODE | |NODE |       +-----------+      |NODE | |NODE |
551	         |  1  | |  2  |                          | 11  | |  12 |
552	         +-----+ +-----+                          +-----+ +-----+

554	         Figure 9: iBGP Sessions with Reflection and Next-Hop-Self

556	      For simple and efficient route propagation filtering and as
557	      illustrated in Figure 9:

559	         Node5, Node6, Node7 and Node8 use the same Cluster ID (Cluster-
560	         1)
561	         Node3 and Node4 use the same Cluster ID (Cluster-2)

563	         Node9 and Node10 use the same Cluster ID (Cluster-3)

565	      The control-plane behavior is mostly the same as described in the
566	      previous section: the only difference is that the eBGP8277 path
567	      propagation is simply replaced by an iBGP8277 path reflection with
568	      next-hop changed to self.

570	      The data-plane tables are exactly the same.

572	5.  Applying Segment Routing in the DC with IPv6 dataplane

574	   The design described in [RFC7938] is reused with one single
575	   modification.  It is highlighted using the example of the
576	   reachability to Node11 via spine node Node5.

578	   Node5 originates 2001:DB8::5/128 with the attached BGP-Prefix-SID for
579	   IPv6 packets destined to segment 2001:DB8::5
580	   ([I-D.ietf-idr-bgp-prefix-sid]).

582	   Node11 originates 2001:DB8::11/128 with the attached BGP-Prefix-SID
583	   advertising the support of the SRH for IPv6 packets destined to
584	   segment 2001:DB8::11.

586	   The control-plane and data-plane processing of all the other nodes in
587	   the fabric is unchanged.  Specifically, the routes to 2001:DB8::5 and
588	   2001:DB8::11 are installed in the FIB along the eBGP best-path to
589	   Node5 (spine node) and Node11 (ToR node) respectively.

591	   An application on HostA which needs to send traffic to HostZ via only
592	   Node5 (spine node) can do so by sending IPv6 packets with a Segment
593	   Routing header (SRH, [I-D.ietf-6man-segment-routing-header]).  The
594	   destination address and active segment is set to 2001:DB8::5.  The
595	   next and last segment is set to 2001:DB8::11.

597	   The application must only use IPv6 addresses that have been
598	   advertised as capable for SRv6 segment processing (e.g. for which the
599	   BGP prefix segment capability has been advertised).  How applications
600	   learn this (e.g.: centralized controller and orchestration) is
601	   outside the scope of this document.

603	6.  Communicating path information to the host

605	   There are two general methods for communicating path information to
606	   the end-hosts: "proactive" and "reactive", aka "push" and "pull"
607	   models.  There are multiple ways to implement either of these
608	   methods.  Here, it is noted that one way could be using a centralized
609	   controller: the controller either tells the hosts of the prefix-to-
610	   path mappings beforehand and updates them as needed (network event
611	   driven push), or responds to the hosts making request for a path to
612	   specific destination (host event driven pull).  It is also possible
613	   to use a hybrid model, i.e., pushing some state from the controller
614	   in response to particular network events, while the host pulls other
615	   state on demand.

617	   It is also noted, that when disseminating network-related data to the
618	   end-hosts a trade-off is made to balance the amount of information
619	   Vs.  the level of visibility in the network state.  This applies both
620	   to push and pull models.  In the extreme case, the host would request
621	   path information on every flow, and keep no local state at all.  On
622	   the other end of the spectrum, information for every prefix in the
623	   network along with available paths could be pushed and continuously
624	   updated on all hosts.

626	7.  Additional Benefits

628	7.1.  MPLS Dataplane with operational simplicity

630	   As required by [RFC7938], no new signaling protocol is introduced.
631	   The BGP-Prefix-SID is a lightweight extension to BGP Labeled Unicast
632	   [RFC8277].  It applies either to eBGP or iBGP based designs.

634	   Specifically, LDP and RSVP-TE are not used.  These protocols would
635	   drastically impact the operational complexity of the Data Center and
636	   would not scale.  This is in line with the requirements expressed in
637	   [RFC7938].

639	   Provided the same SRGB is configured on all nodes, all nodes use the
640	   same MPLS label for a given IP prefix.  This is simpler from an
641	   operation standpoint, as discussed in Section 8

643	7.2.  Minimizing the FIB table

645	   The designer may decide to switch all the traffic at Tier-1 and Tier-
646	   2's based on MPLS, hence drastically decreasing the IP table size at
647	   these nodes.

649	   This is easily accomplished by encapsulating the traffic either
650	   directly at the host or the source ToR node by pushing the BGP-
651	   Prefix-SID of the destination ToR for intra-DC traffic, or the BGP-
652	   Prefix-SID for the the border node for inter-DC or DC-to-outside-
653	   world traffic.

655	7.3.  Egress Peer Engineering

657	   It is straightforward to combine the design illustrated in this
658	   document with the Egress Peer Engineering (EPE) use-case described in
659	   [I-D.ietf-spring-segment-routing-central-epe].

661	   In such case, the operator is able to engineer its outbound traffic
662	   on a per host-flow basis, without incurring any additional state at
663	   intermediate points in the DC fabric.

665	   For example, the controller only needs to inject a per-flow state on
666	   the HostA to force it to send its traffic destined to a specific
667	   Internet destination D via a selected border node (say Node12 in
668	   Figure 1 instead of another border node, Node11) and a specific
669	   egress peer of Node12 (say peer AS 9999 of local PeerNode segment
670	   9999 at Node12 instead of any other peer which provides a path to the
671	   destination D).  Any packet matching this state at host A would be
672	   encapsulated with SR segment list (label stack) {16012, 9999}.  16012
673	   would steer the flow through the DC fabric, leveraging any ECMP,
674	   along the best path to border node Node12.  Once the flow gets to
675	   border node Node12, the active segment is 9999 (because of PHP on the
676	   upstream neighbor of Node12).  This EPE PeerNode segment forces
677	   border node Node12 to forward the packet to peer AS 9999, without any
678	   IP lookup at the border node.  There is no per-flow state for this
679	   engineered flow in the DC fabric.  A benefit of segment routing is
680	   the per-flow state is only required at the source.

682	   As well as allowing full traffic engineering control such a design
683	   also offers FIB table minimization benefits as the Internet-scale FIB
684	   at border node Node12 is not required if all FIB lookups are avoided
685	   there by using EPE.

687	7.4.  Anycast

689	   The design presented in this document preserves the availability and
690	   load-balancing properties of the base design presented in
691	   [I-D.ietf-spring-segment-routing].

693	   For example, one could assign an anycast loopback 192.0.2.20/32 and
694	   associate segment index 20 to it on the border Node11 and Node12 (in
695	   addition to their node-specific loopbacks).  Doing so, the EPE
696	   controller could express a default "go-to-the-Internet via any border
697	   node" policy as segment list {16020}. Indeed, from any host in the DC
698	   fabric or from any ToR node, 16020 steers the packet towards the
699	   border Node11 or Node12 leveraging ECMP where available along the
700	   best paths to these nodes.

702	8.  Preferred SRGB Allocation

704	   In the MPLS case, it is recommend to use same SRGBs at each node.

706	   Different SRGBs in each node likely increase the complexity of the
707	   solution both from an operational viewpoint and from a controller
708	   viewpoint.

710	   From an operation viewpoint, it is much simpler to have the same
711	   global label at every node for the same destination (the MPLS
712	   troubleshooting is then similar to the IPv6 troubleshooting where
713	   this global property is a given).

715	   From a controller viewpoint, this allows us to construct simple
716	   policies applicable across the fabric.

718	   Let us consider two applications A and B respectively connected to
719	   Node1 and Node2 (ToR nodes).  A has two flows FA1 and FA2 destined to
720	   Z.  B has two flows FB1 and FB2 destined to Z.  The controller wants
721	   FA1 and FB1 to be load-shared across the fabric while FA2 and FB2
722	   must be respectively steered via Node5 and Node8.

724	   Assuming a consistent unique SRGB across the fabric as described in
725	   the document, the controller can simply do it by instructing A and B
726	   to use {16011} respectively for FA1 and FB1 and by instructing A and
727	   B to use {16005 16011} and {16008 16011} respectively for FA2 and
728	   FB2.

730	   Let us assume a design where the SRGB is different at every node and
731	   where the SRGB of each node is advertised using the Originator SRGB
732	   TLV of the BGP-Prefix-SID as defined in
733	   [I-D.ietf-idr-bgp-prefix-sid]: SRGB of Node K starts at value K*1000
734	   and the SRGB length is 1000 (e.g.  Node1's SRGB is [1000, 1999],
735	   Node2's SRGB is [2000, 2999], ...).

737	   In this case, not only the controller would need to collect and store
738	   all of these different SRGB's (e.g., through the Originator SRGB TLV
739	   of the BGP-Prefix-SID), furthermore it would need to adapt the policy
740	   for each host.  Indeed, the controller would instruct A to use {1011}
741	   for FA1 while it would have to instruct B to use {2011} for FB1
742	   (while with the same SRGB, both policies are the same {16011}).

744	   Even worse, the controller would instruct A to use {1005, 5011} for
745	   FA1 while it would instruct B to use {2011, 8011} for FB1 (while with
746	   the same SRGB, the second segment is the same across both policies:
747	   16011).  When combining segments to create a policy, one need to
748	   carefully update the label of each segment.  This is obviously more
749	   error-prone, more complex and more difficult to troubleshoot.

751	9.  IANA Considerations

753	   This document does not make any IANA request.

755	10.  Manageability Considerations

757	   The design and deployment guidelines described in this document are
758	   based on the network design described in [RFC7938].

760	   The deployment model assumed in this document is based on a single
761	   domain where the interconnected DCs are part of the same
762	   administrative domain (which, of course, is split into different
763	   autonomous systems).  The operator has full control of the whole
764	   domain and the usual operational and management mechanisms and
765	   procedures are used in order to prevent any information related to
766	   internal prefixes and topology to be leaked outside the domain.

768	   As recommended in [I-D.ietf-spring-segment-routing], the same SRGB
769	   should be allocated in all nodes in order to facilitate the design,
770	   deployment and operations of the domain.

772	   When EPE ([I-D.ietf-spring-segment-routing-central-epe]) is used (as
773	   explained in Section 7.3, the same operational model is assumed.  EPE
774	   information is originated and propagated throughout the domain
775	   towards an internal server and unless explicitly configured by the
776	   operator, no EPE information is leaked outside the domain boundaries.

778	11.  Security Considerations

780	   This document proposes to apply Segment Routing to a well known
781	   scalability requirement expressed in [RFC7938] using the BGP-Prefix-
782	   SID as defined in [I-D.ietf-idr-bgp-prefix-sid].

784	   It has to be noted, as described in Section 10 that the design
785	   illustrated in [RFC7938] and in this document, refer to a deployment
786	   model where all nodes are under the same administration.  In this
787	   context, it is assumed that the operator doesn't want to leak outside
788	   of the domain any information related to internal prefixes and
789	   topology.  The internal information includes prefix-sid and EPE
790	   information.  In order to prevent such leaking, the standard BGP
791	   mechanisms (filters) are applied on the boundary of the domain.

793	   Therefore, the solution proposed in this document does not introduce
794	   any additional security concerns from what expressed in [RFC7938] and
795	   [I-D.ietf-idr-bgp-prefix-sid].  It is assumed that the security and
796	   confidentiality of the prefix and topology information is preserved
797	   by outbound filters at each peering point of the domain as described
798	   in Section 10.

800	12.  Acknowledgements

802	   The authors would like to thank Benjamin Black, Arjun Sreekantiah,
803	   Keyur Patel, Acee Lindem and Anoop Ghanwani for their comments and
804	   review of this document.

806	13.  Contributors

808	   Gaya Nagarajan
809	   Facebook
810	   US

812	   Email: gaya@fb.com

814	   Gaurav Dawra
815	   Cisco Systems
816	   US

818	   Email: gdawra.ietf@gmail.com

820	   Dmitry Afanasiev
821	   Yandex
822	   RU

824	   Email: fl0w@yandex-team.ru

826	   Tim Laberge
827	   Cisco
828	   US

830	   Email: tlaberge@cisco.com

832	   Edet Nkposong
833	   Salesforce.com Inc.
834	   US

836	   Email: enkposong@salesforce.com

838	   Mohan Nanduri
839	   Microsoft
840	   US

842	   Email: mnanduri@microsoft.com
843	   James Uttaro
844	   ATT
845	   US

847	   Email: ju1738@att.com

849	   Saikat Ray
850	   Unaffiliated
851	   US

853	   Email: raysaikat@gmail.com

855	   Jon Mitchell
856	   Unaffiliated
857	   US

859	   Email: jrmitche@puck.nether.net

861	14.  References

863	14.1.  Normative References

865	   [I-D.ietf-idr-bgp-prefix-sid]
866	              Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A.,
867	              and H. Gredler, "Segment Routing Prefix SID extensions for
868	              BGP", draft-ietf-idr-bgp-prefix-sid-27 (work in progress),
869	              June 2018.

871	   [I-D.ietf-spring-segment-routing]
872	              Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B.,
873	              Litkowski, S., and R. Shakir, "Segment Routing
874	              Architecture", draft-ietf-spring-segment-routing-15 (work
875	              in progress), January 2018.

877	   [I-D.ietf-spring-segment-routing-central-epe]
878	              Filsfils, C., Previdi, S., Dawra, G., Aries, E., and D.
879	              Afanasiev, "Segment Routing Centralized BGP Egress Peer
880	              Engineering", draft-ietf-spring-segment-routing-central-
881	              epe-10 (work in progress), December 2017.

883	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
884	              Requirement Levels", BCP 14, RFC 2119,
885	              DOI 10.17487/RFC2119, March 1997,
886	              <https://www.rfc-editor.org/info/rfc2119>.

888	   [RFC4271]  Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
889	              Border Gateway Protocol 4 (BGP-4)", RFC 4271,
890	              DOI 10.17487/RFC4271, January 2006,
891	              <https://www.rfc-editor.org/info/rfc4271>.

893	   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
894	              BGP for Routing in Large-Scale Data Centers", RFC 7938,
895	              DOI 10.17487/RFC7938, August 2016,
896	              <https://www.rfc-editor.org/info/rfc7938>.

898	   [RFC8277]  Rosen, E., "Using BGP to Bind MPLS Labels to Address
899	              Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017,
900	              <https://www.rfc-editor.org/info/rfc8277>.

902	14.2.  Informative References

904	   [I-D.ietf-6man-segment-routing-header]
905	              Filsfils, C., Previdi, S., Leddy, J., Matsushima, S., and
906	              d. daniel.voyer@bell.ca, "IPv6 Segment Routing Header
907	              (SRH)", draft-ietf-6man-segment-routing-header-15 (work in
908	              progress), October 2018.

910	   [RFC6793]  Vohra, Q. and E. Chen, "BGP Support for Four-Octet
911	              Autonomous System (AS) Number Space", RFC 6793,
912	              DOI 10.17487/RFC6793, December 2012,
913	              <https://www.rfc-editor.org/info/rfc6793>.

915	Authors' Addresses

917	   Clarence Filsfils (editor)
918	   Cisco Systems, Inc.
919	   Brussels
920	   BE

922	   Email: cfilsfil@cisco.com

924	   Stefano Previdi
925	   Cisco Systems, Inc.
926	   Italy

928	   Email: stefano@previdi.net
929	   Gaurav Dawra
930	   LinkedIn
931	   USA

933	   Email: gdawra.ietf@gmail.com

935	   Ebben Aries
936	   Juniper Networks
937	   1133 Innovation Way
938	   Sunnyvale  CA 94089
939	   US

941	   Email: exa@juniper.net

943	   Petr Lapukhov
944	   Facebook
945	   US

947	   Email: petr@fb.com