idnits 2.17.1 

draft-farrel-spring-sr-domain-interconnect-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 30, 2017) is 2490 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-05) exists of
     draft-drake-bess-datacenter-gateway-03

  == Outdated reference: A later version (-27) exists of
     draft-ietf-idr-bgp-prefix-sid-06

  == Outdated reference: A later version (-19) exists of
     draft-ietf-idr-bgpls-segment-routing-epe-13

  == Outdated reference: A later version (-22) exists of
     draft-ietf-idr-tunnel-encaps-06

  == Outdated reference: A later version (-25) exists of
     draft-ietf-isis-segment-routing-extensions-13

  == Outdated reference: A later version (-04) exists of
     draft-ietf-mpls-rfc3107bis-02

  == Outdated reference: A later version (-27) exists of
     draft-ietf-ospf-segment-routing-extensions-17

  == Outdated reference: A later version (-11) exists of
     draft-ietf-pce-pce-initiated-lsp-10

  == Outdated reference: A later version (-16) exists of
     draft-ietf-pce-segment-routing-09

  == Outdated reference: A later version (-15) exists of
     draft-ietf-spring-segment-routing-12

  == Outdated reference: A later version (-22) exists of
     draft-ietf-spring-segment-routing-mpls-10

  == Outdated reference: A later version (-07) exists of
     draft-sivabalan-pce-binding-label-sid-02

  -- Obsolete informational reference (is this intentional?): RFC 7752
     (Obsoleted by RFC 9552)


     Summary: 0 errors (**), 0 flaws (~~), 13 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	SPRING Working Group                                           A. Farrel
3	Internet-Draft                                                  J. Drake
4	Intended status: Informational                          Juniper Networks
5	Expires: January 1, 2018                                   June 30, 2017

7	   Interconnection of Segment Routing Domains - Problem Statement and
8	                           Solution Landscape
9	             draft-farrel-spring-sr-domain-interconnect-00

11	Abstract

13	   Segment Routing (SR) is now a popular forwarding paradigm for use in
14	   MPLS and IPv6 networks.  It is typically deployed in discrete domains
15	   that may be data centers, access networks, or other networks that are
16	   under the control of a single operator and that can easily be
17	   upgraded to support this new technology.

19	   Traffic originating in one SR domain often terminates in another SR
20	   domain, but must transit a backbone network that provides
21	   interconnection between those domains.

23	   This document describes a mechanism for providing connectivity
24	   between SR domains to enable end-to-end or domain-to-domain traffic
25	   engineering.

27	   The approach described: allows connectivity between SR domains,
28	   utilizes traffic engineering mechanisms (RSVP-TE or Segment Routing)
29	   across the backbone network, makes heavy use of pre-existing
30	   technologies requiring the specifications of very few additional
31	   mechanisms.

33	   This document some background and a problem statement, explains the
34	   solution mechanism, and provides examples.  It does not define any
35	   new protocol mechanisms.

37	Status of This Memo

39	   This Internet-Draft is submitted in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF).  Note that other groups may also distribute
44	   working documents as Internet-Drafts.  The list of current Internet-
45	   Drafts is at http://datatracker.ietf.org/drafts/current/.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   This Internet-Draft will expire on January 1, 2018.

54	Copyright Notice

56	   Copyright (c) 2017 IETF Trust and the persons identified as the
57	   document authors.  All rights reserved.

59	   This document is subject to BCP 78 and the IETF Trust's Legal
60	   Provisions Relating to IETF Documents
61	   (http://trustee.ietf.org/license-info) in effect on the date of
62	   publication of this document.  Please review these documents
63	   carefully, as they describe your rights and restrictions with respect
64	   to this document.  Code Components extracted from this document must
65	   include Simplified BSD License text as described in Section 4.e of
66	   the Trust Legal Provisions and are provided without warranty as
67	   described in the Simplified BSD License.

69	Table of Contents

71	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
72	   2.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   3
73	   3.  Solution Technologies . . . . . . . . . . . . . . . . . . . .   6
74	     3.1.  Characteristics of Solution Technologies  . . . . . . . .   7
75	   4.  Decomposing the Problem . . . . . . . . . . . . . . . . . . .   9
76	   5.  Solution Space  . . . . . . . . . . . . . . . . . . . . . . .  10
77	     5.1.  Global Optimization of the Paths  . . . . . . . . . . . .  10
78	     5.2.  Figuring Out the GWs at a Destination Domain for a Given
79	           Prefix  . . . . . . . . . . . . . . . . . . . . . . . . .  11
80	     5.3.  Figuring Out the Backbone Egress ASBRs  . . . . . . . . .  12
81	     5.4.  Making use of RSVP-TE LSPs Across the Backbone  . . . . .  12
82	     5.5.  Data Plane  . . . . . . . . . . . . . . . . . . . . . . .  13
83	     5.6.  Centralized and Distributed Controllers . . . . . . . . .  15
84	   6.  BGP-LS Considerations . . . . . . . . . . . . . . . . . . . .  18
85	   7.  Worked Examples . . . . . . . . . . . . . . . . . . . . . . .  21
86	   8.  Label Stack Depth Considerations  . . . . . . . . . . . . . .  25
87	     8.1.  Worked Example  . . . . . . . . . . . . . . . . . . . . .  26
88	   9.  Gateway Considerations  . . . . . . . . . . . . . . . . . . .  27
89	     9.1.  Domain Gateway Auto-Discovery . . . . . . . . . . . . . .  27
90	     9.2.  Relationship to BGP Link State and Egress Peer
91	           Engineering . . . . . . . . . . . . . . . . . . . . . . .  28
92	     9.3.  Advertising a Domain Route Externally . . . . . . . . . .  28
93	     9.4.  Encapsulations  . . . . . . . . . . . . . . . . . . . . .  29
94	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  29
95	   11. Management Considerations . . . . . . . . . . . . . . . . . .  29
96	   12. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  29
97	   13. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  29
98	   14. Informative References  . . . . . . . . . . . . . . . . . . .  29
99	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  32

101	1.  Introduction

103	   Data Centers are a growing market sector.  They are being set up by
104	   new specialist companies, by enterprises for their own use, by legacy
105	   ISPs, and by the new wave of network operators such as Microsoft and
106	   Amazon.

108	   The networks inside Data Centers are currently well-planned, but the
109	   traffic loads can be unpredictable.  There is a need to be able to
110	   direct traffic within a Data Center to follow a specific path.

112	   Data Centers are attached to external ("backbone") networks to allow
113	   access by users and to facilitate communication among Data Centers.
114	   An individual Data Center may be attached to multiple backbone
115	   networks, and may have multiple points of attachment to each backbone
116	   network.  Traffic to or from a Data Center may need to be directed to
117	   or from any of these points of attachment.

119	   A variety of networking technologies exist and have been proposed to
120	   steer traffic within the Data Center and across the backbone
121	   networks.  This document proposes an approach that builds on existing
122	   technologies to produce mechanisms that provide scalable and flexible
123	   interconnection of Data Centers, and that will be easy to operate.

125	   Segment Routing (SR) is a new technology that places forwarding state
126	   into each packet as a stack of loose hops as distinct from other pre-
127	   existing techniques that require signaling protocols to install state
128	   in the network.  SR is a popular option for building Data Centers,
129	   and is also seeing increasing traction in edge and access networks as
130	   well as in backbone networks.

132	   This paper describes mechanisms to provide end-to-end SR connectivity
133	   between SR-capable domains across an MPLS backbone network that
134	   supports SR and/or MPLS-TE.  This is the generalization of the
135	   requirement to provide inter-Data Center connectivity.

137	2.  Problem Statement

139	   Consider the network in Figure 1.  Without loss of generality, this
140	   `figure can be used to represent the architecture and problem space
141	   for steering traffic within and between SR edge domains.  The figure
142	   shows a single destination for all traffic that we will consider.  In
143	   this figure we distinguish between the PEs that provide access to the
144	   backbone networks and the Gateways that provide access to the SR edge
145	   domains: these may, in fact be the same equipment, and the PEs might
146	   be located at the domain edges.

148	   In describing the problem space and the solution we use four terms
149	   for network nodes as follows:

151	   SR edge domain :  A collection of SR-capable nodes in an edge network
152	      attached to the backbone network through one or more gateways.
153	      Examples include, access networks, Data Center sites, and
154	      blessings of unicorns.

156	   Host :  A node within an edge domain.  May be an end system or a
157	      transit node in the edge domain.

159	   Gateway (GW) :  Provides access to or from an edge domain.  Examples
160	      are CEs, ASBRs, and Data Center gateways.

162	   Provider Edge (PE) :  Provides access to or from the backbone
163	      network.

165	   Autonomous System Border Router (ASBR) :  Provides access to one AS
166	      in the backbone network from another AS in the backbone network.

168	   These terms can be seen used in Figure 1 where the various sources
169	   and destinations are hosts.

171	    -------------------------------------------------------------------
172	   |                                                                   |
173	   |                              AS1                                  |
174	   |  ----    ----                                       ----    ----  |
175	    -|PE1a|--|PE1b|-------------------------------------|PE2a|--|PE2b|-
176	      ----    ----                                       ----    ----
177	      :        :   ------------           ------------      :      :
178	      :        :  | AS2        |         |        AS3 |     :      :
179	      :        :  |         ------     ------         |     :      :
180	      :        :  |        |ASBR2a|...|ASBR3a|        |     :      :
181	      :        :  |         ------     ------         |     :      :
182	      :        :  |            |         |            |     :      :
183	      :        :  |         ------     ------         |     :      :
184	      :        :  |        |ASBR2b|...|ASBR3b|        |     :      :
185	      :        :  |         ------     ------         |     :      :
186	      :        :  |            |         |            |     :      :
187	      :  ......:  |  ----      |         |      ----  |     :      :
188	      :  :         -|PE2a|-----           -----|PE3a|-      :      :
189	      :  :           ----                       ----        :      :
190	      :  :      ......:                           :.......  :      :
191	      :  :      :                                        :  :      :
192	      ----    ----                                       ----    ----
193	    -|GW1a|--|GW1b|-                                   -|GW2a|--|GW2b|-
194	   |  ----    ----  |                                 |  ----    ----  |
195	   |                |                                 |                |
196	   |                |                                 |                |
197	   |                |                                 | Source3        |
198	   |        Source2 |                                 |                |
199	   |                |                                 |        Source4 |
200	   | Source1        |                                 |                |
201	   |                |                                 |   Destination  |
202	   |                |                                 |                |
203	   | Dom1           |                                 |           Dom2 |
204	    ----------------                                   ----------------

206	        Figure 1: Reference Architecture for SR Domain Interconnect

208	   Traffic to the destination may be sourced from multiple sources
209	   within that domain (we show two such sources: Source3 and Source4).
210	   Furthermore, traffic intended for the destination may arrive from
211	   outside the domain through any of the points of attachment to the
212	   backbone networks (we show GW3a and GW3b).  This traffic may need to
213	   be steered within the domain to achieve load-balancing across network
214	   resources, to avoid degraded or out-of-service resources (including
215	   planned service outages), and to achieve different qualities of
216	   service.  Of course, traffic in a remote source domain may also need
217	   to be steered within that domain.  We class this problem as "Intra-
218	   Domain Traffic Steering".

220	   Traffic across the backbone networks may need to be steered to
221	   conform to common Traffic Engineering paradigms.  That is, the path
222	   across any network (shown in the figure as an AS) or across any
223	   collection of networks may need to be chosen.  Furthermore, the
224	   points of inter-connection between networks may need to be selected
225	   and influence the path chosen for the data.  We class this problem as
226	   "Inter-Domain Traffic Steering".

228	   The composite end-to-end path comprises steering in the source
229	   domain, choice of source domain exit point, steering across the
230	   backbone networks, choice of network interconnections, choice of
231	   destination domain entry point, and steering in the destination
232	   domain.  These issues may be inter-dependent (for example, the best
233	   traffic steering in the source domain may help select the best exit
234	   point from that domain, but the connectivity options across the
235	   backbone network may drive the selection of a different exit point).
236	   We class this combination of problems as "End-to-End Domain
237	   Interconnect Traffic Steering".

239	   It should be noted that the solution to the End-to-End Domain
240	   Interconnect Traffic Steering problem depends on a number of factors:

242	   o  What technology is deployed in the domains.

244	   o  What technology is deployed in the backbone networks.

246	   o  How much information are the domains willing to share with each
247	      other.

249	   o  How much information are the backbone network operators and the
250	      domain operators are willing to share.

252	   In some cases, the domains and backbone networks are all owned and
253	   operated by the same company (with the backbone network often being a
254	   private network).  In other cases, the domains are operated by one
255	   company, with other companies operating the backbone.

257	3.  Solution Technologies

259	   Within the Data Center, Segment Routing (SR from the SPRING working
260	   group in the IETF [RFC7855] and [I-D.ietf-spring-segment-routing]) is
261	   becoming a dominant solution.  SR introduces traffic steering
262	   capabilities into an MPLS network
263	   [I-D.ietf-spring-segment-routing-mpls] by utilizing existing data
264	   plane capabilities (label pop and packet forwarding - "pop and go")
265	   in combination with additions to existing IGPs
266	   [I-D.ietf-ospf-segment-routing-extensions],
267	   [I-D.ietf-isis-segment-routing-extensions], BGP (as BGP-LU)
268	   [I-D.ietf-mpls-rfc3107bis], or a centralized controller to distribute
269	   "per-hop" labels.  An MPLS label stack can be imposed on a packet to
270	   describe a sequence of links/nodes to be transited by the packet; as
271	   each hop is transited, the label that represents it is popped from
272	   the stack and the packet is forwarded.  Thus, on a packet-by-packet
273	   basis, traffic can be steered within the Data Center network.

275	   Note that other Data Center data plane technologies also exist.
276	   While this document focuses on connecting domains that use MPLS
277	   Segment Routing, the techniques are equally applicable to non-MPLS
278	   domains (such as those using IP, VXLAN, and NVGRE).  See Section 9
279	   for details.

281	   This document broadens the problem space to consider interconnection
282	   of any type of edge domain.  These may be Data Center sites, but they
283	   may equally be access networks, VPN sites, or any other form of
284	   domain that includes packet sources and destinations.  We
285	   particularly focus on "SR edge domains" being source or destination
286	   domains that utilize SR, but the domains could use other technologies
287	   as described in Section 9.

289	   Backbone networks are commonly based on MPLS hardware.  In these
290	   networks, a number of different options exist to establish TE paths.
291	   Among these options are static LSPs (perhaps set up by an SDN
292	   controller), LSP tunnels established using a signaling protocol (such
293	   as RSVP-TE), and inter-domain use of SR (as described above for
294	   intra-domain steering).  Where traffic steering (without resource
295	   reservation) is needed, SR may be adequate.  Where Traffic
296	   Engineering is needed (i.e., traffic steering with resource
297	   reservation) RSVP-TE or centralized SDN control are preferred.
298	   However, in a network that is fully managed and controlled through a
299	   centralized planning tool, resource reservation can be achieved and
300	   SR can be used for full Traffic Engineering.  These solutions are
301	   already used in support of a number of edge-to-edge services such as
302	   L3VPN and L2VPN.

304	3.1.  Characteristics of Solution Technologies

306	   Each of the solution technologies mentioned in the previous section
307	   has certain characteristics, and the combined solution needs to
308	   recognize and address the characteristics in order to make a workable
309	   solution.

311	   o  When SR is used for traffic steering, the size of the MPLS label
312	      stack used in SR scales linearly with the length of the source
313	      route.  This can cause issues with MPLS implementations that only
314	      support label stacks of a limited size.  For example, some MPLS
315	      implementations cannot push enough labels on the stack to
316	      represent an entire source route.  Other implementations may be
317	      unable to do the proper "ECMP hashing" if the label stack is too
318	      long; they may be unable to read enough of the packet header to
319	      find an entropy label or to find the IP header of the payload.
320	      Increasing the packet header size also reduces the size of the
321	      payload that can be carried in an MPLS packet.  There are
322	      techniques that can be used to reduce the size of the label stack.
323	      For example, a single label (known as a "binding SID") can be used
324	      to represent a sequence of nodes; this label can be replaced with
325	      a set of labels when the packet reaches the first node in the
326	      sequence.  It is also possible to combine SR with conventional
327	      RSVP-TE by using a binding SID in the label stack to represent an
328	      LSP tunnel set up by RSVP-TE.

330	   o  Most of the work on using SR for traffic steering assumes that
331	      traffic only needs to be steered within a single administrative
332	      domain.  If the backbone consists of multiple ASes that are part
333	      of a common administrative domain, the use of SR across the
334	      backbone may prove to be a challenge, and its use in the backbone
335	      may be limited to cases where private networks connect the
336	      domains, rather than cases where the domains are connected by
337	      third-party network operators or by the public Internet.

339	   o  RSVP-TE has been used to provide edge-to-edge tunnels through
340	      which flows to/from many endpoints can be routed, and this
341	      provides a reduction in state while still offering Traffic
342	      Engineering across the backbone network.  However, this requires
343	      O(n2) connections and as the number of edge domains increases this
344	      becomes unsustainable.

346	   o  A centralized control system, while capable of producing more
347	      optimal results than a distributed control system, may present
348	      challenges in large and dynamic networks.  It relies on all
349	      network state being held centrally, and it is difficult to make
350	      central control as robust and self-correcting as distributed
351	      control.

353	   This paper introduces an approach that blends the best points of each
354	   of these solution technologies to achieve a trade-off where RSVP-TE
355	   tunnels in the backbone network are stitched together using SR, and
356	   end-to-end SR paths can be created under the control of a central
357	   controller with routing devolved to the constituent networks where
358	   possible.

360	4.  Decomposing the Problem

362	   It is important to decompose the problem to take account of different
363	   regions spanned by the end-to-end path.  These regions may use
364	   different technologies and may be under different administrative
365	   control.  The separation of administrative control is particularly
366	   important because the operator of one region may be unwilling to
367	   share information about their networks, and may be resistant to
368	   allowing a third party to exert control over their network resources.

370	   Using the reference model in Figure 1, we can consider how to get a
371	   packet from Source1 to the Destination.  The following decisions must
372	   be made:

374	   o  In which domain the Destination lies.

376	   o  Which exit point from Dom1 to use.

378	   o  Which entry point to Dom2 to use.

380	   o  How to reach the exit point of Dom1 from Source1.

382	   o  How to reach the entry point to Dom2 from the exit point of Dom1.

384	   o  How to reach the Destination from the entry point to Dom2.

386	   As already mentioned, these decisions may be inter-related.  This
387	   enables us to break down the problem into three steps:

389	   1.  Get the packet from Source1 to the exit point of Dom1.

391	   2.  Get the packet from exit point of Dom1 to entry point of Dom2.

393	   3.  Get the packet from entry point of Dom2 to Destination.

395	   The solution needs to achieve this in a way that allows:

397	   o  Adequate discovery of preferred elements in the end-to-end path
398	      (such as location of destination, destination domain entry point).

400	   o  Full control of the end-to-end path if all of the operators are
401	      willing.

403	   o  Re-use of existing techniques and technologies.

405	   From a technology point of view we must support several functions and
406	   mixtures of those functions:

408	   o  If the domain uses MPLS Segment Routing, the labels within the
409	      domain may be populated by any means including BGP-LU
410	      [I-D.ietf-mpls-rfc3107bis], IGP, and central control.  Source
411	      routes within the domain may be expressed as label stacks pushed
412	      by a controller or computed by a source router, or expressed as a
413	      single label and programmed into the domain routers by a
414	      controller.

416	   o  If the domain uses other (non-MPLS) forwarding, the domain
417	      processing is specific to that technology.  See Section 9 for
418	      details.

420	   o  If the domains use Segment Routing, the source and destination
421	      domains may or may not be in the same Segment Routing domain, so
422	      that the prefix-SIDs may be the same or different in the two
423	      domains.

425	   o  The backbone network may be a single private network under the
426	      control of the owner of the domains and comprising one or more
427	      ASes, or may be a network operated by one or more third parties.

429	   o  The backbone network may utilize MPLS Traffic Engineering tunnels
430	      in conjunction with MPLS Segment Routing and the domain-to-domain
431	      source route may be provided by stitching TE LSPs.

433	   o  A single controller may be used to handle the source and
434	      destination domains as well as the backbone network, or there may
435	      be a different controller for the backbone network separate from
436	      that that controls the two domains, or there may be separate
437	      controllers for each network.  The controllers may cooperate and
438	      share information to different degrees.

440	   All of these different decompositions of the problem reflect
441	   different deployment choices and different commercial and operational
442	   practices, each with different functional trade-offs.  For example,
443	   with separate controllers that do not share information and that only
444	   cooperate to a limited extent, it will be possible to achieve end-to-
445	   end connectivity with optimal routing at each step (domain or
446	   backbone AS), but the end-to-end path that is achieved might not be
447	   optimal.

449	5.  Solution Space

451	5.1.  Global Optimization of the Paths

453	   Global optimization of the path from one domain to another requires
454	   either that the source controller has a complete view of the end-to-
455	   end topology or some form of cooperation between controllers (such as
456	   in BRPC in RFC 5441 [RFC5441]).

458	   BGP-LS [RFC7752] can be used to provide the "source" controller with
459	   a view of the topology of the backbone.  This requires some of the
460	   BGP speakers in each AS to have BGP-LS sessions to the controller.
461	   Other means of obtaining this view are of course possible.

463	5.2.  Figuring Out the GWs at a Destination Domain for a Given Prefix

465	   Suppose GW1 and GW2 both advertise a route to prefix X, each setting
466	   itself as next hop.  One might think that the GWs for X could be
467	   inferred from the routes' next hop fields, but typically both routes
468	   do not get distributed across the backbone, only the "best" route, as
469	   selected by BGP.  But the best route according to the BGP selection
470	   process might not be the route via the GW that we want to use for
471	   traffic engineering purposes.

473	   The obvious solution would be to use the ADD-PATH mechanism [RFC7911]
474	   to ensure that all routes to X get advertised.  However, even if one
475	   does this, the identity of the GWs would get lost as soon as the
476	   routes got distributed through an ASBR that sets next hop self.  And
477	   if there are multiple ASes in the backbone, not only will the next
478	   hop change several times, but the ADD-PATH mechanism experiences
479	   scaling issues.  So this "obvious" solution only works within a
480	   single AS.

482	   A better solution can be achieved using the Tunnel Encapsulation
483	   [I-D.ietf-idr-tunnel-encaps] attribute as follows:

485	   We define a new tunnel type, "SR tunnel" and when the GWs to a given
486	   domain advertise a route to a prefix X within the domain, they each
487	   include a Tunnel Encapsulation attribute with multiple remote
488	   endpoint sub-TLVs each identifying a specific GW to the domain.

490	   In other words, each route advertised by any GW identifies all of the
491	   GWs to the same domain (see Section 9 for a discussion of how GWs
492	   discover each other).  Therefore, only one of the routes needs to be
493	   distributed to other ASes, and it doesn't matter how many times the
494	   next hop changes, the Tunnel Encapsulation attribute (and its remote
495	   endpoint sub-TLVs) remains unchanged.

497	   Further, when a packet destined for prefix X is sent on a TE path to
498	   GW1 we want the packet to arrive at GW1 carrying, at the top of its
499	   label stack, GW1's label for prefix X.  To achieve this we will place
500	   the SID/SRGB in a sub-TLV of the Tunnel Encapsulation attribute.  We
501	   will define the prefix-SID sub-TLV to be essentially identical in
502	   syntax to the prefix-SID attribute (see

504	   [I-D.ietf-idr-bgp-prefix-sid]), but the semantics are somewhat
505	   different.

507	   It is also possible to define an "MPLS Label Stack" sub-TLV for the
508	   Tunnel Encapsulation attribute, and put this in the "SR tunnel" TLV.
509	   This allows the destination GW to specify a label stack that it wants
510	   packets destined for prefix X to have.  This label stack represents a
511	   source route through the destination domain.

513	5.3.  Figuring Out the Backbone Egress ASBRs

515	   We need to figure out the backbone egress ASBRs that are attached to
516	   a given GW at the destination domain this out in order to properly
517	   engineer the path across the backbone.

519	   The "cleanest" way to figure this out is to have the backbone egress
520	   ASBRs distribute the information to the source controller using the
521	   EPE extensions of BGP-LS [I-D.ietf-idr-bgpls-segment-routing-epe].
522	   The EPE extensions to BGP-LS allow a BGP speaker to say, "Here is a
523	   list of my EBGP neighbors, and here is a (locally significant)
524	   adjacency-SID for each one."

526	   It may also be possible to consider utilizing cooperating PCEs or a
527	   Hierarchical PCE approach in RFC 6805 [RFC6805].  But it should be
528	   observed that this question is dependent on the question in
529	   Section 5.2.  That is, it is not possible to even start the selection
530	   of egress ASBRs until it is known which GWs at the destination domain
531	   provide access to a given prefix.  Once that question has been
532	   answered, any number of PCE approaches can be used to select the
533	   right egress ASBR and, more generally, the ASBR path across the
534	   backbone.

536	5.4.  Making use of RSVP-TE LSPs Across the Backbone

538	   There are a number of ways to carry traffic across the backbone from
539	   one domain to another.  RSVP-TE is a popular tunneling mechanism in
540	   similar scenarios (e.g., L3VPN) because it allows for reservation of
541	   resources as well as traffic steering.

543	   A controller can cause an RSVP-TE LSP to be set up by using PCEP to
544	   talk to the LSP headend, using PCEP extensions
545	   [I-D.ietf-pce-pce-initiated-lsp].  That draft specifies an "LSP-
546	   initiate" message that the controller uses to specify the RSVP-TE LSP
547	   endpoints, the ERO, a "symbolic pathname", and optionally other
548	   attributes (specified in the PCEP specification, RFC 5440 [RFC5440])
549	   such as bandwidth.

551	   When the headend receives an LSP-initiate message, it sets up the
552	   RSVP-TE LSP, assigns it a "PLSP-id", and reports the PLSP-id back to
553	   the controller in a PCRpt message [I-D.ietf-pce-stateful-pce].  The
554	   PCRpt message also contains the symbolic name that the controller
555	   assigned to the LSP, as well as containing some information
556	   identifying the LSP-initiate message from the controller, and details
557	   of exactly how the LSP was set up (RRO, bandwidth, etc.).

559	   The headend can add to the PCRpt message a TE-PATH-BINDING TLV
560	   [I-D.sivabalan-pce-binding-label-sid].  This allows the headend to
561	   assign a "binding SID" to the LSP, and to report to the controller
562	   that a particular binding SID corresponds to a particular LSP.  The
563	   binding SID is locally scoped to the headend.

565	   The controller can make this label be part of the label stack that it
566	   tells the source (or the GW at the source domain) to put on the data
567	   packets being sent to prefix X.  When the headend receives a packet
568	   with this label at the top of the stack it will send the packet
569	   onward on the LSP.

571	5.5.  Data Plane

573	   Consolidating all of the above, consider what happens when we want to
574	   move a data packet from Source to Destination in Figure 1via the
575	   following source route:

577	   Source1---GW1b---PE2a---ASBR2a---ASBR3a---PE3a---GW2a---Destination

579	   Further, assume that there is an RSVP-TE LSP from PE2a to ASBR2a that
580	   we want to use, as well as an RSVP-TE LSP from ASBR3a to PE3a that we
581	   want to use.

583	   Let's suppose that the Source pushes a label stack following
584	   instructions from the controller (for example, using BGP-LU
585	   [I-D.ietf-mpls-rfc3107bis]).  We won't worry for now about source
586	   routing through the domains themselves: that is, in practice there
587	   may be additional labels in the stack to cover the source route from
588	   the Source to GW1b and from GW2a to the Destination, but we will
589	   focus only on the labels necessary to leave the source domain,
590	   traverse the backbone, and enter the egress domain.  So we only care
591	   what the stack looks like when the packet gets to GW1b.

593	   When the packet gets to GW1b, the stack should have six labels:

595	   Top Label:

597	      Peer-SID or adjacency-SID identifying link or links to PE2a.
598	      These SIDs are distributed from GW1b to the controller via the EPE
599	      extensions of BGP-LS.  (This label will get popped by GW1b, which
600	      will then send the packet to PE2a.)

602	   Second Label:

604	      Binding SID advertised by PE2a to the controller for the RSVP-TE
605	      LSP to ASBR2a.  This binding SID is advertised via the PCEP
606	      extensions discussed above.  (This label will get swapped by PE2a
607	      for the label that the LSP's next hop has assigned to the LSP.)

609	   Third Label:

611	      Peer-SID or adjacency-SID identifying link or links to ASBR3a, as
612	      advertised to the controller by ASBR2a using the BGP-LS EPE
613	      extensions.  (This label gets popped by ASBR2a, which then sends
614	      the packet to ASBR3a.)

616	   Fourth Label:

618	      Binding SID advertised by ASBR3a for the RSVP-TE LSP to PE3a.
619	      This binding SID is advertised via the PCEP extensions discussed
620	      above.  ASBR3a treats this label just like PE2a treated the second
621	      label above.

623	   Fifth label:

625	      Peer-SID or adjacency-SID identifying link or links to GW2a, as
626	      advertised to the controller by ASBR3a using the BGP-LS EPE
627	      extensions.  ASBR3a pops this label and sends the packet to GW2a.

629	   Sixth Label:

631	      Prefix-SID or other label identifying the Destination advertised
632	      in a Tunnel Encapsulation attribute by GW2a.  (This can be omitted
633	      if GW2a is happy to accept IP packets, or prefers a VXLAN tunnel
634	      for example.  That would be indicated through the Tunnel
635	      Encapsulation attribute of course.)

637	   Note that the size of the label stack is proportional to the number
638	   of RSVP-TE LSPs that get stitched together by SR.

640	   See Section 7 for some detailed examples that show the concrete use
641	   of labels in a sample topology.

643	   In the above example, all labels except the sixth are locally
644	   significant labels: peer-SIDs, binding SIDs, or adjacency-SIDs.  Only
645	   the sixth label, a prefix-SID, has a domain-wide unique value.  To
646	   impose that label, the source needs to know the SRGB of GW2a.  If all
647	   nodes have the same SRGB, this is not a problem.  Otherwise, there
648	   are a number of different ways GW3a can advertise its SRGB.  This can
649	   be done via the segment routing extensions of BGP-LS, or it can be
650	   done using the prefix-SID attribute or BGP-LU
651	   [I-D.ietf-mpls-rfc3107bis], or it can be done using the BGP Tunnel
652	   Encapsulation attribute.  The exact technique to be used will depend
653	   on the details of the deployment scenario.

655	   The reason the above example is primarily based on locally
656	   significant labels is that it creates a "strict source route", and it
657	   presupposes the EPE extensions of BGP-LS.  In some scenarios, the EPE
658	   extension to BGP-LS might not be available (or BGP-LS might not be
659	   available at all).  In other scenarios, it may be desirable to steer
660	   a packet through a "loose source route".  In such scenarios, the
661	   label stack imposed by the source will be based upon a sequence of
662	   domain-wide unique "node-SIDs", each representing one of the hops of
663	   source route.  Each label has to be computed by adding the
664	   corresponding node-SID to the SRGB of the node that will act upon the
665	   label.  One way to learn the node-SIDs and SRGBs is to use the
666	   segment routing extensions of BGP-LS.  Another way is to use BGP-LU
667	   as follows.  Each node that may be part of a source route would
668	   originate a BGP-LU route with one of its own loopback addresses as
669	   the prefix.  The BGP prefix-SID attribute would be attached to this
670	   route.  The prefix-SID attribute would contain a SID, which is the
671	   domain-wide unique SID corresponding to the node's loopback address.
672	   The attribute would also contain the node's SRGB.

674	   While this technique is useful when BGP-LS is not available, it
675	   presupposes that the source controller has some other means of
676	   discovering the topology.  In this document, we focus primarily on
677	   the scenario where BGP-LS, rather than BGP-LU, is used.

679	5.6.  Centralized and Distributed Controllers

681	   A controller or set of controllers are needed to collate topology and
682	   TE information from the constituent networks, to apply policies and
683	   service requirements to compute paths across those networks, to
684	   select an end-to-end path, and to program key nodes in the network to
685	   take the right forwarding actions (pushing label stacks, stitching
686	   LSPs, forwarding traffic).

688	   o  It is commonly understood that a fully optimal end-to-end path can
689	      only be computed with full knowledge of the end-to-end topology
690	      and available Traffic Engineering resources.  Thus, one option is
691	      for all information about the domain networks and backbone network
692	      to be collected by a central controller that makes all path
693	      computations and is responsible for issuing the necessary
694	      programming commands.  Such a model works best when there is no
695	      commercial or administrative impediment (for example, where the
696	      domains and the backbone network are owned and operated by the
697	      same organization).  There may, however, be some scaling concerns
698	      if the component networks are large.

700	      In this mode of operation, each network may use BGP-LS to export
701	      Traffic Engineering and topology information to the central
702	      controller, and the controller may use PCEP to program the network
703	      behavior.

705	   o  A similar centralized control mechanism can be used with a
706	      scalability improvement that risks a reduction in optimality.  In
707	      this case, the domain networks can export to the controller just
708	      the feasibility of connectivity between data source/sink and
709	      gateway, perhaps enhancing this with some information about the
710	      Traffic Engineering metrics of the path.

712	      This approach allows the central controller to understand the end-
713	      to-end path that it is selecting, but not to control it fully.
714	      The source route from data source to domain egress gateway is left
715	      to the source host or a controller in the source domain, while the
716	      source route from domain ingress gateway to destination is left as
717	      a decision for the domain ingress gateway or to a controller in
718	      the destination domain.

720	      This mode of operation still leaves overall control with a
721	      centralized server and that may not be considered suitable when
722	      there is separate commercial or administrative control of the
723	      networks.

725	   o  When there is separate commercial or administrative control of the
726	      networks the domain operator will not want the backbone operator
727	      to have control of the source routes within the domain and may be
728	      reluctant to disclose any information about the topology or
729	      resource availability within the domains.  Conversely, the
730	      backbone operator may be very unwilling to allow the domain
731	      operator (a customer) any control over or knowledge about the
732	      backbone network.

734	      This "problem" has already been solved for Traffic Engineering in
735	      MPLS networks that span multiple administrative domains and leads
736	      to multiple potential solutions:

738	      *  Per-domain path computation in RFC 5152 [RFC5152] can be seen
739	         as "best effort optimization".  In this mode the controller for
740	         each domain is responsible for finding the best path to the
741	         next domain, but has no way of knowing which is the best exit
742	         point from the local domain.  The resulting path may end up
743	         significantly sub-optimal or even blocked.

745	      *  Backward recursive path computation (BRPC) in RFC 5441
746	         [RFC5441] is a mechanism that allows controllers to cooperate
747	         across a small set of domains (such as ASes) to build a tree of
748	         possible paths and so allow the controller for the ingress
749	         domain to select the optimal path.  The details of the paths
750	         within each domain that might reveal confidential information
751	         can be hidden using Path Keys in RFC 5520 [RFC5520] BRPC
752	         produces optimal paths but scales poorly with an increase in
753	         domains and with an increase in connectivity between domains.
754	         It can also lead to slow computation times.

756	      *  Hierarchical PCE (H-PCE) in RFC 6805 [RFC6805] is a two-level
757	         cooperation process between PCEs.  The child PCEs remain
758	         responsible for computing paths across their domains, and they
759	         coordinate with a parent PCE that stitches these paths together
760	         to form the end-to-end path.  This approach has many
761	         similarities with BRPC but can scale better through the
762	         maintenance of "domain topology" that shows how the domains are
763	         interconnected, and through the ability to pipe-line
764	         computation requests to all of the child domains.  It has the
765	         drawback that some party has to own and operate the parent PCE.

767	      *  An alternative approach is documented by the TEAS working group
768	         [RFC7926].  In this model each network advertises to
769	         controllers for adjacent networks (using BGP-LS) selected
770	         information about potential connectivity across the network.
771	         It does not have to show full topology and can make its own
772	         decisions about which paths it considers optimal for use by its
773	         different neighbors and customers.  This approach is suitable
774	         for the End-to-End Domain Interconnect Traffic Steering problem
775	         where the backbone is under different control from the domains
776	         because it allows the overlay nature of the use of the backbone
777	         network to be treated as a peer network relationship by the
778	         controllers of the domains - the domains can be operated using
779	         a single controller or a separate controller for each domain.

781	   It is also possible to operate domain interconnection when some or
782	   all domains do not have a controller.  Segment Routing is capable of
783	   routing a packet toward the next hop based on the top label on the
784	   stack, and that label does not need to indicate an immediately
785	   adjacent node or link.  In these cases, the packet may be forwarded
786	   untouched, or the forwarding router may impose a locally-determined
787	   additional set of labels that define the path to the next hop.

789	   PCE can be used to instruct the source host or a transit node on what
790	   label stacks to add to packets.  That is, a node that needs to impose
791	   labels (either to start routing the packet from the source host, or
792	   to advance the packet from a transit router toward the destination)
793	   can determine the label stack to use based on local function or can
794	   have that stack supplied by a PCE.  The PCE Protocol (PCEP) has been
795	   extended to allow the PCE to supply a label stack for reaching a
796	   specific destination either in response to a request or in an
797	   unsolicited manner [I-D.ietf-pce-segment-routing].

799	6.  BGP-LS Considerations

801	   This section gives an overview of the use of BGP-LS to export an
802	   abstraction (or summary) of the connectivity across the backbone
803	   network by means of two figures that show different views of a sample
804	   network.

806	   Figure 2 shows a more complex reference architecture.

808	   Figure 3 represents the minimum set of nodes and links that need to
809	   be advertised in BGP-LS with SR in order to perform Domain
810	   Interconnect with traffic engineering across the backbone network:
811	   the PEs, ASBRs, and gateways (GWs), and the links between them.  In
812	   particular, EPE [I-D.ietf-idr-bgpls-segment-routing-epe] and TE
813	   information with associated segment IDs is advertised in BGP-LS with
814	   SR.

816	   Links that are advertised may be physical links, links realized by
817	   LSP tunnels, or abstract links.  It is assumed that intra-AS links
818	   are either real links, RSVP-TE LSPs with allocated bandwidth, or SR
819	   TE policies as described in
820	   [I-D.previdi-idr-segment-routing-te-policy].  Additional nodes
821	   internal to an AS and their links to PEs, ASBRs, and/or GWs may also
822	   be advertised (for example to avoid full mesh problems).

824	    -------------------------------------------------------------------
825	   |                                                                   |
826	   |                              AS1                                  |
827	   |  ----    ----                                       ----    ----  |
828	    -|PE1a|--|PE1b|-------------------------------------|PE2a|--|PE2b|-
829	      ----    ----                                       ----    ----
830	      :        :   ------------           ------------     :     : :
831	      :        :  | AS2        |         |        AS3 |    :     : :
832	      :        :  |         ------.....------         |    :     : :
833	      :        :  |        |ASBR2a|   |ASBR3a|        |    :     : :
834	      :        :  |         ------  ..:------         |    :     : :
835	      :        :  |            |    :    |            |    :     : :
836	      :        :  |         ------..:  ------         |    :     : :
837	      :        :  |        |ASBR2b|...|ASBR3b|        |    :     : :
838	      :        :  |         ------     ------         |    :     : :
839	      :        :  |            |         |            |    :     : :
840	      :        :  |            |       ------         |    :     : :
841	      :        :  |            |    ..|ASBR3c|        |    :     : :
842	      :        :  |            |    :  ------         |    : ....: :
843	      :  ......:  |  ----      |    :    |      ----  |    : :     :
844	      :  :         -|PE2a|-----     :     -----|PE3b|-     : :     :
845	      :  :           ----           :           ----       : :     :
846	      :  :     .......:             :             :....... : :     :
847	      :  :     :                   ------                : : :     :
848	      :  :     :              ----|ASBR4b|----           : : :     :
849	      :  :     :             |     ------     |          : : :     :
850	      :  :     :           ----               |          : : :     :
851	      :  :     : .........|PE4b|          AS4 |          : : :     :
852	      :  :     : :         ----               |          : : :     :
853	      :  :     : :           |      ----      |          : : :     :
854	      :  :     : :            -----|PE4a|-----           : : :     :
855	      :  :     : :                  ----                 : : :     :
856	      :  :     : :                ..:  :..               : : :     :
857	      :  :     : :                :      :               : : :     :
858	      ----    ----              ----    ----             ----:   ----
859	    -|GW1a|--|GW1b|-          -|GW2a|--|GW2b|-         -|GW3a|--|GW3b|-
860	   |  ----    ----  |        |  ----    ----  |       |  ----    ----  |
861	   |                |        |                |       |                |
862	   |                |        |                |       |                |
863	   | Host1a  Host1b |        | Host2a  Host2b |       | Host3a  Host3b |
864	   |                |        |                |       |                |
865	   |                |        |                |       |                |
866	   | Dom1           |        | Dom2           |       |           Dom3 |
867	    ----------------          ----------------         ----------------

869	              Figure 2: Network View of Example Configuration

871	       .............................................................
872	       :                                                           :
873	      ----    ----                                       ----    ----
874	     |PE1a|  |PE1b|.....................................|PE2a|  |PE2b|
875	      ----    ----                                       ----    ----
876	      :        :                                           :     : :
877	      :        :                                           :     : :
878	      :        :            ------.....------              :     : :
879	      :        :     ......|ASBR2a|   |ASBR3a|......       :     : :
880	      :        :     :      ------  ..:------      :       :     : :
881	      :        :     :              :              :       :     : :
882	      :        :     :      ------..:  ------      :       :     : :
883	      :        :     :  ...|ASBR2b|...|ASBR3b|     :       :     : :
884	      :        :     :  :   ------     ------      :       :     : :
885	      :        :     :  :                 :        :       :     : :
886	      :        :     :  :              ------      :       :     : :
887	      :        :     :  :           ..|ASBR3c|...  :       :     : :
888	      :        :     :  :           :  ------   :  :       : ....: :
889	      :  ......:     ----           :           ----       : :     :
890	      :  :          |PE2a|          :          |PE3b|      : :     :
891	      :  :           ----           :           ----       : :     :
892	      :  :     .......:             :             :....... : :     :
893	      :  :     :                   ------                : : :     :
894	      :  :     :                  |ASBR4b|               : : :     :
895	      :  :     :                   ------                : : :     :
896	      :  :     :           ----        :                 : : :     :
897	      :  :     : .........|PE4b|.....  :                 : : :     :
898	      :  :     : :         ----     :  :                 : : :     :
899	      :  :     : :                  ----                 : : :     :
900	      :  :     : :                 |PE4a|                : : :     :
901	      :  :     : :                  ----                 : : :     :
902	      :  :     : :                ..:  :..               : : :     :
903	      :  :     : :                :      :               : : :     :
904	      ----    ----              ----    ----             ----:   ----
905	    -|GW1a|--|GW1b|-          -|GW2a|--|GW2b|-         -|GW3a|--|GW3b|-
906	   |  ----    ----  |        |  ----    ----  |       |  ----    ----  |
907	   |                |        |                |       |                |
908	   |                |        |                |       |                |
909	   | Host1a  Host1b |        | Host2a  Host2b |       | Host3a  Host3b |
910	   |                |        |                |       |                |
911	   |                |        |                |       |                |
912	   | Dom1           |        | Dom2           |       |           Dom3 |
913	    ----------------          ----------------         ----------------

915	             Figure 3: Topology View of Example Configuration

917	   A node (a PCE, router, or host) that is computing a full or partial
918	   path correlates the topology information disseminated in BGP-LS with
919	   SR with the information advertised with the Tunnel Encapsulation
920	   attributes to compute that path and obtain the SIDs for the elements
921	   on that path.  In order to allow a source host to compute exit points
922	   from its domain, some subset of the above information needs to be
923	   disseminated within that domain.

925	   What is advertised external to a given AS is controlled by policy at
926	   the ASes' PEs, ASBRs, and GWs.  Central control of what each node
927	   should advertise, based upon analysis of the network as a whole, is
928	   an important additional function.  This and the amount of policy
929	   involved may make the use of a Route Reflector an attractive option.

931	   The configuration of which links to other nodes and the
932	   characteristics of those links a given node advertises in BGP-LS with
933	   SR is done locally at each node and pairwise coordination between
934	   link end-points is required to ensure consistency.

936	   Path Weighted ECMP (PWECMP) is assumed to be used by a GW for a given
937	   source domain to send all flows to a given destination domain using
938	   all paths in the backbone network to that destination domain in
939	   proportion to the minimum bandwidth on each path.  PWECMP is also
940	   assumed to be used by hosts within a source domain to send flows to
941	   that domain's GWs.

943	7.  Worked Examples

945	   Figure 4 shows a view of the links, paths, and labels that can be
946	   assigned to part of the sample network shown in Figure 2 and
947	   Figure 3.  The double-dash lines (===) indicate LSP tunnels across
948	   backbone ASes and dotted lines (...) are physical links.

950	   At each node, a label may be assigned to each outgoing link.  This is
951	   shown in Figure 4.  For example, at GW1a the label L201 is assigned
952	   to the link connecting GW1a to PE1a.  At PE1c, the label L302 is
953	   assigned to the link connecting PE1c to GW3b.  Labels ("binding
954	   SIDs") may also be assigned to RSVP-TE LSPs.  For example, at PE1a,
955	   label L202 is assigned to the RSVP-TE LSP leading from PE1a to PE1c.

957	   At the destination domain, labels L302 and L305 are "node-SIDs"; they
958	   represent GW3b and Host3b respectively, rather than representing
959	   particular links.

961	   When a node processes a packet, the label at the top of the label
962	   stack indicates the link (or RSVP-TE LSP) on which that node is to
963	   transmit the packet.  The node pops that label off the label stack
964	   before transmitting the packet on the link.  However, if the top
965	   label is a node-SID, the node processing the packet is expected to
966	   transmit the packet on whatever link it regards as the shortest path
967	   to the node represented by the label.

969	      ----        L202                                             ----
970	     |    |=======================================================|    |
971	     |PE1a|                                                       |PE1c|
972	     |    |=======================================================|    |
973	      ----        L203                                             ----
974	      :                                                             : :
975	      :     ----     L205                                     ----  : :
976	      :    |PE1b|============================================|PE1d| : :
977	      :     ----                                              ----  : :
978	      :      :                                                  :   : :
979	      :      :                                                  :   : :
980	      :      :    ----  L207  ------  L209  ------          L303:   : :
981	      :L201  :   |    |======|ASBR2a|......|      |             :   : :
982	      :      :   |    |       ------       |      | L210  ----  :   : :
983	      :      :   |PE2a|                    |ASBR3a|======|PE3b| :   : :
984	      :      :   |    | L208  ------  L211 |      |       ----  :   : :
985	      :      :   |    |======|ASBR2b|......|      |       :     :   : :
986	      :  L204:    ----       ------         ------     ...:     :   : :
987	      :      :      :                                  :        :   : :
988	      :  ....:      :                                  : .......:   : :
989	      :  :          :                                  : :          : :
990	      :  :          :L206                          L301: : .........: :
991	      :  :          :                                  : : : L304     :
992	      :  :      ....:                                  : : :      ....:
993	      :  :      :                                      : : :      : L302
994	      ----    ----                                     -----    ----
995	    -|GW1a|--|GW1b|-                                 -|GW3a |--|GW3b|-
996	   |  ----    ----  |                               |  -----    ----  |
997	   |    :      :    |                               |     :      :    |
998	   |L103:      :L102|                               | L303:      :L304|
999	   |    :      :    |                               |     :      :    |
1000	   |   N1      N2   |                               |    N3      N4   |
1001	   |    :..  ..:    |                               |     :  ....:    |
1002	   | L101 :  :      |                               |     :  :        |
1003	   |     Host1a     |                               |   Host3b (L305) |
1004	   |                |                               |                 |
1005	   | Dom1           |                               |            Dom3 |
1006	    ----------------                                 -----------------

1008	           Figure 4: Tunnels and Labels in Example Configuration

1010	   Let's consider several different possible ways to direct a packet
1011	   from Host1a in Dom1 to Host3b in Dom3.

1013	   a.  Full source route imposed at source

1015	      In this case it is assumed that the entity responsible for
1016	      determining an end-to-end path has access to the topologies of
1017	      both domains and of the backbone network.  This might happen if
1018	      all of the networks are owned by the same operator in which case
1019	      the information can be shared into a single database for use by an
1020	      offline tool, or the information can be distributed using routing
1021	      protocols such that the source host can see enough to select the
1022	      path.  Alternatively, the end-to-end path could be produced
1023	      through cooperation between computation entities each responsible
1024	      for different domains along the path.

1026	      If the path is computed externally it is pushed to the source
1027	      host.  Otherwise, it is computed by the source host itself.

1029	      Suppose it is desired for a packet from Host1a to travel to Host3b
1030	      via the following source route:

1032	         Host1a->N1->GW1a->PE1a->(RSVP-TE LSP)->PE1c->GW3b->N4->Host3b

1034	      Host1a would impose the following label stack would be imposed
1035	      (with the first label representing the top of stack), and then
1036	      send the packet to N1:

1038	         L103, L201, L202, L302, L304, L305

1040	      N1 sees L103 at the top of the stack, so it pops the stack and
1041	      forwards the packet to GW1a.  GW1a sees L201 at the top of the
1042	      stack, so it pops the stack and forwards the packet to PE1a.  PE1a
1043	      sees L202 at the top of the stack, so it pops the stack and
1044	      forwards the packet over the RSVP-TE LSP to PE1c.  As the packet
1045	      travels over this LSP, its top label will be an RSVP-TE signaled
1046	      label representing the LSP.  That is, PE1a imposes an additional
1047	      label stack entry for the tunnel LSP.

1049	      At the end of the LSP tunnel, the MPLS tunnel label will be
1050	      popped, and PE1c will see L302 at the top of the stack.  PE1c pops
1051	      the stack and forwards the packet to GW3b.  GW3b will see L304 at
1052	      the top of the stack, so it pops the stack and forwards the packet
1053	      to N4.  Finally, N4 sees L305 at the top of the stack, so it pops
1054	      the stack and forwards the packet to Host 3b.  No remote
1055	      visibility into Dom3.

1057	   b.  It is possible that the source domain does not have visibility
1058	   into the destination domain.

1060	      This occurs if the destination domain does not export its
1061	      topology, but even in this case, it will export reachability
1062	      information so that the source host or the path computation entity
1063	      will know:

1065	      *  The GWs through which the destination can be reached.

1067	      *  The SID to use for the destination prefix.

1069	      Suppose we want a packet to follow the source route:

1071	         Host1a->N1->GW1a->PE1a->(RSVP-TE LSP)->PE1c->GW3b->...->Host3b

1073	      (The ellipsis indicates a part of the path that is not explicitly
1074	      specified.)  Thus, the label stack imposed at the source host
1075	      would be:

1077	         L103, L201, L202, L302, L305

1079	      Processing is as per case a., but when the packet reaches the GW
1080	      of the destination domain, it can either simply forward the packet
1081	      along the shortest path to Host3b, or it can insert additional
1082	      labels to direct the path to the destination.

1084	   c.  Dom1 only has reachability information

1086	      The source domain (or the path computation entity) may be further
1087	      restricted in its view of the network.  It is possible that it
1088	      knows the location of the destination in the destination domain,
1089	      and knows the GWs to the destination domain that provide
1090	      reachability to the destination, but that it has no view of the
1091	      backbone network.  This leads to the packet being forwarded in a
1092	      manner similar to 'per-domain path computation' described in
1093	      Section 5.6.

1095	      At the source host a simple label stack is imposed navigating the
1096	      domain and indicating the destination GW and the destination host.

1098	         L101, L103, L302, L305

1100	      As the packet leaves the source domain, the source GW determines
1101	      the PE to use to enter the backbone using nothing more than the
1102	      BGP preferred route to the destination GW.

1104	      When the packet reaches the first PE it has a label stack just
1105	      identifying the destination GW and host (L302, L305).  The PE uses
1106	      information it has about the backbone network topology and
1107	      available LSPs to select an LSP tunnel, impose the tunnel label,
1108	      and forward the packet.

1110	      When the packet reaches the end of the LSP tunnel, it is processed
1111	      as described in case b.

1113	   d.  Stitched LSPs across the backbone

1115	      A variant of all these cases arises when the packet is sent using
1116	      a path that spans multiple ASes.  For example, one that crosses
1117	      AS2 and AS3 as shown in Figure 2.

1119	      In this case, basing the example on case a., the source host would
1120	      impose the label stack:

1122	         L102, L206, L207, L209, L210, L301, L303, L305

1124	      and would then send the packet to N2.

1126	      When the packet reaches PE2a as previously described and the top
1127	      label (L207) selects an LSP tunnel that leads to ASBR2a.  At the
1128	      end of that LSP tunnel the next label (L209) routes the packet
1129	      from ASBR2a to the ASBR3a, where the next label (L210) identifies
1130	      the next LSP tunnel to use.  Thus, SR has been used to stitch
1131	      together LSPs to make a longer path segment.  As the packet
1132	      emerges from the final LSP tunnel, forwarding continues as
1133	      previously described.

1135	8.  Label Stack Depth Considerations

1137	   As described in Section 3.1, one of the issues with a Segment Routing
1138	   approach is that the label stack can get large, for example when the
1139	   source route becomes long.  A mechanism to mitigate this problem is
1140	   needed if the solution is to be fully applicable in all environments.

1142	   An Internet-Draft called "Segment Routing Traffic Engineering Policy
1143	   using BGP" [I-D.previdi-idr-segment-routing-te-policy] introduces the
1144	   concept of hierarchical source routes as a way to compress source
1145	   route headers.  It functions by having the egress node for a set of
1146	   source routes advertise those source routes along with an explicit
1147	   request that each node that is an ingress node for one or more of
1148	   those source routes should advertise a binding SID for the set of
1149	   source routes for which it is the ingress.  (It should be noted that
1150	   the set of source routes can either be advertised by the egress node
1151	   as described here, or could be advertised by a controller on behalf
1152	   of the egress node.)  Such an ingress node advertises its set of
1153	   source routes and a binding SID as an adjacency in BGP-LS as
1154	   described in Section 6.  These source routes represent the weighted
1155	   ECMP paths between the ingress node and the egress node.  (Note also
1156	   that the binding SID may be supplied by the node that advertises the
1157	   source routes - the egress or the controller - or may be chosen by
1158	   ingress node.)

1160	   A remote node that wishes to reach the egress node would then
1161	   construct a source route consisting of the segment IDs necessary to
1162	   reach one of the ingress nodes for the path it wishes to use along
1163	   with the binding SID that the ingress node advertised to identify the
1164	   set of paths.  When the selected ingress node receives a packet with
1165	   a binding SID it has advertised, it replaces the binding SID with the
1166	   labels for one of its source routes to the egress node (it will
1167	   choose one of the source routes in the set according to its own
1168	   weighting algorithms and policy).

1170	8.1.  Worked Example

1172	   Consider the topology in Figure 4.  Suppose that it is desired to
1173	   construct full segment routed paths from ingress to egress, but that
1174	   the resulting label stack (segment route) is too large.  In this case
1175	   the gateways to Dom3 (GW3a and GW3b) can advertise all of the source
1176	   routes from the gateways to Dom1 (GW1a and GW1b).  The gateways to
1177	   Dom1 then assign binding SIDs to those source routes and advertise
1178	   those SIDs into BGP-LS.

1180	   Thus, GW3b would advertise the two source routes (L201, L202, L302
1181	   and L201, L203, L302), and GW1a would advertise into BGP-LS its
1182	   adjacency to GW3b along with a binding SID.  Should Host1a wish to
1183	   send a packet via GW1a and GW3b, it can include L103 and this binding
1184	   SID in the source route.  GW1a is free to choose which source route
1185	   to use between itself and GW3b using its weighted ECMP algorithm.

1187	   Similarly, GW3a would advertise the following set of source routes:

1189	   o  L201, L202, L304

1191	   o  L201, L203, L304

1193	   o  L204, L205, L303

1195	   o  L206, L207, L209, L210, L301

1197	   o  L206, L208, L211, L210, L301
1198	   GW1a would advertise a binding SID for the first three, and GW1b
1199	   would advertise a binding SID for the other two.

1201	9.  Gateway Considerations

1203	   As described in Section 5, we define a new tunnel type, "SR tunnel",
1204	   and when the GWs to a given domain advertise a route to a prefix X
1205	   within the domain, they will each include a Tunnel Encapsulation
1206	   attribute with multiple tunnel instances each of type "SR tunnel",
1207	   one for each GW and each containing a Remote Endpoint sub-TLV with
1208	   that GW's address.

1210	   In other words, each route advertised by any GW identifies all of the
1211	   GWs to the same domain.

1213	   Therefore, even if only one of the routes is distributed to other
1214	   ASes, it will not matter how many times the next hop changes, as the
1215	   Tunnel Encapsulation attribute (and its remote endpoint sub-TLVs)
1216	   will remain unchanged.

1218	9.1.  Domain Gateway Auto-Discovery

1220	   To allow a given domain's GWs to auto-discover each other and to
1221	   coordinate their operations, the following procedures are implemented
1222	   [I-D.drake-bess-datacenter-gateway]:

1224	   o  Each GW is configured with an identifier for the domain that is
1225	      common across all GWs to the domain (i.e., all GWs to all domains
1226	      that are connected) and unique across all domains that are
1227	      connected.

1229	   o  A route target [RFC4360] is attached to each GW's auto-discovery
1230	      route and has its value set to the domain identifier.

1232	   o  Each GW constructs an import filtering rule to import any route
1233	      that carries a route target with the same domain identifier that
1234	      the GW itself uses.  This means that only these GWs will import
1235	      those routes and that all GWs to the same domain will import each
1236	      other's routes and will learn (auto-discover) the current set of
1237	      active GWs for the domain.

1239	   o  The auto-discovery route each GW advertises consists of the
1240	      following:

1242	      *  An IPv4 or IPv6 NLRI containing one of the GW's loopback
1243	         addresses (that is, with AFI/SAFI that is one of 1/1, 2/1, 1/4,
1244	         2/4).

1246	      *  A Tunnel Encapsulation attribute containing the GW's
1247	         encapsulation information, which at a minimum consists of an SR
1248	         tunnel TLV (type to be allocated by IANA) with a Remote
1249	         Endpoint sub-TLV [I-D.ietf-idr-tunnel-encaps].

1251	   To avoid the side effect of applying the Tunnel Encapsulation
1252	   attribute to any packet that is addressed to the GW, the GW should
1253	   use a different loopback address.

1255	   Each GW will include a Tunnel Encapsulation attribute for each GW
1256	   that is active for the domain (including itself), and will include
1257	   these in every route advertised externally to the domain by each GW.
1258	   As the current set of active GWs changes (due to the addition of a
1259	   new GW or the failure/removal of an existing GW) each externally
1260	   advertised route will be re-advertised with the set of SR tunnel
1261	   instances reflecting the current set of active GWs.

1263	9.2.  Relationship to BGP Link State and Egress Peer Engineering

1265	   When a remote GW receives a route to a prefix X it can use the SR
1266	   tunnel instances within the contained Tunnel Encapsulation attribute
1267	   to identify the GWs through which X can be reached.  It uses this
1268	   information to compute SR TE paths across the backbone network
1269	   looking at the information advertised to it in SR BGP Link State
1270	   (BGP-LS) [I-D.gredler-idr-bgp-ls-segment-routing-ext] and correlated
1271	   using the domain identity.  SR Egress Peer Engineering (EPE)
1272	   [I-D.ietf-idr-bgpls-segment-routing-epe] can be used to supplement
1273	   the information advertised in the BGP-LS.

1275	9.3.  Advertising a Domain Route Externally

1277	   When a packet destined for prefix X is sent on an SR TE path to a GW
1278	   for the domain containing X, it needs to carry the receiving GW's
1279	   label for X such that this label rises to the top of the stack before
1280	   the GW complete its processing of the packet.  To achieve this we
1281	   place a prefix-SID sub-TLV for X in each SR tunnel instance in the
1282	   Tunnel Encapsulation attribute in the externally advertised route for
1283	   X.

1285	   Alternatively, if the GWs for a given domain are configured to allow
1286	   remote GWs to perform SR TE through that domain for a prefix X, then
1287	   each GW computes an SR TE path through that domain to X from each of
1288	   the current active GWs and places each in an MPLS label stack sub-TLV
1289	   [I-D.ietf-idr-tunnel-encaps] in the SR tunnel instance for that GW.

1291	9.4.  Encapsulations

1293	   If the GWs for a given domain are configured to allow remote GWs send
1294	   them a packet in that domain's native encapsulation, then each GW
1295	   will also include multiple instances of a tunnel TLV for that native
1296	   encapsulation, one for each GW and each containing a remote endpoint
1297	   sub-TLV with that GW's address, in externally advertised routes.  A
1298	   remote GW may then encapsulate a packet according to the rules
1299	   defined via the sub-TLVs included in each of the tunnel TLV
1300	   instances.

1302	10.  Security Considerations

1304	   TBD

1306	11.  Management Considerations

1308	   TBD

1310	12.  IANA Considerations

1312	   This document makes no requests for IANA action.

1314	13.  Acknowledgements

1316	   TBD

1318	14.  Informative References

1320	   [I-D.drake-bess-datacenter-gateway]
1321	              Drake, J., Farrel, A., Rosen, E., Patel, K., and L. Jalil,
1322	              "Gateway Auto-Discovery and Route Advertisement for
1323	              Segment Routing Enabled Data Center Interconnection",
1324	              draft-drake-bess-datacenter-gateway-03 (work in progress),
1325	              April 2017.

1327	   [I-D.gredler-idr-bgp-ls-segment-routing-ext]
1328	              Previdi, S., Psenak, P., Filsfils, C., Gredler, H., Chen,
1329	              M., and j. jefftant@gmail.com, "BGP Link-State extensions
1330	              for Segment Routing", draft-gredler-idr-bgp-ls-segment-
1331	              routing-ext-04 (work in progress), October 2016.

1333	   [I-D.ietf-idr-bgp-prefix-sid]
1334	              Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A.,
1335	              and H. Gredler, "Segment Routing Prefix SID extensions for
1336	              BGP", draft-ietf-idr-bgp-prefix-sid-06 (work in progress),
1337	              June 2017.

1339	   [I-D.ietf-idr-bgpls-segment-routing-epe]
1340	              Previdi, S., Filsfils, C., Patel, K., Ray, S., and J.
1341	              Dong, "BGP-LS extensions for Segment Routing BGP Egress
1342	              Peer Engineering", draft-ietf-idr-bgpls-segment-routing-
1343	              epe-13 (work in progress), June 2017.

1345	   [I-D.ietf-idr-tunnel-encaps]
1346	              Rosen, E., Patel, K., and G. Velde, "The BGP Tunnel
1347	              Encapsulation Attribute", draft-ietf-idr-tunnel-encaps-06
1348	              (work in progress), June 2017.

1350	   [I-D.ietf-isis-segment-routing-extensions]
1351	              Previdi, S., Filsfils, C., Bashandy, A., Gredler, H.,
1352	              Litkowski, S., Decraene, B., and j. jefftant@gmail.com,
1353	              "IS-IS Extensions for Segment Routing", draft-ietf-isis-
1354	              segment-routing-extensions-13 (work in progress), June
1355	              2017.

1357	   [I-D.ietf-mpls-rfc3107bis]
1358	              Rosen, E., "Using BGP to Bind MPLS Labels to Address
1359	              Prefixes", draft-ietf-mpls-rfc3107bis-02 (work in
1360	              progress), May 2017.

1362	   [I-D.ietf-ospf-segment-routing-extensions]
1363	              Psenak, P., Previdi, S., Filsfils, C., Gredler, H.,
1364	              Shakir, R., Henderickx, W., and J. Tantsura, "OSPF
1365	              Extensions for Segment Routing", draft-ietf-ospf-segment-
1366	              routing-extensions-17 (work in progress), June 2017.

1368	   [I-D.ietf-pce-pce-initiated-lsp]
1369	              Crabbe, E., Minei, I., Sivabalan, S., and R. Varga, "PCEP
1370	              Extensions for PCE-initiated LSP Setup in a Stateful PCE
1371	              Model", draft-ietf-pce-pce-initiated-lsp-10 (work in
1372	              progress), June 2017.

1374	   [I-D.ietf-pce-segment-routing]
1375	              Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W.,
1376	              and J. Hardwick, "PCEP Extensions for Segment Routing",
1377	              draft-ietf-pce-segment-routing-09 (work in progress),
1378	              April 2017.

1380	   [I-D.ietf-pce-stateful-pce]
1381	              Crabbe, E., Minei, I., Medved, J., and R. Varga, "PCEP
1382	              Extensions for Stateful PCE", draft-ietf-pce-stateful-
1383	              pce-21 (work in progress), June 2017.

1385	   [I-D.ietf-spring-segment-routing]
1386	              Filsfils, C., Previdi, S., Decraene, B., Litkowski, S.,
1387	              and R. Shakir, "Segment Routing Architecture", draft-ietf-
1388	              spring-segment-routing-12 (work in progress), June 2017.

1390	   [I-D.ietf-spring-segment-routing-mpls]
1391	              Filsfils, C., Previdi, S., Bashandy, A., Decraene, B.,
1392	              Litkowski, S., and R. Shakir, "Segment Routing with MPLS
1393	              data plane", draft-ietf-spring-segment-routing-mpls-10
1394	              (work in progress), June 2017.

1396	   [I-D.previdi-idr-segment-routing-te-policy]
1397	              Previdi, S., Filsfils, C., Mattes, P., Rosen, E., and S.
1398	              Lin, "Advertising Segment Routing Policies in BGP", draft-
1399	              previdi-idr-segment-routing-te-policy-07 (work in
1400	              progress), June 2017.

1402	   [I-D.sivabalan-pce-binding-label-sid]
1403	              Sivabalan, S., Filsfils, C., Previdi, S., Tantsura, J.,
1404	              Hardwick, J., and M. Nanduri, "Carrying Binding Label/
1405	              Segment-ID in PCE-based Networks.", draft-sivabalan-pce-
1406	              binding-label-sid-02 (work in progress), October 2016.

1408	   [RFC4360]  Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended
1409	              Communities Attribute", RFC 4360, DOI 10.17487/RFC4360,
1410	              February 2006, <http://www.rfc-editor.org/info/rfc4360>.

1412	   [RFC5152]  Vasseur, JP., Ed., Ayyangar, A., Ed., and R. Zhang, "A
1413	              Per-Domain Path Computation Method for Establishing Inter-
1414	              Domain Traffic Engineering (TE) Label Switched Paths
1415	              (LSPs)", RFC 5152, DOI 10.17487/RFC5152, February 2008,
1416	              <http://www.rfc-editor.org/info/rfc5152>.

1418	   [RFC5440]  Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation
1419	              Element (PCE) Communication Protocol (PCEP)", RFC 5440,
1420	              DOI 10.17487/RFC5440, March 2009,
1421	              <http://www.rfc-editor.org/info/rfc5440>.

1423	   [RFC5441]  Vasseur, JP., Ed., Zhang, R., Bitar, N., and JL. Le Roux,
1424	              "A Backward-Recursive PCE-Based Computation (BRPC)
1425	              Procedure to Compute Shortest Constrained Inter-Domain
1426	              Traffic Engineering Label Switched Paths", RFC 5441,
1427	              DOI 10.17487/RFC5441, April 2009,
1428	              <http://www.rfc-editor.org/info/rfc5441>.

1430	   [RFC5520]  Bradford, R., Ed., Vasseur, JP., and A. Farrel,
1431	              "Preserving Topology Confidentiality in Inter-Domain Path
1432	              Computation Using a Path-Key-Based Mechanism", RFC 5520,
1433	              DOI 10.17487/RFC5520, April 2009,
1434	              <http://www.rfc-editor.org/info/rfc5520>.

1436	   [RFC6805]  King, D., Ed. and A. Farrel, Ed., "The Application of the
1437	              Path Computation Element Architecture to the Determination
1438	              of a Sequence of Domains in MPLS and GMPLS", RFC 6805,
1439	              DOI 10.17487/RFC6805, November 2012,
1440	              <http://www.rfc-editor.org/info/rfc6805>.

1442	   [RFC7752]  Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and
1443	              S. Ray, "North-Bound Distribution of Link-State and
1444	              Traffic Engineering (TE) Information Using BGP", RFC 7752,
1445	              DOI 10.17487/RFC7752, March 2016,
1446	              <http://www.rfc-editor.org/info/rfc7752>.

1448	   [RFC7855]  Previdi, S., Ed., Filsfils, C., Ed., Decraene, B.,
1449	              Litkowski, S., Horneffer, M., and R. Shakir, "Source
1450	              Packet Routing in Networking (SPRING) Problem Statement
1451	              and Requirements", RFC 7855, DOI 10.17487/RFC7855, May
1452	              2016, <http://www.rfc-editor.org/info/rfc7855>.

1454	   [RFC7911]  Walton, D., Retana, A., Chen, E., and J. Scudder,
1455	              "Advertisement of Multiple Paths in BGP", RFC 7911,
1456	              DOI 10.17487/RFC7911, July 2016,
1457	              <http://www.rfc-editor.org/info/rfc7911>.

1459	   [RFC7926]  Farrel, A., Ed., Drake, J., Bitar, N., Swallow, G.,
1460	              Ceccarelli, D., and X. Zhang, "Problem Statement and
1461	              Architecture for Information Exchange between
1462	              Interconnected Traffic-Engineered Networks", BCP 206,
1463	              RFC 7926, DOI 10.17487/RFC7926, July 2016,
1464	              <http://www.rfc-editor.org/info/rfc7926>.

1466	Authors' Addresses

1468	   Adrian Farrel
1469	   Juniper Networks

1471	   Email: afarrel@juniper.net

1473	   John Drake
1474	   Juniper Networks

1476	   Email: jdrake@juniper.net