idnits 2.17.1 draft-farrel-spring-sr-domain-interconnect-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 30, 2017) is 2490 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-05) exists of draft-drake-bess-datacenter-gateway-03 == Outdated reference: A later version (-27) exists of draft-ietf-idr-bgp-prefix-sid-06 == Outdated reference: A later version (-19) exists of draft-ietf-idr-bgpls-segment-routing-epe-13 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-06 == Outdated reference: A later version (-25) exists of draft-ietf-isis-segment-routing-extensions-13 == Outdated reference: A later version (-04) exists of draft-ietf-mpls-rfc3107bis-02 == Outdated reference: A later version (-27) exists of draft-ietf-ospf-segment-routing-extensions-17 == Outdated reference: A later version (-11) exists of draft-ietf-pce-pce-initiated-lsp-10 == Outdated reference: A later version (-16) exists of draft-ietf-pce-segment-routing-09 == Outdated reference: A later version (-15) exists of draft-ietf-spring-segment-routing-12 == Outdated reference: A later version (-22) exists of draft-ietf-spring-segment-routing-mpls-10 == Outdated reference: A later version (-07) exists of draft-sivabalan-pce-binding-label-sid-02 -- Obsolete informational reference (is this intentional?): RFC 7752 (Obsoleted by RFC 9552) Summary: 0 errors (**), 0 flaws (~~), 13 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SPRING Working Group A. Farrel 3 Internet-Draft J. Drake 4 Intended status: Informational Juniper Networks 5 Expires: January 1, 2018 June 30, 2017 7 Interconnection of Segment Routing Domains - Problem Statement and 8 Solution Landscape 9 draft-farrel-spring-sr-domain-interconnect-00 11 Abstract 13 Segment Routing (SR) is now a popular forwarding paradigm for use in 14 MPLS and IPv6 networks. It is typically deployed in discrete domains 15 that may be data centers, access networks, or other networks that are 16 under the control of a single operator and that can easily be 17 upgraded to support this new technology. 19 Traffic originating in one SR domain often terminates in another SR 20 domain, but must transit a backbone network that provides 21 interconnection between those domains. 23 This document describes a mechanism for providing connectivity 24 between SR domains to enable end-to-end or domain-to-domain traffic 25 engineering. 27 The approach described: allows connectivity between SR domains, 28 utilizes traffic engineering mechanisms (RSVP-TE or Segment Routing) 29 across the backbone network, makes heavy use of pre-existing 30 technologies requiring the specifications of very few additional 31 mechanisms. 33 This document some background and a problem statement, explains the 34 solution mechanism, and provides examples. It does not define any 35 new protocol mechanisms. 37 Status of This Memo 39 This Internet-Draft is submitted in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF). Note that other groups may also distribute 44 working documents as Internet-Drafts. The list of current Internet- 45 Drafts is at http://datatracker.ietf.org/drafts/current/. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress." 52 This Internet-Draft will expire on January 1, 2018. 54 Copyright Notice 56 Copyright (c) 2017 IETF Trust and the persons identified as the 57 document authors. All rights reserved. 59 This document is subject to BCP 78 and the IETF Trust's Legal 60 Provisions Relating to IETF Documents 61 (http://trustee.ietf.org/license-info) in effect on the date of 62 publication of this document. Please review these documents 63 carefully, as they describe your rights and restrictions with respect 64 to this document. Code Components extracted from this document must 65 include Simplified BSD License text as described in Section 4.e of 66 the Trust Legal Provisions and are provided without warranty as 67 described in the Simplified BSD License. 69 Table of Contents 71 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 72 2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 3 73 3. Solution Technologies . . . . . . . . . . . . . . . . . . . . 6 74 3.1. Characteristics of Solution Technologies . . . . . . . . 7 75 4. Decomposing the Problem . . . . . . . . . . . . . . . . . . . 9 76 5. Solution Space . . . . . . . . . . . . . . . . . . . . . . . 10 77 5.1. Global Optimization of the Paths . . . . . . . . . . . . 10 78 5.2. Figuring Out the GWs at a Destination Domain for a Given 79 Prefix . . . . . . . . . . . . . . . . . . . . . . . . . 11 80 5.3. Figuring Out the Backbone Egress ASBRs . . . . . . . . . 12 81 5.4. Making use of RSVP-TE LSPs Across the Backbone . . . . . 12 82 5.5. Data Plane . . . . . . . . . . . . . . . . . . . . . . . 13 83 5.6. Centralized and Distributed Controllers . . . . . . . . . 15 84 6. BGP-LS Considerations . . . . . . . . . . . . . . . . . . . . 18 85 7. Worked Examples . . . . . . . . . . . . . . . . . . . . . . . 21 86 8. Label Stack Depth Considerations . . . . . . . . . . . . . . 25 87 8.1. Worked Example . . . . . . . . . . . . . . . . . . . . . 26 88 9. Gateway Considerations . . . . . . . . . . . . . . . . . . . 27 89 9.1. Domain Gateway Auto-Discovery . . . . . . . . . . . . . . 27 90 9.2. Relationship to BGP Link State and Egress Peer 91 Engineering . . . . . . . . . . . . . . . . . . . . . . . 28 92 9.3. Advertising a Domain Route Externally . . . . . . . . . . 28 93 9.4. Encapsulations . . . . . . . . . . . . . . . . . . . . . 29 94 10. Security Considerations . . . . . . . . . . . . . . . . . . . 29 95 11. Management Considerations . . . . . . . . . . . . . . . . . . 29 96 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 97 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 29 98 14. Informative References . . . . . . . . . . . . . . . . . . . 29 99 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 101 1. Introduction 103 Data Centers are a growing market sector. They are being set up by 104 new specialist companies, by enterprises for their own use, by legacy 105 ISPs, and by the new wave of network operators such as Microsoft and 106 Amazon. 108 The networks inside Data Centers are currently well-planned, but the 109 traffic loads can be unpredictable. There is a need to be able to 110 direct traffic within a Data Center to follow a specific path. 112 Data Centers are attached to external ("backbone") networks to allow 113 access by users and to facilitate communication among Data Centers. 114 An individual Data Center may be attached to multiple backbone 115 networks, and may have multiple points of attachment to each backbone 116 network. Traffic to or from a Data Center may need to be directed to 117 or from any of these points of attachment. 119 A variety of networking technologies exist and have been proposed to 120 steer traffic within the Data Center and across the backbone 121 networks. This document proposes an approach that builds on existing 122 technologies to produce mechanisms that provide scalable and flexible 123 interconnection of Data Centers, and that will be easy to operate. 125 Segment Routing (SR) is a new technology that places forwarding state 126 into each packet as a stack of loose hops as distinct from other pre- 127 existing techniques that require signaling protocols to install state 128 in the network. SR is a popular option for building Data Centers, 129 and is also seeing increasing traction in edge and access networks as 130 well as in backbone networks. 132 This paper describes mechanisms to provide end-to-end SR connectivity 133 between SR-capable domains across an MPLS backbone network that 134 supports SR and/or MPLS-TE. This is the generalization of the 135 requirement to provide inter-Data Center connectivity. 137 2. Problem Statement 139 Consider the network in Figure 1. Without loss of generality, this 140 `figure can be used to represent the architecture and problem space 141 for steering traffic within and between SR edge domains. The figure 142 shows a single destination for all traffic that we will consider. In 143 this figure we distinguish between the PEs that provide access to the 144 backbone networks and the Gateways that provide access to the SR edge 145 domains: these may, in fact be the same equipment, and the PEs might 146 be located at the domain edges. 148 In describing the problem space and the solution we use four terms 149 for network nodes as follows: 151 SR edge domain : A collection of SR-capable nodes in an edge network 152 attached to the backbone network through one or more gateways. 153 Examples include, access networks, Data Center sites, and 154 blessings of unicorns. 156 Host : A node within an edge domain. May be an end system or a 157 transit node in the edge domain. 159 Gateway (GW) : Provides access to or from an edge domain. Examples 160 are CEs, ASBRs, and Data Center gateways. 162 Provider Edge (PE) : Provides access to or from the backbone 163 network. 165 Autonomous System Border Router (ASBR) : Provides access to one AS 166 in the backbone network from another AS in the backbone network. 168 These terms can be seen used in Figure 1 where the various sources 169 and destinations are hosts. 171 ------------------------------------------------------------------- 172 | | 173 | AS1 | 174 | ---- ---- ---- ---- | 175 -|PE1a|--|PE1b|-------------------------------------|PE2a|--|PE2b|- 176 ---- ---- ---- ---- 177 : : ------------ ------------ : : 178 : : | AS2 | | AS3 | : : 179 : : | ------ ------ | : : 180 : : | |ASBR2a|...|ASBR3a| | : : 181 : : | ------ ------ | : : 182 : : | | | | : : 183 : : | ------ ------ | : : 184 : : | |ASBR2b|...|ASBR3b| | : : 185 : : | ------ ------ | : : 186 : : | | | | : : 187 : ......: | ---- | | ---- | : : 188 : : -|PE2a|----- -----|PE3a|- : : 189 : : ---- ---- : : 190 : : ......: :....... : : 191 : : : : : : 192 ---- ---- ---- ---- 193 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- 194 | ---- ---- | | ---- ---- | 195 | | | | 196 | | | | 197 | | | Source3 | 198 | Source2 | | | 199 | | | Source4 | 200 | Source1 | | | 201 | | | Destination | 202 | | | | 203 | Dom1 | | Dom2 | 204 ---------------- ---------------- 206 Figure 1: Reference Architecture for SR Domain Interconnect 208 Traffic to the destination may be sourced from multiple sources 209 within that domain (we show two such sources: Source3 and Source4). 210 Furthermore, traffic intended for the destination may arrive from 211 outside the domain through any of the points of attachment to the 212 backbone networks (we show GW3a and GW3b). This traffic may need to 213 be steered within the domain to achieve load-balancing across network 214 resources, to avoid degraded or out-of-service resources (including 215 planned service outages), and to achieve different qualities of 216 service. Of course, traffic in a remote source domain may also need 217 to be steered within that domain. We class this problem as "Intra- 218 Domain Traffic Steering". 220 Traffic across the backbone networks may need to be steered to 221 conform to common Traffic Engineering paradigms. That is, the path 222 across any network (shown in the figure as an AS) or across any 223 collection of networks may need to be chosen. Furthermore, the 224 points of inter-connection between networks may need to be selected 225 and influence the path chosen for the data. We class this problem as 226 "Inter-Domain Traffic Steering". 228 The composite end-to-end path comprises steering in the source 229 domain, choice of source domain exit point, steering across the 230 backbone networks, choice of network interconnections, choice of 231 destination domain entry point, and steering in the destination 232 domain. These issues may be inter-dependent (for example, the best 233 traffic steering in the source domain may help select the best exit 234 point from that domain, but the connectivity options across the 235 backbone network may drive the selection of a different exit point). 236 We class this combination of problems as "End-to-End Domain 237 Interconnect Traffic Steering". 239 It should be noted that the solution to the End-to-End Domain 240 Interconnect Traffic Steering problem depends on a number of factors: 242 o What technology is deployed in the domains. 244 o What technology is deployed in the backbone networks. 246 o How much information are the domains willing to share with each 247 other. 249 o How much information are the backbone network operators and the 250 domain operators are willing to share. 252 In some cases, the domains and backbone networks are all owned and 253 operated by the same company (with the backbone network often being a 254 private network). In other cases, the domains are operated by one 255 company, with other companies operating the backbone. 257 3. Solution Technologies 259 Within the Data Center, Segment Routing (SR from the SPRING working 260 group in the IETF [RFC7855] and [I-D.ietf-spring-segment-routing]) is 261 becoming a dominant solution. SR introduces traffic steering 262 capabilities into an MPLS network 263 [I-D.ietf-spring-segment-routing-mpls] by utilizing existing data 264 plane capabilities (label pop and packet forwarding - "pop and go") 265 in combination with additions to existing IGPs 266 [I-D.ietf-ospf-segment-routing-extensions], 267 [I-D.ietf-isis-segment-routing-extensions], BGP (as BGP-LU) 268 [I-D.ietf-mpls-rfc3107bis], or a centralized controller to distribute 269 "per-hop" labels. An MPLS label stack can be imposed on a packet to 270 describe a sequence of links/nodes to be transited by the packet; as 271 each hop is transited, the label that represents it is popped from 272 the stack and the packet is forwarded. Thus, on a packet-by-packet 273 basis, traffic can be steered within the Data Center network. 275 Note that other Data Center data plane technologies also exist. 276 While this document focuses on connecting domains that use MPLS 277 Segment Routing, the techniques are equally applicable to non-MPLS 278 domains (such as those using IP, VXLAN, and NVGRE). See Section 9 279 for details. 281 This document broadens the problem space to consider interconnection 282 of any type of edge domain. These may be Data Center sites, but they 283 may equally be access networks, VPN sites, or any other form of 284 domain that includes packet sources and destinations. We 285 particularly focus on "SR edge domains" being source or destination 286 domains that utilize SR, but the domains could use other technologies 287 as described in Section 9. 289 Backbone networks are commonly based on MPLS hardware. In these 290 networks, a number of different options exist to establish TE paths. 291 Among these options are static LSPs (perhaps set up by an SDN 292 controller), LSP tunnels established using a signaling protocol (such 293 as RSVP-TE), and inter-domain use of SR (as described above for 294 intra-domain steering). Where traffic steering (without resource 295 reservation) is needed, SR may be adequate. Where Traffic 296 Engineering is needed (i.e., traffic steering with resource 297 reservation) RSVP-TE or centralized SDN control are preferred. 298 However, in a network that is fully managed and controlled through a 299 centralized planning tool, resource reservation can be achieved and 300 SR can be used for full Traffic Engineering. These solutions are 301 already used in support of a number of edge-to-edge services such as 302 L3VPN and L2VPN. 304 3.1. Characteristics of Solution Technologies 306 Each of the solution technologies mentioned in the previous section 307 has certain characteristics, and the combined solution needs to 308 recognize and address the characteristics in order to make a workable 309 solution. 311 o When SR is used for traffic steering, the size of the MPLS label 312 stack used in SR scales linearly with the length of the source 313 route. This can cause issues with MPLS implementations that only 314 support label stacks of a limited size. For example, some MPLS 315 implementations cannot push enough labels on the stack to 316 represent an entire source route. Other implementations may be 317 unable to do the proper "ECMP hashing" if the label stack is too 318 long; they may be unable to read enough of the packet header to 319 find an entropy label or to find the IP header of the payload. 320 Increasing the packet header size also reduces the size of the 321 payload that can be carried in an MPLS packet. There are 322 techniques that can be used to reduce the size of the label stack. 323 For example, a single label (known as a "binding SID") can be used 324 to represent a sequence of nodes; this label can be replaced with 325 a set of labels when the packet reaches the first node in the 326 sequence. It is also possible to combine SR with conventional 327 RSVP-TE by using a binding SID in the label stack to represent an 328 LSP tunnel set up by RSVP-TE. 330 o Most of the work on using SR for traffic steering assumes that 331 traffic only needs to be steered within a single administrative 332 domain. If the backbone consists of multiple ASes that are part 333 of a common administrative domain, the use of SR across the 334 backbone may prove to be a challenge, and its use in the backbone 335 may be limited to cases where private networks connect the 336 domains, rather than cases where the domains are connected by 337 third-party network operators or by the public Internet. 339 o RSVP-TE has been used to provide edge-to-edge tunnels through 340 which flows to/from many endpoints can be routed, and this 341 provides a reduction in state while still offering Traffic 342 Engineering across the backbone network. However, this requires 343 O(n2) connections and as the number of edge domains increases this 344 becomes unsustainable. 346 o A centralized control system, while capable of producing more 347 optimal results than a distributed control system, may present 348 challenges in large and dynamic networks. It relies on all 349 network state being held centrally, and it is difficult to make 350 central control as robust and self-correcting as distributed 351 control. 353 This paper introduces an approach that blends the best points of each 354 of these solution technologies to achieve a trade-off where RSVP-TE 355 tunnels in the backbone network are stitched together using SR, and 356 end-to-end SR paths can be created under the control of a central 357 controller with routing devolved to the constituent networks where 358 possible. 360 4. Decomposing the Problem 362 It is important to decompose the problem to take account of different 363 regions spanned by the end-to-end path. These regions may use 364 different technologies and may be under different administrative 365 control. The separation of administrative control is particularly 366 important because the operator of one region may be unwilling to 367 share information about their networks, and may be resistant to 368 allowing a third party to exert control over their network resources. 370 Using the reference model in Figure 1, we can consider how to get a 371 packet from Source1 to the Destination. The following decisions must 372 be made: 374 o In which domain the Destination lies. 376 o Which exit point from Dom1 to use. 378 o Which entry point to Dom2 to use. 380 o How to reach the exit point of Dom1 from Source1. 382 o How to reach the entry point to Dom2 from the exit point of Dom1. 384 o How to reach the Destination from the entry point to Dom2. 386 As already mentioned, these decisions may be inter-related. This 387 enables us to break down the problem into three steps: 389 1. Get the packet from Source1 to the exit point of Dom1. 391 2. Get the packet from exit point of Dom1 to entry point of Dom2. 393 3. Get the packet from entry point of Dom2 to Destination. 395 The solution needs to achieve this in a way that allows: 397 o Adequate discovery of preferred elements in the end-to-end path 398 (such as location of destination, destination domain entry point). 400 o Full control of the end-to-end path if all of the operators are 401 willing. 403 o Re-use of existing techniques and technologies. 405 From a technology point of view we must support several functions and 406 mixtures of those functions: 408 o If the domain uses MPLS Segment Routing, the labels within the 409 domain may be populated by any means including BGP-LU 410 [I-D.ietf-mpls-rfc3107bis], IGP, and central control. Source 411 routes within the domain may be expressed as label stacks pushed 412 by a controller or computed by a source router, or expressed as a 413 single label and programmed into the domain routers by a 414 controller. 416 o If the domain uses other (non-MPLS) forwarding, the domain 417 processing is specific to that technology. See Section 9 for 418 details. 420 o If the domains use Segment Routing, the source and destination 421 domains may or may not be in the same Segment Routing domain, so 422 that the prefix-SIDs may be the same or different in the two 423 domains. 425 o The backbone network may be a single private network under the 426 control of the owner of the domains and comprising one or more 427 ASes, or may be a network operated by one or more third parties. 429 o The backbone network may utilize MPLS Traffic Engineering tunnels 430 in conjunction with MPLS Segment Routing and the domain-to-domain 431 source route may be provided by stitching TE LSPs. 433 o A single controller may be used to handle the source and 434 destination domains as well as the backbone network, or there may 435 be a different controller for the backbone network separate from 436 that that controls the two domains, or there may be separate 437 controllers for each network. The controllers may cooperate and 438 share information to different degrees. 440 All of these different decompositions of the problem reflect 441 different deployment choices and different commercial and operational 442 practices, each with different functional trade-offs. For example, 443 with separate controllers that do not share information and that only 444 cooperate to a limited extent, it will be possible to achieve end-to- 445 end connectivity with optimal routing at each step (domain or 446 backbone AS), but the end-to-end path that is achieved might not be 447 optimal. 449 5. Solution Space 451 5.1. Global Optimization of the Paths 453 Global optimization of the path from one domain to another requires 454 either that the source controller has a complete view of the end-to- 455 end topology or some form of cooperation between controllers (such as 456 in BRPC in RFC 5441 [RFC5441]). 458 BGP-LS [RFC7752] can be used to provide the "source" controller with 459 a view of the topology of the backbone. This requires some of the 460 BGP speakers in each AS to have BGP-LS sessions to the controller. 461 Other means of obtaining this view are of course possible. 463 5.2. Figuring Out the GWs at a Destination Domain for a Given Prefix 465 Suppose GW1 and GW2 both advertise a route to prefix X, each setting 466 itself as next hop. One might think that the GWs for X could be 467 inferred from the routes' next hop fields, but typically both routes 468 do not get distributed across the backbone, only the "best" route, as 469 selected by BGP. But the best route according to the BGP selection 470 process might not be the route via the GW that we want to use for 471 traffic engineering purposes. 473 The obvious solution would be to use the ADD-PATH mechanism [RFC7911] 474 to ensure that all routes to X get advertised. However, even if one 475 does this, the identity of the GWs would get lost as soon as the 476 routes got distributed through an ASBR that sets next hop self. And 477 if there are multiple ASes in the backbone, not only will the next 478 hop change several times, but the ADD-PATH mechanism experiences 479 scaling issues. So this "obvious" solution only works within a 480 single AS. 482 A better solution can be achieved using the Tunnel Encapsulation 483 [I-D.ietf-idr-tunnel-encaps] attribute as follows: 485 We define a new tunnel type, "SR tunnel" and when the GWs to a given 486 domain advertise a route to a prefix X within the domain, they each 487 include a Tunnel Encapsulation attribute with multiple remote 488 endpoint sub-TLVs each identifying a specific GW to the domain. 490 In other words, each route advertised by any GW identifies all of the 491 GWs to the same domain (see Section 9 for a discussion of how GWs 492 discover each other). Therefore, only one of the routes needs to be 493 distributed to other ASes, and it doesn't matter how many times the 494 next hop changes, the Tunnel Encapsulation attribute (and its remote 495 endpoint sub-TLVs) remains unchanged. 497 Further, when a packet destined for prefix X is sent on a TE path to 498 GW1 we want the packet to arrive at GW1 carrying, at the top of its 499 label stack, GW1's label for prefix X. To achieve this we will place 500 the SID/SRGB in a sub-TLV of the Tunnel Encapsulation attribute. We 501 will define the prefix-SID sub-TLV to be essentially identical in 502 syntax to the prefix-SID attribute (see 504 [I-D.ietf-idr-bgp-prefix-sid]), but the semantics are somewhat 505 different. 507 It is also possible to define an "MPLS Label Stack" sub-TLV for the 508 Tunnel Encapsulation attribute, and put this in the "SR tunnel" TLV. 509 This allows the destination GW to specify a label stack that it wants 510 packets destined for prefix X to have. This label stack represents a 511 source route through the destination domain. 513 5.3. Figuring Out the Backbone Egress ASBRs 515 We need to figure out the backbone egress ASBRs that are attached to 516 a given GW at the destination domain this out in order to properly 517 engineer the path across the backbone. 519 The "cleanest" way to figure this out is to have the backbone egress 520 ASBRs distribute the information to the source controller using the 521 EPE extensions of BGP-LS [I-D.ietf-idr-bgpls-segment-routing-epe]. 522 The EPE extensions to BGP-LS allow a BGP speaker to say, "Here is a 523 list of my EBGP neighbors, and here is a (locally significant) 524 adjacency-SID for each one." 526 It may also be possible to consider utilizing cooperating PCEs or a 527 Hierarchical PCE approach in RFC 6805 [RFC6805]. But it should be 528 observed that this question is dependent on the question in 529 Section 5.2. That is, it is not possible to even start the selection 530 of egress ASBRs until it is known which GWs at the destination domain 531 provide access to a given prefix. Once that question has been 532 answered, any number of PCE approaches can be used to select the 533 right egress ASBR and, more generally, the ASBR path across the 534 backbone. 536 5.4. Making use of RSVP-TE LSPs Across the Backbone 538 There are a number of ways to carry traffic across the backbone from 539 one domain to another. RSVP-TE is a popular tunneling mechanism in 540 similar scenarios (e.g., L3VPN) because it allows for reservation of 541 resources as well as traffic steering. 543 A controller can cause an RSVP-TE LSP to be set up by using PCEP to 544 talk to the LSP headend, using PCEP extensions 545 [I-D.ietf-pce-pce-initiated-lsp]. That draft specifies an "LSP- 546 initiate" message that the controller uses to specify the RSVP-TE LSP 547 endpoints, the ERO, a "symbolic pathname", and optionally other 548 attributes (specified in the PCEP specification, RFC 5440 [RFC5440]) 549 such as bandwidth. 551 When the headend receives an LSP-initiate message, it sets up the 552 RSVP-TE LSP, assigns it a "PLSP-id", and reports the PLSP-id back to 553 the controller in a PCRpt message [I-D.ietf-pce-stateful-pce]. The 554 PCRpt message also contains the symbolic name that the controller 555 assigned to the LSP, as well as containing some information 556 identifying the LSP-initiate message from the controller, and details 557 of exactly how the LSP was set up (RRO, bandwidth, etc.). 559 The headend can add to the PCRpt message a TE-PATH-BINDING TLV 560 [I-D.sivabalan-pce-binding-label-sid]. This allows the headend to 561 assign a "binding SID" to the LSP, and to report to the controller 562 that a particular binding SID corresponds to a particular LSP. The 563 binding SID is locally scoped to the headend. 565 The controller can make this label be part of the label stack that it 566 tells the source (or the GW at the source domain) to put on the data 567 packets being sent to prefix X. When the headend receives a packet 568 with this label at the top of the stack it will send the packet 569 onward on the LSP. 571 5.5. Data Plane 573 Consolidating all of the above, consider what happens when we want to 574 move a data packet from Source to Destination in Figure 1via the 575 following source route: 577 Source1---GW1b---PE2a---ASBR2a---ASBR3a---PE3a---GW2a---Destination 579 Further, assume that there is an RSVP-TE LSP from PE2a to ASBR2a that 580 we want to use, as well as an RSVP-TE LSP from ASBR3a to PE3a that we 581 want to use. 583 Let's suppose that the Source pushes a label stack following 584 instructions from the controller (for example, using BGP-LU 585 [I-D.ietf-mpls-rfc3107bis]). We won't worry for now about source 586 routing through the domains themselves: that is, in practice there 587 may be additional labels in the stack to cover the source route from 588 the Source to GW1b and from GW2a to the Destination, but we will 589 focus only on the labels necessary to leave the source domain, 590 traverse the backbone, and enter the egress domain. So we only care 591 what the stack looks like when the packet gets to GW1b. 593 When the packet gets to GW1b, the stack should have six labels: 595 Top Label: 597 Peer-SID or adjacency-SID identifying link or links to PE2a. 598 These SIDs are distributed from GW1b to the controller via the EPE 599 extensions of BGP-LS. (This label will get popped by GW1b, which 600 will then send the packet to PE2a.) 602 Second Label: 604 Binding SID advertised by PE2a to the controller for the RSVP-TE 605 LSP to ASBR2a. This binding SID is advertised via the PCEP 606 extensions discussed above. (This label will get swapped by PE2a 607 for the label that the LSP's next hop has assigned to the LSP.) 609 Third Label: 611 Peer-SID or adjacency-SID identifying link or links to ASBR3a, as 612 advertised to the controller by ASBR2a using the BGP-LS EPE 613 extensions. (This label gets popped by ASBR2a, which then sends 614 the packet to ASBR3a.) 616 Fourth Label: 618 Binding SID advertised by ASBR3a for the RSVP-TE LSP to PE3a. 619 This binding SID is advertised via the PCEP extensions discussed 620 above. ASBR3a treats this label just like PE2a treated the second 621 label above. 623 Fifth label: 625 Peer-SID or adjacency-SID identifying link or links to GW2a, as 626 advertised to the controller by ASBR3a using the BGP-LS EPE 627 extensions. ASBR3a pops this label and sends the packet to GW2a. 629 Sixth Label: 631 Prefix-SID or other label identifying the Destination advertised 632 in a Tunnel Encapsulation attribute by GW2a. (This can be omitted 633 if GW2a is happy to accept IP packets, or prefers a VXLAN tunnel 634 for example. That would be indicated through the Tunnel 635 Encapsulation attribute of course.) 637 Note that the size of the label stack is proportional to the number 638 of RSVP-TE LSPs that get stitched together by SR. 640 See Section 7 for some detailed examples that show the concrete use 641 of labels in a sample topology. 643 In the above example, all labels except the sixth are locally 644 significant labels: peer-SIDs, binding SIDs, or adjacency-SIDs. Only 645 the sixth label, a prefix-SID, has a domain-wide unique value. To 646 impose that label, the source needs to know the SRGB of GW2a. If all 647 nodes have the same SRGB, this is not a problem. Otherwise, there 648 are a number of different ways GW3a can advertise its SRGB. This can 649 be done via the segment routing extensions of BGP-LS, or it can be 650 done using the prefix-SID attribute or BGP-LU 651 [I-D.ietf-mpls-rfc3107bis], or it can be done using the BGP Tunnel 652 Encapsulation attribute. The exact technique to be used will depend 653 on the details of the deployment scenario. 655 The reason the above example is primarily based on locally 656 significant labels is that it creates a "strict source route", and it 657 presupposes the EPE extensions of BGP-LS. In some scenarios, the EPE 658 extension to BGP-LS might not be available (or BGP-LS might not be 659 available at all). In other scenarios, it may be desirable to steer 660 a packet through a "loose source route". In such scenarios, the 661 label stack imposed by the source will be based upon a sequence of 662 domain-wide unique "node-SIDs", each representing one of the hops of 663 source route. Each label has to be computed by adding the 664 corresponding node-SID to the SRGB of the node that will act upon the 665 label. One way to learn the node-SIDs and SRGBs is to use the 666 segment routing extensions of BGP-LS. Another way is to use BGP-LU 667 as follows. Each node that may be part of a source route would 668 originate a BGP-LU route with one of its own loopback addresses as 669 the prefix. The BGP prefix-SID attribute would be attached to this 670 route. The prefix-SID attribute would contain a SID, which is the 671 domain-wide unique SID corresponding to the node's loopback address. 672 The attribute would also contain the node's SRGB. 674 While this technique is useful when BGP-LS is not available, it 675 presupposes that the source controller has some other means of 676 discovering the topology. In this document, we focus primarily on 677 the scenario where BGP-LS, rather than BGP-LU, is used. 679 5.6. Centralized and Distributed Controllers 681 A controller or set of controllers are needed to collate topology and 682 TE information from the constituent networks, to apply policies and 683 service requirements to compute paths across those networks, to 684 select an end-to-end path, and to program key nodes in the network to 685 take the right forwarding actions (pushing label stacks, stitching 686 LSPs, forwarding traffic). 688 o It is commonly understood that a fully optimal end-to-end path can 689 only be computed with full knowledge of the end-to-end topology 690 and available Traffic Engineering resources. Thus, one option is 691 for all information about the domain networks and backbone network 692 to be collected by a central controller that makes all path 693 computations and is responsible for issuing the necessary 694 programming commands. Such a model works best when there is no 695 commercial or administrative impediment (for example, where the 696 domains and the backbone network are owned and operated by the 697 same organization). There may, however, be some scaling concerns 698 if the component networks are large. 700 In this mode of operation, each network may use BGP-LS to export 701 Traffic Engineering and topology information to the central 702 controller, and the controller may use PCEP to program the network 703 behavior. 705 o A similar centralized control mechanism can be used with a 706 scalability improvement that risks a reduction in optimality. In 707 this case, the domain networks can export to the controller just 708 the feasibility of connectivity between data source/sink and 709 gateway, perhaps enhancing this with some information about the 710 Traffic Engineering metrics of the path. 712 This approach allows the central controller to understand the end- 713 to-end path that it is selecting, but not to control it fully. 714 The source route from data source to domain egress gateway is left 715 to the source host or a controller in the source domain, while the 716 source route from domain ingress gateway to destination is left as 717 a decision for the domain ingress gateway or to a controller in 718 the destination domain. 720 This mode of operation still leaves overall control with a 721 centralized server and that may not be considered suitable when 722 there is separate commercial or administrative control of the 723 networks. 725 o When there is separate commercial or administrative control of the 726 networks the domain operator will not want the backbone operator 727 to have control of the source routes within the domain and may be 728 reluctant to disclose any information about the topology or 729 resource availability within the domains. Conversely, the 730 backbone operator may be very unwilling to allow the domain 731 operator (a customer) any control over or knowledge about the 732 backbone network. 734 This "problem" has already been solved for Traffic Engineering in 735 MPLS networks that span multiple administrative domains and leads 736 to multiple potential solutions: 738 * Per-domain path computation in RFC 5152 [RFC5152] can be seen 739 as "best effort optimization". In this mode the controller for 740 each domain is responsible for finding the best path to the 741 next domain, but has no way of knowing which is the best exit 742 point from the local domain. The resulting path may end up 743 significantly sub-optimal or even blocked. 745 * Backward recursive path computation (BRPC) in RFC 5441 746 [RFC5441] is a mechanism that allows controllers to cooperate 747 across a small set of domains (such as ASes) to build a tree of 748 possible paths and so allow the controller for the ingress 749 domain to select the optimal path. The details of the paths 750 within each domain that might reveal confidential information 751 can be hidden using Path Keys in RFC 5520 [RFC5520] BRPC 752 produces optimal paths but scales poorly with an increase in 753 domains and with an increase in connectivity between domains. 754 It can also lead to slow computation times. 756 * Hierarchical PCE (H-PCE) in RFC 6805 [RFC6805] is a two-level 757 cooperation process between PCEs. The child PCEs remain 758 responsible for computing paths across their domains, and they 759 coordinate with a parent PCE that stitches these paths together 760 to form the end-to-end path. This approach has many 761 similarities with BRPC but can scale better through the 762 maintenance of "domain topology" that shows how the domains are 763 interconnected, and through the ability to pipe-line 764 computation requests to all of the child domains. It has the 765 drawback that some party has to own and operate the parent PCE. 767 * An alternative approach is documented by the TEAS working group 768 [RFC7926]. In this model each network advertises to 769 controllers for adjacent networks (using BGP-LS) selected 770 information about potential connectivity across the network. 771 It does not have to show full topology and can make its own 772 decisions about which paths it considers optimal for use by its 773 different neighbors and customers. This approach is suitable 774 for the End-to-End Domain Interconnect Traffic Steering problem 775 where the backbone is under different control from the domains 776 because it allows the overlay nature of the use of the backbone 777 network to be treated as a peer network relationship by the 778 controllers of the domains - the domains can be operated using 779 a single controller or a separate controller for each domain. 781 It is also possible to operate domain interconnection when some or 782 all domains do not have a controller. Segment Routing is capable of 783 routing a packet toward the next hop based on the top label on the 784 stack, and that label does not need to indicate an immediately 785 adjacent node or link. In these cases, the packet may be forwarded 786 untouched, or the forwarding router may impose a locally-determined 787 additional set of labels that define the path to the next hop. 789 PCE can be used to instruct the source host or a transit node on what 790 label stacks to add to packets. That is, a node that needs to impose 791 labels (either to start routing the packet from the source host, or 792 to advance the packet from a transit router toward the destination) 793 can determine the label stack to use based on local function or can 794 have that stack supplied by a PCE. The PCE Protocol (PCEP) has been 795 extended to allow the PCE to supply a label stack for reaching a 796 specific destination either in response to a request or in an 797 unsolicited manner [I-D.ietf-pce-segment-routing]. 799 6. BGP-LS Considerations 801 This section gives an overview of the use of BGP-LS to export an 802 abstraction (or summary) of the connectivity across the backbone 803 network by means of two figures that show different views of a sample 804 network. 806 Figure 2 shows a more complex reference architecture. 808 Figure 3 represents the minimum set of nodes and links that need to 809 be advertised in BGP-LS with SR in order to perform Domain 810 Interconnect with traffic engineering across the backbone network: 811 the PEs, ASBRs, and gateways (GWs), and the links between them. In 812 particular, EPE [I-D.ietf-idr-bgpls-segment-routing-epe] and TE 813 information with associated segment IDs is advertised in BGP-LS with 814 SR. 816 Links that are advertised may be physical links, links realized by 817 LSP tunnels, or abstract links. It is assumed that intra-AS links 818 are either real links, RSVP-TE LSPs with allocated bandwidth, or SR 819 TE policies as described in 820 [I-D.previdi-idr-segment-routing-te-policy]. Additional nodes 821 internal to an AS and their links to PEs, ASBRs, and/or GWs may also 822 be advertised (for example to avoid full mesh problems). 824 ------------------------------------------------------------------- 825 | | 826 | AS1 | 827 | ---- ---- ---- ---- | 828 -|PE1a|--|PE1b|-------------------------------------|PE2a|--|PE2b|- 829 ---- ---- ---- ---- 830 : : ------------ ------------ : : : 831 : : | AS2 | | AS3 | : : : 832 : : | ------.....------ | : : : 833 : : | |ASBR2a| |ASBR3a| | : : : 834 : : | ------ ..:------ | : : : 835 : : | | : | | : : : 836 : : | ------..: ------ | : : : 837 : : | |ASBR2b|...|ASBR3b| | : : : 838 : : | ------ ------ | : : : 839 : : | | | | : : : 840 : : | | ------ | : : : 841 : : | | ..|ASBR3c| | : : : 842 : : | | : ------ | : ....: : 843 : ......: | ---- | : | ---- | : : : 844 : : -|PE2a|----- : -----|PE3b|- : : : 845 : : ---- : ---- : : : 846 : : .......: : :....... : : : 847 : : : ------ : : : : 848 : : : ----|ASBR4b|---- : : : : 849 : : : | ------ | : : : : 850 : : : ---- | : : : : 851 : : : .........|PE4b| AS4 | : : : : 852 : : : : ---- | : : : : 853 : : : : | ---- | : : : : 854 : : : : -----|PE4a|----- : : : : 855 : : : : ---- : : : : 856 : : : : ..: :.. : : : : 857 : : : : : : : : : : 858 ---- ---- ---- ---- ----: ---- 859 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 860 | ---- ---- | | ---- ---- | | ---- ---- | 861 | | | | | | 862 | | | | | | 863 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 864 | | | | | | 865 | | | | | | 866 | Dom1 | | Dom2 | | Dom3 | 867 ---------------- ---------------- ---------------- 869 Figure 2: Network View of Example Configuration 871 ............................................................. 872 : : 873 ---- ---- ---- ---- 874 |PE1a| |PE1b|.....................................|PE2a| |PE2b| 875 ---- ---- ---- ---- 876 : : : : : 877 : : : : : 878 : : ------.....------ : : : 879 : : ......|ASBR2a| |ASBR3a|...... : : : 880 : : : ------ ..:------ : : : : 881 : : : : : : : : 882 : : : ------..: ------ : : : : 883 : : : ...|ASBR2b|...|ASBR3b| : : : : 884 : : : : ------ ------ : : : : 885 : : : : : : : : : 886 : : : : ------ : : : : 887 : : : : ..|ASBR3c|... : : : : 888 : : : : : ------ : : : ....: : 889 : ......: ---- : ---- : : : 890 : : |PE2a| : |PE3b| : : : 891 : : ---- : ---- : : : 892 : : .......: : :....... : : : 893 : : : ------ : : : : 894 : : : |ASBR4b| : : : : 895 : : : ------ : : : : 896 : : : ---- : : : : : 897 : : : .........|PE4b|..... : : : : : 898 : : : : ---- : : : : : : 899 : : : : ---- : : : : 900 : : : : |PE4a| : : : : 901 : : : : ---- : : : : 902 : : : : ..: :.. : : : : 903 : : : : : : : : : : 904 ---- ---- ---- ---- ----: ---- 905 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 906 | ---- ---- | | ---- ---- | | ---- ---- | 907 | | | | | | 908 | | | | | | 909 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 910 | | | | | | 911 | | | | | | 912 | Dom1 | | Dom2 | | Dom3 | 913 ---------------- ---------------- ---------------- 915 Figure 3: Topology View of Example Configuration 917 A node (a PCE, router, or host) that is computing a full or partial 918 path correlates the topology information disseminated in BGP-LS with 919 SR with the information advertised with the Tunnel Encapsulation 920 attributes to compute that path and obtain the SIDs for the elements 921 on that path. In order to allow a source host to compute exit points 922 from its domain, some subset of the above information needs to be 923 disseminated within that domain. 925 What is advertised external to a given AS is controlled by policy at 926 the ASes' PEs, ASBRs, and GWs. Central control of what each node 927 should advertise, based upon analysis of the network as a whole, is 928 an important additional function. This and the amount of policy 929 involved may make the use of a Route Reflector an attractive option. 931 The configuration of which links to other nodes and the 932 characteristics of those links a given node advertises in BGP-LS with 933 SR is done locally at each node and pairwise coordination between 934 link end-points is required to ensure consistency. 936 Path Weighted ECMP (PWECMP) is assumed to be used by a GW for a given 937 source domain to send all flows to a given destination domain using 938 all paths in the backbone network to that destination domain in 939 proportion to the minimum bandwidth on each path. PWECMP is also 940 assumed to be used by hosts within a source domain to send flows to 941 that domain's GWs. 943 7. Worked Examples 945 Figure 4 shows a view of the links, paths, and labels that can be 946 assigned to part of the sample network shown in Figure 2 and 947 Figure 3. The double-dash lines (===) indicate LSP tunnels across 948 backbone ASes and dotted lines (...) are physical links. 950 At each node, a label may be assigned to each outgoing link. This is 951 shown in Figure 4. For example, at GW1a the label L201 is assigned 952 to the link connecting GW1a to PE1a. At PE1c, the label L302 is 953 assigned to the link connecting PE1c to GW3b. Labels ("binding 954 SIDs") may also be assigned to RSVP-TE LSPs. For example, at PE1a, 955 label L202 is assigned to the RSVP-TE LSP leading from PE1a to PE1c. 957 At the destination domain, labels L302 and L305 are "node-SIDs"; they 958 represent GW3b and Host3b respectively, rather than representing 959 particular links. 961 When a node processes a packet, the label at the top of the label 962 stack indicates the link (or RSVP-TE LSP) on which that node is to 963 transmit the packet. The node pops that label off the label stack 964 before transmitting the packet on the link. However, if the top 965 label is a node-SID, the node processing the packet is expected to 966 transmit the packet on whatever link it regards as the shortest path 967 to the node represented by the label. 969 ---- L202 ---- 970 | |=======================================================| | 971 |PE1a| |PE1c| 972 | |=======================================================| | 973 ---- L203 ---- 974 : : : 975 : ---- L205 ---- : : 976 : |PE1b|============================================|PE1d| : : 977 : ---- ---- : : 978 : : : : : 979 : : : : : 980 : : ---- L207 ------ L209 ------ L303: : : 981 :L201 : | |======|ASBR2a|......| | : : : 982 : : | | ------ | | L210 ---- : : : 983 : : |PE2a| |ASBR3a|======|PE3b| : : : 984 : : | | L208 ------ L211 | | ---- : : : 985 : : | |======|ASBR2b|......| | : : : : 986 : L204: ---- ------ ------ ...: : : : 987 : : : : : : : 988 : ....: : : .......: : : 989 : : : : : : : 990 : : :L206 L301: : .........: : 991 : : : : : : L304 : 992 : : ....: : : : ....: 993 : : : : : : : L302 994 ---- ---- ----- ---- 995 -|GW1a|--|GW1b|- -|GW3a |--|GW3b|- 996 | ---- ---- | | ----- ---- | 997 | : : | | : : | 998 |L103: :L102| | L303: :L304| 999 | : : | | : : | 1000 | N1 N2 | | N3 N4 | 1001 | :.. ..: | | : ....: | 1002 | L101 : : | | : : | 1003 | Host1a | | Host3b (L305) | 1004 | | | | 1005 | Dom1 | | Dom3 | 1006 ---------------- ----------------- 1008 Figure 4: Tunnels and Labels in Example Configuration 1010 Let's consider several different possible ways to direct a packet 1011 from Host1a in Dom1 to Host3b in Dom3. 1013 a. Full source route imposed at source 1015 In this case it is assumed that the entity responsible for 1016 determining an end-to-end path has access to the topologies of 1017 both domains and of the backbone network. This might happen if 1018 all of the networks are owned by the same operator in which case 1019 the information can be shared into a single database for use by an 1020 offline tool, or the information can be distributed using routing 1021 protocols such that the source host can see enough to select the 1022 path. Alternatively, the end-to-end path could be produced 1023 through cooperation between computation entities each responsible 1024 for different domains along the path. 1026 If the path is computed externally it is pushed to the source 1027 host. Otherwise, it is computed by the source host itself. 1029 Suppose it is desired for a packet from Host1a to travel to Host3b 1030 via the following source route: 1032 Host1a->N1->GW1a->PE1a->(RSVP-TE LSP)->PE1c->GW3b->N4->Host3b 1034 Host1a would impose the following label stack would be imposed 1035 (with the first label representing the top of stack), and then 1036 send the packet to N1: 1038 L103, L201, L202, L302, L304, L305 1040 N1 sees L103 at the top of the stack, so it pops the stack and 1041 forwards the packet to GW1a. GW1a sees L201 at the top of the 1042 stack, so it pops the stack and forwards the packet to PE1a. PE1a 1043 sees L202 at the top of the stack, so it pops the stack and 1044 forwards the packet over the RSVP-TE LSP to PE1c. As the packet 1045 travels over this LSP, its top label will be an RSVP-TE signaled 1046 label representing the LSP. That is, PE1a imposes an additional 1047 label stack entry for the tunnel LSP. 1049 At the end of the LSP tunnel, the MPLS tunnel label will be 1050 popped, and PE1c will see L302 at the top of the stack. PE1c pops 1051 the stack and forwards the packet to GW3b. GW3b will see L304 at 1052 the top of the stack, so it pops the stack and forwards the packet 1053 to N4. Finally, N4 sees L305 at the top of the stack, so it pops 1054 the stack and forwards the packet to Host 3b. No remote 1055 visibility into Dom3. 1057 b. It is possible that the source domain does not have visibility 1058 into the destination domain. 1060 This occurs if the destination domain does not export its 1061 topology, but even in this case, it will export reachability 1062 information so that the source host or the path computation entity 1063 will know: 1065 * The GWs through which the destination can be reached. 1067 * The SID to use for the destination prefix. 1069 Suppose we want a packet to follow the source route: 1071 Host1a->N1->GW1a->PE1a->(RSVP-TE LSP)->PE1c->GW3b->...->Host3b 1073 (The ellipsis indicates a part of the path that is not explicitly 1074 specified.) Thus, the label stack imposed at the source host 1075 would be: 1077 L103, L201, L202, L302, L305 1079 Processing is as per case a., but when the packet reaches the GW 1080 of the destination domain, it can either simply forward the packet 1081 along the shortest path to Host3b, or it can insert additional 1082 labels to direct the path to the destination. 1084 c. Dom1 only has reachability information 1086 The source domain (or the path computation entity) may be further 1087 restricted in its view of the network. It is possible that it 1088 knows the location of the destination in the destination domain, 1089 and knows the GWs to the destination domain that provide 1090 reachability to the destination, but that it has no view of the 1091 backbone network. This leads to the packet being forwarded in a 1092 manner similar to 'per-domain path computation' described in 1093 Section 5.6. 1095 At the source host a simple label stack is imposed navigating the 1096 domain and indicating the destination GW and the destination host. 1098 L101, L103, L302, L305 1100 As the packet leaves the source domain, the source GW determines 1101 the PE to use to enter the backbone using nothing more than the 1102 BGP preferred route to the destination GW. 1104 When the packet reaches the first PE it has a label stack just 1105 identifying the destination GW and host (L302, L305). The PE uses 1106 information it has about the backbone network topology and 1107 available LSPs to select an LSP tunnel, impose the tunnel label, 1108 and forward the packet. 1110 When the packet reaches the end of the LSP tunnel, it is processed 1111 as described in case b. 1113 d. Stitched LSPs across the backbone 1115 A variant of all these cases arises when the packet is sent using 1116 a path that spans multiple ASes. For example, one that crosses 1117 AS2 and AS3 as shown in Figure 2. 1119 In this case, basing the example on case a., the source host would 1120 impose the label stack: 1122 L102, L206, L207, L209, L210, L301, L303, L305 1124 and would then send the packet to N2. 1126 When the packet reaches PE2a as previously described and the top 1127 label (L207) selects an LSP tunnel that leads to ASBR2a. At the 1128 end of that LSP tunnel the next label (L209) routes the packet 1129 from ASBR2a to the ASBR3a, where the next label (L210) identifies 1130 the next LSP tunnel to use. Thus, SR has been used to stitch 1131 together LSPs to make a longer path segment. As the packet 1132 emerges from the final LSP tunnel, forwarding continues as 1133 previously described. 1135 8. Label Stack Depth Considerations 1137 As described in Section 3.1, one of the issues with a Segment Routing 1138 approach is that the label stack can get large, for example when the 1139 source route becomes long. A mechanism to mitigate this problem is 1140 needed if the solution is to be fully applicable in all environments. 1142 An Internet-Draft called "Segment Routing Traffic Engineering Policy 1143 using BGP" [I-D.previdi-idr-segment-routing-te-policy] introduces the 1144 concept of hierarchical source routes as a way to compress source 1145 route headers. It functions by having the egress node for a set of 1146 source routes advertise those source routes along with an explicit 1147 request that each node that is an ingress node for one or more of 1148 those source routes should advertise a binding SID for the set of 1149 source routes for which it is the ingress. (It should be noted that 1150 the set of source routes can either be advertised by the egress node 1151 as described here, or could be advertised by a controller on behalf 1152 of the egress node.) Such an ingress node advertises its set of 1153 source routes and a binding SID as an adjacency in BGP-LS as 1154 described in Section 6. These source routes represent the weighted 1155 ECMP paths between the ingress node and the egress node. (Note also 1156 that the binding SID may be supplied by the node that advertises the 1157 source routes - the egress or the controller - or may be chosen by 1158 ingress node.) 1160 A remote node that wishes to reach the egress node would then 1161 construct a source route consisting of the segment IDs necessary to 1162 reach one of the ingress nodes for the path it wishes to use along 1163 with the binding SID that the ingress node advertised to identify the 1164 set of paths. When the selected ingress node receives a packet with 1165 a binding SID it has advertised, it replaces the binding SID with the 1166 labels for one of its source routes to the egress node (it will 1167 choose one of the source routes in the set according to its own 1168 weighting algorithms and policy). 1170 8.1. Worked Example 1172 Consider the topology in Figure 4. Suppose that it is desired to 1173 construct full segment routed paths from ingress to egress, but that 1174 the resulting label stack (segment route) is too large. In this case 1175 the gateways to Dom3 (GW3a and GW3b) can advertise all of the source 1176 routes from the gateways to Dom1 (GW1a and GW1b). The gateways to 1177 Dom1 then assign binding SIDs to those source routes and advertise 1178 those SIDs into BGP-LS. 1180 Thus, GW3b would advertise the two source routes (L201, L202, L302 1181 and L201, L203, L302), and GW1a would advertise into BGP-LS its 1182 adjacency to GW3b along with a binding SID. Should Host1a wish to 1183 send a packet via GW1a and GW3b, it can include L103 and this binding 1184 SID in the source route. GW1a is free to choose which source route 1185 to use between itself and GW3b using its weighted ECMP algorithm. 1187 Similarly, GW3a would advertise the following set of source routes: 1189 o L201, L202, L304 1191 o L201, L203, L304 1193 o L204, L205, L303 1195 o L206, L207, L209, L210, L301 1197 o L206, L208, L211, L210, L301 1198 GW1a would advertise a binding SID for the first three, and GW1b 1199 would advertise a binding SID for the other two. 1201 9. Gateway Considerations 1203 As described in Section 5, we define a new tunnel type, "SR tunnel", 1204 and when the GWs to a given domain advertise a route to a prefix X 1205 within the domain, they will each include a Tunnel Encapsulation 1206 attribute with multiple tunnel instances each of type "SR tunnel", 1207 one for each GW and each containing a Remote Endpoint sub-TLV with 1208 that GW's address. 1210 In other words, each route advertised by any GW identifies all of the 1211 GWs to the same domain. 1213 Therefore, even if only one of the routes is distributed to other 1214 ASes, it will not matter how many times the next hop changes, as the 1215 Tunnel Encapsulation attribute (and its remote endpoint sub-TLVs) 1216 will remain unchanged. 1218 9.1. Domain Gateway Auto-Discovery 1220 To allow a given domain's GWs to auto-discover each other and to 1221 coordinate their operations, the following procedures are implemented 1222 [I-D.drake-bess-datacenter-gateway]: 1224 o Each GW is configured with an identifier for the domain that is 1225 common across all GWs to the domain (i.e., all GWs to all domains 1226 that are connected) and unique across all domains that are 1227 connected. 1229 o A route target [RFC4360] is attached to each GW's auto-discovery 1230 route and has its value set to the domain identifier. 1232 o Each GW constructs an import filtering rule to import any route 1233 that carries a route target with the same domain identifier that 1234 the GW itself uses. This means that only these GWs will import 1235 those routes and that all GWs to the same domain will import each 1236 other's routes and will learn (auto-discover) the current set of 1237 active GWs for the domain. 1239 o The auto-discovery route each GW advertises consists of the 1240 following: 1242 * An IPv4 or IPv6 NLRI containing one of the GW's loopback 1243 addresses (that is, with AFI/SAFI that is one of 1/1, 2/1, 1/4, 1244 2/4). 1246 * A Tunnel Encapsulation attribute containing the GW's 1247 encapsulation information, which at a minimum consists of an SR 1248 tunnel TLV (type to be allocated by IANA) with a Remote 1249 Endpoint sub-TLV [I-D.ietf-idr-tunnel-encaps]. 1251 To avoid the side effect of applying the Tunnel Encapsulation 1252 attribute to any packet that is addressed to the GW, the GW should 1253 use a different loopback address. 1255 Each GW will include a Tunnel Encapsulation attribute for each GW 1256 that is active for the domain (including itself), and will include 1257 these in every route advertised externally to the domain by each GW. 1258 As the current set of active GWs changes (due to the addition of a 1259 new GW or the failure/removal of an existing GW) each externally 1260 advertised route will be re-advertised with the set of SR tunnel 1261 instances reflecting the current set of active GWs. 1263 9.2. Relationship to BGP Link State and Egress Peer Engineering 1265 When a remote GW receives a route to a prefix X it can use the SR 1266 tunnel instances within the contained Tunnel Encapsulation attribute 1267 to identify the GWs through which X can be reached. It uses this 1268 information to compute SR TE paths across the backbone network 1269 looking at the information advertised to it in SR BGP Link State 1270 (BGP-LS) [I-D.gredler-idr-bgp-ls-segment-routing-ext] and correlated 1271 using the domain identity. SR Egress Peer Engineering (EPE) 1272 [I-D.ietf-idr-bgpls-segment-routing-epe] can be used to supplement 1273 the information advertised in the BGP-LS. 1275 9.3. Advertising a Domain Route Externally 1277 When a packet destined for prefix X is sent on an SR TE path to a GW 1278 for the domain containing X, it needs to carry the receiving GW's 1279 label for X such that this label rises to the top of the stack before 1280 the GW complete its processing of the packet. To achieve this we 1281 place a prefix-SID sub-TLV for X in each SR tunnel instance in the 1282 Tunnel Encapsulation attribute in the externally advertised route for 1283 X. 1285 Alternatively, if the GWs for a given domain are configured to allow 1286 remote GWs to perform SR TE through that domain for a prefix X, then 1287 each GW computes an SR TE path through that domain to X from each of 1288 the current active GWs and places each in an MPLS label stack sub-TLV 1289 [I-D.ietf-idr-tunnel-encaps] in the SR tunnel instance for that GW. 1291 9.4. Encapsulations 1293 If the GWs for a given domain are configured to allow remote GWs send 1294 them a packet in that domain's native encapsulation, then each GW 1295 will also include multiple instances of a tunnel TLV for that native 1296 encapsulation, one for each GW and each containing a remote endpoint 1297 sub-TLV with that GW's address, in externally advertised routes. A 1298 remote GW may then encapsulate a packet according to the rules 1299 defined via the sub-TLVs included in each of the tunnel TLV 1300 instances. 1302 10. Security Considerations 1304 TBD 1306 11. Management Considerations 1308 TBD 1310 12. IANA Considerations 1312 This document makes no requests for IANA action. 1314 13. Acknowledgements 1316 TBD 1318 14. Informative References 1320 [I-D.drake-bess-datacenter-gateway] 1321 Drake, J., Farrel, A., Rosen, E., Patel, K., and L. Jalil, 1322 "Gateway Auto-Discovery and Route Advertisement for 1323 Segment Routing Enabled Data Center Interconnection", 1324 draft-drake-bess-datacenter-gateway-03 (work in progress), 1325 April 2017. 1327 [I-D.gredler-idr-bgp-ls-segment-routing-ext] 1328 Previdi, S., Psenak, P., Filsfils, C., Gredler, H., Chen, 1329 M., and j. jefftant@gmail.com, "BGP Link-State extensions 1330 for Segment Routing", draft-gredler-idr-bgp-ls-segment- 1331 routing-ext-04 (work in progress), October 2016. 1333 [I-D.ietf-idr-bgp-prefix-sid] 1334 Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A., 1335 and H. Gredler, "Segment Routing Prefix SID extensions for 1336 BGP", draft-ietf-idr-bgp-prefix-sid-06 (work in progress), 1337 June 2017. 1339 [I-D.ietf-idr-bgpls-segment-routing-epe] 1340 Previdi, S., Filsfils, C., Patel, K., Ray, S., and J. 1341 Dong, "BGP-LS extensions for Segment Routing BGP Egress 1342 Peer Engineering", draft-ietf-idr-bgpls-segment-routing- 1343 epe-13 (work in progress), June 2017. 1345 [I-D.ietf-idr-tunnel-encaps] 1346 Rosen, E., Patel, K., and G. Velde, "The BGP Tunnel 1347 Encapsulation Attribute", draft-ietf-idr-tunnel-encaps-06 1348 (work in progress), June 2017. 1350 [I-D.ietf-isis-segment-routing-extensions] 1351 Previdi, S., Filsfils, C., Bashandy, A., Gredler, H., 1352 Litkowski, S., Decraene, B., and j. jefftant@gmail.com, 1353 "IS-IS Extensions for Segment Routing", draft-ietf-isis- 1354 segment-routing-extensions-13 (work in progress), June 1355 2017. 1357 [I-D.ietf-mpls-rfc3107bis] 1358 Rosen, E., "Using BGP to Bind MPLS Labels to Address 1359 Prefixes", draft-ietf-mpls-rfc3107bis-02 (work in 1360 progress), May 2017. 1362 [I-D.ietf-ospf-segment-routing-extensions] 1363 Psenak, P., Previdi, S., Filsfils, C., Gredler, H., 1364 Shakir, R., Henderickx, W., and J. Tantsura, "OSPF 1365 Extensions for Segment Routing", draft-ietf-ospf-segment- 1366 routing-extensions-17 (work in progress), June 2017. 1368 [I-D.ietf-pce-pce-initiated-lsp] 1369 Crabbe, E., Minei, I., Sivabalan, S., and R. Varga, "PCEP 1370 Extensions for PCE-initiated LSP Setup in a Stateful PCE 1371 Model", draft-ietf-pce-pce-initiated-lsp-10 (work in 1372 progress), June 2017. 1374 [I-D.ietf-pce-segment-routing] 1375 Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., 1376 and J. Hardwick, "PCEP Extensions for Segment Routing", 1377 draft-ietf-pce-segment-routing-09 (work in progress), 1378 April 2017. 1380 [I-D.ietf-pce-stateful-pce] 1381 Crabbe, E., Minei, I., Medved, J., and R. Varga, "PCEP 1382 Extensions for Stateful PCE", draft-ietf-pce-stateful- 1383 pce-21 (work in progress), June 2017. 1385 [I-D.ietf-spring-segment-routing] 1386 Filsfils, C., Previdi, S., Decraene, B., Litkowski, S., 1387 and R. Shakir, "Segment Routing Architecture", draft-ietf- 1388 spring-segment-routing-12 (work in progress), June 2017. 1390 [I-D.ietf-spring-segment-routing-mpls] 1391 Filsfils, C., Previdi, S., Bashandy, A., Decraene, B., 1392 Litkowski, S., and R. Shakir, "Segment Routing with MPLS 1393 data plane", draft-ietf-spring-segment-routing-mpls-10 1394 (work in progress), June 2017. 1396 [I-D.previdi-idr-segment-routing-te-policy] 1397 Previdi, S., Filsfils, C., Mattes, P., Rosen, E., and S. 1398 Lin, "Advertising Segment Routing Policies in BGP", draft- 1399 previdi-idr-segment-routing-te-policy-07 (work in 1400 progress), June 2017. 1402 [I-D.sivabalan-pce-binding-label-sid] 1403 Sivabalan, S., Filsfils, C., Previdi, S., Tantsura, J., 1404 Hardwick, J., and M. Nanduri, "Carrying Binding Label/ 1405 Segment-ID in PCE-based Networks.", draft-sivabalan-pce- 1406 binding-label-sid-02 (work in progress), October 2016. 1408 [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended 1409 Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, 1410 February 2006, . 1412 [RFC5152] Vasseur, JP., Ed., Ayyangar, A., Ed., and R. Zhang, "A 1413 Per-Domain Path Computation Method for Establishing Inter- 1414 Domain Traffic Engineering (TE) Label Switched Paths 1415 (LSPs)", RFC 5152, DOI 10.17487/RFC5152, February 2008, 1416 . 1418 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 1419 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 1420 DOI 10.17487/RFC5440, March 2009, 1421 . 1423 [RFC5441] Vasseur, JP., Ed., Zhang, R., Bitar, N., and JL. Le Roux, 1424 "A Backward-Recursive PCE-Based Computation (BRPC) 1425 Procedure to Compute Shortest Constrained Inter-Domain 1426 Traffic Engineering Label Switched Paths", RFC 5441, 1427 DOI 10.17487/RFC5441, April 2009, 1428 . 1430 [RFC5520] Bradford, R., Ed., Vasseur, JP., and A. Farrel, 1431 "Preserving Topology Confidentiality in Inter-Domain Path 1432 Computation Using a Path-Key-Based Mechanism", RFC 5520, 1433 DOI 10.17487/RFC5520, April 2009, 1434 . 1436 [RFC6805] King, D., Ed. and A. Farrel, Ed., "The Application of the 1437 Path Computation Element Architecture to the Determination 1438 of a Sequence of Domains in MPLS and GMPLS", RFC 6805, 1439 DOI 10.17487/RFC6805, November 2012, 1440 . 1442 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 1443 S. Ray, "North-Bound Distribution of Link-State and 1444 Traffic Engineering (TE) Information Using BGP", RFC 7752, 1445 DOI 10.17487/RFC7752, March 2016, 1446 . 1448 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 1449 Litkowski, S., Horneffer, M., and R. Shakir, "Source 1450 Packet Routing in Networking (SPRING) Problem Statement 1451 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 1452 2016, . 1454 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 1455 "Advertisement of Multiple Paths in BGP", RFC 7911, 1456 DOI 10.17487/RFC7911, July 2016, 1457 . 1459 [RFC7926] Farrel, A., Ed., Drake, J., Bitar, N., Swallow, G., 1460 Ceccarelli, D., and X. Zhang, "Problem Statement and 1461 Architecture for Information Exchange between 1462 Interconnected Traffic-Engineered Networks", BCP 206, 1463 RFC 7926, DOI 10.17487/RFC7926, July 2016, 1464 . 1466 Authors' Addresses 1468 Adrian Farrel 1469 Juniper Networks 1471 Email: afarrel@juniper.net 1473 John Drake 1474 Juniper Networks 1476 Email: jdrake@juniper.net