idnits 2.17.1 draft-farrel-spring-sr-domain-interconnect-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 19, 2021) is 1072 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-bess-datacenter-gateway-10 == Outdated reference: A later version (-26) exists of draft-ietf-idr-segment-routing-te-policy-12 == Outdated reference: A later version (-16) exists of draft-ietf-pce-binding-label-sid-08 -- Obsolete informational reference (is this intentional?): RFC 7752 (Obsoleted by RFC 9552) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SPRING Working Group A. Farrel 3 Internet-Draft Old Dog Consulting 4 Intended status: Informational J. Drake 5 Expires: November 20, 2021 Juniper Networks 6 May 19, 2021 8 Interconnection of Segment Routing Sites - Problem Statement and 9 Solution Landscape 10 draft-farrel-spring-sr-domain-interconnect-06 12 Abstract 14 Segment Routing (SR) is a forwarding paradigm for use in MPLS and 15 IPv6 networks. It is intended to be deployed in discrete sites that 16 may be data centers, access networks, or other networks that are 17 under the control of a single operator and that can easily be 18 upgraded to support this new technology. 20 Traffic originating in one SR site often terminates in another SR 21 site, but must transit a backbone network that provides 22 interconnection between those sites. 24 This document describes a mechanism for providing connectivity 25 between SR sites to enable end-to-end or site-to-site traffic 26 engineering. 28 The approach described allows connectivity between SR sites, utilizes 29 traffic engineering mechanisms (such as RSVP-TE or Segment Routing) 30 across the backbone network, makes heavy use of pre-existing 31 technologies, and requires the specification of very few additional 32 mechanisms. 34 This document provides some background and a problem statement, 35 explains the solution mechanism, gives references to other documents 36 that define protocol mechanisms, and provides examples. It does not 37 define any new protocol mechanisms. 39 Status of This Memo 41 This Internet-Draft is submitted in full conformance with the 42 provisions of BCP 78 and BCP 79. 44 Internet-Drafts are working documents of the Internet Engineering 45 Task Force (IETF). Note that other groups may also distribute 46 working documents as Internet-Drafts. The list of current Internet- 47 Drafts is at https://datatracker.ietf.org/drafts/current/. 49 Internet-Drafts are draft documents valid for a maximum of six months 50 and may be updated, replaced, or obsoleted by other documents at any 51 time. It is inappropriate to use Internet-Drafts as reference 52 material or to cite them other than as "work in progress." 54 This Internet-Draft will expire on November 20, 2021. 56 Copyright Notice 58 Copyright (c) 2021 IETF Trust and the persons identified as the 59 document authors. All rights reserved. 61 This document is subject to BCP 78 and the IETF Trust's Legal 62 Provisions Relating to IETF Documents 63 (https://trustee.ietf.org/license-info) in effect on the date of 64 publication of this document. Please review these documents 65 carefully, as they describe your rights and restrictions with respect 66 to this document. Code Components extracted from this document must 67 include Simplified BSD License text as described in Section 4.e of 68 the Trust Legal Provisions and are provided without warranty as 69 described in the Simplified BSD License. 71 Table of Contents 73 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 74 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 75 2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 76 3. Solution Technologies . . . . . . . . . . . . . . . . . . . . 7 77 3.1. Characteristics of Solution Technologies . . . . . . . . 7 78 4. Decomposing the Problem . . . . . . . . . . . . . . . . . . . 9 79 5. Solution Space . . . . . . . . . . . . . . . . . . . . . . . 10 80 5.1. Global Optimization of the Paths . . . . . . . . . . . . 10 81 5.2. Figuring Out the GWs at a Destination Site for a Given 82 Prefix . . . . . . . . . . . . . . . . . . . . . . . . . 11 83 5.3. Figuring Out the Backbone Egress ASBRs . . . . . . . . . 12 84 5.4. Making use of RSVP-TE LSPs Across the Backbone . . . . . 12 85 5.5. Data Plane . . . . . . . . . . . . . . . . . . . . . . . 13 86 5.6. Centralized and Distributed Controllers . . . . . . . . . 15 87 6. BGP-LS Considerations . . . . . . . . . . . . . . . . . . . . 18 88 7. Worked Examples . . . . . . . . . . . . . . . . . . . . . . . 21 89 8. Label Stack Depth Considerations . . . . . . . . . . . . . . 25 90 8.1. Worked Example . . . . . . . . . . . . . . . . . . . . . 26 91 9. Gateway Considerations . . . . . . . . . . . . . . . . . . . 27 92 9.1. Site Gateway Auto-Discovery . . . . . . . . . . . . . . . 27 93 9.2. Relationship to BGP Link State and Egress Peer 94 Engineering . . . . . . . . . . . . . . . . . . . . . . . 28 95 9.3. Advertising a Site Route Externally . . . . . . . . . . . 28 96 9.4. Encapsulations . . . . . . . . . . . . . . . . . . . . . 29 98 10. Security Considerations . . . . . . . . . . . . . . . . . . . 29 99 11. Management Considerations . . . . . . . . . . . . . . . . . . 30 100 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 101 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 30 102 14. Informative References . . . . . . . . . . . . . . . . . . . 30 103 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 34 105 1. Introduction 107 Data Centers are a growing market sector. They are being set up by 108 new specialist companies, by enterprises for their own use, by legacy 109 ISPs, and by the new wave of network operators. The networks inside 110 Data Centers are currently well-planned, but the traffic loads can be 111 unpredictable. There is a need to be able to direct traffic within a 112 Data Center to follow a specific path. 114 Data Centers are attached to external ("backbone") networks to allow 115 access by users and to facilitate communication among Data Centers. 116 An individual Data Center may be attached to multiple backbone 117 networks, and may have multiple points of attachment to each backbone 118 network. Traffic to or from a Data Center may need to be directed to 119 or from any of these points of attachment. 121 Segment Routing (SR) is a technology that places forwarding state 122 into each packet as a stack of loose hops. SR is an option for 123 building Data Centers, and is also seeing increasing traction in edge 124 and access networks as well as in backbone networks. It is typically 125 deployed in discrete sites that are under the control of a single 126 operator and that can easily be upgraded to support this new 127 technology. 129 Traffic originating in one SR site often terminates in another SR 130 site, but must transit a backbone network that provides 131 interconnection between those sites. This document describes an 132 approach that builds on existing technologies to produce mechanisms 133 that provide scalable and flexible interconnection of SR site, and 134 that will be easy to operate. 136 The approach described allows end-to-end connectivity between SR 137 sites across an MPLS backbone network, utilizes traffic engineering 138 mechanisms (such as RSVP-TE or Segment Routing) across the backbone 139 network, makes heavy use of pre-existing technologies, and requires 140 the specification of very few additional mechanisms. 142 This document provides some background and a problem statement, 143 explains the solution mechanism, gives references to other documents 144 that define protocol mechanisms, and provides examples. It does not 145 define any new protocol mechanisms. 147 1.1. Terminology 149 This document uses Segment Routing terminology from [RFC7855] and 150 [RFC8402]. Particular abbreviations of note are: 152 o SID: a segment identifier 154 o SRGB: an SR Global Block 156 In the context of this document, the terms "optimal" and "optimality" 157 refer to making the best possible use of network resources, and 158 achieving network paths that best meet the objectives of the network 159 operators and customers. 161 Further terms are defined in Section 2. 163 2. Problem Statement 165 Consider the network in Figure 1. Without loss of generality, this 166 figure can be used to represent the architecture and problem space 167 for steering traffic within and between SR edge sites. The figure 168 shows a single destination for all traffic that we will consider. 170 In describing the problem space and the solution we use six terms as 171 follows: 173 SR domain : This term is defined in [RFC8402]. It is the collection 174 of all interconnected SR-capable network nodes that may be 175 colocated in a site, distributed across multiple sites, present in 176 SR-capable backbone networks, or located at key points within the 177 backbone network. 179 SR site : In this document, an SR site is a collection of SR-capable 180 nodes under the care of one administrator or protocol. This means 181 that each SR site is attached to the backbone network through one 182 or more gateways. Examples include, access networks, Data Center 183 sites, backbone networks that run SR, and blessings of unicorns. 185 Host : A node within an SR site. It may be an end system or a 186 transit node in the SR site. 188 Gateway (GW) : Provides access to or from an SR site. Examples are 189 Customer Edge nodes (CEs), Autonomous System Border Routers 190 (ASBRs), and Data Center gateways. 192 Provider Edge (PE) : Provides access to or from the backbone 193 network. 195 Autonomous System Border Router (ASBR) : Provides access to one 196 Autonomous System (AS) in the backbone network from another AS in 197 the backbone network. 199 These terms can be seen in use in Figure 1, where the various sources 200 and the destination are hosts. In this figure we distinguish between 201 the PEs that provide access to the backbone network, and the Gateways 202 that provide access to the SR sites: these may, in fact, be the same 203 equipment and the PEs might be located at the site edges. 205 ------------------------------------------------------------------- 206 | | 207 | AS1 | 208 | ---- ---- ---- ---- | 209 -|PE1a|--|PE1b|-------------------------------------|PE2a|--|PE2b|- 210 ---- ---- ---- ---- 211 : : ------------ ------------ : : 212 : : | AS2 | | AS3 | : : 213 : : | ------ ------ | : : 214 : : | |ASBR2a|...|ASBR3a| | : : 215 : : | ------ ------ | : : 216 : : | | | | : : 217 : : | ------ ------ | : : 218 : : | |ASBR2b|...|ASBR3b| | : : 219 : : | ------ ------ | : : 220 : : | | | | : : 221 : ......: | ---- | | ---- | : : 222 : : -|PE2a|----- -----|PE3a|- : : 223 : : ---- ---- : : 224 : : ......: :....... : : 225 : : : : : : 226 ---- ---- ---- ---- 227 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- 228 | ---- ---- | | ---- ---- | 229 | | | | 230 | | | Source3 | 231 | Source2 | | | 232 | | | Source4 | 233 | Source1 | | | 234 | | | Destination | 235 | | | | 236 | | | | 237 | Site1 | | Site2 | 238 ---------------- ---------------- 240 Figure 1: Reference Architecture for SR Site Interconnect 242 Traffic to the destination may originate from multiple sources within 243 that site (we show two such sources: Source3 and Source4). 244 Furthermore, traffic intended for the destination may arrive from 245 outside the site through any of the points of attachment to the 246 backbone networks (we show GW2a and GW2b). This traffic may need to 247 be steered within the site to achieve load-balancing across network 248 resources, to avoid degraded or out-of-service resources (including 249 planned service outages), and to achieve different qualities of 250 service. Of course, traffic in a remote source site may also need to 251 be steered within that site. We class this problem as "Intra-Site 252 Traffic Steering". 254 Traffic across the backbone networks may need to be steered to 255 conform to common Traffic Engineering (TE) paradigms. That is, the 256 path across any network (shown in the figure as an AS) or across any 257 collection of networks may need to be chosen and may be different 258 from the shortest path first (SPF) routing that would occur without 259 TE. Furthermore, the points of inter-connection between networks may 260 need to be selected and influence the path chosen for the data. We 261 class this problem as "Inter-Site Traffic Steering". 263 The composite end-to-end path comprises steering in the source site, 264 choice of source site exit point, steering across the backbone 265 networks, choice of network interconnections, choice of destination 266 site entry point, and steering in the destination site. These issues 267 may be inter-dependent (for example, the best traffic steering in the 268 source site may help select the best exit point from that site, but 269 the connectivity options across the backbone network may drive the 270 selection of a different exit point). We class this combination of 271 problems as "End-to-End Site Interconnect Traffic Steering". 273 It should be noted that the solution to the End-to-End Site 274 Interconnect Traffic Steering problem depends on a number of factors: 276 o What technology is deployed in the site. 278 o What technology is deployed in the backbone networks. 280 o How much information the sites are willing to share with each 281 other. 283 o How much information the backbone network operators and the site 284 operators are willing to share. 286 In some cases, the sites and backbone networks are all owned and 287 operated by the same company (with the backbone network often being a 288 private network). In other cases, the sites are operated by one 289 company, with other companies operating the backbone. 291 3. Solution Technologies 293 Segment Routing (SR from the SPRING working group in the IETF 294 [RFC7855] and [RFC8402]) introduces traffic steering capabilities 295 into an MPLS network [RFC8660] by utilizing existing data plane 296 capabilities (label pop and packet forwarding - "pop and go") in 297 combination with additions to existing IGPs ([RFC8665] and 298 [RFC8667]), BGP (as BGP-LU) [RFC8277], or a centralized controller to 299 distribute "per-hop" labels. An MPLS label stack can be imposed on a 300 packet to describe a sequence of links/nodes to be transited by the 301 packet; as each hop is transited, the label that represents it is 302 popped from the stack and the packet is forwarded. Thus, on a 303 packet-by-packet basis, traffic can be steered within the SR domain. 305 This document broadens the problem space to consider interconnection 306 of any type of site. These may be Data Center sites, but they may 307 equally be access networks, VPN sites, or any other form of domain 308 that includes packet sources and destinations. We particularly focus 309 on "SR sites" being source or destination sites that utilize MPLS SR, 310 but the sites could use other non-MPLS technologies (such as IP, 311 VXLAN, and NVGRE) as described in Section 9. 313 Backbone networks are commonly based on MPLS-capable hardware. In 314 these networks, a number of different options exist to establish TE 315 paths. Among these options are static Label Switched Paths (LSPs), 316 perhaps set up by an SDN controller, LSP tunnels established using a 317 signaling protocol (such as RSVP-TE), and inter-site use of SR (as 318 described above for intra-site steering). Where traffic steering 319 (without resource reservation) is needed, SR may be adequate; where 320 Traffic Engineering is needed (i.e., traffic steering with resource 321 reservation) RSVP-TE or centralized SDN control are preferred. 322 However, in a network that is fully managed and controlled through a 323 centralized planning tool, resource reservation can be achieved and 324 SR can be used for full Traffic Engineering. These solutions are 325 already used in support of a number of edge-to-edge services such as 326 L3VPN and L2VPN. 328 3.1. Characteristics of Solution Technologies 330 Each of the solution technologies mentioned in the previous section 331 has certain characteristics, and the combined solution needs to 332 recognize and address these characteristics in order to make a 333 workable solution. 335 o When SR is used for traffic steering, the size of the MPLS label 336 stack used in SR scales linearly with the length of the strict 337 source route. This can cause issues with MPLS implementations 338 that only support label stacks of a limited size. For example, 339 some MPLS implementations cannot push enough labels on the stack 340 to represent an entire source route. Other implementations may be 341 unable to do the proper "ECMP hashing" if the label stack is too 342 long; they may be unable to read enough of the packet header to 343 find an entropy label or to find the IP header of the payload. 344 Increasing the packet header size also reduces the size of the 345 payload that can be carried in an MPLS packet. There are 346 techniques that can be used to reduce the size of the label stack. 347 For example, a source route may be made less specific through the 348 use of loose hops requiring fewer labels, or a single label (known 349 as a "binding SID") can be used to represent a sequence of nodes; 350 this label can be replaced with a set of labels when the packet 351 reaches the first node in the sequence. It is also possible to 352 combine SR with conventional RSVP-TE by using a binding SID in the 353 label stack to represent an LSP tunnel set up by RSVP-TE. 355 o Most of the work on using SR for traffic steering assumes that 356 traffic only needs to be steered within a single administrative 357 domain. If the backbone consists of multiple ASes that are not 358 part of a common administrative domain, the use of SR across the 359 backbone may prove to be a challenge, and its use in the backbone 360 may be limited to cases where private networks connect the sites, 361 rather than cases where the sites are connected by third-party 362 network operators or by the public Internet. 364 o RSVP-TE has been used to provide edge-to-edge tunnels through 365 which flows to/from many endpoints can be routed, and this 366 provides a reduction in state while still offering Traffic 367 Engineering across the backbone network. However, this requires 368 O(n2) connections and as the number of sites increases this 369 becomes unsustainable. 371 o A centralized control system is capable of producing more 372 efficient use of network resources and of allowing better 373 coordination of network usage and of network diagnostics. 374 However, such a system may present challenges in large and dynamic 375 networks because it relies on all network state being held 376 centrally, and it is difficult to make central control as robust 377 and self-correcting as distributed control. 379 This document introduces an approach that blends the best points of 380 each of these solution technologies to achieve a trade-off where 381 RSVP-TE tunnels in the backbone network are stitched together using 382 SR, and end-to-end SR paths can be created under the control of a 383 central controller with routing devolved to the constituent networks 384 where possible. 386 4. Decomposing the Problem 388 It is important to decompose the problem to take account of different 389 regions spanned by the end-to-end path. These regions may use 390 different technologies and may be under different administrative 391 control. The separation of administrative control is particularly 392 important because the operator of one region may be unwilling to 393 share information about their networks, and may be resistant to 394 allowing a third party to exert control over their network resources. 396 Using the reference model in Figure 1, we can consider how to get a 397 packet from Source1 to the Destination. The following decisions must 398 be made: 400 o In which site Destination lies. 402 o Which exit point from Site1 to use. 404 o Which entry point to Site2 to use. 406 o How to reach the exit point of Site1 from Source1. 408 o How to reach the entry point to Site2 from the exit point of 409 Site1. 411 o How to reach Destination from the entry point to Site2. 413 As already mentioned, these decisions may be inter-related. This 414 enables us to break down the problem into three steps: 416 1. Get the packet from Source1 to the exit point of Site1. 418 2. Get the packet from exit point of Site1 to entry point of Site2. 420 3. Get the packet from entry point of Site2 to Destination. 422 The solution needs to achieve this in a way that allows: 424 o Adequate discovery of preferred elements in the end-to-end path 425 (such as the location of the destination, and the selection of the 426 destination site entry point). 428 o Full control of the end-to-end path if all of the operators are 429 willing. 431 o Re-use of existing techniques and technologies. 433 From a technology point of view we must support several functions and 434 mixtures of those functions: 436 o If a site uses MPLS Segment Routing, the labels within the site 437 may be populated by any means including BGP-LU [RFC8277], IGP 438 [RFC8667] [RFC8665], and central control. Source routes within 439 the site may be expressed as label stacks pushed by a controller 440 or computed by a source router, or expressed as a single label and 441 programmed into the site routers by a controller. 443 o If a site uses other (non-MPLS) forwarding, the site processing is 444 specific to that technology. See Section 9 for details. 446 o If the sites use Segment Routing, the prefix-SIDs for the source 447 and destination may be the same or different. 449 o The backbone network may be a single private network under the 450 control of the owner of the sites and comprising one or more ASes, 451 or may be a network operated by one or more third parties. 453 o The backbone network may utilize MPLS Traffic Engineering tunnels 454 in conjunction with MPLS Segment Routing and the site-to-site 455 source route may be provided by stitching TE LSPs. 457 o A single controller may be used to handle the source and 458 destination site as well as the backbone network, or there may be 459 a different controller for the backbone network separate from that 460 that controls the two site, or there may be separate controllers 461 for each network. The controllers may cooperate and share 462 information to different degrees. 464 All of these different decompositions of the problem reflect 465 different deployment choices and different commercial and operational 466 practices, each with different functional trade-offs. For example, 467 with separate controllers that do not share information and that only 468 cooperate to a limited extent, it will be possible to achieve end-to- 469 end connectivity with optimal routing at each step (site or backbone 470 AS), but the end-to-end path that is achieved might not be optimal. 472 5. Solution Space 474 5.1. Global Optimization of the Paths 476 Global optimization of the path from one site to another requires 477 either that the source controller has a complete view of the end-to- 478 end topology or some form of cooperation between controllers (such as 479 in Backward Recursive Path Computation (BRPC) in [RFC5441]). 481 BGP-LS [RFC7752] can be used to provide the "source" controller with 482 a view of the topology of the backbone: that topology may be 483 abstracted or partial. This requires some of the BGP speakers in 484 each AS to have BGP-LS sessions to the controller. Other means of 485 obtaining this view of the topology are of course possible. 487 5.2. Figuring Out the GWs at a Destination Site for a Given Prefix 489 Suppose GW2a and GW2b both advertise a route to prefix X, each 490 setting itself as next hop. One might think that the GWs for X could 491 be inferred from the routes' next hop fields, but typically only the 492 "best" route (as selected by BGP) gets distributed across the 493 backbone: the other route is discarded. But the best route according 494 to the BGP selection process might not be the route via the GW that 495 we want to use for traffic engineering purposes. 497 The obvious solution would be to use the ADD-PATH mechanism [RFC7911] 498 to ensure that all routes to X get advertised. However, even if one 499 does this, the identity of the GWs would get lost as soon as the 500 routes got distributed through an ASBR that sets next hop self. And 501 if there are multiple ASes in the backbone, not only will the next 502 hop change several times, but the ADD-PATH mechanism will experience 503 scaling issues. So this "obvious" solution only works within a 504 single AS. 506 A better solution can be achieved using the Tunnel Encapsulation 507 [RFC9012] attribute as follows. 509 We define a new tunnel type, "SR tunnel", and when the GWs to a given 510 site advertise a route to a prefix X within the site, they each 511 include a Tunnel Encapsulation attribute with multiple remote 512 endpoint sub-TLVs each of which identifies a specific GW to the site. 514 In other words, each route advertised by any GW identifies all of the 515 GWs to the same site (see Section 9 for a discussion of how GWs 516 discover each other). Therefore, only one of the routes needs to be 517 distributed to other ASes, and it doesn't matter how many times the 518 next hop changes, the Tunnel Encapsulation attribute (and its remote 519 endpoint sub-TLVs) remains unchanged and disclose the full list of 520 GWs to the site. 522 Further, when a packet destined for prefix X is sent on a TE path to 523 GW2a we want the packet to arrive at GW2a carrying, at the top of its 524 label stack, GW2a's label for prefix X. To achieve this we place the 525 SID/SRGB in a sub-TLV of the Tunnel Encapsulation attribute. We 526 define the prefix-SID sub-TLV to be essentially identical in syntax 527 to the prefix-SID attribute (see [RFC8669]), but the semantics are 528 somewhat different. 530 We also define an "MPLS Label Stack" sub-TLV for the Tunnel 531 Encapsulation attribute, and put this in the "SR tunnel" TLV. This 532 allows the destination GW to specify a label stack that it wants 533 packets destined for prefix X to have. This label stack represents a 534 source route through the destination site. 536 5.3. Figuring Out the Backbone Egress ASBRs 538 We need to figure out the backbone egress ASBRs that are attached to 539 a given GW at the destination site in order to properly engineer the 540 path across the backbone. 542 The "cleanest" way to do this is to have the backbone egress ASBRs 543 distribute the information to the source controller using the egress 544 peer engineering (EPE) extensions of BGP-LS 545 [I-D.ietf-idr-bgpls-segment-routing-epe]. The EPE extensions to BGP- 546 LS allow a BGP speaker to say, "Here is a list of my EBGP neighbors, 547 and here is a (locally significant) adjacency-SID for each one." 549 It may also be possible to consider utilizing cooperating PCEs or a 550 Hierarchical PCE approach in [RFC6805]. But it should be observed 551 that this question is dependent on the questions in Section 5.2. 552 That is, it is not possible to even start the selection of egress 553 ASBRs until it is known which GWs at the destination site provide 554 access to a given prefix. Once that question has been answered, any 555 number of PCE approaches can be used to select the right egress ASBR 556 and, more generally, the ASBR path across the backbone. 558 5.4. Making use of RSVP-TE LSPs Across the Backbone 560 There are a number of ways to carry traffic across the backbone from 561 one site to another. RSVP-TE is a popular mechanism for establishing 562 tunnels across MPLS networks in similar scenarios (e.g., L3VPN) 563 because it allows for reservation of resources as well as traffic 564 steering. 566 A controller can cause an RSVP-TE LSP to be set up by talking to the 567 LSP head end using PCEP extensions as described in [RFC8281]. That 568 document specifies an "LSP Initiate" message (the PCInitiate message) 569 that the controller uses to specify the RSVP-TE LSP endpoints, the 570 explicit path, a "symbolic pathname", and other optional attributes 571 (specified in the PCEP specification [RFC5440]) such as bandwidth. 573 When the head end receives a PCInitiate message, it sets up the RSVP- 574 TE LSP, assigns it a "PLSP-id", and reports the PLSP-id back to the 575 controller in a PCRpt message [RFC8231]. The PCRpt message also 576 contains the symbolic name that the controller assigned to the LSP, 577 as well as containing some information identifying the LSP-initiate 578 message from the controller, and details of exactly how the LSP was 579 set up (RRO, bandwidth, etc.). 581 The head end can add a TE-PATH-BINDING TLV to the PCRpt message 582 [I-D.ietf-pce-binding-label-sid]. This allows the head end to assign 583 a "binding SID" to the LSP, and to report to the controller that a 584 particular binding SID corresponds to a particular LSP. The binding 585 SID is locally scoped to the head end. 587 The controller can make this label be part of the label stack that it 588 tells the source (or the GW at the source site) to impose on the data 589 packets being sent to prefix X. When the head end receives a packet 590 with this label at the top of the stack it will send the packet 591 onward on the LSP. 593 5.5. Data Plane 595 Consolidating all of the above, consider what happens when we want to 596 move a data packet from Source1 to Destination in Figure 1via the 597 following source route: 599 Source1---GW1b---PE2a---ASBR2a---ASBR3a---PE3a---GW2a---Destination 601 Further, assume that there is an RSVP-TE LSP from PE2a to ASBR2a and 602 an RSVP-TE LSP from ASBR3a to PE3a both of which we want to use. 604 Let's suppose that the Source pushes a label stack as instructed by 605 the controller (for example, using BGP-LU [RFC8277]). We won't worry 606 for now about source routing through the sites themselves: that is, 607 in practice there may be additional labels in the stack to cover the 608 source route from Source1 to GW1b and from GW2a to the Destination, 609 but we will focus only on the labels necessary to leave the source 610 site, traverse the backbone, and enter the egress site. So we only 611 care what the stack looks like when the packet gets to GW1b. 613 When the packet gets to GW1b, the stack should have six labels: 615 Top Label: 617 Peer-SID or adjacency-SID identifying the link or links to PE2a. 618 These SIDs are distributed from GW1b to the controller via the EPE 619 extensions of BGP-LS. This label will get popped by GW1b, which 620 will then send the packet to PE2a. 622 Second Label: 624 Binding SID advertised by PE2a to the controller for the RSVP-TE 625 LSP to ASBR2a. This binding SID is advertised via the PCEP 626 extensions discussed above. This label will get swapped by PE2a 627 for the label that the LSP's next hop has assigned to the LSP. 629 Third Label: 631 Peer-SID or adjacency-SID identifying the link or links to ASBR3a, 632 as advertised to the controller by ASBR2a using the BGP-LS EPE 633 extensions. This label gets popped by ASBR2a, which then sends 634 the packet to ASBR3a. 636 Fourth Label: 638 Binding SID advertised by ASBR3a for the RSVP-TE LSP to PE3a. 639 This binding SID is advertised via the PCEP extensions discussed 640 above. ASBR3a treats this label just like PE2a treated the second 641 label above. 643 Fifth label: 645 Peer-SID or adjacency-SID identifying link or links to GW2a, as 646 advertised to the controller by ASBR3a using the BGP-LS EPE 647 extensions. ASBR3a pops this label and sends the packet to GW2a. 649 Sixth Label: 651 Prefix-SID or other label identifying the Destination advertised 652 in a Tunnel Encapsulation attribute by GW2a. This can be omitted 653 if GW2a is happy to accept IP packets, or prefers a VXLAN tunnel 654 for example. That would be indicated through the Tunnel 655 Encapsulation attribute of course. 657 Note that the size of the label stack is proportional to the number 658 of RSVP-TE LSPs that get stitched together by SR. 660 See Section 7 for some detailed examples that show the concrete use 661 of labels in a sample topology. 663 In the above example, all labels except the sixth are locally 664 significant labels: peer-SIDs, binding SIDs, or adjacency-SIDs. Only 665 the sixth label, a prefix-SID, has a value that is unique across the 666 whole SR domain. To impose that label, the source needs to know the 667 SRGB of GW2a. If all nodes have the same SRGB, this is not a 668 problem. Otherwise, there are a number of different ways GW3a can 669 advertise its SRGB. This can be done via the segment routing 670 extensions of BGP-LS, or it can be done using the prefix-SID 671 attribute or BGP-LU [RFC8277], or it can be done using the BGP Tunnel 672 Encapsulation attribute. The technique to be used will depend on the 673 details of the deployment scenario. 675 The reason the above example is primarily based on locally 676 significant labels is that it creates a "strict source route", and it 677 presupposes the EPE extensions of BGP-LS. In some scenarios, the EPE 678 extension to BGP-LS might not be available (or BGP-LS might not be 679 available at all). In other scenarios, it may be desirable to steer 680 a packet through a "loose source route". In such scenarios, the 681 label stack imposed by the source will be based upon a sequence of 682 "node-SIDs" that are unique across the whole SR domain, where each 683 represents one of the hops of source route. Each label has to be 684 computed by adding the corresponding node-SID to the SRGB of the node 685 that will act upon the label. One way to learn the node-SIDs and 686 SRGBs is to use the segment routing extensions of BGP-LS. Another 687 way is to use BGP-LU as follows: 689 Each node that may be part of a source route originates a BGP-LU 690 route with one of its own loopback addresses as the prefix. The 691 BGP prefix-SID attribute is attached to this route. The prefix- 692 SID attribute contains a SID that is the SID corresponding to the 693 node's loopback address and which is unique across the whole SR 694 domain. The attribute also contains the node's SRGB. 696 While this technique is useful when BGP-LS is not available, there 697 needs to be some other means for the source controller to discover 698 the topology. In this document, we focus primarily on the scenario 699 where BGP-LS, rather than BGP-LU, is used. 701 5.6. Centralized and Distributed Controllers 703 A controller or set of controllers is needed to collate topology and 704 TE information from the constituent networks, to apply policies and 705 service requirements to compute paths across those networks, to 706 select an end-to-end path, and to program key nodes in the network to 707 take the right forwarding actions (pushing label stacks, stitching 708 LSPs, forwarding traffic). 710 o It is commonly understood that a fully optimal end-to-end path can 711 only be computed with full knowledge of the end-to-end topology 712 and available Traffic Engineering resources. Thus, one option is 713 for all information about the site networks and backbone network 714 to be collected by a central controller that makes all path 715 computations and is responsible for issuing the necessary 716 programming commands. Such a model works best when there is no 717 commercial or administrative impediment (for example, where the 718 sites and the backbone network are owned and operated by the same 719 organization). There may, however, be some scaling concerns if 720 the component networks are large. 722 In this mode of operation, each network may use BGP-LS to export 723 Traffic Engineering and topology information to the central 724 controller, and the controller may use PCEP to program the network 725 behavior. 727 o A similar centralized control mechanism can be used with a 728 scalability improvement that risks a reduction in optimality. In 729 this case, the site networks can export to the controller just the 730 feasibility of connectivity between data source/sink and gateway, 731 perhaps enhancing this with some information about the Traffic 732 Engineering metrics of the potential paths. 734 This approach allows the central controller to understand the end- 735 to-end path that it is selecting, but not to control it fully. 736 The source route from data source to site egress gateway is left 737 to the source host or a controller in the source site, while the 738 source route from site ingress gateway to destination is left as a 739 decision for the site ingress gateway or to a controller in the 740 destination site and in both cases the traffic may be left to 741 follow the IGP shortest path. 743 This mode of operation still leaves overall control with a 744 centralized server and that may not be considered suitable when 745 there is separate commercial or administrative control of the 746 networks. 748 o When there is separate commercial or administrative control of the 749 networks, the site operator will not want the backbone operator to 750 have control of the paths within the sites and may be reluctant to 751 disclose any information about the topology or resource 752 availability within the sites. Conversely, the backbone operator 753 may be very unwilling to allow the site operator (a customer) any 754 control over or knowledge about the backbone network. 756 This "problem" has already been solved for Traffic Engineering in 757 MPLS networks that span multiple administrative domains and leads 758 to several potential solutions: 760 * Per-domain path computation [RFC5152] can be seen as "best 761 effort optimization". In this mode the controller for each 762 domain is responsible for finding the best path to the next 763 domain, but has no way of knowing which is the best exit point 764 from the local domain. The resulting path may end up 765 significantly sub-optimal or even blocked. 767 * Backward recursive path computation (BRPC) [RFC5441] is a 768 mechanism that allows controllers to cooperate across a small 769 set of domains (such as ASes) to build a tree of possible paths 770 and so allow the controller for the ingress domain to select 771 the optimal path. The details of the paths within each domain 772 that might reveal confidential information can be hidden using 773 Path Keys [RFC5520]. BRPC produces optimal paths, but scales 774 poorly with an increase in domains and with an increase in 775 connectivity between domains. It can also lead to slow 776 computation times. 778 * Hierarchical PCE (H-PCE) [RFC6805] is a two-level cooperation 779 process between PCEs. The child PCEs remain responsible for 780 computing paths across their domains, and they coordinate with 781 a parent PCE that stitches these paths together to form the 782 end-to-end path. This approach has many similarities with BRPC 783 but can scale better through the maintenance of "domain 784 topology" that shows how the domains are interconnected, and 785 through the ability to pipe-line computation requests to all of 786 the child domains. It has the drawback that some party has to 787 own and operate the parent PCE. 789 * An alternative approach is documented by the TEAS working group 790 [RFC7926]. In this model each network advertises to 791 controllers for adjacent networks (using BGP-LS) selected 792 information about potential connectivity across the network. 793 It does not have to show full topology and can make its own 794 decisions about which paths it considers optimal for use by its 795 different neighbors and customers. This approach is suitable 796 for the End-to-End Domain Interconnect Traffic Steering problem 797 where the backbone is under different control from the domains 798 because it allows the overlay nature of the use of the backbone 799 network to be treated as a peer network relationship by the 800 controllers of the domains - the domains can be operated using 801 a single controller or a separate controller for each domain. 803 It is also possible to operate domain interconnection when some or 804 all domains do not have a controller. Segment Routing is capable of 805 routing a packet toward the next hop based on the top label on the 806 stack, and that label does not need to indicate an immediately 807 adjacent node or link. In these cases, the packet may be forwarded 808 untouched, or the forwarding router may impose a locally-determined 809 additional set of labels that define the path to the next hop. 811 PCE can be used to instruct the source host or a transit node about 812 what label stacks to add to packets. That is, a node that needs to 813 impose labels (either to start routing the packet from the source 814 host, or to advance the packet from a transit router toward the 815 destination) can determine the label stack to use based on local 816 function or can have that stack supplied by a PCE. The PCE 817 Communication Protocol (PCEP) has been extended to allow the PCE to 818 supply a label stack for reaching a specific destination either in 819 response to a request or in an unsolicited manner [RFC8664]. 821 6. BGP-LS Considerations 823 This section gives an overview of the use of BGP-LS to export an 824 abstraction (or summary) of the connectivity across the backbone 825 network by means of two figures that show different views of a sample 826 network. 828 Figure 2 shows a more complex reference architecture. 830 Figure 3 represents the minimum set of nodes and links that need to 831 be advertised in BGP-LS with SR in order to perform Site Interconnect 832 with traffic engineering across the backbone network: the PEs, ASBRs, 833 and GWs, and the links between them. In particular, EPE 834 [I-D.ietf-idr-bgpls-segment-routing-epe] and TE information with 835 associated segment IDs is advertised in BGP-LS with SR. 837 Links that are advertised may be physical links, links realized by 838 LSP tunnels or SR paths, or abstract links. It is assumed that 839 intra-AS links are either real links, RSVP-TE LSPs with allocated 840 bandwidth, or SR TE policies as described in 841 [I-D.ietf-idr-segment-routing-te-policy]. Additional nodes internal 842 to an AS and their links to PEs, ASBRs, and/or GWs may also be 843 advertised (for example, to avoid full mesh problems). 845 Note that Figure 3 does not show full interconnectivity. For 846 example, there is no possibility of connectivity between PE1a and 847 PE1c (because there is no RSVP-TE LSP established across AS1 between 848 these two nodes) and so no link is presented in the topology view. 849 [RFC7926] contains further discussion of topological abstractions 850 that may be useful in understanding this distinction. 852 ------------------------------------------------------------------- 853 | | 854 | AS1 | 855 | ---- ---- ---- ---- | 856 -|PE1a|--|PE1b|-------------------------------------|PE1c|--|PE1d|- 857 ---- ---- ---- ---- 858 : : ------------ ------------ : : : 859 : : | AS2 | | AS3 | : : : 860 : : | ------.....------ | : : : 861 : : | |ASBR2a| |ASBR3a| | : : : 862 : : | ------ ..:------ | : : : 863 : : | | ..: | | : : : 864 : : | ------: ------ | : : : 865 : : | |ASBR2b|...|ASBR3b| | : : : 866 : : | ------ ------ | : : : 867 : : | | | | : : : 868 : : | | ------ | : : : 869 : : | | ..|ASBR3c| | : : : 870 : : | | : ------ | : ....: : 871 : ......: | ---- | : | ---- | : : : 872 : : -|PE2a|----- : -----|PE3b|- : : : 873 : : ---- : ---- : : : 874 : : .......: : :....... : : : 875 : : : ------ : : : : 876 : : : ----|ASBR4b|---- : : : : 877 : : : | ------ | : : : : 878 : : : ---- | : : : : 879 : : : .........|PE4b| AS4 | : : : : 880 : : : : ---- | : : : : 881 : : : : | ---- | : : : : 882 : : : : -----|PE4a|----- : : : : 883 : : : : ---- : : : : 884 : : : : ..: :.. : : : : 885 : : : : : : : : : : 886 ---- ---- ---- ---- ----: ---- 887 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 888 | ---- ---- | | ---- ---- | | ---- ---- | 889 | | | | | | 890 | | | | | | 891 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 892 | | | | | | 893 | | | | | | 894 | Site1 | | Site2 | | Site3 | 895 ---------------- ---------------- ---------------- 897 Figure 2: Network View of Example Configuration 899 ............................................................. 900 : : 901 ---- ---- ---- ---- 902 |PE1a| |PE1b|.....................................|PE1c| |PE1d| 903 ---- ---- ---- ---- 904 : : : : : 905 : : ------.....------ : : : 906 : : ......|ASBR2a| |ASBR3a|...... : : : 907 : : : ------ ..:------ : : : : 908 : : : : : : : : 909 : : : ------..: ------ : : : : 910 : : : ...|ASBR2b|...|ASBR3b| : : : : 911 : : : : ------ ------ : : : : 912 : : : : : : : : : 913 : : : : ------ : : : : 914 : : : : ..|ASBR3c|... : : : : 915 : : : : : ------ : : : ....: : 916 : ......: ---- : ---- : : : 917 : : |PE2a| : |PE3b| : : : 918 : : ---- : ---- : : : 919 : : .......: : :....... : : : 920 : : : ------ : : : : 921 : : : |ASBR4b| : : : : 922 : : : ------ : : : : 923 : : : ----.....: : : : : : 924 : : : .........|PE4b|..... : : : : : 925 : : : : ---- : : : : : : 926 : : : : ---- : : : : 927 : : : : |PE4a| : : : : 928 : : : : ---- : : : : 929 : : : : ..: :.. : : : : 930 : : : : : : : : : : 931 ---- ---- ---- ---- ----: ---- 932 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 933 | ---- ---- | | ---- ---- | | ---- ---- | 934 | | | | | | 935 | | | | | | 936 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 937 | | | | | | 938 | | | | | | 939 | Site1 | | Site2 | | Site3 | 940 ---------------- ---------------- ---------------- 942 Figure 3: Topology View of Example Configuration 944 A node (a PCE, router, or host) that is computing a full or partial 945 path correlates the topology information disseminated in BGP-LS with 946 the information advertised in BGP (with the Tunnel Encapsulation 947 attributes) and uses this to compute that path and obtain the SIDs 948 for the elements on that path. In order to allow a source host to 949 compute exit points from its site, some subset of the above 950 information needs to be disseminated within that site. 952 What is advertised external to a given AS is controlled by policy at 953 the ASes' PEs, ASBRs, and GWs. Central control of what each node 954 should advertise, based upon analysis of the network as a whole, is 955 an important additional function. This and the amount of policy 956 involved may make the use of a Route Reflector an attractive option. 958 Local configuration at each node determines which links to other 959 nodes are advertised in BGP-LS, and determines which characteristics 960 of those links are advertised. Pairwise coordination between link 961 end-points is required to ensure consistency. 963 Path Weighted ECMP (PWECMP) is a mechanism to load-balance traffic 964 across parallel equal cost links or paths. In this approach an 965 ingress node distributes the flows from it to a given egress node 966 across the equal cost paths to the egress node in proportion to the 967 lowest bandwidth link on each path. PWECMP can be used by a GW for a 968 given source site to send all flows to a given destination site using 969 all paths in the backbone network to that destination site in 970 proportion to the minimum bandwidth on each path. PWECMP may also be 971 used by hosts within a source site to send flows to that site's GWs. 973 7. Worked Examples 975 Figure 4 shows a view of the links, paths, and labels that can be 976 assigned to part of the sample network shown in Figure 2 and 977 Figure 3. The double-dash lines (===) indicate LSP tunnels across 978 backbone ASes and dotted lines (...) are physical links. 980 A label may be assigned to each outgoing link at each node. This is 981 shown in Figure 4. For example, at GW1a the label L201 is assigned 982 to the link connecting GW1a to PE1a. At PE1c, the label L302 is 983 assigned to the link connecting PE1c to GW3b. Labels ("binding 984 SIDs") may also be assigned to RSVP-TE LSPs. For example, at PE1a, 985 label L202 is assigned to the RSVP-TE LSP leading from PE1a to PE1c. 987 At the destination site, label L305 is a "node-SID"; it represents 988 Host3b, rather than representing a particular link. 990 When a node processes a packet, the label at the top of the label 991 stack indicates the link (or RSVP-TE LSP) on which that node is to 992 transmit the packet. The node pops that label off the label stack 993 before transmitting the packet on the link. However, if the top 994 label is a node-SID, the node processing the packet is expected to 995 transmit the packet on whatever link it regards as the shortest path 996 to the node represented by the label. 998 ---- L202 ---- 999 | |===================================================| | 1000 |PE1a| |PE1c| 1001 | |===================================================| | 1002 ---- L203 ---- 1003 : L304: :L302 1004 : : : 1005 : ---- L205 ---- : : 1006 : |PE1b|========================================|PE1d| : : 1007 : ---- ---- : : 1008 : : L303: : : 1009 : : ---- : : : 1010 : : ---- L207 |ASBR|L209 ---- : : : 1011 : : | |======| 2a |......| | : : : 1012 : : | | ---- | |L210 ---- : : : 1013 : : |PE2a| |ASBR|======|PE3b| : : : 1014 : : | |L208 ---- L211 | 3a | ---- : : : 1015 : : | |======|ASBR|......| | L301: : : : 1016 : : ---- | 2b | ---- ...: : : : 1017 : : : ---- : : : : 1018 : ....: : : .......: : : 1019 : : : : : : : 1020 : : : : : .........: : 1021 : : : : : : : 1022 : : ....: : : : ....: 1023 L201: :L204 :L206 : : : : 1024 ---- ---- ----- ---- 1025 -|GW1a|--|GW1b|- -|GW3a |--|GW3b|- 1026 | ---- ---- | | ----- ---- | 1027 | : : | | L303: :L304| 1028 | : : | | : : | 1029 |L103: :L102| | : : | 1030 | N1 N2 | | N3 N4 | 1031 | :.. ..: | | : ....: | 1032 | : : | | : : | 1033 | L101: : | | : : | 1034 | Host1a | | Host3b (L305) | 1035 | | | | 1036 | Site1 | | Site3 | 1037 ---------------- ----------------- 1039 Figure 4: Tunnels and Labels in Example Configuration 1041 Note that label spaces can overlap so that, for example, the figure 1042 shows two instances of L303 and L304. This is acceptable because of 1043 the separation between the sites, and because SIDs applied to 1044 outgoing interfaces are locally scoped. 1046 Let's consider several different possible ways to direct a packet 1047 from Host1a in Site1 to Host3b in Site3. 1049 a. Full source route imposed at source 1051 In this case it is assumed that the entity responsible for 1052 determining an end-to-end path has access to the topologies of 1053 both the source and destination sites as well as of the 1054 backbone network. This might happen if all of the networks 1055 are owned by the same operator in which case the information 1056 can be shared into a single database for use by an offline 1057 tool, or the information can be distributed using routing 1058 protocols such that the source host can see enough to select 1059 the path. Alternatively, the end-to-end path could be 1060 produced through cooperation between computation entities each 1061 responsible for different sites and ASes along the path. 1063 If the path is computed externally it is pushed to the source 1064 host. Otherwise, it is computed by the source host itself. 1066 Suppose it is desired for a packet from Host1a to travel to 1067 Host3b via the following source route: 1069 Host1a->N1->GW1a->PE1a->(RSVP-TE 1070 LSP)->PE1c->GW3b->N4->Host3b 1072 Host1a imposes the following label stack (with the first label 1073 representing the top of stack), and then sends the packet to 1074 N1: 1076 L103, L201, L202, L302, L304, L305 1078 N1 sees L103 at the top of the stack, so it pops the stack and 1079 forwards the packet to GW1a. GW1a sees L201 at the top of the 1080 stack, so it pops the stack and forwards the packet to PE1a. 1081 PE1a sees L202 at the top of the stack, so it pops the stack 1082 and forwards the packet over the RSVP-TE LSP to PE1c. As the 1083 packet travels over this LSP, its top label is an RSVP-TE 1084 signaled label representing the LSP. That is, PE1a imposes an 1085 additional label stack entry for the tunnel LSP. 1087 At the end of the LSP tunnel, the MPLS tunnel label is popped, 1088 and PE1c sees L302 at the top of the stack. PE1c pops the 1089 stack and forwards the packet to GW3b. GW3b sees L304 at the 1090 top of the stack, so it pops the stack and forwards the packet 1091 to N4. Finally, N4 sees L305 at the top of the stack, so it 1092 pops the stack and forwards the packet to Host3b. 1094 b. It is possible that the source site does not have visibility into 1095 the destination site. 1097 This occurs if the destination site does not export its 1098 topology, but does export basic reachability information so 1099 that the source host or the path computation entity will know: 1101 + The GWs through which the destination can be reached. 1103 + The SID to use for the destination prefix. 1105 Suppose we want a packet to follow the source route: 1107 Host1a->N1->GW1a->PE1a->(RSVP-TE 1108 LSP)->PE1c->GW3b->...->Host3b 1110 The ellipsis indicates a part of the path that is not 1111 explicitly specified. Thus, the label stack imposed at the 1112 source host is: 1114 L103, L201, L202, L302, L305 1116 Processing is as per case a., but when the packet reaches the 1117 GW of the destination site (GW3b) it can either simply forward 1118 the packet along the shortest path to Host3b, or it can insert 1119 additional labels to direct the path to the destination. 1121 c. Site1 only has reachability information for the backbone and 1122 destination networks 1124 The source site (or the path computation entity) may be 1125 further restricted in its view of the network. It is possible 1126 that it knows the location of the destination in the 1127 destination site, and knows the GWs to the destination site 1128 that provide reachability to the destination, but that it has 1129 no view of the backbone network. This leads to the packet 1130 being forwarded in a manner similar to 'per-domain path 1131 computation' described in Section 5.6. 1133 At the source host a simple label stack is imposed navigating 1134 the site and indicating the destination GW and the destination 1135 host. 1137 L103, L302, L305 1139 As the packet leaves the source site, the source GW (GW1a) 1140 determines the PE to use to enter the backbone using nothing 1141 more than the BGP preferred route to the destination GW (it 1142 could be PE1a or PE1b). 1144 When the packet reaches the first PE it has a label stack just 1145 identifying the destination GW and the host (L302, L305). The 1146 PE uses information it has about the backbone network topology 1147 and available LSPs to select an LSP tunnel, impose the tunnel 1148 label, and forward the packet. 1150 When the packet reaches the end of the LSP tunnel, it is 1151 processed as described in case b. 1153 d. Stitched LSPs across the backbone 1155 A variant of all these cases arises when the packet is sent 1156 using a path that spans multiple ASes. For example, one that 1157 crosses AS2 and AS3 as shown in Figure 2. 1159 In this case, basing the example on case a., the source host 1160 imposes the label stack: 1162 L102, L206, L207, L209, L210, L301, L303, L305 1164 It then sends the packet to N2. 1166 When the packet reaches PE2a, as previously described, the top 1167 label (L207) indicates an LSP tunnel that leads to ASBR2a. At 1168 the end of that LSP tunnel the next label (L209) routes the 1169 packet from ASBR2a to ASBR3a, where the next label (L210) 1170 identifies the next LSP tunnel to use. Thus, SR has been used 1171 to stitch together LSPs to make a longer path segment. As the 1172 packet emerges from the final LSP tunnel, forwarding continues 1173 as previously described. 1175 8. Label Stack Depth Considerations 1177 As described in Section 3.1, one of the issues with a Segment Routing 1178 approach is that the label stack can get large, for example when the 1179 source route becomes long. A mechanism to mitigate this problem is 1180 needed if the solution is to be fully applicable in all environments. 1182 [I-D.ietf-idr-segment-routing-te-policy] introduces the concept of 1183 hierarchical source routes as a way to compress source route headers. 1184 It functions by having the egress node for a set of source routes 1185 advertise those source routes along with an explicit request that 1186 each node that is an ingress node for one or more of those source 1187 routes should advertise a binding SID for the set of source routes 1188 for which it is the ingress. It should be noted that the set of 1189 source routes can either be advertised by the egress node as 1190 described here, or advertised by a controller on behalf of the egress 1191 node. 1193 Such an ingress node advertises its set of source routes and a 1194 binding SID as an adjacency in BGP-LS as described in Section 6. 1195 These source routes represent the weighted ECMP paths between the 1196 ingress node and the egress node. Note also that the binding SID may 1197 be supplied by the node that advertises the source routes (the egress 1198 or the controller), or may be chosen by the ingress. 1200 A remote node that wishes to reach the egress node constructs a 1201 source route consisting of the segment IDs necessary to reach one of 1202 the ingress nodes for the path it wishes to use along with the 1203 binding SID that the ingress node advertised to identify the set of 1204 paths. When the selected ingress node receives a packet with a 1205 binding SID it has advertised, it replaces the binding SID with the 1206 labels for one of its source routes to the egress node (it will 1207 choose one of the source routes in the set according to its own 1208 weighting algorithms and policy). 1210 8.1. Worked Example 1212 Consider the topology in Figure 4. Suppose that it is desired to 1213 construct full segment routed paths from ingress to egress, but that 1214 the resulting label stack (segment route) is too large. In this case 1215 the gateways to Site3 (GW3a and GW3b) can advertise all of the source 1216 routes from the gateways to Site1 (GW1a and GW1b). The gateways to 1217 Site1 then assign binding SIDs to those source routes and advertise 1218 those SIDs into BGP-LS. 1220 Thus, GW3b advertises the two source routes (L201, L202, L302 and 1221 L201, L203, L302), and GW1a advertises into BGP-LS its adjacency to 1222 GW3b along with a binding SID. Should Host1a wish to send a packet 1223 via GW1a and GW3b, it can include L103 and this binding SID in the 1224 source route. GW1a is free to choose which source route to use 1225 between itself and GW3b using its weighted ECMP algorithm. 1227 Similarly, GW3a can advertise the following set of source routes: 1229 o L201, L202, L304 1231 o L201, L203, L304 1232 o L204, L205, L303 1234 o L206, L207, L209, L210, L301 1236 o L206, L208, L211, L210, L301 1238 GW1a advertises a binding SID for the first three, and GW1b 1239 advertises a binding SID for the other two. 1241 9. Gateway Considerations 1243 As described in Section 5.2, [I-D.ietf-bess-datacenter-gateway] 1244 defines a new tunnel type, "SR tunnel", and when the GWs to a given 1245 site advertise a route to a prefix X within the site, they will each 1246 include a Tunnel Encapsulation attribute with multiple tunnel 1247 instances each of type "SR tunnel", one for each GW and each 1248 containing a Remote Endpoint sub-TLV with that GW's address. 1250 In other words, each route advertised by any GW identifies all of the 1251 GWs to the same site. 1253 Therefore, even if only one of the routes is distributed to other 1254 ASes, it will not matter how many times the next hop changes, as the 1255 Tunnel Encapsulation attribute (and its remote endpoint sub-TLVs) 1256 will remain unchanged. 1258 9.1. Site Gateway Auto-Discovery 1260 To allow a given site's GWs to auto-discover each other and to 1261 coordinate their operations, the following procedures are implemented 1262 as described in [I-D.ietf-bess-datacenter-gateway]: 1264 o Each GW is configured with an identifier of the site that is 1265 common across all GWs to the site and unique across all sites that 1266 are connected. 1268 o A route target [RFC4360] is attached to each GW's auto-discovery 1269 route and has its value set to the site identifier. 1271 o Each GW constructs an import filtering rule to import any route 1272 that carries a route target with the same site identifier that the 1273 GW itself uses. This means that only these GWs will import those 1274 routes and that all GWs to the same site will import each other's 1275 routes and will learn (auto-discover) the current set of active 1276 GWs for the site. 1278 o The auto-discovery route each GW advertises consists of the 1279 following: 1281 * An IPv4 or IPv6 NLRI containing one of the GW's loopback 1282 addresses (that is, with AFI/SAFI that is one of 1/1, 2/1, 1/4, 1283 2/4). 1285 * A Tunnel Encapsulation attribute containing the GW's 1286 encapsulation information, which at a minimum consists of an SR 1287 tunnel TLV with a Remote Endpoint sub-TLV [RFC9012]. 1289 To avoid the side effect of applying the Tunnel Encapsulation 1290 attribute to any packet that is addressed to the GW, the GW should 1291 use a different loopback address in the advertisement from that used 1292 to reach the GW itself. 1294 Each GW will include a Tunnel Encapsulation attribute for each GW 1295 that is active for the site (including itself), and will include 1296 these in every route advertised by each GW to peers outside the site. 1297 As the current set of active GWs changes (due to the addition of a 1298 new GW or the failure/removal of an existing GW) each externally 1299 advertised route will be re-advertised with the set of SR tunnel 1300 instances reflecting the current set of active GWs. 1302 9.2. Relationship to BGP Link State and Egress Peer Engineering 1304 When a remote GW receives a route to a prefix X it can use the SR 1305 tunnel instances within the contained Tunnel Encapsulation attribute 1306 to identify the GWs through which X can be reached. It uses this 1307 information to compute SR TE paths across the backbone network 1308 looking at the information advertised to it in SR BGP Link State 1309 (BGP-LS) [I-D.ietf-idr-bgp-ls-segment-routing-ext] and correlated 1310 using the site identity. SR Egress Peer Engineering (EPE) 1311 [I-D.ietf-idr-bgpls-segment-routing-epe] can be used to supplement 1312 the information advertised in BGP-LS. 1314 9.3. Advertising a Site Route Externally 1316 When a packet destined for prefix X is sent on an SR TE path to a GW 1317 for the site containing X, it needs to carry the receiving GW's label 1318 for X such that this label rises to the top of the stack before the 1319 GW completes its processing of the packet. To achieve this we place 1320 a prefix-SID sub-TLV for X in each SR tunnel instance in the Tunnel 1321 Encapsulation attribute in the externally advertised route for X. 1323 Alternatively, if the GWs for a given site are configured to allow 1324 remote GWs to perform SR TE through that site for prefix X, then each 1325 GW computes an SR TE path through that site to X from each of the 1326 current active GWs and places each in an MPLS label stack sub-TLV 1327 [RFC9012] in the SR tunnel instance for that GW. 1329 9.4. Encapsulations 1331 If the GWs for a given site are configured to allow remote GWs to 1332 send them packets in that site's native encapsulation, then each GW 1333 will also include multiple instances of a tunnel TLV for that native 1334 encapsulation in the externally advertised routes: one for each GW, 1335 and each containing a remote endpoint sub-TLV with that GW's address. 1336 A remote GW may then encapsulate a packet according to the rules 1337 defined via the sub-TLVs included in each of the tunnel TLV 1338 instances. 1340 10. Security Considerations 1342 There are several security domains and associated threats in this 1343 architecture. SR is itself a data transmission encapsulation that 1344 provides no additional security, so security in this architecture 1345 relies on higher layer mechanisms (for example, end-to-end encryption 1346 of pay-load data), security of protocols used to establish 1347 connectivity and distribute network information, and access control 1348 so that control plane and data plane packets are not admitted to the 1349 network from outside. 1351 This architecture utilizes a number of control plane protocols within 1352 sites, within the backbone, and north-south between controllers and 1353 sites. Only minor modifications are made to BGP as described in 1354 [I-D.ietf-bess-datacenter-gateway], otherwise this architecture uses 1355 existing protocols and extensions so no new security risks are 1356 introduced. 1358 Special care should, however, be taken when routing protocols export 1359 or import information from or to domains that might have a security 1360 model based on secure boundaries and internal mutual trust. This is 1361 notable when: 1363 o BGP-LS is used to export topology information from within a domain 1364 to a controller that is sited outside the domain. 1366 o A southbound protocol such as BGP-LU or Netconf is used to install 1367 state in the network from a controller that may be sited outside 1368 the domain. 1370 In these cases protocol security mechanisms should be used to protect 1371 the information in transit entering or leaving the domain, and to 1372 authenticate the out-of-domain nodes (the controller) to ensure that 1373 confidential/private information is not lost and that data or 1374 configuration is not falsified. 1376 In this context, a domain may be considered to be a site, an AS, or 1377 the whole SR domain. 1379 11. Management Considerations 1381 Configuration elements for the approaches described in this document 1382 are minor but crucial. 1384 Each GW to a site is configured with the same identifier of the site, 1385 and that identifier is unique across all sites that are connected. 1386 This requires some coordination both within a site, and between 1387 cooperating sites. There are no requirements for how this 1388 configuration and coordination is achieved, but it is assumed that 1389 management systems are involved. 1391 Policy determines what topology information is shared by a BGP-LS 1392 speaker (see Section 6). This applies both to the advertisement of 1393 interdomain links and their characteristics, and to the advertisement 1394 of summarized domain topology or connectivity. This policy is a 1395 local (i.e., domain-scoped) configuration dependent on the objectives 1396 and business imperatives of the domain operator. 1398 Domain boundaries are usually configured to limit the control and 1399 interaction from other domains (for example, to not allow end-to-end 1400 TE paths to be set up across AS boundaries). As noted in 1401 Section 9.3, the GWs for a given site can be configured to allow 1402 remote GWs to perform SR TE through that site for a given prefix, a 1403 set of prefixes, or all reachable prefixes. 1405 Similarly, (as described in Section 9.4 the GWs for a given site can 1406 be configured to allow remote GWs to send them packets in that site's 1407 native encapsulation. 1409 12. IANA Considerations 1411 This document makes no requests for IANA action. 1413 13. Acknowledgements 1415 Thanks to Jeffery Zhang for his careful review. 1417 14. Informative References 1419 [I-D.ietf-bess-datacenter-gateway] 1420 Farrel, A., Drake, J., Rosen, E., Patel, K., and L. Jalil, 1421 "Gateway Auto-Discovery and Route Advertisement for 1422 Segment Routing Enabled Domain Interconnection", draft- 1423 ietf-bess-datacenter-gateway-10 (work in progress), April 1424 2021. 1426 [I-D.ietf-idr-bgp-ls-segment-routing-ext] 1427 Previdi, S., Talaulikar, K., Filsfils, C., Gredler, H., 1428 and M. Chen, "BGP Link-State extensions for Segment 1429 Routing", draft-ietf-idr-bgp-ls-segment-routing-ext-18 1430 (work in progress), April 2021. 1432 [I-D.ietf-idr-bgpls-segment-routing-epe] 1433 Previdi, S., Talaulikar, K., Filsfils, C., Patel, K., Ray, 1434 S., and J. Dong, "BGP-LS extensions for Segment Routing 1435 BGP Egress Peer Engineering", draft-ietf-idr-bgpls- 1436 segment-routing-epe-19 (work in progress), May 2019. 1438 [I-D.ietf-idr-segment-routing-te-policy] 1439 Previdi, S., Filsfils, C., Talaulikar, K., Mattes, P., 1440 Rosen, E., Jain, D., and S. Lin, "Advertising Segment 1441 Routing Policies in BGP", draft-ietf-idr-segment-routing- 1442 te-policy-12 (work in progress), May 2021. 1444 [I-D.ietf-pce-binding-label-sid] 1445 Sivabalan, S., Filsfils, C., Tantsura, J., Previdi, S., 1446 and C. Li, "Carrying Binding Label/Segment Identifier in 1447 PCE-based Networks.", draft-ietf-pce-binding-label-sid-08 1448 (work in progress), April 2021. 1450 [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended 1451 Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, 1452 February 2006, . 1454 [RFC5152] Vasseur, JP., Ed., Ayyangar, A., Ed., and R. Zhang, "A 1455 Per-Domain Path Computation Method for Establishing Inter- 1456 Domain Traffic Engineering (TE) Label Switched Paths 1457 (LSPs)", RFC 5152, DOI 10.17487/RFC5152, February 2008, 1458 . 1460 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 1461 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 1462 DOI 10.17487/RFC5440, March 2009, 1463 . 1465 [RFC5441] Vasseur, JP., Ed., Zhang, R., Bitar, N., and JL. Le Roux, 1466 "A Backward-Recursive PCE-Based Computation (BRPC) 1467 Procedure to Compute Shortest Constrained Inter-Domain 1468 Traffic Engineering Label Switched Paths", RFC 5441, 1469 DOI 10.17487/RFC5441, April 2009, 1470 . 1472 [RFC5520] Bradford, R., Ed., Vasseur, JP., and A. Farrel, 1473 "Preserving Topology Confidentiality in Inter-Domain Path 1474 Computation Using a Path-Key-Based Mechanism", RFC 5520, 1475 DOI 10.17487/RFC5520, April 2009, 1476 . 1478 [RFC6805] King, D., Ed. and A. Farrel, Ed., "The Application of the 1479 Path Computation Element Architecture to the Determination 1480 of a Sequence of Domains in MPLS and GMPLS", RFC 6805, 1481 DOI 10.17487/RFC6805, November 2012, 1482 . 1484 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 1485 S. Ray, "North-Bound Distribution of Link-State and 1486 Traffic Engineering (TE) Information Using BGP", RFC 7752, 1487 DOI 10.17487/RFC7752, March 2016, 1488 . 1490 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 1491 Litkowski, S., Horneffer, M., and R. Shakir, "Source 1492 Packet Routing in Networking (SPRING) Problem Statement 1493 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 1494 2016, . 1496 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 1497 "Advertisement of Multiple Paths in BGP", RFC 7911, 1498 DOI 10.17487/RFC7911, July 2016, 1499 . 1501 [RFC7926] Farrel, A., Ed., Drake, J., Bitar, N., Swallow, G., 1502 Ceccarelli, D., and X. Zhang, "Problem Statement and 1503 Architecture for Information Exchange between 1504 Interconnected Traffic-Engineered Networks", BCP 206, 1505 RFC 7926, DOI 10.17487/RFC7926, July 2016, 1506 . 1508 [RFC8231] Crabbe, E., Minei, I., Medved, J., and R. Varga, "Path 1509 Computation Element Communication Protocol (PCEP) 1510 Extensions for Stateful PCE", RFC 8231, 1511 DOI 10.17487/RFC8231, September 2017, 1512 . 1514 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 1515 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 1516 . 1518 [RFC8281] Crabbe, E., Minei, I., Sivabalan, S., and R. Varga, "Path 1519 Computation Element Communication Protocol (PCEP) 1520 Extensions for PCE-Initiated LSP Setup in a Stateful PCE 1521 Model", RFC 8281, DOI 10.17487/RFC8281, December 2017, 1522 . 1524 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 1525 Decraene, B., Litkowski, S., and R. Shakir, "Segment 1526 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 1527 July 2018, . 1529 [RFC8660] Bashandy, A., Ed., Filsfils, C., Ed., Previdi, S., 1530 Decraene, B., Litkowski, S., and R. Shakir, "Segment 1531 Routing with the MPLS Data Plane", RFC 8660, 1532 DOI 10.17487/RFC8660, December 2019, 1533 . 1535 [RFC8664] Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., 1536 and J. Hardwick, "Path Computation Element Communication 1537 Protocol (PCEP) Extensions for Segment Routing", RFC 8664, 1538 DOI 10.17487/RFC8664, December 2019, 1539 . 1541 [RFC8665] Psenak, P., Ed., Previdi, S., Ed., Filsfils, C., Gredler, 1542 H., Shakir, R., Henderickx, W., and J. Tantsura, "OSPF 1543 Extensions for Segment Routing", RFC 8665, 1544 DOI 10.17487/RFC8665, December 2019, 1545 . 1547 [RFC8667] Previdi, S., Ed., Ginsberg, L., Ed., Filsfils, C., 1548 Bashandy, A., Gredler, H., and B. Decraene, "IS-IS 1549 Extensions for Segment Routing", RFC 8667, 1550 DOI 10.17487/RFC8667, December 2019, 1551 . 1553 [RFC8669] Previdi, S., Filsfils, C., Lindem, A., Ed., Sreekantiah, 1554 A., and H. Gredler, "Segment Routing Prefix Segment 1555 Identifier Extensions for BGP", RFC 8669, 1556 DOI 10.17487/RFC8669, December 2019, 1557 . 1559 [RFC9012] Patel, K., Van de Velde, G., Sangli, S., and J. Scudder, 1560 "The BGP Tunnel Encapsulation Attribute", RFC 9012, 1561 DOI 10.17487/RFC9012, April 2021, 1562 . 1564 Authors' Addresses 1566 Adrian Farrel 1567 Old Dog Consulting 1569 Email: adrian@olddog.co.uk 1571 John Drake 1572 Juniper Networks 1574 Email: jdrake@juniper.net