idnits 2.17.1 draft-farrel-spring-sr-domain-interconnect-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 13, 2018) is 2020 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-bess-datacenter-gateway-01 == Outdated reference: A later version (-18) exists of draft-ietf-idr-bgp-ls-segment-routing-ext-09 == Outdated reference: A later version (-19) exists of draft-ietf-idr-bgpls-segment-routing-epe-15 == Outdated reference: A later version (-26) exists of draft-ietf-idr-segment-routing-te-policy-04 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-10 == Outdated reference: A later version (-25) exists of draft-ietf-isis-segment-routing-extensions-19 == Outdated reference: A later version (-27) exists of draft-ietf-ospf-segment-routing-extensions-25 == Outdated reference: A later version (-16) exists of draft-ietf-pce-segment-routing-13 == Outdated reference: A later version (-22) exists of draft-ietf-spring-segment-routing-mpls-14 == Outdated reference: A later version (-07) exists of draft-sivabalan-pce-binding-label-sid-04 -- Obsolete informational reference (is this intentional?): RFC 7752 (Obsoleted by RFC 9552) Summary: 0 errors (**), 0 flaws (~~), 11 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SPRING Working Group A. Farrel 3 Internet-Draft J. Drake 4 Intended status: Informational Juniper Networks 5 Expires: April 16, 2019 October 13, 2018 7 Interconnection of Segment Routing Domains - Problem Statement and 8 Solution Landscape 9 draft-farrel-spring-sr-domain-interconnect-05 11 Abstract 13 Segment Routing (SR) is a forwarding paradigm for use in MPLS and 14 IPv6 networks. It is intended to be deployed in discrete domains 15 that may be data centers, access networks, or other networks that are 16 under the control of a single operator and that can easily be 17 upgraded to support this new technology. 19 Traffic originating in one SR domain often terminates in another SR 20 domain, but must transit a backbone network that provides 21 interconnection between those domains. 23 This document describes a mechanism for providing connectivity 24 between SR domains to enable end-to-end or domain-to-domain traffic 25 engineering. 27 The approach described allows connectivity between SR domains, 28 utilizes traffic engineering mechanisms (RSVP-TE or Segment Routing) 29 across the backbone network, makes heavy use of pre-existing 30 technologies, and requires the specification of very few additional 31 mechanisms. 33 This document provides some background and a problem statement, 34 explains the solution mechanism, gives references to other documents 35 that define protocol mechanisms, and provides examples. It does not 36 define any new protocol mechanisms. 38 Status of This Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at https://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on April 16, 2019. 55 Copyright Notice 57 Copyright (c) 2018 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (https://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with respect 65 to this document. Code Components extracted from this document must 66 include Simplified BSD License text as described in Section 4.e of 67 the Trust Legal Provisions and are provided without warranty as 68 described in the Simplified BSD License. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 73 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 74 2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 75 3. Solution Technologies . . . . . . . . . . . . . . . . . . . . 7 76 3.1. Characteristics of Solution Technologies . . . . . . . . 7 77 4. Decomposing the Problem . . . . . . . . . . . . . . . . . . . 9 78 5. Solution Space . . . . . . . . . . . . . . . . . . . . . . . 11 79 5.1. Global Optimization of the Paths . . . . . . . . . . . . 11 80 5.2. Figuring Out the GWs at a Destination Domain for a Given 81 Prefix . . . . . . . . . . . . . . . . . . . . . . . . . 11 82 5.3. Figuring Out the Backbone Egress ASBRs . . . . . . . . . 12 83 5.4. Making use of RSVP-TE LSPs Across the Backbone . . . . . 12 84 5.5. Data Plane . . . . . . . . . . . . . . . . . . . . . . . 13 85 5.6. Centralized and Distributed Controllers . . . . . . . . . 15 86 6. BGP-LS Considerations . . . . . . . . . . . . . . . . . . . . 18 87 7. Worked Examples . . . . . . . . . . . . . . . . . . . . . . . 21 88 8. Label Stack Depth Considerations . . . . . . . . . . . . . . 26 89 8.1. Worked Example . . . . . . . . . . . . . . . . . . . . . 27 90 9. Gateway Considerations . . . . . . . . . . . . . . . . . . . 28 91 9.1. Domain Gateway Auto-Discovery . . . . . . . . . . . . . . 28 92 9.2. Relationship to BGP Link State and Egress Peer 93 Engineering . . . . . . . . . . . . . . . . . . . . . . . 29 94 9.3. Advertising a Domain Route Externally . . . . . . . . . . 29 95 9.4. Encapsulations . . . . . . . . . . . . . . . . . . . . . 30 97 10. Security Considerations . . . . . . . . . . . . . . . . . . . 30 98 11. Management Considerations . . . . . . . . . . . . . . . . . . 31 99 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 100 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 31 101 14. Informative References . . . . . . . . . . . . . . . . . . . 31 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 35 104 1. Introduction 106 Data Centers are a growing market sector. They are being set up by 107 new specialist companies, by enterprises for their own use, by legacy 108 ISPs, and by the new wave of network operators. The networks inside 109 Data Centers are currently well-planned, but the traffic loads can be 110 unpredictable. There is a need to be able to direct traffic within a 111 Data Center to follow a specific path. 113 Data Centers are attached to external ("backbone") networks to allow 114 access by users and to facilitate communication among Data Centers. 115 An individual Data Center may be attached to multiple backbone 116 networks, and may have multiple points of attachment to each backbone 117 network. Traffic to or from a Data Center may need to be directed to 118 or from any of these points of attachment. 120 Segment Routing (SR) is a technology that places forwarding state 121 into each packet as a stack of loose hops. SR is an option for 122 building Data Centers, and is also seeing increasing traction in edge 123 and access networks as well as in backbone networks. It is typically 124 deployed in discrete domains that are under the control of a single 125 operator and that can easily be upgraded to support this new 126 technology. 128 Traffic originating in one SR domain often terminates in another SR 129 domain, but must transit a backbone network that provides 130 interconnection between those domains. This document describes an 131 approach that builds on existing technologies to produce mechanisms 132 that provide scalable and flexible interconnection of SR domains, and 133 that will be easy to operate. 135 The approach described allows end-to-end connectivity between SR 136 domains across an MPLS backbone network, utilizes traffic engineering 137 mechanisms (RSVP-TE or Segment Routing) across the backbone network, 138 makes heavy use of pre-existing technologies, and requires the 139 specification of very few additional mechanisms. 141 This document provides some background and a problem statement, 142 explains the solution mechanism, gives references to other documents 143 that define protocol mechanisms, and provides examples. It does not 144 define any new protocol mechanisms. 146 1.1. Terminology 148 This document uses Segment Routing terminology from [RFC7855] and 149 [RFC8402]. Particular abbreviations of note are: 151 o SID: a segment identifier 153 o SRGB: an SR Global Block 155 In the context of this document, the terms "optimal" and "optimality" 156 refer to making the best possible use of network resources, and 157 achieving network paths that best meet the objectives of the network 158 operators and customers. 160 Further terms are defined in Section 2. 162 2. Problem Statement 164 Consider the network in Figure 1. Without loss of generality, this 165 figure can be used to represent the architecture and problem space 166 for steering traffic within and between SR edge domains. The figure 167 shows a single destination for all traffic that we will consider. 169 In describing the problem space and the solution we use five terms 170 for network nodes as follows: 172 SR domain : This term is defined in [RFC8402]. In this document, an 173 SR domain is a collection of SR-capable nodes under the care of 174 one administrator or protocol. This may mean that each edge 175 network is an SR domain attached to the backbone network through 176 one or more gateways. Examples include, access networks, Data 177 Center sites, backbone networks that run SR, and blessings of 178 unicorns. 180 Host : A node within an edge domain. It may be an end system or a 181 transit node in the edge domain. 183 Gateway (GW) : Provides access to or from an edge domain. Examples 184 are Customer Edge nodes (CEs), Autonomous System Border Routers 185 (ASBRs), and Data Center gateways. 187 Provider Edge (PE) : Provides access to or from the backbone 188 network. 190 Autonomous System Border Router (ASBR) : Provides access to one 191 Autonomous System (AS) in the backbone network from another AS in 192 the backbone network. 194 These terms can be seen used in Figure 1 where the various sources 195 and the destination are hosts. In this figure we distinguish between 196 the PEs that provide access to the backbone networks and the Gateways 197 that provide access to the SR edge domains: these may, in fact, be 198 the same equipment and the PEs might be located at the domain edges. 200 ------------------------------------------------------------------- 201 | | 202 | AS1 | 203 | ---- ---- ---- ---- | 204 -|PE1a|--|PE1b|-------------------------------------|PE2a|--|PE2b|- 205 ---- ---- ---- ---- 206 : : ------------ ------------ : : 207 : : | AS2 | | AS3 | : : 208 : : | ------ ------ | : : 209 : : | |ASBR2a|...|ASBR3a| | : : 210 : : | ------ ------ | : : 211 : : | | | | : : 212 : : | ------ ------ | : : 213 : : | |ASBR2b|...|ASBR3b| | : : 214 : : | ------ ------ | : : 215 : : | | | | : : 216 : ......: | ---- | | ---- | : : 217 : : -|PE2a|----- -----|PE3a|- : : 218 : : ---- ---- : : 219 : : ......: :....... : : 220 : : : : : : 221 ---- ---- ---- ---- 222 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- 223 | ---- ---- | | ---- ---- | 224 | | | | 225 | | | Source3 | 226 | Source2 | | | 227 | | | Source4 | 228 | Source1 | | | 229 | | | Destination | 230 | | | | 231 | | | | 232 | Domain1 | | Domain2 | 233 ---------------- ---------------- 235 Figure 1: Reference Architecture for SR Domain Interconnect 237 Traffic to the destination may originate from multiple sources within 238 that domain (we show two such sources: Source3 and Source4). 239 Furthermore, traffic intended for the destination may arrive from 240 outside the domain through any of the points of attachment to the 241 backbone networks (we show GW2a and GW2b). This traffic may need to 242 be steered within the domain to achieve load-balancing across network 243 resources, to avoid degraded or out-of-service resources (including 244 planned service outages), and to achieve different qualities of 245 service. Of course, traffic in a remote source domain may also need 246 to be steered within that domain. We class this problem as "Intra- 247 Domain Traffic Steering". 249 Traffic across the backbone networks may need to be steered to 250 conform to common Traffic Engineering (TE) paradigms. That is, the 251 path across any network (shown in the figure as an AS) or across any 252 collection of networks may need to be chosen and may be different 253 from the shortest path first (SPF) routing that would occur without 254 TE. Furthermore, the points of inter-connection between networks may 255 need to be selected and influence the path chosen for the data. We 256 class this problem as "Inter-Domain Traffic Steering". 258 The composite end-to-end path comprises steering in the source 259 domain, choice of source domain exit point, steering across the 260 backbone networks, choice of network interconnections, choice of 261 destination domain entry point, and steering in the destination 262 domain. These issues may be inter-dependent (for example, the best 263 traffic steering in the source domain may help select the best exit 264 point from that domain, but the connectivity options across the 265 backbone network may drive the selection of a different exit point). 266 We class this combination of problems as "End-to-End Domain 267 Interconnect Traffic Steering". 269 It should be noted that the solution to the End-to-End Domain 270 Interconnect Traffic Steering problem depends on a number of factors: 272 o What technology is deployed in the domains. 274 o What technology is deployed in the backbone networks. 276 o How much information the domains are willing to share with each 277 other. 279 o How much information the backbone network operators and the domain 280 operators are willing to share. 282 In some cases, the domains and backbone networks are all owned and 283 operated by the same company (with the backbone network often being a 284 private network). In other cases, the domains are operated by one 285 company, with other companies operating the backbone. 287 3. Solution Technologies 289 Segment Routing (SR from the SPRING working group in the IETF 290 [RFC7855] and [RFC8402]) introduces traffic steering capabilities 291 into an MPLS network [I-D.ietf-spring-segment-routing-mpls] by 292 utilizing existing data plane capabilities (label pop and packet 293 forwarding - "pop and go") in combination with additions to existing 294 IGPs ([I-D.ietf-ospf-segment-routing-extensions] and 295 [I-D.ietf-isis-segment-routing-extensions]), BGP (as BGP-LU) 296 [RFC8277], or a centralized controller to distribute "per-hop" 297 labels. An MPLS label stack can be imposed on a packet to describe a 298 sequence of links/nodes to be transited by the packet; as each hop is 299 transited, the label that represents it is popped from the stack and 300 the packet is forwarded. Thus, on a packet-by-packet basis, traffic 301 can be steered within the SR domain. 303 This document broadens the problem space to consider interconnection 304 of any type of edge domain. These may be Data Center sites, but they 305 may equally be access networks, VPN sites, or any other form of 306 domain that includes packet sources and destinations. We 307 particularly focus on "SR edge domains" being source or destination 308 domains that utilize MPLS SR, but the domains could use other non- 309 MPLS technologies (such as IP, VXLAN, and NVGRE) as described in 310 Section 9. 312 Backbone networks are commonly based on MPLS-capable hardware. In 313 these networks, a number of different options exist to establish TE 314 paths. Among these options are static Label Switched Paths (LSPs), 315 perhaps set up by an SDN controller, LSP tunnels established using a 316 signaling protocol (such as RSVP-TE), and inter-domain use of SR (as 317 described above for intra-domain steering). Where traffic steering 318 (without resource reservation) is needed, SR may be adequate; where 319 Traffic Engineering is needed (i.e., traffic steering with resource 320 reservation) RSVP-TE or centralized SDN control are preferred. 321 However, in a network that is fully managed and controlled through a 322 centralized planning tool, resource reservation can be achieved and 323 SR can be used for full Traffic Engineering. These solutions are 324 already used in support of a number of edge-to-edge services such as 325 L3VPN and L2VPN. 327 3.1. Characteristics of Solution Technologies 329 Each of the solution technologies mentioned in the previous section 330 has certain characteristics, and the combined solution needs to 331 recognize and address these characteristics in order to make a 332 workable solution. 334 o When SR is used for traffic steering, the size of the MPLS label 335 stack used in SR scales linearly with the length of the strict 336 source route. This can cause issues with MPLS implementations 337 that only support label stacks of a limited size. For example, 338 some MPLS implementations cannot push enough labels on the stack 339 to represent an entire source route. Other implementations may be 340 unable to do the proper "ECMP hashing" if the label stack is too 341 long; they may be unable to read enough of the packet header to 342 find an entropy label or to find the IP header of the payload. 343 Increasing the packet header size also reduces the size of the 344 payload that can be carried in an MPLS packet. There are 345 techniques that can be used to reduce the size of the label stack. 346 For example, a source route may be made less specific through the 347 use of loose hops requiring fewer labels, or a single label (known 348 as a "binding SID") can be used to represent a sequence of nodes; 349 this label can be replaced with a set of labels when the packet 350 reaches the first node in the sequence. It is also possible to 351 combine SR with conventional RSVP-TE by using a binding SID in the 352 label stack to represent an LSP tunnel set up by RSVP-TE. 354 o Most of the work on using SR for traffic steering assumes that 355 traffic only needs to be steered within a single administrative 356 domain. If the backbone consists of multiple ASes that are not 357 part of a common administrative domain, the use of SR across the 358 backbone may prove to be a challenge, and its use in the backbone 359 may be limited to cases where private networks connect the 360 domains, rather than cases where the domains are connected by 361 third-party network operators or by the public Internet. 363 o RSVP-TE has been used to provide edge-to-edge tunnels through 364 which flows to/from many endpoints can be routed, and this 365 provides a reduction in state while still offering Traffic 366 Engineering across the backbone network. However, this requires 367 O(n2) connections and as the number of edge domains increases this 368 becomes unsustainable. 370 o A centralized control system is capable of producing more 371 efficient use of network resources and of allowing better 372 coordination of network usage and of network diagnostics. 373 However, such a system may present challenges in large and dynamic 374 networks because it relies on all network state being held 375 centrally, and it is difficult to make central control as robust 376 and self-correcting as distributed control. 378 This document introduces an approach that blends the best points of 379 each of these solution technologies to achieve a trade-off where 380 RSVP-TE tunnels in the backbone network are stitched together using 381 SR, and end-to-end SR paths can be created under the control of a 382 central controller with routing devolved to the constituent networks 383 where possible. 385 4. Decomposing the Problem 387 It is important to decompose the problem to take account of different 388 regions spanned by the end-to-end path. These regions may use 389 different technologies and may be under different administrative 390 control. The separation of administrative control is particularly 391 important because the operator of one region may be unwilling to 392 share information about their networks, and may be resistant to 393 allowing a third party to exert control over their network resources. 395 Using the reference model in Figure 1, we can consider how to get a 396 packet from Source1 to the Destination. The following decisions must 397 be made: 399 o In which domain Destination lies. 401 o Which exit point from Domain1 to use. 403 o Which entry point to Domain2 to use. 405 o How to reach the exit point of Domain1 from Source1. 407 o How to reach the entry point to Domain2 from the exit point of 408 Domain1. 410 o How to reach Destination from the entry point to Domain2. 412 As already mentioned, these decisions may be inter-related. This 413 enables us to break down the problem into three steps: 415 1. Get the packet from Source1 to the exit point of Domain1. 417 2. Get the packet from exit point of Domain1 to entry point of 418 Domain2. 420 3. Get the packet from entry point of Domain2 to Destination. 422 The solution needs to achieve this in a way that allows: 424 o Adequate discovery of preferred elements in the end-to-end path 425 (such as the location of the destination, and the selection of the 426 destination domain entry point). 428 o Full control of the end-to-end path if all of the operators are 429 willing. 431 o Re-use of existing techniques and technologies. 433 From a technology point of view we must support several functions and 434 mixtures of those functions: 436 o If a domain uses MPLS Segment Routing, the labels within the 437 domain may be populated by any means including BGP-LU [RFC8277], 438 IGP [I-D.ietf-isis-segment-routing-extensions] 439 [I-D.ietf-ospf-segment-routing-extensions], and central control. 440 Source routes within the domain may be expressed as label stacks 441 pushed by a controller or computed by a source router, or 442 expressed as a single label and programmed into the domain routers 443 by a controller. 445 o If a domain uses other (non-MPLS) forwarding, the domain 446 processing is specific to that technology. See Section 9 for 447 details. 449 o If the domains use Segment Routing, the source and destination 450 domains may or may not be in the same 'Segment Routing domain' 451 [RFC8402], so that the prefix-SIDs may be the same or different in 452 the two domains. 454 o The backbone network may be a single private network under the 455 control of the owner of the domains and comprising one or more 456 ASes, or may be a network operated by one or more third parties. 458 o The backbone network may utilize MPLS Traffic Engineering tunnels 459 in conjunction with MPLS Segment Routing and the domain-to-domain 460 source route may be provided by stitching TE LSPs. 462 o A single controller may be used to handle the source and 463 destination domains as well as the backbone network, or there may 464 be a different controller for the backbone network separate from 465 that that controls the two domains, or there may be separate 466 controllers for each network. The controllers may cooperate and 467 share information to different degrees. 469 All of these different decompositions of the problem reflect 470 different deployment choices and different commercial and operational 471 practices, each with different functional trade-offs. For example, 472 with separate controllers that do not share information and that only 473 cooperate to a limited extent, it will be possible to achieve end-to- 474 end connectivity with optimal routing at each step (domain or 475 backbone AS), but the end-to-end path that is achieved might not be 476 optimal. 478 5. Solution Space 480 5.1. Global Optimization of the Paths 482 Global optimization of the path from one domain to another requires 483 either that the source controller has a complete view of the end-to- 484 end topology or some form of cooperation between controllers (such as 485 in Backward Recursive Path Computation (BRPC) in [RFC5441]). 487 BGP-LS [RFC7752] can be used to provide the "source" controller with 488 a view of the topology of the backbone: that topology may be 489 abstracted or partial. This requires some of the BGP speakers in 490 each AS to have BGP-LS sessions to the controller. Other means of 491 obtaining this view of the topology are of course possible. 493 5.2. Figuring Out the GWs at a Destination Domain for a Given Prefix 495 Suppose GW2a and GW2b both advertise a route to prefix X, each 496 setting itself as next hop. One might think that the GWs for X could 497 be inferred from the routes' next hop fields, but typically only the 498 "best" route (as selected by BGP) gets distributed across the 499 backbone: the other route is discarded. But the best route according 500 to the BGP selection process might not be the route via the GW that 501 we want to use for traffic engineering purposes. 503 The obvious solution would be to use the ADD-PATH mechanism [RFC7911] 504 to ensure that all routes to X get advertised. However, even if one 505 does this, the identity of the GWs would get lost as soon as the 506 routes got distributed through an ASBR that sets next hop self. And 507 if there are multiple ASes in the backbone, not only will the next 508 hop change several times, but the ADD-PATH mechanism will experience 509 scaling issues. So this "obvious" solution only works within a 510 single AS. 512 A better solution can be achieved using the Tunnel Encapsulation 513 [I-D.ietf-idr-tunnel-encaps] attribute as follows. 515 We define a new tunnel type, "SR tunnel", and when the GWs to a given 516 domain advertise a route to a prefix X within the domain, they each 517 include a Tunnel Encapsulation attribute with multiple remote 518 endpoint sub-TLVs each of which identifies a specific GW to the 519 domain. 521 In other words, each route advertised by any GW identifies all of the 522 GWs to the same domain (see Section 9 for a discussion of how GWs 523 discover each other). Therefore, only one of the routes needs to be 524 distributed to other ASes, and it doesn't matter how many times the 525 next hop changes, the Tunnel Encapsulation attribute (and its remote 526 endpoint sub-TLVs) remains unchanged and disclose the full list of 527 GWs to the domain. 529 Further, when a packet destined for prefix X is sent on a TE path to 530 GW2a we want the packet to arrive at GW2a carrying, at the top of its 531 label stack, GW2a's label for prefix X. To achieve this we place the 532 SID/SRGB in a sub-TLV of the Tunnel Encapsulation attribute. We 533 define the prefix-SID sub-TLV to be essentially identical in syntax 534 to the prefix-SID attribute (see [I-D.ietf-idr-bgp-prefix-sid]), but 535 the semantics are somewhat different. 537 We also define an "MPLS Label Stack" sub-TLV for the Tunnel 538 Encapsulation attribute, and put this in the "SR tunnel" TLV. This 539 allows the destination GW to specify a label stack that it wants 540 packets destined for prefix X to have. This label stack represents a 541 source route through the destination domain. 543 5.3. Figuring Out the Backbone Egress ASBRs 545 We need to figure out the backbone egress ASBRs that are attached to 546 a given GW at the destination domain in order to properly engineer 547 the path across the backbone. 549 The "cleanest" way to do this is to have the backbone egress ASBRs 550 distribute the information to the source controller using the egress 551 peer engineering (EPE) extensions of BGP-LS 552 [I-D.ietf-idr-bgpls-segment-routing-epe]. The EPE extensions to BGP- 553 LS allow a BGP speaker to say, "Here is a list of my EBGP neighbors, 554 and here is a (locally significant) adjacency-SID for each one." 556 It may also be possible to consider utilizing cooperating PCEs or a 557 Hierarchical PCE approach in [RFC6805]. But it should be observed 558 that this question is dependent on the questions in Section 5.2. 559 That is, it is not possible to even start the selection of egress 560 ASBRs until it is known which GWs at the destination domain provide 561 access to a given prefix. Once that question has been answered, any 562 number of PCE approaches can be used to select the right egress ASBR 563 and, more generally, the ASBR path across the backbone. 565 5.4. Making use of RSVP-TE LSPs Across the Backbone 567 There are a number of ways to carry traffic across the backbone from 568 one domain to another. RSVP-TE is a popular mechanism for 569 establishing tunnels across MPLS networks in similar scenarios (e.g., 570 L3VPN) because it allows for reservation of resources as well as 571 traffic steering. 573 A controller can cause an RSVP-TE LSP to be set up by talking to the 574 LSP head end using PCEP extensions as described in [RFC8281]. That 575 document specifies an "LSP Initiate" message (the PCInitiate message) 576 that the controller uses to specify the RSVP-TE LSP endpoints, the 577 explicit path, a "symbolic pathname", and other optional attributes 578 (specified in the PCEP specification [RFC5440]) such as bandwidth. 580 When the head end receives a PCInitiate message, it sets up the RSVP- 581 TE LSP, assigns it a "PLSP-id", and reports the PLSP-id back to the 582 controller in a PCRpt message [RFC8231]. The PCRpt message also 583 contains the symbolic name that the controller assigned to the LSP, 584 as well as containing some information identifying the LSP-initiate 585 message from the controller, and details of exactly how the LSP was 586 set up (RRO, bandwidth, etc.). 588 The head end can add a TE-PATH-BINDING TLV to the PCRpt message 589 [I-D.sivabalan-pce-binding-label-sid]. This allows the head end to 590 assign a "binding SID" to the LSP, and to report to the controller 591 that a particular binding SID corresponds to a particular LSP. The 592 binding SID is locally scoped to the head end. 594 The controller can make this label be part of the label stack that it 595 tells the source (or the GW at the source domain) to impose on the 596 data packets being sent to prefix X. When the head end receives a 597 packet with this label at the top of the stack it will send the 598 packet onward on the LSP. 600 5.5. Data Plane 602 Consolidating all of the above, consider what happens when we want to 603 move a data packet from Source1 to Destination in Figure 1via the 604 following source route: 606 Source1---GW1b---PE2a---ASBR2a---ASBR3a---PE3a---GW2a---Destination 608 Further, assume that there is an RSVP-TE LSP from PE2a to ASBR2a and 609 an RSVP-TE LSP from ASBR3a to PE3a both of which we want to use. 611 Let's suppose that the Source pushes a label stack as instructed by 612 the controller (for example, using BGP-LU [RFC8277]). We won't worry 613 for now about source routing through the domains themselves: that is, 614 in practice there may be additional labels in the stack to cover the 615 source route from Source1 to GW1b and from GW2a to the Destination, 616 but we will focus only on the labels necessary to leave the source 617 domain, traverse the backbone, and enter the egress domain. So we 618 only care what the stack looks like when the packet gets to GW1b. 620 When the packet gets to GW1b, the stack should have six labels: 622 Top Label: 624 Peer-SID or adjacency-SID identifying the link or links to PE2a. 625 These SIDs are distributed from GW1b to the controller via the EPE 626 extensions of BGP-LS. This label will get popped by GW1b, which 627 will then send the packet to PE2a. 629 Second Label: 631 Binding SID advertised by PE2a to the controller for the RSVP-TE 632 LSP to ASBR2a. This binding SID is advertised via the PCEP 633 extensions discussed above. This label will get swapped by PE2a 634 for the label that the LSP's next hop has assigned to the LSP. 636 Third Label: 638 Peer-SID or adjacency-SID identifying the link or links to ASBR3a, 639 as advertised to the controller by ASBR2a using the BGP-LS EPE 640 extensions. This label gets popped by ASBR2a, which then sends 641 the packet to ASBR3a. 643 Fourth Label: 645 Binding SID advertised by ASBR3a for the RSVP-TE LSP to PE3a. 646 This binding SID is advertised via the PCEP extensions discussed 647 above. ASBR3a treats this label just like PE2a treated the second 648 label above. 650 Fifth label: 652 Peer-SID or adjacency-SID identifying link or links to GW2a, as 653 advertised to the controller by ASBR3a using the BGP-LS EPE 654 extensions. ASBR3a pops this label and sends the packet to GW2a. 656 Sixth Label: 658 Prefix-SID or other label identifying the Destination advertised 659 in a Tunnel Encapsulation attribute by GW2a. This can be omitted 660 if GW2a is happy to accept IP packets, or prefers a VXLAN tunnel 661 for example. That would be indicated through the Tunnel 662 Encapsulation attribute of course. 664 Note that the size of the label stack is proportional to the number 665 of RSVP-TE LSPs that get stitched together by SR. 667 See Section 7 for some detailed examples that show the concrete use 668 of labels in a sample topology. 670 In the above example, all labels except the sixth are locally 671 significant labels: peer-SIDs, binding SIDs, or adjacency-SIDs. Only 672 the sixth label, a prefix-SID, has a domain-wide unique value. To 673 impose that label, the source needs to know the SRGB of GW2a. If all 674 nodes have the same SRGB, this is not a problem. Otherwise, there 675 are a number of different ways GW3a can advertise its SRGB. This can 676 be done via the segment routing extensions of BGP-LS, or it can be 677 done using the prefix-SID attribute or BGP-LU [RFC8277], or it can be 678 done using the BGP Tunnel Encapsulation attribute. The technique to 679 be used will depend on the details of the deployment scenario. 681 The reason the above example is primarily based on locally 682 significant labels is that it creates a "strict source route", and it 683 presupposes the EPE extensions of BGP-LS. In some scenarios, the EPE 684 extension to BGP-LS might not be available (or BGP-LS might not be 685 available at all). In other scenarios, it may be desirable to steer 686 a packet through a "loose source route". In such scenarios, the 687 label stack imposed by the source will be based upon a sequence of 688 domain-wide unique "node-SIDs", each representing one of the hops of 689 source route. Each label has to be computed by adding the 690 corresponding node-SID to the SRGB of the node that will act upon the 691 label. One way to learn the node-SIDs and SRGBs is to use the 692 segment routing extensions of BGP-LS. Another way is to use BGP-LU 693 as follows: 695 Each node that may be part of a source route originates a BGP-LU 696 route with one of its own loopback addresses as the prefix. The 697 BGP prefix-SID attribute is attached to this route. The prefix- 698 SID attribute contains a SID that is the domain-wide unique SID 699 corresponding to the node's loopback address. The attribute also 700 contains the node's SRGB. 702 While this technique is useful when BGP-LS is not available, there 703 needs to be some other means for the source controller to discover 704 the topology. In this document, we focus primarily on the scenario 705 where BGP-LS, rather than BGP-LU, is used. 707 5.6. Centralized and Distributed Controllers 709 A controller or set of controllers is needed to collate topology and 710 TE information from the constituent networks, to apply policies and 711 service requirements to compute paths across those networks, to 712 select an end-to-end path, and to program key nodes in the network to 713 take the right forwarding actions (pushing label stacks, stitching 714 LSPs, forwarding traffic). 716 o It is commonly understood that a fully optimal end-to-end path can 717 only be computed with full knowledge of the end-to-end topology 718 and available Traffic Engineering resources. Thus, one option is 719 for all information about the domain networks and backbone network 720 to be collected by a central controller that makes all path 721 computations and is responsible for issuing the necessary 722 programming commands. Such a model works best when there is no 723 commercial or administrative impediment (for example, where the 724 domains and the backbone network are owned and operated by the 725 same organization). There may, however, be some scaling concerns 726 if the component networks are large. 728 In this mode of operation, each network may use BGP-LS to export 729 Traffic Engineering and topology information to the central 730 controller, and the controller may use PCEP to program the network 731 behavior. 733 o A similar centralized control mechanism can be used with a 734 scalability improvement that risks a reduction in optimality. In 735 this case, the domain networks can export to the controller just 736 the feasibility of connectivity between data source/sink and 737 gateway, perhaps enhancing this with some information about the 738 Traffic Engineering metrics of the potential paths. 740 This approach allows the central controller to understand the end- 741 to-end path that it is selecting, but not to control it fully. 742 The source route from data source to domain egress gateway is left 743 to the source host or a controller in the source domain, while the 744 source route from domain ingress gateway to destination is left as 745 a decision for the domain ingress gateway or to a controller in 746 the destination domain and in both cases the traffic may be left 747 to follow the IGP shortest path. 749 This mode of operation still leaves overall control with a 750 centralized server and that may not be considered suitable when 751 there is separate commercial or administrative control of the 752 networks. 754 o When there is separate commercial or administrative control of the 755 networks, the domain operator will not want the backbone operator 756 to have control of the paths within the domains and may be 757 reluctant to disclose any information about the topology or 758 resource availability within the domains. Conversely, the 759 backbone operator may be very unwilling to allow the domain 760 operator (a customer) any control over or knowledge about the 761 backbone network. 763 This "problem" has already been solved for Traffic Engineering in 764 MPLS networks that span multiple administrative domains and leads 765 to several potential solutions: 767 * Per-domain path computation [RFC5152] can be seen as "best 768 effort optimization". In this mode the controller for each 769 domain is responsible for finding the best path to the next 770 domain, but has no way of knowing which is the best exit point 771 from the local domain. The resulting path may end up 772 significantly sub-optimal or even blocked. 774 * Backward recursive path computation (BRPC) [RFC5441] is a 775 mechanism that allows controllers to cooperate across a small 776 set of domains (such as ASes) to build a tree of possible paths 777 and so allow the controller for the ingress domain to select 778 the optimal path. The details of the paths within each domain 779 that might reveal confidential information can be hidden using 780 Path Keys [RFC5520]. BRPC produces optimal paths, but scales 781 poorly with an increase in domains and with an increase in 782 connectivity between domains. It can also lead to slow 783 computation times. 785 * Hierarchical PCE (H-PCE) [RFC6805] is a two-level cooperation 786 process between PCEs. The child PCEs remain responsible for 787 computing paths across their domains, and they coordinate with 788 a parent PCE that stitches these paths together to form the 789 end-to-end path. This approach has many similarities with BRPC 790 but can scale better through the maintenance of "domain 791 topology" that shows how the domains are interconnected, and 792 through the ability to pipe-line computation requests to all of 793 the child domains. It has the drawback that some party has to 794 own and operate the parent PCE. 796 * An alternative approach is documented by the TEAS working group 797 [RFC7926]. In this model each network advertises to 798 controllers for adjacent networks (using BGP-LS) selected 799 information about potential connectivity across the network. 800 It does not have to show full topology and can make its own 801 decisions about which paths it considers optimal for use by its 802 different neighbors and customers. This approach is suitable 803 for the End-to-End Domain Interconnect Traffic Steering problem 804 where the backbone is under different control from the domains 805 because it allows the overlay nature of the use of the backbone 806 network to be treated as a peer network relationship by the 807 controllers of the domains - the domains can be operated using 808 a single controller or a separate controller for each domain. 810 It is also possible to operate domain interconnection when some or 811 all domains do not have a controller. Segment Routing is capable of 812 routing a packet toward the next hop based on the top label on the 813 stack, and that label does not need to indicate an immediately 814 adjacent node or link. In these cases, the packet may be forwarded 815 untouched, or the forwarding router may impose a locally-determined 816 additional set of labels that define the path to the next hop. 818 PCE can be used to instruct the source host or a transit node about 819 what label stacks to add to packets. That is, a node that needs to 820 impose labels (either to start routing the packet from the source 821 host, or to advance the packet from a transit router toward the 822 destination) can determine the label stack to use based on local 823 function or can have that stack supplied by a PCE. The PCE 824 Communication Protocol (PCEP) has been extended to allow the PCE to 825 supply a label stack for reaching a specific destination either in 826 response to a request or in an unsolicited manner 827 [I-D.ietf-pce-segment-routing]. 829 6. BGP-LS Considerations 831 This section gives an overview of the use of BGP-LS to export an 832 abstraction (or summary) of the connectivity across the backbone 833 network by means of two figures that show different views of a sample 834 network. 836 Figure 2 shows a more complex reference architecture. 838 Figure 3 represents the minimum set of nodes and links that need to 839 be advertised in BGP-LS with SR in order to perform Domain 840 Interconnect with traffic engineering across the backbone network: 841 the PEs, ASBRs, and GWs, and the links between them. In particular, 842 EPE [I-D.ietf-idr-bgpls-segment-routing-epe] and TE information with 843 associated segment IDs is advertised in BGP-LS with SR. 845 Links that are advertised may be physical links, links realized by 846 LSP tunnels or SR paths, or abstract links. It is assumed that 847 intra-AS links are either real links, RSVP-TE LSPs with allocated 848 bandwidth, or SR TE policies as described in 849 [I-D.ietf-idr-segment-routing-te-policy]. Additional nodes internal 850 to an AS and their links to PEs, ASBRs, and/or GWs may also be 851 advertised (for example, to avoid full mesh problems). 853 Note that Figure 3 does not show full interconnectivity. For 854 example, there is no possibility of connectivity between PE1a and 855 PE1c (because there is no RSVP-TE LSP established across AS1 between 856 these two nodes) and so no link is presented in the topology view. 857 [RFC7926] contains further discussion of topological abstractions 858 that may be useful in understanding this distinction. 860 ------------------------------------------------------------------- 861 | | 862 | AS1 | 863 | ---- ---- ---- ---- | 864 -|PE1a|--|PE1b|-------------------------------------|PE1c|--|PE1d|- 865 ---- ---- ---- ---- 866 : : ------------ ------------ : : : 867 : : | AS2 | | AS3 | : : : 868 : : | ------.....------ | : : : 869 : : | |ASBR2a| |ASBR3a| | : : : 870 : : | ------ ..:------ | : : : 871 : : | | ..: | | : : : 872 : : | ------: ------ | : : : 873 : : | |ASBR2b|...|ASBR3b| | : : : 874 : : | ------ ------ | : : : 875 : : | | | | : : : 876 : : | | ------ | : : : 877 : : | | ..|ASBR3c| | : : : 878 : : | | : ------ | : ....: : 879 : ......: | ---- | : | ---- | : : : 880 : : -|PE2a|----- : -----|PE3b|- : : : 881 : : ---- : ---- : : : 882 : : .......: : :....... : : : 883 : : : ------ : : : : 884 : : : ----|ASBR4b|---- : : : : 885 : : : | ------ | : : : : 886 : : : ---- | : : : : 887 : : : .........|PE4b| AS4 | : : : : 888 : : : : ---- | : : : : 889 : : : : | ---- | : : : : 890 : : : : -----|PE4a|----- : : : : 891 : : : : ---- : : : : 892 : : : : ..: :.. : : : : 893 : : : : : : : : : : 894 ---- ---- ---- ---- ----: ---- 895 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 896 | ---- ---- | | ---- ---- | | ---- ---- | 897 | | | | | | 898 | | | | | | 899 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 900 | | | | | | 901 | | | | | | 902 | Domain1 | | Domain2 | | Domain3 | 903 ---------------- ---------------- ---------------- 905 Figure 2: Network View of Example Configuration 907 ............................................................. 908 : : 909 ---- ---- ---- ---- 910 |PE1a| |PE1b|.....................................|PE1c| |PE1d| 911 ---- ---- ---- ---- 912 : : : : : 913 : : ------.....------ : : : 914 : : ......|ASBR2a| |ASBR3a|...... : : : 915 : : : ------ ..:------ : : : : 916 : : : : : : : : 917 : : : ------..: ------ : : : : 918 : : : ...|ASBR2b|...|ASBR3b| : : : : 919 : : : : ------ ------ : : : : 920 : : : : : : : : : 921 : : : : ------ : : : : 922 : : : : ..|ASBR3c|... : : : : 923 : : : : : ------ : : : ....: : 924 : ......: ---- : ---- : : : 925 : : |PE2a| : |PE3b| : : : 926 : : ---- : ---- : : : 927 : : .......: : :....... : : : 928 : : : ------ : : : : 929 : : : |ASBR4b| : : : : 930 : : : ------ : : : : 931 : : : ----.....: : : : : : 932 : : : .........|PE4b|..... : : : : : 933 : : : : ---- : : : : : : 934 : : : : ---- : : : : 935 : : : : |PE4a| : : : : 936 : : : : ---- : : : : 937 : : : : ..: :.. : : : : 938 : : : : : : : : : : 939 ---- ---- ---- ---- ----: ---- 940 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 941 | ---- ---- | | ---- ---- | | ---- ---- | 942 | | | | | | 943 | | | | | | 944 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 945 | | | | | | 946 | | | | | | 947 | Domain1 | | Domain2 | | Domain3 | 948 ---------------- ---------------- ---------------- 950 Figure 3: Topology View of Example Configuration 952 A node (a PCE, router, or host) that is computing a full or partial 953 path correlates the topology information disseminated in BGP-LS with 954 the information advertised in BGP (with the Tunnel Encapsulation 955 attributes) and uses this to compute that path and obtain the SIDs 956 for the elements on that path. In order to allow a source host to 957 compute exit points from its domain, some subset of the above 958 information needs to be disseminated within that domain. 960 What is advertised external to a given AS is controlled by policy at 961 the ASes' PEs, ASBRs, and GWs. Central control of what each node 962 should advertise, based upon analysis of the network as a whole, is 963 an important additional function. This and the amount of policy 964 involved may make the use of a Route Reflector an attractive option. 966 Local configuration at each node determines which links to other 967 nodes are advertised in BGP-LS, and determines which characteristics 968 of those links are advertised. Pairwise coordination between link 969 end-points is required to ensure consistency. 971 Path Weighted ECMP (PWECMP) is a mechanism to load-balance traffic 972 across parallel equal cost links or paths. In this approach an 973 ingress node distributes the flows from it to a given egress node 974 across the equal cost paths to the egress node in proportion to the 975 lowest bandwidth link on each path. PWECMP can be used by a GW for a 976 given source domain to send all flows to a given destination domain 977 using all paths in the backbone network to that destination domain in 978 proportion to the minimum bandwidth on each path. PWECMP may also be 979 used by hosts within a source domain to send flows to that domain's 980 GWs. 982 7. Worked Examples 984 Figure 4 shows a view of the links, paths, and labels that can be 985 assigned to part of the sample network shown in Figure 2 and 986 Figure 3. The double-dash lines (===) indicate LSP tunnels across 987 backbone ASes and dotted lines (...) are physical links. 989 A label may be assigned to each outgoing link at each node. This is 990 shown in Figure 4. For example, at GW1a the label L201 is assigned 991 to the link connecting GW1a to PE1a. At PE1c, the label L302 is 992 assigned to the link connecting PE1c to GW3b. Labels ("binding 993 SIDs") may also be assigned to RSVP-TE LSPs. For example, at PE1a, 994 label L202 is assigned to the RSVP-TE LSP leading from PE1a to PE1c. 996 At the destination domain, label L305 is a "node-SID"; it represents 997 Host3b, rather than representing a particular link. 999 When a node processes a packet, the label at the top of the label 1000 stack indicates the link (or RSVP-TE LSP) on which that node is to 1001 transmit the packet. The node pops that label off the label stack 1002 before transmitting the packet on the link. However, if the top 1003 label is a node-SID, the node processing the packet is expected to 1004 transmit the packet on whatever link it regards as the shortest path 1005 to the node represented by the label. 1007 ---- L202 ---- 1008 | |===================================================| | 1009 |PE1a| |PE1c| 1010 | |===================================================| | 1011 ---- L203 ---- 1012 : L304: :L302 1013 : : : 1014 : ---- L205 ---- : : 1015 : |PE1b|========================================|PE1d| : : 1016 : ---- ---- : : 1017 : : L303: : : 1018 : : ---- : : : 1019 : : ---- L207 |ASBR|L209 ---- : : : 1020 : : | |======| 2a |......| | : : : 1021 : : | | ---- | |L210 ---- : : : 1022 : : |PE2a| |ASBR|======|PE3b| : : : 1023 : : | |L208 ---- L211 | 3a | ---- : : : 1024 : : | |======|ASBR|......| | L301: : : : 1025 : : ---- | 2b | ---- ...: : : : 1026 : : : ---- : : : : 1027 : ....: : : .......: : : 1028 : : : : : : : 1029 : : : : : .........: : 1030 : : : : : : : 1031 : : ....: : : : ....: 1032 L201: :L204 :L206 : : : : 1033 ---- ---- ----- ---- 1034 -|GW1a|--|GW1b|- -|GW3a |--|GW3b|- 1035 | ---- ---- | | ----- ---- | 1036 | : : | | L303: :L304| 1037 | : : | | : : | 1038 |L103: :L102| | : : | 1039 | N1 N2 | | N3 N4 | 1040 | :.. ..: | | : ....: | 1041 | : : | | : : | 1042 | L101: : | | : : | 1043 | Host1a | | Host3b (L305) | 1044 | | | | 1045 | Domain1 | | Domain3 | 1046 ---------------- ----------------- 1048 Figure 4: Tunnels and Labels in Example Configuration 1050 Note that label spaces can overlap so that, for example, the figure 1051 shows two instances of L303 and L304. This is acceptable because of 1052 the separation between the SR domains, and because SIDs applied to 1053 outgoing interfaces are locally scoped. 1055 Let's consider several different possible ways to direct a packet 1056 from Host1a in Domain1 to Host3b in Domain3. 1058 a. Full source route imposed at source 1060 In this case it is assumed that the entity responsible for 1061 determining an end-to-end path has access to the topologies of 1062 both the source and destination domains as well as of the 1063 backbone network. This might happen if all of the networks 1064 are owned by the same operator in which case the information 1065 can be shared into a single database for use by an offline 1066 tool, or the information can be distributed using routing 1067 protocols such that the source host can see enough to select 1068 the path. Alternatively, the end-to-end path could be 1069 produced through cooperation between computation entities each 1070 responsible for different domains along the path. 1072 If the path is computed externally it is pushed to the source 1073 host. Otherwise, it is computed by the source host itself. 1075 Suppose it is desired for a packet from Host1a to travel to 1076 Host3b via the following source route: 1078 Host1a->N1->GW1a->PE1a->(RSVP-TE 1079 LSP)->PE1c->GW3b->N4->Host3b 1081 Host1a imposes the following label stack (with the first label 1082 representing the top of stack), and then sends the packet to 1083 N1: 1085 L103, L201, L202, L302, L304, L305 1087 N1 sees L103 at the top of the stack, so it pops the stack and 1088 forwards the packet to GW1a. GW1a sees L201 at the top of the 1089 stack, so it pops the stack and forwards the packet to PE1a. 1090 PE1a sees L202 at the top of the stack, so it pops the stack 1091 and forwards the packet over the RSVP-TE LSP to PE1c. As the 1092 packet travels over this LSP, its top label is an RSVP-TE 1093 signaled label representing the LSP. That is, PE1a imposes an 1094 additional label stack entry for the tunnel LSP. 1096 At the end of the LSP tunnel, the MPLS tunnel label is popped, 1097 and PE1c sees L302 at the top of the stack. PE1c pops the 1098 stack and forwards the packet to GW3b. GW3b sees L304 at the 1099 top of the stack, so it pops the stack and forwards the packet 1100 to N4. Finally, N4 sees L305 at the top of the stack, so it 1101 pops the stack and forwards the packet to Host3b. 1103 b. It is possible that the source domain does not have visibility 1104 into the destination domain. 1106 This occurs if the destination domain does not export its 1107 topology, but does export basic reachability information so 1108 that the source host or the path computation entity will know: 1110 + The GWs through which the destination can be reached. 1112 + The SID to use for the destination prefix. 1114 Suppose we want a packet to follow the source route: 1116 Host1a->N1->GW1a->PE1a->(RSVP-TE 1117 LSP)->PE1c->GW3b->...->Host3b 1119 The ellipsis indicates a part of the path that is not 1120 explicitly specified. Thus, the label stack imposed at the 1121 source host is: 1123 L103, L201, L202, L302, L305 1125 Processing is as per case a., but when the packet reaches the 1126 GW of the destination domain (GW3b) it can either simply 1127 forward the packet along the shortest path to Host3b, or it 1128 can insert additional labels to direct the path to the 1129 destination. 1131 c. Domain1 only has reachability information for the backbone and 1132 destination networks 1134 The source domain (or the path computation entity) may be 1135 further restricted in its view of the network. It is possible 1136 that it knows the location of the destination in the 1137 destination domain, and knows the GWs to the destination 1138 domain that provide reachability to the destination, but that 1139 it has no view of the backbone network. This leads to the 1140 packet being forwarded in a manner similar to 'per-domain path 1141 computation' described in Section 5.6. 1143 At the source host a simple label stack is imposed navigating 1144 the domain and indicating the destination GW and the 1145 destination host. 1147 L103, L302, L305 1149 As the packet leaves the source domain, the source GW (GW1a) 1150 determines the PE to use to enter the backbone using nothing 1151 more than the BGP preferred route to the destination GW (it 1152 could be PE1a or PE1b). 1154 When the packet reaches the first PE it has a label stack just 1155 identifying the destination GW and the host (L302, L305). The 1156 PE uses information it has about the backbone network topology 1157 and available LSPs to select an LSP tunnel, impose the tunnel 1158 label, and forward the packet. 1160 When the packet reaches the end of the LSP tunnel, it is 1161 processed as described in case b. 1163 d. Stitched LSPs across the backbone 1165 A variant of all these cases arises when the packet is sent 1166 using a path that spans multiple ASes. For example, one that 1167 crosses AS2 and AS3 as shown in Figure 2. 1169 In this case, basing the example on case a., the source host 1170 imposes the label stack: 1172 L102, L206, L207, L209, L210, L301, L303, L305 1174 It then sends the packet to N2. 1176 When the packet reaches PE2a, as previously described, the top 1177 label (L207) indicates an LSP tunnel that leads to ASBR2a. At 1178 the end of that LSP tunnel the next label (L209) routes the 1179 packet from ASBR2a to ASBR3a, where the next label (L210) 1180 identifies the next LSP tunnel to use. Thus, SR has been used 1181 to stitch together LSPs to make a longer path segment. As the 1182 packet emerges from the final LSP tunnel, forwarding continues 1183 as previously described. 1185 8. Label Stack Depth Considerations 1187 As described in Section 3.1, one of the issues with a Segment Routing 1188 approach is that the label stack can get large, for example when the 1189 source route becomes long. A mechanism to mitigate this problem is 1190 needed if the solution is to be fully applicable in all environments. 1192 [I-D.ietf-idr-segment-routing-te-policy] introduces the concept of 1193 hierarchical source routes as a way to compress source route headers. 1194 It functions by having the egress node for a set of source routes 1195 advertise those source routes along with an explicit request that 1196 each node that is an ingress node for one or more of those source 1197 routes should advertise a binding SID for the set of source routes 1198 for which it is the ingress. It should be noted that the set of 1199 source routes can either be advertised by the egress node as 1200 described here, or advertised by a controller on behalf of the egress 1201 node. 1203 Such an ingress node advertises its set of source routes and a 1204 binding SID as an adjacency in BGP-LS as described in Section 6. 1205 These source routes represent the weighted ECMP paths between the 1206 ingress node and the egress node. Note also that the binding SID may 1207 be supplied by the node that advertises the source routes (the egress 1208 or the controller), or may be chosen by the ingress. 1210 A remote node that wishes to reach the egress node constructs a 1211 source route consisting of the segment IDs necessary to reach one of 1212 the ingress nodes for the path it wishes to use along with the 1213 binding SID that the ingress node advertised to identify the set of 1214 paths. When the selected ingress node receives a packet with a 1215 binding SID it has advertised, it replaces the binding SID with the 1216 labels for one of its source routes to the egress node (it will 1217 choose one of the source routes in the set according to its own 1218 weighting algorithms and policy). 1220 8.1. Worked Example 1222 Consider the topology in Figure 4. Suppose that it is desired to 1223 construct full segment routed paths from ingress to egress, but that 1224 the resulting label stack (segment route) is too large. In this case 1225 the gateways to Domain3 (GW3a and GW3b) can advertise all of the 1226 source routes from the gateways to Domain1 (GW1a and GW1b). The 1227 gateways to Domain1 then assign binding SIDs to those source routes 1228 and advertise those SIDs into BGP-LS. 1230 Thus, GW3b advertises the two source routes (L201, L202, L302 and 1231 L201, L203, L302), and GW1a advertises into BGP-LS its adjacency to 1232 GW3b along with a binding SID. Should Host1a wish to send a packet 1233 via GW1a and GW3b, it can include L103 and this binding SID in the 1234 source route. GW1a is free to choose which source route to use 1235 between itself and GW3b using its weighted ECMP algorithm. 1237 Similarly, GW3a can advertise the following set of source routes: 1239 o L201, L202, L304 1241 o L201, L203, L304 1243 o L204, L205, L303 1245 o L206, L207, L209, L210, L301 1246 o L206, L208, L211, L210, L301 1248 GW1a advertises a binding SID for the first three, and GW1b 1249 advertises a binding SID for the other two. 1251 9. Gateway Considerations 1253 As described in Section 5.2, [I-D.ietf-bess-datacenter-gateway] 1254 defines a new tunnel type, "SR tunnel", and when the GWs to a given 1255 domain advertise a route to a prefix X within the domain, they will 1256 each include a Tunnel Encapsulation attribute with multiple tunnel 1257 instances each of type "SR tunnel", one for each GW and each 1258 containing a Remote Endpoint sub-TLV with that GW's address. 1260 In other words, each route advertised by any GW identifies all of the 1261 GWs to the same domain. 1263 Therefore, even if only one of the routes is distributed to other 1264 ASes, it will not matter how many times the next hop changes, as the 1265 Tunnel Encapsulation attribute (and its remote endpoint sub-TLVs) 1266 will remain unchanged. 1268 9.1. Domain Gateway Auto-Discovery 1270 To allow a given domain's GWs to auto-discover each other and to 1271 coordinate their operations, the following procedures are implemented 1272 as described in [I-D.ietf-bess-datacenter-gateway]: 1274 o Each GW is configured with an identifier of the domain that is 1275 common across all GWs to the domain and unique across all domains 1276 that are connected. 1278 o A route target [RFC4360] is attached to each GW's auto-discovery 1279 route and has its value set to the domain identifier. 1281 o Each GW constructs an import filtering rule to import any route 1282 that carries a route target with the same domain identifier that 1283 the GW itself uses. This means that only these GWs will import 1284 those routes and that all GWs to the same domain will import each 1285 other's routes and will learn (auto-discover) the current set of 1286 active GWs for the domain. 1288 o The auto-discovery route each GW advertises consists of the 1289 following: 1291 * An IPv4 or IPv6 NLRI containing one of the GW's loopback 1292 addresses (that is, with AFI/SAFI that is one of 1/1, 2/1, 1/4, 1293 2/4). 1295 * A Tunnel Encapsulation attribute containing the GW's 1296 encapsulation information, which at a minimum consists of an SR 1297 tunnel TLV with a Remote Endpoint sub-TLV 1298 [I-D.ietf-idr-tunnel-encaps]. 1300 To avoid the side effect of applying the Tunnel Encapsulation 1301 attribute to any packet that is addressed to the GW, the GW should 1302 use a different loopback address in the advertisement from that used 1303 to reach the GW itself. 1305 Each GW will include a Tunnel Encapsulation attribute for each GW 1306 that is active for the domain (including itself), and will include 1307 these in every route advertised by each GW to peers outside the 1308 domain. As the current set of active GWs changes (due to the 1309 addition of a new GW or the failure/removal of an existing GW) each 1310 externally advertised route will be re-advertised with the set of SR 1311 tunnel instances reflecting the current set of active GWs. 1313 9.2. Relationship to BGP Link State and Egress Peer Engineering 1315 When a remote GW receives a route to a prefix X it can use the SR 1316 tunnel instances within the contained Tunnel Encapsulation attribute 1317 to identify the GWs through which X can be reached. It uses this 1318 information to compute SR TE paths across the backbone network 1319 looking at the information advertised to it in SR BGP Link State 1320 (BGP-LS) [I-D.ietf-idr-bgp-ls-segment-routing-ext] and correlated 1321 using the domain identity. SR Egress Peer Engineering (EPE) 1322 [I-D.ietf-idr-bgpls-segment-routing-epe] can be used to supplement 1323 the information advertised in BGP-LS. 1325 9.3. Advertising a Domain Route Externally 1327 When a packet destined for prefix X is sent on an SR TE path to a GW 1328 for the domain containing X, it needs to carry the receiving GW's 1329 label for X such that this label rises to the top of the stack before 1330 the GW completes its processing of the packet. To achieve this we 1331 place a prefix-SID sub-TLV for X in each SR tunnel instance in the 1332 Tunnel Encapsulation attribute in the externally advertised route for 1333 X. 1335 Alternatively, if the GWs for a given domain are configured to allow 1336 remote GWs to perform SR TE through that domain for prefix X, then 1337 each GW computes an SR TE path through that domain to X from each of 1338 the current active GWs and places each in an MPLS label stack sub-TLV 1339 [I-D.ietf-idr-tunnel-encaps] in the SR tunnel instance for that GW. 1341 9.4. Encapsulations 1343 If the GWs for a given domain are configured to allow remote GWs to 1344 send them packets in that domain's native encapsulation, then each GW 1345 will also include multiple instances of a tunnel TLV for that native 1346 encapsulation in the externally advertised routes: one for each GW, 1347 and each containing a remote endpoint sub-TLV with that GW's address. 1348 A remote GW may then encapsulate a packet according to the rules 1349 defined via the sub-TLVs included in each of the tunnel TLV 1350 instances. 1352 10. Security Considerations 1354 There are several security domains and associated threats in this 1355 architecture. SR is itself a data transmission encapsulation that 1356 provides no additional security, so security in this architecture 1357 relies on higher layer mechanisms (for example, end-to-end encryption 1358 of pay-load data), security of protocols used to establish 1359 connectivity and distribute network information, and access control 1360 so that control plane and data plane packets are not admitted to the 1361 network from outside. 1363 This architecture utilizes a number of control plane protocols within 1364 domains, within the backbone, and north-south between controllers and 1365 domains. Only minor modifications are made to BGP as described in 1366 [I-D.ietf-bess-datacenter-gateway], otherwise this architecture uses 1367 existing protocols and extensions so no new security risks are 1368 introduced. 1370 Special care should, however, be taken when routing protocols export 1371 or import information from or to domains that might have a security 1372 model based on secure boundaries and internal mutual trust. This is 1373 notable when: 1375 o BGP-LS is used to export topology information from within a domain 1376 to a controller that is sited outside the domain. 1378 o A southbound protocol such as BGP-LU or Netconf is used to install 1379 state in the network from a controller that may be sited outside 1380 the domain. 1382 In these cases protocol security mechanisms should be used to protect 1383 the information in transit entering or leaving the domain, and to 1384 authenticate the out-of-domain nodes (the controller) to ensure that 1385 confidential/private information is not lost and that data or 1386 configuration is not falsified. 1388 11. Management Considerations 1390 Configuration elements for the approaches described in this document 1391 are minor but crucial. 1393 Each GW to a domain is configured with the same identifier of the 1394 domain, and that identifier is unique across all domains that are 1395 connected. This requires some coordination both within a domain, and 1396 between cooperating domains. There are no requirements for how this 1397 configuration and coordination is achieved, but it is assumed that 1398 management systems are involved. 1400 Policy determines what topology information is shared by a BGP-LS 1401 speaker (see Section 6). This applies both to the advertisement of 1402 interdomain links and their characteristics, and to the advertisement 1403 of summarized domain topology or connectivity. This policy is a 1404 local (i.e., domain-scoped) configuration dependent on the objectives 1405 and business imperatives of the domain operator. 1407 Domain boundaries are usually configured to limit the control and 1408 interaction from other domains (for example, to not allow end-to-end 1409 TE paths to be set up across domain boundaries. As noted in 1410 Section 9.3, the GWs for a given domain can be configured to allow 1411 remote GWs to perform SR TE through that domain for a given prefix, a 1412 set of prefixes, or all reachable prefixes. 1414 Similarly, (as described in Section 9.4 the GWs for a given domain 1415 can be configured to allow remote GWs to send them packets in that 1416 domain's native encapsulation. 1418 12. IANA Considerations 1420 This document makes no requests for IANA action. 1422 13. Acknowledgements 1424 Thanks to Jeffery Zhang for his careful review. 1426 14. Informative References 1428 [I-D.ietf-bess-datacenter-gateway] 1429 Drake, J., Farrel, A., Rosen, E., Patel, K., and L. Jalil, 1430 "Gateway Auto-Discovery and Route Advertisement for 1431 Segment Routing Enabled Domain Interconnection", draft- 1432 ietf-bess-datacenter-gateway-01 (work in progress), May 1433 2018. 1435 [I-D.ietf-idr-bgp-ls-segment-routing-ext] 1436 Previdi, S., Talaulikar, K., Filsfils, C., Gredler, H., 1437 and M. Chen, "BGP Link-State extensions for Segment 1438 Routing", draft-ietf-idr-bgp-ls-segment-routing-ext-09 1439 (work in progress), October 2018. 1441 [I-D.ietf-idr-bgp-prefix-sid] 1442 Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A., 1443 and H. Gredler, "Segment Routing Prefix SID extensions for 1444 BGP", draft-ietf-idr-bgp-prefix-sid-27 (work in progress), 1445 June 2018. 1447 [I-D.ietf-idr-bgpls-segment-routing-epe] 1448 Previdi, S., Filsfils, C., Patel, K., Ray, S., and J. 1449 Dong, "BGP-LS extensions for Segment Routing BGP Egress 1450 Peer Engineering", draft-ietf-idr-bgpls-segment-routing- 1451 epe-15 (work in progress), March 2018. 1453 [I-D.ietf-idr-segment-routing-te-policy] 1454 Previdi, S., Filsfils, C., Jain, D., Mattes, P., Rosen, 1455 E., and S. Lin, "Advertising Segment Routing Policies in 1456 BGP", draft-ietf-idr-segment-routing-te-policy-04 (work in 1457 progress), July 2018. 1459 [I-D.ietf-idr-tunnel-encaps] 1460 Rosen, E., Patel, K., and G. Velde, "The BGP Tunnel 1461 Encapsulation Attribute", draft-ietf-idr-tunnel-encaps-10 1462 (work in progress), August 2018. 1464 [I-D.ietf-isis-segment-routing-extensions] 1465 Previdi, S., Ginsberg, L., Filsfils, C., Bashandy, A., 1466 Gredler, H., Litkowski, S., Decraene, B., and J. Tantsura, 1467 "IS-IS Extensions for Segment Routing", draft-ietf-isis- 1468 segment-routing-extensions-19 (work in progress), July 1469 2018. 1471 [I-D.ietf-ospf-segment-routing-extensions] 1472 Psenak, P., Previdi, S., Filsfils, C., Gredler, H., 1473 Shakir, R., Henderickx, W., and J. Tantsura, "OSPF 1474 Extensions for Segment Routing", draft-ietf-ospf-segment- 1475 routing-extensions-25 (work in progress), April 2018. 1477 [I-D.ietf-pce-segment-routing] 1478 Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., 1479 and J. Hardwick, "PCEP Extensions for Segment Routing", 1480 draft-ietf-pce-segment-routing-13 (work in progress), 1481 October 2018. 1483 [I-D.ietf-spring-segment-routing-mpls] 1484 Bashandy, A., Filsfils, C., Previdi, S., Decraene, B., 1485 Litkowski, S., and R. Shakir, "Segment Routing with MPLS 1486 data plane", draft-ietf-spring-segment-routing-mpls-14 1487 (work in progress), June 2018. 1489 [I-D.sivabalan-pce-binding-label-sid] 1490 Sivabalan, S., Tantsura, J., Filsfils, C., Previdi, S., 1491 Hardwick, J., and D. Dhody, "Carrying Binding Label/ 1492 Segment-ID in PCE-based Networks.", draft-sivabalan-pce- 1493 binding-label-sid-04 (work in progress), March 2018. 1495 [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended 1496 Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, 1497 February 2006, . 1499 [RFC5152] Vasseur, JP., Ed., Ayyangar, A., Ed., and R. Zhang, "A 1500 Per-Domain Path Computation Method for Establishing Inter- 1501 Domain Traffic Engineering (TE) Label Switched Paths 1502 (LSPs)", RFC 5152, DOI 10.17487/RFC5152, February 2008, 1503 . 1505 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 1506 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 1507 DOI 10.17487/RFC5440, March 2009, 1508 . 1510 [RFC5441] Vasseur, JP., Ed., Zhang, R., Bitar, N., and JL. Le Roux, 1511 "A Backward-Recursive PCE-Based Computation (BRPC) 1512 Procedure to Compute Shortest Constrained Inter-Domain 1513 Traffic Engineering Label Switched Paths", RFC 5441, 1514 DOI 10.17487/RFC5441, April 2009, 1515 . 1517 [RFC5520] Bradford, R., Ed., Vasseur, JP., and A. Farrel, 1518 "Preserving Topology Confidentiality in Inter-Domain Path 1519 Computation Using a Path-Key-Based Mechanism", RFC 5520, 1520 DOI 10.17487/RFC5520, April 2009, 1521 . 1523 [RFC6805] King, D., Ed. and A. Farrel, Ed., "The Application of the 1524 Path Computation Element Architecture to the Determination 1525 of a Sequence of Domains in MPLS and GMPLS", RFC 6805, 1526 DOI 10.17487/RFC6805, November 2012, 1527 . 1529 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 1530 S. Ray, "North-Bound Distribution of Link-State and 1531 Traffic Engineering (TE) Information Using BGP", RFC 7752, 1532 DOI 10.17487/RFC7752, March 2016, 1533 . 1535 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 1536 Litkowski, S., Horneffer, M., and R. Shakir, "Source 1537 Packet Routing in Networking (SPRING) Problem Statement 1538 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 1539 2016, . 1541 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 1542 "Advertisement of Multiple Paths in BGP", RFC 7911, 1543 DOI 10.17487/RFC7911, July 2016, 1544 . 1546 [RFC7926] Farrel, A., Ed., Drake, J., Bitar, N., Swallow, G., 1547 Ceccarelli, D., and X. Zhang, "Problem Statement and 1548 Architecture for Information Exchange between 1549 Interconnected Traffic-Engineered Networks", BCP 206, 1550 RFC 7926, DOI 10.17487/RFC7926, July 2016, 1551 . 1553 [RFC8231] Crabbe, E., Minei, I., Medved, J., and R. Varga, "Path 1554 Computation Element Communication Protocol (PCEP) 1555 Extensions for Stateful PCE", RFC 8231, 1556 DOI 10.17487/RFC8231, September 2017, 1557 . 1559 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 1560 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 1561 . 1563 [RFC8281] Crabbe, E., Minei, I., Sivabalan, S., and R. Varga, "Path 1564 Computation Element Communication Protocol (PCEP) 1565 Extensions for PCE-Initiated LSP Setup in a Stateful PCE 1566 Model", RFC 8281, DOI 10.17487/RFC8281, December 2017, 1567 . 1569 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 1570 Decraene, B., Litkowski, S., and R. Shakir, "Segment 1571 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 1572 July 2018, . 1574 Authors' Addresses 1576 Adrian Farrel 1577 Juniper Networks 1579 Email: adrian@olddog.co.uk 1581 John Drake 1582 Juniper Networks 1584 Email: jdrake@juniper.net