idnits 2.17.1 draft-farrel-spring-sr-domain-interconnect-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 13, 2018) is 2144 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-bess-datacenter-gateway-01 == Outdated reference: A later version (-18) exists of draft-ietf-idr-bgp-ls-segment-routing-ext-08 == Outdated reference: A later version (-27) exists of draft-ietf-idr-bgp-prefix-sid-21 == Outdated reference: A later version (-19) exists of draft-ietf-idr-bgpls-segment-routing-epe-15 == Outdated reference: A later version (-26) exists of draft-ietf-idr-segment-routing-te-policy-03 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-09 == Outdated reference: A later version (-25) exists of draft-ietf-isis-segment-routing-extensions-16 == Outdated reference: A later version (-27) exists of draft-ietf-ospf-segment-routing-extensions-25 == Outdated reference: A later version (-16) exists of draft-ietf-pce-segment-routing-11 == Outdated reference: A later version (-22) exists of draft-ietf-spring-segment-routing-mpls-14 == Outdated reference: A later version (-07) exists of draft-sivabalan-pce-binding-label-sid-04 -- Obsolete informational reference (is this intentional?): RFC 7752 (Obsoleted by RFC 9552) Summary: 0 errors (**), 0 flaws (~~), 12 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SPRING Working Group A. Farrel 3 Internet-Draft J. Drake 4 Intended status: Informational Juniper Networks 5 Expires: December 15, 2018 June 13, 2018 7 Interconnection of Segment Routing Domains - Problem Statement and 8 Solution Landscape 9 draft-farrel-spring-sr-domain-interconnect-04 11 Abstract 13 Segment Routing (SR) is a forwarding paradigm for use in MPLS and 14 IPv6 networks. It is intended to be deployed in discrete domains 15 that may be data centers, access networks, or other networks that are 16 under the control of a single operator and that can easily be 17 upgraded to support this new technology. 19 Traffic originating in one SR domain often terminates in another SR 20 domain, but must transit a backbone network that provides 21 interconnection between those domains. 23 This document describes a mechanism for providing connectivity 24 between SR domains to enable end-to-end or domain-to-domain traffic 25 engineering. 27 The approach described allows connectivity between SR domains, 28 utilizes traffic engineering mechanisms (RSVP-TE or Segment Routing) 29 across the backbone network, makes heavy use of pre-existing 30 technologies, and requires the specification of very few additional 31 mechanisms. 33 This document provides some background and a problem statement, 34 explains the solution mechanism, gives references to other documents 35 that define protocol mechanisms, and provides examples. It does not 36 define any new protocol mechanisms. 38 Status of This Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at https://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on December 15, 2018. 55 Copyright Notice 57 Copyright (c) 2018 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (https://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with respect 65 to this document. Code Components extracted from this document must 66 include Simplified BSD License text as described in Section 4.e of 67 the Trust Legal Provisions and are provided without warranty as 68 described in the Simplified BSD License. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 73 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 74 2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 75 3. Solution Technologies . . . . . . . . . . . . . . . . . . . . 7 76 3.1. Characteristics of Solution Technologies . . . . . . . . 7 77 4. Decomposing the Problem . . . . . . . . . . . . . . . . . . . 9 78 5. Solution Space . . . . . . . . . . . . . . . . . . . . . . . 11 79 5.1. Global Optimization of the Paths . . . . . . . . . . . . 11 80 5.2. Figuring Out the GWs at a Destination Domain for a Given 81 Prefix . . . . . . . . . . . . . . . . . . . . . . . . . 11 82 5.3. Figuring Out the Backbone Egress ASBRs . . . . . . . . . 12 83 5.4. Making use of RSVP-TE LSPs Across the Backbone . . . . . 12 84 5.5. Data Plane . . . . . . . . . . . . . . . . . . . . . . . 13 85 5.6. Centralized and Distributed Controllers . . . . . . . . . 15 86 6. BGP-LS Considerations . . . . . . . . . . . . . . . . . . . . 18 87 7. Worked Examples . . . . . . . . . . . . . . . . . . . . . . . 21 88 8. Label Stack Depth Considerations . . . . . . . . . . . . . . 26 89 8.1. Worked Example . . . . . . . . . . . . . . . . . . . . . 27 90 9. Gateway Considerations . . . . . . . . . . . . . . . . . . . 28 91 9.1. Domain Gateway Auto-Discovery . . . . . . . . . . . . . . 28 92 9.2. Relationship to BGP Link State and Egress Peer 93 Engineering . . . . . . . . . . . . . . . . . . . . . . . 29 94 9.3. Advertising a Domain Route Externally . . . . . . . . . . 29 95 9.4. Encapsulations . . . . . . . . . . . . . . . . . . . . . 30 97 10. Security Considerations . . . . . . . . . . . . . . . . . . . 30 98 11. Management Considerations . . . . . . . . . . . . . . . . . . 31 99 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 100 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 31 101 14. Informative References . . . . . . . . . . . . . . . . . . . 31 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 35 104 1. Introduction 106 Data Centers are a growing market sector. They are being set up by 107 new specialist companies, by enterprises for their own use, by legacy 108 ISPs, and by the new wave of network operators. The networks inside 109 Data Centers are currently well-planned, but the traffic loads can be 110 unpredictable. There is a need to be able to direct traffic within a 111 Data Center to follow a specific path. 113 Data Centers are attached to external ("backbone") networks to allow 114 access by users and to facilitate communication among Data Centers. 115 An individual Data Center may be attached to multiple backbone 116 networks, and may have multiple points of attachment to each backbone 117 network. Traffic to or from a Data Center may need to be directed to 118 or from any of these points of attachment. 120 Segment Routing (SR) is a technology that places forwarding state 121 into each packet as a stack of loose hops. SR is an option for 122 building Data Centers, and is also seeing increasing traction in edge 123 and access networks as well as in backbone networks. It is typically 124 deployed in discrete domains that are under the control of a single 125 operator and that can easily be upgraded to support this new 126 technology. 128 Traffic originating in one SR domain often terminates in another SR 129 domain, but must transit a backbone network that provides 130 interconnection between those domains. This document describes an 131 approach that builds on existing technologies to produce mechanisms 132 that provide scalable and flexible interconnection of SR domains, and 133 that will be easy to operate. 135 The approach described allows end-to-end connectivity between SR 136 domains across an MPLS backbone network, utilizes traffic engineering 137 mechanisms (RSVP-TE or Segment Routing) across the backbone network, 138 makes heavy use of pre-existing technologies, and requires the 139 specification of very few additional mechanisms. 141 This document provides some background and a problem statement, 142 explains the solution mechanism, gives references to other documents 143 that define protocol mechanisms, and provides examples. It does not 144 define any new protocol mechanisms. 146 1.1. Terminology 148 This document uses Segment Routing terminology from [RFC7855] and 149 [I-D.ietf-spring-segment-routing]. Particular abbreviations of note 150 are: 152 o SID: a segment identifier 154 o SRGB: an SR Global Block 156 In the context of this document, the terms "optimal" and "optimality" 157 refer to making the best possible use of network resources, and 158 achieving network paths that best meet the objectives of the network 159 operators and customers. 161 Further terms are defined in Section 2. 163 2. Problem Statement 165 Consider the network in Figure 1. Without loss of generality, this 166 figure can be used to represent the architecture and problem space 167 for steering traffic within and between SR edge domains. The figure 168 shows a single destination for all traffic that we will consider. 170 In describing the problem space and the solution we use five terms 171 for network nodes as follows: 173 SR domain : This term is defined in 174 [I-D.ietf-spring-segment-routing]. In this document, an SR domain 175 is a collection of SR-capable nodes under the care of one 176 administrator or protocol. This may mean that each edge network 177 is an SR domain attached to the backbone network through one or 178 more gateways. Examples include, access networks, Data Center 179 sites, backbone networks that run SR, and blessings of unicorns. 181 Host : A node within an edge domain. It may be an end system or a 182 transit node in the edge domain. 184 Gateway (GW) : Provides access to or from an edge domain. Examples 185 are Customer Edge nodes (CEs), Autonomous System Border Routers 186 (ASBRs), and Data Center gateways. 188 Provider Edge (PE) : Provides access to or from the backbone 189 network. 191 Autonomous System Border Router (ASBR) : Provides access to one 192 Autonomous System (AS) in the backbone network from another AS in 193 the backbone network. 195 These terms can be seen used in Figure 1 where the various sources 196 and the destination are hosts. In this figure we distinguish between 197 the PEs that provide access to the backbone networks and the Gateways 198 that provide access to the SR edge domains: these may, in fact, be 199 the same equipment and the PEs might be located at the domain edges. 201 ------------------------------------------------------------------- 202 | | 203 | AS1 | 204 | ---- ---- ---- ---- | 205 -|PE1a|--|PE1b|-------------------------------------|PE2a|--|PE2b|- 206 ---- ---- ---- ---- 207 : : ------------ ------------ : : 208 : : | AS2 | | AS3 | : : 209 : : | ------ ------ | : : 210 : : | |ASBR2a|...|ASBR3a| | : : 211 : : | ------ ------ | : : 212 : : | | | | : : 213 : : | ------ ------ | : : 214 : : | |ASBR2b|...|ASBR3b| | : : 215 : : | ------ ------ | : : 216 : : | | | | : : 217 : ......: | ---- | | ---- | : : 218 : : -|PE2a|----- -----|PE3a|- : : 219 : : ---- ---- : : 220 : : ......: :....... : : 221 : : : : : : 222 ---- ---- ---- ---- 223 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- 224 | ---- ---- | | ---- ---- | 225 | | | | 226 | | | Source3 | 227 | Source2 | | | 228 | | | Source4 | 229 | Source1 | | | 230 | | | Destination | 231 | | | | 232 | | | | 233 | Domain1 | | Domain2 | 234 ---------------- ---------------- 236 Figure 1: Reference Architecture for SR Domain Interconnect 238 Traffic to the destination may originate from multiple sources within 239 that domain (we show two such sources: Source3 and Source4). 240 Furthermore, traffic intended for the destination may arrive from 241 outside the domain through any of the points of attachment to the 242 backbone networks (we show GW2a and GW2b). This traffic may need to 243 be steered within the domain to achieve load-balancing across network 244 resources, to avoid degraded or out-of-service resources (including 245 planned service outages), and to achieve different qualities of 246 service. Of course, traffic in a remote source domain may also need 247 to be steered within that domain. We class this problem as "Intra- 248 Domain Traffic Steering". 250 Traffic across the backbone networks may need to be steered to 251 conform to common Traffic Engineering (TE) paradigms. That is, the 252 path across any network (shown in the figure as an AS) or across any 253 collection of networks may need to be chosen and may be different 254 from the shortest path first (SPF) routing that would occur without 255 TE. Furthermore, the points of inter-connection between networks may 256 need to be selected and influence the path chosen for the data. We 257 class this problem as "Inter-Domain Traffic Steering". 259 The composite end-to-end path comprises steering in the source 260 domain, choice of source domain exit point, steering across the 261 backbone networks, choice of network interconnections, choice of 262 destination domain entry point, and steering in the destination 263 domain. These issues may be inter-dependent (for example, the best 264 traffic steering in the source domain may help select the best exit 265 point from that domain, but the connectivity options across the 266 backbone network may drive the selection of a different exit point). 267 We class this combination of problems as "End-to-End Domain 268 Interconnect Traffic Steering". 270 It should be noted that the solution to the End-to-End Domain 271 Interconnect Traffic Steering problem depends on a number of factors: 273 o What technology is deployed in the domains. 275 o What technology is deployed in the backbone networks. 277 o How much information the domains are willing to share with each 278 other. 280 o How much information the backbone network operators and the domain 281 operators are willing to share. 283 In some cases, the domains and backbone networks are all owned and 284 operated by the same company (with the backbone network often being a 285 private network). In other cases, the domains are operated by one 286 company, with other companies operating the backbone. 288 3. Solution Technologies 290 Segment Routing (SR from the SPRING working group in the IETF 291 [RFC7855] and [I-D.ietf-spring-segment-routing]) introduces traffic 292 steering capabilities into an MPLS network 293 [I-D.ietf-spring-segment-routing-mpls] by utilizing existing data 294 plane capabilities (label pop and packet forwarding - "pop and go") 295 in combination with additions to existing IGPs 296 ([I-D.ietf-ospf-segment-routing-extensions] and 297 [I-D.ietf-isis-segment-routing-extensions]), BGP (as BGP-LU) 298 [RFC8277], or a centralized controller to distribute "per-hop" 299 labels. An MPLS label stack can be imposed on a packet to describe a 300 sequence of links/nodes to be transited by the packet; as each hop is 301 transited, the label that represents it is popped from the stack and 302 the packet is forwarded. Thus, on a packet-by-packet basis, traffic 303 can be steered within the SR domain. 305 This document broadens the problem space to consider interconnection 306 of any type of edge domain. These may be Data Center sites, but they 307 may equally be access networks, VPN sites, or any other form of 308 domain that includes packet sources and destinations. We 309 particularly focus on "SR edge domains" being source or destination 310 domains that utilize MPLS SR, but the domains could use other non- 311 MPLS technologies (such as IP, VXLAN, and NVGRE) as described in 312 Section 9. 314 Backbone networks are commonly based on MPLS-capable hardware. In 315 these networks, a number of different options exist to establish TE 316 paths. Among these options are static Label Switched Paths (LSPs), 317 perhaps set up by an SDN controller, LSP tunnels established using a 318 signaling protocol (such as RSVP-TE), and inter-domain use of SR (as 319 described above for intra-domain steering). Where traffic steering 320 (without resource reservation) is needed, SR may be adequate; where 321 Traffic Engineering is needed (i.e., traffic steering with resource 322 reservation) RSVP-TE or centralized SDN control are preferred. 323 However, in a network that is fully managed and controlled through a 324 centralized planning tool, resource reservation can be achieved and 325 SR can be used for full Traffic Engineering. These solutions are 326 already used in support of a number of edge-to-edge services such as 327 L3VPN and L2VPN. 329 3.1. Characteristics of Solution Technologies 331 Each of the solution technologies mentioned in the previous section 332 has certain characteristics, and the combined solution needs to 333 recognize and address these characteristics in order to make a 334 workable solution. 336 o When SR is used for traffic steering, the size of the MPLS label 337 stack used in SR scales linearly with the length of the strict 338 source route. This can cause issues with MPLS implementations 339 that only support label stacks of a limited size. For example, 340 some MPLS implementations cannot push enough labels on the stack 341 to represent an entire source route. Other implementations may be 342 unable to do the proper "ECMP hashing" if the label stack is too 343 long; they may be unable to read enough of the packet header to 344 find an entropy label or to find the IP header of the payload. 345 Increasing the packet header size also reduces the size of the 346 payload that can be carried in an MPLS packet. There are 347 techniques that can be used to reduce the size of the label stack. 348 For example, a source route may be made less specific through the 349 use of loose hops requiring fewer labels, or a single label (known 350 as a "binding SID") can be used to represent a sequence of nodes; 351 this label can be replaced with a set of labels when the packet 352 reaches the first node in the sequence. It is also possible to 353 combine SR with conventional RSVP-TE by using a binding SID in the 354 label stack to represent an LSP tunnel set up by RSVP-TE. 356 o Most of the work on using SR for traffic steering assumes that 357 traffic only needs to be steered within a single administrative 358 domain. If the backbone consists of multiple ASes that are not 359 part of a common administrative domain, the use of SR across the 360 backbone may prove to be a challenge, and its use in the backbone 361 may be limited to cases where private networks connect the 362 domains, rather than cases where the domains are connected by 363 third-party network operators or by the public Internet. 365 o RSVP-TE has been used to provide edge-to-edge tunnels through 366 which flows to/from many endpoints can be routed, and this 367 provides a reduction in state while still offering Traffic 368 Engineering across the backbone network. However, this requires 369 O(n2) connections and as the number of edge domains increases this 370 becomes unsustainable. 372 o A centralized control system is capable of producing more 373 efficient use of network resources and of allowing better 374 coordination of network usage and of network diagnostics. 375 However, such a system may present challenges in large and dynamic 376 networks because it relies on all network state being held 377 centrally, and it is difficult to make central control as robust 378 and self-correcting as distributed control. 380 This document introduces an approach that blends the best points of 381 each of these solution technologies to achieve a trade-off where 382 RSVP-TE tunnels in the backbone network are stitched together using 383 SR, and end-to-end SR paths can be created under the control of a 384 central controller with routing devolved to the constituent networks 385 where possible. 387 4. Decomposing the Problem 389 It is important to decompose the problem to take account of different 390 regions spanned by the end-to-end path. These regions may use 391 different technologies and may be under different administrative 392 control. The separation of administrative control is particularly 393 important because the operator of one region may be unwilling to 394 share information about their networks, and may be resistant to 395 allowing a third party to exert control over their network resources. 397 Using the reference model in Figure 1, we can consider how to get a 398 packet from Source1 to the Destination. The following decisions must 399 be made: 401 o In which domain Destination lies. 403 o Which exit point from Domain1 to use. 405 o Which entry point to Domain2 to use. 407 o How to reach the exit point of Domain1 from Source1. 409 o How to reach the entry point to Domain2 from the exit point of 410 Domain1. 412 o How to reach Destination from the entry point to Domain2. 414 As already mentioned, these decisions may be inter-related. This 415 enables us to break down the problem into three steps: 417 1. Get the packet from Source1 to the exit point of Domain1. 419 2. Get the packet from exit point of Domain1 to entry point of 420 Domain2. 422 3. Get the packet from entry point of Domain2 to Destination. 424 The solution needs to achieve this in a way that allows: 426 o Adequate discovery of preferred elements in the end-to-end path 427 (such as the location of the destination, and the selection of the 428 destination domain entry point). 430 o Full control of the end-to-end path if all of the operators are 431 willing. 433 o Re-use of existing techniques and technologies. 435 From a technology point of view we must support several functions and 436 mixtures of those functions: 438 o If a domain uses MPLS Segment Routing, the labels within the 439 domain may be populated by any means including BGP-LU [RFC8277], 440 IGP [I-D.ietf-isis-segment-routing-extensions] 441 [I-D.ietf-ospf-segment-routing-extensions], and central control. 442 Source routes within the domain may be expressed as label stacks 443 pushed by a controller or computed by a source router, or 444 expressed as a single label and programmed into the domain routers 445 by a controller. 447 o If a domain uses other (non-MPLS) forwarding, the domain 448 processing is specific to that technology. See Section 9 for 449 details. 451 o If the domains use Segment Routing, the source and destination 452 domains may or may not be in the same 'Segment Routing domain' 453 [I-D.ietf-spring-segment-routing], so that the prefix-SIDs may be 454 the same or different in the two domains. 456 o The backbone network may be a single private network under the 457 control of the owner of the domains and comprising one or more 458 ASes, or may be a network operated by one or more third parties. 460 o The backbone network may utilize MPLS Traffic Engineering tunnels 461 in conjunction with MPLS Segment Routing and the domain-to-domain 462 source route may be provided by stitching TE LSPs. 464 o A single controller may be used to handle the source and 465 destination domains as well as the backbone network, or there may 466 be a different controller for the backbone network separate from 467 that that controls the two domains, or there may be separate 468 controllers for each network. The controllers may cooperate and 469 share information to different degrees. 471 All of these different decompositions of the problem reflect 472 different deployment choices and different commercial and operational 473 practices, each with different functional trade-offs. For example, 474 with separate controllers that do not share information and that only 475 cooperate to a limited extent, it will be possible to achieve end-to- 476 end connectivity with optimal routing at each step (domain or 477 backbone AS), but the end-to-end path that is achieved might not be 478 optimal. 480 5. Solution Space 482 5.1. Global Optimization of the Paths 484 Global optimization of the path from one domain to another requires 485 either that the source controller has a complete view of the end-to- 486 end topology or some form of cooperation between controllers (such as 487 in Backward Recursive Path Computation (BRPC) in [RFC5441]). 489 BGP-LS [RFC7752] can be used to provide the "source" controller with 490 a view of the topology of the backbone: that topology may be 491 abstracted or partial. This requires some of the BGP speakers in 492 each AS to have BGP-LS sessions to the controller. Other means of 493 obtaining this view of the topology are of course possible. 495 5.2. Figuring Out the GWs at a Destination Domain for a Given Prefix 497 Suppose GW2a and GW2b both advertise a route to prefix X, each 498 setting itself as next hop. One might think that the GWs for X could 499 be inferred from the routes' next hop fields, but typically only the 500 "best" route (as selected by BGP) gets distributed across the 501 backbone: the other route is discarded. But the best route according 502 to the BGP selection process might not be the route via the GW that 503 we want to use for traffic engineering purposes. 505 The obvious solution would be to use the ADD-PATH mechanism [RFC7911] 506 to ensure that all routes to X get advertised. However, even if one 507 does this, the identity of the GWs would get lost as soon as the 508 routes got distributed through an ASBR that sets next hop self. And 509 if there are multiple ASes in the backbone, not only will the next 510 hop change several times, but the ADD-PATH mechanism will experience 511 scaling issues. So this "obvious" solution only works within a 512 single AS. 514 A better solution can be achieved using the Tunnel Encapsulation 515 [I-D.ietf-idr-tunnel-encaps] attribute as follows. 517 We define a new tunnel type, "SR tunnel", and when the GWs to a given 518 domain advertise a route to a prefix X within the domain, they each 519 include a Tunnel Encapsulation attribute with multiple remote 520 endpoint sub-TLVs each of which identifies a specific GW to the 521 domain. 523 In other words, each route advertised by any GW identifies all of the 524 GWs to the same domain (see Section 9 for a discussion of how GWs 525 discover each other). Therefore, only one of the routes needs to be 526 distributed to other ASes, and it doesn't matter how many times the 527 next hop changes, the Tunnel Encapsulation attribute (and its remote 528 endpoint sub-TLVs) remains unchanged and disclose the full list of 529 GWs to the domain. 531 Further, when a packet destined for prefix X is sent on a TE path to 532 GW2a we want the packet to arrive at GW2a carrying, at the top of its 533 label stack, GW2a's label for prefix X. To achieve this we place the 534 SID/SRGB in a sub-TLV of the Tunnel Encapsulation attribute. We 535 define the prefix-SID sub-TLV to be essentially identical in syntax 536 to the prefix-SID attribute (see [I-D.ietf-idr-bgp-prefix-sid]), but 537 the semantics are somewhat different. 539 We also define an "MPLS Label Stack" sub-TLV for the Tunnel 540 Encapsulation attribute, and put this in the "SR tunnel" TLV. This 541 allows the destination GW to specify a label stack that it wants 542 packets destined for prefix X to have. This label stack represents a 543 source route through the destination domain. 545 5.3. Figuring Out the Backbone Egress ASBRs 547 We need to figure out the backbone egress ASBRs that are attached to 548 a given GW at the destination domain in order to properly engineer 549 the path across the backbone. 551 The "cleanest" way to do this is to have the backbone egress ASBRs 552 distribute the information to the source controller using the egress 553 peer engineering (EPE) extensions of BGP-LS 554 [I-D.ietf-idr-bgpls-segment-routing-epe]. The EPE extensions to BGP- 555 LS allow a BGP speaker to say, "Here is a list of my EBGP neighbors, 556 and here is a (locally significant) adjacency-SID for each one." 558 It may also be possible to consider utilizing cooperating PCEs or a 559 Hierarchical PCE approach in [RFC6805]. But it should be observed 560 that this question is dependent on the questions in Section 5.2. 561 That is, it is not possible to even start the selection of egress 562 ASBRs until it is known which GWs at the destination domain provide 563 access to a given prefix. Once that question has been answered, any 564 number of PCE approaches can be used to select the right egress ASBR 565 and, more generally, the ASBR path across the backbone. 567 5.4. Making use of RSVP-TE LSPs Across the Backbone 569 There are a number of ways to carry traffic across the backbone from 570 one domain to another. RSVP-TE is a popular mechanism for 571 establishing tunnels across MPLS networks in similar scenarios (e.g., 572 L3VPN) because it allows for reservation of resources as well as 573 traffic steering. 575 A controller can cause an RSVP-TE LSP to be set up by talking to the 576 LSP head end using PCEP extensions as described in [RFC8281]. That 577 document specifies an "LSP Initiate" message (the PCInitiate message) 578 that the controller uses to specify the RSVP-TE LSP endpoints, the 579 explicit path, a "symbolic pathname", and other optional attributes 580 (specified in the PCEP specification [RFC5440]) such as bandwidth. 582 When the head end receives a PCInitiate message, it sets up the RSVP- 583 TE LSP, assigns it a "PLSP-id", and reports the PLSP-id back to the 584 controller in a PCRpt message [RFC8231]. The PCRpt message also 585 contains the symbolic name that the controller assigned to the LSP, 586 as well as containing some information identifying the LSP-initiate 587 message from the controller, and details of exactly how the LSP was 588 set up (RRO, bandwidth, etc.). 590 The head end can add a TE-PATH-BINDING TLV to the PCRpt message 591 [I-D.sivabalan-pce-binding-label-sid]. This allows the head end to 592 assign a "binding SID" to the LSP, and to report to the controller 593 that a particular binding SID corresponds to a particular LSP. The 594 binding SID is locally scoped to the head end. 596 The controller can make this label be part of the label stack that it 597 tells the source (or the GW at the source domain) to impose on the 598 data packets being sent to prefix X. When the head end receives a 599 packet with this label at the top of the stack it will send the 600 packet onward on the LSP. 602 5.5. Data Plane 604 Consolidating all of the above, consider what happens when we want to 605 move a data packet from Source1 to Destination in Figure 1via the 606 following source route: 608 Source1---GW1b---PE2a---ASBR2a---ASBR3a---PE3a---GW2a---Destination 610 Further, assume that there is an RSVP-TE LSP from PE2a to ASBR2a and 611 an RSVP-TE LSP from ASBR3a to PE3a both of which we want to use. 613 Let's suppose that the Source pushes a label stack as instructed by 614 the controller (for example, using BGP-LU [RFC8277]). We won't worry 615 for now about source routing through the domains themselves: that is, 616 in practice there may be additional labels in the stack to cover the 617 source route from Source1 to GW1b and from GW2a to the Destination, 618 but we will focus only on the labels necessary to leave the source 619 domain, traverse the backbone, and enter the egress domain. So we 620 only care what the stack looks like when the packet gets to GW1b. 622 When the packet gets to GW1b, the stack should have six labels: 624 Top Label: 626 Peer-SID or adjacency-SID identifying the link or links to PE2a. 627 These SIDs are distributed from GW1b to the controller via the EPE 628 extensions of BGP-LS. This label will get popped by GW1b, which 629 will then send the packet to PE2a. 631 Second Label: 633 Binding SID advertised by PE2a to the controller for the RSVP-TE 634 LSP to ASBR2a. This binding SID is advertised via the PCEP 635 extensions discussed above. This label will get swapped by PE2a 636 for the label that the LSP's next hop has assigned to the LSP. 638 Third Label: 640 Peer-SID or adjacency-SID identifying the link or links to ASBR3a, 641 as advertised to the controller by ASBR2a using the BGP-LS EPE 642 extensions. This label gets popped by ASBR2a, which then sends 643 the packet to ASBR3a. 645 Fourth Label: 647 Binding SID advertised by ASBR3a for the RSVP-TE LSP to PE3a. 648 This binding SID is advertised via the PCEP extensions discussed 649 above. ASBR3a treats this label just like PE2a treated the second 650 label above. 652 Fifth label: 654 Peer-SID or adjacency-SID identifying link or links to GW2a, as 655 advertised to the controller by ASBR3a using the BGP-LS EPE 656 extensions. ASBR3a pops this label and sends the packet to GW2a. 658 Sixth Label: 660 Prefix-SID or other label identifying the Destination advertised 661 in a Tunnel Encapsulation attribute by GW2a. This can be omitted 662 if GW2a is happy to accept IP packets, or prefers a VXLAN tunnel 663 for example. That would be indicated through the Tunnel 664 Encapsulation attribute of course. 666 Note that the size of the label stack is proportional to the number 667 of RSVP-TE LSPs that get stitched together by SR. 669 See Section 7 for some detailed examples that show the concrete use 670 of labels in a sample topology. 672 In the above example, all labels except the sixth are locally 673 significant labels: peer-SIDs, binding SIDs, or adjacency-SIDs. Only 674 the sixth label, a prefix-SID, has a domain-wide unique value. To 675 impose that label, the source needs to know the SRGB of GW2a. If all 676 nodes have the same SRGB, this is not a problem. Otherwise, there 677 are a number of different ways GW3a can advertise its SRGB. This can 678 be done via the segment routing extensions of BGP-LS, or it can be 679 done using the prefix-SID attribute or BGP-LU [RFC8277], or it can be 680 done using the BGP Tunnel Encapsulation attribute. The technique to 681 be used will depend on the details of the deployment scenario. 683 The reason the above example is primarily based on locally 684 significant labels is that it creates a "strict source route", and it 685 presupposes the EPE extensions of BGP-LS. In some scenarios, the EPE 686 extension to BGP-LS might not be available (or BGP-LS might not be 687 available at all). In other scenarios, it may be desirable to steer 688 a packet through a "loose source route". In such scenarios, the 689 label stack imposed by the source will be based upon a sequence of 690 domain-wide unique "node-SIDs", each representing one of the hops of 691 source route. Each label has to be computed by adding the 692 corresponding node-SID to the SRGB of the node that will act upon the 693 label. One way to learn the node-SIDs and SRGBs is to use the 694 segment routing extensions of BGP-LS. Another way is to use BGP-LU 695 as follows: 697 Each node that may be part of a source route originates a BGP-LU 698 route with one of its own loopback addresses as the prefix. The 699 BGP prefix-SID attribute is attached to this route. The prefix- 700 SID attribute contains a SID that is the domain-wide unique SID 701 corresponding to the node's loopback address. The attribute also 702 contains the node's SRGB. 704 While this technique is useful when BGP-LS is not available, there 705 needs to be some other means for the source controller to discover 706 the topology. In this document, we focus primarily on the scenario 707 where BGP-LS, rather than BGP-LU, is used. 709 5.6. Centralized and Distributed Controllers 711 A controller or set of controllers is needed to collate topology and 712 TE information from the constituent networks, to apply policies and 713 service requirements to compute paths across those networks, to 714 select an end-to-end path, and to program key nodes in the network to 715 take the right forwarding actions (pushing label stacks, stitching 716 LSPs, forwarding traffic). 718 o It is commonly understood that a fully optimal end-to-end path can 719 only be computed with full knowledge of the end-to-end topology 720 and available Traffic Engineering resources. Thus, one option is 721 for all information about the domain networks and backbone network 722 to be collected by a central controller that makes all path 723 computations and is responsible for issuing the necessary 724 programming commands. Such a model works best when there is no 725 commercial or administrative impediment (for example, where the 726 domains and the backbone network are owned and operated by the 727 same organization). There may, however, be some scaling concerns 728 if the component networks are large. 730 In this mode of operation, each network may use BGP-LS to export 731 Traffic Engineering and topology information to the central 732 controller, and the controller may use PCEP to program the network 733 behavior. 735 o A similar centralized control mechanism can be used with a 736 scalability improvement that risks a reduction in optimality. In 737 this case, the domain networks can export to the controller just 738 the feasibility of connectivity between data source/sink and 739 gateway, perhaps enhancing this with some information about the 740 Traffic Engineering metrics of the potential paths. 742 This approach allows the central controller to understand the end- 743 to-end path that it is selecting, but not to control it fully. 744 The source route from data source to domain egress gateway is left 745 to the source host or a controller in the source domain, while the 746 source route from domain ingress gateway to destination is left as 747 a decision for the domain ingress gateway or to a controller in 748 the destination domain and in both cases the traffic may be left 749 to follow the IGP shortest path. 751 This mode of operation still leaves overall control with a 752 centralized server and that may not be considered suitable when 753 there is separate commercial or administrative control of the 754 networks. 756 o When there is separate commercial or administrative control of the 757 networks, the domain operator will not want the backbone operator 758 to have control of the paths within the domains and may be 759 reluctant to disclose any information about the topology or 760 resource availability within the domains. Conversely, the 761 backbone operator may be very unwilling to allow the domain 762 operator (a customer) any control over or knowledge about the 763 backbone network. 765 This "problem" has already been solved for Traffic Engineering in 766 MPLS networks that span multiple administrative domains and leads 767 to several potential solutions: 769 * Per-domain path computation [RFC5152] can be seen as "best 770 effort optimization". In this mode the controller for each 771 domain is responsible for finding the best path to the next 772 domain, but has no way of knowing which is the best exit point 773 from the local domain. The resulting path may end up 774 significantly sub-optimal or even blocked. 776 * Backward recursive path computation (BRPC) [RFC5441] is a 777 mechanism that allows controllers to cooperate across a small 778 set of domains (such as ASes) to build a tree of possible paths 779 and so allow the controller for the ingress domain to select 780 the optimal path. The details of the paths within each domain 781 that might reveal confidential information can be hidden using 782 Path Keys [RFC5520]. BRPC produces optimal paths, but scales 783 poorly with an increase in domains and with an increase in 784 connectivity between domains. It can also lead to slow 785 computation times. 787 * Hierarchical PCE (H-PCE) [RFC6805] is a two-level cooperation 788 process between PCEs. The child PCEs remain responsible for 789 computing paths across their domains, and they coordinate with 790 a parent PCE that stitches these paths together to form the 791 end-to-end path. This approach has many similarities with BRPC 792 but can scale better through the maintenance of "domain 793 topology" that shows how the domains are interconnected, and 794 through the ability to pipe-line computation requests to all of 795 the child domains. It has the drawback that some party has to 796 own and operate the parent PCE. 798 * An alternative approach is documented by the TEAS working group 799 [RFC7926]. In this model each network advertises to 800 controllers for adjacent networks (using BGP-LS) selected 801 information about potential connectivity across the network. 802 It does not have to show full topology and can make its own 803 decisions about which paths it considers optimal for use by its 804 different neighbors and customers. This approach is suitable 805 for the End-to-End Domain Interconnect Traffic Steering problem 806 where the backbone is under different control from the domains 807 because it allows the overlay nature of the use of the backbone 808 network to be treated as a peer network relationship by the 809 controllers of the domains - the domains can be operated using 810 a single controller or a separate controller for each domain. 812 It is also possible to operate domain interconnection when some or 813 all domains do not have a controller. Segment Routing is capable of 814 routing a packet toward the next hop based on the top label on the 815 stack, and that label does not need to indicate an immediately 816 adjacent node or link. In these cases, the packet may be forwarded 817 untouched, or the forwarding router may impose a locally-determined 818 additional set of labels that define the path to the next hop. 820 PCE can be used to instruct the source host or a transit node about 821 what label stacks to add to packets. That is, a node that needs to 822 impose labels (either to start routing the packet from the source 823 host, or to advance the packet from a transit router toward the 824 destination) can determine the label stack to use based on local 825 function or can have that stack supplied by a PCE. The PCE 826 Communication Protocol (PCEP) has been extended to allow the PCE to 827 supply a label stack for reaching a specific destination either in 828 response to a request or in an unsolicited manner 829 [I-D.ietf-pce-segment-routing]. 831 6. BGP-LS Considerations 833 This section gives an overview of the use of BGP-LS to export an 834 abstraction (or summary) of the connectivity across the backbone 835 network by means of two figures that show different views of a sample 836 network. 838 Figure 2 shows a more complex reference architecture. 840 Figure 3 represents the minimum set of nodes and links that need to 841 be advertised in BGP-LS with SR in order to perform Domain 842 Interconnect with traffic engineering across the backbone network: 843 the PEs, ASBRs, and GWs, and the links between them. In particular, 844 EPE [I-D.ietf-idr-bgpls-segment-routing-epe] and TE information with 845 associated segment IDs is advertised in BGP-LS with SR. 847 Links that are advertised may be physical links, links realized by 848 LSP tunnels or SR paths, or abstract links. It is assumed that 849 intra-AS links are either real links, RSVP-TE LSPs with allocated 850 bandwidth, or SR TE policies as described in 851 [I-D.ietf-idr-segment-routing-te-policy]. Additional nodes internal 852 to an AS and their links to PEs, ASBRs, and/or GWs may also be 853 advertised (for example, to avoid full mesh problems). 855 Note that Figure 3 does not show full interconnectivity. For 856 example, there is no possibility of connectivity between PE1a and 857 PE1c (because there is no RSVP-TE LSP established across AS1 between 858 these two nodes) and so no link is presented in the topology view. 859 [RFC7926] contains further discussion of topological abstractions 860 that may be useful in understanding this distinction. 862 ------------------------------------------------------------------- 863 | | 864 | AS1 | 865 | ---- ---- ---- ---- | 866 -|PE1a|--|PE1b|-------------------------------------|PE1c|--|PE1d|- 867 ---- ---- ---- ---- 868 : : ------------ ------------ : : : 869 : : | AS2 | | AS3 | : : : 870 : : | ------.....------ | : : : 871 : : | |ASBR2a| |ASBR3a| | : : : 872 : : | ------ ..:------ | : : : 873 : : | | ..: | | : : : 874 : : | ------: ------ | : : : 875 : : | |ASBR2b|...|ASBR3b| | : : : 876 : : | ------ ------ | : : : 877 : : | | | | : : : 878 : : | | ------ | : : : 879 : : | | ..|ASBR3c| | : : : 880 : : | | : ------ | : ....: : 881 : ......: | ---- | : | ---- | : : : 882 : : -|PE2a|----- : -----|PE3b|- : : : 883 : : ---- : ---- : : : 884 : : .......: : :....... : : : 885 : : : ------ : : : : 886 : : : ----|ASBR4b|---- : : : : 887 : : : | ------ | : : : : 888 : : : ---- | : : : : 889 : : : .........|PE4b| AS4 | : : : : 890 : : : : ---- | : : : : 891 : : : : | ---- | : : : : 892 : : : : -----|PE4a|----- : : : : 893 : : : : ---- : : : : 894 : : : : ..: :.. : : : : 895 : : : : : : : : : : 896 ---- ---- ---- ---- ----: ---- 897 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 898 | ---- ---- | | ---- ---- | | ---- ---- | 899 | | | | | | 900 | | | | | | 901 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 902 | | | | | | 903 | | | | | | 904 | Domain1 | | Domain2 | | Domain3 | 905 ---------------- ---------------- ---------------- 907 Figure 2: Network View of Example Configuration 909 ............................................................. 910 : : 911 ---- ---- ---- ---- 912 |PE1a| |PE1b|.....................................|PE1c| |PE1d| 913 ---- ---- ---- ---- 914 : : : : : 915 : : ------.....------ : : : 916 : : ......|ASBR2a| |ASBR3a|...... : : : 917 : : : ------ ..:------ : : : : 918 : : : : : : : : 919 : : : ------..: ------ : : : : 920 : : : ...|ASBR2b|...|ASBR3b| : : : : 921 : : : : ------ ------ : : : : 922 : : : : : : : : : 923 : : : : ------ : : : : 924 : : : : ..|ASBR3c|... : : : : 925 : : : : : ------ : : : ....: : 926 : ......: ---- : ---- : : : 927 : : |PE2a| : |PE3b| : : : 928 : : ---- : ---- : : : 929 : : .......: : :....... : : : 930 : : : ------ : : : : 931 : : : |ASBR4b| : : : : 932 : : : ------ : : : : 933 : : : ----.....: : : : : : 934 : : : .........|PE4b|..... : : : : : 935 : : : : ---- : : : : : : 936 : : : : ---- : : : : 937 : : : : |PE4a| : : : : 938 : : : : ---- : : : : 939 : : : : ..: :.. : : : : 940 : : : : : : : : : : 941 ---- ---- ---- ---- ----: ---- 942 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 943 | ---- ---- | | ---- ---- | | ---- ---- | 944 | | | | | | 945 | | | | | | 946 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 947 | | | | | | 948 | | | | | | 949 | Domain1 | | Domain2 | | Domain3 | 950 ---------------- ---------------- ---------------- 952 Figure 3: Topology View of Example Configuration 954 A node (a PCE, router, or host) that is computing a full or partial 955 path correlates the topology information disseminated in BGP-LS with 956 the information advertised in BGP (with the Tunnel Encapsulation 957 attributes) and uses this to compute that path and obtain the SIDs 958 for the elements on that path. In order to allow a source host to 959 compute exit points from its domain, some subset of the above 960 information needs to be disseminated within that domain. 962 What is advertised external to a given AS is controlled by policy at 963 the ASes' PEs, ASBRs, and GWs. Central control of what each node 964 should advertise, based upon analysis of the network as a whole, is 965 an important additional function. This and the amount of policy 966 involved may make the use of a Route Reflector an attractive option. 968 Local configuration at each node determines which links to other 969 nodes are advertised in BGP-LS, and determines which characteristics 970 of those links are advertised. Pairwise coordination between link 971 end-points is required to ensure consistency. 973 Path Weighted ECMP (PWECMP) is a mechanism to load-balance traffic 974 across parallel equal cost links or paths. In this approach an 975 ingress node distributes the flows from it to a given egress node 976 across the equal cost paths to the egress node in proportion to the 977 lowest bandwidth link on each path. PWECMP can be used by a GW for a 978 given source domain to send all flows to a given destination domain 979 using all paths in the backbone network to that destination domain in 980 proportion to the minimum bandwidth on each path. PWECMP may also be 981 used by hosts within a source domain to send flows to that domain's 982 GWs. 984 7. Worked Examples 986 Figure 4 shows a view of the links, paths, and labels that can be 987 assigned to part of the sample network shown in Figure 2 and 988 Figure 3. The double-dash lines (===) indicate LSP tunnels across 989 backbone ASes and dotted lines (...) are physical links. 991 A label may be assigned to each outgoing link at each node. This is 992 shown in Figure 4. For example, at GW1a the label L201 is assigned 993 to the link connecting GW1a to PE1a. At PE1c, the label L302 is 994 assigned to the link connecting PE1c to GW3b. Labels ("binding 995 SIDs") may also be assigned to RSVP-TE LSPs. For example, at PE1a, 996 label L202 is assigned to the RSVP-TE LSP leading from PE1a to PE1c. 998 At the destination domain, label L305 is a "node-SID"; it represents 999 Host3b, rather than representing a particular link. 1001 When a node processes a packet, the label at the top of the label 1002 stack indicates the link (or RSVP-TE LSP) on which that node is to 1003 transmit the packet. The node pops that label off the label stack 1004 before transmitting the packet on the link. However, if the top 1005 label is a node-SID, the node processing the packet is expected to 1006 transmit the packet on whatever link it regards as the shortest path 1007 to the node represented by the label. 1009 ---- L202 ---- 1010 | |===================================================| | 1011 |PE1a| |PE1c| 1012 | |===================================================| | 1013 ---- L203 ---- 1014 : L304: :L302 1015 : : : 1016 : ---- L205 ---- : : 1017 : |PE1b|========================================|PE1d| : : 1018 : ---- ---- : : 1019 : : L303: : : 1020 : : ---- : : : 1021 : : ---- L207 |ASBR|L209 ---- : : : 1022 : : | |======| 2a |......| | : : : 1023 : : | | ---- | |L210 ---- : : : 1024 : : |PE2a| |ASBR|======|PE3b| : : : 1025 : : | |L208 ---- L211 | 3a | ---- : : : 1026 : : | |======|ASBR|......| | L301: : : : 1027 : : ---- | 2b | ---- ...: : : : 1028 : : : ---- : : : : 1029 : ....: : : .......: : : 1030 : : : : : : : 1031 : : : : : .........: : 1032 : : : : : : : 1033 : : ....: : : : ....: 1034 L201: :L204 :L206 : : : : 1035 ---- ---- ----- ---- 1036 -|GW1a|--|GW1b|- -|GW3a |--|GW3b|- 1037 | ---- ---- | | ----- ---- | 1038 | : : | | L303: :L304| 1039 | : : | | : : | 1040 |L103: :L102| | : : | 1041 | N1 N2 | | N3 N4 | 1042 | :.. ..: | | : ....: | 1043 | : : | | : : | 1044 | L101: : | | : : | 1045 | Host1a | | Host3b (L305) | 1046 | | | | 1047 | Domain1 | | Domain3 | 1048 ---------------- ----------------- 1050 Figure 4: Tunnels and Labels in Example Configuration 1052 Note that label spaces can overlap so that, for example, the figure 1053 shows two instances of L303 and L304. This is acceptable because of 1054 the separation between the SR domains, and because SIDs applied to 1055 outgoing interfaces are locally scoped. 1057 Let's consider several different possible ways to direct a packet 1058 from Host1a in Domain1 to Host3b in Domain3. 1060 a. Full source route imposed at source 1062 In this case it is assumed that the entity responsible for 1063 determining an end-to-end path has access to the topologies of 1064 both the source and destination domains as well as of the 1065 backbone network. This might happen if all of the networks 1066 are owned by the same operator in which case the information 1067 can be shared into a single database for use by an offline 1068 tool, or the information can be distributed using routing 1069 protocols such that the source host can see enough to select 1070 the path. Alternatively, the end-to-end path could be 1071 produced through cooperation between computation entities each 1072 responsible for different domains along the path. 1074 If the path is computed externally it is pushed to the source 1075 host. Otherwise, it is computed by the source host itself. 1077 Suppose it is desired for a packet from Host1a to travel to 1078 Host3b via the following source route: 1080 Host1a->N1->GW1a->PE1a->(RSVP-TE 1081 LSP)->PE1c->GW3b->N4->Host3b 1083 Host1a imposes the following label stack (with the first label 1084 representing the top of stack), and then sends the packet to 1085 N1: 1087 L103, L201, L202, L302, L304, L305 1089 N1 sees L103 at the top of the stack, so it pops the stack and 1090 forwards the packet to GW1a. GW1a sees L201 at the top of the 1091 stack, so it pops the stack and forwards the packet to PE1a. 1092 PE1a sees L202 at the top of the stack, so it pops the stack 1093 and forwards the packet over the RSVP-TE LSP to PE1c. As the 1094 packet travels over this LSP, its top label is an RSVP-TE 1095 signaled label representing the LSP. That is, PE1a imposes an 1096 additional label stack entry for the tunnel LSP. 1098 At the end of the LSP tunnel, the MPLS tunnel label is popped, 1099 and PE1c sees L302 at the top of the stack. PE1c pops the 1100 stack and forwards the packet to GW3b. GW3b sees L304 at the 1101 top of the stack, so it pops the stack and forwards the packet 1102 to N4. Finally, N4 sees L305 at the top of the stack, so it 1103 pops the stack and forwards the packet to Host3b. 1105 b. It is possible that the source domain does not have visibility 1106 into the destination domain. 1108 This occurs if the destination domain does not export its 1109 topology, but does export basic reachability information so 1110 that the source host or the path computation entity will know: 1112 + The GWs through which the destination can be reached. 1114 + The SID to use for the destination prefix. 1116 Suppose we want a packet to follow the source route: 1118 Host1a->N1->GW1a->PE1a->(RSVP-TE 1119 LSP)->PE1c->GW3b->...->Host3b 1121 The ellipsis indicates a part of the path that is not 1122 explicitly specified. Thus, the label stack imposed at the 1123 source host is: 1125 L103, L201, L202, L302, L305 1127 Processing is as per case a., but when the packet reaches the 1128 GW of the destination domain (GW3b) it can either simply 1129 forward the packet along the shortest path to Host3b, or it 1130 can insert additional labels to direct the path to the 1131 destination. 1133 c. Domain1 only has reachability information for the backbone and 1134 destination networks 1136 The source domain (or the path computation entity) may be 1137 further restricted in its view of the network. It is possible 1138 that it knows the location of the destination in the 1139 destination domain, and knows the GWs to the destination 1140 domain that provide reachability to the destination, but that 1141 it has no view of the backbone network. This leads to the 1142 packet being forwarded in a manner similar to 'per-domain path 1143 computation' described in Section 5.6. 1145 At the source host a simple label stack is imposed navigating 1146 the domain and indicating the destination GW and the 1147 destination host. 1149 L103, L302, L305 1151 As the packet leaves the source domain, the source GW (GW1a) 1152 determines the PE to use to enter the backbone using nothing 1153 more than the BGP preferred route to the destination GW (it 1154 could be PE1a or PE1b). 1156 When the packet reaches the first PE it has a label stack just 1157 identifying the destination GW and the host (L302, L305). The 1158 PE uses information it has about the backbone network topology 1159 and available LSPs to select an LSP tunnel, impose the tunnel 1160 label, and forward the packet. 1162 When the packet reaches the end of the LSP tunnel, it is 1163 processed as described in case b. 1165 d. Stitched LSPs across the backbone 1167 A variant of all these cases arises when the packet is sent 1168 using a path that spans multiple ASes. For example, one that 1169 crosses AS2 and AS3 as shown in Figure 2. 1171 In this case, basing the example on case a., the source host 1172 imposes the label stack: 1174 L102, L206, L207, L209, L210, L301, L303, L305 1176 It then sends the packet to N2. 1178 When the packet reaches PE2a, as previously described, the top 1179 label (L207) indicates an LSP tunnel that leads to ASBR2a. At 1180 the end of that LSP tunnel the next label (L209) routes the 1181 packet from ASBR2a to ASBR3a, where the next label (L210) 1182 identifies the next LSP tunnel to use. Thus, SR has been used 1183 to stitch together LSPs to make a longer path segment. As the 1184 packet emerges from the final LSP tunnel, forwarding continues 1185 as previously described. 1187 8. Label Stack Depth Considerations 1189 As described in Section 3.1, one of the issues with a Segment Routing 1190 approach is that the label stack can get large, for example when the 1191 source route becomes long. A mechanism to mitigate this problem is 1192 needed if the solution is to be fully applicable in all environments. 1194 [I-D.ietf-idr-segment-routing-te-policy] introduces the concept of 1195 hierarchical source routes as a way to compress source route headers. 1196 It functions by having the egress node for a set of source routes 1197 advertise those source routes along with an explicit request that 1198 each node that is an ingress node for one or more of those source 1199 routes should advertise a binding SID for the set of source routes 1200 for which it is the ingress. It should be noted that the set of 1201 source routes can either be advertised by the egress node as 1202 described here, or advertised by a controller on behalf of the egress 1203 node. 1205 Such an ingress node advertises its set of source routes and a 1206 binding SID as an adjacency in BGP-LS as described in Section 6. 1207 These source routes represent the weighted ECMP paths between the 1208 ingress node and the egress node. Note also that the binding SID may 1209 be supplied by the node that advertises the source routes (the egress 1210 or the controller), or may be chosen by the ingress. 1212 A remote node that wishes to reach the egress node constructs a 1213 source route consisting of the segment IDs necessary to reach one of 1214 the ingress nodes for the path it wishes to use along with the 1215 binding SID that the ingress node advertised to identify the set of 1216 paths. When the selected ingress node receives a packet with a 1217 binding SID it has advertised, it replaces the binding SID with the 1218 labels for one of its source routes to the egress node (it will 1219 choose one of the source routes in the set according to its own 1220 weighting algorithms and policy). 1222 8.1. Worked Example 1224 Consider the topology in Figure 4. Suppose that it is desired to 1225 construct full segment routed paths from ingress to egress, but that 1226 the resulting label stack (segment route) is too large. In this case 1227 the gateways to Domain3 (GW3a and GW3b) can advertise all of the 1228 source routes from the gateways to Domain1 (GW1a and GW1b). The 1229 gateways to Domain1 then assign binding SIDs to those source routes 1230 and advertise those SIDs into BGP-LS. 1232 Thus, GW3b advertises the two source routes (L201, L202, L302 and 1233 L201, L203, L302), and GW1a advertises into BGP-LS its adjacency to 1234 GW3b along with a binding SID. Should Host1a wish to send a packet 1235 via GW1a and GW3b, it can include L103 and this binding SID in the 1236 source route. GW1a is free to choose which source route to use 1237 between itself and GW3b using its weighted ECMP algorithm. 1239 Similarly, GW3a can advertise the following set of source routes: 1241 o L201, L202, L304 1243 o L201, L203, L304 1245 o L204, L205, L303 1247 o L206, L207, L209, L210, L301 1248 o L206, L208, L211, L210, L301 1250 GW1a advertises a binding SID for the first three, and GW1b 1251 advertises a binding SID for the other two. 1253 9. Gateway Considerations 1255 As described in Section 5.2, [I-D.ietf-bess-datacenter-gateway] 1256 defines a new tunnel type, "SR tunnel", and when the GWs to a given 1257 domain advertise a route to a prefix X within the domain, they will 1258 each include a Tunnel Encapsulation attribute with multiple tunnel 1259 instances each of type "SR tunnel", one for each GW and each 1260 containing a Remote Endpoint sub-TLV with that GW's address. 1262 In other words, each route advertised by any GW identifies all of the 1263 GWs to the same domain. 1265 Therefore, even if only one of the routes is distributed to other 1266 ASes, it will not matter how many times the next hop changes, as the 1267 Tunnel Encapsulation attribute (and its remote endpoint sub-TLVs) 1268 will remain unchanged. 1270 9.1. Domain Gateway Auto-Discovery 1272 To allow a given domain's GWs to auto-discover each other and to 1273 coordinate their operations, the following procedures are implemented 1274 as described in [I-D.ietf-bess-datacenter-gateway]: 1276 o Each GW is configured with an identifier of the domain that is 1277 common across all GWs to the domain and unique across all domains 1278 that are connected. 1280 o A route target [RFC4360] is attached to each GW's auto-discovery 1281 route and has its value set to the domain identifier. 1283 o Each GW constructs an import filtering rule to import any route 1284 that carries a route target with the same domain identifier that 1285 the GW itself uses. This means that only these GWs will import 1286 those routes and that all GWs to the same domain will import each 1287 other's routes and will learn (auto-discover) the current set of 1288 active GWs for the domain. 1290 o The auto-discovery route each GW advertises consists of the 1291 following: 1293 * An IPv4 or IPv6 NLRI containing one of the GW's loopback 1294 addresses (that is, with AFI/SAFI that is one of 1/1, 2/1, 1/4, 1295 2/4). 1297 * A Tunnel Encapsulation attribute containing the GW's 1298 encapsulation information, which at a minimum consists of an SR 1299 tunnel TLV with a Remote Endpoint sub-TLV 1300 [I-D.ietf-idr-tunnel-encaps]. 1302 To avoid the side effect of applying the Tunnel Encapsulation 1303 attribute to any packet that is addressed to the GW, the GW should 1304 use a different loopback address in the advertisement from that used 1305 to reach the GW itself. 1307 Each GW will include a Tunnel Encapsulation attribute for each GW 1308 that is active for the domain (including itself), and will include 1309 these in every route advertised by each GW to peers outside the 1310 domain. As the current set of active GWs changes (due to the 1311 addition of a new GW or the failure/removal of an existing GW) each 1312 externally advertised route will be re-advertised with the set of SR 1313 tunnel instances reflecting the current set of active GWs. 1315 9.2. Relationship to BGP Link State and Egress Peer Engineering 1317 When a remote GW receives a route to a prefix X it can use the SR 1318 tunnel instances within the contained Tunnel Encapsulation attribute 1319 to identify the GWs through which X can be reached. It uses this 1320 information to compute SR TE paths across the backbone network 1321 looking at the information advertised to it in SR BGP Link State 1322 (BGP-LS) [I-D.ietf-idr-bgp-ls-segment-routing-ext] and correlated 1323 using the domain identity. SR Egress Peer Engineering (EPE) 1324 [I-D.ietf-idr-bgpls-segment-routing-epe] can be used to supplement 1325 the information advertised in BGP-LS. 1327 9.3. Advertising a Domain Route Externally 1329 When a packet destined for prefix X is sent on an SR TE path to a GW 1330 for the domain containing X, it needs to carry the receiving GW's 1331 label for X such that this label rises to the top of the stack before 1332 the GW completes its processing of the packet. To achieve this we 1333 place a prefix-SID sub-TLV for X in each SR tunnel instance in the 1334 Tunnel Encapsulation attribute in the externally advertised route for 1335 X. 1337 Alternatively, if the GWs for a given domain are configured to allow 1338 remote GWs to perform SR TE through that domain for prefix X, then 1339 each GW computes an SR TE path through that domain to X from each of 1340 the current active GWs and places each in an MPLS label stack sub-TLV 1341 [I-D.ietf-idr-tunnel-encaps] in the SR tunnel instance for that GW. 1343 9.4. Encapsulations 1345 If the GWs for a given domain are configured to allow remote GWs to 1346 send them packets in that domain's native encapsulation, then each GW 1347 will also include multiple instances of a tunnel TLV for that native 1348 encapsulation in the externally advertised routes: one for each GW, 1349 and each containing a remote endpoint sub-TLV with that GW's address. 1350 A remote GW may then encapsulate a packet according to the rules 1351 defined via the sub-TLVs included in each of the tunnel TLV 1352 instances. 1354 10. Security Considerations 1356 There are several security domains and associated threats in this 1357 architecture. SR is itself a data transmission encapsulation that 1358 provides no additional security, so security in this architecture 1359 relies on higher layer mechanisms (for example, end-to-end encryption 1360 of pay-load data), security of protocols used to establish 1361 connectivity and distribute network information, and access control 1362 so that control plane and data plane packets are not admitted to the 1363 network from outside. 1365 This architecture utilizes a number of control plane protocols within 1366 domains, within the backbone, and north-south between controllers and 1367 domains. Only minor modifications are made to BGP as described in 1368 [I-D.ietf-bess-datacenter-gateway], otherwise this architecture uses 1369 existing protocols and extensions so no new security risks are 1370 introduced. 1372 Special care should, however, be taken when routing protocols export 1373 or import information from or to domains that might have a security 1374 model based on secure boundaries and internal mutual trust. This is 1375 notable when: 1377 o BGP-LS is used to export topology information from within a domain 1378 to a controller that is sited outside the domain. 1380 o A southbound protocol such as BGP-LU or Netconf is used to install 1381 state in the network from a controller that may be sited outside 1382 the domain. 1384 In these cases protocol security mechanisms should be used to protect 1385 the information in transit entering or leaving the domain, and to 1386 authenticate the out-of-domain nodes (the controller) to ensure that 1387 confidential/private information is not lost and that data or 1388 configuration is not falsified. 1390 11. Management Considerations 1392 Configuration elements for the approaches described in this document 1393 are minor but crucial. 1395 Each GW to a domain is configured with the same identifier of the 1396 domain, and that identifier is unique across all domains that are 1397 connected. This requires some coordination both within a domain, and 1398 between cooperating domains. There are no requirements for how this 1399 configuration and coordination is achieved, but it is assumed that 1400 management systems are involved. 1402 Policy determines what topology information is shared by a BGP-LS 1403 speaker (see Section 6). This applies both to the advertisement of 1404 interdomain links and their characteristics, and to the advertisement 1405 of summarized domain topology or connectivity. This policy is a 1406 local (i.e., domain-scoped) configuration dependent on the objectives 1407 and business imperatives of the domain operator. 1409 Domain boundaries are usually configured to limit the control and 1410 interaction from other domains (for example, to not allow end-to-end 1411 TE paths to be set up across domain boundaries. As noted in 1412 Section 9.3, the GWs for a given domain can be configured to allow 1413 remote GWs to perform SR TE through that domain for a given prefix, a 1414 set of prefixes, or all reachable prefixes. 1416 Similarly, (as described in Section 9.4 the GWs for a given domain 1417 can be configured to allow remote GWs to send them packets in that 1418 domain's native encapsulation. 1420 12. IANA Considerations 1422 This document makes no requests for IANA action. 1424 13. Acknowledgements 1426 Thanks to Jeffery Zhang for his careful review. 1428 14. Informative References 1430 [I-D.ietf-bess-datacenter-gateway] 1431 Drake, J., Farrel, A., Rosen, E., Patel, K., and L. Jalil, 1432 "Gateway Auto-Discovery and Route Advertisement for 1433 Segment Routing Enabled Domain Interconnection", draft- 1434 ietf-bess-datacenter-gateway-01 (work in progress), May 1435 2018. 1437 [I-D.ietf-idr-bgp-ls-segment-routing-ext] 1438 Previdi, S., Talaulikar, K., Filsfils, C., Gredler, H., 1439 and M. Chen, "BGP Link-State extensions for Segment 1440 Routing", draft-ietf-idr-bgp-ls-segment-routing-ext-08 1441 (work in progress), May 2018. 1443 [I-D.ietf-idr-bgp-prefix-sid] 1444 Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A., 1445 and H. Gredler, "Segment Routing Prefix SID extensions for 1446 BGP", draft-ietf-idr-bgp-prefix-sid-21 (work in progress), 1447 May 2018. 1449 [I-D.ietf-idr-bgpls-segment-routing-epe] 1450 Previdi, S., Filsfils, C., Patel, K., Ray, S., and J. 1451 Dong, "BGP-LS extensions for Segment Routing BGP Egress 1452 Peer Engineering", draft-ietf-idr-bgpls-segment-routing- 1453 epe-15 (work in progress), March 2018. 1455 [I-D.ietf-idr-segment-routing-te-policy] 1456 Previdi, S., Filsfils, C., Jain, D., Mattes, P., Rosen, 1457 E., and S. Lin, "Advertising Segment Routing Policies in 1458 BGP", draft-ietf-idr-segment-routing-te-policy-03 (work in 1459 progress), May 2018. 1461 [I-D.ietf-idr-tunnel-encaps] 1462 Rosen, E., Patel, K., and G. Velde, "The BGP Tunnel 1463 Encapsulation Attribute", draft-ietf-idr-tunnel-encaps-09 1464 (work in progress), February 2018. 1466 [I-D.ietf-isis-segment-routing-extensions] 1467 Previdi, S., Ginsberg, L., Filsfils, C., Bashandy, A., 1468 Gredler, H., Litkowski, S., Decraene, B., and J. Tantsura, 1469 "IS-IS Extensions for Segment Routing", draft-ietf-isis- 1470 segment-routing-extensions-16 (work in progress), April 1471 2018. 1473 [I-D.ietf-ospf-segment-routing-extensions] 1474 Psenak, P., Previdi, S., Filsfils, C., Gredler, H., 1475 Shakir, R., Henderickx, W., and J. Tantsura, "OSPF 1476 Extensions for Segment Routing", draft-ietf-ospf-segment- 1477 routing-extensions-25 (work in progress), April 2018. 1479 [I-D.ietf-pce-segment-routing] 1480 Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., 1481 and J. Hardwick, "PCEP Extensions for Segment Routing", 1482 draft-ietf-pce-segment-routing-11 (work in progress), 1483 November 2017. 1485 [I-D.ietf-spring-segment-routing] 1486 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 1487 Litkowski, S., and R. Shakir, "Segment Routing 1488 Architecture", draft-ietf-spring-segment-routing-15 (work 1489 in progress), January 2018. 1491 [I-D.ietf-spring-segment-routing-mpls] 1492 Bashandy, A., Filsfils, C., Previdi, S., Decraene, B., 1493 Litkowski, S., and R. Shakir, "Segment Routing with MPLS 1494 data plane", draft-ietf-spring-segment-routing-mpls-14 1495 (work in progress), June 2018. 1497 [I-D.sivabalan-pce-binding-label-sid] 1498 Sivabalan, S., Tantsura, J., Filsfils, C., Previdi, S., 1499 Hardwick, J., and D. Dhody, "Carrying Binding Label/ 1500 Segment-ID in PCE-based Networks.", draft-sivabalan-pce- 1501 binding-label-sid-04 (work in progress), March 2018. 1503 [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended 1504 Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, 1505 February 2006, . 1507 [RFC5152] Vasseur, JP., Ed., Ayyangar, A., Ed., and R. Zhang, "A 1508 Per-Domain Path Computation Method for Establishing Inter- 1509 Domain Traffic Engineering (TE) Label Switched Paths 1510 (LSPs)", RFC 5152, DOI 10.17487/RFC5152, February 2008, 1511 . 1513 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 1514 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 1515 DOI 10.17487/RFC5440, March 2009, 1516 . 1518 [RFC5441] Vasseur, JP., Ed., Zhang, R., Bitar, N., and JL. Le Roux, 1519 "A Backward-Recursive PCE-Based Computation (BRPC) 1520 Procedure to Compute Shortest Constrained Inter-Domain 1521 Traffic Engineering Label Switched Paths", RFC 5441, 1522 DOI 10.17487/RFC5441, April 2009, 1523 . 1525 [RFC5520] Bradford, R., Ed., Vasseur, JP., and A. Farrel, 1526 "Preserving Topology Confidentiality in Inter-Domain Path 1527 Computation Using a Path-Key-Based Mechanism", RFC 5520, 1528 DOI 10.17487/RFC5520, April 2009, 1529 . 1531 [RFC6805] King, D., Ed. and A. Farrel, Ed., "The Application of the 1532 Path Computation Element Architecture to the Determination 1533 of a Sequence of Domains in MPLS and GMPLS", RFC 6805, 1534 DOI 10.17487/RFC6805, November 2012, 1535 . 1537 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 1538 S. Ray, "North-Bound Distribution of Link-State and 1539 Traffic Engineering (TE) Information Using BGP", RFC 7752, 1540 DOI 10.17487/RFC7752, March 2016, 1541 . 1543 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 1544 Litkowski, S., Horneffer, M., and R. Shakir, "Source 1545 Packet Routing in Networking (SPRING) Problem Statement 1546 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 1547 2016, . 1549 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 1550 "Advertisement of Multiple Paths in BGP", RFC 7911, 1551 DOI 10.17487/RFC7911, July 2016, 1552 . 1554 [RFC7926] Farrel, A., Ed., Drake, J., Bitar, N., Swallow, G., 1555 Ceccarelli, D., and X. Zhang, "Problem Statement and 1556 Architecture for Information Exchange between 1557 Interconnected Traffic-Engineered Networks", BCP 206, 1558 RFC 7926, DOI 10.17487/RFC7926, July 2016, 1559 . 1561 [RFC8231] Crabbe, E., Minei, I., Medved, J., and R. Varga, "Path 1562 Computation Element Communication Protocol (PCEP) 1563 Extensions for Stateful PCE", RFC 8231, 1564 DOI 10.17487/RFC8231, September 2017, 1565 . 1567 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 1568 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 1569 . 1571 [RFC8281] Crabbe, E., Minei, I., Sivabalan, S., and R. Varga, "Path 1572 Computation Element Communication Protocol (PCEP) 1573 Extensions for PCE-Initiated LSP Setup in a Stateful PCE 1574 Model", RFC 8281, DOI 10.17487/RFC8281, December 2017, 1575 . 1577 Authors' Addresses 1579 Adrian Farrel 1580 Juniper Networks 1582 Email: afarrel@juniper.net 1584 John Drake 1585 Juniper Networks 1587 Email: jdrake@juniper.net