idnits 2.17.1 draft-farrel-spring-sr-domain-interconnect-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 6, 2018) is 2301 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-bess-datacenter-gateway-00 == Outdated reference: A later version (-18) exists of draft-ietf-idr-bgp-ls-segment-routing-ext-03 == Outdated reference: A later version (-27) exists of draft-ietf-idr-bgp-prefix-sid-09 == Outdated reference: A later version (-19) exists of draft-ietf-idr-bgpls-segment-routing-epe-14 == Outdated reference: A later version (-26) exists of draft-ietf-idr-segment-routing-te-policy-01 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-07 == Outdated reference: A later version (-25) exists of draft-ietf-isis-segment-routing-extensions-15 == Outdated reference: A later version (-27) exists of draft-ietf-ospf-segment-routing-extensions-24 == Outdated reference: A later version (-16) exists of draft-ietf-pce-segment-routing-11 == Outdated reference: A later version (-15) exists of draft-ietf-spring-segment-routing-14 == Outdated reference: A later version (-22) exists of draft-ietf-spring-segment-routing-mpls-11 == Outdated reference: A later version (-07) exists of draft-sivabalan-pce-binding-label-sid-03 -- Obsolete informational reference (is this intentional?): RFC 7752 (Obsoleted by RFC 9552) Summary: 0 errors (**), 0 flaws (~~), 13 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SPRING Working Group A. Farrel 3 Internet-Draft J. Drake 4 Intended status: Informational Juniper Networks 5 Expires: July 10, 2018 January 6, 2018 7 Interconnection of Segment Routing Domains - Problem Statement and 8 Solution Landscape 9 draft-farrel-spring-sr-domain-interconnect-03 11 Abstract 13 Segment Routing (SR) is a popular forwarding paradigm for use in MPLS 14 and IPv6 networks. It is typically deployed in discrete domains that 15 may be data centers, access networks, or other networks that are 16 under the control of a single operator and that can easily be 17 upgraded to support this new technology. 19 Traffic originating in one SR domain often terminates in another SR 20 domain but must transit a backbone network that provides 21 interconnection between those domains. 23 This document describes a mechanism for providing connectivity 24 between SR domains to enable end-to-end or domain-to-domain traffic 25 engineering. 27 The approach described allows connectivity between SR domains, 28 utilizes traffic engineering mechanisms (RSVP-TE or Segment Routing) 29 across the backbone network, makes heavy use of pre-existing 30 technologies, and requires the specification of very few additional 31 mechanisms. 33 This document provides some background and a problem statement, 34 explains the solution mechanism, gives references to other documents 35 that define protocol mechanisms, and provides examples. It does not 36 define any new protocol mechanisms. 38 Status of This Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at https://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on July 10, 2018. 55 Copyright Notice 57 Copyright (c) 2018 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (https://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with respect 65 to this document. Code Components extracted from this document must 66 include Simplified BSD License text as described in Section 4.e of 67 the Trust Legal Provisions and are provided without warranty as 68 described in the Simplified BSD License. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 73 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 74 2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 75 3. Solution Technologies . . . . . . . . . . . . . . . . . . . . 7 76 3.1. Characteristics of Solution Technologies . . . . . . . . 8 77 4. Decomposing the Problem . . . . . . . . . . . . . . . . . . . 9 78 5. Solution Space . . . . . . . . . . . . . . . . . . . . . . . 11 79 5.1. Global Optimization of the Paths . . . . . . . . . . . . 11 80 5.2. Figuring Out the GWs at a Destination Domain for a Given 81 Prefix . . . . . . . . . . . . . . . . . . . . . . . . . 11 82 5.3. Figuring Out the Backbone Egress ASBRs . . . . . . . . . 12 83 5.4. Making use of RSVP-TE LSPs Across the Backbone . . . . . 13 84 5.5. Data Plane . . . . . . . . . . . . . . . . . . . . . . . 13 85 5.6. Centralized and Distributed Controllers . . . . . . . . . 16 86 6. BGP-LS Considerations . . . . . . . . . . . . . . . . . . . . 18 87 7. Worked Examples . . . . . . . . . . . . . . . . . . . . . . . 22 88 8. Label Stack Depth Considerations . . . . . . . . . . . . . . 26 89 8.1. Worked Example . . . . . . . . . . . . . . . . . . . . . 27 90 9. Gateway Considerations . . . . . . . . . . . . . . . . . . . 28 91 9.1. Domain Gateway Auto-Discovery . . . . . . . . . . . . . . 28 92 9.2. Relationship to BGP Link State and Egress Peer 93 Engineering . . . . . . . . . . . . . . . . . . . . . . . 29 94 9.3. Advertising a Domain Route Externally . . . . . . . . . . 29 95 9.4. Encapsulations . . . . . . . . . . . . . . . . . . . . . 30 97 10. Security Considerations . . . . . . . . . . . . . . . . . . . 30 98 11. Management Considerations . . . . . . . . . . . . . . . . . . 31 99 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 100 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 31 101 14. Informative References . . . . . . . . . . . . . . . . . . . 31 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 34 104 1. Introduction 106 Data Centers are a growing market sector. They are being set up by 107 new specialist companies, by enterprises for their own use, by legacy 108 ISPs, and by the new wave of network operators. The networks inside 109 Data Centers are currently well-planned, but the traffic loads can be 110 unpredictable. There is a need to be able to direct traffic within a 111 Data Center to follow a specific path. 113 Data Centers are attached to external ("backbone") networks to allow 114 access by users and to facilitate communication among Data Centers. 115 An individual Data Center may be attached to multiple backbone 116 networks, and may have multiple points of attachment to each backbone 117 network. Traffic to or from a Data Center may need to be directed to 118 or from any of these points of attachment. 120 Segment Routing (SR) is a technology that places forwarding state 121 into each packet as a stack of loose hops. SR is a popular option 122 for building Data Centers, and is also seeing increasing traction in 123 edge and access networks as well as in backbone networks. It is 124 typically deployed in discrete domains that may be data centers, 125 access networks, or other networks that are under the control of a 126 single operator and that can easily be upgraded to support this new 127 technology. 129 Traffic originating in one SR domain often terminates in another SR 130 domain but must transit a backbone network that provides 131 interconnection between those domains. This document describes an 132 approach that builds on existing technologies to produce mechanisms 133 that provide scalable and flexible interconnection of Data Centers, 134 and that will be easy to operate. 136 The mechanisms described provide end-to-end SR connectivity between 137 SR-capable domains across an MPLS backbone network that supports SR 138 and/or MPLS-TE. This is the generalization of the requirement to 139 provide inter-Data Center connectivity. 141 The approach described allows connectivity between SR domains, 142 utilizes traffic engineering mechanisms (RSVP-TE or Segment Routing) 143 across the backbone network, makes heavy use of pre-existing 144 technologies, and requires the specification of very few additional 145 mechanisms. 147 This document provides some background and a problem statement, 148 explains the solution mechanism, gives references to other documents 149 that define protocol mechanisms, and provides examples. It does not 150 define any new protocol mechanisms. 152 1.1. Terminology 154 This document uses Segment Routing terminology from [RFC7855] and 155 [I-D.ietf-spring-segment-routing]. Particular abbreviations of note 156 are: 158 o SID: a segment identifier 160 o SRGB: an SR Global Block 162 Further terms are defined in Section 2. 164 2. Problem Statement 166 Consider the network in Figure 1. Without loss of generality, this 167 figure can be used to represent the architecture and problem space 168 for steering traffic within and between SR edge domains. The figure 169 shows a single destination for all traffic that we will consider. 171 In describing the problem space and the solution we use five terms 172 for network nodes as follows: 174 SR domain : This term is defined in 175 [I-D.ietf-spring-segment-routing]. In this document, an SR domain 176 is a collection of SR-capable nodes under the care of one 177 administrator or protocol. This may mean that each edge network 178 is an SR domain attached to the backbone network through one or 179 more gateways. Examples include, access networks, Data Center 180 sites, backbone networks that run SR, and blessings of unicorns. 182 Host : A node within an edge domain. May be an end system or a 183 transit node in the edge domain. 185 Gateway (GW) : Provides access to or from an edge domain. Examples 186 are Customer Edge nodes (CEs), Autonomous System Border Routers 187 (ASBRs), and Data Center gateways. 189 Provider Edge (PE) : Provides access to or from the backbone 190 network. 192 Autonomous System Border Router (ASBR) : Provides access to one AS 193 in the backbone network from another AS in the backbone network. 195 These terms can be seen used in Figure 1 where the various sources 196 and the destination are hosts. In this figure we distinguish between 197 the PEs that provide access to the backbone networks and the Gateways 198 that provide access to the SR edge domains: these may, in fact, be 199 the same equipment and the PEs might be located at the domain edges. 201 ------------------------------------------------------------------- 202 | | 203 | AS1 | 204 | ---- ---- ---- ---- | 205 -|PE1a|--|PE1b|-------------------------------------|PE2a|--|PE2b|- 206 ---- ---- ---- ---- 207 : : ------------ ------------ : : 208 : : | AS2 | | AS3 | : : 209 : : | ------ ------ | : : 210 : : | |ASBR2a|...|ASBR3a| | : : 211 : : | ------ ------ | : : 212 : : | | | | : : 213 : : | ------ ------ | : : 214 : : | |ASBR2b|...|ASBR3b| | : : 215 : : | ------ ------ | : : 216 : : | | | | : : 217 : ......: | ---- | | ---- | : : 218 : : -|PE2a|----- -----|PE3a|- : : 219 : : ---- ---- : : 220 : : ......: :....... : : 221 : : : : : : 222 ---- ---- ---- ---- 223 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- 224 | ---- ---- | | ---- ---- | 225 | | | | 226 | | | | 227 | | | Source3 | 228 | Source2 | | | 229 | | | Source4 | 230 | Source1 | | | 231 | | | Destination | 232 | | | | 233 | Domain1 | | Domain2 | 234 ---------------- ---------------- 236 Figure 1: Reference Architecture for SR Domain Interconnect 238 Traffic to the destination may originated from multiple sources 239 within that domain (we show two such sources: Source3 and Source4). 240 Furthermore, traffic intended for the destination may arrive from 241 outside the domain through any of the points of attachment to the 242 backbone networks (we show GW2a and GW2b). This traffic may need to 243 be steered within the domain to achieve load-balancing across network 244 resources, to avoid degraded or out-of-service resources (including 245 planned service outages), and to achieve different qualities of 246 service. Of course, traffic in a remote source domain may also need 247 to be steered within that domain. We class this problem as "Intra- 248 Domain Traffic Steering". 250 Traffic across the backbone networks may need to be steered to 251 conform to common Traffic Engineering (TE) paradigms. That is, the 252 path across any network (shown in the figure as an Autonomous System 253 (AS)) or across any collection of networks may need to be chosen and 254 may be different from the shortest path first (SPF) routing that 255 would occur without TE. Furthermore, the points of inter-connection 256 between networks may need to be selected and influence the path 257 chosen for the data. We class this problem as "Inter-Domain Traffic 258 Steering". 260 The composite end-to-end path comprises steering in the source 261 domain, choice of source domain exit point, steering across the 262 backbone networks, choice of network interconnections, choice of 263 destination domain entry point, and steering in the destination 264 domain. These issues may be inter-dependent (for example, the best 265 traffic steering in the source domain may help select the best exit 266 point from that domain, but the connectivity options across the 267 backbone network may drive the selection of a different exit point). 268 We class this combination of problems as "End-to-End Domain 269 Interconnect Traffic Steering". 271 It should be noted that the solution to the End-to-End Domain 272 Interconnect Traffic Steering problem depends on a number of factors: 274 o What technology is deployed in the domains. 276 o What technology is deployed in the backbone networks. 278 o How much information the domains are willing to share with each 279 other. 281 o How much information the backbone network operators and the domain 282 operators are willing to share. 284 In some cases, the domains and backbone networks are all owned and 285 operated by the same company (with the backbone network often being a 286 private network). In other cases, the domains are operated by one 287 company, with other companies operating the backbone. 289 3. Solution Technologies 291 Within the Data Center, Segment Routing (SR from the SPRING working 292 group in the IETF [RFC7855] and [I-D.ietf-spring-segment-routing]) is 293 a popular solution. SR introduces traffic steering capabilities into 294 an MPLS network [I-D.ietf-spring-segment-routing-mpls] by utilizing 295 existing data plane capabilities (label pop and packet forwarding - 296 "pop and go") in combination with additions to existing IGPs 297 [I-D.ietf-ospf-segment-routing-extensions], 298 [I-D.ietf-isis-segment-routing-extensions], BGP (as BGP-LU) 299 [RFC8277], or a centralized controller to distribute "per-hop" 300 labels. An MPLS label stack can be imposed on a packet to describe a 301 sequence of links/nodes to be transited by the packet; as each hop is 302 transited, the label that represents it is popped from the stack and 303 the packet is forwarded. Thus, on a packet-by-packet basis, traffic 304 can be steered within the Data Center network. 306 This document broadens the problem space to consider interconnection 307 of any type of edge domain. These may be Data Center sites, but they 308 may equally be access networks, VPN sites, or any other form of 309 domain that includes packet sources and destinations. We 310 particularly focus on "SR edge domains" being source or destination 311 domains that utilize MPLS SR, but the domains could use other non- 312 MPLS technologies (such as IP, VXLAN, and NVGRE) as described in 313 Section 9. 315 Backbone networks are commonly based on MPLS-capable hardware. In 316 these networks, a number of different options exist to establish TE 317 paths. Among these options are static Label Switched Paths (LSPs) 318 perhaps set up by an SDN controller, LSP tunnels established using a 319 signaling protocol (such as RSVP-TE), and inter-domain use of SR (as 320 described above for intra-domain steering). Where traffic steering 321 (without resource reservation) is needed, SR may be adequate; where 322 Traffic Engineering is needed (i.e., traffic steering with resource 323 reservation) RSVP-TE or centralized SDN control are preferred. 324 However, in a network that is fully managed and controlled through a 325 centralized planning tool, resource reservation can be achieved and 326 SR can be used for full Traffic Engineering. These solutions are 327 already used in support of a number of edge-to-edge services such as 328 L3VPN and L2VPN. 330 3.1. Characteristics of Solution Technologies 332 Each of the solution technologies mentioned in the previous section 333 has certain characteristics, and the combined solution needs to 334 recognize and address the characteristics in order to make a workable 335 solution. 337 o When SR is used for traffic steering, the size of the MPLS label 338 stack used in SR scales linearly with the length of the strict 339 source route. This can cause issues with MPLS implementations 340 that only support label stacks of a limited size. For example, 341 some MPLS implementations cannot push enough labels on the stack 342 to represent an entire source route. Other implementations may be 343 unable to do the proper "ECMP hashing" if the label stack is too 344 long; they may be unable to read enough of the packet header to 345 find an entropy label or to find the IP header of the payload. 346 Increasing the packet header size also reduces the size of the 347 payload that can be carried in an MPLS packet. There are 348 techniques that can be used to reduce the size of the label stack. 349 For example, a single label (known as a "binding SID") can be used 350 to represent a sequence of nodes; this label can be replaced with 351 a set of labels when the packet reaches the first node in the 352 sequence. It is also possible to combine SR with conventional 353 RSVP-TE by using a binding SID in the label stack to represent an 354 LSP tunnel set up by RSVP-TE. 356 o Most of the work on using SR for traffic steering assumes that 357 traffic only needs to be steered within a single administrative 358 domain. If the backbone consists of multiple ASes that are not 359 part of a common administrative domain, the use of SR across the 360 backbone may prove to be a challenge, and its use in the backbone 361 may be limited to cases where private networks connect the 362 domains, rather than cases where the domains are connected by 363 third-party network operators or by the public Internet. 365 o RSVP-TE has been used to provide edge-to-edge tunnels through 366 which flows to/from many endpoints can be routed, and this 367 provides a reduction in state while still offering Traffic 368 Engineering across the backbone network. However, this requires 369 O(n2) connections and as the number of edge domains increases this 370 becomes unsustainable. 372 o A centralized control system, while capable of producing more 373 optimal results than a distributed control system, may present 374 challenges in large and dynamic networks. It relies on all 375 network state being held centrally, and it is difficult to make 376 central control as robust and self-correcting as distributed 377 control. 379 This document introduces an approach that blends the best points of 380 each of these solution technologies to achieve a trade-off where 381 RSVP-TE tunnels in the backbone network are stitched together using 382 SR, and end-to-end SR paths can be created under the control of a 383 central controller with routing devolved to the constituent networks 384 where possible. 386 4. Decomposing the Problem 388 It is important to decompose the problem to take account of different 389 regions spanned by the end-to-end path. These regions may use 390 different technologies and may be under different administrative 391 control. The separation of administrative control is particularly 392 important because the operator of one region may be unwilling to 393 share information about their networks, and may be resistant to 394 allowing a third party to exert control over their network resources. 396 Using the reference model in Figure 1, we can consider how to get a 397 packet from Source1 to the Destination. The following decisions must 398 be made: 400 o In which domain the Destination lies. 402 o Which exit point from Domain1 to use. 404 o Which entry point to Domain2 to use. 406 o How to reach the exit point of Domain1 from Source1. 408 o How to reach the entry point to Domain2 from the exit point of 409 Domain1. 411 o How to reach the Destination from the entry point to Domain2. 413 As already mentioned, these decisions may be inter-related. This 414 enables us to break down the problem into three steps: 416 1. Get the packet from Source1 to the exit point of Domain1. 418 2. Get the packet from exit point of Domain1 to entry point of 419 Domain2. 421 3. Get the packet from entry point of Domain2 to Destination. 423 The solution needs to achieve this in a way that allows: 425 o Adequate discovery of preferred elements in the end-to-end path 426 (such as the location of the destination, and the selection of the 427 destination domain entry point). 429 o Full control of the end-to-end path if all of the operators are 430 willing. 432 o Re-use of existing techniques and technologies. 434 From a technology point of view we must support several functions and 435 mixtures of those functions: 437 o If a domain uses MPLS Segment Routing, the labels within the 438 domain may be populated by any means including BGP-LU [RFC8277], 439 IGP, and central control. Source routes within the domain may be 440 expressed as label stacks pushed by a controller or computed by a 441 source router, or expressed as a single label and programmed into 442 the domain routers by a controller. 444 o If a domain uses other (non-MPLS) forwarding, the domain 445 processing is specific to that technology. See Section 9 for 446 details. 448 o If the domains use Segment Routing, the source and destination 449 domains may or may not be in the same 'Segment Routing domain' 450 [I-D.ietf-spring-segment-routing], so that the prefix-SIDs may be 451 the same or different in the two domains. 453 o The backbone network may be a single private network under the 454 control of the owner of the domains and comprising one or more 455 ASes, or may be a network operated by one or more third parties. 457 o The backbone network may utilize MPLS Traffic Engineering tunnels 458 in conjunction with MPLS Segment Routing and the domain-to-domain 459 source route may be provided by stitching TE LSPs. 461 o A single controller may be used to handle the source and 462 destination domains as well as the backbone network, or there may 463 be a different controller for the backbone network separate from 464 that that controls the two domains, or there may be separate 465 controllers for each network. The controllers may cooperate and 466 share information to different degrees. 468 All of these different decompositions of the problem reflect 469 different deployment choices and different commercial and operational 470 practices, each with different functional trade-offs. For example, 471 with separate controllers that do not share information and that only 472 cooperate to a limited extent, it will be possible to achieve end-to- 473 end connectivity with optimal routing at each step (domain or 474 backbone AS), but the end-to-end path that is achieved might not be 475 optimal. 477 5. Solution Space 479 5.1. Global Optimization of the Paths 481 Global optimization of the path from one domain to another requires 482 either that the source controller has a complete view of the end-to- 483 end topology or some form of cooperation between controllers (such as 484 in Backward Recursive Path Computation (BRPC) in [RFC5441]). 486 BGP-LS [RFC7752] can be used to provide the "source" controller with 487 a view of the topology of the backbone. This requires some of the 488 BGP speakers in each AS to have BGP-LS sessions to the controller. 489 Other means of obtaining this view of the topology are of course 490 possible. 492 5.2. Figuring Out the GWs at a Destination Domain for a Given Prefix 494 Suppose GW2a and GW2b both advertise a route to prefix X, each 495 setting itself as next hop. One might think that the GWs for X could 496 be inferred from the routes' next hop fields, but typically only the 497 "best" route as selected by BGP gets distributed across the backbone: 498 the other route is discarded. But the best route according to the 499 BGP selection process might not be the route via the GW that we want 500 to use for traffic engineering purposes. 502 The obvious solution would be to use the ADD-PATH mechanism [RFC7911] 503 to ensure that all routes to X get advertised. However, even if one 504 does this, the identity of the GWs would get lost as soon as the 505 routes got distributed through an ASBR that sets next hop self. And 506 if there are multiple ASes in the backbone, not only will the next 507 hop change several times, but the ADD-PATH mechanism will experience 508 scaling issues. So this "obvious" solution only works within a 509 single AS. 511 A better solution can be achieved using the Tunnel Encapsulation 512 [I-D.ietf-idr-tunnel-encaps] attribute as follows. 514 We define a new tunnel type, "SR tunnel" and when the GWs to a given 515 domain advertise a route to a prefix X within the domain, they each 516 include a Tunnel Encapsulation attribute with multiple remote 517 endpoint sub-TLVs each of which identifies a specific GW to the 518 domain. 520 In other words, each route advertised by any GW identifies all of the 521 GWs to the same domain (see Section 9 for a discussion of how GWs 522 discover each other). Therefore, only one of the routes needs to be 523 distributed to other ASes, and it doesn't matter how many times the 524 next hop changes, the Tunnel Encapsulation attribute (and its remote 525 endpoint sub-TLVs) remains unchanged and disclose the full list of 526 GWs. 528 Further, when a packet destined for prefix X is sent on a TE path to 529 GW2a we want the packet to arrive at GW2a carrying, at the top of its 530 label stack, GW2a's label for prefix X. To achieve this we place the 531 SID/SRGB in a sub-TLV of the Tunnel Encapsulation attribute. We 532 define the prefix-SID sub-TLV to be essentially identical in syntax 533 to the prefix-SID attribute (see [I-D.ietf-idr-bgp-prefix-sid]), but 534 the semantics are somewhat different. 536 We also define an "MPLS Label Stack" sub-TLV for the Tunnel 537 Encapsulation attribute, and put this in the "SR tunnel" TLV. This 538 allows the destination GW to specify a label stack that it wants 539 packets destined for prefix X to have. This label stack represents a 540 source route through the destination domain. 542 5.3. Figuring Out the Backbone Egress ASBRs 544 We need to figure out the backbone egress ASBRs that are attached to 545 a given GW at the destination domain in order to properly engineer 546 the path across the backbone. 548 The "cleanest" way to do this is to have the backbone egress ASBRs 549 distribute the information to the source controller using the egress 550 peer engineering (EPE) extensions of BGP-LS 551 [I-D.ietf-idr-bgpls-segment-routing-epe]. The EPE extensions to BGP- 552 LS allow a BGP speaker to say, "Here is a list of my EBGP neighbors, 553 and here is a (locally significant) adjacency-SID for each one." 555 It may also be possible to consider utilizing cooperating PCEs or a 556 Hierarchical PCE approach in [RFC6805]. But it should be observed 557 that this question is dependent on the questions in Section 5.2. 558 That is, it is not possible to even start the selection of egress 559 ASBRs until it is known which GWs at the destination domain provide 560 access to a given prefix. Once that question has been answered, any 561 number of PCE approaches can be used to select the right egress ASBR 562 and, more generally, the ASBR path across the backbone. 564 5.4. Making use of RSVP-TE LSPs Across the Backbone 566 There are a number of ways to carry traffic across the backbone from 567 one domain to another. RSVP-TE is a popular tunneling mechanism in 568 similar scenarios (e.g., L3VPN) because it allows for reservation of 569 resources as well as traffic steering. 571 A controller can cause an RSVP-TE LSP to be set up by talking to the 572 LSP head end, using PCEP extensions [RFC8281]. That document 573 specifies an "LSP Initiate" message (the PCInitiate message) that the 574 controller uses to specify the RSVP-TE LSP endpoints, the ERO, a 575 "symbolic pathname", and optionally other attributes (specified in 576 the PCEP specification [RFC5440]) such as bandwidth. 578 When the head end receives a PCInitiate message, it sets up the RSVP- 579 TE LSP, assigns it a "PLSP-id", and reports the PLSP-id back to the 580 controller in a PCRpt message [RFC8231]. The PCRpt message also 581 contains the symbolic name that the controller assigned to the LSP, 582 as well as containing some information identifying the LSP-initiate 583 message from the controller, and details of exactly how the LSP was 584 set up (RRO, bandwidth, etc.). 586 The head end can add a TE-PATH-BINDING TLV to the PCRpt message 587 [I-D.sivabalan-pce-binding-label-sid]. This allows the head end to 588 assign a "binding SID" to the LSP, and to report to the controller 589 that a particular binding SID corresponds to a particular LSP. The 590 binding SID is locally scoped to the head end. 592 The controller can make this label be part of the label stack that it 593 tells the source (or the GW at the source domain) to impose on the 594 data packets being sent to prefix X. When the head end receives a 595 packet with this label at the top of the stack it will send the 596 packet onward on the LSP. 598 5.5. Data Plane 600 Consolidating all of the above, consider what happens when we want to 601 move a data packet from Source to Destination in Figure 1via the 602 following source route: 604 Source1---GW1b---PE2a---ASBR2a---ASBR3a---PE3a---GW2a---Destination 606 Further, assume that there is an RSVP-TE LSP from PE2a to ASBR2a and 607 an RSVP-TE LSP from ASBR3a to PE3a both of which we want to use. 609 Let's suppose that the Source pushes a label stack following 610 instructions from the controller (for example, using BGP-LU 611 [RFC8277]). We won't worry for now about source routing through the 612 domains themselves: that is, in practice there may be additional 613 labels in the stack to cover the source route from the Source to GW1b 614 and from GW2a to the Destination, but we will focus only on the 615 labels necessary to leave the source domain, traverse the backbone, 616 and enter the egress domain. So we only care what the stack looks 617 like when the packet gets to GW1b. 619 When the packet gets to GW1b, the stack should have six labels: 621 Top Label: 623 Peer-SID or adjacency-SID identifying the link or links to PE2a. 624 These SIDs are distributed from GW1b to the controller via the EPE 625 extensions of BGP-LS. This label will get popped by GW1b, which 626 will then send the packet to PE2a. 628 Second Label: 630 Binding SID advertised by PE2a to the controller for the RSVP-TE 631 LSP to ASBR2a. This binding SID is advertised via the PCEP 632 extensions discussed above. This label will get swapped by PE2a 633 for the label that the LSP's next hop has assigned to the LSP. 635 Third Label: 637 Peer-SID or adjacency-SID identifying the link or links to ASBR3a, 638 as advertised to the controller by ASBR2a using the BGP-LS EPE 639 extensions. This label gets popped by ASBR2a, which then sends 640 the packet to ASBR3a. 642 Fourth Label: 644 Binding SID advertised by ASBR3a for the RSVP-TE LSP to PE3a. 645 This binding SID is advertised via the PCEP extensions discussed 646 above. ASBR3a treats this label just like PE2a treated the second 647 label above. 649 Fifth label: 651 Peer-SID or adjacency-SID identifying link or links to GW2a, as 652 advertised to the controller by ASBR3a using the BGP-LS EPE 653 extensions. ASBR3a pops this label and sends the packet to GW2a. 655 Sixth Label: 657 Prefix-SID or other label identifying the Destination advertised 658 in a Tunnel Encapsulation attribute by GW2a. This can be omitted 659 if GW2a is happy to accept IP packets, or prefers a VXLAN tunnel 660 for example. That would be indicated through the Tunnel 661 Encapsulation attribute of course. 663 Note that the size of the label stack is proportional to the number 664 of RSVP-TE LSPs that get stitched together by SR. 666 See Section 7 for some detailed examples that show the concrete use 667 of labels in a sample topology. 669 In the above example, all labels except the sixth are locally 670 significant labels: peer-SIDs, binding SIDs, or adjacency-SIDs. Only 671 the sixth label, a prefix-SID, has a domain-wide unique value. To 672 impose that label, the source needs to know the SRGB of GW2a. If all 673 nodes have the same SRGB, this is not a problem. Otherwise, there 674 are a number of different ways GW3a can advertise its SRGB. This can 675 be done via the segment routing extensions of BGP-LS, or it can be 676 done using the prefix-SID attribute or BGP-LU [RFC8277], or it can be 677 done using the BGP Tunnel Encapsulation attribute. The technique to 678 be used will depend on the details of the deployment scenario. 680 The reason the above example is primarily based on locally 681 significant labels is that it creates a "strict source route", and it 682 presupposes the EPE extensions of BGP-LS. In some scenarios, the EPE 683 extension to BGP-LS might not be available (or BGP-LS might not be 684 available at all). In other scenarios, it may be desirable to steer 685 a packet through a "loose source route". In such scenarios, the 686 label stack imposed by the source will be based upon a sequence of 687 domain-wide unique "node-SIDs", each representing one of the hops of 688 source route. Each label has to be computed by adding the 689 corresponding node-SID to the SRGB of the node that will act upon the 690 label. One way to learn the node-SIDs and SRGBs is to use the 691 segment routing extensions of BGP-LS. Another way is to use BGP-LU 692 as follows: 694 Each node that may be part of a source route would originate a 695 BGP-LU route with one of its own loopback addresses as the prefix. 696 The BGP prefix-SID attribute would be attached to this route. The 697 prefix-SID attribute would contain a SID, which is the domain-wide 698 unique SID corresponding to the node's loopback address. The 699 attribute would also contain the node's SRGB. 701 While this technique is useful when BGP-LS is not available, there 702 has to be some other means for the source controller to discover the 703 topology. In this document, we focus primarily on the scenario where 704 BGP-LS, rather than BGP-LU, is used. 706 5.6. Centralized and Distributed Controllers 708 A controller or set of controllers is needed to collate topology and 709 TE information from the constituent networks, to apply policies and 710 service requirements to compute paths across those networks, to 711 select an end-to-end path, and to program key nodes in the network to 712 take the right forwarding actions (pushing label stacks, stitching 713 LSPs, forwarding traffic). 715 o It is commonly understood that a fully optimal end-to-end path can 716 only be computed with full knowledge of the end-to-end topology 717 and available Traffic Engineering resources. Thus, one option is 718 for all information about the domain networks and backbone network 719 to be collected by a central controller that makes all path 720 computations and is responsible for issuing the necessary 721 programming commands. Such a model works best when there is no 722 commercial or administrative impediment (for example, where the 723 domains and the backbone network are owned and operated by the 724 same organization). There may, however, be some scaling concerns 725 if the component networks are large. 727 In this mode of operation, each network may use BGP-LS to export 728 Traffic Engineering and topology information to the central 729 controller, and the controller may use PCEP to program the network 730 behavior. 732 o A similar centralized control mechanism can be used with a 733 scalability improvement that risks a reduction in optimality. In 734 this case, the domain networks can export to the controller just 735 the feasibility of connectivity between data source/sink and 736 gateway, perhaps enhancing this with some information about the 737 Traffic Engineering metrics of the potential paths. 739 This approach allows the central controller to understand the end- 740 to-end path that it is selecting, but not to control it fully. 741 The source route from data source to domain egress gateway is left 742 to the source host or a controller in the source domain, while the 743 source route from domain ingress gateway to destination is left as 744 a decision for the domain ingress gateway or to a controller in 745 the destination domain and in both cases the traffic may be left 746 to follow the IGP shortest path. 748 This mode of operation still leaves overall control with a 749 centralized server and that may not be considered suitable when 750 there is separate commercial or administrative control of the 751 networks. 753 o When there is separate commercial or administrative control of the 754 networks the domain operator will not want the backbone operator 755 to have control of the paths within the domains and may be 756 reluctant to disclose any information about the topology or 757 resource availability within the domains. Conversely, the 758 backbone operator may be very unwilling to allow the domain 759 operator (a customer) any control over or knowledge about the 760 backbone network. 762 This "problem" has already been solved for Traffic Engineering in 763 MPLS networks that span multiple administrative domains and leads 764 to several potential solutions: 766 * Per-domain path computation [RFC5152] can be seen as "best 767 effort optimization". In this mode the controller for each 768 domain is responsible for finding the best path to the next 769 domain, but has no way of knowing which is the best exit point 770 from the local domain. The resulting path may end up 771 significantly sub-optimal or even blocked. 773 * Backward recursive path computation (BRPC) [RFC5441] is a 774 mechanism that allows controllers to cooperate across a small 775 set of domains (such as ASes) to build a tree of possible paths 776 and so allow the controller for the ingress domain to select 777 the optimal path. The details of the paths within each domain 778 that might reveal confidential information can be hidden using 779 Path Keys [RFC5520]. BRPC produces optimal paths, but scales 780 poorly with an increase in domains and with an increase in 781 connectivity between domains. It can also lead to slow 782 computation times. 784 * Hierarchical PCE (H-PCE) [RFC6805] is a two-level cooperation 785 process between PCEs. The child PCEs remain responsible for 786 computing paths across their domains, and they coordinate with 787 a parent PCE that stitches these paths together to form the 788 end-to-end path. This approach has many similarities with BRPC 789 but can scale better through the maintenance of "domain 790 topology" that shows how the domains are interconnected, and 791 through the ability to pipe-line computation requests to all of 792 the child domains. It has the drawback that some party has to 793 own and operate the parent PCE. 795 * An alternative approach is documented by the TEAS working group 796 [RFC7926]. In this model each network advertises to 797 controllers for adjacent networks (using BGP-LS) selected 798 information about potential connectivity across the network. 799 It does not have to show full topology and can make its own 800 decisions about which paths it considers optimal for use by its 801 different neighbors and customers. This approach is suitable 802 for the End-to-End Domain Interconnect Traffic Steering problem 803 where the backbone is under different control from the domains 804 because it allows the overlay nature of the use of the backbone 805 network to be treated as a peer network relationship by the 806 controllers of the domains - the domains can be operated using 807 a single controller or a separate controller for each domain. 809 It is also possible to operate domain interconnection when some or 810 all domains do not have a controller. Segment Routing is capable of 811 routing a packet toward the next hop based on the top label on the 812 stack, and that label does not need to indicate an immediately 813 adjacent node or link. In these cases, the packet may be forwarded 814 untouched, or the forwarding router may impose a locally-determined 815 additional set of labels that define the path to the next hop. 817 PCE can be used to instruct the source host or a transit node about 818 what label stacks to add to packets. That is, a node that needs to 819 impose labels (either to start routing the packet from the source 820 host, or to advance the packet from a transit router toward the 821 destination) can determine the label stack to use based on local 822 function or can have that stack supplied by a PCE. The PCE 823 Communication Protocol (PCEP) has been extended to allow the PCE to 824 supply a label stack for reaching a specific destination either in 825 response to a request or in an unsolicited manner 826 [I-D.ietf-pce-segment-routing]. 828 6. BGP-LS Considerations 830 This section gives an overview of the use of BGP-LS to export an 831 abstraction (or summary) of the connectivity across the backbone 832 network by means of two figures that show different views of a sample 833 network. 835 Figure 2 shows a more complex reference architecture. 837 Figure 3 represents the minimum set of nodes and links that need to 838 be advertised in BGP-LS with SR in order to perform Domain 839 Interconnect with traffic engineering across the backbone network: 840 the PEs, ASBRs, and GWs, and the links between them. In particular, 841 EPE [I-D.ietf-idr-bgpls-segment-routing-epe] and TE information with 842 associated segment IDs is advertised in BGP-LS with SR. 844 Links that are advertised may be physical links, links realized by 845 LSP tunnels or SR paths, or abstract links. It is assumed that 846 intra-AS links are either real links, RSVP-TE LSPs with allocated 847 bandwidth, or SR TE policies as described in 848 [I-D.ietf-idr-segment-routing-te-policy]. Additional nodes internal 849 to an AS and their links to PEs, ASBRs, and/or GWs may also be 850 advertised (for example to avoid full mesh problems). 852 Note that Figure 3 does not show full interconnectivity. For 853 example, there is no possibility of connectivity between PE1a and 854 PE1c (because there is no RSVP-TE LSP established across AS1 between 855 these two nodes) and so not link is presented in the topology view. 856 [RFC7926] gives further discussion of topological abstractions that 857 may be useful in understanding this distinction. 859 ------------------------------------------------------------------- 860 | | 861 | AS1 | 862 | ---- ---- ---- ---- | 863 -|PE1a|--|PE1b|-------------------------------------|PE1c|--|PE1d|- 864 ---- ---- ---- ---- 865 : : ------------ ------------ : : : 866 : : | AS2 | | AS3 | : : : 867 : : | ------.....------ | : : : 868 : : | |ASBR2a| |ASBR3a| | : : : 869 : : | ------ ..:------ | : : : 870 : : | | ..: | | : : : 871 : : | ------: ------ | : : : 872 : : | |ASBR2b|...|ASBR3b| | : : : 873 : : | ------ ------ | : : : 874 : : | | | | : : : 875 : : | | ------ | : : : 876 : : | | ..|ASBR3c| | : : : 877 : : | | : ------ | : ....: : 878 : ......: | ---- | : | ---- | : : : 879 : : -|PE2a|----- : -----|PE3b|- : : : 880 : : ---- : ---- : : : 881 : : .......: : :....... : : : 882 : : : ------ : : : : 883 : : : ----|ASBR4b|---- : : : : 884 : : : | ------ | : : : : 885 : : : ---- | : : : : 886 : : : .........|PE4b| AS4 | : : : : 887 : : : : ---- | : : : : 888 : : : : | ---- | : : : : 889 : : : : -----|PE4a|----- : : : : 890 : : : : ---- : : : : 891 : : : : ..: :.. : : : : 892 : : : : : : : : : : 893 ---- ---- ---- ---- ----: ---- 894 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 895 | ---- ---- | | ---- ---- | | ---- ---- | 896 | | | | | | 897 | | | | | | 898 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 899 | | | | | | 900 | | | | | | 901 | Domain1 | | Domain2 | | Domain3 | 902 ---------------- ---------------- ---------------- 904 Figure 2: Network View of Example Configuration 906 ............................................................. 907 : : 908 ---- ---- ---- ---- 909 |PE1a| |PE1b|.....................................|PE1c| |PE1d| 910 ---- ---- ---- ---- 911 : : : : : 912 : : ------.....------ : : : 913 : : ......|ASBR2a| |ASBR3a|...... : : : 914 : : : ------ ..:------ : : : : 915 : : : : : : : : 916 : : : ------..: ------ : : : : 917 : : : ...|ASBR2b|...|ASBR3b| : : : : 918 : : : : ------ ------ : : : : 919 : : : : : : : : : 920 : : : : ------ : : : : 921 : : : : ..|ASBR3c|... : : : : 922 : : : : : ------ : : : ....: : 923 : ......: ---- : ---- : : : 924 : : |PE2a| : |PE3b| : : : 925 : : ---- : ---- : : : 926 : : .......: : :....... : : : 927 : : : ------ : : : : 928 : : : |ASBR4b| : : : : 929 : : : ------ : : : : 930 : : : ----.....: : : : : : 931 : : : .........|PE4b|..... : : : : : 932 : : : : ---- : : : : : : 933 : : : : ---- : : : : 934 : : : : |PE4a| : : : : 935 : : : : ---- : : : : 936 : : : : ..: :.. : : : : 937 : : : : : : : : : : 938 ---- ---- ---- ---- ----: ---- 939 -|GW1a|--|GW1b|- -|GW2a|--|GW2b|- -|GW3a|--|GW3b|- 940 | ---- ---- | | ---- ---- | | ---- ---- | 941 | | | | | | 942 | | | | | | 943 | Host1a Host1b | | Host2a Host2b | | Host3a Host3b | 944 | | | | | | 945 | | | | | | 946 | Domain1 | | Domain2 | | Domain3 | 947 ---------------- ---------------- ---------------- 949 Figure 3: Topology View of Example Configuration 951 A node (a PCE, router, or host) that is computing a full or partial 952 path correlates the topology information disseminated in BGP-LS with 953 the information advertised in BGP with the Tunnel Encapsulation 954 attributes and uses this to compute that path and obtain the SIDs for 955 the elements on that path. In order to allow a source host to 956 compute exit points from its domain, some subset of the above 957 information needs to be disseminated within that domain. 959 What is advertised external to a given AS is controlled by policy at 960 the ASes' PEs, ASBRs, and GWs. Central control of what each node 961 should advertise, based upon analysis of the network as a whole, is 962 an important additional function. This and the amount of policy 963 involved may make the use of a Route Reflector an attractive option. 965 The configuration of which links to other nodes and the 966 characteristics of those links a given node advertises in BGP-LS is 967 done locally at each node and pairwise coordination between link end- 968 points is required to ensure consistency. 970 Path Weighted ECMP (PWECMP) is assumed to be used by a GW for a given 971 source domain to send all flows to a given destination domain using 972 all paths in the backbone network to that destination domain in 973 proportion to the minimum bandwidth on each path. PWECMP is also 974 assumed to be used by hosts within a source domain to send flows to 975 that domain's GWs. 977 7. Worked Examples 979 Figure 4 shows a view of the links, paths, and labels that can be 980 assigned to part of the sample network shown in Figure 2 and 981 Figure 3. The double-dash lines (===) indicate LSP tunnels across 982 backbone ASes and dotted lines (...) are physical links. 984 A label may be assigned to each outgoing link at each node. This is 985 shown in Figure 4. For example, at GW1a the label L201 is assigned 986 to the link connecting GW1a to PE1a. At PE1c, the label L302 is 987 assigned to the link connecting PE1c to GW3b. Labels ("binding 988 SIDs") may also be assigned to RSVP-TE LSPs. For example, at PE1a, 989 label L202 is assigned to the RSVP-TE LSP leading from PE1a to PE1c. 991 At the destination domain, label L305 is a "node-SID"; it represents 992 Host3b, rather than representing a particular link. 994 When a node processes a packet, the label at the top of the label 995 stack indicates the link (or RSVP-TE LSP) on which that node is to 996 transmit the packet. The node pops that label off the label stack 997 before transmitting the packet on the link. However, if the top 998 label is a node-SID, the node processing the packet is expected to 999 transmit the packet on whatever link it regards as the shortest path 1000 to the node represented by the label. 1002 ---- L202 ---- 1003 | |======================================================| | 1004 |PE1a| |PE1c| 1005 | |======================================================| | 1006 ---- L203 ---- 1007 : L304: :L 1008 : : :3 1009 : ---- L205 ---- : :0 1010 : |PE1b|============================================|PE1d| : :2 1011 : ---- ---- : : 1012 : : L303: : : 1013 : : : : : 1014 : : ---- L207 ------ L209 ------ : : : 1015 : : | |======|ASBR2a|......| | : : : 1016 : : | | ------ | |L210 ---- : : : 1017 : : |PE2a| |ASBR3a|======|PE3b| : : : 1018 : : | |L208 ------ L211 | | ---- : : : 1019 : : | |======|ASBR2b|......| | L301: : : : 1020 : : ---- ------ ------ ...: : : : 1021 : : : : : : : 1022 : ....: : : .......: : : 1023 : : : : : : : 1024 : : : : : .........: : 1025 : : : : : : : 1026 : : ....: : : : ....: 1027 L201: :L204 :L206 : : : : 1028 ---- ---- ----- ---- 1029 -|GW1a|--|GW1b|- -|GW3a |--|GW3b|- 1030 | ---- ---- | | ----- ---- | 1031 | : : | | L303: :L304| 1032 | : : | | : : | 1033 |L103: :L102| | : : | 1034 | N1 N2 | | N3 N4 | 1035 | :.. ..: | | : ....: | 1036 | : : | | : : | 1037 | L101: : | | : : | 1038 | Host1a | | Host3b (L305) | 1039 | | | | 1040 | Domain1 | | Domain3 | 1041 ---------------- ----------------- 1043 Figure 4: Tunnels and Labels in Example Configuration 1045 Note the overlap of label space that occurs so that the figure shows 1046 two instances of L303 and L304. This is acceptable because of 1047 separation between the SR domains and because SIDs applied to 1048 outgoing interfaces are locally scoped. 1050 Let's consider several different possible ways to direct a packet 1051 from Host1a in Domain1 to Host3b in Domain3. 1053 a. Full source route imposed at source 1055 In this case it is assumed that the entity responsible for 1056 determining an end-to-end path has access to the topologies of 1057 both the source and destination domains as well as of the backbone 1058 network. This might happen if all of the networks are owned by 1059 the same operator in which case the information can be shared into 1060 a single database for use by an offline tool, or the information 1061 can be distributed using routing protocols such that the source 1062 host can see enough to select the path. Alternatively, the end- 1063 to-end path could be produced through cooperation between 1064 computation entities each responsible for different domains along 1065 the path. 1067 If the path is computed externally it is pushed to the source 1068 host. Otherwise, it is computed by the source host itself. 1070 Suppose it is desired for a packet from Host1a to travel to Host3b 1071 via the following source route: 1073 Host1a->N1->GW1a->PE1a->(RSVP-TE LSP)->PE1c->GW3b->N4->Host3b 1075 Host1a would impose the following label stack would be imposed 1076 (with the first label representing the top of stack), and then 1077 send the packet to N1: 1079 L103, L201, L202, L302, L304, L305 1081 N1 sees L103 at the top of the stack, so it pops the stack and 1082 forwards the packet to GW1a. GW1a sees L201 at the top of the 1083 stack, so it pops the stack and forwards the packet to PE1a. PE1a 1084 sees L202 at the top of the stack, so it pops the stack and 1085 forwards the packet over the RSVP-TE LSP to PE1c. As the packet 1086 travels over this LSP, its top label will be an RSVP-TE signaled 1087 label representing the LSP. That is, PE1a imposes an additional 1088 label stack entry for the tunnel LSP. 1090 At the end of the LSP tunnel, the MPLS tunnel label will be 1091 popped, and PE1c will see L302 at the top of the stack. PE1c pops 1092 the stack and forwards the packet to GW3b. GW3b will see L304 at 1093 the top of the stack, so it pops the stack and forwards the packet 1094 to N4. Finally, N4 sees L305 at the top of the stack, so it pops 1095 the stack and forwards the packet to Host3b. 1097 b. It is possible that the source domain does not have visibility 1098 into the destination domain 1100 This occurs if the destination domain does not export its 1101 topology, but does export basic reachability information so that 1102 the source host or the path computation entity will know: 1104 * The GWs through which the destination can be reached. 1106 * The SID to use for the destination prefix. 1108 Suppose we want a packet to follow the source route: 1110 Host1a->N1->GW1a->PE1a->(RSVP-TE LSP)->PE1c->GW3b->...->Host3b 1112 The ellipsis indicates a part of the path that is not explicitly 1113 specified. Thus, the label stack imposed at the source host would 1114 be: 1116 L103, L201, L202, L302, L305 1118 Processing is as per case a., but when the packet reaches the GW 1119 of the destination domain (GW3b), it can either simply forward the 1120 packet along the shortest path to Host3b, or it can insert 1121 additional labels to direct the path to the destination. 1123 c. Domain1 only has reachability information for the backbone and 1124 destination networks 1126 The source domain (or the path computation entity) may be further 1127 restricted in its view of the network. It is possible that it 1128 knows the location of the destination in the destination domain, 1129 and knows the GWs to the destination domain that provide 1130 reachability to the destination, but that it has no view of the 1131 backbone network. This leads to the packet being forwarded in a 1132 manner similar to 'per-domain path computation' described in 1133 Section 5.6. 1135 At the source host a simple label stack is imposed navigating the 1136 domain and indicating the destination GW and the destination host. 1138 L103, L302, L305 1140 As the packet leaves the source domain, the source GW (GW1a) 1141 determines the PE to use to enter the backbone using nothing more 1142 than the BGP preferred route to the destination GW (it could be 1143 PE1a or PE1b). 1145 When the packet reaches the first PE it has a label stack just 1146 identifying the destination GW and the host (L302, L305). The PE 1147 uses information it has about the backbone network topology and 1148 available LSPs to select an LSP tunnel, impose the tunnel label, 1149 and forward the packet. 1151 When the packet reaches the end of the LSP tunnel, it is processed 1152 as described in case b. 1154 d. Stitched LSPs across the backbone 1156 A variant of all these cases arises when the packet is sent using 1157 a path that spans multiple ASes. For example, one that crosses 1158 AS2 and AS3 as shown in Figure 2. 1160 In this case, basing the example on case a., the source host would 1161 impose the label stack: 1163 L102, L206, L207, L209, L210, L301, L303, L305 1165 It would then send the packet to N2. 1167 When the packet reaches PE2a as previously described and the top 1168 label (L207) selects an LSP tunnel that leads to ASBR2a. At the 1169 end of that LSP tunnel the next label (L209) routes the packet 1170 from ASBR2a to the ASBR3a, where the next label (L210) identifies 1171 the next LSP tunnel to use. Thus, SR has been used to stitch 1172 together LSPs to make a longer path segment. As the packet 1173 emerges from the final LSP tunnel, forwarding continues as 1174 previously described. 1176 8. Label Stack Depth Considerations 1178 As described in Section 3.1, one of the issues with a Segment Routing 1179 approach is that the label stack can get large, for example when the 1180 source route becomes long. A mechanism to mitigate this problem is 1181 needed if the solution is to be fully applicable in all environments. 1183 [I-D.ietf-idr-segment-routing-te-policy] introduces the concept of 1184 hierarchical source routes as a way to compress source route headers. 1185 It functions by having the egress node for a set of source routes 1186 advertise those source routes along with an explicit request that 1187 each node that is an ingress node for one or more of those source 1188 routes should advertise a binding SID for the set of source routes 1189 for which it is the ingress. It should be noted that the set of 1190 source routes can either be advertised by the egress node as 1191 described here, or could be advertised by a controller on behalf of 1192 the egress node. 1194 Such an ingress node advertises its set of source routes and a 1195 binding SID as an adjacency in BGP-LS as described in Section 6. 1196 These source routes represent the weighted ECMP paths between the 1197 ingress node and the egress node. Note also that the binding SID may 1198 be supplied by the node that advertises the source routes (the egress 1199 or the controller) or may be chosen by ingress node. 1201 A remote node that wishes to reach the egress node would then 1202 construct a source route consisting of the segment IDs necessary to 1203 reach one of the ingress nodes for the path it wishes to use along 1204 with the binding SID that the ingress node advertised to identify the 1205 set of paths. When the selected ingress node receives a packet with 1206 a binding SID it has advertised, it replaces the binding SID with the 1207 labels for one of its source routes to the egress node (it will 1208 choose one of the source routes in the set according to its own 1209 weighting algorithms and policy). 1211 8.1. Worked Example 1213 Consider the topology in Figure 4. Suppose that it is desired to 1214 construct full segment routed paths from ingress to egress, but that 1215 the resulting label stack (segment route) is too large. In this case 1216 the gateways to Domain3 (GW3a and GW3b) can advertise all of the 1217 source routes from the gateways to Domain1 (GW1a and GW1b). The 1218 gateways to Domain1 then assign binding SIDs to those source routes 1219 and advertise those SIDs into BGP-LS. 1221 Thus, GW3b would advertise the two source routes (L201, L202, L302 1222 and L201, L203, L302), and GW1a would advertise into BGP-LS its 1223 adjacency to GW3b along with a binding SID. Should Host1a wish to 1224 send a packet via GW1a and GW3b, it can include L103 and this binding 1225 SID in the source route. GW1a is free to choose which source route 1226 to use between itself and GW3b using its weighted ECMP algorithm. 1228 Similarly, GW3a would advertise the following set of source routes: 1230 o L201, L202, L304 1232 o L201, L203, L304 1234 o L204, L205, L303 1236 o L206, L207, L209, L210, L301 1238 o L206, L208, L211, L210, L301 1240 GW1a would advertise a binding SID for the first three, and GW1b 1241 would advertise a binding SID for the other two. 1243 9. Gateway Considerations 1245 As described in Section 5.2, [I-D.ietf-bess-datacenter-gateway] 1246 defines a new tunnel type, "SR tunnel", and when the GWs to a given 1247 domain advertise a route to a prefix X within the domain, they will 1248 each include a Tunnel Encapsulation attribute with multiple tunnel 1249 instances each of type "SR tunnel", one for each GW and each 1250 containing a Remote Endpoint sub-TLV with that GW's address. 1252 In other words, each route advertised by any GW identifies all of the 1253 GWs to the same domain. 1255 Therefore, even if only one of the routes is distributed to other 1256 ASes, it will not matter how many times the next hop changes, as the 1257 Tunnel Encapsulation attribute (and its remote endpoint sub-TLVs) 1258 will remain unchanged. 1260 9.1. Domain Gateway Auto-Discovery 1262 To allow a given domain's GWs to auto-discover each other and to 1263 coordinate their operations, the following procedures are implemented 1264 as described in [I-D.ietf-bess-datacenter-gateway]: 1266 o Each GW is configured with an identifier for the domain that is 1267 common across all GWs to the domain (i.e., all GWs to all domains 1268 that are connected) and unique across all domains that are 1269 connected. 1271 o A route target [RFC4360] is attached to each GW's auto-discovery 1272 route and has its value set to the domain identifier. 1274 o Each GW constructs an import filtering rule to import any route 1275 that carries a route target with the same domain identifier that 1276 the GW itself uses. This means that only these GWs will import 1277 those routes and that all GWs to the same domain will import each 1278 other's routes and will learn (auto-discover) the current set of 1279 active GWs for the domain. 1281 o The auto-discovery route each GW advertises consists of the 1282 following: 1284 * An IPv4 or IPv6 NLRI containing one of the GW's loopback 1285 addresses (that is, with AFI/SAFI that is one of 1/1, 2/1, 1/4, 1286 2/4). 1288 * A Tunnel Encapsulation attribute containing the GW's 1289 encapsulation information, which at a minimum consists of an SR 1290 tunnel TLV with a Remote Endpoint sub-TLV 1291 [I-D.ietf-idr-tunnel-encaps]. 1293 To avoid the side effect of applying the Tunnel Encapsulation 1294 attribute to any packet that is addressed to the GW, the GW should 1295 use a different loopback address in the advertisement from that used 1296 to reach the GW itself. 1298 Each GW will include a Tunnel Encapsulation attribute for each GW 1299 that is active for the domain (including itself), and will include 1300 these in every route advertised by each GW to peers outside the 1301 domain. As the current set of active GWs changes (due to the 1302 addition of a new GW or the failure/removal of an existing GW) each 1303 externally advertised route will be re-advertised with the set of SR 1304 tunnel instances reflecting the current set of active GWs. 1306 9.2. Relationship to BGP Link State and Egress Peer Engineering 1308 When a remote GW receives a route to a prefix X it can use the SR 1309 tunnel instances within the contained Tunnel Encapsulation attribute 1310 to identify the GWs through which X can be reached. It uses this 1311 information to compute SR TE paths across the backbone network 1312 looking at the information advertised to it in SR BGP Link State 1313 (BGP-LS) [I-D.ietf-idr-bgp-ls-segment-routing-ext] and correlated 1314 using the domain identity. SR Egress Peer Engineering (EPE) 1315 [I-D.ietf-idr-bgpls-segment-routing-epe] can be used to supplement 1316 the information advertised in BGP-LS. 1318 9.3. Advertising a Domain Route Externally 1320 When a packet destined for prefix X is sent on an SR TE path to a GW 1321 for the domain containing X, it needs to carry the receiving GW's 1322 label for X such that this label rises to the top of the stack before 1323 the GW complete its processing of the packet. To achieve this we 1324 place a prefix-SID sub-TLV for X in each SR tunnel instance in the 1325 Tunnel Encapsulation attribute in the externally advertised route for 1326 X. 1328 Alternatively, if the GWs for a given domain are configured to allow 1329 remote GWs to perform SR TE through that domain for a prefix X, then 1330 each GW computes an SR TE path through that domain to X from each of 1331 the current active GWs and places each in an MPLS label stack sub-TLV 1332 [I-D.ietf-idr-tunnel-encaps] in the SR tunnel instance for that GW. 1334 9.4. Encapsulations 1336 If the GWs for a given domain are configured to allow remote GWs to 1337 send them packets in that domain's native encapsulation, then each GW 1338 will also include in the externally advertised routes multiple 1339 instances of a tunnel TLV for that native encapsulation, one for each 1340 GW and each containing a remote endpoint sub-TLV with that GW's 1341 address. A remote GW may then encapsulate a packet according to the 1342 rules defined via the sub-TLVs included in each of the tunnel TLV 1343 instances. 1345 10. Security Considerations 1347 There are several security domains and associated threats in this 1348 architecture. SR is itself a data transmission encapsulation that 1349 provides no additional security, so security in this architecture 1350 relies on higher layer mechanisms (for example, end-to-end encryption 1351 of pay-load data), security of protocols used to establish 1352 connectivity and distribute network information, and access control 1353 so that control plane and data plane packets are not admitted to the 1354 network from outside. 1356 This architecture utilizes a number of control plane protocols within 1357 domains, within the backbone, and north-south between controllers and 1358 domains. Only minor modifications are made to BGP as described in 1359 [I-D.ietf-bess-datacenter-gateway], otherwise this architecture uses 1360 existing protocols and extensions so no new security risks are 1361 introduced. 1363 Special care should, however, be taken when routing protocols export 1364 or import information from or to domains that might have a security 1365 model based on secure boundaries and internal mutual trust. This is 1366 notable when: 1368 o BGP-LS is used to export topology information from within a domain 1369 to a controller that is sited outside the domain. 1371 o A southbound protocol such as BGP-LU or Netconf is used to install 1372 state in the network from a controller that may be sited outside 1373 the domain. 1375 In these cases protocol security mechanisms should be used to protect 1376 the information in transit entering or leaving the domain, and to 1377 authenticate the out-of-domain nodes (the controller) to ensure that 1378 confidential/private information is not lost and that data or 1379 configuration is not falsified. 1381 11. Management Considerations 1383 TBD 1385 12. IANA Considerations 1387 This document makes no requests for IANA action. 1389 13. Acknowledgements 1391 Thanks to Jeffery Zhang for his careful review. 1393 14. Informative References 1395 [I-D.ietf-bess-datacenter-gateway] 1396 Drake, J., Farrel, A., Rosen, E., Patel, K., and L. Jalil, 1397 "Gateway Auto-Discovery and Route Advertisement for 1398 Segment Routing Enabled Domain Interconnection", draft- 1399 ietf-bess-datacenter-gateway-00 (work in progress), 1400 October 2017. 1402 [I-D.ietf-idr-bgp-ls-segment-routing-ext] 1403 Previdi, S., Psenak, P., Filsfils, C., Gredler, H., and M. 1404 Chen, "BGP Link-State extensions for Segment Routing", 1405 draft-ietf-idr-bgp-ls-segment-routing-ext-03 (work in 1406 progress), July 2017. 1408 [I-D.ietf-idr-bgp-prefix-sid] 1409 Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A., 1410 and H. Gredler, "Segment Routing Prefix SID extensions for 1411 BGP", draft-ietf-idr-bgp-prefix-sid-09 (work in progress), 1412 January 2018. 1414 [I-D.ietf-idr-bgpls-segment-routing-epe] 1415 Previdi, S., Filsfils, C., Patel, K., Ray, S., and J. 1416 Dong, "BGP-LS extensions for Segment Routing BGP Egress 1417 Peer Engineering", draft-ietf-idr-bgpls-segment-routing- 1418 epe-14 (work in progress), December 2017. 1420 [I-D.ietf-idr-segment-routing-te-policy] 1421 Previdi, S., Filsfils, C., Mattes, P., Rosen, E., and S. 1422 Lin, "Advertising Segment Routing Policies in BGP", draft- 1423 ietf-idr-segment-routing-te-policy-01 (work in progress), 1424 December 2017. 1426 [I-D.ietf-idr-tunnel-encaps] 1427 Rosen, E., Patel, K., and G. Velde, "The BGP Tunnel 1428 Encapsulation Attribute", draft-ietf-idr-tunnel-encaps-07 1429 (work in progress), July 2017. 1431 [I-D.ietf-isis-segment-routing-extensions] 1432 Previdi, S., Ginsberg, L., Filsfils, C., Bashandy, A., 1433 Gredler, H., Litkowski, S., Decraene, B., and J. Tantsura, 1434 "IS-IS Extensions for Segment Routing", draft-ietf-isis- 1435 segment-routing-extensions-15 (work in progress), December 1436 2017. 1438 [I-D.ietf-ospf-segment-routing-extensions] 1439 Psenak, P., Previdi, S., Filsfils, C., Gredler, H., 1440 Shakir, R., Henderickx, W., and J. Tantsura, "OSPF 1441 Extensions for Segment Routing", draft-ietf-ospf-segment- 1442 routing-extensions-24 (work in progress), December 2017. 1444 [I-D.ietf-pce-segment-routing] 1445 Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., 1446 and J. Hardwick, "PCEP Extensions for Segment Routing", 1447 draft-ietf-pce-segment-routing-11 (work in progress), 1448 November 2017. 1450 [I-D.ietf-spring-segment-routing] 1451 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 1452 Litkowski, S., and R. Shakir, "Segment Routing 1453 Architecture", draft-ietf-spring-segment-routing-14 (work 1454 in progress), December 2017. 1456 [I-D.ietf-spring-segment-routing-mpls] 1457 Filsfils, C., Previdi, S., Bashandy, A., Decraene, B., 1458 Litkowski, S., and R. Shakir, "Segment Routing with MPLS 1459 data plane", draft-ietf-spring-segment-routing-mpls-11 1460 (work in progress), October 2017. 1462 [I-D.sivabalan-pce-binding-label-sid] 1463 Sivabalan, S., Filsfils, C., Previdi, S., Tantsura, J., 1464 Hardwick, J., and D. Dhody, "Carrying Binding Label/ 1465 Segment-ID in PCE-based Networks.", draft-sivabalan-pce- 1466 binding-label-sid-03 (work in progress), July 2017. 1468 [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended 1469 Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, 1470 February 2006, . 1472 [RFC5152] Vasseur, JP., Ed., Ayyangar, A., Ed., and R. Zhang, "A 1473 Per-Domain Path Computation Method for Establishing Inter- 1474 Domain Traffic Engineering (TE) Label Switched Paths 1475 (LSPs)", RFC 5152, DOI 10.17487/RFC5152, February 2008, 1476 . 1478 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 1479 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 1480 DOI 10.17487/RFC5440, March 2009, 1481 . 1483 [RFC5441] Vasseur, JP., Ed., Zhang, R., Bitar, N., and JL. Le Roux, 1484 "A Backward-Recursive PCE-Based Computation (BRPC) 1485 Procedure to Compute Shortest Constrained Inter-Domain 1486 Traffic Engineering Label Switched Paths", RFC 5441, 1487 DOI 10.17487/RFC5441, April 2009, 1488 . 1490 [RFC5520] Bradford, R., Ed., Vasseur, JP., and A. Farrel, 1491 "Preserving Topology Confidentiality in Inter-Domain Path 1492 Computation Using a Path-Key-Based Mechanism", RFC 5520, 1493 DOI 10.17487/RFC5520, April 2009, 1494 . 1496 [RFC6805] King, D., Ed. and A. Farrel, Ed., "The Application of the 1497 Path Computation Element Architecture to the Determination 1498 of a Sequence of Domains in MPLS and GMPLS", RFC 6805, 1499 DOI 10.17487/RFC6805, November 2012, 1500 . 1502 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 1503 S. Ray, "North-Bound Distribution of Link-State and 1504 Traffic Engineering (TE) Information Using BGP", RFC 7752, 1505 DOI 10.17487/RFC7752, March 2016, 1506 . 1508 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 1509 Litkowski, S., Horneffer, M., and R. Shakir, "Source 1510 Packet Routing in Networking (SPRING) Problem Statement 1511 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 1512 2016, . 1514 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 1515 "Advertisement of Multiple Paths in BGP", RFC 7911, 1516 DOI 10.17487/RFC7911, July 2016, 1517 . 1519 [RFC7926] Farrel, A., Ed., Drake, J., Bitar, N., Swallow, G., 1520 Ceccarelli, D., and X. Zhang, "Problem Statement and 1521 Architecture for Information Exchange between 1522 Interconnected Traffic-Engineered Networks", BCP 206, 1523 RFC 7926, DOI 10.17487/RFC7926, July 2016, 1524 . 1526 [RFC8231] Crabbe, E., Minei, I., Medved, J., and R. Varga, "Path 1527 Computation Element Communication Protocol (PCEP) 1528 Extensions for Stateful PCE", RFC 8231, 1529 DOI 10.17487/RFC8231, September 2017, 1530 . 1532 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 1533 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 1534 . 1536 [RFC8281] Crabbe, E., Minei, I., Sivabalan, S., and R. Varga, "Path 1537 Computation Element Communication Protocol (PCEP) 1538 Extensions for PCE-Initiated LSP Setup in a Stateful PCE 1539 Model", RFC 8281, DOI 10.17487/RFC8281, December 2017, 1540 . 1542 Authors' Addresses 1544 Adrian Farrel 1545 Juniper Networks 1547 Email: afarrel@juniper.net 1549 John Drake 1550 Juniper Networks 1552 Email: jdrake@juniper.net