idnits 2.17.1 draft-ietf-bess-bgp-multicast-controller-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 22, 2020) is 1304 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC5331' is mentioned on line 249, but not defined == Missing Reference: 'RFC 7752' is mentioned on line 332, but not defined ** Obsolete undefined reference: RFC 7752 (Obsoleted by RFC 9552) == Unused Reference: 'RFC6513' is defined on line 872, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-ietf-bess-bgp-multicast-02 == Outdated reference: A later version (-26) exists of draft-ietf-idr-segment-routing-te-policy-09 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-17 == Outdated reference: A later version (-11) exists of draft-ietf-idr-wide-bgp-communities-05 Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Z. Zhang 3 Internet-Draft Juniper Networks 4 Intended status: Standards Track R. Raszuk 5 Expires: March 26, 2021 Bloomberg LP 6 D. Pacella 7 Verizon 8 A. Gulko 9 Refinitiv 10 September 22, 2020 12 Controller Based BGP Multicast Signaling 13 draft-ietf-bess-bgp-multicast-controller-05 15 Abstract 17 This document specifies a way that one or more centralized 18 controllers can use BGP to set up a multicast distribution tree in a 19 network. In the case of labeled tree, the labels are assigned by the 20 controllers either from the controllers' local label spaces, or from 21 a common Segment Routing Global Block (SRGB), or from each routers 22 Segment Routing Local Block (SRLB) that the controllers learn. In 23 case of labeled unidirectional tree and label allocation from the 24 common SRGB or from the controllers' local spaces, a single common 25 label can be used for all routers on the tree to send and receive 26 traffic with. Since the controllers calculate the trees, they can 27 use sophisticated algorithms and constraints to achieve traffic 28 engineering. 30 Requirements Language 32 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 33 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 34 "OPTIONAL" in this document are to be interpreted as described in BCP 35 14 [RFC2119] [RFC8174] when, and only when, they appear in all 36 capitals, as shown here. 38 Status of This Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at https://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on March 26, 2021. 55 Copyright Notice 57 Copyright (c) 2020 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (https://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with respect 65 to this document. Code Components extracted from this document must 66 include Simplified BSD License text as described in Section 4.e of 67 the Trust Legal Provisions and are provided without warranty as 68 described in the Simplified BSD License. 70 Table of Contents 72 1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 3 73 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 3 74 1.2. Resilience . . . . . . . . . . . . . . . . . . . . . . . 4 75 1.3. Signaling . . . . . . . . . . . . . . . . . . . . . . . . 5 76 1.4. Label Allocation . . . . . . . . . . . . . . . . . . . . 5 77 1.4.1. Using a Common per-tree Label for All Routers . . . . 6 78 1.4.2. Upstream-assignment from Controller's Local Label 79 Space . . . . . . . . . . . . . . . . . . . . . . . . 7 80 1.5. Determining Root/Leaves . . . . . . . . . . . . . . . . . 8 81 1.5.1. PIM-SSM/Bidir or mLDP . . . . . . . . . . . . . . . . 8 82 1.5.2. PIM ASM . . . . . . . . . . . . . . . . . . . . . . . 8 83 1.6. Multiple Domains . . . . . . . . . . . . . . . . . . . . 9 84 1.7. SR-P2MP . . . . . . . . . . . . . . . . . . . . . . . . . 10 85 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 11 86 2.1. Enhancements to TEA . . . . . . . . . . . . . . . . . . . 11 87 2.1.1. Any-Encapsulation Tunnel . . . . . . . . . . . . . . 11 88 2.1.2. Load-balancing Tunnel . . . . . . . . . . . . . . . . 11 89 2.1.3. Receiving MPLS Label Stack . . . . . . . . . . . . . 12 90 2.1.4. RPF Sub-TLV . . . . . . . . . . . . . . . . . . . . . 12 91 2.1.5. Tree Label Stack sub-TLV . . . . . . . . . . . . . . 12 92 2.1.6. Backup Tunnel sub-TLV . . . . . . . . . . . . . . . . 13 93 2.2. Context Label TLV in BGP-LS Node Attribute . . . . . . . 14 94 2.3. SR P2MP Signaling . . . . . . . . . . . . . . . . . . . . 14 95 2.3.1. S-PMSI A-D Route for SR P2MP . . . . . . . . . . . . 14 96 2.3.2. BGP Community Container for SR P2MP Policy . . . . . 15 97 2.3.3. SR Policy Tunnel Type . . . . . . . . . . . . . . . . 16 98 3. Procedures . . . . . . . . . . . . . . . . . . . . . . . . . 17 99 4. Security Considerations . . . . . . . . . . . . . . . . . . . 17 100 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 101 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 18 102 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 103 7.1. Normative References . . . . . . . . . . . . . . . . . . 18 104 7.2. Informative References . . . . . . . . . . . . . . . . . 19 105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 107 1. Overview 109 1.1. Introduction 111 [I-D.ietf-bess-bgp-multicast] describes a way to use BGP as a 112 replacement signaling for PIM [RFC7761] or mLDP [RFC6388]. The BGP- 113 based multicast signaling described there provides a mechanism for 114 setting up both (s,g)/(*,g) multicast trees (as PIM does, but 115 optionally with labels) and labeled (MPLS) multicast tunnels (as mLDP 116 does). Each router on a tree performs essentially the same 117 procedures as it would perform if using PIM or mLDP, but all the 118 inter-router signaling is done using BGP. 120 These procedures allow the routers to set up a separate tree for each 121 individual multicast (x,g) flow where the 'x' could be either 's' or 122 '*', but they also allow the routers to set up trees that are used 123 for more than one flow. In the latter case, the trees are often 124 referred to as "multicast tunnels" or "multipoint tunnels", and 125 specifically in this document they are mLDP tunnels (except that they 126 are set up with BGP signaling). While it actually does not have to 127 be restricted to mLDP tunnels, mLDP FEC is conveniently borrowed to 128 identify the tunnel. In the rest of the document, the term tree and 129 tunnel are used interchangeably. 131 The trees/tunnels are set up using the "receiver-initiated join" 132 technique of PIM/mLDP, hop by hop from downstream routers towards the 133 root. The BGP messages are either sent hop by hop between downstream 134 routers and their upstream neighbors, or can be reflected by Route 135 Reflectors (RRs). 137 As an alternative to each hop independently determining its upstream 138 router and signaling upstream towards the root (following PIM/mLDP 139 model), the entire tree can be calculated by a centralized 140 controller, and the signaling can be entirely done from the 141 controller, using the same BGP messages as defined in 142 [I-D.ietf-bess-bgp-multicast]. For that, some additional procedures 143 and optimizations are specified in this document. 145 While it is outside the scope of this document, signaling from the 146 controllers could be done via other means as well, like Netconf or 147 any other SDN methods. 149 1.2. Resilience 151 Each router could establish direct BGP sessions with one or more 152 controllers, or it could establish BGP sessions with RRs who in turn 153 peer with controllers. For the same tree/tunnel, each controller may 154 independently calculate the tree/tunnel and signal the routers on the 155 tree/tunnel using MCAST-TREE Leaf A-D routes 156 [I-D.ietf-bess-bgp-multicast]. How the tree/tunnel roots/leaves are 157 discovered and how the calculation is done are outside the scope of 158 this document. 160 On each router, BGP route selection rules will lead to one 161 controller's route for the tree/tunnel being selected as the active 162 route and used for setting up forwarding state. As long as all the 163 routers on a tree/tunnel consistently pick the same controller's 164 routes for the tree/tunnel, the setup should be consistent. If the 165 tree/tunnel is labeled, different labels will be used from different 166 controllers so there is no traffic loop issue even if the routers do 167 not consistently select the same controlle's routes. In the 168 unlabeled case, to ensure the consistency the selection SHOULD be 169 solely based on the identifier of the controller, which could be 170 carried in an Address Specific Extended Community (EC). 172 Another consistency issue is when a bidirectional tree/tunnel needs 173 to be re-routed. Because this is no longer triggered hop-by-hop from 174 downstream to upstream, it is possible that the upstream change 175 happens before the downstream, causing traffic loop. In the 176 unlabeled case, there is no good solution (other than that the 177 controller issues upstream change only after it gets acknowledgement 178 from downstream). In the labeled case, as long as a new label is 179 used there should be no problem. 181 Besides the traffic loop issue, there could be transient traffic loss 182 before both the upstream and downstream's forwarding state are 183 updated. This could be mitigated if the upstream keep sending 184 traffic on the old path (in addition to the new path) and the 185 downstream keep accepting traffic on the old path (but not on the new 186 path) for some time. It is a local matter when for the downstream to 187 switch to the new path - it could be data driven (e.g., after traffic 188 arrives on the new path) or timer driven. 190 For each tree, multiple disjoint instances could be calculated and 191 signaled for live-live protection. Different labels are used for 192 different instances, so that the leaves can differentiate incoming 193 traffic on different instances. As far as transit routers are 194 concerned, the instances are just independent. Note that the two 195 instances are not expected to share common transit routers (it is 196 otherwise outside the scope of this document/revision). 198 1.3. Signaling 200 Each router only receives Leaf A-D routes from the controllers but 201 does not originate or re-advertise S-PMSI/Leaf A-D routes. The re- 202 advertisement of a received route can be blocked based on the fact 203 that a configured import RT matches the RT of the route, which 204 indicates that this router is the target and consumer of the route 205 hence it should not be re-advertised further. The routes includes 206 the forwarding information in the form of Tunnel Encapsulation 207 Attributes (TEA) [I-D.ietf-idr-tunnel-encaps], with enhancements 208 specified in this document. 210 Suppose that for a particular tree, there are two downstream routers 211 D1 and D2 for a particular upstream router U. A controller C may 212 send two Leaf A-D routes to U, as if the two routes were originated 213 by D1 and D2 but reflected by the controller. Alternatively, C could 214 just send one route to U, with the Upstream Router's IP Address field 215 set to U's IP address and the TEA specifying both the two downstreams 216 and its upstream (see Section 2.1.4). In this case, the Originating 217 Router's Address field of the Leaf A-D route is set to the 218 controller's address. Note that for a TEA attached to a unicast 219 NLRI, only one of the tunnels in a TEA is used for forwarding a 220 particular packet, while all the tunnels in a TEA are used to reach 221 multiple endpoints when it is attached to a multicast NLRI. 223 Note that, in case of labeled trees, the (x,g) or mLDP FEC signaling 224 is actually not needed to transit routers but only needed on tunnel 225 root/leaves. However, for consistency, the same signaling is used to 226 all routers. 228 1.4. Label Allocation 230 In the case of labeled multicast signaled hop by hop towards the 231 root, whether it's (x,g) multicast or "mLDP" tunnel, labels are 232 assigned by a downstream router and advertised to its upstream router 233 (from traffic direction point of view). In the case of controller 234 based signaling, routers do not originate tree join (S-PMSI/Leaf A-D) 235 routes anymore, so the controllers have to assign labels on behalf of 236 routers, and there are three options for label assignment: 238 o From each router's SRLB that the controller learns 240 o From the common SRGB that the controller learns 241 o From the controller's local label space 243 Assignment from each router's SRLB is no different from each router 244 assigning labels from its own local label space in the hop-by-hop 245 signaling case. The assignments for a router is independent of 246 assignments for another router, even for the same tree. 248 Assignment from the controller's local label space is upstream- 249 assigned [RFC5331]. It is used if the controller does not learn the 250 common SRGB or each router's SRLB. Assignment from the SRGB 251 [RFC8402] is only meaningful if all SRGBs are the same and a single 252 common label is used for all the routers on a tree in case of 253 unidirectional tree/tunnel (Section 1.4.1). Otherwise, assignment 254 from SRLB is preferred. 256 The choice of which of the options to use depends on many factors. 257 An operator may want to use a single common label per tree for ease 258 of monitoring and debugging, but that requires explicit RPF checking 259 and either SRGB or upstream assigned labels, which may not be 260 supported due to either the software or hardware limitations (e.g. 261 label imposition/disposition limits). In an SR network, assignment 262 from the common SRGB if it's required to use a single common label 263 per unidirectional tree, or otherwise assignment from SRLB is a good 264 choice because it does not require support for context label spaces. 266 1.4.1. Using a Common per-tree Label for All Routers 268 MPLS labels only have local significance. For an LSP that goes 269 through a series of routers, each router allocates a label 270 independently and it swaps the incoming label (that it advertised to 271 its upstream) to an outgoing label (that it received from its 272 downstream) when it forwards a labeled packet. Even if the incoming 273 and outgoing labels happen to be the same on a particular router, 274 that is just incidental. 276 With Segment Routing, it is becoming a common practice that all 277 routers use the same SRGB so that a SID maps to the same label on all 278 routers. This makes it easier for operators to monitor and debug 279 their network. The same concept applies to multicast trees as well - 280 a common per-tree label is used for a router to receive traffic from 281 its upstream neighbor and replicate traffic to all its downstream 282 neighbor. 284 However, a common per-tree label can only be used for unidirectional 285 trees. Additionally, it requires each router to do explicit RPF 286 check, so that only packets from its expected upstream neighbor are 287 accepted. Otherwise, traffic loop may form during topology changes, 288 because the forwarding state update is no longer ordered. 290 Traditionally, p2mp mpls forwarding does not require explicit RPF 291 check as a downstream router advertises a label only to its upstream 292 router and all traffic with that incoming label is presumed to be 293 from the upstream router and accepted. When a downstream router 294 switches to a different upstream router a different label will be 295 advertised, so it can determine if traffic is from its expected 296 upstream neighbor purely based on the label. Now with a single 297 common label used for all routers on a tree to send and receive 298 traffic with, a router can no longer determine if the traffic is from 299 its expected neighbor just based on that common tree label. 300 Therefore, explicit RPF check is needed. Instead of interface based 301 RPF checking as in PIM case, neighbor based RPF checking is used - a 302 label identifying the upstream neighbor precedes the tree label and 303 the receiving router checks if that preceding neighbor label matches 304 its expected upstream neighbor. Notice that this is similar to 305 what's described in Section "9.1.1 Discarding Packets from Wrong PE" 306 of RFC 6513 (an egress PE discards traffic sent from a wrong ingress 307 PE). The only difference is one is used for label based forwarding 308 and the other is used for (s,g) based forwarding. [note: for 309 bidirectional trees, we may be able to use two labels per tree - one 310 for upstream traffic and one for downstream traffic. This needs 311 further verification]. 313 Both the common per-tree label and the neighbor label are allocated 314 either from the common SRGB or from the controller's local label 315 space. In the latter case, an additional label identifying the 316 controller's label space is needed, as described in the following 317 section. 319 1.4.2. Upstream-assignment from Controller's Local Label Space 321 In this case in the multicast packet's label stack the tree label and 322 upstream neighbor label (if used in case of single common-label per 323 tree) are preceded by a downstream-assigned "context label". The 324 context label identifies a context-specific label space (the 325 controller's local label space), and the upstream-assigned label that 326 follows it is looked up in that space. 328 This specification requires that, in case of upstream-assignment from 329 a controller's local label space, each router D to assign, 330 corresponding to each controller C, a context label that identifies 331 the upstream-assigned label space used by that controller. This 332 label, call it Lc-D, is communicated by D to C via BGP-LS [RFC 7752]. 334 Suppose a controller is setting up unidirectional tree T. It assigns 335 that tree the label Lt, and assigns label Lu to identify router U 336 which is the upstream of router D on tree T. C needs to tell U: "to 337 send a packet on the given tree/tunnel, one of the things you have to 338 do is push Lt onto the packet's label stack, then push Lu, then push 339 Lc-D onto the packet's label stack, then unicast the packet to D". 340 Controller C also needs to inform router D of the correspondence 341 between and tree T. 343 To achieve that, when C sends a Leaf A-D route, for each tunnel in 344 the TEA, it includes a label stack Sub-TLV 345 [I-D.ietf-idr-tunnel-encaps], with the outer label being the context 346 label Lc-D (received by the controller from the corresponding 347 downstream), the next label being the upstream neighbor label Lu, and 348 the inner label being the label Lt assigned by the controller for the 349 tree. The router receiving the route will use the label stacks to 350 send traffic to its downstreams. 352 For C to signal the expected label stack for D to receive traffic 353 with, we overload a tunnel TLV in the TEA of the Leaf A-D route sent 354 to D - if the tunnel TLV has a RPF sub-TLV (Section 2.1.4), then it 355 indicates that this is actually for receiving traffic from the 356 upstream. 358 1.5. Determining Root/Leaves 360 For the controller to calculate a tree, it needs to determine the 361 root and leaves of the tree. This may be based on provisioning 362 (static or dynamically programmed), or based on BGP signaling using 363 the BGP multicast messages defined in [I-D.ietf-bess-bgp-multicast], 364 as described in the following two sections. 366 In both cases, the BGP updates are targeted at the controller, via an 367 address specific Route Target with Global Administration Field set to 368 the controller's address and the Local Administration Field set to 0, 369 or a value pre-assigned to identify a VPN. 371 1.5.1. PIM-SSM/Bidir or mLDP 373 In this case, the PIM Last Hop Routers (LHRs) with interested 374 receivers or mLDP tunnel leaves encode a Leaf A-D route with the 375 Upstream Router's IP Address field set to the controller's address 376 and the Originating Router's IP Address set to the address of the LHR 377 or the P2MP tunnel leaf. The encoded PIM SSM source or mLDP FEC 378 provides root information and the Originating Router's IP Address 379 provides leaves information. 381 1.5.2. PIM ASM 383 In this case, the First Hop Routers (FHRs) originate Source Active 384 routes which provides root information, and the LHRs originate Leaf 385 A-D routes, encoded as in the PIM-SSM case except that it is (*,G) 386 instead of (S,G). The Leaf A-D routes provide leaf information. 388 1.6. Multiple Domains 390 An end to end multicast tree may span multiple routing domains, and 391 the setup of the tree in each domain may be done differently as 392 specified in [I-D.ietf-bess-bgp-multicast]. This section discusses a 393 few aspects specific to controller signaling. 395 Consider two adjacent domains each with its own controller in the 396 following configuration where router B is an upstream node of C for a 397 multicast tree: 399 | 400 domain 1 | domain 2 401 | 402 ctrlr1 | ctrlr2 403 /\ | /\ 404 / \ | / \ 405 / \ | / \ 406 A--...-B--|--C--...-D 407 | 409 In the case of native (un-labeled) IP multicast, nothing special is 410 needed. Controller 1 signals B to send traffic out of B-C link while 411 Controller 2 signals C to accept traffic on the B-C link. 413 In the case of labeled IP multicast or mLDP tunnel, the controllers 414 may be able to coordinate their actions such that Controller 1 415 signals B to send traffic out of B-C link with label X while 416 Controller 2 signals C to accept traffic with the same label X on the 417 B-C link. If the coordination is not possible, then C needs to use 418 hop-by-hop BGP signaling to signal towards B, as specified in 419 [I-D.ietf-bess-bgp-multicast]. 421 The configuration could also be as following, where router B borders 422 both domain 1 and domain 2 and is controlled by both controllers: 424 | 425 domain 1 | domain 2 426 | 427 ctrlr1 | ctrlr2 428 /\ | /\ 429 / \ | / \ 430 / \ | / \ 431 / \|/ \ 432 A--...---B--...---C 433 | 435 As discussed in Section 1.2, when B receives signaling from both 436 Controller 1 and Controller 2, only one of the routes would be 437 selected as the best route and used for programming the forwarding 438 state of the corresponding segment. For B to stitch the two segments 439 together, it is expected for B to know by provisioning that it is a 440 border router so that B will look for the other segment (represented 441 by the signaling from the other controller) and stitch the two 442 together. 444 1.7. SR-P2MP 446 [I-D.voyer-pim-sr-p2mp-policy] describes an architecture to construct 447 a Point-to-Multipoint (P2MP) tree to deliver Multi-point services in 448 a Segment Routing domain. An SR P2MP tree is constructed by 449 stitching together a set of Replication Segments that are specified 450 in [I-D.voyer-spring-sr-replication-segment]. An SR Point-to- 451 Multipoint (SR P2MP) Policy is used to define and instantiate a P2MP 452 tree which is computed by a controller. 454 An SR P2MP tree is no different from an mLDP tunnel in MPLS 455 forwarding plane. The difference is in control plane - instead of 456 hop-by-hop mLDP signaling from leaves towards the root, to set up SR 457 P2MP trees controllers program forwarding state (referred to as 458 Replication Segments) to the root, leaves, and intermediate 459 replication points using Netconf, PCEP, BGP or any other reasonable 460 signaling/programming methods. 462 Procedures in this document can be used for controllers to set up SR 463 P2MP trees with just an additional S-PMSI route type. 465 If/once the SR Replication Segment is extended to bi-redirectional, 466 and SR MP2MP is introduced, the same procedures in this document 467 would apply to SR MP2MP as well. 469 2. Specification 471 2.1. Enhancements to TEA 473 This document specifies two new Tunnel Types and four new sub-TLVs. 474 The type codes will be assigned by IANA from the "BGP Tunnel 475 Encapsulation Attribute Tunnel Types". 477 2.1.1. Any-Encapsulation Tunnel 479 When a multicast packet needs to be sent from an upstream node to a 480 downstream node, it may not matter how it is sent - natively when the 481 two nodes are directly connected or tunneled otherwise. In case of 482 tunneling, it may not matter what kind of tunnel is used - MPLS, GRE, 483 IPinIP, or whatever. 485 To support this, an "Any-Encapsulation" tunnel type is defined. This 486 tunnel MUST have a Tunnel Endpoint Sub-TLV and SHOULD NOT have any 487 other Sub-TLVs. The Tunnel Endpoint Sub-TLV specifies an IP address, 488 which could be any of the following: 490 o An interface's local address - when a packet needs to sent out of 491 the corresponding interface natively. On a LAN multicast MAC 492 address MUST be used. 494 o A directly connected neighbor's interface address - when a packet 495 needs to unicast to the address natively. 497 o An address that is not directly connected - when a packet needs to 498 be tunneled to the address (any tunnel type/instance can be used). 500 2.1.2. Load-balancing Tunnel 502 Consider that a multicast packet needs to be sent to a downstream 503 node, which could be reached via four paths P1~P4. If it does not 504 matter which of path is taken, an "Any-Encapsulation" tunnel with the 505 Tunnel Endpoint Sub-TLV specifying the downstream node's loopback 506 address works well. If the controller wants to specify that only 507 P1~P2 should be used, then a "Load-balancing" tunnel needs to be 508 used, listing P1 and P2 as member tunnels of the "Load-balancing" 509 tunnel. 511 A load-balancing tunnel has one "Member Tunnels" Sub-TLV defined in 512 this document. The Sub-TLV is a list of tunnels, each specifying a 513 way to reach the downstream. A packet will be sent out of one of the 514 tunnels listed in the Member Tunnels Sub-TLV of the load-balancing 515 tunnel. 517 2.1.3. Receiving MPLS Label Stack 519 While [I-D.ietf-bess-bgp-multicast] uses S-PMSI A-D routes to signal 520 forwarding information for MP2MP upstream traffic, when controller 521 signaling is used, a single Leaf A-D route is used for both upstream 522 and downstream traffic. Since different upstream and downstream 523 labels need to be used, a new "Receiving MPLS Label Stack" of type 524 TBD is added as a tunnel sub-TLV in addition to the existing MPLS 525 Label Stack sub-TLV. Other than type difference, the two are the 526 encoded the same way. 528 The Receiving MPLS Label Stack sub-TLV is added to each downstream 529 tunnel in the TEA of Leaf A-D route for an MP2MP tunnel to specify 530 the forwarding information for upstream traffic from the 531 corresponding downstream node. A label stack instead of a single 532 label is used because of the need for neighbor based RPF check, as 533 further explained in the following section. 535 The Receiving MPLS Label Stack sub-TLV is also used for downstream 536 traffic from the upstream for both P2MP and MP2MP, as specified 537 below. 539 2.1.4. RPF Sub-TLV 541 The RPF sub-TLV has a type to be allocated by IANA and a one-octet 542 length. The length is 0 currently, but if necessary in the future, 543 sub-sub-TLVs could be placed in its value part. If the RPF sub-TLV 544 appears in a tunnel, it indicates that the "tunnel" is for the 545 upstream node instead of a downstream node. The tunnel contains an 546 Receiving MPLS Label Stack sub-TLV for downstream traffic from the 547 upstream node, and in case of MP2MP it also contains a regular MPLS 548 Label Stack sub-TLV for upstream traffic to the upstream node. 550 The inner most label in the Receiving MPLS Label Stack is the 551 incoming label identifying the tree (for comparison the inner most 552 label for a regular MPLS Label Stack is the outgoing label). If the 553 Receiving MPLS Label Stack sub-TLVe has more than one labels, the 554 second inner most label in the stack identifies the expected upstream 555 neighbor and explicit RPF checking needs to be set up for the tree 556 label accordingly. 558 2.1.5. Tree Label Stack sub-TLV 560 The MPLS Label Stack sub-TLV can be used to specify the complete 561 label stack used to send traffic, with the stack including both a 562 transport label (stack) and label(s) that identify the (tree, 563 neighbor) to the downstream node. There are cases where the 564 controller only wants to specify the tree-identifying labels but 565 leave the transport details to the router itself. For example, the 566 router could locally determine a transport label (stack) and combine 567 with the tree-identifying labels signaled from the controller to get 568 the complete outgoing label stack. 570 For that purpose, a new Tree Label Stack sub-TLV is defined, with a 571 one-octet length field. The value field contains a label stack with 572 the same encoding as value part of the MPLS Label Stack sub-TLV, but 573 the sub-TLV has a different type. A stack is specified because it 574 may take up to three labels (see Section 1.4): 576 o If different nodes use different labels (allocated from the common 577 SRGB or the node's SRLB) for a (tree, neighbor) tuple, only a 578 single label is in the stack. This is similar to current mLDP hop 579 by hop signaling case. 581 o If different nodes use the same tree label, then an additional 582 neighbor-identifying label is needed in front of the tree label. 584 o For the previous bullet, if the neighbor-identifying label is 585 allocated from the controller's local label space, then an 586 additional context label is needed in front of the neighbor label. 588 2.1.6. Backup Tunnel sub-TLV 590 The Backup Tunnel sub-TLV is used to specify the backup paths for the 591 tunnel. The length is two-octet. The value part encodes a one-octet 592 flags field and a variable length Tunnel Encapsulation Attribute. If 593 the tunnel goes down, traffic that is normally sent out of the tunnel 594 is fast rerouted to the tunnels listed in the encoded TEA. 596 +--------------------------------+ 597 | Sub-TLV Type (1 Octet, TBD) | 598 +--------------------------------+ 599 | Sub-TLV Length (2 Octets) | 600 +--------------------------------+ 601 | P | rest of 1 Octet Flags | 602 +--------------------------------+ 603 | Backup TEA (variable length) | 604 +--------------------------------+ 606 The backup tunnels can be going to the same or different nodes 607 reached by the original tunnel. 609 If the tunnel carries a RPF sub-TLV and a Backup Tunnel sub-TLV, then 610 both traffic arriving on the original tunnel and on the tunnels 611 encoded in the Backup Tunnel sub-TLV's TEA can be accepted, if the 612 Parallel (P-)bit in the flags field is set. If the P-bit is not set, 613 then traffic arriving on the backup tunnel is accepted only if router 614 has switched to receiving on the backup tunnel (this is the 615 equivalent of PIM/mLDP MoFRR). 617 2.2. Context Label TLV in BGP-LS Node Attribute 619 For a router to signal the context label that it assigns for a 620 controller (or any label allocator that assigns labels - from its 621 local label space -- that will be received by this router), a new 622 BGP-LS Node Attribute TLV is defined: 624 0 1 2 3 625 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 626 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 627 | Type | Length | 628 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 629 | Context Label | 630 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 631 | IPv4/v6 Address of Label Space Owner | 632 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 634 The Length field implies the type of the address. Multiple Context 635 Label TLVs may be included in a Node Attribute, one for each label 636 space owner. 638 An as example, a controller with address 11.11.11.11 allocates label 639 200 from its own label space, and router A assigns label 100 to 640 identify this controller's label space. The router includes the 641 Context Label TLV (100, 11.11.11.11) in its BGP-LS Node Attribute and 642 the controller instructs router B to send traffic to router A with a 643 label stack (100, 200), and router A uses label 100 to determine the 644 Label FIB in which to look up label 200. 646 2.3. SR P2MP Signaling 648 An SR P2MP policy for an SR P2MP tree is identified by a (Root, Tree- 649 id) tuple. It has a set of leaves and set of Candidate Paths (CPs). 650 The policy is instantiated on the root of the tree, with 651 corresponding Replication Segments - identified by (Root, Tree-id, 652 Tree-Node-id) - instantiated on the tree nodes (root, leaves, and 653 intermediate replication points). The Candidate Path is implicitly 654 identified by the Route Distinguisher. 656 2.3.1. S-PMSI A-D Route for SR P2MP 658 With BGP signaled IP multicast trees and mLDP tunnels, the tree/ 659 tunnel identification is encoded in the NLRI of S-PMSI A-D routes and 660 corresponding Leaf A-D routes. The signaling sets up forwarding 661 state on each node of the tree, so the NLRI also contains the 662 identification of the node in the "Upstream Router's IP Address" 663 field. 665 For SR P2MP, forwarding state are represented as Replication Segments 666 and are signaled from controllers to tree nodes. A Replication 667 Segment is identified in a new type of S-PMSI A-D route and 668 corresponding Leaf A-D route (note that the "Leaf" term here does not 669 refer to tree leaves): 671 +- +-----------------------------------+ 672 | | Route Type - 4 (Leaf A-D) | 673 | +-----------------------------------+ 674 | | Length (1 octet) | 675 | L +- +-----------------------------------+ --+ 676 L | E | | Route Type - 0x83 (SR P2MP S-PMSI)| | S 677 E | A | +-----------------------------------+ | | 678 A | F | | Length (1 octet) | | P 679 F | | +-----------------------------------+ | M 680 | R | | RD (8 octets) | | S 681 | O | +-----------------------------------+ | I 682 | U | | Root ID (4 or 16 octets) | | 683 N | T | +-----------------------------------+ | N 684 L | E | | Tree ID (4 octets) | | L 685 R | | +-----------------------------------+ | R 686 I | K | | Upstream Router's IP Address | | I 687 | E | +-----------------------------------+ --+ 688 | Y | | Originating Router's IP Address | 689 +- +- +-----------------------------------+ 691 Leaf A-D route for SR Replication Segment 693 2.3.2. BGP Community Container for SR P2MP Policy 695 The Leaf A-D route for Replication Segments signaled to the root is 696 also used to signal (parts of) the SR P2MP Policy - the policy name, 697 the set of leaves (optional, for informational purpose), preference 698 of the CP and other information are all encoded in a newly defined 699 BGP Community Container (BCC) [I-D.ietf-idr-wide-bgp-communities] 700 called SR P2MP Policy BCC. 702 The SR P2MP Policy BCC has a BGP Community Container type to be 703 assigned by IANA. It is composed of a fixed 4-octet Candidate Path 704 Preference value, optionally followed by TLVs. 706 0 1 2 3 707 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 708 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 709 | Candidate Path Preference | 710 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 711 | | 712 | TLVs (optional) | 713 | | 714 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 716 BGP Community Container for SR P2MP Policy 718 One optional TLV is to enclose the following optional Atoms TLVs that 719 are already defined in [I-D.ietf-idr-wide-bgp-communities]: 721 o An IPv4 or IPv6 Prefix list - for the set of leaves 723 o A UTF-8 string - for the policy name 725 If more information for the policy are needed, more Atoms TLVs or SR 726 P2MP Policy BCC specific TLVs can be defined. 728 The root receives one Leaf A-D route for each Candidate Path of the 729 policy. Only one of the routes need to, though more than one MAY 730 include the above listed optional Atom TLVs in the SR P2MP Policy 731 BCC. 733 2.3.3. SR Policy Tunnel Type 735 The Tunnel Encapsulation Attribute (TEA) attached to Leaf A-D routes 736 encodes all replication branch information. For example, if an SR 737 explicit path is to be used to reach a particular downstream node, 738 the TEA will include a tunnel that lists the entire label stack for 739 that SR path, plus the label that identifies the SR P2MP tree to the 740 downstream node. 742 That SR path may have been installed on the node as a unicast SR 743 policy with a corresponding Binding SID. In stead of listing the 744 entire label stack in an MPLS tunnel in the TEA, a different tunnel, 745 SR Policy Tunnel [I-D.ietf-idr-segment-routing-te-policy], can be 746 used as an alternative. The tunnel includes a Binding SID sub-TLV, 747 an optional endpoint sub-TLV that identifies the downstream node, and 748 an optional one-segment segment list that identifies to the 749 downstream node the SR P2MP tree. When a node receives the Leaf A-D 750 route with the TEA that contains an SR Policy Tunnel without a RPF 751 sub-TLV, the Binding SID is used to locate corresponding outgoing 752 segment lists used to reach the downstream node; the tree-identifying 753 segment from the optional one-segment segment list is added to to 754 outgoing segment lists mapped from the binding SID to form the entire 755 segment list used to send traffic to downstream node. 757 Note that, the SR Policy Tunnel is initially defined to instantiate 758 an SR policy. For that use case it provides information associated 759 with the policy, e.g., Binding SID, preference, and segment lists. 760 The receiving node installs that policy and establishes the mapping 761 from the Binding SID to the outgoing segments. The use of SR Policy 762 Tunnel in this document is to refer to a pre-installed SR policy so 763 the preference and segment lists are not used. 765 If a tunnel in the TEA carries a RPF sub-TLV, it is for the upstream 766 node. The tunnel may be an MPLS tunnel in case of SR MPLS, and the 767 Receiving MPLS Label Stack sub-TLV specifies the incoming label stack 768 that identifies the tree and optionally the upstream neighbor. 769 Alternatively, for both SR-MPLS and SRv6 an SR Policy Tunnel with the 770 RPF sub-TLV can be used, in which the Binding SID sub-TLV is the SID 771 for the tree. 773 If the node is the root and a Binding SID is allocated by the 774 controller, the Binding SID is signaled to the root in a TEA tunnel 775 with a RPF sub-TLV as above but without a destination sub-TLV. 777 3. Procedures 779 Details to be added. The general idea is described in the 780 introduction section. 782 4. Security Considerations 784 This document does not introduce new security risks. 786 5. IANA Considerations 788 This document makes the following IANA requests: 790 o Assign "Any-Encapsulation" and "Load-balancing" tunnel types from 791 the "BGP Tunnel Encapsulation Attribute Tunnel Types" registry 793 o Assign "Member Tunnels", "Receiving MPLS Label Stack", "Tree Label 794 Stack" and "RPF" sub-TLV types from the "BGP Tunnel Encapsulation 795 Attribute Sub-TLVs" registry. The "Member Tunnels" sub-TLV has a 796 two-octet value length (so the type should be in the 128-255 797 range), while the "Receiving MPLS Label Stack", "Tree Label" and 798 "RPF" sub-TLV has a one-octet value length. 800 o Assign "Context Label TLV" type from the "BGP-LS Node Descriptor, 801 Link Descriptor, Prefix Descriptor, and Attribute TLVs" registry. 803 o Assign "S-PMSI A-D Route for SR P2MP" route type from the "BGP 804 MCAST-TREE Route Types" registry, with a suggested value of 0x83. 806 o Assign a new BGP Community Container type "SR P2MP Policy", and to 807 create an "SR P2MP Policy Community Container TLV Registry", with 808 an initial entry for "TLV for Atoms". 810 6. Acknowledgements 812 The authors Eric Rosen for his questions, suggestions, and help 813 finding solutions to some issues like the neighbor based explicit RPF 814 checking. The authors also thank Lenny Giuliano, Sanoj Vivekanandan 815 and IJsbrand Wijnands for their review and comments. 817 7. References 819 7.1. Normative References 821 [I-D.ietf-bess-bgp-multicast] 822 Zhang, Z., Giuliano, L., Patel, K., Wijnands, I., mishra, 823 m., and A. Gulko, "BGP Based Multicast", draft-ietf-bess- 824 bgp-multicast-02 (work in progress), June 2020. 826 [I-D.ietf-idr-segment-routing-te-policy] 827 Previdi, S., Filsfils, C., Talaulikar, K., Mattes, P., 828 Rosen, E., Jain, D., and S. Lin, "Advertising Segment 829 Routing Policies in BGP", draft-ietf-idr-segment-routing- 830 te-policy-09 (work in progress), May 2020. 832 [I-D.ietf-idr-tunnel-encaps] 833 Patel, K., Velde, G., Sangli, S., and J. Scudder, "The BGP 834 Tunnel Encapsulation Attribute", draft-ietf-idr-tunnel- 835 encaps-17 (work in progress), July 2020. 837 [I-D.ietf-idr-wide-bgp-communities] 838 Raszuk, R., Haas, J., Lange, A., Decraene, B., Amante, S., 839 and P. Jakma, "BGP Community Container Attribute", draft- 840 ietf-idr-wide-bgp-communities-05 (work in progress), July 841 2018. 843 [I-D.voyer-pim-sr-p2mp-policy] 844 Voyer, D., Filsfils, C., Parekh, R., Bidgoli, H., and Z. 845 Zhang, "Segment Routing Point-to-Multipoint Policy", 846 draft-voyer-pim-sr-p2mp-policy-02 (work in progress), July 847 2020. 849 [I-D.voyer-spring-sr-replication-segment] 850 Voyer, D., Filsfils, C., Parekh, R., Bidgoli, H., and Z. 851 Zhang, "SR Replication Segment for Multi-point Service 852 Delivery", draft-voyer-spring-sr-replication-segment-04 853 (work in progress), July 2020. 855 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 856 Requirement Levels", BCP 14, RFC 2119, 857 DOI 10.17487/RFC2119, March 1997, 858 . 860 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 861 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 862 May 2017, . 864 7.2. Informative References 866 [RFC6388] Wijnands, IJ., Ed., Minei, I., Ed., Kompella, K., and B. 867 Thomas, "Label Distribution Protocol Extensions for Point- 868 to-Multipoint and Multipoint-to-Multipoint Label Switched 869 Paths", RFC 6388, DOI 10.17487/RFC6388, November 2011, 870 . 872 [RFC6513] Rosen, E., Ed. and R. Aggarwal, Ed., "Multicast in MPLS/ 873 BGP IP VPNs", RFC 6513, DOI 10.17487/RFC6513, February 874 2012, . 876 [RFC7761] Fenner, B., Handley, M., Holbrook, H., Kouvelas, I., 877 Parekh, R., Zhang, Z., and L. Zheng, "Protocol Independent 878 Multicast - Sparse Mode (PIM-SM): Protocol Specification 879 (Revised)", STD 83, RFC 7761, DOI 10.17487/RFC7761, March 880 2016, . 882 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 883 Decraene, B., Litkowski, S., and R. Shakir, "Segment 884 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 885 July 2018, . 887 Authors' Addresses 889 Zhaohui Zhang 890 Juniper Networks 892 EMail: zzhang@juniper.net 893 Robert Raszuk 894 Bloomberg LP 896 EMail: robert@raszuk.net 898 Dante Pacella 899 Verizon 901 EMail: dante.j.pacella@verizon.com 903 Arkadiy Gulko 904 Refinitiv 906 EMail: arkadiy.gulko@refinitiv.com