idnits 2.17.1 draft-zzhang-bess-bgp-multicast-controller-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: A Tunnel Encapsulation Attribute includes Tunnel TLVs and a router receiving the TEA (associated with a route) selects one of the Tunnel TLVs to set up forwarding state - a packet is sent out of only one of the tunnels. To specify that traffic needs to be sent out of multiple tunnels, a Composite Tunnel TLV is used. The value part of the TLV includes a list of sub-TLVs, each being a Tunnel TLV. Obviously, a Composite Tunnel TLV MUST not be a sub-TLV of a Composite Tunnel TLV. -- The document date (February 6, 2019) is 1905 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC5331' is mentioned on line 241, but not defined == Unused Reference: 'RFC6513' is defined on line 465, but no explicit reference was found in the text == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-10 == Outdated reference: A later version (-11) exists of draft-ietf-idr-wide-bgp-communities-05 == Outdated reference: A later version (-03) exists of draft-zzhang-bess-bgp-multicast-02 Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Z. Zhang 3 Internet-Draft Juniper Networks 4 Intended status: Standards Track R. Raszuk 5 Expires: August 10, 2019 Bloomberg LP 6 D. Pacella 7 Verizon 8 A. Gulko 9 Thomson Reuters 10 February 6, 2019 12 Controller Based BGP Multicast Signaling 13 draft-zzhang-bess-bgp-multicast-controller-01 15 Abstract 17 This document specifies a way that one or more centralized 18 controllers can use BGP to set up a multicast distribution tree in a 19 network. In the case of labeled tree, the labels are assigned by the 20 controllers either from the controllers' local label spaces, or from 21 a common Segment Routing Global Block (SRGB), or from each routers 22 Segment Routing Local Block (SRLB) that the controllers learn. In 23 case of labeled unidirectional tree and label allocation from the 24 common SRGB or from the controllers' local spaces, a single common 25 label can be used for all routers on the tree to send and receive 26 traffic with. Since the controllers calculate the trees, they can 27 use sophisticated algorithms and constraints to achieve traffic 28 engineering. 30 Requirements Language 32 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 33 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 34 "OPTIONAL" in this document are to be interpreted as described in BCP 35 14 [RFC2119] [RFC8174] when, and only when, they appear in all 36 capitals, as shown here. 38 Status of This Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at https://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on August 10, 2019. 55 Copyright Notice 57 Copyright (c) 2019 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (https://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with respect 65 to this document. Code Components extracted from this document must 66 include Simplified BSD License text as described in Section 4.e of 67 the Trust Legal Provisions and are provided without warranty as 68 described in the Simplified BSD License. 70 Table of Contents 72 1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 3 73 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 3 74 1.2. Resilience . . . . . . . . . . . . . . . . . . . . . . . 3 75 1.3. Signaling . . . . . . . . . . . . . . . . . . . . . . . . 4 76 1.4. Label Allocation . . . . . . . . . . . . . . . . . . . . 5 77 1.4.1. Using a Common per-tree Label for All Routers . . . . 6 78 1.4.2. Upstream-assignment from Controller's Local Label 79 Space . . . . . . . . . . . . . . . . . . . . . . . . 7 80 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 8 81 2.1. Additional Tunnel Type for TEA . . . . . . . . . . . . . 8 82 2.2. RPF Label Stack Sub-TLV . . . . . . . . . . . . . . . . . 9 83 2.3. Context Label Wide Community . . . . . . . . . . . . . . 9 84 2.4. Procedures . . . . . . . . . . . . . . . . . . . . . . . 9 85 3. Security Considerations . . . . . . . . . . . . . . . . . . . 9 86 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 87 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 88 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 89 6.1. Normative References . . . . . . . . . . . . . . . . . . 10 90 6.2. Informative References . . . . . . . . . . . . . . . . . 10 91 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 93 1. Overview 95 1.1. Introduction 97 [I-D.zzhang-bess-bgp-multicast] describes a way to use BGP as a 98 replacement signaling for PIM [RFC7761] or mLDP [RFC6388]. The BGP- 99 based multicast signaling described there provides a mechanism for 100 setting up both (s,g)/(*,g) multicast trees (as PIM does, but 101 optionally with labels) and labeled (MPLS) multicast tunnels (as mLDP 102 does). Each router on a tree performs essentially the same 103 procedures as it would perform if using PIM or mLDP, but all the 104 inter-router signaling is done using BGP. 106 These procedures allow the routers to set up a separate tree for each 107 individual multicast (x,g) flow where the 'x' could be either 's' or 108 '*', but they also allow the routers to set up trees that are used 109 for more than one flow. In the latter case, the trees are often 110 referred to as "multicast tunnels" or "multipoint tunnels", and 111 specifically in this document they are mLDP tunnels (except that they 112 are set up with BGP signaling). While it actually does not have to 113 be restricted to mLDP tunnels, mLDP FEC is conveniently borrowed to 114 identify the tunnel. In the rest of the document, the term tree and 115 tunnel are used interchangeably. 117 The trees/tunnels are set up using the "receiver-initiated join" 118 technique of PIM/mLDP, hop by hop from downstream routers towards the 119 root. The BGP messages are either sent hop by hop between downstream 120 routers and their upstream neighbors, or can be reflected by Route 121 Reflectors (RRs). 123 As an alternative to each hop independently determining its upstream 124 router and signaling upstream towards the root (following PIM/mLDP 125 model), the entire tree can be calculated by a centralized 126 controller, and the signaling can be entirely done from the 127 controller, using the same BGP messages as defined in 128 [I-D.zzhang-bess-bgp-multicast]. For that, some additional 129 procedures and optimizations are specified in this document. 131 While it is outside the scope of this document, signaling from the 132 controllers could be done via other means as well, like Netconf or 133 any other SDN methods. 135 1.2. Resilience 137 Each router could establish direct BGP sessions with one or more 138 controllers, or it could establish BGP sessions with RRs who in turn 139 peer with controllers. For the same tree/tunnel, each controller may 140 independently calculate the tree/tunnel and signal the routers on the 141 tree/tunnel using MCAST-TREE S-PMSI/Leaf A-D routes 142 [I-D.zzhang-bess-bgp-multicast]. How the tree/tunnel roots/leaves 143 are discovered and how the calculation is done are outside the scope 144 of this document. 146 On each router, BGP route selection rules will lead to one 147 controller's route for the tree/tunnel being selected as the active 148 route and used for setting up forwarding state. As long as all the 149 routers on a tree/tunnel consistently pick the same controller's 150 routes for the tree/tunnel, the setup should be consistent. If the 151 tree/tunnel is labeled, different labels will be used from different 152 controllers so there is no traffic loop issue even if the routers do 153 not consistently select the same controlle's routes. In the 154 unlabeled case, to ensure the consistency the selection SHOULD be 155 solely based on the identifier of the controller, which could be 156 carried in an Address Specific Extended Community (EC). 158 Another consistency issue is when a bidirectional tree/tunnel needs 159 to be re-routed. Because this is no longer triggered hop-by-hop from 160 downstream to upstream, it is possible that the upstream change 161 happens before the downstream, causing traffic loop. In the 162 unlabeled case, there is no good solution (other than that the 163 controller issues upstream change only after it gets acknowledgement 164 from downstream). In the labeled case, as long as a new label is 165 used there should be no problem. 167 Besides the traffic loop issue, there could be transient traffic loss 168 before both the upstream and downstream's forwarding state are 169 updated. This could be mitigated if the upstream keep sending 170 traffic on the old path (in addition to the new path) and the 171 downstream keep accepting traffic on the old path (but not on the new 172 path) for some time. It is a local matter when for the downstream to 173 switch to the new path - it could be data driven (e.g., after traffic 174 arrives on the new path) or timer driven. 176 For each tree, multiple disjoint instances could be calculated and 177 signaled for live-live protection. Different labels are used for 178 different instances, so that the leaves can differentiate incoming 179 traffic on different instances. As far as transit routers are 180 concerned, the instances are just independent. Note that the two 181 instances are not expected to share common transit routers (it is 182 otherwise outside the scope of this document/revision). 184 1.3. Signaling 186 Each router only receives S-PMSI/Leaf A-D routes from the controllers 187 but does not originate or re-advertise those routes. The re- 188 advertisement of a received route can be blocked based on the fact 189 that a configured import RT matches the RT of the route, which 190 indicates that this router is the target and consumer of the route 191 hence it should not be re-advertised further. The routes includes 192 the outgoing forwarding information in the form of Tunnel 193 Encapsulation Attributes (TEA), with optional enhancements specified 194 in this document. The router infers the incoming forwarding 195 information from the Upstream Router's IP Address field in the NLRI 196 in case of an unlabeled tree. 198 Suppose that for a particular tree, there are two downstream routers 199 D1 and D2 for a particular upstream router U. A controller C may 200 send two Leaf A-D routes to U, as if the two routes were originated 201 by D1 and D2 but reflected by the controller. As an alternative in 202 case of a labeled tree, C could just send one route to U, with a 203 Composite Tunnel in TEA (in this case, the Originating Router's 204 Address field of the Leaf A-D route is set to the controller's 205 address) and the Composite Tunnel specifies both downstreams. The 206 tunnel in a TEA or Composite Tunnel is of type "MPLS Encapsulation" 207 with a Label Stack Sub-TLV to encode label information. 209 For comparison, the existing TEA as specified in 210 [I-D.ietf-idr-tunnel-encaps] can include multiple tunnels, but only 211 one of those is used, while with a Composite Tunnel, traffic is sent 212 out of all the enclosed tunnels to reach multiple endpoints. 214 Note that, in case of labeled trees, the (x,g) or mLDP FEC signaling 215 is actually not needed to transit routers but only needed on tunnel 216 root/leaves. However, for consistency, the same signaling is used to 217 all routers. 219 1.4. Label Allocation 221 In the case of labeled multicast signaled hop by hop towards the 222 root, whether it's (x,g) multicast or "mLDP" tunnel, labels are 223 assigned by a downstream router and advertised to its upstream router 224 (from traffic direction point of view). In the case of controller 225 based signaling, routers do not originate tree join (S-PMSI/Leaf A-D) 226 routes anymore, so the controllers have to assign labels on behalf of 227 routers, and there are three options for label assignment: 229 o From each router's SRLB that the controller learns 231 o From the common SRGB that the controller learns 233 o From the controller's local label space 235 Assignment from each router's SRLB is no different from each router 236 assigning labels from its own local label space in the hop-by-hop 237 signaling case. The assignments for a router is independent of 238 assignments for another router, even for the same tree. 240 Assignment from the controller's local label space is upstream- 241 assigned [RFC5331]. It is used if the controller does not learn the 242 common SRGB or each router's SRLB. Assignment from the SRGB 243 [RFC8402] is only meaningful if all SRGBs are the same and a single 244 common label is used for all the routers on a tree in case of 245 unidirectional tree/tunnel (Section 1.4.1). Otherwise, assignment 246 from SRLB is preferred. 248 The choice of which of the options to use depends on many factors. 249 An operator may want to use a single common label per tree for ease 250 of monitoring and debugging, but that requires explicit RPF checking 251 and either SRGB or upstream assigned labels, which may not be 252 supported due to either the software or hardware limitations (e.g. 253 label imposition/disposition limits). In an SR network, assignment 254 from the common SRGB if it's required to use a single common label 255 per unidirectional tree, or otherwise assignment from SRLB is a good 256 choice because it does not require support for context label spaces. 258 1.4.1. Using a Common per-tree Label for All Routers 260 MPLS labels only have local significance. For an LSP that goes 261 through a series of routers, each router allocates a label 262 independently and it swaps the incoming label (that it advertised to 263 its upstream) to an outgoing label (that it received from its 264 downstream) when it forwards a labeled packet. Even if the incoming 265 and outgoing labels happen to be the same on a particular router, 266 that is just incidental. 268 With Segment Routing, it is becoming a common practice that all 269 routers use the same SRGB so that a SID maps to the same label on all 270 routers. This makes it easier for operators to monitor and debug 271 their network. The same concept applies to multicast trees as well - 272 a common per-tree label is used for a router to receive traffic from 273 its upstream neighbor and replicate traffic to all its downstream 274 neighbor. 276 However, a common per-tree label can only be used for unidirectional 277 trees. Additionally, it requires each router to do explicit RPF 278 check, so that only packets from its expected upstream neighbor are 279 accepted. Otherwise, traffic loop may form during topology changes, 280 because the forwarding state update is no longer ordered. 282 Traditionally, p2mp mpls forwarding does not require explicit RPF 283 check as a downstream router advertises a label only to its upstream 284 router and all traffic with that incoming label is presumed to be 285 from the upstream router and accepted. When a downstream router 286 switches to a different upstream router a different label will be 287 advertised, so it can determine if traffic is from its expected 288 upstream neighbor purely based on the label. Now with a single 289 common label used for all routers on a tree to send and receive 290 traffic with, a router can no longer determine if the traffic is from 291 its expected neighbor just based on that common tree label. 292 Therefore, explicit RPF check is needed. Instead of interface based 293 RPF checking as in PIM case, neighbor based RPF checking is used - a 294 label identifying the upstream neighbor precedes the tree label and 295 the receiving router checks if that preceding neighbor label matches 296 its expected upstream neighbor. Notice that this is similar to 297 what's described in Section "9.1.1 Discarding Packets from Wrong PE" 298 of RFC 6513 (an egress PE discards traffic sent from a wrong ingress 299 PE). The only difference is one is used for label based forwarding 300 and the other is used for (s,g) based forwarding. [note: for 301 bidirectional trees, we may be able to use two labels per tree - one 302 for upstream traffic and one for downstream traffic. This needs 303 further verification]. 305 Both the common per-tree label and the neighbor label are allocated 306 either from the common SRGB or from the controller's local label 307 space. In the latter case, an additional label identifying the 308 controller's label space is needed, as described in the following 309 section. 311 1.4.2. Upstream-assignment from Controller's Local Label Space 313 In this case in the multicast packet's label stack the tree label and 314 upstream neighbor label (if used in case of single common-label per 315 tree) are preceded by a downstream-assigned "context label". The 316 context label identifies a context-specific label space (the 317 controller's local label space), and the upstream-assigned label that 318 follows it is looked up in that space. 320 This specification requires that, in case of upstream-assignment from 321 a controller's local label space, each router D to assign, 322 corresponding to each controller C, a context label that identifies 323 the upstream-assigned label space used by that controller. This 324 label, call it Lc-D, is communicated by D to C. 326 Suppose a controller is setting up unidirectional tree T. It assigns 327 that tree the label Lt, and assigns label Lu to identify router U 328 which is the upstream of router D on tree T. C needs to tell U: "to 329 send a packet on the given tree/tunnel, one of the things you have to 330 do is push Lt onto the packet's label stack, then push Lu, then push 331 Lc-D onto the packet's label stack, then unicast the packet to D". 333 Controller C also needs to inform router D of the correspondence 334 between and tree T. 336 To achieve that, when C sends an S-PMSI/Leaf A-D route, for each 337 tunnel in the TEA or in the Composite Tunnel TLV, it includes a label 338 stack Sub-TLV [I-D.ietf-idr-tunnel-encaps], with the outer label 339 being the context label Lc-D (received by the controller from the 340 corresponding downstream), the next label being the upstream neighbor 341 label Lu, and the inner label being the label Lt assigned by the 342 controller for the tree. The router receiving the route will use the 343 label stacks to send traffic to its downstreams. 345 For C to signal the expected label stack for D to receive traffic 346 with, we overload a tunnel TLV in either the TEA or the Composite 347 Tunnel in the Leaf A-D route sent to D - if the remote endpoint of 348 that tunnel TLV matches the Upstream Router field in the Leaf A-D 349 route, then it indicates that this is actually for receiving traffic 350 from the upstream. If a common tree label is used, then the TLV 351 contains a variant of the Label Stack Sub-TLV because the D needs to 352 treat the second inner most label as the upstream neighbor label and 353 set up forwarding state accordingly for explicit RPF check. This 354 variant is referred to as RPF Label Stack Sub-TLV (Section 2.2). 356 Note that the use of TEA to specify downstream and upstream 357 forwarding information also apply to label assignment from the common 358 SRGB or each router's SRLB, with the differences that the context 359 label is not needed in the SRGB/SRLB case, and that in SRLB case only 360 a Label Stack Sub-TLV with a single SRLB label is used for upstream 361 and downstream forwarding information (no RPF Label Stack Sub-TLV is 362 needed) in the SRLB case. 364 2. Specification 366 2.1. Additional Tunnel Type for TEA 368 This document specifies a Composite Tunnel TLV and a TEA Tunnel TLV. 369 The type codes will be assigned by IANA. 371 A Tunnel Encapsulation Attribute includes Tunnel TLVs and a router 372 receiving the TEA (associated with a route) selects one of the Tunnel 373 TLVs to set up forwarding state - a packet is sent out of only one of 374 the tunnels. To specify that traffic needs to be sent out of 375 multiple tunnels, a Composite Tunnel TLV is used. The value part of 376 the TLV includes a list of sub-TLVs, each being a Tunnel TLV. 377 Obviously, a Composite Tunnel TLV MUST not be a sub-TLV of a 378 Composite Tunnel TLV. 380 Consider that a Composite Tunnel TLV that includes a bunch of sub- 381 TLVs specifying a bunch of tunnels used to send traffic to a bunch of 382 endpoints. For a particular endpoint, there are multiple ways to 383 reach it - any one but only one should be used. For that purpose, a 384 TEA Tunnel TLV (for lack of a better name) is used for that endpoint. 385 The TEA Tunnel TLV includes a bunch of sub-TLVs, each being a Tunnel 386 TLV that specifies one way to reach the same endpoint. This is 387 similar to a Tunnel Encapsulation Attribute, hence the name TEA 388 Tunnel TLV. 390 2.2. RPF Label Stack Sub-TLV 392 This is almost identical to Label Stack Sub-TLV. The only difference 393 is that the second inner most label in the stack identifies the 394 expected upstream neighbor and explicit RPF checking needs to be set 395 up for the tree label accordingly. 397 2.3. Context Label Wide Community 399 For a router to signal the context label that it assigns for a 400 controller (or any label allocator that assigns labels that will be 401 seen by this router), it attaches a Context Label Wide Community 402 [I-D.ietf-idr-wide-bgp-communities] to the host route for its own 403 address used in its BGP session towards the controllers (directly or 404 via RRs). This is a new wide community that specifies the (Label 405 Allocator, Context Label) tuple, and the exactly format will be 406 specified in a future revision. 408 2.4. Procedures 410 Details to be added. The general idea is described in the 411 introduction section. 413 3. Security Considerations 415 This document does not introduce new security risks? 417 4. IANA Considerations 419 To be added. 421 5. Acknowledgements 423 The authors Eric Rosen for his questions, suggestions, and help 424 finding solutions to some issues like the neighbor based explicit RPF 425 checking. The authors also thank Lenny Giuliano and IJsbrand 426 Wijnands for their review and comments. 428 6. References 430 6.1. Normative References 432 [I-D.ietf-idr-tunnel-encaps] 433 Rosen, E., Patel, K., and G. Velde, "The BGP Tunnel 434 Encapsulation Attribute", draft-ietf-idr-tunnel-encaps-10 435 (work in progress), August 2018. 437 [I-D.ietf-idr-wide-bgp-communities] 438 Raszuk, R., Haas, J., Lange, A., Decraene, B., Amante, S., 439 and P. Jakma, "BGP Community Container Attribute", draft- 440 ietf-idr-wide-bgp-communities-05 (work in progress), July 441 2018. 443 [I-D.zzhang-bess-bgp-multicast] 444 Zhang, Z., Giuliano, L., Patel, K., Wijnands, I., mishra, 445 m., and A. Gulko, "BGP Based Multicast", draft-zzhang- 446 bess-bgp-multicast-02 (work in progress), December 2018. 448 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 449 Requirement Levels", BCP 14, RFC 2119, 450 DOI 10.17487/RFC2119, March 1997, 451 . 453 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 454 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 455 May 2017, . 457 6.2. Informative References 459 [RFC6388] Wijnands, IJ., Ed., Minei, I., Ed., Kompella, K., and B. 460 Thomas, "Label Distribution Protocol Extensions for Point- 461 to-Multipoint and Multipoint-to-Multipoint Label Switched 462 Paths", RFC 6388, DOI 10.17487/RFC6388, November 2011, 463 . 465 [RFC6513] Rosen, E., Ed. and R. Aggarwal, Ed., "Multicast in MPLS/ 466 BGP IP VPNs", RFC 6513, DOI 10.17487/RFC6513, February 467 2012, . 469 [RFC7761] Fenner, B., Handley, M., Holbrook, H., Kouvelas, I., 470 Parekh, R., Zhang, Z., and L. Zheng, "Protocol Independent 471 Multicast - Sparse Mode (PIM-SM): Protocol Specification 472 (Revised)", STD 83, RFC 7761, DOI 10.17487/RFC7761, March 473 2016, . 475 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 476 Decraene, B., Litkowski, S., and R. Shakir, "Segment 477 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 478 July 2018, . 480 Authors' Addresses 482 Zhaohui Zhang 483 Juniper Networks 485 EMail: zzhang@juniper.net 487 Robert Raszuk 488 Bloomberg LP 490 EMail: robert@raszuk.net 492 Dante Pacella 493 Verizon 495 EMail: dante.j.pacella@verizon.com 497 Arkadiy Gulko 498 Thomson Reuters 500 EMail: arkadiy.gulko@thomsonreuters.com