idnits 2.17.1 draft-rosen-vpn-mcast-14.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 10, 2010) is 5098 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 4601 (ref. 'PIMv2') (Obsoleted by RFC 7761) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Eric C. Rosen 3 Internet Draft Yiqun Cai 4 Intended Status: Informational IJsbrand Wijnands 5 Expires: November 10, 2010 Cisco Systems, Inc. 7 May 10, 2010 9 Multicast in MPLS/BGP IP VPNs 11 draft-rosen-vpn-mcast-14.txt 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance with the 16 provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Copyright and License Notice 36 Copyright (c) 2010 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. Code Components extracted from this document must 45 include Simplified BSD License text as described in Section 4.e of 46 the Trust Legal Provisions and are provided without warranty as 47 described in the Simplified BSD License. 49 Abstract 51 This draft describes the deployed MVPN (Multicast in BGP/MPLS IP 52 VPNs) solution of Cisco Systems. 54 Table of Contents 56 1 Specification of requirements ......................... 4 57 2 Introduction .......................................... 4 58 2.1 Scaling Multicast State Info. in the Network Core ..... 5 59 2.2 Overview .............................................. 6 60 3 Multicast VRFs ........................................ 7 61 4 Multicast Domains ..................................... 8 62 4.1 Model of Operation .................................... 8 63 5 Multicast Tunnels ..................................... 9 64 5.1 Ingress PEs ........................................... 9 65 5.2 Egress PEs ............................................ 9 66 5.3 Tunnel Destination Address(es) ........................ 9 67 5.4 Auto-Discovery ........................................ 10 68 5.4.1 MDT-SAFI .............................................. 11 69 5.5 Which PIM Variant to Use .............................. 12 70 5.6 Inter-AS MDT Construction ............................. 12 71 5.6.1 The PIM MVPN Join Attribute ........................... 12 72 5.6.1.1 Definition ............................................ 12 73 5.6.1.2 Usage ................................................. 13 74 5.7 Encapsulation in GRE .................................. 14 75 5.8 MTU ................................................... 15 76 5.9 TTL ................................................... 16 77 5.10 Differentiated Services ............................... 16 78 5.11 Avoiding Conflict with Internet Multicast ............. 16 79 6 The PIM C-Instance and the MT ......................... 16 80 6.1 PIM C-Instance Control Packets ........................ 17 81 6.2 PIM C-Instance RPF Determination ...................... 17 82 6.2.1 Connector Attribute ................................... 18 83 7 Data MDT: Optimizing Flooding ......................... 18 84 7.1 Limitation of Multicast Domain ........................ 19 85 7.2 Signaling Data MDT Trees .............................. 19 86 7.3 Use of SSM for Data MDTs .............................. 21 87 8 Packet Formats and Constants .......................... 21 88 8.1 MDT TLV ............................................... 21 89 8.2 MDT Join TLV for IPv4 streams ......................... 22 90 8.3 MDT Join TLV for IPv6 streams ......................... 23 91 8.4 Multiple MDT Join TLVs per Datagram ................... 24 92 8.5 Constants ............................................. 24 93 9 IANA Considerations ................................... 25 94 10 Security Considerations ............................... 25 95 11 Acknowledgments ....................................... 25 96 12 Normative References .................................. 26 97 13 Informative References ................................ 26 98 14 Authors' Addresses .................................... 27 100 1. Specification of requirements 102 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 103 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 104 document are to be interpreted as described in [RFC2119]. 106 2. Introduction 108 This draft describes the deployed MVPN (Multicast in BGP/MPLS IP 109 VPNs) solution of Cisco Systems. This document is being made 110 available as a reference for interoperating with deployed 111 implementations. 113 The procedures specified in this draft differ are largely a subset of 114 the generalized MVPN framework defined in [MVPN]. However, as this 115 draft specifies an implementation that precedes the standardization 116 of [MVPN] by ten years, it does differ in a few minor respects from a 117 fully standards-compliant implementation . These differences are 118 pointed out where they occur. 120 The base specification for BGP/MPLS IP VPNs [RFC4364] does not 121 provide a way for IP multicast data or control traffic to travel from 122 one VPN site to another. This document extends that specification by 123 specifying the necessary protocols and procedures for support of IP 124 multicast. 126 This specification presupposes that: 128 1. PIM [PIMv2], running over either IPv4 or IPv6, is the multicast 129 routing protocol used within the VPN, 131 2. PIM, running over IPv4, is the multicast routing protocol used 132 within the SP network, and 134 3. the SP network supports native IPv4 multicast forwarding. 136 Familiarity with the terminology and procedures of [RFC4364] is 137 presupposed. Familiarity with [PIMv2] is also presupposed. 139 2.1. Scaling Multicast State Info. in the Network Core 141 The BGP/MPLS IP VPN service of [RFC4364] provides a VPN with 142 "optimal" unicast routing through the SP backbone, in that a packet 143 follows the "shortest path" across the backbone, as determined by the 144 backbone's own routing algorithm. This optimal routing is provided 145 without requiring the P routers to maintain any routing information 146 that is specific to a VPN; indeed, the P routers do not maintain any 147 per-VPN state at all. 149 Unfortunately, optimal MULTICAST routing cannot be provided without 150 requiring the P routers to maintain some VPN-specific state 151 information. Optimal multicast routing would require that one or 152 more multicast distribution trees be created in the backbone for each 153 multicast group that is in use. If a particular multicast group from 154 within a VPN is using source-based distribution trees, optimal 155 routing requires that there be one distribution tree for each 156 transmitter of that group. If shared trees are being used, one tree 157 for each group is still required. Each such tree requires state in 158 some set of the P routers, with the amount of state being 159 proportional to the number of multicast transmitters. The reason 160 there needs to be at least one distribution tree per multicast group 161 is that each group may have a different set of receivers; multicast 162 routing algorithms generally go to great lengths to ensure that a 163 multicast packet will not be sent to a node that is not on the path 164 to a receiver. 166 Given that an SP generally supports many VPNs, where each VPN may 167 have many multicast groups, and each multicast group may have many 168 transmitters, it is not scalable to have one or more distribution 169 trees for each multicast group. The SP has no control whatsoever 170 over the number of multicast groups and transmitters that exist in 171 the VPNs, and it is difficult to place any bound on these numbers. 173 In order to have a scalable multicast solution for MPLS/BGP IP VPNs, 174 the amount of state maintained by the "P routers" (routers in the 175 provider backbone, other than the "provider edge" or "PE" routers) 176 needs to be proportional to something that IS under the control of 177 the SP. This specification describes such a solution. In this 178 solution, the amount of state maintained in the P routers is 179 proportional only to the number of VPNs that run over the backbone; 180 the amount of state in the P routers is NOT sensitive to the number 181 of multicast groups or to the number of multicast transmitters within 182 the VPNS. To achieve this scalability, the optimality of the 183 multicast routes is reduced. A PE that is not on the path to any 184 receiver of a particular multicast group may still receive multicast 185 packets for that group, and if so, will have to discard them. The SP 186 does however have control over the tradeoff between optimal routing 187 and scalability. 189 2.2. Overview 191 An SP determines whether a particular VPN is multicast-enabled. If 192 it is, it corresponds to a "Multicast Domain". A PE that attaches to 193 a particular multicast-enabled VPN is said to belong to the 194 corresponding Multicast Domain. For each Multicast Domain, there is 195 a default "Multicast Distribution Tree (MDT)" through the backbone, 196 connecting ALL of the PEs that belong to that Multicast Domain. A 197 given PE may be in as many Multicast Domains as there are VPNs 198 attached to that PE. However, each Multicast Domain has its own MDT. 199 The MDTs are created by running PIM in the backbone, and in general 200 an MDT also includes P routers on the paths between the PE routers. 202 In a departure from the usual multicast tree distribution procedures, 203 the Default MDT for a Multicast Domain is constructed automatically 204 as the PEs in the domain come up. Construction of the Default MDT 205 does not depend on the existence of multicast traffic in the domain; 206 it will exist before any such multicast traffic is seen. Default 207 MDTs correspond to the "MI-PMSIs" of [MVPN]. 209 In BGP/IP MPLS VPNs, each CE ("Customer Edge", see [RFC4364]) router 210 is a unicast routing adjacency of a PE router, but CE routers at 211 different sites do NOT become unicast routing adjacencies of each 212 other. This important characteristic is retained for multicast 213 routing -- a CE router becomes a PIM adjacency of a PE router, but CE 214 routers at different sites do NOT become PIM adjacencies of each 215 other. Multicast packets from within a VPN are received from a CE 216 router by an ingress PE router. The ingress PE encapsulates the 217 multicast packets and (initially) forwards them along the Default MDT 218 tree to all the PE routers connected to sites of the given VPN. 219 Every PE router attached to a site of the given VPN thus receives all 220 multicast packets from within that VPN. If a particular PE routers 221 is not on the path to any receiver of that multicast group, the PE 222 simply discards that packet. 224 If a large amount of traffic is being sent to a particular multicast 225 group, but that group does not have receivers at all the VPN sites, 226 it can be wasteful to forward that group's traffic along the Default 227 MDT. Therefore, we also specify a method for establishing individual 228 MDTs for specific multicast groups. We call these "Data MDTs". A 229 Data MDT delivers VPN data traffic for a particular multicast group 230 only to those PE routers that are on the path to receivers of that 231 multicast group. Using a Data MDT has the benefit of reducing the 232 amount of multicast traffic on the backbone, as well reducing the 233 load on some of the PEs; it has the disadvantage of increasing the 234 amount of state that must be maintained by the P routers. The SP has 235 complete control over this tradeoff. Data MDTs correspond to the 236 S-PMSIs of [MVPN]. 238 This solution requires the SP to deploy appropriate protocols and 239 procedures, but is transparent to the SP's customers. An enterprise 240 that uses PIM-based multicasting in its network can migrate from a 241 private network to a BGP/MPLS IP VPN service, while continuing to use 242 whatever multicast router configurations it was previously using; no 243 changes need be made to CE routers or to other routers at customer 244 sites. For instance, any dynamic RP-discovery procedures that are 245 already in use may be left in place. 247 3. Multicast VRFs 249 The notion of a "VRF", defined in [RFC4364], is extended to include 250 multicast routing entries as well as unicast routing entries. 252 Each VRF has its own multicast routing table. When a multicast data 253 or control packet is received from a particular CE device, multicast 254 routing is done in the associated VRF. 256 Each PE router runs a number of instances of PIM-SM, as many as one 257 per VRF. In each instance of PIM-SM, the PE maintains a PIM 258 adjacency with each of the PIM-capable CE routers associated with 259 that VRF. The multicast routing table created by each instance is 260 specific to the corresponding VRF. We will refer to these PIM 261 instances as "VPN-specific PIM instances", or "PIM C-instances". 263 Each PE router also runs a "provider-wide" instance of PIM-SM (a "PIM 264 P-instance"), in which it has a PIM adjacency with each of its IGP 265 neighbors (i.e., with P routers), but NOT with any CE routers, and 266 not with other PE routers (unless they happen to be adjacent in the 267 SP's network). The P routers also run the P-instance of PIM, but do 268 NOT run a C-instance. 270 In order to help clarify when we are speaking of the PIM P-instance 271 and when we are speaking of a a PIM C-instance, we will also apply 272 the prefixes "P-" and "C-" respectively to control messages, 273 addresses, etc. Thus a P-Join would be a PIM Join that is processed 274 by the PIM P-instance, and a C-Join would be a PIM Join that is 275 processed by a C-instance. A P-group address would be a group 276 address in the SP's address space, and a C-group address would be a 277 group address in a VPN's address space. 279 4. Multicast Domains 281 4.1. Model of Operation 283 A "Multicast Domain (MD)" is essentially a set of VRFs associated 284 with interfaces that can send multicast traffic to each other. From 285 the standpoint of PIM C-instance, a multicast domain is equivalent to 286 a multi-access interface. The PE routers in a given MD become PIM 287 adjacencies of each other in the PIM C-instance. 289 Each multicast VRF is assigned to one MD. Each MD is configured with 290 a distinct, multicast P-group address, called the "Default MDT group 291 address". This address is used to build the Default MDT for the MD. 293 When a PE router needs to send PIM C-instance control traffic to the 294 other PE routers in the MD, it encapsulates the control traffic, with 295 its own IPv4 address as source IP address and the Default MDT group 296 address as destination IP address. Note that the Default MDT is part 297 of the PIM P-instance, whereas the PEs that communicate over the 298 Default MDT are PIM adjacencies in a C-instance. Within the 299 C-instance, the Default MDT appears to be a multi-access network to 300 which all the PEs are attached. This is discussed in more detail in 301 section 5. 303 The Default MDT does not only carry the PIM control traffic of the 304 MD's PIM C-instance. It also, by default, carries the multicast data 305 traffic of the C-instance. In some cases though, multicast data 306 traffic in a particular MD will be sent on a Data MDT rather than on 307 the Default MDT. The use of Data MDTs is described in section 7. 309 Note that, if an MDT (Default or Data) is set up using the ASM ("Any 310 Source Multicast") Service Model, the MDT (Default or Data) must have 311 a P-group address that is "globally unique" (more precisely, unique 312 over the set of SP networks carrying the multicast traffic of the 313 corresponding MD). If the MDT is set up using the SSM ("Single 314 Source Multicast") model, the P-group address of an MDT only needs to 315 be unique relative to the source of the MDT (though see section 5.4). 316 However, some implementations require the same SSM group address to 317 be assigned to all the PEs. Interoperability with those 318 implementations requires conformance to this restriction. 320 5. Multicast Tunnels 322 An MD can be thought of as a set of PE routers connected by a 323 "multicast tunnel (MT)". From the perspective of a VPN-specific PIM 324 instance, an MT is a single multi-access interface. In the SP 325 network, a single MT is realized as a Default MDT combined with zero 326 or more Data MDTs. 328 5.1. Ingress PEs 330 An ingress PE is a PE router that is either directly connected to the 331 multicast sender in the VPN, or via a CE router. When the multicast 332 sender starts transmitting, and if there are receivers (or PIM RP) 333 behind other PE routers in the common MD, the ingress PE becomes the 334 transmitter of either the Default MDT group or a Data MDT group in 335 the SP network. 337 5.2. Egress PEs 339 A PE router with a VRF configured in an MD becomes a receiver of the 340 Default MDT group for that MD. A PE router may also join a Data MDT 341 group if if it has a VPN-specific PIM instance in which it is 342 forwarding to one of its attached sites traffic for a particular 343 C-group, and that particular C-group has been associated with that 344 particular Data MDT. When a PE router joins any P-group used for 345 encapsulating VPN multicast traffic, the PE router becomes one of the 346 endpoints of the corresponding MT. 348 When a packet is received from an MT, the receiving PE derives the MD 349 from the destination address, which is a P-group address, of the 350 received packet. The packet is then passed to the corresponding 351 Multicast VRF and VPN-specific PIM instance for further processing. 353 5.3. Tunnel Destination Address(es) 355 An MT is an IP tunnel for which the destination address is a P-group 356 address. However an MT is not limited to using only one P-group 357 address for encapsulation. Based on the payload VPN multicast 358 traffic, it can choose to use the Default MDT group address, or one 359 of the Data MDT group addresses (as described in section 7 of this 360 document), allowing the MT to reach a different set of PE routers in 361 the common MD. 363 5.4. Auto-Discovery 365 Any of the variants of PIM may be used to set up the Default MDT: 366 PIM-SM, Bidirectional PIM [BIDIR], or PIM-SSM [SSM]. Except in the 367 case of PIM-SSM, the PEs need only know the proper P-group address in 368 order to begin setting up the Default MDTs. The PEs will then 369 discover each others' addresses by virtue of receiving PIM control 370 traffic, e.g., PIM Hellos, sourced (and encapsulated) by each other. 372 However, in the case of PIM-SSM, the necessary MDTs for an MD cannot 373 be set up until each PE in the MD knows the source address of each of 374 the other PEs in that same MD. This information needs to be 375 auto-discovered. 377 A new BGP Address Family, MDT-SAFI is defined. The NLRI for this 378 address family consists of an RD, an IPv4 unicast address, and a 379 multicast group address. A given PE router in a given MD constructs 380 an NLRI in this family from: 382 - Its own IPv4 address. If it has several, it uses the one that it 383 will be placing in the IP source address field of multicast 384 packets that it will be sending over the MDT. 386 - An RD that has been assigned to the MD. 388 - The P-group address, an IPv4 multicast address that is to be used 389 as the IP destination address field of multicast packets that 390 will be sent over the MDT. 392 When a PE distributes this NLRI via BGP, it may include a Route 393 Target (RT) Extended Communities attribute. This RT must be an 394 "Import RT" [RFC4364] of each VRF in the MD. The ordinary BGP 395 distribution procedures used by [RFC4364] will then ensure that each 396 PE learns the MDT-SAFI "address" of each of the other PEs in the MD, 397 and that the learned MDT-SAFI addresses get associated with the right 398 VRFs. 400 If a PE receives an MDT-SAFI NLRI that does not have an RT attribute, 401 the P-group address from the NLRI has to be used to associate the 402 NLRI with a particular VRF. In this case, each multicast domain must 403 be associated with a unique P-address, even if PIM-SSM is used. 404 However, finding a unique P-address for a multi-provider multicast 405 group may be difficult. 407 In order to facilitate the deployment of multi-provider multicast 408 domains, this specification REQUIRES the use of the MDT-SAFI NLRI 409 (even if PIM-SSM is not used to set up the default MDT). This 410 specification also REQUIRES that an implementation be capable of 411 using PIM-SSM to set up the default MDT. 413 In [MVPN], the MDT-SAFI is replaced by the "Intra-AS I-PMSI A-D 414 Route." The latter is a generalized version of the MDT-SAFI, which 415 allows the "default MDTs" and "data MDTs" to be implemented as MPLS 416 P2MP LSPs ("Point-to-Multipoint Label Switched Paths") or MP2MP 417 ("Multipoint-to-Multipoint Label Switched Paths") LSPs, as well as by 418 PIM-created multicast distribution trees. In the latter case, the 419 Intra-AS A-D routes carry the same information that the MDT-SAFI 420 does, though with a different encoding. 422 The Intra-AS A-D Routes also carry Route Targets, and so may be 423 distributed inter-AS in the same manner as unicast routes. (Inter-AS 424 distribution of "Intra-AS I-PMSI A-D routes" is necessary in some 425 cases, see below.) 427 The encoding of the MDT-SAFI is specified in the following 428 subsection: 430 5.4.1. MDT-SAFI 432 BGP messages in which AFI=1 and SAFI=66 are "MDT-SAFI" messages. 434 The NLRI format is 8-byte-RD:IPv4-address followed by the MDT group 435 address. i.e. The MP_REACH attribute for this SAFI will contain one 436 or more tuples of the following form : 438 +-------------------------------+ 439 | | 440 | RD:IPv4-address (12 octets) | 441 | | 442 +-------------------------------+ 443 | Group Address (4 octets) | 444 +-------------------------------+ 446 The IPv4 address identifies the PE that originated this route, and 447 the RD identifies a VRF in that PE. The group address MUST be an 448 IPv4 multicast group address, and is used to build the P-tunnels. 449 All PEs attached to a given MVPN MUST specify the same group-address, 450 even if the group is an SSM group. MDT-SAFI routes do not carry RTs, 451 and the group address is used to associate a received MDT-SAFI route 452 with a VRF. 454 5.5. Which PIM Variant to Use 456 To minimize the amount of multicast routing state maintained by the P 457 routers, the Default MDTs should be realized as shared trees, such as 458 PIM Bidirectional trees. However, the operational procedures for 459 assigning P-group addresses may be greatly simplified, especially in 460 the case of multi-provider MDs, if PIM-SSM is used. 462 Data MDTs are best realized as source trees, constructed via PIM-SSM. 464 5.6. Inter-AS MDT Construction 466 Standard PIM techniques for the construction of source trees 467 presuppose that every router has a route to the source of the tree. 468 However, if the source of the tree is in a different AS than a 469 particular P router, it is possible that the P router will not have a 470 route to the source. For example, the remote AS may be using BGP to 471 distribute a route to the source, but a particular P router may be 472 part of a "BGP-free core", in which the P routers are not aware of 473 BGP-distributed routes. 475 What is needed in this case is a way for a PE to tell PIM to 476 construct the tree through a particular BGP speaker, the "BGP next 477 hop" for the tree source. This can be accomplished with a PIM 478 extension. 480 If the PE has selected the source of the tree from the MDT SAFI 481 address family, then it may be desirable to build the tree along the 482 route to the MDT SAFI address, rather than along the route to the 483 corresponding IPv4 address. This enables the inter-AS portion of the 484 tree to follow a path that is specifically chosen for multicast 485 (i.e., it allows the inter-AS multicast topology to be 486 "non-congruent" to the inter-AS unicast topology). This too requires 487 a PIM extension. 489 The necessary PIM extension is the PIM MVPN Join Attribute described 490 in in the following sub-section. 492 5.6.1. The PIM MVPN Join Attribute 494 5.6.1.1. Definition 496 In [PIM-ATTRIB], the notion of a "join attribute" is defined, and a 497 format for included join attributes in PIM Join/Prune messages is 498 specified. We now define a new join attribute, which we call the 499 "MVPN Join Attribute". 501 0 1 2 3 502 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 503 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 504 |F|E| Type | Length | Proxy IP address 505 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 506 | RD 507 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-....... 509 The 6-bit Type field of the MVPN Join Attribute is set to 1. 511 The F bit is set to 0, indicating that the attribute is 512 non-transitive. 514 Rules for setting the E bit are given in [PIM-ATTRIB]. 516 Two information fields are carried in the MVPN Join attribute: 518 - Proxy: The IP address of the node towards which the PIM 519 Join/Prune message is to be forwarded. This will either be an 520 IPv4 or an IPv6 address, depending on whether the PIM Join/Prune 521 message itself is IPv4 or IPv6. 523 - RD: An eight-byte RD. This immediately follows the proxy IP 524 address. 526 The PIM message also carries the address of the upstream PE. 528 In the case of an intra-AS MVPN, the proxy and the upstream PE are 529 the same. In the case of an inter-AS MVPN, proxy will be the ASBR 530 that is the exit point from the local AS on the path to the upstream 531 PE. 533 5.6.1.2. Usage 535 When a PE router creates a PIM Join/Prune message in order to set up 536 an inter-AS default MDT, it does so as a result of having received a 537 particular MDT-SAFI route. It includes an MVPN Join attribute whose 538 fields are set as follows: 540 - If the upstream PE is in the same AS as the local PE, then the 541 proxy field contains the address of the upstream PE. Otherwise, 542 it contains the address of the BGP next hop on the route to the 543 upstream PE. 545 - The RD field contains the RD from the NLRI of the MDT-SAFI route. 547 - The upstream PE field contains the address of the PE that 548 originated the MDT-SAFI route (obtained from the NLRI of that 549 route). 551 When a PIM router processes a PIM Join/Prune message with an MVPN 552 Join Attribute, it first checks to see if the proxy field contains 553 one of its own addresses. 555 If not, the router uses the proxy IP address in order to determine 556 the RPF interface and neighbor. The MVPN Join Attribute MUST be 557 passed upstream, unchanged. 559 If the proxy address is one of the router's own IP addresses, then 560 the router looks in its BGP routing table for an MDT-SAFI route whose 561 NLRI consists of the upstream PE address prepended with the RD from 562 the Join attribute. If there is no match, the PIM message is 563 discarded. If there is a match the IP address from the BGP next hop 564 field of the matching route is used in order to determine the RPF 565 interface and neighbor. When the PIM Join/Prune is forwarded 566 upstream, the proxy field is replaced with the address of the BGP 567 next hop, and the RD and upstream PE fields are left unchanged. 569 5.7. Encapsulation in GRE 571 GRE [GRE1701] encapsulation is used when sending multicast traffic 572 through an MDT. The following diagram shows the progression of the 573 packet as it enters and leaves the service provider network. 575 Packets received Packets in transit Packets forwarded 576 at ingress PE in the service by egress PEs 577 provider network 579 +---------------+ 580 | P-IP Header | 581 +---------------+ 582 | GRE | 583 ++=============++ ++=============++ ++=============++ 584 || C-IP Header || || C-IP Header || || C-IP Header || 585 ++=============++ >>>>> ++=============++ >>>>> ++=============++ 586 || C-Payload || || C-Payload || || C-Payload || 587 ++=============++ ++=============++ ++=============++ 589 The IPv4 Protocol Number field in the P-IP Header MUST be set to 47. 590 The Protocol Type field of the GRE Header MUST be set to 0x0800 if 591 C-IP header is an IPv4 header; it MUST be set to 0x86dd if the C-IP 592 header is an IPv6 header. 594 [GRE2784] specifies an optional GRE checksum, and [GRE2890] specifies 595 optional GRE key and sequence number fields. 597 The GRE key field is not needed because the P-group address in the 598 delivery IP header already identifies the MD, and thus associates the 599 VRF context, for the payload packet to be further processed. 601 The GRE sequence number field is also not needed because the 602 transport layer services for the original application will be 603 provided by the C-IP Header. 605 The use of GRE checksum field MUST follow [GRE2784]. 607 To facilitate high speed implementation, this document recommends 608 that the ingress PE routers encapsulate VPN packets without setting 609 the checksum, key or sequence field. 611 5.8. MTU 613 Because multicast group addresses are used as tunnel destination 614 addresses, existing Path MTU discovery mechanisms can not be used. 615 This requires that: 617 1. The ingress PE router (one that does the encapsulation) MUST 618 NOT set the DF ("Don't Fragment") bit in the outer header, and 620 2. If the "DF" bit is cleared in the IP header of the C-Packet, 621 fragment the C-Packet before encapsulation if appropriate. 622 This is very important in practice due to the fact that the 623 performance of reassembly function is significantly lower than 624 that of decapsulating and forwarding packets on today's router 625 implementations. 627 5.9. TTL 629 The ingress PE should not copy the TTL field from the payload IP 630 header received from a CE router to the delivery IP header. The 631 setting the TTL of the delivery IP header is determined by the local 632 policy of the ingress PE router. 634 5.10. Differentiated Services 636 By default, the setting of the DS ("Differentiated Services") field 637 in the delivery IP header should follow the guidelines outlined in 638 [DIFF2983]. An SP may also choose to deploy any of the additional 639 mechanisms the PE routers support. 641 5.11. Avoiding Conflict with Internet Multicast 643 If the SP is providing Internet multicast, distinct from its VPN 644 multicast services, it must ensure that the P-group addresses that 645 correspond to its MDs are distinct from any of the group addresses of 646 the Internet multicasts it supports. This is best done by using 647 administratively scoped addresses [ADMIN-ADDR]. 649 The C-group addresses need not be distinct from either the P-group 650 addresses or the Internet multicast addresses. 652 6. The PIM C-Instance and the MT 654 If a particular VRF is in a particular MD, the corresponding MT is 655 treated by that VRF's VPN-specific PIM instances as a LAN interface. 656 As a result, the PEs that are adjacent on the MT will generate and 657 process PIM control packets, such as Hello, Join/Prune, and Assert. 658 DF election occurs just as it would on an actual LAN interface. 660 6.1. PIM C-Instance Control Packets 662 The PIM protocol packets are sent to ALL-PIM-ROUTERS (224.0.0.13 for 663 IPv4 or ff02::d for IPv6) in the context of that VRF, but when in 664 transit in the provider network, they are encapsulated using the 665 Default MDT group configured for that MD. This allows VPN-specific 666 PIM routes to be extended from site to site without appearing in the 667 P routers. 669 If a PIM C-Instance control packet is an IPv6 packet, its source 670 address is the IPv4-mapped IPv6 address corresponding to the IPv4 671 address of the PE router sending the packet. 673 6.2. PIM C-Instance RPF Determination 675 Although the MT is treated as a PIM-enabled interface, unicast 676 routing is NOT run over it, and there are no unicast routing 677 adjacencies over it. It is therefore necessary to specify special 678 procedures for determining when the MT is to be regarded as the "RPF 679 Interface" for a particular C-address. 681 When a PE needs to determine the RPF interface of a particular 682 C-address, it looks up the C-address in the VRF. If the route 683 matching it is not a VPN-IP route learned from MP-BGP as described in 684 [RFC4364], or if that route's outgoing interface is one of the 685 interfaces associated with the VRF, then ordinary PIM procedures for 686 determining the RPF interface apply. 688 However, if the route matching the C-address is a VPN-IP route whose 689 outgoing interface is not one of the interfaces associated with the 690 VRF, then PIM will consider the outgoing interface to be the MT 691 associated with the VPN-specific PIM instance. 693 Once PIM has determined that the RPF interface for a particular 694 C-address is the MT, it is necessary for PIM to determine the RPF 695 neighbor for that C-address. This will be one of the other PEs that 696 is a PIM adjacency over the MT. 698 The BGP "Connector" attribute is defined. Whenever a PE router 699 distributes a VPN-IP address from a VRF that is part of an MD, it 700 SHOULD distribute a Connector attribute along with it. The Connector 701 attribute specifies the MDT address family, and its value is the IP 702 address that the PE router is using as its source IP address for the 703 multicast packets that are encapsulated and sent over the MT. When a 704 PE has determined that the RPF interface for a particular C-address 705 is the MT, it looks up the Connector attribute that was distributed 706 along with the VPN-IP address corresponding to that C-address. The 707 value of this Connector attribute is considered to be the RPF 708 adjacency for the C-address. 710 There are older implementations in which the Connector attribute is 711 not present. In this case, as long as "BGP Next Hop" for the 712 C-address is one of the PEs that is a PIM adjacency, then that PE is 713 treated as the RPF adjacency for that C-address. 715 However, if the MD spans multiple Autonomous Systems, and an "option 716 b" interconnect ([RFC4364], section 10) is used, the BGP Next Hop 717 might not be a PIM adjacency, and the RPF check will not succeed 718 unless the Connector attribute is used. 720 In [MVPN], the connector attribute is replaced by the "VRF Route 721 Import Extended Community" attribute. The latter is a generalized 722 version, but carries the same information as the connector attribute 723 does; the encoding however is different. 725 The connector attribute is defined in the following sub-section. 727 6.2.1. Connector Attribute 729 The Connector Attribute is an optional transitive attribute. Its 730 value field is formatted as follows: 732 0 1 733 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 734 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 735 |0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1| 736 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 737 | | 738 | IPv4 Address of PE | 739 | | 740 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 742 7. Data MDT: Optimizing Flooding 743 7.1. Limitation of Multicast Domain 745 While the procedure specified in the previous section requires the P 746 routers to maintain multicast state, the amount of state is bounded 747 by the number of supported VPNs. The P routers do NOT run any 748 VPN-specific PIM instances. 750 In particular, the use of a single bidirectional tree per VPN scales 751 well as the number of transmitters and receivers increases, but not 752 so well as the amount of multicast traffic per VPN increases. 754 The multicast routing provided by this scheme is not optimal, in that 755 a packet of a particular multicast group may be forwarded to PE 756 routers that have no downstream receivers for that group, and hence 757 which may need to discard the packet. 759 In the simplest configuration model, only the Default MDT group is 760 configured for each MD. The result of the configuration is that all 761 VPN multicast traffic, control or data, will be encapsulated and 762 forwarded to all PE routers that are part of the MD. While this 763 limits the number of multicast routing states the provider network 764 has to maintain, it also requires PE routers to discard multicast 765 C-packets if there are no receivers for those packets in the 766 corresponding sites. In some cases, especially when the content 767 involves high bandwidth but only a limited set of receivers, it is 768 desirable that certain C-packets only travel to PE routers that do 769 have receivers in the VPN to save bandwidth in the network and reduce 770 load on the PE routers. 772 7.2. Signaling Data MDT Trees 774 A simple protocol is proposed to signal additional P-group addresses 775 to encapsulate VPN traffic. These P-group addresses are called data 776 MDT groups. The ingress PE router advertises a different P-group 777 address (as opposed to always using the Default MDT group) to 778 encapsulate VPN multicast traffic. Only the PE routers on the path 779 to eventual receivers join the P-group, and therefore form an optimal 780 multicast distribution tree in the service provider network for the 781 VPN multicast traffic. These multicast distribution trees are called 782 Data MDT trees because they do not carry PIM control packets 783 exchanged by PE routers. 785 The following documents the procedures of the initiation and teardown 786 of the Data MDT trees. The definition of the constants and timers 787 can be found in section 8. 789 - The PE router connected to the source of the content initially 790 uses the Default MDT group when forwarding the content to the MD. 792 - When one or more pre-configured conditions are met, it starts to 793 periodically announce MDT Join TLV at the interval of 794 [MDT_INTERVAL]. The MDT Join TLV is forwarded to all the PE 795 routers in the MD. 797 If a PE in a particular MD transmits a C-multicast data packet to 798 the backbone, by transmitting it through an MD, every other PE in 799 that MD will receive it. Any of those PEs that are not on a 800 C-multicast distribution tree for the packet's C-multicast 801 destination address (as determined by applying ordinary PIM 802 procedures to the corresponding multicast VRF) will have to 803 discard the packet. 805 A commonly used condition is the bandwidth. When the VPN traffic 806 exceeds certain threshold, it is more desirable to deliver the 807 flow to the PE routers connected to receivers in order to 808 optimize the performance of PE routers and the resource of the 809 provider network. However, other conditions can also be devised 810 and they are purely implementation specific. 812 - The MDT Join TLV is encapsulated in UDP. 814 UDP over IPv4 is used if the multicast stream being assigned to a 815 data-MDT is an IPv4 stream. In this case the UDP datagram is 816 addressed to ALL-PIM-ROUTERS (224.0.0.13). 818 UDP over IPv6 is used if the multicast stream being assigned to a 819 data-MDT is an IPv6 stream. In this case the UDP datagram is 820 addressed to ALL-PIM-ROUTERS (ff02::d). 822 The destination UDP port is 3232. 824 The UDP datagram is sent on the Default MDT. This allows all PE 825 routers to receive the information. Any MDT Join that is not 826 received over a Default MDT MUST be dropped. 828 - Upon receiving MDT Join TLV, PE routers connected to receivers 829 will join the Data MDT group announced by the MDT Join TLV in the 830 global table. When the Data MDT group is in PIM-SM or 831 bidirectional PIM mode, the PE routers build a shared tree toward 832 the RP. When the data MDT group is setup using PIM-SSM, the PE 833 routers build a source tree toward the PE router that is 834 advertising the MDT Join TLV. The IP address of the source 835 address is the same as the source IP address used in the IP 836 packet advertising the MDT Join TLV. 838 PE routers that are not connected to receivers may wish to cache 839 the states in order to reduce the delay when a receiver comes up 840 in the future. 842 - After [MDT_DATA_DELAY], the PE router connected to the source 843 starts encapsulating traffic using the Data MDT group. 845 - When the pre-configured conditions are no longer met, e.g. the 846 traffic stops, the PE router connected to the source stops 847 announcing MDT Join TLV. 849 - If the MDT Join TLV is not received over [MDT_DATA_TIMEOUT], PE 850 routers connected to the receivers just leave the Data MDT group 851 in the global instance. 853 7.3. Use of SSM for Data MDTs 855 The use of Data MDTs requires that a set of multicast P-addresses be 856 pre-allocated and dedicated for use as the destination addresses for 857 the Data MDTs. 859 If SSM is used to set up the Data MDTs, then each MD needs to be 860 assigned a set of these of multicast P-addresses. Each VRF in the MD 861 needs to be configured with this set (i.e., all VRFs in the MD are 862 configured with the same set). If there are n addresses in this set, 863 then each PE in the MD can be the source of n Data MDTs in that MD. 865 If SSM is not used for setting up Data MDTs, then each VRF needs to 866 be configured with a unique set of multicast P-addresses; two VRFs in 867 the same MD cannot be configured with the same set of addresses. 868 This requires the pre-allocation of many more multicast P-addresses, 869 and the need to configure a different set for each VRF greatly 870 complicates the operations and management. Therefore the use of SSM 871 for Data MDTs is very strongly recommended. 873 8. Packet Formats and Constants 875 8.1. MDT TLV 877 "MDT TLV" has the following format. 879 0 1 2 3 880 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 881 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 882 | Type | Length | Value | 883 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 884 | . | 885 | . | 886 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 888 Type (8 bits): 890 the type of the MDT TLV. In this specification, types 1 and 4 891 are defined. 893 Length (16 bits): 895 the total number of octets in the TLV for this type, including 896 both the Type and Length field. 898 Value (variable length): 900 the content of the TLV. 902 8.2. MDT Join TLV for IPv4 streams 904 "MDT Join TLV for IPv4 streams" has the following format. 906 0 1 2 3 907 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 908 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 909 | Type | Length | Reserved | 910 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 911 | C-source | 912 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 913 | C-group | 914 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 915 | P-group | 916 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 918 Type (8 bits): 920 Must be set to 1. 922 Length (16 bits): 924 Must be set to 16. 926 Reserved (8 bits): 928 for future use. 930 C-Source (32 bits): 932 the IPv4 address of the traffic source in the VPN. 934 C-Group (32 bits): 936 the IPv4 address of the multicast traffic destination address in 937 the VPN. 939 P-Group (32 bits): 941 the IPv4 group address that the PE router is going to use to 942 encapsulate the flow (C-Source, C-Group). 944 8.3. MDT Join TLV for IPv6 streams 946 "MDT Join TLV for IPv6 streams" has the following format. 948 0 1 2 3 949 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 950 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 951 | Type | Length | Reserved | 952 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 953 | | 954 | C-Source | 955 | | 956 | | 957 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 958 | | 959 | C-Group | 960 | | 961 | | 962 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 963 | P-Group | 964 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 966 Type (8 bits): 968 Must be set to 4. 970 Length (16 bits): 972 Must be set to 40. 974 Reserved (8 bits): 976 for future use. 978 C-Source (128 bits): 980 the IPv6 address of the traffic source in the VPN. 982 C-Group (128 bits): 984 the IPv6 address of the multicast traffic destination address in 985 the VPN. 987 P-Group (32 bits): 989 the IPv4 group address that the PE router is going to use to 990 encapsulate the flow (C-Source, C-Group). 992 8.4. Multiple MDT Join TLVs per Datagram 994 A single UDP datagram MAY carry multiple MDT Join TLVs, as many as 995 can fit entirely within it. If there are multiple MDT Join TLVs in a 996 UDP datagram, they MUST be of the same type. The end of the last MDT 997 Join TLV (as determined by the MDT Join TLV length field) MUST 998 coincide with the end of the UDP datagram, as determined by the UDP 999 length field. When processing a received UDP datagram that contains 1000 one or more MDT Join TLVs, a router MUST be able to process all the 1001 MDT Join TLVs that fit into the datagram. 1003 8.5. Constants 1005 [MDT_DATA_DELAY]: 1007 the interval before the PE router connected to the source to 1008 switch to the Data MDT group. The default value is 3 seconds. 1010 [MDT_DATA_TIMEOUT]: 1012 the interval before which the PE router connected to the 1013 receivers to time out MDT JOIN TLV received and leave the data 1014 MDT group. The default value is 3 minutes. This value must be 1015 consistent among PE routers. 1017 [MDT_DATA_HOLDOWN]: 1019 the interval before which the PE router will switch back to the 1020 Default MDT tree after it started encapsulating packets using the 1021 Data MDT group. This is used to avoid oscillation when traffic 1022 is bursty. The default value is 1 minute. 1024 [MDT_INTERVAL] 1025 the interval the source PE router uses to periodically send 1026 MDT_JOIN_TLV message. The default value is 60 seconds. 1028 9. IANA Considerations 1030 The codepoint for the connector attribute is defined in IANA's 1031 registry of BGP attributes. The reference should be changed to refer 1032 to this document. 1034 The codepoint for MDT-SAFI is defined in IANA's registry of BGP SAFI 1035 assignments. The reference should be changed to refer to this 1036 document. 1038 10. Security Considerations 1040 [RFC4364] discusses in general the security considerations that 1041 pertain to when the RFC4364 type of VPN is deployed. 1043 [PIMv2] discusses the security considerations that pertain to the use 1044 of PIM. 1046 The security considerations of [RFC4023] and [RFC4797] apply whenever 1047 VPN traffic is carried through IP or GRE tunnels. 1049 11. Acknowledgments 1051 Major contributions to this work have been made by Dan Tappan and 1052 Tony Speakman. 1054 The authors also wish to thank Arjen Boers, Robert Raszuk, Toerless 1055 Eckert and Ted Qian for their help and their ideas. 1057 12. Normative References 1059 [GRE2784] "Generic Routing Encapsulation (GRE)", Farinacci, Li, 1060 Hanks, Meyer, Traina, March 2000, RFC 2784 1062 [PIMv2] "Protocol Independent Multicast - Sparse Mode (PIM-SM)", 1063 Fenner, Handley, Holbrook, Kouvelas, August 2006, RFC 4601 1065 [PIM-ATTRIB] "The PIM Join Attribute Format" A. Boers, IJ. Wijnands, 1066 E. Rosen, November 2008, RFC 5384 1068 [RFC2119] "Key words for use in RFCs to Indicate Requirement 1069 Levels.", Bradner, March 1997, RFC 2119 1071 [RFC4364] "BGP/MPLS IP VPNs", Rosen, Rekhter, February 2006, RFC 4364 1073 13. Informative References 1075 [ADMIN-ADDR] "Administratively Scoped IP Multicast", Meyer, July 1076 1998, RFC 2365 1078 [BIDIR] "Bidirectional Protocol Independent Multicast", Handley, 1079 Kouvelas, Speakman, Vicisano, October 2007, RFC 5015 1081 [DIFF2983] "Differentiated Services and Tunnels", Black, October 1082 2000, RFC2983. 1084 [GRE1701] "Generic Routing Encapsulation (GRE)", Farinacci, Li, 1085 Hanks, Traina, October 1994, RFC 1701 1087 [GRE2890] "Key and Sequence Number Extensions to GRE", Dommety, 1088 September 2000, RFC 2890 1090 [MVPN] "Multicast in MPLS/BGP IP VPNs", Rosen, Aggarwal, 1091 draft-ietf-l3vpn-2547bis-mcast-10.txt, January 2010 1093 [SSM] "Source-Specific Multicast for IP", Holbrook, Cain, August 1094 2006, RFC 4607 1096 [RFC4023] " Encapsulating MPLS in IP or Generic Routing Encapsulation 1097 (GRE)", T. Worster, Y. Rekhter, E. Rosen, Ed.. March 2005, RFC 4023 1099 [RFC4797] "Use of Provider Edge to Provider Edge (PE-PE) Generic 1100 Routing Encapsulation (GRE) or IP in BGP/MPLS IP Virtual Private 1101 Networks", Y.Rekhter, R. Bonica, E. Rosen, January 2007, RFC 4797 1103 14. Authors' Addresses 1105 Yiqun Cai (Editor) 1106 Cisco Systems, Inc. 1107 170 Tasman Drive 1108 San Jose, CA, 95134 1109 E-mail: ycai@cisco.com 1111 Eric C. Rosen (Editor) 1112 Cisco Systems, Inc. 1113 1414 Massachusetts Avenue 1114 Boxborough, MA, 01719 1115 E-mail: erosen@cisco.com 1117 IJsbrand Wijnands 1118 Cisco Systems, Inc. 1119 170 Tasman Drive 1120 San Jose, CA, 95134 1121 E-mail: ice@cisco.com