BESS Z. Zhang Internet-Draft Juniper Networks Intended status: Standards Track K. Patel Expires: April 18, 2016 Cisco Systems October 16, 2015 BGP Based Multicast draft-zzhang-bess-bgp-multicast-00 Abstract This document describes multicast signaling based on Border Gateway Protocol (BGP). Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC2119. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 18, 2016. Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect Zhang & Patel Expires April 18, 2016 [Page 1] Internet-Draft bgp-mcast October 2015 to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 2 1.2. Overview . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1. BGP Sessions . . . . . . . . . . . . . . . . . . . . 3 1.2.2. LAN and Parallel Links . . . . . . . . . . . . . . . 4 1.2.3. Source Discovery for ASM . . . . . . . . . . . . . . 5 1.2.4. Bidirectional Trees . . . . . . . . . . . . . . . . . 6 1.2.5. Transition . . . . . . . . . . . . . . . . . . . . . 6 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1. BGP NLRIs and Attributes . . . . . . . . . . . . . . . . 7 2.1.1. S-PMSI A-D Route . . . . . . . . . . . . . . . . . . 7 2.1.2. Source Active A-D Route . . . . . . . . . . . . . . . 8 2.2. Procedures . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1. Originating Tree Join Routes . . . . . . . . . . . . 8 2.2.2. Receiving Tree Join Routes . . . . . . . . . . . . . 9 2.2.3. Originating S-PMSI A-D Routes . . . . . . . . . . . . 10 2.2.4. Receiving S-PMSI A-D Routes . . . . . . . . . . . . . 10 3. Security Considerations . . . . . . . . . . . . . . . . . . . 11 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 5. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.1. Normative References . . . . . . . . . . . . . . . . . . 11 5.2. Informative References . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 1. Introduction 1.1. Motivation Protocol Independent Multicast (PIM) has been the prevailing multicast protocol for many years. Despite its success, it has two drawbacks: o Complexity originated from RPT/SPT switchover and data driven nature for PIM-ASM. o Periodical protocol state refreshes due to soft state nature. While PIM-SSM removes the complexity of PIM-ASM, there have not been a good way of discovering sources, limiting its deployment. PIM-Port (PIM over Reliable Transport) solves the soft state issue, though its deployment has also been limited. Zhang & Patel Expires April 18, 2016 [Page 2] Internet-Draft bgp-mcast October 2015 Partly because of the complexity concern, some Data Center operators have been avoiding deploying multicast in their networks. BGP-MVPN [RFC 6514] uses BGP to signal VPN customer multicast state over provider networks. It removes the above mentioned problems, and the deployment experiences have been encouraging. [draft-ietf-bess- mvpn-pe-ce] adapts the concept of BGP-MVPN to PE-CE links, and this document extends it further to general topologies, so that it can deployed in any network where BGP is running, or can be run, throughout or on most routers. One target deployment would be a Data Center that requires multicast and that uses BGP as its only routing protocol. 1.2. Overview In a nut shell, this is PIM with BGP based join/prune signaling, plus BGP based source discovery in case of ASM. The same RPF procedures as in PIM are used for each router to determine the RPF neighbor for a particular source or RPA (in case of Bidirectional Tree). Except in the Bidirectional Tree case, no (*,G) join is used - LHR routers discover the sources for ASM and then join towards the sources directly. Data driven mechanisms like PIM Assert is replaced by control driven mechanisms (Section 1.2.2). The joins are carried in BGP Updates with CMCAST SAFI and types of routes as defined in [draft-ietf-bess-mvpn-pe-ce]. CMCAST NLRIs are targted at the upstream neighbor by use of Route Targets. 1.2.1. BGP Sessions As specified in [draft-ietf-bess-mvpn-pe-ce-00], in order for two BGP speakers to exchange C-MCAST NLRI, they must use BGP Capabilities Advertisement [RFC5492] to ensure that they both are capable of properly processing the C-MCAST NLRI. This is done as specified in [RFC4760], by using a capability code 1 (multiprotocol BGP) with an AFI of IPv4 (1) or IPv6 (2) and a SAFI of C-MCAST with a value to be assigned by IANA. How the BGP peer sessions are provisioned, whether EBGP or IBGP, whether statically, automatically (e.g., based on IGP neighbor discovery), or programmably via an external controller, is outside the scope of this document. In case of IBGP, it could be that every router peering with Route Reflectors, or hop by hop IBGP sessions could be used to exchange CMCAST NLRIs for joins. In the latter case, unless desired otherwise for reasons outside of the scope of this document, the hop by hop IBGP sessions MUST only be used to exchange CMCAST NLRIs. Zhang & Patel Expires April 18, 2016 [Page 3] Internet-Draft bgp-mcast October 2015 FHRs and LHRs also establish BGP sessions to some Route Reflectors for source discovery purpose (Section 1.2.3). With the traditional PIM, the FHRs and LHRs refer to the PIM DRs on the source or receiver networks. With BGP based multicast, PIM may not be running at all, and the FHRs and LHRs refer to the IGMP/MLD queriers in that case. 1.2.2. LAN and Parallel Links There could be parallel links between two BGP peers. A single multi- hop session, whether IBGP or EBGP, between loopback addresses may be used. Except for LAN interfaces, any link between the two peers can be automatically used by a downstream peer to receive traffic from the upstream peer, and it is for the upstream peer to decide which link to use. If one of the link goes down, the upstream peer switches to a different link and there is no change needed on the downstream peer. The upstream peer MAY prefer LAN interfaces to send traffic, since multiple downstream peers may be reached simultaneously, or it may make a decision based on local policy, e.g., for load balancing purpose. Because different downstream peers might choose different upstream peers for RPF, when an upstream peer decides to use a LAN interface to send traffic, it originates an S-PMSI A-D route indicating that one or more LAN interface will be used. The route carries Route Targets specific to the LANs so that all the peers on the LANs import the route. If more than one router originate the route specifying the same LAN for the same (s,g) or (*,g) flow, then assert procedure based on the S-PMSI A-D routes happens and assert losers will stop sending traffic to the LAN. In this multihop session case, there need be a way to determine if two peers are directly connected, so that traffic can be sent natively when possible or tunneled when necessary. Advertising attached interface addresses, like LDP does, could be one way. Those advertisements can be limited to peers that are directly connected by using of Route Targets. More details may be provided in a future revision, pending further consideration. Alternatively, multiple single-hop sessions between interface addresses, whether IBGP or EBPG, can be used. This is especially suitable in DC scenarios. Zhang & Patel Expires April 18, 2016 [Page 4] Internet-Draft bgp-mcast October 2015 1.2.3. Source Discovery for ASM This document does not support ASM via shared trees (aka RP Tree, or RPT). Instead, FHRs, RPs, and LHRs work together to propagate/ discover source information via control plane and LHRs join source specific Shortest Path Trees (SPT) directly. The RPs are just Route Reflectors. Multicast data traffic does not necessarily go through them, and redundancy can be easily achieved by having multiple RRs. They do not participate in any multicast specific procedures, besides that they redistribute Source Active A-D routes. A FHR originates Source Active A-D routes upon discovery sources for particular flows and advertise them to the RRs, carrying an IPv4 or IPv6 address specific Route Target. The Global Administrator field is set the group address of the flow, and the Local Administrator field is set to 0. An LHR originate Route Target Constraint routes towards the RRs, with the Route Target field in the NLRI set accordingly, for the groups it wants to receive traffic for. That way, RR maintains all source information but only distributes to interested LHRs on demand. Because the RPs are only used for distributing SA route and not as data rendezvous points, a small number of them are enough and there is no need to have different RPs for different groups. As a result, static configuration is sufficient - no need for dynamic RP learning protocols like BSR and Auto-RP. 1.2.3.1. Integration with BGP-MVPN For each VPN, the RRs for that VPN can be completely separate from those for a different VPN. The provider is not involved at all, as in the Inter-site Shared C-Tree model described in Section 13 of RFC 6514. Alternatively, one or more PEs can serve as the RRs for their local sites for the purpose of distributing SA routes. Compared to the approach in the previous paragraph, those PEs use a single session (vs. one session for each VPN) to exchange BGP-MVPN SA routes (MCAST- VPN SAFI) among themselves, following the procedures defined in Section 14 of RFC 6514. That's in addition to exchanging BGP SA routes (CMCAST SAFI) between a PE and FHRs/LHRs that it is responsible for. Note that RFC 6514 does not explicictly specify that an egress PE translate received BGP-MVPN SA A-D routes into PIM Null Register messages or MSDP SA routes (for the purpose of Anycast RP). In this document, a PE acting as a RR for SA A-D routes does translate received BGP-MVPN SA A-D routes to BGP SA A-D routes, and vice versa. Zhang & Patel Expires April 18, 2016 [Page 5] Internet-Draft bgp-mcast October 2015 1.2.4. Bidirectional Trees For Bidirectional PIM, on transit LANs it is required that a DF is elected to forward traffic to/from the RPA direction. This is based on DF messages exchanged rapidly among the BIDIR-PIM routers on the same LAN. The procedure is complicated and may not be robust enough in all situations. In a typical provider network, transit LANs are rarely used therefore for simplicy this document does not support transit LANs for bidirectional trees. For resilience purpose the RPA is typically a "virtual address" on a multi-access link and is not associated with any routers. No DF election is needed on this RPL (Rendezvous Point Link), and all routers on the RPL forward traffic to/from the RPL. With Bidir-PIM, the RPL routers terminate the Join/Prune messages from downstream neighbors and the same applies if BGP is used for signaling. 1.2.5. Transition A network currently running PIM can be incrementally transitioned to BGP based multicast. At any time, a router supporting BGP based multicast can use PIM with some neighbors (upstream or downstream) and BGP with some other neighbors. PIM and BGP MUST not be used simultaneously between two neighbors for multicast purpose, and routers connected to the same LAN MUST be transitioned during the same maintenance window. In case of PIM-SSM, any router can be transitioned at any time (except on a LAN all routers must be transitioned together). It may receive source tree joins from a mixed set of BGP and PIM downstream neighbors and send source tree joins to its upstream neighbor using either PIM or BGP signaling. In case of PIM-ASM, the RPs are first upgraded to support BGP based multicast. They learn sources either via PIM procedures from PIM FHRs, or via Source Active A-D routes from BGP FHRs. In the former case, the RPs can originate proxy Source Active A-D routes. There may be a mixed set of RPs/RRs - some capable of both traditional PIM RP functionalities while some only redistribute SA routes. Then any routers can be transitioned incrementally. A transitioned LHR router will pull Source Active A-D routes from the RPs when they receive IGMP/MLD (*,G) joins for ASM groups, and may send either PIM (s,g) joins or BGP Source Tree Join routes. A transitioned transit router may receive (*,g) PIM joins but only send source tree joins after pulling Source Active A-D routes from RPs. Zhang & Patel Expires April 18, 2016 [Page 6] Internet-Draft bgp-mcast October 2015 2. Specification 2.1. BGP NLRIs and Attributes The same CMCAST SAFI and types of routes as defined in [draft-ietf- bess-mvpn-pe-ce] are used, except that the Source Prune A-D Route is not used, and an additional two types are defined. In summary: 3 - S-PMSI A-D Route [new] 5 - Source Active A-D Route [new] 6 - Shared Tree Join Route [existing] 7 - Source Tree Join Route [existing] 8 - Source Prune A-D Route [not used] Except for the Source Active A-D routes, the routes carry a NO- ADVERTISE community so that the receiving peer will not propagate it further. 2.1.1. S-PMSI A-D Route Similar to defined in RFC 6514, an S-PMSI A-D Route Type specific CMCAST NLRI consists of the following, though it does not have an RD: +-----------------------------------+ | Multicast Source Length (1 octet) | +-----------------------------------+ | Multicast Source (variable) | +-----------------------------------+ | Multicast Group Length (1 octet) | +-----------------------------------+ | Multicast Group (variable) | +-----------------------------------+ | Originating Router's IP Addr | +-----------------------------------+ If the Multicast Source (or Group) field contains an IPv4 address, then the value of the Multicast Source (or Group) Length field is 32. If the Multicast Source (or Group) field contains an IPv6 address, then the value of the Multicast Source (or Group) Length field is 128. Usage of other values of the Multicast Source Length and Multicast Group Length fields is outside the scope of this document. Usage of S-PMSI A-D routes is described in Section 2.2.3 and Section 2.2.4. Zhang & Patel Expires April 18, 2016 [Page 7] Internet-Draft bgp-mcast October 2015 2.1.2. Source Active A-D Route Similar to defined in RFC 6514, a Source Active A-D Route Type specific MCAST NLRI consists of the following: +-----------------------------------+ | Multicast Source Length (1 octet) | +-----------------------------------+ | Multicast Source (variable) | +-----------------------------------+ | Multicast Group Length (1 octet) | +-----------------------------------+ | Multicast Group (variable) | +-----------------------------------+ The definition of the source/length and group/length fields are the same as in the S-PMSI A-D routes. Source Active A-D routes with a Multicast group belonging to the Source Specific Multicast (SSM) range (as defined in [RFC4607], and potentially extended locally on a router) MUST NOT be advertised by a router and MUST be discarded if received. Usage of Source Active A-D routes is described in Section 1.2.3. 2.2. Procedures 2.2.1. Originating Tree Join Routes When a router learns from IGMP/MLD or a downstream PIM/BGP peer that it needs to join a SPT to receive traffic for a particular (s,g) flow, it determines the RPF neighbor wrt the source following the same RPF procedures as defined for PIM. If the RPF neighbor supports CMCAST SAFI, it originates a Source Tree Join Route and advertises the route to the RPF neighbor (in case of EBGP or hop-by-hop IBGP), or one or more RRs. When a router learns that it needs to join a bi-directional tree for a particular group, it determines the RPF neighbor wrt the RPA. If the neighbor supports CMCAST SAFI, it originates a Shared Tree Join Route and advertises the route to the RPF neighbor (in case of EBGP or hop-by-hop IBGP), or one or more RRs. When a router first learns that it needs to receive traffic for an ASM group, it originates a RTC route with the NLRI's AS field set to its AS number and the Route Target field set to an address based Route Target, with the Global Administrator field set to group address and the Local Administrator field set to 0. The route is Zhang & Patel Expires April 18, 2016 [Page 8] Internet-Draft bgp-mcast October 2015 advertised to the RRs, so that RRs can re-advertise the matching Source Active A-D routes to this router. Upon the receiving of the Source Active A-D routes, the router originates Source Tree Join routes as described above, as long as it still needs to receive traffic for the flows (i.e., the corresponding IGMP/MLD membership exists or join from downstream PIM/BGP neighbor exists). When a Source/Shared Tree Join route is originated by this router, it sets up corresponding forwarding state such that the expected incoming interface list includes all non-LAN interfaces directly connecting to the upstream neighbor. LAN interfaces are added upon receiving corresponding S-PMSI A-D route (Section 2.2.4). In this revision, it is assumed that the single-hop peering is used for DC deployments. As discussed earlier, additional signaling could be used for a router to discover direct interfaces connected to its upstream or downstream neighbors. The Source/Shared Tree Join routes carry an Address Specific RT, with the global administrative field set to the upstream peer's address and the local administrative field set to 0. 2.2.2. Receiving Tree Join Routes A router (auto-)configures Import RTs matching itself so that it can import tree join routes from their peers. When a router receives a tree join route from a downstream router and imports it, it determines if it needs to originate its own corresponding route and advertise further upstream wrt the source or RPA. If itself is the FHR or is on the RPL, then it does not need to. Otherwise the procedures in Section 2.2.1 are followed. Additionally, the router sets up its corresponding forwarding state such that one of the interfaces that directly connects to the downstream neighbor is added to outgoing interface list. If there is a LAN interface connecting to the downstream neighbor, it MAY be preferred over non-LAN interfaces, but an S-PMSI A-D route MUST be originated (Section 2.2.3). In this revision, it is assumed that the single-hop peering is used for DC deployments. As discussed earlier, additional signaling could be used for a peer to discover direct interfaces connected to its upstream or downstream neighbors. Zhang & Patel Expires April 18, 2016 [Page 9] Internet-Draft bgp-mcast October 2015 2.2.3. Originating S-PMSI A-D Routes If this router chooses to use a LAN interface to send traffic to its neighbors for a particular (s,g) or (*,g) flow, it MUST announce that by originating a corresponding S-PMSI A-D route. The Tunnel Type in the PTA is set to 0 (no tunnel information Present). The LAN interface is identified by an IP address specific RT, with the Global Administrative Field set to the LAN interface's address prefix and the Local Administrative Field set to the prefix length. The RT also serves the purpose of restricting the importing of the route by all routers on the LAN. If multiple LAN interfaces are to be used (to reach different sets of neighbors), then the route will include multiple RTs, one for each used LAN interface as described above. The S-PMSI A-D routes may also be used to announce tunnels that could be used to send traffic to downstream neighbors that are not directly connected. This is outside of the scope for now. 2.2.4. Receiving S-PMSI A-D Routes A router (auto-)configures an Import RT for each of its LAN interfaces over which BGP is used for multicast signaling. The construction of the RT is described in the previous section. When a router imports an S-PMSI A-D route, it checks if it also originated the same route and if the route has at least one common RT of the received one. If yes, it means both itself and the originator of the receive route want to send to the same LANs. This kicks off the assert procedure to elect a winner - the one with the highest next hop address wins. The assert losers will not include the corresponding LAN interface in its outgoing interface list, but it keeps the S-PMSI A-D route that it originates. If this router does not have a matching S-PMSI route of its own with some common RTs, and the originator of the received S-PMSI route is a chosen upstream neighbor for the corresponding flow, then this router updates its forwarding state to include the LAN interface in the incoming interface list. When the last S-PMSI route with a RT matching the LAN is withdrawn later, the LAN interface is removed from the incoming interface list. Note that a downstream router on the LAN does not participate in the assert procedure. It adds/keeps the LAN interface in the expected incoming interfaces as long as its chosen upstream peer originates the S-PMSI AD route. It does not switch to the assert winner as its upstream. An assert loser MAY keep sending joins upstream based on Zhang & Patel Expires April 18, 2016 [Page 10] Internet-Draft bgp-mcast October 2015 local policy even if it has no other downstream neighbors (this could be used for fast switch over in case the assert winner would fail). 3. Security Considerations This document does not introduce new security risks. 4. Acknowledgements The authors thank Marco Rodrigues and Lenny Giuliano for their initial idea/ask of using BGP for multicast signaling beyond MVPN. We also thank Eric Rosen for his questions, suggestions, and help finding solutions to some issues. 5. References 5.1. Normative References [I-D.ietf-bess-mvpn-pe-ce] Patel, K., Rosen, E., and Y. Rekhter, "BGP as an MVPN PE- CE Protocol", draft-ietf-bess-mvpn-pe-ce-00 (work in progress), April 2015. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, "Protocol Independent Multicast - Sparse Mode (PIM-SM): Protocol Specification (Revised)", RFC 4601, DOI 10.17487/RFC4601, August 2006, . [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, "Bidirectional Protocol Independent Multicast (BIDIR- PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, . [RFC6514] Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP Encodings and Procedures for Multicast in MPLS/BGP IP VPNs", RFC 6514, DOI 10.17487/RFC6514, February 2012, . Zhang & Patel Expires April 18, 2016 [Page 11] Internet-Draft bgp-mcast October 2015 5.2. Informative References [I-D.ietf-rtgwg-bgp-routing-large-dc] Lapukhov, P., Premji, A., and J. Mitchell, "Use of BGP for routing in large-scale data centers", draft-ietf-rtgwg- bgp-routing-large-dc-02 (work in progress), April 2015. Authors' Addresses Zhaohui Zhang Juniper Networks EMail: zzhang@juniper.net Keyur Patel Cisco Systems EMail: keyupate@cisco.com Zhang & Patel Expires April 18, 2016 [Page 12]