Inter-Domain Multicast Routing (IDMR) A. Ballardie INTERNET-DRAFT Consultant B. Cain Bay Networks Z. Zhang Bay Networks March 1998. Core Based Tree (CBT) Multicast Border Router Specification Status of this Memo This document is an Internet Draft. Internet Drafts are working doc- uments of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute work- ing documents as Internet Drafts). Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. Abstract This draft specifies the behaviour of a CBT multicast border router (BR). This specification assumes the use of CBTv3 - the latest CBT protocol version [3]. CBTv3 has capabilities which make CBT equally well suited for use in stub- or transit- domains; this draft describes mechanisms which enable a CBT distribution tree to span only those routers and links leading to interested receivers or receiver-domains. Expires October 1998 [Page 1] INTERNET-DRAFT CBT Border Router Specification March 1998 1. Changes from Previous Revision This draft differs significantly from previous revisions, and incor- porates mostly new procedures and mechanisms. 2. Interoperability Model The interoperability model follows that described in [2]. Particular attention is drawn to sections 1 and 2 of that document. For brevity, some of the more fundamental aspects of interoperability are listed below: +o logically, a BR has at least two "components", each component being associated with a particular multicast routing protocol. Each com- ponent may have more than one associated interface which is running the particular multicast protocol associated with the component. At least one of these components is a CBT component. Figure 1 provides an example (logical) representation of a border router. +o besides a CBT component owning its own (private) forwarding cache (hereafter referred to as the PFC), all components share a common, protocol independent, multicast forwarding cache (hereafter referred to as the SFC) which supports source specific (i.e (S, G)), and source independent (i.e. (*, G)) entries. The latest CBT specification recommends that all CBT router implementations include support for an SFC, allowing any CBT router to assume the role of Border Router if necessary. A CBT component's PFC must support (*, G), (S, G), and (*, Core) entries; (*, Core) entries are not relevant to the SFC. To ensure that all components have a consistent view of the SFC a BR's components must be able to communicate with each other; how is implementation dependent (guidelines provided in [2]). +o the parent for all PFC entries shall point towards the local domain core for G. There is no notion of "incoming" interface wrt any PFC state. The semantics of the SFC cannot be stated until such time as the inter-domain multicast routing architecture is fully understood. Expires October 1998 [Page 2] INTERNET-DRAFT CBT Border Router Specification March 1998 +o It is suggested the SFC is only used on active CBT Border Routers; the PFC is used on all other CBT routers. +o mixed multicast protocol LANs are not permitted. ------------X----------------------X--X-------- | | | | | | X = component | | comp A | | comp B | | interface | ------------- ----------- | | | comp = component | ----------------------------- | | | Shared Multicast | | | | Forwarding Cache (S,G) | | | ----------------------------- | | | | ------------ ---------- | | | comp C | | comp D | | | | | | | | ----------X----X----------------------X-------- Figure 1: Example Representation of a Border Router 3. Multicast ASs vs. Multicast Domains/"Regions" It is important to distinguish between a multicast Autonomous System (AS) - an AS whose unicast and multicast routing boundaries are aligned, and a multicast domain (or "region"), whose unicast and mul- ticast routing boundaries are not aligned. In the former case BGP-4 [6] is often deployed as a separator between interior and exterior routing. For multicast ASs BGP-4+ [1] is assumed, which allows a domain to express multicast policy, i.e. "come from" paths, as well as unicast policy, i.e. "go to" paths, for particular network addresses (or prefixes). The advantage of BGP-4+ is highlighted in the case where a multicast AS is multi-homed (multi-homed stub AS, or transit AS) - BGP-4+ has the ability to select a single ingress border router (BR) per external multicast source (network or prefix), thereby avoiding the potential for the very damaging effect of multicast packet duplicates being injected into a domain (or AS). Expires October 1998 [Page 3] INTERNET-DRAFT CBT Border Router Specification March 1998 Note that, if BGP-4+ is assumed, it must be deployed on all of an AS's BRs. In circumstances where BGP-4+ cannot be assumed, a single ingress BR for a particular multicast source (network or prefix) must be selected by alternate means. One alternative would be to manually configure a CBT domain's multicast BRs, but this does not scale for large numbers of BRs. We therefore recommend that each of a CBT domain's BRs implement an "arbitration process" on each BR, responsi- ble for dynamically selecting a single ingress BR per multicast source (network or prefix). This scheme arbitrates using (multicast) routing metrics as its BR selection criterium; its goal is to select a single ingress BR per external source, not replace BGP-4+ with its fine-grained policy expression capabilities. The resulting effect of the arbitration process is to allow a CBT component to potentially create/modify/delete a BR shared forwarding cache entry as necessary to prevent the injection of multicast dupli- cates inside the CBT domain. A description of one possible implementation of an "arbitration pro- cess" is provided in the Appendix. Hereafter, the terms "multicast domain" and "multicast AS" will be simply referred to as multicast "domain". 4. The Architectural Model This section explains the overall architecture of a CBT multicast domain that attaches to other multicast domain(s). +o a CBT Border Router (BR) which is used to forward traffic towards an external receiver domain can be thought of as a group member wrt the CBT domain. +o Domain Wide Reports (DWRs) (see section 5) are used in a CBT domain so that BRs learn of internal group membership, and downstream domain group membership. DWRs need not necessarily be implemented in singly-homed stub CBT domains, at the potential cost of traffic flowing unnecessarily between the ingress BR and core router(s). Expires October 1998 [Page 4] INTERNET-DRAFT CBT Border Router Specification March 1998 +o DWRs are only sent by core routers. They are sent whenever a core router gets its first child, or when a core loses its last child. DWRs are distributed via the "all-cbt-border-routers" (ABR) multi- cast group, administratively scoped as 239.X.X.X. All CBT BRs must join this group at initialization time. +o CBT core routers have authoritative group membership information for the CBT domain for those groups for which they are the core. Hence, DWRs sent by core routers are authoritative - BRs use DWRs to decide whether or not to inject traffic into the CBT domain. +o CBT core routers also have authoritative group membership informa- tion wrt BRs' attached domain(s) - this is reflected in a core router's forwarding cache; BRs not interested in receiving traffic on behalf of a neighbouring domain send a QUIT_NOTIFICATION (prune) of the corresponding granularity towards the core router. Hence, a core router knows whether or not there are interested receivers downstream of it (internal or external). +o CBT BRs may issue (*, Core), (*, G), or (S, G) quits (prunes). In CBT, quits always instantiate uni-directional prune state; by send- ing a quit the BR is electing not to receive traffic via the CBT domain, but may inject externally sourced traffic into the CBT domain. A quit always follows the state that it is pruning - towards the core; if a quit reaches a core router, it is never for- warded beyond the core router. +o A BR may instantiate a priori state between itself and a core router, or that state may be explicitly invoked (see section 4.1 below). For the case where no a priori state exists between a BR and core router, if the BR receives externally sourced data and is the ingress BR for that data, if the BR has a cached DWR Join (received recently from a core router) the ingress BR instantiates (*, Core) (uni-directional) state between itself and the core, UNLESS the BR has appropriate pre-existing (*, G) or (S, G) state. This way, externally sourced data traffic can always be injected natively into the CBT domain. +o A join explicitly invoked by another BR component (as opposed to a DWR) - signalling that a neighbouring domain is interested in a group - instantiates bi-directional state between the BR and core (otherwise data would not be "pulled down" to the BR). This state may be (*, Core), (*, G), or (S, G). Expires October 1998 [Page 5] INTERNET-DRAFT CBT Border Router Specification March 1998 In each of the following two sections we look at the procedures followed by two different circumstances: firstly, when a BR's neighbouring domain is able to explicitly signal its group membership, and secondly, when a BR's neighbouring domain cannot (or cannot to the same degree compared to the first case) explicitly signal group membership. 4.1. A Neighbouring Domain can Explitly Signal Group Membership +o CBT BR's send (*,G), (S,G), or (*, Core) JOIN_REQUESTs towards the relevant core router when a neighbouring domain has group members, i.e. a join-alert is received by the CBT BR component from another BR component. The resulting state is bi-directional. +o CBT BR's send (*,G), (S,G), or (*, Core) QUIT_NOTIFICATIONs towards the relevant core router when the neighbouring domain no longer has group members, i.e. a prune-alert is received by the CBT component from another BR component. +o When a core router gets its first child or loses its last child it issues a DWR of the corresponding granularity. This is received by all BR's; as a result, the BRs know whether or not to inject exter- nally sourced traffic. 4.2. A Neighbouring Domain cannot Explitly Signal Group Membership +o CBT BR's instantiate (*,Core) state at initialization time to all core routers in a CBT domain that are associated with inter-domain scoped groups. The resulting state is bi-directional. +o Though a neighbouring BR component might not be explicitly informed of group membership inside its domain, it may still send prune- alerts (e.g. DVMRP) to the CBT component. Upon receiving a prune- alert from another component, the CBT component sends a (*, G), (S, G), or (*, Core) QUIT_NOTIFICATION (prune) towards the relevant core. Since a quit (prune) is always uni-directional, the BR - if ingress for some external sources - is still able to inject exter- nally sourced data into the CBT domain. Expires October 1998 [Page 6] INTERNET-DRAFT CBT Border Router Specification March 1998 +o Though a neighbouring BR component might not be explicitly informed of group membership inside its domain, it may still send join- alerts (e.g. DVMRP) to the CBT component. Upon receiving a join- alert from another component, the CBT component sends a (*, G) (or (S, G)) JOIN_REQUEST toward the relevant core, UNLESS there exists appropriate non-pruned less specific state, i.e. (*, Core). The resulting state is bi-directional. 4.3. Architectural Summary In the context of multicast domain interconnection, a CBT domain exhibits the following attributes: +o if at least one of a BR's neighbouring domains cannot explicitly signal group membership, the BR must instantiate a priori (*, Core) state (bi-directional) between itself and each domain core. +o if all of a BR's neighbouring domains can explicitly signal group membership, the BR need not instantiate any state between itself and domain cores until group membership is signalled. +o if a BR receives a DWR Join from a domain core, the DWR is cached. If, during the DWR cache lifetime data arrives for a member group and the BR is the ingress BR for that data, the BR instantiates uni-directional (*, Core) state between itself and the core so the data can be injected into the CBT domain natively, UNLESS there exists appropriate (*, G) or (S, G) state. A BR may explicitly tear down (using a quit message) uni-directional (*, Core) state after a suitable data flow idle period, or the state may remain. +o if a BR receives a DWR Leave from a domain core, the DWR is cached. A DWR Leave results in BRs with any (*, G) or (S,G) PFC (bi-direc- tional) states pruning the relevant state by sending a quit mes- sage. The BR also removes the appropriate interface from its SFC entry. An ingress BR which is receiving traffic for the now pruned group (injecting it using (*,Core) state) either tears down the one-way (*, Core) state, or marks its parent PFC as pruned; this is the ONLY instance of a parent interface being pruned. Expires October 1998 [Page 7] INTERNET-DRAFT CBT Border Router Specification March 1998 5. Domain Wide Reports (DWRs) Domain Wide Reports (DWRs) are used in a CBT domain to enable BRs to learn - in a dynamic and timely fashion - of internal group member- ship, and downstream domain group membership. Group member- ship/absence is indicated by means of DWR Join and Leave messages, respectively. It is assumed DWRs are refreshed periodically, and cached by receiv- ing BRs for a lifetime of X seconds, after which time they expire in the BR's DWR cache. If DWRs are in use in a CBT domain, they are only ever issued by core routers. DWRs issued by core routers are authoritative. DWRs can be source-group specific (S, G), or source independent (*, G). A DWR represents aggregated state where possible. For example, if a core has only one child for each of its (*, G) and (S, G) states, it generates a (*, G) DWR to cover (*, G) and (S, G). Finer grained aggregates may be represented if DWRs support mask information. The DWR processing rules are as follows: +o whenever a CBT component receives an (S,G) DWR Join message it is only processed by the ingress BR for S. If there exists no (S, G) SFC entry at the ingress BR, an (S, G) SFC entry is created by the CBT component, and an (S, G) Creation-Alert is generated. Then (or if an (S, G) entry already exists) the interface via which the DWR originating core router is reachable is added (or un-pruned) in the entry's child list. If this interface is the only interface in the child list of the entry, the CBT component generates an (S, G) Join-Alert. If the CBT component's PFC does not have equal- or less specific state that includes the same interface, a (*, G) JOIN_REQUEST (including the "uni-directional" join option) is sent over the interface towards the domain core for G. [An (S,G) join is not sent because (S,G) state does not exist between the ingress BR and core router]. +o whenever a CBT component receives an (S,G) DWR Leave message it is processed only by the ingress BR for S. If the ingress BR for S has an equal- or less specific SFC entry that includes a pruned Expires October 1998 [Page 8] INTERNET-DRAFT CBT Border Router Specification March 1998 outgoing interface corresponding to that leading to the relevant domain core router, no further action need be taken. Otherwise, an (S, G) SFC entry is created (causing an (S, G) Cre- ation-Alert), and the interface leading to the relevant domain core router is included in the outgoing interface list and marked as pruned. If no further non-pruned children remain in the (S, G) SFC child list, the CBT component sends an (S, G) Prune-Alert to the entry's owner. +o whenever a CBT component receives a (*, G) DWR Join, a (*, G) SFC entry is created (unless it, or a less specific entry, already exists) and the interface leading to the domain core for G is added as a child in the entry. The CBT component's PFC is checked to ensure the same interface belongs to an equal- or less specific entry. If no such entry exists, the CBT component sends a (*, G) JOIN_REQUEST (uni-directional) towards the domain core for G, instantiating (*, G) PFC state. +o When a DWR (*, G) Leave message is received by a CBT component if no (*, G) SFC entry exists one is created, and the interface lead- ing to the relevant domain core router is added as a child and marked as pruned. If no more non-pruned (*, G) SFC children remain, the CBT component sends an (*, G) Prune-Alert to the entry's owner. 6. More BR Component Interactions +o upon receipt of an (S,G) Join-Alert (see [2]) by a CBT component, if the interface towards the domain core for G is owned by the CBT component, it adds the interface as the (S, G) SFC entry's incoming interface. If the CBT component's PFC has no equal- or less specific state that includes the same interface, an (S, G) JOIN_REQUEST is sent over the interface towards G. +o the receipt of a (*,G) Join-Alert (see [2]) by a CBT component results in the CBT component including the interface leading to the Expires October 1998 [Page 9] INTERNET-DRAFT CBT Border Router Specification March 1998 relevant domain core as the parent in the (*, G) SFC entry. The CBT component's PFC is checked to ensure the same interface belongs to an equal- or less specific entry. If no such entry exists the CBT component sends a (*, G) JOIN_REQUEST towards the domain core for G, instantiating (*, G) PFC state. +o upon receipt of an (S,G) Prune-Alert (see [2]) by a CBT component, if the next-hop interface towards the domain core for G is owned by the CBT component, the CBT component removes the interface from the (S, G) SFC entry, and an (S, G) QUIT_NOTIFICATION is sent towards the domain core for G, instantiating (S, G) PFC prune state. +o the receipt of an (*,G) Prune-Alert (see [2]) by a CBT component causes the CBT component to remove the interface leading to the relevant domain core from the (*, G) SFC entry, then send a (*,G) QUIT_NOTIFICATION over that interface. +o whenever a more specific PFC entry is created and there exists a less specific entry/entries, the child list of the new entry is the union of the less specific entry/entries child list(s). The child list of the new entry must also include the interface over which the triggering control message was received. 7. Tunnel Issues IP multicast deployment in the Internet is a slow process. A unicast AS may not have the resources, or it may not be practical, to migrate a complete AS into one that is completely multicast capable (i.e. multicast AS), so multicast "islands" (i.e. multicast domains - see section 3) may be created within the unicast AS infrastructure as part of a longer term migration strategy. A shortage of resources may mean that core routers must be shared between any multicast domains, implying that, from the perspective of the Bootstrap Mechanism [5], multiple multicast domains may be seen as one single domain. Configured (IP-in-IP) tunnels provide multi- cast connectivity between the multicast domain "islands". Under these circumstances the Bootstrap Mechanism operating within each multicast domain must be modified such that, if the core router for some set of groups belongs to another domain, the local domain tunnel end-point advertises ("proxies") itself as the core router for that set of groups. The tunnel end-point (router) simply acts as a Expires October 1998 [Page 10] INTERNET-DRAFT CBT Border Router Specification March 1998 "relay" agent for CBT joins, forwarding them to the remote tunnel end-point for onward forwarding. Acknowledgements Special thanks goes to Paul Francis, NTT Japan, for the original brainstorming sessions that led to the development of CBT. Others that have contributed to the progress of CBT include Ken Carl- berg, Eric Crawley, Jon Crowcroft, Bill Fenner, Mark Handley, Ahmed Helmy, Nitin Jain, Alan O'Neill, Steven Ostrowsksi, Radia Perlman, Scott Reeve, Benny Rodrig, Clay Shields, Martin Tatham, Dave Thaler, Sue Thompson, Paul White, and other participants of the IETF IDMR working group. Thanks also to 3Com Corporation and British Telecom Plc for assisting with funding this work. References [1] Multiprotocol Extensions to BGP-4; T. Bates et al. ftp://ds.internic.net/internet-drafts/draft-ietf-idr-bgp4-multiproto- col-02.txt [2] Interoperability Rules for Multicast Routing Protocols; D. Thaler; ftp://ds.internic.net/internet-drafts/draft-thaler-multicast- interop-01.txt; Working Draft, March 1997. [3] Core Based Trees (CBTv3) Multicast Routing: Protocol Specifica- tion; A. Ballardie, B. Cain, Z. Zhang; ftp://ds.internic.net/inter- net-drafts/draft-ietf-idmr-cbt-spec-**.txt Working Draft, March 1998. [4] Domain Wide Multicast Group Membership Reports; W. Fenner; draft- ietf-idmr-membership-reports-00.txt; Working Draft, November 1997. [5] A Dynamic Bootstrap Mechanism for Rendezvous-based Multicast Rout- ing; D. Estrin et al.; Technical Report, available from: http//netweb.usc.edu/pim [6] A Border Gateway Protocol 4 (BGP-4); Y. Rekhter and T. Li; RFC 1771, March 1995. ftp://ds.internic.net/rfc/rfc1771.txt Expires October 1998 [Page 11] INTERNET-DRAFT CBT Border Router Specification March 1998 APPENDIX The BR "Arbitration Process" The specific details of the arbitration process (AP) are implementa- tion dependent, but we provide an outline of its possible operation for reference. The AP is applicable to any multi-homed (stub or transit) CBT domain whose BRs have not deployed BGP-4+ [1]. The goal of the arbitration process is to allow CBT BRs attached to a CBT domain to select a sin- gle ingress BR per external multicast source in a timely fashion. A diagram showing the arbitration process in a CBT BR component is shown in figure 2. ------------X----------------------X--X-------- | | | | | | X = component | | comp A | | comp B | | interface | ------------- ----------- | | ----------------------------- | | | Shared Multicast | | | | Forwarding Cache | | | ----------------------------- | | comp C _______ | | _______| AP | | | | CBT | | ___________ | | | ----> | | comp D | | | | <---- | | | | ----------X----X----------------------X-------- Figure 5: Logical Representation of the Arbitration Process The arbitration process is only triggered by externally sourced pack- ets, i.e. those passed to a CBT component interface by the BR process that handles SFC forwarding); the AP is not triggered by data packets arriving from inside the CBT domain. All of a CBT domain's BRs are joined to the "all-CBT-border-routers" (ABR) multicast group (see section 4), the group address for which is domain-scoped, 239.X.X.X. This group is used both by the arbitration process, and by CBT transit domains for propogating multicast routing Expires October 1998 [Page 12] INTERNET-DRAFT CBT Border Router Specification March 1998 information (in cases where multicast topology discovery is neces- sary). Whenever a CBT component receives an externally sourced data packet for the first time (since some time 't-zero') the CBT component arbi- tration process is invoked. The arbiter queries the BR component owning the interface via which the externally sourced data packet arrived, i.e. the component owning the interface nearest to S, to find out that component's current (multicast) metric for S. The arbiter multicasts an (S, G, metric, protocol component) tuple - where protocol component is DVMRP, PIM- DM, etc. - to the ABR group, and the message is processed by each receiving CBT component arbitration process. The receiving arbiter(s) cache the received tuple. The unsolicited arrival of a triple triggers the receiving arbiter to reply with its own corresponding tuple; those CBT components not receiving the (S,G) multicast data simply reply with a NULL tuple, where "metric" and "protocol component" are both NULL. Failure to receive a reply from each other BR belonging to the ABR group results in the tuple being resent (unicast, unless no reply is forthcoming from any BR) after 3 (??) seconds. The receipt of a non-NULL tuple with a better metric for S than this BR's tuple (for the same protocol), or an equal metric but sent by a lower-addressed BR, causes the arbiter to instigate the removal of the CBT component interface from the relevant SFC entry's child list. The subsequent arrival of data packets from S are injected into the CBT domain via a single BR, with the same packets arriving at any of the other BRs being filtered. The arrival of subsequent data packets does not result in any exchanges between BR arbiters for the lifetime of the relevant cache entry's timeout period, which must be synchro- nised across all of the domain's BRs (recommended lifetime, X secs). Whilst this method does not guarantee against multicast duplicates being injected into the CBT domain, it should ensure that any dupli- Expires October 1998 [Page 13] INTERNET-DRAFT CBT Border Router Specification March 1998 cation is short-lived. Author Information: Tony Ballardie, Research Consultant. e-mail: ABallardie@acm.org Brad Cain, Bay Networks Inc., 3, Federal Street, Billerica, MA 01821, USA. e-mail: bcain@baynetworks.com voice: +1 978 916 1316 Zhaohui "Jeffrey" Zhang, Bay Networks Inc., 600 Technology Park Drive, Billerica, MA 01821, USA. Phone: +1 (978) 439 0280 Fax: +1 978 670 8760 e-mail: zzhang@baynetworks.com Expires October 1998 [Page 14]