INTERNET-DRAFT T. Anker D. Breitgand File: draft-anker-congress-01.txt D. Dolev Z. Levy The Hebrew Univ. of Jerusalem Expiration: 18 July 1998 IMSS: IP Multicast Shortcut Service Status of this Memo This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress". To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract This memo describes an IP Multicast Shortcut Service (IMSS) over a large ATM cloud. The service enables cut-through routing between routers serving different Logical IP Subnets (LISs). The presented solution is complementary to MARS [2], adopted as the IETF standard solution for IP multicast over ATM. IMSS consists of two orthogonal components: CONnection-oriented Group address RESolution Service (CONGRESS) and IP multicast SErvice for Non-broadcast Access Networking TEchnology (IP-SENATE). An IP class D address is resolved into a set of addresses of multicast routers that should receive the multicast traffic targeted to this class D address. This task is accomplished using CONGRESS. The cut-through routing decisions and actual data transmission are performed by IP- SENATE. IMSS preserves the classical LIS model [8]. The scope of IMSS is to facilitate inter-LIS cut-through routing, while MARS provides tools for the intra-LIS IP multicast. Anker, Breitgand et. al Expires July 1998 [Page 1] Internet Draft 1 July 1997 Table of Content 1. ................................................Introduction 1.1 ..................................................Background 1.2 ....................................................CONGRESS 1.3 ...................................................IP-SENATE 2. ..................................................Discussion 3. ...............................................IMSS Overview 3.1 ...............................................Network Model 3.2 ....................................................CONGRESS 3.2.1 ...............................................CONGRESS' API 3.3 ...................................................IP-SENATE 4. ................................................Architecture 4.1 .......................................CONGRESS Architecture 4.2 ......................................IP-SENATE Architecture 4.3 ...........................................IMSS Architecture 5. ...........................................CONGRESS Protocol 5.1 .............................................Data Structures 5.2 .......................IMSS Router Joining/Leaving a D-group 5.3 ...........Reception of Incremental Membership Notifications 5.4 ...............................Resolution of D-Group Address 5.5 ........................................Handling of Failures 5.5.1 .........................................IMSS Router Failure 5.5.2 ..............................................Domain Failure 5.5.3 .............................................Domain Recovery 6. ..........................................IP-SENATE Protocol 6.1 ........................................Main Data Structures 6.2 .....................................Maintenance of D-groups 6.2.1 ............................................Joining D-Groups 6.2.2 ............................................Leaving D-Groups 6.2.3 .........................Client and Server Operational Roles 6.2.4 ...............................Regular and Sender-Only Modes 6.3 ........................................Forwarding Decisions 6.3.1 ..................A Server Receives a Datagram from a Client 6.3.2 ............A Server Receives a Datagram from another Server 6.3.3 .........A Client Receives a Datagram from an IDMR Interface 6.3.4 .........A Server Receives a Datagram from an IDMR Interface 6.3.5 ...........................................Pruning Mechanism 7. .............................................Fault Tolerance 8. .....................................Security Considerations 9. .............................................Message Formats 9.1 ...........................................CONGRESS Messages 9.2 ..........................................IP-SENATE Messages 10. ..................................................References 11. .............................................Acknowledgments 12. .......................................List of Abbreviations Anker, Breitgand et. al Expires July 1998 [Page 2] Internet Draft 1 July 1997 1. Introduction As was noted in VENUS [3]: "The development of NHRP [21], a protocol for discovering and managing unicast forwarding paths that bypass IP routers, has led to some calls for an IP multicast equivalent. Unfortunately, the IP multicast service is a rather different beast to the IP unicast service.". The problems correctly identified by VENUS can be divided into two broad categories: 1) problems associated with multicast group membership maintenance and resolution and 2) problems concerned with the multicast routing. Although VENUS, "...focuses exclusively on the problems associated with extending the MARS model to cover multiple clusters or clusters spanning more than one subnet", most of the discussed problems are, in fact, intrinsic to any cut-through routing solution. The main conclusion that one can draw from VENUS is that these problems cannot be solved just by the straightforward extension of MARS to cover multiple LISs. This memo presents a solution that relies on MARS for intra-LIS multicast communication, and uses an alternative methodology to provide an inter-LIS multicast shortcut service that scales to large ATM clouds. It is assumed that the reader is familiar with the classical LIS model [8], MARS[2] and the basics of the Inter-Domain Multicast Routing (IDMR) protocols [4,5,9,10,11]. This document has two goals: o To provide a generic protocol for dynamic mapping of any IP class D address onto a set of the multicast routers that have an ATM (or any other SVC-based Data Link subnetwork) connectivity and have either directly attached hosts, or down- stream routers (w.r.t. to a specific multicast tree) that need to receive the corresponding multicast traffic. The resolved addresses are used to establish the shortcut ATM connections among the multicast routers. The mapping protocol should be independent of any underlying IP multicast protocol. It should be specifically noted that this document proposes usage of the shortcut multicast connections on a per-source basis. This is motivated by the fact that the shortcut connections will be mainly used by multicast applications that need guaranteed QoS. For all other multicast applications the current IP over ATM paradigm would, probably, suffice. Multicast applications that require QoS, such as video-conferencing, transmission of high quality video stream, interactive games, etc, will usually involve a small number of sources and will require a source specific multicast trees in order to achieve the required QoS. o To provide a solution for the generic interoperability and routing problems that arise when any cut-through routing protocol is deployed in conjunction with the existing IDMR Anker, Breitgand et. al Expires July 1998 [Page 3] Internet Draft 1 July 1997 protocols. This document proposes an architectural separation between the two problem domains above, so that each one of them can be tackled with the most appropriate methodology and in the most generic manner. 1.1 Background The classical IP network over an ATM cloud consists of multiple Logical IP Subnets (LISs) interconnected by IP routers [8]. The standardized solution for IP Multicast over ATM, Multicast Address Resolution Service (MARS[2]) follows the classical model. In the MARS approach, each LIS is served by a single MARS server and is termed "MARS cluster". MARS can be viewed "as an extended analog of the ATM ARP server [8]". From the IP multicast perspective, MARS is functionally equivalent to IGMP [1]. Similarly to IGMP, a MARS server registers the hosts that are directly attached to a multicast router and are interested to receive multicast traffic targeted to a specific IP class D address. The important difference, however, is that MARS is aware of the connection-oriented nature of the underlying network. For each relevant IP class D address, the MARS server maintains a set (membership) of the hosts that belong to the same LIS and have been registered to receive IP datagrams being sent to this address. The process of mapping an IP class D address onto a set of ATM end- point addresses is termed "multicast address resolution". Each such set is used to establish native ATM connections between an IP multicast router and the local members of the IP multicast group. The IP multicast datagrams targeted to a specific class D address are propagated over these connections. The ATM connections' layout within a MARS cluster may be based either on a mesh of point to multipoint (ptmpt) Virtual Circuits (VCs) [6,7], or a Multicast Server (MCS). There is a work in progress to distribute the MARS server in order to provide for load balancing and fault tolerance [17]. A group of redundant MARS servers will constitute a single logical entity that would provide the same functionality as a non-distributed MARS server. There is another work in progress, EARTH [12] that intends to extend the scope of the services provided by MARS to multiple LISs. EARTH defines a Multicast LIS (MLIS) that is composed of a number of LISs and is served by a single EARTH server. Due to the centralistic approach taken by EARTH, ultimately, very large MLISs would look as very large MARS clusters. Thus the discussion and the conclusions Anker, Breitgand et. al Expires July 1998 [Page 4] Internet Draft 1 July 1997 provided in VENUS are equally applicable to EARTH. In the classical LIS model, LIS has the following properties: o All members of a LIS have the same IP network/subnet number and address mask; o All members of a LIS are directly connected to the same NBMA subnetwork; o All hosts and routers outside the LIS are accessed via a router; o All members of a LIS access each other directly (without routers). In the MARS model that retains the LIS model, it is assumed that all the multicast communication outside the LISs is performed via multicast routers that run some IDMR protocols. As explained in [13], the classical LIS model may be too restrictive for networks based on switched virtual circuit technology, e.g, ATM. Obviously, if LISs share the same physical ATM network (ATM cloud), the LIS internetworking model may introduce extra routing hops. This mismatch between the IP and ATM topologies complicates full utilization of the capabilities provided by the ATM network (e.g., QoS). In addition, the extra routing hops impose an unnecessary segmentation and reassembly overhead, because every IP datagram should be reassembled at every router so that a router can perform routing decisions. The "short-cut" (or "cut-through") paradigm seeks to eliminate the mismatch between the topology of IP and that of the underlying ATM network. Unfortunately, as was already stated above, bypassing the extra routing hops is not a trivial task. 1.2 CONGRESS The purpose of cut-through routing is to establish direct communication links among the multicast group members. The discovery of the multicast group members addresses is performed by a multicast group address resolution and maintenance service. Generally this service maps some application-defined character string, a multicast group address, onto a set of identifiers of the group members. Since a multicast group address resolution and maintenance service is crucial to any multicast routing short-cut solution over NBMA networks, it is appropriate to ask whether it should be implemented Anker, Breitgand et. al Expires July 1998 [Page 5] Internet Draft 1 July 1997 once as a generic stand-alone service or suited specifically for each and every multicast short-cut service. The tradeoff here is between the generality and efficiency w.r.t. a specific multicast routing protocol. In the IMSS approach a general multicast address resolution service, CONGRESS, is used. CONGRESS is a multicast address resolution and maintenance service for NBMA networks that is independent of an underlying multicast protocol. This is a generic stand-alone service. Although CONGRESS may be exploited by the native ATM applications, as well as by the network layer (IP), this document will focus only on the aspects of CONGRESS related to IP. In fact a reduced version of CONGRESS having the minimal set of features is presented in this memo. The interested reader is encouraged to refer to [14] for more information. CONGRESS operates in the native ATM environment. Its purpose is to provide multicast address resolution and maintenance service scaleable to a large ATM WAN. CONGRESS design is based on the following principles: o No flooding: CONGRESS does not flood the WAN on every multicast group membership change. o Hierarchical design: CONGRESS services are provided to applications by multiple hierarchically organized servers. o Robustness: Due to network failures and/or network reconfiguration and re-planning, some CONGRESS servers may temporarily disconnect and later reconnect. CONGRESS withstands such transient failures by providing a best-effort service to applications. It is important to stress that CONGRESS is not concerned with the actual data transfer. Its functionality is limited to the resolution of multicast group addresses upon requests from the applications. An overview of CONGRESS is provided in Section 3.2. 1.3 IP-SENATE IP-SENATE is the second component of IMSS. It is concerned with the actual IP datagram transmission over the short-cut communication links, establishment of these links, routing decisions and the interoperability with the existing IDMR protocols. IP-SENATE provides a solution for the problems arising from bypassing of the multicast routers. Most of these problems are general and independent of the underlying IDMR protocols. The design philosophy of IP-SENATE is Anker, Breitgand et. al Expires July 1998 [Page 6] Internet Draft 1 July 1997 based on the following principles: o IP-SENATE is a best effort service. IP-SENATE does not guarantee that short-cut is always possible, but it attempts to perform the short-cut wherever possible. o Short-cut is performed only among the multicast routers and not directly among hosts. o IP-SENATE facilitates (a) a full mesh of ptmpt connections based communication, (b) multicast servers based communication and (c) a hybrid form of communication based on the previous two. o IP-SENATE facilitates migration from a mesh of ptmpt connections to multicast service-based connections and for load-balancing among the multicast servers without a need for global reconfiguration. o IP-SENATE uses CONGRESS services for resolution and maintenance of the multicast addresses into a set of addresses of the relevant multicast routers. IP-SENATE may use any other service providing the same functionality as CONGRESS. o IP-SENATE is an inter-LIS protocol. It extends only the IDMR routers. Host interface to IP multicast services [19] is not changed. o IP-SENATE relies on MARS to facilitate all the intra-LIS IP multicast traffic. o IP-SENATE does not assume a single multicast routing domain. IP-SENATE is designed to operate in a heterogeneous network where network consists of multiple interconnected multicast routing domains. Consequently, IP-SENATE is not tailored for any specific multicast routing protocol, but can be dynamically configured to inter-work with different multicast protocols. o IP-SENATE is to be implemented as an extension to the existing multicast routing software. 2. Discussion A designer of a short-cut routing multicast solution is opposed with multiple non-trivial problems. The more prominent problems are discussed below. Anker, Breitgand et. al Expires July 1998 [Page 7] Internet Draft 1 July 1997 o If hosts are allowed to communicate directly with other hosts (as in [3]), bypassing the multicast routers, then each host must maintain membership information about all other hosts scattered all over the internet and belonging to the same IP multicast group. This scheme does not scale well because: - The hosts must maintain large amounts of data that should be kept consistent and updated. - A considerable traffic and signalling overhead is introduced when membership changes, e.g, join or leave events are flooded over the network. - As was noted in RFC2121 [18], an ATM Network Interface Card (NIC) is capable of supporting a limited number of connections (i.e, VCs originating from a NIC or terminating at a NIC). If full mesh of ptmpt VCs is used for cut-through communication within a multicast group, NICs might not be capable to support all the simultaneous connections. o To solve the NIC limitations problem, the current IETF IP multicast over ATM solution, MARS, supports a migrate functionality that allows to switch from a mesh of ptmpt connections to a multicast server based communication within a single MARS cluster. It is not clear how to extend this functionality, to a large ATM cloud. Such switching obsoletes membership information kept at the hosts that are scattered throughout the internet. As a result, some currently active connections may become stale or terminate abruptly. The IMSS solution presented in this memo performs cut-through only among the multicast routers, reducing the problems above to a certain extent. The NIC limitation problem is not completely eliminated, however. Hence, IMSS facilitates deployment of "multicast servers" for other routers that are termed "clients". In IMSS some of the multicast routers may also function as multicast servers. Cut-through mechanisms may have a negative impact on the conventional IDMR protocols. For the sake of discussion of the interoperability issues with the IDMR protocols, we divide the IDMR protocols into two large families: "broadcast & prune"- based [10] and "explicit join"- based [4,5,9,11]. In the first model periodical flooding of the network and the subsequent pruning of irrelevant branches of the multicast propagation trees is employed. In the second model, some explicit information about the topology of the IP multicast groups is exchanged among the multicast routers. As we see it, a cut-through solution will have to co-exist with a Anker, Breitgand et. al Expires July 1998 [Page 8] Internet Draft 1 July 1997 regular Inter-Domain Multicast Routing protocol in the same routing domain. One of the reasons for deployment of an IDMR protocol in addition to the cut-through mechanism, in the same ATM cloud, is that it is not guaranteed that a cut-through connections can reach all the relevant targets in the ATM cloud. =============================================================== |------------| | IP cloud | | (DVMRP) | | | | S #######> R ## |-----------| |------------| # | |----------| ##> CTR xxxxxxx>CTR | | IP/ATM |# IP cloud| S - source | cloud |# (DVMRP) |------------| D - Destination |-----------|# | | R - DVMRP router |#########>R########> D | CTR - Cut-through router |----------| | x - Cut-through connection | IP cloud | # - DVMRP branch | (DVMRP) | |------------| Figure 1. =============================================================== Another important reason is that if a "broadcast & prune" IDMR protocol is used in some non-ATM based IP subnetworks connected to the ATM cloud, the border routers that connect these subnetworks to the ATM cloud, do not receive explicit notifications that some downstream routers could be a part of an IDMR multicast propagation tree (as depicted in Figure 1). Thus, a broadcast & prune mechanism of the IDMR protocol should be exploited periodically by the cut- through multicast routers in order to learn about the downstream routers that depend on them. The discovery process is based on analysis of the prune messages that the multicast router will receive from the neighboring routers. On the other hand, the co-existence of IDMR protocols with the cut- through solution, raises several problems: o Routing decisions are normally made at the multicast routers. If hosts can bypass a multicast router, the latter should be aware of all the hosts in its own LIS (and in all of the downstream LISs) that participate in the cut-through Anker, Breitgand et. al Expires July 1998 [Page 9] Internet Draft 1 July 1997 connections. Otherwise the IDMR protocols would not be able to construct the multicast propagation trees correctly and the multicast datagrams may be lost. o If a multicast cut-through mechanism is deployed in conjunction with some IDMR protocol, then conflicts with the Reverse Path Forwarding (RPF) [20] may occur. The RPF mechanisms prevent routing loops and are crucial for the correct operation of IDMR protocols. Thus, the cut-through traffic should be treated carefully in order not to confuse the IDMR protocol. o A multicast distribution tree of an IDMR protocol may span non-ATM based IP subnetworks and contains more than one border router that connect these subnetworks to the ATM cloud as shown in Figure 2. If these border routers maintain the cut-through ATM connections to all other relevant border routers, undesired datagram duplication may result. o Another scenario that may lead to routing loops and undesired datagram duplication, may arise when both a cut- through mechanism and some conventional IDMR protocol, are deployed in the same ATM cloud. This means that an IDMR tree spans some routers within the ATM cloud and not only the border routers. =============================================================== S | CTR xxxxxxxxxxx CTR(a) ##############R xx x x # x x x x # # # x x x x # # # x x x x # # # x x x x R R R x xx x # x xx x # x x x x # .... x x x x # x x x x # x x x x # xx x x # x x x # CTR xxxxxxxxxxx x CTR(b) IP/ATM + Shortcut Domain DVMRP Domain S - the source R - IP router Anker, Breitgand et. al Expires July 1998 [Page 10] Internet Draft 1 July 1997 CTR - cut-through router x - cut-through connection # - DVMRP branch Figure 2. =============================================================== 3. IMSS Overview IMSS organizes IP multicast routers into logical groups, where each group corresponds to some class D IP address and contains routers that have members of this IP multicast group or senders to it in their domain. These groups are termed "D-groups". D-groups will be further discussed in Section 4.2. The resolution and management of these multicast router groups is performed through the CONGRESS services described later in Section 3.2. 3.1 Network Model In this memo, the physical layer is assumed to be comprised of different interconnected Data Link subnetworks: ATM, Ethernet, Switched Ethernet, Token Ring etc. IMSS facilitates IP multicast data transfer over large-scale Non-Broadcast Media Access (NBMA) network. We assume that ATM is the underlying NBMA network. We call a single ATM Data Link subnetwork an ATM cloud. For administrative and policy reasons a single ATM cloud may be partitioned into several, disjoint logical ATM clouds, so that the direct connectivity is allowed only within the same logical cloud. Hereafter, unless otherwise specified, we use the term ATM cloud to mean logical ATM cloud. We assume that the network layer is IP. The topology of the IP network consists of hosts (that may be either ATM based or non-ATM based) and IP routers. IP multicast traffic (which is our focus) is routed using IP multicast routers running some (potentially different) IDMR protocols. The internals of IP implementation may vary from one IP subnetwork to another. The differences are due to the usage of different Data Link layers. If the underlying network is ATM, then the IP subnetwork's implementation can be based either on LAN Emulation, or Classical IP and ARP over ATM (RFC1577) [8] standards. We differentiate between the two types of IP-multicast routers: a) routers that run an IDMR protocol and b) those that run both an IDMR protocol and the IP-SENATE protocol. We refer to the latter routers as "border routers". A border router connects either an ATM based Anker, Breitgand et. al Expires July 1998 [Page 11] Internet Draft 1 July 1997 LIS, or a conventional IP subnetwork to an ATM cloud. An important assumption is that only one IDMR protocol is allowed _inside_ (including the border routers) the same logical ATM cloud. Having multiple IDMR protocols in the same logical ATM cloud considerably complicates the task of avoiding datagrams duplications that may happen as was explained in Section 2. If multiple IDMR protocols need to be deployed in an ATM cloud than each of the respective multicast routing domains will constitute a distinct logical ATM cloud. It should be noted, that we use the term border router in a slightly different manner than this term is usually used. Namely, if upon receiving an IP multicast datagram via an IDMR protocol, a border router for some reason cannot forward it using a cut-through connection, it may use an IDMR protocol for the next hop forwarding. As one may note, the border router behaves just as a regular router in this case. For this reason, we will sometimes refer to a border router simply as "IP-SENATE router", to stress the mere fact that it may take either IDMR routing decisions, or IP-SENATE routing decisions at any given time w.r.t the same network interface. Depending on the direction of the IP multicast traffic, a border router may be called "ingress router" (if the traffic is directed to the IP subnetwork), or "egress router" (if the traffic is directed outside the IP subnetwork). All IDMR protocols make use of multicast distribution trees over which IP multicast datagrams are propagated. Multicast routers that comprise a specific tree, receive datagrams from the upstream routers and forward them to the downstream routers. For the sake of simplicity, we assume that each border router has only one ATM interface that participates in the IP-SENATE protocol. 3.2 CONGRESS CONGRESS is a native ATM protocol that provides multicast group address (name) resolution and dynamic membership monitoring services to higher-level applications. Multicast group names are application- defined character strings. CONGRESS does not deal with actual data transmission. Address resolution services provided by CONGRESS, are used by applications in order to open and maintain native ATM connections for data transmission. Although CONGRESS is much more than just an auxiliary service for IP-SENATE, in this document we concentrate only on those CONGRESS' features that are relevant for IP-SENATE (The interested reader is advised to read the full version of the CONGRESS protocol presented in [14]). From the CONGRESS' Anker, Breitgand et. al Expires July 1998 [Page 12] Internet Draft 1 July 1997 perspective, IP-SENATE is one of the applications that utilizes its services. 3.2.1 CONGRESS' API We refer to a client that uses CONGRESS services by a generic term end-point (in the context of this document, an end-point is always an IP-SENATE router). An end-point may become a group member by joining a group or cease its membership by leaving a group. Each join or leave request of an end-point leads to a generation of an Incremental Membership Notification w.r.t. a specific group. Incremental membership notifications reflects only the difference between the new membership and the previously reported one. The full membership of a group may be constructed by resolving a group name once upon joining and then by applying the incremental membership notifications as they arrive. Incremental membership notifications may be also triggered by various asynchronous network events, i.e, host or communication link crash/recovery. The CONGRESS services are provided by a library that includes the following basic functions: o join(G, id, id_len): Makes the invoking end-point a registered member of a multicast group G. id is the identifier of the new member (a pointer to some application-specific structure). id_len is the size of this application specific structure. o leave(G, id, id_len): Unregister the invoking end-point from G. o resolve(G): A multicast group name G is resolved into a set of the ATM end-point identifiers. This set includes all the end-points who joined G and have not disconnected due to a network failure or a host crash. o set_flag(G, imn_flag): Enables or disables the reception of the incremental membership notifications w.r.t. G, by the invoking end-point. In the context of this memo, a multicast group is always a D-group. 3.3 IP-SENATE An IP-SENATE extension at a multicast router uses the group membership information that it receives from CONGRESS, in order to open ATM connections that bypass the IP routing mechanism. Since the Anker, Breitgand et. al Expires July 1998 [Page 13] Internet Draft 1 July 1997 number of multicast routers is considerably lower than the overall number of the ATM-based destinations (both hosts and multicast routers), IP-SENATE reduces the number of potential short-cut connections comparing to a straightforward host to host cut-through routing. It may still be the case, however, that the number of multicast routers participating in a mesh of ptmpt connections is very large. Using the address resolution services of CONGRESS, IP- SENATE can support both hierarchies of multicast servers and meshes of ptmpt connections, and to switch back and forth between these two layouts as required. This will be described in Subsection 6.2.3. In order to avoid stable routing loops, an IP-SENATE router never routes IP multicast datagrams using cut-through connections if they were received from another IP-SENATE router. In addition, an RPF-like mechanism is deployed by IP-SENATE in order to prevent the extensive duplication of IP multicast datagrams. Such duplication may result from multiple IP-SENATE routers setting up multiple cut-through connections to the same destinations (see Figure 2). We assume that IP-SENATE will be used along with conventional IDMR protocols and that not all of the multicast routers will run IP- SENATE within an ATM cloud. As was explained in Subsection 2, this deployment mode may lead to unnecessary datagram duplication when a datagram is propagated over some multicast distribution tree and, simultaneously, over a cut-through connection. IP-SENATE provides a pruning mechanism that cuts the branches of an IDMR multicast distribution tree so that IP-SENATE multicast router that receives datagrams via a cut-through connection would not receive duplications via IDMR. 4. Architecture 4.1 CONGRESS Architecture CONGRESS services are provided by a set of servers. There are two kinds of CONGRESS servers: Local Membership Servers (LMSs) and Global Membership Servers (GMSs). An LMS resides at the same hosts as a multicast router and constitutes this router's interface to the CONGRESS services. GMSs are organized in a hierarchical structure throughout the network, and may run on either dedicated machines or in switches. Logically, an LMS location is independent of the router's host. CONGRESS views the network as a hierarchy of domains, where each domain is serviced by a CONGRESS server (the CONGRESS hierarchy can be readily mapped onto a peer group hierarchy provided by the native Anker, Breitgand et. al Expires July 1998 [Page 14] Internet Draft 1 July 1997 ATM network layer, PNNI). Note, that there is no relationship between a CONGRESS domain and a LIS. At the lowest level, a domain consists of a single multicast router. Such a domain is called a "host domain" and is serviced by the LMS of the router's host. The LMS is called a "representative" of a host domain. Higher level domains consist of a set of the lower level domain representatives. Thus, a single GMS may serve a domain that consists of either several LMSs, or several GMSs that are representatives of their respective lower level domains. A CONGRESS `domain identifier' is the longest common address prefix of the domains it is built of. The domain identifier of a host domain is the ATM address of the host itself. Figure 3 illustrates the CONGRESS domain layout. Note, that there is no relation between the addresses in the figure below and the IP address whatsoever. The IP-like addresses were chosen to illustrate the hierarchy idea in the most simple way. ======================================================================== GMS --------------------------------- GMS 1.1 1.7 / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ GMS ------------ GMS GMS --------------- GMS 1.1.1 1.1.2 1.7.4 1.7.2 / \ / \ / \ / \ / \ / \ / \ / \ LMS LMS LMS LMS LMS LMS LMS LMS 1.1.1.2 1.1.1.5 1.1.2.1 1.1.2.3 1.7.4.8 1.7.4.9 1.7.2.1 1.7.2.6 Figure 3. ======================================================================== In order to avoid flooding of the whole network upon every membership change occurring in every D-group, membership notifications pertaining to a D-group are propagated using a distributed spanning tree for this group. This spanning tree is a sub-tree of the CONGRESS Anker, Breitgand et. al Expires July 1998 [Page 15] Internet Draft 1 July 1997 servers hierarchy. The CONGRESS servers comprising the sub-tree corresponding to a D-group, are the servers that have multicast routers from this group in their domains. Each server in the CONGRESS hierarchy maintains only a part of the spanning tree that consists of its immediate neighbours. The spanning tree is constructed and maintained according to the multicast routers join/leave requests issued through their LMSs. In addition, asynchronous network events such as crashes/recoveries of end-points (multicast routers), CONGRESS servers and/or failures of communication links change the topology of the spanning tree (such events are detected by a best- effort fault detector module). Obviously, since CONGRESS operates in an asynchronous environment, the spanning tree of a group can only be a best-effort approximation. 4.2 IP-SENATE Architecture In Figure 4, the architecture of IP-SENATE router is presented. An IP-SENATE router is, by definition, a border router that connects a cut-through routing domain to some IDMR routing domain(s). As shown in the figure, IP-SENATE extends a multicast router`s software. D- groups of IP-SENATE are managed through CONGRESS. We employ an LMS at each IP-SENATE router in order to provide the interface to the CONGRESS services. In order to make routing decisions and to open cut-through connections, IP-SENATE communicates with the CONGRESS protocol that supplies group address resolution and maintenance services. ================================================================== |----\ /---------|----------| |-------|----------\ /-----| | IP \/ IP-SENATE| IDMR | | IDMR | IP-SENATE \/ IP | | |____ _____|__________| |_______|____________| | | | ^ ^ ^ | | ^ ^ ^ | | | | | |---| |--------| | | |--------| |---| | | | | | | |CGS| |RFC+MARS| | | |RFC+MARS| |CGS| | | | | | | |if | |1577 if | | | |1577 if | |if | | | | | |----| | |---| |--------| | | |--------| |---| | | |----| | |IDMR| v v v | | v v v | |IDMR| |-|----|---------------------| |------------------- |-|----| |MAC | ATM ______________| |___________ ATM | | | | |signalling | |signalling| | | |------|---------------------| |------------------- |------| |phy. | phy. layer | | phy. layer |phy. | |layer | | | |layer | |------|---------------------| |------------------- |------| | | | | ... == ============ ===== ... Anker, Breitgand et. al Expires July 1998 [Page 16] Internet Draft 1 July 1997 CGS if - CONGRESS interface MARS if - MARS interface Figure 4. ================================================================== In the classical IP multicast model [19], a host does not have to become a registered member of a multicast group in order to send datagrams to this group. A sender does not see any difference between sending a datagram to a multicast IP address or a unicast IP address. The difference is in the multicast router, that has to participate in some IDMR protocol that builds a multicast propagation tree. In this model, a multicast router usually should know only about its immediate neighbours that belong to the propagation tree, and not about the whole tree (example of an exception to this is MOSPF [11]). IP-SENATE provides the hosts with the same interface for IP multicast service as in the classical model. A border IP-SENATE router that forwards IP multicast datagrams from a particular source residing in a non-ATM cloud into the ATM cloud, or from an ATM-based host residing in the router's LIS, is termed an injector for the corresponding pair. (Note, that the same router may function as an injector for multiple pairs). Injectors for any specific class D address must know the identifiers of all other IP-SENATE routers that must receive the traffic targeted to this class D address. For any pair, shortcut connections should be opened by the corresponding injectors to these IP-SENATE routers. Ideally, only a single injector should be active w.r.t. any source in order to avoid datagram duplication. The set of IP-SENATE routers' identifiers that has to be maintained per IP class D address, includes the identifiers of the IP-SENATE routers that have either o directly connected hosts that registered (e.g., using IGMP) to receive IP multicast traffic pertaining to a specific class D address, or o some downstream multicast routers (w.r.t. some source) that have receivers in their LISs. This set of IP-SENATE routers is termed D-group. In order to obtain the membership of a D-group, an IP-SENATE router joins this group via CONGRESS. The name associated with this multicast group is just a class D address interpreted as a character string. The details of how D-groups are formed and managed are provided in Subsection 6.2. Anker, Breitgand et. al Expires July 1998 [Page 17] Internet Draft 1 July 1997 It may seem that an IP-SENATE router that does not have any downstream receivers (neither routers, nor hosts) w.r.t any source, does not need to be a member of a D-group because it does not need to receive any traffic. Such a router could have used the CONGRESS resolve operation each time it needs to learn about the membership of the corresponding D-group (for example, when it needs to send a datagram originated by a sender in its domain). In this scheme, however, CONGRESS would have been heavily used and unnecessary overhead on the network would be imposed. In our approach, an IP- SENATE router joins the relevant D-group even if it does not have to receive the multicast traffic. In this case, it will receive incremental membership notifications concerning the D-group. These scheme is less costly. In order to prevent such a router, from being added as a leaf to the cut-through connections within the D-group, special sub-identifiers are added to the IP-SENATE router's identifier. This is explained in Subsection 6.1. In order to overcome the previously mentioned NIC's limitations on a number of simultaneously opened connections, some IP-SENATE routers may act as multicast servers, serving other IP-SENATE routers that are termed clients. It is important to stress that an IP-SENATE router acting as a server in one D-group may act as a client in another one. Moreover, as will be explained in Subsection 6.2.3, the operational roles of the IP- SENATE routers may dynamically change within the same D-group. It is important to understand that maintaining a distinct multicast group simultaneously for every possible IP class D address is technically infeasible. Fortunately, there is no real need to do this, because only a part of these addresses is actually in use at any given time. It is also unlikely that the same multicast router would belong to ALL the D-groups. In IP-SENATE's approach, membership of D-groups is formed on-demand using CONGRESS, as will be explained in Subsection 6.2. Another very important property of the IP-SENATE solution is that IP-SENATE can tear down the cut-through connections among the members of a D-group when no multicast data is transmitted over these connections for a sufficiently long period of time. The cut-through connections may be resumed later on-demand, using CONGRESS to obtain updated membership information. Note, that when an IP-SENATE router terminates the inactive connections within a D-group, this does not affect CONGRESS which may continue to monitor the membership of the group running "in the background". Thus, when the cut-through connections need to be resumed, the membership information would be instantly available. Anker, Breitgand et. al Expires July 1998 [Page 18] Internet Draft 1 July 1997 For a variety of reasons that were explained in Section 2, IP-SENATE may have to co-exist with some IDMR protocol in the same ATM cloud. This implies that an IP-SENATE router may receive IP multicast datagrams both via an IDMR protocol and the cut-through connections on the same network interface. For the correct operation of IP-SENATE protocol, it is necessary to differentiate between these two cases. One way to do this is to use the protocol field of the IP datagram header. An IP-SENATE protocol should be assigned a special unique number. Each time an IP-SENATE router forwards a datagram over a cut-through connection, the original protocol number is extracted and appended to the end of the datagram. The IP-SENATE protocol number is inserted into the protocol field and all other relevant fields of the IP datagram header (total length, header checksum, etc.) should be updated appropriately. Obviously, the reverse operations should be performed by the IP-SENATE routers on the other side of the cut- through connections. A more detailed description of this encapsulation technique is to be provided. 4.3 IMSS Architecture In Figure 5 the architecture of IMSS is summarized. IMSS does not change a MARS server's functionality. An IP-SENATE router interacts with the MARS server in order to carry out IP multicast transmission within the LIS. An LMS serves as a CONGRESS front-end to the IP- SENATE router. An IP-SENATE router communicates with an LMS in order to handle the membership of the relevant D-groups. An LMS communicates with the GMS as was explained in Section 4.1. In the figure above an LMS is shown to run on the same machine as the IP- SENATE router. This layout is most reliable since the LMS monitors the IP-SENATE router's liveness using IPC tools. It is possible, however, to run an LMS on a different machine. =================================================================== -------------------------------- | | | | | --------- | | ------------- ------- | ------- | MARS | <--|--> | | IP-SENATE | <----> | LMS | | <-----> | GMS | | Server| | | | Router | ------- | ------- --------- | | ------------- | | | LIS -------------------------------- border Figure 5. =================================================================== Anker, Breitgand et. al Expires July 1998 [Page 19] Internet Draft 1 July 1997 5. CONGRESS Protocol 5.1 Data Structures In this subsection we summarize the main data structures used by both LMS and GMS types of CONGRESS servers. Each LMS maintains a Local Membership List. This list contains the D-group addresses that the multicast router local to the LMS had joined through CONGRESS. In order to avoid constant flooding of the network with excess messages, the GMSs maintain for each D-group G a distributed CONGRESS "group control tree", T(G), that is a sub-tree of the CONGRESS hierarchy tree. Vertices of T(G) are LMSs and GMSs (where LMSs are the leafs of T(G)) that have the members of G in their respective domains. All CONGRESS protocol messages concerning G are confined to T(G). Each GMS maintains only a local part of T(G) for each D-group G in a vector GT(G). GT(G) holds an entry for each neighbour (i.e., parent, sibling or child) of the GMS in T(G). A value of an entry in this vector can be either `resolve', or `all'. In case of `resolve', only `resolve' requests are forwarded to the corresponding neighbour (because no member of G in its domain have set the on-line flag). A value of `all' means that all CONGRESS protocol messages concerning G should be forwarded to that neighbour. When a GMS first creates a vector for a group, all its entries are initialized to `all' for each of the GMS's neighbours. Each GMS also keeps track of the liveliness of its neighbours through updates supplied by its fault-detector module. 5.2 IMSS Router Joining/Leaving a D-group When an IMSS router wishes to join a D-group G, it issues a `join' request to its LMS, L, using some local IPC mechanism. Next, L informs its GMS about the new member of G by forwarding it a `join' message. The `join' message must be propagated to all members of G that have requested incremental membership notifications. As will be explained later, a multicast router that acts as a client of a multicast server, does not require constant reception of incremental membership notifications. Anker, Breitgand et. al Expires July 1998 [Page 20] Internet Draft 1 July 1997 When a `join' message travels in the CONGRESS hierarchy, GMSs can learn about the new member of G and update their GT(G) accordingly in order to ensure the correct operation of the future `resolve' operations. If a GMS receives the `join' notification from one of its children C, and GT(G) does not exist (i.e., the new member of G is the first one in the GMS's domain), then the GMS initializes it, and forwards this message to all its live siblings and the parent. If GT(G) exists, the GMS sets GT(G,C) to `all' and forwards the notification to all its live siblings and the parent that have `all' in their corresponding entries of GT(G). As a special case, upon the reception of the join notification directly from an LMS, a GMS forwards it also to all of its children (i.e. LMSs) that are alive and have `all' in their corresponding entries of GT(G). If a `join' notification w.r.t. G was received by a GMS from its parent or a sibling, X, and GT(G) does not exist, the notification is ignored. Otherwise, the entry GT(G,X) is set to `all' and the GMS forwards the notification to all its live children that have `all' in the corresponding entries of GT(G). Upon the reception of the notification about a new router joining G from its GMS, an LMS delivers a corresponding incremental membership notifications to the local IMSS router. In order to maintain a T(G) accurately, GMSs should prune all their neighbours that do not have members of G in their respective domains, from their GT(G) entries. This will allow to keep the message overhead linear in the size of G. Immediately after the new router register in a new D-group , the local LMS issues a `resolve' request w.r.t. G, on its behalf. This request is handled as described in Subsection 5.4. The CONGRESS servers that reply with an empty lists of members (routers) are removed from GT(G) by the GMSs throughout the hierarchy. Note that if a parent GMS reply with an empty list to its child (in the CONGRESS hierarchy), the child does not remove the corresponding entry of the parent from its GT(G). An IMSS router leaves a D-group through issuing a `leave' request to its LMS. The propagation of the `leave' notification corresponding to this request is exactly the same as that of the `join' notification described above. In addition, if a GMS S discovers that there are no more members of a group G in its domain, it deletes the GT(G) vector from its GT. After that, S informs all its neighbours to which it forwarded the corresponding `leave' notification that they should remove GT(G, S) entry from their GTs (The set of these neighbours does not include neighbouring LMSs). Note that an LMS knows that a Anker, Breitgand et. al Expires July 1998 [Page 21] Internet Draft 1 July 1997 group should be deleted by directly monitoring the membership of its local IMSS router. A GMS knows that a group should be deleted when all of its children have reported that there are no more members of a group G in its domain. 5.3 Reception of Incremental Membership Notifications Whenever an IMSS router wishes to start or stop receiving incremental membership notifications w.r.t. a D-group G of which it is a member, all the GMSs that have members of G in their domain must know this. This is necessary for accurate propagation of future membership changes of G occurring in their domain. However, a notification of this request is not necessary to be received by GMSs if the requesting router is not the first inside (or outside) the GMS's domain to request incremental membership notifications. The same is true if a router is the last inside (or outside) their domain requesting to stop receiving incremental membership notifications. An IMSS router may wish to stop the reception of the incremental membership notifications if it decides to operate in a `client' role, as will be explained in Section 6.2.3. Let G' be the set of members of G that requested to receive incremental membership notifications. When an IMSS router R desires to receive incremental membership notifications w.r.t. a D-group G, it issues a `set_flag' request with the `online_flag' parameter set to TRUE to its LMS. The LMS forwards the `set_flag' request message m to its GMS. Similarly, when R desires to stop receiving incremental membership notifications, it issues a `set_flag' request with the `online_flag' parameter set to FALSE to its LMS. When a GMS receives m from a neighbour, it sets the entry of this neighbour in GT(G) to `all' if `online_flag' is TRUE, and to `resolve' otherwise. If R is the first member of G' in the GMSs domain or G' has no more members in this domain, m is forwarded to all the siblings that are listed in GT(G), and to the parent. If R is the first member of G' outside the GMSs domain or G' has no more members outside this domain, then m is forwarded to all the children of the GMS that are listed in GT(G). It should be noted, that each CONGRESS server always marks its parent as `all'. 5.4 Resolution of D-Group Address An IMSS router that is a member of a D-group G, can resolve G's name into a list of the live registered members by issuing an appropriate `resolve' request to its LMS. The LMS then generates an appropriate message m from it, and forwards m to its GMS. When a GMS receives m from one of its children, it forwards m to all Anker, Breitgand et. al Expires July 1998 [Page 22] Internet Draft 1 July 1997 the live siblings and the parent that are listed in GT. As a special case, if m was received from an LMS, the GMS also forwards m to the live LMSs that are listed in GT(G). If m was received by the GMS from either its parent or a sibling, it forwards it to all the live children that are listed in GT(G). The GMS then collects the responses to m until all the neighbours have responded or became disconnected. Then the GMS sends the aggregated response to the neighbour from which the request was received. When an LMS receives a `resolve reply' message m w.r.t. G, it responds with the the address of the local router. If the local router is not a member of G the LMS responds with an `empty' message. This way, the `resolve' request is propagated to the relevant LMSs that are leaves of the T(G). The responses of these LMSs are aggregated by the GMSs, the intermediate nodes of T(G). The final response will be received by the LMS that originated the `resolve' request from its GMS and will be delivered to the requesting IMSS router. 5.5 Handling of Failures The CONGRESS handling of failures focuses on asynchronous host crash/recoveries, and communication links failures/recoveries. In order to handle these failures each CONGRESS server interacts with a local "fault detector" module that monitors the liveliness of this CONGRESS server's neighbours. All the messages that are sent/received by a CONGRESS server pass through the fault detector in the first place. Thus, a message received from a CONGRESS server is interpreted by the fault detector as the evidence of the sender's liveliness. If a server's neighbour was suspected by the fault detector of this server, and later a message from the presumably failed neighbour was received, the fault detector delivers the notification about the neighbour's liveliness before the delivery of its message. 5.5.1 IMSS Router Failure When an IMSS router fails, a local LMS discovers this using internal IPC mechanisms. This event is handled by the LMS as if the failed router had issued a `leave' request w.r.t. to all the D-groups that it was a member of. 5.5.2 Domain Failure When a CONGRESS server disconnects from the rest of the hierarchy due to a communication link failure or a host crash, this event is interpreted by its neighbours as if all the IMSS routers that reside in its domain have left their respective D-groups. Instead of sending Anker, Breitgand et. al Expires July 1998 [Page 23] Internet Draft 1 July 1997 multiple `leave' notifications, each GMS that detects a failure of a neighbouring CONGRESS server, generates a `domain leave' notification message that contains the domain identifier of the failed domain and a list of all the D-groups that had members in this domain. The latter is obtained from the local GT table. The `domain leave' notification is propagated and processed throughout the CONGRESS hierarchy in the same way as a `join'/`leave' notification. An IMSS router outside the failed domain can compute a new membership of a D-group from the `domain leave' notification by discarding all the IMSS routers that have the same address prefix as the failed domain identifier. Similarly, an IMSS router within the failed domain discards all the IMSS routers that have the address prefix different from that of the failed domain. 5.5.3 Domain Recovery A GMS and its respective domain are considered recovered whenever the GMS re-connects or re-starts execution. For each D-group G in the recovered domain group membership information must be updated throughout the re-merged T(G). A recovered GMS initializes its data structures from scratch as described in Section 5.1. When a GMS detects (through the fault detector) a recovery of one of its siblings in the CONGRESS hierarchy, it resolves all the D-groups that are present in its GT by issuing `resolve' requests to its children. The aggregated replies of these `resolve' requests are sent as ordinary `join' notifications to the recovered sibling. When a GMS detects a recovery of one of its children, it does not perform any actions except marking this server as alive. The necessary actions will be initiated by the recovered child as described below. When a CONGRESS server detects a recovery of its parent, it generates `resolve' requests to its children w.r.t. all the D-groups known to this CONGRESS server. The aggregated results are sent to the parent as special `join' notifications. These notifications are forwarded as ordinary `join' notifications, but are also marked with a special flag. When such a message w.r.t. a D-group G is received by some GMS from its sibling, the GMS resolves the membership of G within its domain and sends back the aggregated result as an ordinary `join' notification. 6. IP-SENATE Protocol In this section we provide a detailed description of the IP-SENATE protocol. For the sake of simplicity we divide the protocol into two parts: a) D-groups' formation and maintenance, b) datagram forwarding Anker, Breitgand et. al Expires July 1998 [Page 24] Internet Draft 1 July 1997 decisions. The IP-SENATE routers are event-driven. This means that in the core of the program, there is a main event-dispatching loop, and when a certain event occurs, an appropriate event handler function is invoked. After an event has been processed, the control is returned to the main loop. It is important to stress, that the event handling is atomic, i.e, a pending event is not handled until the current event has been fully processed. For the sake of simplicity, we provide all the explanations for a single IP multicast group (i.e., a single class D address). 6.1 Main Data Structures This subsection depicts the main data structures used by the IP- SENATE routers. o RAV[G]: Each IP-SENATE router R maintains a Redundancy Avoidance Vector (RAV) for each D-group G with which R is involved. RAV[G] has an entry for each source (originator) of the IP multicast datagrams that were forwarded to R by other IP-SENATE routers (i.e., via short-cut connections). We define "remoteness" of an injector to be an estimation of the distance of a router from a datagram source. This estimation can be based, for instance, on the TTL value of the packet received from the source (the higher is the value of the TTL field, the closer is the injector to the source). Another method for measuring the remoteness is piggybacking the routing metrics derived from the routing tables of the injector on the packets forwarded over the short-cut connections. It should be noted that using TTL as a measure for remoteness may cause some problems, as will be explained later. We will use a function denoted as remoteness(m, R) where m is a datagram received by an IP-SENATE router R in order to calculate the remoteness of R from the source of m. We use regular mathematical notation to compare two remoteness values. The meaning of remoteness(m, R) < remoteness(m, R') is that R is closer to m's source than R'. The entry RAV[G][S] holds the name of the IP-SENATE router that has the minimal remoteness value w.r.t the source S and is forwarding datagrams from S to R through short-cut connections. The value of remoteness is kept in the same entry with the router's identifier. The information kept in RAV[G] is temporal and is refreshed regularly, as will be explained later. o eif: expected network interface variable. This variable is concerned with the RPF techniques that are used by the IDMR protocols in order to break routing loops that may occur in multicast distribution trees. When a multicast IP datagram Anker, Breitgand et. al Expires July 1998 [Page 25] Internet Draft 1 July 1997 arrives to a multicast router, the router checks whether it received it from the "expected" network interface. The expected network interface for a multicast datagram originated at some source S, is the interface that would be used to forward unicast datagrams to S by this multicast router. If a multicast datagram arrived from an unexpected interface it is silently discarded, because it was not propagated over the optimal branch. Obviously for each IP multicast datagram originated at some source S, the value of this variable depends on the IDMR routing tables. It is important to understand that an actual implementation is not required to support eif explicitly. This variable is used by us in order to simplify the presentation of the algorithms. o id: identifier of an IP-SENATE router. This is a structure containing the following fields: - physical address: an ATM address of the IP-SENATE router; - operational role: `client' or `server'; - mode: `sender-only' or `regular'. o Membership[G]: group membership table. For each D-group of which an IP-SENATE router is a member, there is a row in this table. Each item in the row is an id structure, as explained above. These memberships are maintained through CONGRESS' incremental membership notifications. 6.2 Maintenance of D-groups In this subsection we explain in a more detailed manner how IP-SENATE routers build and manage D-groups. 6.2.1 Joining D-Groups The code below deals with handling of four kinds of events that cause an IP-SENATE router to join a D-group. C1. explicitly requested join: C1.1 An IP-SENATE router R finds out (e.g, through processing of IGMP "join_group" request or "MARS_JOIN" request) that there exists some destination within its LIS, that needs to receive IP multicast datagrams that are sent to some IP class D address. Anker, Breitgand et. al Expires July 1998 [Page 26] Internet Draft 1 July 1997 C1.2 An IP-SENATE router R learns via some mechanism (e.g, via some control messages) that there exist downstream multicast routers that depend on it for receiving multicast datagrams for some group. C2. traffic-driven join: C2.1 An IP-SENATE router R receives an IP multicast datagram via some IDMR propagation tree from some neighbouring multicast router. C2.2 An IP-SENATE router R receives an IP multicast datagram from some directly attached host. In cases C2.1 and C2.2 an IP-SENATE router should decide whether it will forward a multicast datagram further. Moreover, if it decides to forward, it should also decide which protocol it will use, i.e, via IP-SENATE cut-through connections or via some IDMR multicast distribution tree. The IP-SENATE approach is to use cut-through wherever possible. In order to open the cut-through connections to all other relevant IP-SENATE routers, an IP-SENATE router joins an appropriate D-group. As was explained in Section 4.2, an IP-SENATE router may join a D- group assuming either a server or a client operational role. The operational role of an IP-SENATE router is indicated by its identifier. Further explanations about the operational roles are provided in Subsection 6.2.3. If an IP-SENATE router joins a D-group as a sender-only, it schedules a timer-related event handler that will terminate the membership of this router in the D-group, if no directly attached host emits multicast datagrams for a sufficiently long time. This timer will be referred later, as a D-timer. -------------------------------------------------------- if R is a member of G /* go to forwarding decisions */ go to the table of forwarding decisions (Figure 6); else if case C1.1 or case C1.2 or case C2.1 Anker, Breitgand et. al Expires July 1998 [Page 27] Internet Draft 1 July 1997 decide on the operational role according to local conditions; id = {R, role, regular}; join(G, id, ...); /* Join via CONGRESS */ else /* case C2.2 */ decide on role according to local conditions; id = {R, role, sender-only}; join(G, id, ...); /* Join via CONGRESS */ Reset D-timer; go to the table of forwarding decisions (Figure 6); -------------------------------------------------------- Note that if downstream routers participate in a "broadcast & prune"-based IDMR protocol, case C1.2 is problematic, since no explicit information about these routers is available. This is a generic problem that does not pertain to cut-through routing only. The same problem arises when any "broadcast & prune"- based routing protocol works in conjunction with a protocol based on "explicit join" messages. As an example consider PIM [4,5] and DVMRP [10] interoperability issues [15]. Another work in progress that attempts to classify the inter-operability issues that arise from deployment of various IDMR protocols, is given in [16]. In the IP-SENATE approach we solve this problem as follows. Since we allow IP-SENATE to coexist with some other IDMR protocols (see Section 4.2) on the same NIC, an IP-SENATE router may periodically propagate datagrams using both an IDMR protocol and cut-through connections. This way a multicast propagation tree of an IDMR protocol will be preserved, and all IP-SENATE routers that are also nodes in some IDMR propagation tree (see case C2.1) will join the relevant D-group. As will be explained in the following subsection, an IP-SENATE router leaves this D-group when it receives "prune" messages from all of its neighbouring downstream multicast routers and no directly attached hosts desire to receive multicast traffic for this class D address. 6.2.2 Leaving D-Groups This subsection depicts the part of an IP-SENATE router's algorithm that deals with leaving of the D-groups Generally, an IP-SENATE router may leave a D-group corresponding to some class D IP address, when this router has neither directly attached hosts, nor downstream routers that need to receive the IP Anker, Breitgand et. al Expires July 1998 [Page 28] Internet Draft 1 July 1997 multicast traffic pertaining to the multicast IP address, or need to send datagrams to it. This happens when o all directly attached hosts performed IGMP/MARS leave, and o all neighboring multicast routers (of attached networks), running some IDMR protocol, have sent `prune' or `leave' messages (depending on the IDMR protocol) for this group, or o the router is a `sender-only' member, and its D-timer for this group had expired. 6.2.3 Client and Server Operational Roles An IP-SENATE router locally decides whether it will assume a client or a server role upon joining the relevant D-group. The decision depends on a number of connections that are already supported by the IP-SENATE router's NIC and the number of additional connections that need to be supported, if the router decides to assume a specific operational role. When an IP-SENATE router joins a D-group, assuming the client operational role, it expects that some server will take care of it. If no server takes care of this client for a certain period of time, this client starts using an IDMR protocol for the forwarding of IP multicast traffic. The IP-SENATE routers that act as servers, learn through the CONGRESS' incremental membership notifications about the new client. Based on the load of the server's NICs and CPU, physical distance, administrative policies etc., each server locally decides whether to take care of the new client. If a server decides to serve a client, it tries to open a native ATM VC to this client (or to add this client as a leaf to an already opened ptmpt connection). If the client has already accepted some other server's connection set-up request, it may either refuse to accept the new connection, or tear down the previous connection and to switch to the new one. In both cases this is a local decision of the client. In case of some server's failure, all its clients should re-join the relevant D-group. This will once again trigger the procedure described above. It should be noted that the operational roles are not fixed "once and for all". Depending on the size of a D-group and the local NIC and CPU load, an IP-SENATE router may desire to change its operational role. In order to do this, an IP-SENATE router should simply leave its D-group and then re-join it with the appropriately updated identifier that indicates its new operational role (see Section 6.1). Anker, Breitgand et. al Expires July 1998 [Page 29] Internet Draft 1 July 1997 6.2.4 Regular and Sender-Only Modes An IP-SENATE router may operate either in `regular' or `sender-only' mode, as was explained in Section 4.2. An IP-SENATE router may wish to change its mode from sender-only to regular if it learns about some downstream host or router that needs to receive the multicast traffic pertaining to a specific class D address. In order to perform this transition, an IP-SENATE router should leave the relevant D-group and re-join it with the updated identifier indicating that it is acting in the regular mode. Note, that actually there is no need for the transition in the opposite direction, i.e, from a regular to a sender-only mode. Indeed, if an IP-SENATE router does not have any downstream hosts or routers that desire to receive multicast traffic, this IP-SENATE router will simply leave the relevant D-group (see Subsection 6.2.2). If there exist some down-stream senders, this IP-SENATE router will re-join the group on-demand later, as was explained in Subsection 6.2.1. 6.3 Forwarding Decisions This subsection depicts the forwarding algorithm executed by the IP- SENATE routers. Due to the assumed heterogeneous network model, there are multiple cases that should be handled carefully. By using CONGRESS membership services and the encapsulation/decapsulation technique described in Section 4.2, an IP-SENATE router can differentiate between the multicast traffic that it receives from another IP-SENATE routers via the cut-through connections and traffic received via an IDMR propagation tree. An IP-SENATE server decides how to forward an incoming multicast packet according to the identity and operational role of the sending router and according to its own operational role. For each possible pair of sender and receiver, the table in Figure 6 provides a pointer to the subsection that describes the relevant part of the pseudo-code. The short parts of the pseudo- code are shown directly in the table. Anker, Breitgand et. al Expires July 1998 [Page 30] Internet Draft 1 July 1997 ============================================================ ------------------------------------------------------- | \ Sender| Multicast | IP-SENATE | IP-SENATE | | \ | Router (via | | | | \ | IDMR protocol) | CLIENT | SERVER | | \ | or a directly | | | | \ | attached host | | | | \ | | | | |Receiver \ | | | | |-----------------------------------------------------| | | | | Forward m | | | | | using | |IP-SENATE | 6.3.3 | X | IDMR | | | | | protocol. | | Client | | | | | | | | | |-----------------------------------------------------| |IP-SENATE | | | | | | 6.3.4 | 6.3.1 | 6.3.2 | | Server | | | | | | | | | ------------------------------------------------------- Figure 6. ============================================================ For the sake of simplicity and shorter representation, we assume that the involved IP-SENATE routers have already joined the relevant D- groups, according to the algorithm explained in Subsection 6.2.1. In all of the following cases we depict the steps taken by an IP- SENATE router R, upon a reception of an IP multicast datagram m originated at some source S and targeted to some multicast group G. 6.3.1 A Server Receives a Datagram from a Client An IP-SENATE router acting as a server, is responsible for the propagation of the multicast traffic that it receives from its clients, to all the relevant multicast routers and directly attached hosts. In order to avoid undesired duplication of IP multicast datagrams, an IP-SENATE router should check whether some other IP-SENATE router(s) might propagate the IP multicast datagrams originating at the same source S. This may happen when a multicast distribution tree of some IDMR protocol contains more than one egress router that connect the Anker, Breitgand et. al Expires July 1998 [Page 31] Internet Draft 1 July 1997 branches of the propagation tree to the ATM cloud. Figure 2 provides a graphical representation of this scenario. In such a case, it is obviously preferable that only one of the egress routers, closest to the source, would transmit the datagrams. In cases such as described above, IP-SENATE routers belonging to the same D-group, can deterministically choose a router that will perform forwarding of IP multicast datagrams by using the CONGRESS membership services. This is done by inspecting RAV[G][S]. Initially RAV[G][S] is set to this server's identifier and the remoteness value is derived from either the server's routing table or from the TTL field of the datagram seen seen by this router. With the passage of time, however, the server may find out that other servers are forwarding to it datagrams (over shortcut connections) originated at the same source, and that these routers are located closer to the source than itself (as seen from the piggybacked remoteness value). In this case RAV[G][S] is set to the name and remoteness of the router that is closest to the source S. This server (router) will be a designated injector for the datagrams originated at S and targeted to G. Obviously, when a router receives a datagram from source S over a non-shortcut connection, it may update its RAV[G][S] if its own remoteness value is better than that of the current injector. If this is the case, the router becomes a new designated injector. Since we assume an asynchronous network model, it is possible that at some point multiple IP-SENATE routers belonging to the same D-group, will consider themselves as the ones that must forward datagrams. As time passes, however, the IP-SENATE routers will learn about this redundancy, because it will be reflected by RAV[G]. In the following subsection more details about RAV maintenance are provided. In Section 6.1, two examples of measuring a remoteness were provided. It should be noted that TTL is not always a reliable measure since a source may change its value arbitrarily. In this case, due to the asynchronous nature of the network, oscillations between multiple injectors may occur. Since source initiated changes of TTL may occur considerably more often than changes of the network topology, these oscillations may present a serious problem. The information kept in RAV[G] is temporal. Each time an IP-SENATE router enters information into a row S of RAV[G], it resets a timer associated with the source S. We refer to this timer as S-timer. If no traffic from S is encountered during the time window defined by the S-timer, the IP-SENATE router discards the row in RAV[G] associated with S. When RAV[G] becomes empty, the IP-SENATE router starts another timer, called G-timer. In case no multicast traffic is encountered within G Anker, Breitgand et. al Expires July 1998 [Page 32] Internet Draft 1 July 1997 during the G-timer, an IP-SENATE router tears down the cut-through connections within the corresponding D-group. These cut-through connections may be resumed on-demand later. -------------------------------------------------------- if exists an entry RAV[G][m.S] if remoteness(m, R) <= RAV[G][m.S] /* The closest router to S is responsible for the * cut-through propagation, so that R is the injector */ update RAV[G][m.S] to hold R and remoteness(m, R); update eif to be the correct one; forward m using IDMR protocol; forward m to all other servers that are members of G that act in regular mode (directly); forward m to all clients that are members of G that act in regular mode excluding the sender (directly); else /* m will be sent by the router nearest to source. */ discard m; else /* The source of the datagram is not in the RAV[G] yet */ Create a new entry for m.S in RAV[G]; update RAV[G][m.S] to hold R and remoteness(m, R); forward m using IDMR protocol; forward m to all other servers that are members of G that act in regular mode (directly); forward m to all clients that are members of G that act in regular mode excluding the sender (directly); -------------------------------------------------------- 6.3.2 A Server R Receives a Datagram from another Server R' If a server receives multicast traffic from another server belonging to the same D-group, the sending server believes that it is the one closest to the source (i.e. it receives packets from the source with the lower remoteness vault than all the other IP_SENATE routers). Otherwise it would not have been sending the datagrams. If the entry for the sending server in the RAV[G][S] is empty (e.g. because RAV was refreshed) the receiving server should insert the remoteness value of the received packet of the sending server into the corresponding entry in RAV[G][S]. Note that this operation may change the local notion of the IP-SENATE router with the lowest remoteness value, at the receiving IP-SENATE router. Anker, Breitgand et. al Expires July 1998 [Page 33] Internet Draft 1 July 1997 An IP-SENATE router acting as a server, is responsible for the propagation of the IP multicast traffic to all its clients belonging to the same D-group and to all the relevant IDMR interfaces. The latter case should be treated especially carefully because IDMR routers use RPF mechanisms in order to break stable routing loops. When a multicast IP datagram arrives to an IDMR router, the router checks whether it received it from the "expected" network interface. An IDMR router expects to receive multicast datagrams originated at some source S, from the same network interface that this router would use in order to forward unicast datagrams to S. If a multicast datagram arrived from an unexpected interface, it is silently discarded, because it was not propagated over the optimal branch of the IDMR multicast propagation tree. As seen from the code below, an IP-SENATE router updates the variable eif to be as expected by the IDMR interface. Otherwise, the RPF mechanism may might erroneously discard datagrams that should not be discarded. Obviously, there is no need to forward the IP multicast datagram that came from an IP-SENATE router acting as a server to other servers belonging to the same D-group. These servers are supposed to be the leaves of the same ptmpt connection as the receiving server. -------------------------------------------------------- update eif to be the expected interface; forward m using IDMR protocol; forward m to all clients that are members of G that act in regular mode; /* There is no need to forward to other servers, since * they are supposed to be handled by the same IP-SENATE * server that sent m. */ If the entry RAV[G][m.S] does not exist Create a new entry for m.S in RAV[G]; /* Since this is the first datagram originated at S that this * router (R) sees, it is assumed that the forwarder is the * designated injector for (S,G) */ update RAV[G][m.S] to hold the identifier of the forwarder R' and remoteness(m, R'); return; if (remoteness(m, R') < RAV[G][m.S]) update RAV[G][m.S] to hold the identifier of the Anker, Breitgand et. al Expires July 1998 [Page 34] Internet Draft 1 July 1997 datagram forwarder R' and remoteness(m, R'); -------------------------------------------------------- 6.3.3 A Client Receives a Datagram from an IDMR Interface When an IP-SENATE server acting as a client receives an IP multicast datagram from an IDMR interface, it should forward it to all other involved IDMR interfaces. In order to propagate the datagram to all the relevant IP-SENATE routers using short-cut, a client should forward the datagram to its server. The latter will forward it further according to the algorithm described in Subsection 6.3.1. As will be explained in Subsection 6.3.5, IP-SENATE routers that also participate in some "broadcast & prune"- based IDMR protocol, prune the redundant branches of an IDMR multicast propagation tree. -------------------------------------------------------- forward m using IDMR protocol; forward m to Multicast_Server over a point-to-point SVC; -------------------------------------------------------- 6.3.4 A Server Receives a Datagram from an IDMR Interface If an IP-SENATE router, acting as a server receives an IP multicast datagram via an IDMR multicast propagation tree, it is responsible to forward it to all the relevant non-IP-SENATE multicast routers and to the relevant clients. In case this IP-SENATE router is the designated injector for (S,G), it should also forward the multicast datagram to all the IP-SENATE routers acting as servers (over short-cut connections). -------------------------------------------------------- if exists an entry RAV[G][m.S] if remoteness(m, R) <= RAV[G][m.S] /* The closest router to S is responsible for the * cut-through propagation, so that R is the injector */ update RAV[G][m.S] to hold R and remoteness(m, R); forward m using IDMR protocol; forward m to all other servers that are members of G Anker, Breitgand et. al Expires July 1998 [Page 35] Internet Draft 1 July 1997 that act in regular mode (directly); forward m to all clients that are members of G that act in regular mode (directly); else /* m was received or will be received from the * the IP-SENATE router nearest to source. */ discard m; else Create a new entry for m.S in RAV[G]; /* Since this is the first datagram originated at S that * this router (R) sees, it is assumed that R is the * designated injector for (S,G). */ update RAV[G][m.S] to hold R and remoteness(m, R); forward m using IDMR protocol; forward m to all other servers that are members of G that act in regular mode (directly); forward m to all clients that are members of G that act in regular mode (directly); -------------------------------------------------------- 6.3.5 Pruning Mechanism As mentioned earlier, IP-SENATE uses an IDMR mechanism along with short-cutting. An IP-SENATE router that must forward multicast traffic of a group G to directly attached hosts or to multicast routers, joins the relevant D-group upon reception of datagrams (or explicit join) from an IDMR interface. Consequently, shortcut connections will be formed between the members of the D-group. At this point the router may receive traffic both from shortcut connections and from the existing IDMR interface. In order to avoid this redundancy, the router prunes the upstream IDMR interface, hereafter accepting upstream traffic only from the shortcut connection. Anker, Breitgand et. al Expires July 1998 [Page 36] Internet Draft 1 July 1997 ================================================================== S x \ x \ On the left - < R1 On the right the cut-through x \ side - the IDMR connection from x ... propagation tree S to R' x \ branch x R Here, R' should x / send prune to < / R. R'<_______<___< / _________R2________________ / \ | A DVMRP routing domain | | | | | | | \_______R''________________/ | | | ------------------ | .... | | | H H a directly attached hosts that want to receive datagrams targeted to G "\" - An IDMR propagation "x" - the shortcut link Figure 7. ================================================================== Figure 7 depicts a scenario when a downstream multicast router requests prune in spite of having downstream routers and directly attached hosts that are dependent on it. Since IP-SENATE router R' receives the IP multicast traffic targeted to a group G both via a cut-through connection and an IDMR propagation tree, R' sends prune message to R. Note, however, that the rest of the IDMR multicast propagation tree located beneath the multicast router R' continues to function as usual. If all the downstream IDMR interfaces of an IP- SENATE router R have been pruned and the router has no directly Anker, Breitgand et. al Expires July 1998 [Page 37] Internet Draft 1 July 1997 attached hosts who are registered in G or are senders in G (no D- Timer is set), R leaves the relevant D-group through CONGRESS. It should be noted that if the IDMR protocol that runs inside the ATM cloud is based on broadcast-and-prune model, e.g. DVMRP, then an extensive signalling overhead may be introduced by shortcutting. This is because a multicast propagation tree of DVMRP is reconstructed periodically by flooding of multicast traffic to all the routers residing inside the ATM cloud. At the beginning, all routers will join the relevant D-group in order to make themselves available for shortcut connections. Later a considerable part of these routers will leave this D-group since their respective downstream routers will send them prune messages. This way shortcut connections may be opened to routers that, in fact, do not need to receive multicast traffic at all. These connections will be later teared down. Obviously, it is possible to introduce some optimizations that will try to minimize the signalling overhead, but, generally speaking, we believe that broadcast-and-prune IDMR protocols do not go well with shortcutting. In some cases, it may happen that a short-cut connection is mistakenly established from a downstream multicast router to the upstream multicast routers. Such short-cut connection would contradict the orientation of the IDMR propagation tree. If the upstream router would blindly prune its upstream IDMR branches just because it has a short-cut connection, it may destroy the connectivity of the IDMR propagation tree. In order to avoid such situations, an IP-SENATE router requests pruning of its upstream IDMR interfaces only if the remoteness value of a datagram received over the short-cut connection is lower than that of the datagram received over an IDMR tree. As was explained in Sections 6.3.1 and 6.3.4, a downstream router's cut-through connection would be suppressed by some other IP-SENATE router that is located closer to the source in terms of remoteness. 7. Fault Tolerance Currently each GMS is a single point of failure in its domain, i.e., when a GMS fails, its domain is disconnected from the rest of the CONGRESS hierarchy. Note that this situation resembles a single DNS failure in its domain. The use of a distributed GMS server comprised of a primary and backup servers acting as a single logical entity can make the CONGRESS protocol more robust. Another way to increase the robustness is to elect a new GMS from the lower level in the CONGRESS hierarchy to take over the failed server's responsibilities. This subject is for further study. Anker, Breitgand et. al Expires July 1998 [Page 38] Internet Draft 1 July 1997 8. Security Considerations Security issues are not discussed in this document. 9. Message Formats 9.1 CONGRESS Messages To be supplied. 9.2 IP-SENATE Messages To be supplied. 10. References [1] Fenner, W., "Internet Group Management Protocol, Version 2", Internet Draft, September 1995 [2] G. Armitage, "Support for Multicast over UNI 3.0/3.1 based ATM Networks.", RFC2022, November 1996. [3] G. Armitage, VENUS - Very Extensive Non-Unicast Service. Internet Draft, June 1997. draft-armitage-ion-venus-03.txt [4] Estrin, D, et. al., "Protocol Independent Multicast Sparse Mode (PIM-SM): Protocol Specification". Internet Draft draft-ietf-idmr-PIM-SM-spec-09.ps, October, 1996. [5] Estrin, D, et. al., "Protocol Independent Multicast Dense Mode (PIM-DM): Protocol Specification". Internet Draft draft-ietf-idmr-PIM-DM-spec-04.ps, September, 1996. [6] ATM Forum, "ATM User-Network Interface Specification Version 3.1", 1994. [7] ATM Forum, "ATM User-Network Interface Specification Version 4.0", 1996. [8] Laubach, M., "Classical IP and ARP over ATM", RFC 1577, Hewlett-Packard Laboratories, December 1993. [9] A. Ballardie. Core Based Tree (CBT) Multicast Architecture. Internet Draft, 1997. draft-ietf-idmr-cbt-spec-10.txt Anker, Breitgand et. al Expires July 1998 [Page 39] Internet Draft 1 July 1997 [10] T. Pusateri. Distance vector multicast routing protocol. Internet Draft, September 1996. draft-ietf-idmr-dvmrp-v3-03.[txt,ps]. [11] J. Moy. Multicast extensions to OSPF. RFC1584, July 1993. [12] M. Smirnov. EARTH - EAsy IP multicast Routing THrough ATM clouds. Internet Draft, 1997. draft-smirnov-ion-earth-02.txt [13] Yakov Rekhter and Dilip Kandlur. "Local/Remote" Forwarding Decision in Switched Data Link Subnetworks, RFC 1937. [14] T. Anker and D. Breitgand and D. Dolev and Z. Levy. CONGRESS: CONnection-oriented Group-address RESolution Service. The Hebrew University, Jerusalem Israel. Technical Report CS96-23, December 1996. http://www.cs.huji.ac.il/labs/transis/transis.html [15] Deborah Estrin and Ahmed Helmy and David Thaler. PIM Multicast Border Router (PMBR) specification for connecting PIM-SM domains to a DVMRP Backbone. Internet Draft, February 1997. draft-ietf-mboned-pmbr-spec-00.txt [16] D. Thaler. Interoperability Rules for Multicast Routing Protocols. Internet Draft May 1996. draft-ietf-mboned-imrp-some-issues-02.txt [17] G. Armitage, A Distributed MARS Protocol. Internet Draft, January 1997. draft-armitage-ion-distmars-spec-00.txt [18] G. Armitage, Issues affecting MARS Cluster Size. RFC 2121, March 1997, [19] S. Deering. Host Extensions for IP Multicasting. RFC 1112, August 1989. [20] C. Semeria. Introduction to IP Multicast Routing. Internet Draft, January 1997 draft-ietf-mboned-intro-multicast-00.txt [21] J. Luciani, et al. NBMA Next Hop Resolution Protocol (NHRP). Internet Draft, February 1997. draft-ietf-rolc-nhrp-11.txt Anker, Breitgand et. al Expires July 1998 [Page 40] Internet Draft 1 July 1997 11. Acknowledgments We would like to thank Prof. Israel Cidon from the Technion Institute, Israel. We also thank Yoav Kluger and Benny Rodrig from Madge Networks (Israel), for their helpful comments and their precious time. 12. List of Abbreviations o IMSS - IP Multicast Shortcut Service o CONGRESS - CONnection-oriented Group address RESolution Service o IP-SENATE - IP multicast SErvice for Non-broadcast Access Networking TEchnology o LMS - Local Membership Server o GMS - Global Membership Server o MCS - Multicast Server Anker, Breitgand et. al Expires July 1998 [Page 41] Internet Draft 1 July 1997 Authors' Addresses Tal Anker The Hebrew University of Jerusalem Computer Science Dept Givat-Ram, Jerusalem Israel, 91904 Phone: (972) 6585706 EMail: anker@cs.huji.ac.il David Breitgand The Hebrew University of Jerusalem Computer Science Dept Givat-Ram, Jerusalem Israel, 91904 Phone: (972) 6585706 EMail: davb@cs.huji.ac.il Danny Dolev The Hebrew University of Jerusalem Computer Science Dept Givat-Ram, Jerusalem Israel, 91904 Phone: (972) 6584116 EMail: dolev@cs.huji.ac.il Zohar Levy The Hebrew University of Jerusalem Computer Science Dept Givat-Ram, Jerusalem Israel, 91904 Phone: (972) 6585706 EMail: zohar@cs.huji.ac.il Anker, Breitgand et. al Expires July 1998 [Page 42] Internet Draft 1 July 1997 Anker, Breitgand et. al Expires July 1998 [Page 43]