Network Working Group Yong Xie Internet-Draft Myung J. Lee Tarek N. Saadawi The City College of New York August 20, 1996 Sink-Assisted Routing Protocol (SARP) For General IP Multicasting Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract This Internet-Draft proposes a routing protocol for supporting real-time multicast service over the Internet. The work is motivated by the fact that the existing multicast routing protocols are designed under the assumption of symmetric paths, which is not necessarily true especially in the real-time applications. The essence of the proposed protocol is using the inter-subnet Routing Guide Message (RGM) flows generated by sink-subnets to form multicast spanning trees. The underlying algorithms enable source and intermediate nodes to construct minimal end-to-end delay paths to sink-subnets based on the path and the session information conveyed by incident RGM-flows. Expires February 20, 1997 [Page 1] Internet Draft August 20, 1996 1. Introduction The primary function of a multicast routing protocol is to select a subset of the Internet topology to form the group-specific spanning trees over which multicast datagrams are delivered from their sources to their sinks. Practically, a multipoint-to-multipoint tree is constructed by multiple point-to-multipoint subtrees. Multicast routers need to look up the actual source or the sinks of a particular datagram in order to make the forwarding decisions. There have been a number of multicast routing protocols and associated algorithms proposed by the Internet community. A typical example is the Distance Vector Multicast Routing Protocol (DVMRP) [4] which employs the Reverse Path Multicast (RPM) algorithm in the latest mrouted version 3.5 and some vender implementations. As well surveyed by [3], RPM is an enhancement to the Reverse Path Broadcast (RPB) and the Truncated Reverse Path Broadcasting (TRPB). RPM defines a way to form the source-rooted spanning trees. The distance back to the source of datagrams is the main factor in determining the forwarding paths. It assumes the symmetricity of paths which becomes a fundamental limitation of entire "reverse optimal path" routing algorithm family, including RPB, TRPB, and RPM. In fact, a shortest path does not necessarily guarantee a shortest reverse path because of the possibility that a full-duplex link could have substantially imbalanced loads on each direction. Furthermore, using fixed "metrics" (e.g., hop count) to compare alternative routes is inappropriate for the situations where routes need to be chosen based on real-time parameters (e.g., real-time delay)(see [2]). These two problems, symmetric path assumption and fixed metric, poses significant challenge for any IP multicast routing protocols should they support real-time applications. The proposed Sink-Assisted Routing Protocol (SARP) is an attempt to address aforementioned two issues. That is, SARP focuses on computing the minimal-delay path from sources to sinks using a real-time delay metric. In contrast, DVMRP is derived from the Routing Information Protocol (RIP) and concerned with computing the shortest path, in terms of fixed metrics, back to the source of a multicast datagram. In short, SARP is to form sink-rooted rather than source-rooted spanning trees using the inter-gateway primitive called the Routing Guide Message (RGM). In order to search for potential multicast source groups, it is assumed that the sink-subnet periodically generates and multicasts RGMs to every possible subnets within a limited scope of the Internet. A source-subnet starts multicasting datagrams out only after intercepting the corresponding RGM flow(s). The information conveyed by a RGM includes a list of the requested multicast groups on behalf of the entire sink-subnet, and the path status as well. Expires February 20, 1997 [Page 2] Internet Draft August 20, 1996 RGMs are to be forwarded using a typical Reverse Path Broadcast algorithm, except that the optimal path is determined based on the measure of Reverse-Relative-Delay (RRD). In other words, RGMs are routed along the paths whose reverse paths are optimal in term of real-time delay back to the sink-subnets. Currently, multicast service is considered connectionless, due to the genetic property of IP. Conventional higher layer handshake-type setup of point-to-point connection is often too expensive for most real-time and multicast applications. Also, flow control and error correction could be really complicated when maintaining a dynamic multicast session group. Nevertheless, there is an emerging demand for a quality assured IP multicast service, which requires certain degree of connectivity between sources and sinks. Towards that end, SARP is also intended to introduce the concept of "loose-IP-connectivity". That is, the logical connectivity between two IP end-points can be loosely defined by the continuity of end-to-end control-message-flow, while using the basic IP support. In SARP, multicast spanning tree is completely specified by the incidence relations between the RGM-flows and the source nodes. The sink-nodes send periodic RGMs as requests, wait for data arrival, and regard the data arrival as receiving the natural acknowledgement, while a source-node is aware of the existence of the sink-nodes and maintains transmission only if RGM-flows are engaged, giving a sense of data connection. We strongly believe that, as far as multicasting and routing are part of the IP functions, it is necessary to extend IP to provide an unacknowledged connection-oriented service. First, such service is beneficial to IP multicasting in which the point-to-point addressing is impossible and the data connection is more important than the logical connection. Second, it is capable of conjoining the routing and the resource reservation, which are naturally two inseparable issues. Throughout this draft, the term "IP-entity" is used generally to cover a single network or a subnet, in which all hosts share the same subnet IP address. It has at least one full-duplex link to its neighbor. The access point of an IP-entity to the link is called the port. The abstraction of the Internet can be considered as a planar graph with a set of IP-entities as the nodes, and pairs of the directed edges connecting adjacent nodes. Thus, we shall frequently call the IP entity as a node, for example, the source-node, the sink-node, and the intermediate-node. 2. Model of Operation Multicast routing within an entity is transparent to IP. That means, as long as a datagram remains on a subnet, it can reach any sink Expires February 20, 1997 [Page 3] Internet Draft August 20, 1996 application within the subnet using the technology that is specific to that subnetwork. For example, on the Ethernet, multicasting is handled by mapping a class-D address into an Ethernet address. On connection-oriented networks, such as an ATM network, it can be done by using some signaling mechanism to establish the host-to-host or host-to-gateway Virtual Circuits (VCs). As with the task of IP multicast routing being primarily a matter of finding gateways among networks, the proposed protocol is concerned only with routing multicast datagrams across an internetwork, along the minimal-delay paths from source-subnets to sink-subnets. SARP depends on the participations of the sinks to establish the group-specific multicast spanning trees. According the standardized host implementation for supporting IP multicasting [1], incoming multicast IP datagrams are received by upper-layer protocol modules using the same "Receive IP" operation as for normal unicast datagram, an application process need to explicitly ask its IP module to join the group through a service interface. In addition, SARP assumes that the process claim itself as the "sink" by opening the file descriptor in a read or read/write mode. The host IP module must be extended to inform the multicast session information to the subnet- router, so that the routing protocol module is able to summarize the demands of sub-hosts into the per subnet sink-group-table. As long as the table is not empty, RGMs are generated. 3. Algorithms for SARP 3.1. Measuring the Reverse-Relative-Delay of a Path Considering the real-time properties of most multicast applications, it is desirable that the distance between two nodes is quantified in term of the real-time delay. The absolute delay, however, is costly to measure since network clocks usually are not accurately synchronized enough for real-time interactive applications. Moreover, the current network synchronization protocols (e.g., [5]) assume symmetric network paths and use the round-trip-delay as an approximation. We call the delay measured by referencing asynchronous clocks the "relative-delay". It is the real-time delay measure with a clock-offset error in it. In a best-effort service packet-switching internetwork, "optimal" is quite a relative term. When a source/intermediate-node intercepts only one RGM-flow from a given sink-node, the route of the incoming RGM- flow is certainly the only and therefore the optimal route to forward datagrams. In case when multiple RGM-flows incident with a source- node are rooted at the same sink-node but via different routes, the decision of the optimal route back to the sink-node can be reached by simply comparing relative-delays of distinct paths. This is possible Expires February 20, 1997 [Page 4] Internet Draft August 20, 1996 because that incurred relative-delays result in the same amount of the clock-offset for any path between a pair of nodes. To measure the relative-delay of a path, each node in the Internet that participates in the routing protocol is assumed to send periodic delay-probe-packets to every possible adjacent node. Thus, the receiving node is able to compute the directional One-Hop-Relative- Delays (OHRDs) from (not to) its neighbors. It is required that each node maintains an OHRD cache which has entries for each neighbor. No further routine information exchange or propagation among routing entities is required by the protocol. This ensures the algorithm against from slow-convergence. RGM is designed to convey the "run- time" relative-delay information across the internetwork. Let d(i,i+1) represent the absolute-delay from node i to adjacent node i+1. The OHRD d'(i,i+1) can be measured by sending a probe IP datagram from i to i+1, whose departure time t(i) and arrival time t(i+1) are time-stamped according to the clocks at i and i+1, respectively. d'(i,i+1) actually consists of two components: d'(i,i+1) = d(i,i+1) + o(i,i+1) where o(i,i+1) denotes the clock-offset between i and i+1. Obviously, a single trial of the delay measurement would be insufficient to represent the statistics of the path. It is assumed that the OHRD database maintained by each node is updated with moving-averages of instantaneous measures. The measure interval and the averaging window size are critical to both sensibility and stability of the routing protocol. The basic criterion is that, OHRD database is expected to reflect link statistics, while minimal computation is most desirable. Further studies are required for determination of those parameters. From now on, we shall refer d'(i,i+1) as the measure of the mean OHRD from node i to its neighbor i+1. It is recorded by node i+1, and available for reference at any time instant. When a RGM traverses along a multi-hop path from node j to node i, the relative delay of the reverse path, denoted as D'(i,j), can be calculated by adding the OHRD of every hop along the path. Among other information, each RGM-datagram carries a Reverse-Relative-Delay (RRD) parameter, which is the sum of the OHRDs added every time before the RGM crossing the hop. Thus, the receiving node is able to exam the reverse path by simply reading the value of the RRD field: D'(i,j) = d'(j-1,j) + d'(j-2,j-1) + ... + d'(i-1,i) = d(j-1,j) + d(j-2,j-1)+ ...+ d(i-1,i) + o(i,j) where o(i,j) is the clock-offset between node i and node j. It is Expires February 20, 1997 [Page 5] Internet Draft August 20, 1996 equal to the sum of the clock-offset components of all OHRDs along the path. Let c(i) represent the value of node-i's clock at a given time instance, o(i,j) is always independent of the intermediate clocks, and the path as well: o(i,j) = c(i) - c(j) = c(i) - c(i+1) + c(i+1) - c(i+2) + ... + c(j-1) - c(j) = o(i,i+1) + o(i+1, i+2) + ... + o(j-1, j) As shown in the example topology below, two RGM-flows originating at node-A destinating at node-D via B and C. If D'(D,B,A) < D'(D,C,A), the port to B is considered by node-D as the optimal route to forward datagrams of certain multicast group to sink-node A, vice versa. B---------D D'(D,B,A) = d'(B,A) + d'(D,B) / / = d(B,A) + o(B,A) + d(D,B) + o(D,B) / / = D(D,B,A) + o(D,A) A---------C D'(D,C,A) = D(D,C,A) + o(D,A) 3.2. Forming the Sink-Rooted Spanning Trees As mentioned in the foregoing, RGMs are used to search for multicast source-node groups on behalf of multicast sink-node groups, and to convey the information about the end-to-end RRD measures. The IP header of a RGM datagram contains the sink-subnet-id as its Source Address, and a well-known multicast address in the Destination Address field particularly assigned to the routing protocol. The initial Time-To-Live (TTL) parameter is set to a number greater than 1 and less than 255 to limit the sink-rooted tree to span within a given scope. RGMs also carry other information which will be discussed later. RGMs are to be forwarded using the typical Reverse Path Broadcasting (RPB) algorithm. It is assumed that each node that has more than one port maintains a session database, called the "RGM-flow table", to cache the information about every incidence RGM-flow. An entry is created for a sink-node in the cache upon receiving the first RGM from that node. Successive arrivals from the same sink-node will make the entry stay effective. For a node on the edge of the Internet, i.e., it has only one port connecting to the outside world, it only has to record the union of the multicast-group-lists of all incoming RGMs. The actual source addresses of RGMs are not interested because the node ought to send datagrams through the only port any way, if it is the source node of Expires February 20, 1997 [Page 6] Internet Draft August 20, 1996 the requested multicast group. In case that an intermediate/source node has direct connections with multiple adjacent nodes, it possibly intercepts multiple RGM-flows sourced by the same sink-node but via different routes. Nevertheless , only one RGM flow from a given sink-node is accepted and possibly forwarded to form the segment of the sink-rooted spanning tree. In the RGM-flow table, an incoming RGM-flow is represented by the (SinkNode, InPort) pair. RRD is recorded for the respective RGM-flow whenever a RGM arrives. Any other ports that are currently not being contacted by the same sink-node are considered having RRD value of infinite. Suppose that node-A has three neighboring nodes, it receives RGMs rooted at sink-subnet S (134.74.60.0) from port p0 and port p2, respectively. According to the RRD measurement, the route via p0 is the optimal reverse path back to node-S. The RGMs arriving at p0 are to be forwarded through ports p1 and p2 to the downstream neighbors (with respect to RGM-flows), whereas, those arriving at p1 are discarded. The RGM-flow table of node-A should look like as follows: SinkNode InPort RRD FlowTTL -------------------------------------- 134.74.60.0 p0 12345 2 p1 inf / p2 23456 3 As shown in the table, each (SinkNode, InPort) pair also has a Flow Time-To-Live (FlowTTL) counter to monitor the continuity of the RGM- flow. FlowTTLs will be decreased by one every unit time, and reset after receiving a successive RGM. The RRD value of a sub-entry is set to infinite if no more RGM arriving at the port and the FlowTTL has expired. The entry for a given SinkNode is cleaned up only if all sub-entries have infinite RRD values. It is mandatory that the interval between any sink-node sending two consecutive RGMs along the same outgoing path must have a protocol-specific up-bound, in order to quantify the continuity of the RGM-flow. Since packet-loss and delay-jitter are expected in the Internet, the duration of FlowTTL is supposed to cover several such maximal RGM-intervals. In summary, a source/intermediate node records the RRD information for each incoming RGM-flow, and only forwards RGMs that arrive via the optimal reverse path (with smallest RRD for a given sink-node) to downstream neighbors through all ports except the incoming port. Any RGM-flow will be terminated by the receiving node if the TTLs of RGM-datagrams are exhausted, and/or the reverse path of RGMs is not considered as the optimal reverse path back to their origins. Expires February 20, 1997 [Page 7] Internet Draft August 20, 1996 3.3. Forwarding Multicast Datagrams to Sink-Subnets Once the sink-rooted trees are formed, delivery of datagrams seems straight-forward. The Forwarding-Table of a source/intermediate node is derived from its RGM-flow-table. It contains entries for each multicast group that is interested by at least one active sink-node. Each multicast group in the table is mapped to a set of ports that represents the union of the optimal routes to every sink-node within the group. Thus, the basic forwarding rule is that, any node that has either local-generated or received datagrams will forward them to selected downstream neighbors according to the MulticastGroup-to- OutPorts map. However, multiple sink-rooted trees might conjoin each other so that the aggregated spanning graph contains circuits. In order to form the group-specific spanning tree, circuits must be broken. #----------------------------------------------------------------# | | | +++++p0 +++++ G-E G E | | + G @--------@ E @------- | | \ | | ++@++ p1+++++p0 \ A-B-D A-B-D | | p1| \ | | | | |p1 \ p0 F-C (A) F-C (B)(C)(F) | | ++@++ p1+++++ p1++@++ | | + A @--------@ B @--------@ D + G-E G-E G-E | | +++++p0 ++@++p0 +++++ \ | \ | \ | | p2| A-B-D A B-D A-B D | | |p0 | | | | | +++++p0 p1++@++ F-C (D) F-C (E) F-C (G) | | + F @--------@ C + | | +++++ +++++ the tree rooted at a given | | sink-node | | | | (*) represents sink node. | #----------------------------------------------------------------# The above figure illustrates an example topology and a possible set of trees rooted at every potential sink-node in the graph. Suppose that (A,C,G) is a set of active sink-nodes of a given multicast group 239.10.10.1, accordingly, the forwarding-table of each node should contain information as follows: Expires February 20, 1997 [Page 8] Internet Draft August 20, 1996 MulticastGroup OutPorts SinkNodes ----------------------------------- node-A: 239.10.10.1 p0 C p1 G node-B: 239.10.10.1 p1 A,G p2 C node-C: 239.10.10.1 p0 A,G node-D: 239.10.10.1 p1 A,C node-E: 239.10.10.1 p0 C p1 A,G node-F: 239.10.10.1 p0 A,C,G The problem arises if node-E serves as one of the sources of the multicast group, while the branches of tree(A) and tree(C) intersect at source-node E from different directions. According to the basic forwarding rule, datagrams are to be multicasted out through both port p0 and port p1 of node-E. Upon receiving datagrams from its port p0, intermediate-node B is supposed to further forward them via port p1 and port p2, since both routes are optimal from B's view point to sink-nodes A/G and C, respectively. Similarly, node-A considers its OutPort p0 as the optimal route to sink-node C. Obviously, this is a case of "mutual confusion", in which both node-A and node-B will forward and receive duplicated datagrams to and from each other. The basic cause of such error is that, the sink-rooted routing algorithm works, so far, in a manner that is totally transparent to the source address of the multicast datagram. The protocol has defined the mechanisms to enable an intermediate-node to explicitly determine the downstream optimal route to a given sink-node. However, the node has no idea whether an upstream path is also optimal for the same sink-node. It is evident that the "mutual confusion" problem exists between a pair of sink/intermediate nodes on a circuit, only because they are receiving datagrams from a particular source/intermediate node that forks multiple data-flows by duplicating datagrams. In the previous example, only node A and node B are confused when receiving datagrams sourced by node E. Any node other than node-E will not cause the same problem. The simplest solution to the circuit-problem is that, any node that receives identical datagrams from different neighbors always forward the early copy of a datagram and discards the later one. In so doing, any intermediate-node that has multiple OutPorts for a given Expires February 20, 1997 [Page 9] Internet Draft August 20, 1996 multicast group has to check the source address and the sequence number of every incoming datagram. Furthermore, the node may send messages to suggest its neighbors not to forward datagrams from a particular source-node. The link that has such messages on both directions must be the one causing the trouble. The problem is that, in addition to the processing overhead and the potential "flip-flop" situation, link-waste is inevitable. Since any source/intermedia node knows exactly who are expecting the data and binds them to corresponding OutPorts, a deterministic solution is suggested: If a source/intermediate node "splits" the data-flow, i.e., forwards a single incoming data-flow to more than one OutPort, it is responsible for specifying the sink-information regarding the departure-port in each outgoing datagram. The information can be a list of sink-nodes either exclusive or inclusive of the OutPort. Also, a receiving intermediate-node will check and possibly re-specify the applicable sink information only if it is the next splitting point of the data-flow. The proposed forwarding-algorithm can be expressed as the following pseudo-code, in which inclusive sink-specification is assumed. while ( receiving datagram D(G) of group G from InPort(D) ) { if ( G has entry in forwarding-table && TTL(D) != 0 ) { outport[] = OutPorts(D) - InPort(D); dup = number of entries in outport[]; if ( dup == 0 ) discard D; else if ( dup == 1) { TTL(D) = TTL(D) -1; forward D through outport[0]; /* only active port */ } else { for (i=0; i August 20, 1996 the algorithm guarantees that any node intercepts data-flows only from the optimal upstream paths. Therefore, delivery paths of multicast datagrams are guaranteed end-to-end optimal. An additional advantage is that any source/intermediate node is capable of rejecting a particular sink-node for the purpose of security or resource-reservation. 4. Discussion One of the limitations of SARP is the additional overhead for processing RGMs. With the basic algorithm of SARP, the size of system memory required by the protocol is proportional to the number of active sink-nodes. Any node are potentially saturated by RGM-flows, especially when routing is taking place in an internetwork with a regular mesh topology. The protocol is expected to be further improved in terms of the protocol efficiency and scalability with other supplementary means. Since the multicast spanning tree is completely specified by the incidence relations between RGM-flows and nodes, the optimalization can be achieved by reducing unnecessary or redundant forwarding of RGMs. An immediate justification to be made is using RPM algorithm instead of RPB algorithm to forward RGMs. Moreover, an intermediate- node that interconnects two disjoined sub-topology of the Internet, such as the node on the multicast backbone, can be used to merge RGM-flows from both directions to represent two groups of sink-nodes. The current focus of SARP is to find minimal-delay paths with respect to every sink-node, and the overall network load for a given multicast group is not necessarily optimized. We reconize that realization of minimal network load is equivalent to finding minimal RGM-flow pattern. This issue will also remain under further study. Reference [1] Steve Deering, "Host Extensions for IP Multicasting," RFC1112, August 1989. [2] Hedrick, C., "Routing Information Protocol," RFC 1058, Rutgers University, June 1988. [3] C. Semeria, T. Maufer, "Introduce to IP Multicast Routing," draft-rfced-info-semeria-00.txt, 3Comy Corporation, March 1996. [4] D. Waitzman, C. Partridge, and S. Deering, "Distance Vector Multicast Routing Protocol," RFC1057, November 1988. [5] D. Mills, "Internet Time Synchronization: The Network Time Protocol," IEEE Trans. on Commun., Vol.39, Oct. 1991. Expires February 20, 1997 [Page 11] Internet Draft August 20, 1996 Author's Address Yong Xie Phone: 212-650-7261 Email: xyong@ee-mail.engr.ccny.cuny.edu Myung J. Lee Phone: 212-650-7260 Email: mjlee@ee-mail.engr.ccny.cuny.edu Tarek N. Saadawi Phone: 212-650-7263 Email: eetns@ee-mail.engr.ccny.cuny.edu Department of Electrical Engineering City University of New York, City College New York, NY 10031 Expires February 20, 1997 [Page 12]