RMT Working Group Dah Ming Chiu, CUHK Internet Engineering Task Force Seok Joo Koh, ETRI Category: Informational Miriam Kadansky, Sun Microsystems December 2003 Brian Whetten, Consultant Expires June 2004 Gursel Taskale, Tibco Tree Auto-Configuration (TREE) Building Block for Reliable Multicast Transport Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as a "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document defines the TREE building block (BB) for tree configuration within a reliable multicast transport (RMT) protocol. As an RMT building block, the TREE BB is a coarse-grained modular component that may be common to multiple RMT protocols. The TREE BB provides automatic configuration of a logical ACK-tree for RMT protocols in the tree-based ACK family. A companion document [RFCxxxx] describes the TRACK (Tree-based ACK) building block, which assumes that the logical ACK tree has been configured by the TREE BB described here. Chiu, Koh, Kadansky, Whetten, Taskale [Page 1] RMT BB: Tree Auto-Configuration (TREE) December 2003 Table of Contents 1. Introduction..................................................3 2. Terminology...................................................4 3. TREE BB Rationale.............................................5 4. Functionality of TREE BB......................................5 4.1 Assumptions and Requirements..............................5 4.2 Functional Overview.......................................6 5. BB Details: Session Announcement..............................9 6. BB Details: Repair Head Discovery and Selection...............9 6.1 Repair Head Discovery Algorithms..........................9 6.2 Distance Measurement.....................................16 6.3 Repair Head Selection....................................18 7. BB Details: Tree Formation...................................19 8. BB Details: Tree Maintenance.................................21 8.1 Monitoring Parents and Children..........................21 8.2 Optimizing the Tree......................................21 9. External Interfaces..........................................22 9.1 Interfaces to this BB....................................22 9.2 Interfaces from this BB to the PI or a higher-level BB...23 10. Applicability Statement.....................................24 11. Messages for TREE BB........................................25 12. Requirements from other Building Blocks.....................26 13. Security Considerations.....................................26 14. References..................................................27 Acknowledgments.................................................27 Author's Addresses..............................................28 Chiu, Koh, Kadansky, Whetten, Taskale [Page 2] RMT BB: Tree Auto-Configuration (TREE) December 2003 1. Introduction The Reliable Multicast Transport (RMT) working group was chartered to standardize IP multicast transport services [RFC2887]. Rather than create a set of monolithic protocol specifications, the RMT WG has chosen to break the reliable multicast protocols into Building Blocks (BB) and Protocol Instantiations (PI). A Building Block is a specification of the algorithms of a single component, with an abstract interface to other BBs and PIs. A PI combines a set of BBs, adds in the additional required functionality not specified in any BB, and specifies the specific instantiation of the protocol. This document describes the Tree auto-configuration building block, or TREE BB. The function of the TREE BB is to construct a session tree, comprised of a single sender, repair heads, and receivers, for use by a tree-based ACK reliable delivery protocol, for example the Tree-Based ACK (TRACK) BB [RFCxxxx]. The design goals of the TREE BB are motivated by the TRACK BB, but trees it constructs could be useful for other functions. The process of session tree construction for IP multicast is difficult. The best session trees match the underlying multicast routing tree topology; however, the multicast service model does not provide explicit support for discovering routing tree topology. There are several viable solutions for constructing a tree, depending on network conditions. The optimality of a tree may depend on a variety of factors, such as the need for load balancing and the need to minimize the tree depth used for collecting feedback information. The TREE building block specifies a distributed procedure for automatically constructing a tree that is loop-free and as efficient as possible, given the information available. It provides a unified solution for tree construction in the presence of different multicast service models and routing protocols. In particular, it specifies a single procedure which may be used with various techniques for repair head discovery and distance measurements, several of which are specified within this document. The difference in these techniques primarily affects the optimality of the tree. Chiu, Koh, Kadansky, Whetten, Taskale [Page 3] RMT BB: Tree Auto-Configuration (TREE) December 2003 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. In addition, the following terms are applied in this document as well as the TRACK BB document [RFCxxxx]. Session A session is used to distribute data over a multicast address. A Session Tree is used to provide reliability and feedback services for a session. Session Identifier A fixed-size number, chosen either by the application that creates the session or by the transport. Senders and receivers use the session Identifier to distinguish sessions. Repair Head (RH) A node within the tree which receives and retransmits data, and aggregates and forwards control information toward the sender. The sender operates as the root repair head in any session tree. An RH that has accepted children for a session is called a parent. Session Tree (ST) The session tree is a tree spanning all receivers of a multicast session. It is rooted at the sender, consisting of zero of more repair heads as interior nodes, and zero or more receivers as leaf nodes. Parent A parent is an RH or receiver's predecessor in the ST on the path toward the sender. Every RH or receiver on the tree except the sender itself has a parent. Each parent communicates with its children using either an assigned multicast address or through unicast. Children The set of receivers and RHs for which an RH or the sender is providing repair and feedback services. Chiu, Koh, Kadansky, Whetten, Taskale [Page 4] RMT BB: Tree Auto-Configuration (TREE) December 2003 3. TREE BB Rationale In order to accommodate various multicast deployments, this document divides the tree building process into the following major components: A. Several techniques for discovering neighboring repair heads: static, expanding ring search, and mesh. Discovering neighboring repair heads is a necessary condition for getting connected, so each node in the session must implement at least one of the above techniques. B. Several algorithms for selecting neighboring repair heads: the measurement and use of neighbor distance and sender distance are described. These repair head selection algorithms help produce a good tree. C. A single distributed procedure for construction and maintenance of loop-free session Trees. 4. Functionality of TREE BB 4.1 Assumptions and Requirements This TREE BB is designed based on the recommendations for tree configuration (Section 4.5 of RFC 3048, as also shown below), along with RFC 2357, RFC 2887 and RFC 3269. It has been shown that the scalability of RM protocols can be greatly enhanced by the insertion of some kind of retransmission or feedback aggregation agents between the sender and receivers. These agents are then used to form a tree with the sender at the root, the receivers at the leaves of the tree, and the aggregation/local repair nodes in the middle. The internal nodes can either be dedicated software for this task, or they may be receivers that are performing dual duty. The effectiveness of these agents to assist in the delivery of data is highly dependent upon how well the logical tree they use to communicate matches the underlying routing topology. The purpose of this building block would be to construct and manage the logical tree connecting the agents. Ideally, this building block would perform these functions in a manner that adapts to changes in session membership, routing topology, and network availability. This TREE BB document describes how to build trees under the following conditions: Chiu, Koh, Kadansky, Whetten, Taskale [Page 5] RMT BB: Tree Auto-Configuration (TREE) December 2003 a. The multicast group has only a single sender. b. A single RH can serve multiple sessions. c. Sessions can take advantage of a pre-deployed infrastructure of RHs (ones that are not necessarily aware of a session before the receivers), or recruit receivers to be RHs. d. Generic Router Assist and Expanding Ring Search are not required of the network infrastructure, but if available they should be able to be utilized. For tree configuration, the following are specifically required: a. While the tree building process may take advantage of information from the routing layer, the mechanisms described are independent of the routing protocol(s) used by the underlying multicast tree. b. All trees constructed must be loop-free c. These mechanisms must support late joiners and tree optimization 4.2 Functional Overview Session trees are spanning trees rooted at the sender. Session trees can be used for forwarding control information (i.e. ACKs) towards the root, or for forwarding data (i.e. repairs) towards the leaf nodes. Figure 1 illustrates overall procedures for tree configuration. Chiu, Koh, Kadansky, Whetten, Taskale [Page 6] RMT BB: Tree Auto-Configuration (TREE) December 2003 +--------------------+ | 1. Session | | Advertisement | +--------------------+ | | Node receives tree-building parameters V +--------------------+ | 2. Measurements |<-------------------------| | to the sender | | | (optional) | | +--------------------+ | | | | | V | +--------------------+ | | 3. Neighbor | | | Discovery | | +--------------------+ | | | | | V | +--------------------+ | | 4. Repair Head | | | Selection | | +--------------------+ | | | | Node picks best neighbor | V | +--------------------+ | | 5. Binding to |--------------------------- | Repair Head | 6. Optimization (optional) | | and fault recovery +--------------------+ Figure 1. Generic procedures for tree configuration Session trees are constructed per sender; each node wishing to join the tree discovers its neighboring RHs and then selects its best parent based on locally available information, such as the relative sender distances and neighbor distances. This document specifies several techniques for measuring distances. It is important to note that RHs may be actual receivers (e.g. receivers willing and able to also function as RHs) or pre-deployed "specialized" servers that are signaled to join the tree by Chiu, Koh, Kadansky, Whetten, Taskale [Page 7] RMT BB: Tree Auto-Configuration (TREE) December 2003 receivers. We use the term repair head to refer to either a receiver or server that is participating as part of the logical tree formation. Tree construction, regardless of RH discovery and selection algorithm, proceeds generically as follows. 4.2.1 Session Announcement Receivers of a session use standard out-of-band mechanisms for discovery of a session's existence (e.g. session advertisement in RFC 2974). In this way, a receiver discovers the multicast group address, the sender's address, and other information necessary for logical tree construction. Sessions may be announced in two parts, the first part containing generic information about the session, such as the multicast address, and the second part, announced on the multicast address, containing additional information. 4.2.2 Measurements to the sender (optional) All RHs and receivers that know about the session optionally determine their distance to the sender. 4.2.3 Neighbor Discovery Meanwhile, each receiver discovers nearby RHs (candidate parents) for the session using the neighbor discovery algorithm(s). 4.2.4 Repair Head Selection Once a receiver (or RH needing to join the tree) discovers a nearby RH, it obtains the RH's distance to the sender as well as the RH's distance to the receiver and other suitability values, if available. After discovery is complete, the best RH is selected. 4.2.5 Binding to Repair Head The receiver or RH then binds to the chosen RH. If a bind is unsuccessful, the receiver or RH retries with another nearby RH, or starts the discovery process all over again. Once an RH receives a bind from a child, that RH then joins the tree if it has not already, discovering an RH of its own, possibly using a different method than leaf receivers. Chiu, Koh, Kadansky, Whetten, Taskale [Page 8] RMT BB: Tree Auto-Configuration (TREE) December 2003 4.2.6 Optimization (optional) and Fault Recovery During a session, a receiver or RH may change to a different RH for a number of reasons described below, including fault tolerance. The session Tree is maintained and optimized over time. 5. BB Details: Session Announcement The first step in the tree-building process is for a node to be informed of the session's existence. This can be done using some out-of-band method, such as Session Advertisement [RFC2974], URL, e-mail, etc. RHs do not necessarily receive these advertisements. If an RH is not a receiver, it obtains the advertisement information once it is contacted by a receiver. The advertisement includes the multicast address being used, the sender's address, the session Identifier, any specific port numbers to be used, and any global information useful for tree construction. The advertisement may also contain information about one or more repair head Discovery options (such as Static, ERS, and Mesh) that can possibly be used by receivers in the session. 6. BB Details: Repair Head Discovery and Selection Discovery is the process by which a node determines a suitable repair head. During the discovery process, suitable neighbors are found, sender distances are optionally exchanged, and the best RH is selected. 6.1 Repair Head Discovery Algorithms This draft describes three algorithms for discovering RHs: Static, Expanding Ring Search (ERS), and Mesh. Multiple algorithms may be used within a single session. For example, RHs may use the Mesh algorithm, while the receivers use static configuration to discover the RHs; alternatively, some receivers may use static configuration while other receivers depend on ERS (in an intranet where ERS is available). Each receiver may pre-configure which algorithm to use before it starts. The transport protocols request the following information from this BB using the getRHs interface. Chiu, Koh, Kadansky, Whetten, Taskale [Page 9] RMT BB: Tree Auto-Configuration (TREE) December 2003 Repair Heads: 1. parentAddress: the address and port of the parent node to which the node should connect 2. UDPListenPort: the UDP port number on which the node will listen for its children's control messages 3. RepairAddr: the multicast address, UDP port, and TTL on which this node sends control messages to its children. Receivers: 1. parentAddress. Senders: 1. UDPListenPort 2. RepairAddr After the above information is obtained from tree-configuration, the transport protocol may perform the necessary Bind operations for participating in the session tree. Tree configuration may rely on a functional entity, named Tree Configurator (TC), which is pre-configured by sender for tree configuration. TC may be simply implemented by a program and thus installed at sender or any other host. It is not necessarily specialized infrastructure. 6.1.1 Static In the static scheme, TC is used to govern the tree building based on its own (session-specific) tree configuration and RH(s) selection rules for the new joiners. If a TC is used for tree building, its address and port are included with the session advertisement. Receivers and RHs will realize there is a TC for the session via session Announcement, and they can contact with the TC to get a list of candidate RHs by sending a unicast Query message. In response to a Query message, TC replies with an advertisement message that contains a list of candidate RHs available to the new joiner. The rule of determining such candidate RHs may depend on the pre-configured mechanism taken by TC. For example, TC may determine the candidate RH list for a node among the possible RHs (it contains at that time) by considering which RHs are in the same Chiu, Koh, Kadansky, Whetten, Taskale [Page 10] RMT BB: Tree Auto-Configuration (TREE) December 2003 network domain with the node (i.e., via comparing their network prefix), or by considering the load balancing for the tree topology it has configured until then. For this purpose, TC maintains a pool of active RHs for the session. The list of candidate RHs carried by Advertise message is ordered in decreasing levels of preference, in which a lower number represents a higher preference. When a receiver node receives the responding advertisement message from TC, the node MAY proceed to try to bind to a candidate RH following the given order by sending a BindRequestmessage, and then waits for the responding message such as BindConfirm or BindReject from TC. These binding steps will be done according to the TRACK mechanism, as described in the TRACK BB. In the static algorithm, RH discovery with a TC proceeds as follows: 1. The node joins the multicast session and learns of the location of TC (from the session announcement). 2. The node optionally discovers its distance from the sender. Any metric described in Section 4.2 may be used. 3. The node sends a Query message to the TC for a parent. The request includes (optionally) the node's distance to the sender and whether the node functions as an RH or not. 4. The TC chooses one or more candidate parents (RHs) for the node from the active RHs by its own tree configuration rule. The selection of candidate parents may be done by comparing the network prefix or by referring to any other information such as the number of currently attached children, etc. 5. The TC responds to the Query message with an Advertise message, which include the candidate parent list. In the list, each entry contains the corresponding IP address and port of an RH. All the entries in the list need to be arranged in the decreasing order of preference levels. In the rejection case, the Advertise message does not include any candidate parent. In this case, the node may resort to the other mechanism such as ERS and Mesh. In the success case, the node will be enrolled as an active RH by the TC, if it functions as an RH in the session. Each active RH (functioning as a parent for the session Tree) sends the TC the periodic Query messages with a flag Chiu, Koh, Kadansky, Whetten, Taskale [Page 11] RMT BB: Tree Auto-Configuration (TREE) December 2003 indicating that it is active over a specific time interval. Based on the Query messages, the TC updates a pool of active RHs in the session. In response to the Query message, the TC sends an Advertise with a flag simply indicating that the Query message is received. 6. After receiving a successful Advertise message from the TC, the node will try to connect to its parent by sending BindRequest messages based on the candidate parent list. Depending on implementation, one or more TCs may be locally deployed, in which each receiver can locally access some configured list of the prospective RHs. 6.1.2 ERS ERS is used only in the network environments where IP multicast transmissions are allowed by receivers and RHs. In ERS algorithm, the RH discovery with a TC proceeds as follows: 1. The nodes first look for Neighbors using a multicast Query message. The initial TTL value in the Query message, TTLNeighborInit, is specified in the session announcement and may be as large as the session TTL (TTLSession). An RH that is able to handle additional children responds to a Query message by multicasting an Advertise message. If the RH is not yet a parent, the TTL used in this response is the same TTL used in the Query message. If the RH is a parent, the TTL used is the greater of the Query TTL and the parent's current Advertise TTL. 2. The node listens for Advertise messages after sending the Query message. If one or more Advertise messages are received during a SolicitPeriod, the best RH among them is selected. 3. If no Advertise messages are received, the node sends another multicast Query message with a TTL that is incremented by TTLIncrement. The process of sending the multicast Query message with an increasing TTL value continues until a response is received. The TTL value is limited by a value, TTLMax, also specified in the session announcement. TTLMax defaults to the value of TTLSession. Chiu, Koh, Kadansky, Whetten, Taskale [Page 12] RMT BB: Tree Auto-Configuration (TREE) December 2003 If the TTL value required to reach the soliciting node is greater than the TTL used to reach the RH, an Advertise message may not reach the node. However, if future Query messages have increased TTL values, the TTL may eventually be large enough for the Advertise message to reach the node. However, it is possible that the node will not locate any RHs using Expanding Ring Search. It is advisable that a backup method, such as static, be available. 4. RHs need to suppress sending Advertise messages in response to Query messages if one was sent with at least the Query's TTL within the last SolicitPeriod. After multicasting a Query message, a node waits for an interval, BetweenQuery, before sending another Query message. Nodes will suppress sending Query messages for BetweenQuery seconds when they first start in order to collect information from Advertise messages already solicited by other nodes. 5. Getting a successful Advertise message via ERS, the node will try to connect to the parent by sending BindRequest messages. The following variables are used in the Expanding Ring Search algorithm. - TTLNeighborInit: This is the initial TTL value to be used by the ERS if no other TTL value is specified by the algorithm. - TTLIncrement: This is the periodic increment for the TTL used in ERS. - TTLMax: This is a configured maximum TTL value to be used by either Query or Advertise messages. - TTLSession: This is the session TTL value for the multicast session. - SolicitPeriod: Each receiver MUST not send more than one QUERY message per SolicitPeriod. When a RH responds to QUERY messages, it also suppresses its ADVERTISE message if one has been sent less than SolicitPeriod ago. This parameter is used to control the amount of control traffic during tree construction if ERS is used. - BetweenQuery: This is the time interval a node waits before sending successive Query messages. Chiu, Koh, Kadansky, Whetten, Taskale [Page 13] RMT BB: Tree Auto-Configuration (TREE) December 2003 - MaxBindAttempts: This variable is an integer used to control how many times a receiver tries to bind to a RH before giving up and try another one. - BindPeriod: In order to prevent loops, sometimes a RH may reject a BindRequest (for example, when the RH is not on the tree yet and has a BindRequest outstanding itself) from a receiver. In this case, if the receiver needs to retry binding to the same RH again (perhaps because the receiver does not discover any alternative RHs), then it waits for BindPeriod seconds. 6.1.3 Mesh The mesh approach relies on a set of RHs already deployed as infrastructure servers. These RHs are on-line, but are not necessarily aware of any particular session unless informed by the following mechanisms. RHs in the mesh are configured to know who their neighbor RHs are, and exchange reachability information with their neighbors in a way analogous to routers in a network. The actually protocol used by RHs to exchange such reachability information is outside the scope of this BB. Instead, this BB specifies the following properties that the mesh of RHs MUST satisfy: a) Each RH knows a subset of RHs in the mesh as its immediate neighbors. b) Each RH has a "forwarding table", such that given any other (destination) RH in the mesh, the forwarding table gives a "next-hop" RH that can be used to reach the destination RH, plus the distance to the destination RH from the local RH. c) A given RH in the mesh can "broadcast" information to all other RHs in the mesh (in the sense of having a means of sending the same information to all other RHs, but not necessarily simultaneously). d) All potential sender and receivers of a multicast session can discover a "neighboring" RH in the mesh, using the neighbor discovery mechanisms described in Section 4.1. The reason for running a routing-like algorithm to maintain the forwarding tables in each RH is to provide fault tolerance when some RHs in the mesh fail. When that happens, the remaining RHs exchange information with each other to update the forwarding Chiu, Koh, Kadansky, Whetten, Taskale [Page 14] RMT BB: Tree Auto-Configuration (TREE) December 2003 tables. In the steady state, the mesh of RHs still needs to satisfy the above four properties. In the simplest form, each RH in the mesh has a forwarding table that contains all other RHs in the mesh. This is called a fully- connected mesh. As mentioned earlier, the mesh scheme assumes that there is a pre- deployed infrastructure of RH servers. That is, the RH-RH bindings and sender-RH (Sender's RHs) will be performed internally by the provider's own policy. The only thing done by sender is to inform its RH's that a session starts. The relationships between sender and its RH's (i.e., sender-RH bindings) MUST also be pre-configured. For example, the internal bindings between sender and RHs MAY be done as follows: 1. The sender locates a neighbor RH in the mesh by a pre- configured mechanism. This RH is referred to as the sender's RH. 2. The sender sends the multicast session id, address and port (all these can be set as a abbreviated session announcement message) to the sender's RH. 3. The sender's RH in turn "broadcasts" the session information to all RHs in the mesh; since RHs can support multiple sessions simultaneously, they keep the information about each session in an entry in a session table. 4. After the sender-RH bindings, sender multicasts its session announcement to the multicast receivers. Then, the sender's RH binds to the sender by sending a BindRequest message. During the internal processing of sender-RH binding described up to now, no Query and Advertise messages are used. When a session starts, the bindings between RH and receivers will be done. The tree binding of RH-receiver is done with the Tree Configurator (TC), which was used in the Static algorithm. The main difference between Static algorithm with TC and Mesh algorithm with TC is that the active RHs are considered as candidate RHs in the Static scheme, while the pre-deployed RHs are considered as candidates in the Mesh scheme. That is, in the Mesh scheme, the TC is required to already get the information on the locations of the pre-deployed RHs. Then the tree buildings is done as follows: Chiu, Koh, Kadansky, Whetten, Taskale [Page 15] RMT BB: Tree Auto-Configuration (TREE) December 2003 1. When a session starts, a receiver sends a Query message to the TC. The receiver may optionally discover its distance from the sender. Any metric described in Section 4.2 may be used. 2. The TC chooses one or more candidate parents RHs for the receiver from the pre-deployed RHs by its own tree configuration rule, as described in Section 4.1.1. 3. TC responds to the Query message with an Advertise message, which includes the candidate parent list. In the list, each entry contains the corresponding IP address and port of the parent. 4. Receiving a successful Advertise message from the TC, the node will try to connect to its parent by sending BindRequest messages based on the candidate parent list. Once a receiver is connected to a parent RH, the RH-RH bindings will also be done internally by the pre-configured provider's policy. For example, each RH in the mesh tries to bind to its "next-hop" RH. If the "next-hop" RH is not reachable for some reason, an RH may also try to bind to any neighbor RH as a back-up alternative. These procedures will be devised to ensure that the loop freedom is guaranteed. 6.2 Distance Measurement Different techniques can be used to determine distances between nodes in a session. The distances of interest are: Sender distance - distance from a RH to sender Neighbor distance - distance from a receiver to a neighboring RH These distances can be used in selecting a repair head if several are discovered. These techniques quantify distance differently. Each specific way of quantifying distance is called a metric. Different Metrics are not necessarily comparable. For example, if distance between A and B is X using metric m1, and distance between A and C is Y using metric m2, then X > Y does not necessarily imply B is farther from A than C. Only distances of the same metric will be compared and ranked. Therefore, a receiver will only rank two RHs based on their respective sender distance if those distances are based on the same metric. Chiu, Koh, Kadansky, Whetten, Taskale [Page 16] RMT BB: Tree Auto-Configuration (TREE) December 2003 On the other hand, it is not necessary for all receivers to use the same metric to select their neighboring RH to connect to. Suppose the receivers use neighbor distance as a selection criterion. One receiver may determine neighbor distances to RHs based on hop count, whereas another receiver may determine neighbor distances to its neighboring RHs based on delay. 6.2.1 TTL Hop-Count If this metric is used, a node periodically sends a Beacon message on the session's multicast address. The Beacon message includes the original time-to-live value set by the node. The distance to the node is then calculated as Beacon's original TTL - Beacon's current TTL One node is closer than another if its distance is a lower number. Note that the TTL value may not be available in some implementation environments. 6.2.2 Number of Levels One metric a receiver can use for RH selection is the number of levels the RH is from the sender. For example, given two RHs in close proximity to a receiver, if one RH is n levels from the sender and the other is m levels, where m