RMT Working Group Brian Whetten Internet Engineering Task Force Talarian Internet Draft Dah Ming Chiu Document: draft-ietf-rmt-bb-track-00.txt Sun Microsystems 17 November, 2000 Miriam Kadansky Expires 17 May 2001 Sun Microsystems Gursel Taskale Talarian Reliable Multicast Transport Building Block for TRACK Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes the TRACK Building Block. It contains functions relating to positive acknowledgments and hierarchical tree construction and maintenance. It is primarily meant to be used as part of the TRACK Protocol Instantiation. It is also designed to be useful as part of overlay multicast systems that wish to offer efficient confirmed delivery of multicast messages. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119. 1.0 Introduction INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 1 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 This document describes the TRACK Building Block. It contains functions relating to positive acknowledgments and hierarchical tree construction and maintenance. It is primarily meant to be used as part of the TRACK Protocol Instantiation. It is also designed to be useful as part of overlay multicast systems that wish to offer efficient confirmed delivery of multicast messages. As pointed out in the building blocks rationale draft [WVKHFL00], there are two different reliability tasks that can be provided by a reliable multicast transport: ensuring goodput and confirming delivery of application level messages. The NACK Protocol Instantiation and ALC Protocol Instantiation are each primarily concerned with ensuring goodput. The TRACK BB and TRACK PI relies on a repair tree to provide goodput as well as confirmed delivery. If Forward Error Correction, Generic Router Assist or other mechanisms are used to help provide goodput, they are assumed to work transparently at a layer below this BB, as if the IP multicast service has lower error rate. The TRACK BB also assumes that there is an automatic tree configuration algorithm which provides the list of parents each node should join to. If Receivers are used that may also serve as Repair Heads, the TRACK BB assumes the Auto Tree BB is also responsible for selecting the role of each Receiver as either Receiver or Repair Head. The TRACK BB also assumes that a separate session advertisement protocol notifies the receivers as to when to join a session, the data multicast address for the session, and the control parameters for the session. The TRACK BB provides additional information and aggregation capabilities, which are useful for congestion control. The TRACK BB provides the following detailed functionality. @ Hierarchical Session Creation and Maintenance. This set of functionality is responsible for creating and maintaining (but not configuring) the hierarchical tree of Repair Heads and Receivers. o Bind. When a child knows the parent it wishes to join to for a given data session, it binds to that parent. o Child Initiated Unbind. When a child wishes to leave a data session, either because the session is over or because the application is finished with the session, it initiates an unbind operation with its parent. o Parent Initiated Unbind. A parent can also force a child to unbind. This happens if the parent needs to leave the session, if the child is not behaving correctly, or if the INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 2 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 parent wants to move the child to another parent as part of tree configuration maintenance. o Fault Detection. In order to verify liveness, parents and children send regular heartbeat messages between themselves. The sender also sends regular null data messages to the group, if it has no data to send. o Fault Recovery. When a child detects that its parent is no longer reachable, it may switch to another parent. When a parent detects that one of its children is no longer reachable, it removes that child from its membership list and reports this up the tree to the Sender of the Data Session. @ TRACK Generation. This set of functionality is responsible for periodically generating TRACK messages from all receivers to acknowledge receipt of data, report missing messages, advance flow control windows, provide roundtrip time measurements and provide other group management information. The algorithms include: o TRACK Timing. In order to avoid ACK implosion, the Receivers and Repair Heads use the rotating TRACK algorithm. o Flow Control and Buffer Management. Receivers and Repair Heads maintain a set of buffers that are at least as large as the Sender's transmission window. The Receivers pass their reception status up to the sender as part of their TRACK messages. This is used to acknowledge receipt of delivery, to advance the buffer windows at each node, and to limit the sender's window advancement to the speed of the slowest receiver. o Application Level Confirmed Delivery. Confirmed Delivery provides transport level confirmation of delivery. Senders can put a "synch point" request in data messages, asking for application level confirmation. Data messages with this flag set are only confirmed by the Receivers after the Receiver applications confirm receipt. @ Local Recovery. This functionality describes how repair heads maintain state on their children and provide repairs in response to requests for retransmission contained in TRACK messages. This has overlap with the NACK BB, which is unified in the TRACK PI. @ TRACK Aggregation. In order to provide the highest levels of scalability and reliability, interior tree nodes provide aggregation of control traffic flowing up the tree. The aggregated feedback information includes that used for end-to- end confirmed delivery, flow control, congestion control, and group membership monitoring and management. INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 3 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 @ Distributed RTT Calculations. One of the primary challenges of congestion control is efficient RTT calculations. TRACK provides two methods to perform these calculations. o Sender Per-Message RTT Calculations. Each message is stamped with a timestamp from the sender. As each is passed up the tree, the amount of dally time spent waiting at each node is accumulated. The highest measurements are passed up the tree, and the dally time is subtracted from the original measurement. o Local Per-Level RTT Calculations. Each parent measures the local RTT to each of its children as part of the keep-alive messages used for failure detection. 2. Design Rationale Much of the design rationale behind the protocol instantiations and building blocks being standardized by the RMT working group are laid out in [WVKHFL00]. In addition, the design rationale for the TRACK PI is laid out in [WCP00]. This building block conforms with the design rationales laid out in both of those documents. TRACK is designed to provide confirmed delivery, receiver-based flow control, distributed management of group membership (some of them may be dedicated servers in a repair tree), as well as providing aggregation of information up the tree. It also provides requests for retransmissions as part of TRACK messages, and local recovery of lost packets. This TRACK BB is primarily designed to work as part of the TRACK BB, in conjunction with other BB's including NACK, FEC, and Auto Tree. In the spirit of modular reuse specified in [WVKHFL00], it is also designed to be useful as an additional layer of functionality on top of any of the following services. 1) The functionality (if not the exact message headers) of the NORM PI. 2) The functionality (if not the exact message headers) of the ALC PI. 3) Running directly on top of an unreliable IP multicast routing protocol, but on a carefully provisioned network. 4) On top of an overlay multicast (aka application layer multicast) system. Overlay multicast is a system where servers in the network provide multicast (and unicast) routing as well as reliable multicast delivery, all on top of a combination of unicast (i.e. TCP) and, as available, reliable multicast services. There is a fundamental tradeoff between reliability and real-time performance in the face of failures. There are two primary types of single layer reliability that have been proposed to deal with this: sender reliable and receiver reliable delivery. Sender INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 4 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 reliable delivery is similar to TCP, where the sender knows the identity of the receivers in a data session, and is notified when any of them fails to receive all the data messages. Receiver reliable delivery limits knowledge of group membership and failures to only the actual receivers. Senders do not have any knowledge of the membership of a group, and do not require receivers to explicitly join or leave a data session. Receiver reliable protocols scale better in the face of networks that have frequent failures, and have very high isolation of failures between receivers. This TRACK BB provides sender reliable delivery, potentially on top of a receiver reliable system. This BB is specified according to the guidelines in [KV00]. In addition, it specifies all communication between entities in terms of messages, rather than packets. A message is an abstract communication unit, which may be part of, or all of, a given packet. It does not have a specific format, although it does contain a list of fields, some of which may be optional, and some of which may have fixed lengths associated with them. It is up to each protocol instantiation to combine the set of messages in this BB, with those in other components, and create the actual set of message formats that will be used. As mentioned in the introduction, this BB assumes the existence of a separate Auto Tree Configuration BB. It also assumes that data sessions are advertised to all receivers as part of an external BB or other component. It expects to also interact with other BB's through the TRACK PI, but does not require this. 3. Applicability Statement It is widely recognized that no single reliable multicast protocol can meet the needs of all application types over all network types. Distinguishing factors include functionality and performance. From a functionality perspective, TRACK and NACK based reliable multicast protocols present an inherently different reliability model. TRACK based protocols are able to remove messages from the retransmission window when all the children have acknowledged them. NACK based protocols have to rely on other means for determining how long to keep messages in the retransmission window. A popular method is a time based scheme[SFCGLTLLBEJMRSV00]. TRACK based protocols can keep track of the membership of the data session, and provide confirmed delivery against that membership list. NACK protocols have anonymous membership. Since reliability is obtained through control traffic, the difference in the semantics of the term reliability lead to the second distinguishing factor: performance. When a persistent failure occurs among the members of a TRACK based protocol, there INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 5 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 is a possibility that this may slow down other members of the group. NACK protocols have higher isolation of failures, as well as smaller amounts of control traffic under many scenarios. 3.1 Application types The objectives of TRACK are to provide high level reliability, high scalability, congestion control and flow control for one to many bulk data dissemination. TRACK is not designed for many to many applications Examples of applications that fit into the one-to-many data dissemination model are: real time financial news and market data distribution, electronic software distribution, audio video streaming, distance learning, software updates and server replication. But, not all of these application types have the same reliability requirements. Historically, financial applications have had the most stringent reliability requirements, while audio video streaming have had the least stringent. For applications that want to have strong confirmation of delivery guarantees, TRACK may be more applicable than alternatives such as NORM or ALC. For applications that do not require this level of reliability, or that demand the lowest levels of latency and the highest levels of failure isolation, TRACK may be less applicable. The TRACK BB, in particular, is designed to optionally work on top of a NORM or ALC PI, to allow applications to select this tradeoff on a dynamic basis. 3.2 Network Infrastructure The TRACKs also serve to provide feedback information to the sender. The sender uses this information for congestion and flow control. This allows TRACK to be applicable in most networks (i.e. managed and shared networks, and high congestion networks.) Asymmetric networks with very low upbound bandwidth and a very low loss data channel may be better served through NACK based protocols, particularly if high reliability is not required. A good example is some satellite networks. Networks that have very high loss rates, and regularly experience partial network partitions, router flapping, or other persistent faults, may be better served through NACK only protocols. 4. Message Types The following table summarizes the messages and their fields used by the TRACK BB. INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 6 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 Message From To Mcast? Fields BindRequest Child Parent no Session, Scope, Level, Role, SubTreeCount BindConfirm Parent Child no Session, Level, RepairAddr, SeqNum, MemberId BindReject Parent Child no Session, Reason UnbindRequest Child Parent no Session, Reason UnbindConfirm Parent Child no Session EjectRequest Parent Child either Session, Reason EjectConfirm Child Parent no Session Heartbeat Parent Child either Session, Level, ChildrenList, SeqNum, SendTime NullData Sender all yes Session, SeqNum, Rate, AppSynch Data Sender all yes Session, SeqNum, Rate, AppSynch, HighestReleased, SendTimeStamp Retransmission Parent Child yes Session, SeqNum, Rate, AppSynch, HighestReleased, SendTimeStamp Track Child Parent no Session, SeqNum, BitMask, Scope, HighestAllowed, AppSynch, SubTreeCount, Slowest, RTT, ParentThere, ParentTimeStamp, LocalDallyTime, SenderTimeStamp, SenderDallyTime INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 7 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 The various fields of the messages are described as follows: - Session: an id that identifies the session. - Level: an integer that indicates the level in the repair tree. This value is used to keep loops in the tree from forming, in addition to indicating the distance from the sender. - SeqNum: an integer indicating the sequence number of a data message within a given data session. - AppSynch: a sequence number signaling a request for confirmed delivery by the application. - HighestAllowed: a sequence number, used for flow control from the receivers. It signals the highest sequence number the sender is allowed to send that will not overrun the receivers' buffer pools. - BitMask: an array of 1's and 0's. Together with a sequence number it is used to indicate lost data messages. If the i'th element is a 1, it indicates the message SeqNum+i is lost. - Scope: an integer to indicate how far a repair message travels. This is optional. - Role: This indicates if the bind requestor is a receiver or repair head. - SubTreeCount: This is an integer indicating the current number of receivers below the node. - ChildrenList: This field contains the identifiers for a list of children. As part of the keepalive message, this field together with the SeqNum field is used to urge those listed receivers to send a TRACK (for the provided SeqNum). The repair head sending this must have been missing the regular TRACKs from these children for an extended period of time. - Rate: This field is used by the sender to tell the receivers its sending rate, in packets per second. It is part of the data or nulldata messages. - HighestReleased: This field contains a sequence number, corresponding to the trailing edge of the sender's retransmission window. It is used (as part of the data, nulldata or retransmission headers) to inform the receivers that they should no INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 8 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 longer attempt to recover those messages with a smaller (or same) sequence number. - HighestPacketSent: This field is part of nulldata messages. It contains the highest sequence numbered data message sent so far as part of this data session. - Slowest: This field contains a field that characterizes the slowest receiver in the subtree beneath (and including) the node sending the TRACK. This is used to provide information for the congestion control BB, and the aggregation methods on this information are defined by that BB. - ParentThere: This field indicates to the parent that the receiver sending the TRACK has not been receiving the regular keepalive messages from its parent, and is wondering if it needs to find a new parent. - Reason: a code indicating the reason for the BindReject, UnbindRequest, or EjectRequest message. - RepairAddr: This field in the BindConfirm message is used to tell the receiver which multicast address the repair head will be sending retransmissions on. If this field is null, then the receiver should expect retransmissions to be sent on the sender's data multicast address. This SeqNum field in the same BindConfirm message indicates the sequence number starting from which the repair head promises to provide repair service. - SenderTimestamp: This field is included in Data messages to signal the need to do a roundtrip time measurement from the sender, through the tree, and back to the sender. It is the time (measured by the sender's local clock) when it sent the packet. - SenderDallyTime: This field is included in TRACK messages. It is associated with a SenderTimestamp field. It contains the sum of the waiting time that should be subtracted from the RTT measurement at the sender. - ParentTimestamp: This field is included in Heartbeat messages to signal the need to do a local RTT measurement from a parent. It is the time when the parent sent the packet. - ParentDallyTime: This is the same as the SenderDallyTime, but is associated with a ParentTimestamp instead of a SenderTimestamp. - MemberId: This is an integer the repair head assigns to a particular child. The child receiver uses this value to implement the rotating TRACK Generation algorithm. INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 9 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 5. Global Configuration Variables and Error Codes 6. External API This section describes external interfaces for the building block. 6.1 Interfaces to the BB from the PI 6.1.1 Start(boolean RepairHead, boolean RejoinAllowed, Advertisement) Start instructs the BB to initiate operation. RepairHead indicates whether or not the node may also operate as a repair head. RejoinAllowed indicates whether or not the node is allowed to rejoin the session if the only repair heads available are missing some repair data. In particular, this parameter also controls whether or not the node is allowed to join the session after the first data messages have become unrecoverable (late join). The Advertisement parameter passes to the BB all of the parameters from the session advertisement. 6.1.2 End End instructs the BB to end its operation. If the node is the Sender, it indicates to the group that the session has ended. If the node is a repair head, it continues to operate as a repair head until all of its children have finished. 6.1.3 SendMessage(Message) SendMessage presents the BB with a data message to send to the group. 6.1.4 MessageSynched(Message) MessageSynched tells the BB that the indicated message has been synched with the application. 6.2 Interfaces from the BB to the PI 6.2.1 MessageReceived(Message, boolean SynchMessage) MessageReceived passes a data message up to the PI. SynchMessage indicates whether or not the PI should call MessageSynched once the message has been consumed by the application. 6.2.2 SenderLost SenderLost tells the PI that contact with the sender has been lost. 6.2.3 UnrecoverableData INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 10 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 UnrecoverableData indicates to the PI that the BB was unable to recover some session data. 6.2.4 SessionDone SessionDone indicates to the PI that the sender has completed sending the data, and the node has left the session. 7. Algorithms 7.1 Tree Based Session Creation and Maintenance 7.1.1 Overview of Tree Configuration Before a Data Session starts delivering data, the tree for the Data Session needs to be created. This process binds each Receiver to either a Repair Head or the Sender, and binds the participating Repair Heads into a loop-free tree structure with the Sender as the root of the tree. This process requires tree configuration knowledge, which can be provided with some combination of manual and/or automatic configuration. The algorithms for automatic tree configuration are part of the Automatic Tree Configuration BB. They return to each node the address of the parent it should bind to, as well as zero or more backup parents to use if the primary parent fails. In addition to receiving the tree configuration information, the receivers all receive a Session Advertisement message from the senders, informing them of the Data Multicast Address and other session configuration information. This advertisement may advertise other relevant session information such as whether or not Repair Heads should be used, whether manual or automatic tree configuration should be used, the time at which the session will start, and other protocol constants. This advertisement is created as part of either the PI or as part of an external service. In this way, the Sender enforces a set of uniform Session Configuration Parameters on all members of the session. As described in the automatic tree configuration BB, the general algorithm for a given node in tree creation is as follows. 1) Get advertisement that a session is starting 2) Get list of neighbor candidates, contact them 3) Select best neighbor as parent in a loop free manner 4) Bind to parent 5) Optionally, later rebind to another parent When a child finishes step 4, it is up to automatic tree configuration to, if necessary, continue building the tree in order to connect the node back to the Sender. After the session is created, children can unbind from their parents and bind again to INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 11 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 new parents. This happens when faults occur, or as part of a tree optimization process. Steps 1 through 3 are external to the TRACK BB. Step 4 is performed as part of session creation. Step 5 is performed as part of session maintenance in conjunction with automatic tree building, as either a child initiated unbind or a parent initiated unbind, combined with another bind operation. Once steps 1 through 3 are completed, Receivers join the Data Multicast Address, and attempt to bind to either the Sender or a local Repair Head. A Receiver will attempt to bind to the first node in the tree configuration list returned by step 3, and if this fails, it will move to the next one. A Receiver only binds to a single Repair Head or Sender, at a time, for each Data Session. The automatic tree building BB ensures that the tree is formed without loops. As part of this, when a Repair Head has a Receiver attempt to bind to it for a given Data Session, it may not at first be able to accept the connection, until it is able to join the tree itself. Because of this, a Receiver will sometimes have to repeatedly attempt to bind to a given parent before succeeding. Once the Sender initiates tree building, it is also free to start sending Data messages on the Data Multicast Address. Repair Heads and Receivers may start receiving these messages, but may not request retransmission or deliver data to the application until they receive confirmation that they have successfully bound to the group. 7.1.2 Bind 7.1.2.1 Input Parameters In order to join a data session and bind to the tree, the following nodes need the following parameters. A Repair Head requires the following parameters. - Session: the unique identifier for the data session to join, received from the Session Advertisement algorithm in the PI. - ParentAddress: the address and port of the parent node to which the node should connect - UDPListenPort: the number of the port on which the node will listen for its children's control messages - RepairAddr: the multicast address, UDP port, and TTL on which this node sends control messages to its children. INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 12 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 A Sender requires the above parameters, except for the ParentAddress. A Receiver requires the above parameters, except for the UDPListenPort and RepairAddr. 7.1.2.2 Bind Algorithm A Bind operation happens when a child wishes to join a parent in the distribution tree for a given data session. The Receivers initiate the first Bind protocols to their parents, which then cause recursive binding by each parent, up to the Sender. Each Receiver sends a separate BindRequest message for each of the streams that it would like to join. At the discretion of the PI, multiple BindRequest messages may be bundled together in a single message. A node sends a BindRequest message to its automatically selected or manually configured parent node. The parent node sends either a BindConfirm message or a BindReject message. Reception of a BindConfirm message terminates the algorithm successfully, while receipt of a BindReject message causes the node to either retry the same parent or restart the Bind algorithm with its next parent candidate (depending on the BindReject reason code), or if it has none, to declare a REJECTED_BY_PARENT error. Reliability is achieved through the use of a standard request- response protocol. At the beginning of the algorithm, the child initializes TimeMaxBindResponse to the constant TIMEOUT_PARENT_RESPONSE and initializes NumBindResponseFailures to 0. Every time it sends a BindRequest message, it waits TimeMaxBindResponse for a response from the parent node. If no response is received, the node doubles its value for TimeMaxBindResponse, but limits TimeMaxBindResponse to be no larger than MAX_TIMEOUT_PARENT_RESPONSE. It also increments NumBindResponseFailures, and retransmits the BindRequest message. If NumBindResponseFailures reaches NUM_MAX_PARENT_ATTEMPTS, it reports a PARENT_UNREACHABLE error. When a parent receives a BindRequest message, it first consults the automatic tree building BB for approval, for instance to ensure that accepting the BindRequest will not cause a loop in the tree. Then the parent checks to be sure that it does not have more than NUM_MAX_CHILDREN children already bound to it for this session. If it can accept the child, it sends back a BindConfirm message. Otherwise, it sends the node a BindReject message. Then the parent checks to see if it is already a member of this data session. If it is not yet a member of this session, it attempts to join the tree itself. The BindConfirm message contains the lowest sequence number that the repair head has available. If this number is 0 or 1, then the INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 13 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 repair head has all of the data available from the start of the session. Otherwise, the requesting node is attempting a late join, and can only use this repair head if late join was allowed by the PI. If late join is not allowed, the node may try another repair head, or give up. Similarly, if a failure recovery occurs, when a node tries to bind to a new repair head, it must follow the same rules as for a late join. 7.1.3 Child Initiated Unbind A child may decide to leave a data session for the following reasons. 1) It detects that the data session is finished. 2) The application requests to leave the data session. 3) It is not able to keep up with the data rate of the data session. When any of these conditions occurs, it initiates an Unbind process. A Child Initiated Unbind is, like the Bind function, a simple request-reply protocol. Unlike the Bind function, it only has a single response, UnbindAccept. With this exception, the Unbind operation uses the same state variables and reliability algorithms as the Bind function. When a child receives an UnbindAccept message from its parent, it reports a LEFT_DATA_SESSION_GRACEFULLY event. If it does not receive this message after NUM_MAX_PARENT_ATTEMPTS, then it reports a LEFT_DATA_SESSION_ABNORMALLY event. 7.1.4 Parent Initiated Unbind A parent may decide to remove one or more of its children from a data stream for the following reasons. 1) The parent needs to leave the group due to application reasons. 2) The repair head detects an unrecoverable failure with either its parent or the sender. 3) The parent detects that the child is not able to keep up with the speed of the data stream. 4) The parent is not able to handle the load of its children and needs some of them to move to another parent. In the first two cases, the parent needs to multicast the advertisement of the termination of one or more data sessions to all of its children. In the second two cases, it needs to send one or more unicast notifications to one or more of its children. Consequently, a Parent Initiated Unbind can be done either with a repeated multicast advertisement message to all children, or a set of unicast request-reply messages to the subset of children that it needs to go to. INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 14 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 For the multicast version of Parent Initiated Unbind, the parent sends a multicast UnbindNotification message to all of its children for a given Data Session, on its Local Multicast Channel. It is only necessary to provide statistical reliability on this message, since children will detect the parent's failure even if the message is not received. Therefore, the UnbindNotification message is sent NUM_REDUNDANT_MULTICAST_ADVERTISEMENT times, which has a recommended value of 3. For the unicast version of Parent Initiated Unbind, the parent sends a unicast UnbindNotification message to all of its children. Each of them respond with an EjectConfirm. Reliability is ensured through the same request-reply mechanism as the Bind operation. 7.1.5 Fault Detection There are three cases where fault detection is needed. 1) Detection (by a child) that a parent has failed. 2) Detection (by a parent) that a child has failed. 3) Detection (by either a RH or Receiver) that a Sender has failed. In order to be scaleable and efficient, fault detection is primarily accomplished by periodic keep-alive messages, combined with the existing TRACK messages. Nodes expect to see keep-alive messages every set period of time. If more than a fixed number of periods go by, and no keep-alive messages of a given type are received, the node declares a preliminary failure. The detecting node may then ping the potentially failed node before declaring it failed, or it can just declare it failed. Failures are detected through three keep-alive messages: Heartbeat, TRACK, and NullData. The Heartbeat message is multicast periodically from a parent to its children on its local control channel. NullData messages are multicast by a sender on the global multicast address when it has no data to send. TRACK messages are generated periodically, even if no data is being sent to a data session, as described in section 7.2. Heartbeat messages are multicast every HeartbeatPeriod seconds, from a parent to its children. Every time that a parent sends a Retransmission message or a Heartbeat message (as well as at initialization time), it resets a timer for HeartbeatPeriod seconds. If the timer goes off, a Heartbeat is sent. The HeatbeatPeriod is dynamically computed as follows: interval = AckWindow / PacketRate HeartbeatPeriod = 2 * interval INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 15 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 If no Retransmission messages are sent within HeartbeatPeriod, a Heartbeat message is generated. Global configuration parameters ConstantHeartbeatPeriod and MinimumHeartbeatPeriod can be used to either set HeartbeatPeriod to a constant, or give HeartbeatPeriod a lower bound, globally. Similarly, a NullData message is multicast by the sender to all data session members, every NULL_DATA_PERIOD. The NullData timer is set to NULL_DATA_PERIOD, and is reset every time that a Data or NullData message is sent by the Sender. The key parameter for failure detection is the global tree parameter FAILURE_DETECTION_REDUNDANCY. The higher the value for this parameter, the more keep-alive messages that must be missed before a failure is declared. A major goal of failure detection is for children to detect parent failures fast enough that there is a high probability they can rejoin the stream at another parent, before flow control has advanced the buffer window to a point where the child can not recover all lost messages in the stream. In order to attempt to do this, children detect a failure of a parent if FAILURE_DETECTION_REDUNDANCY * HeartbeatPeriod time goes by without any heartbeats. As part of buffer window advancement, described in section 7.2.4, all parents MAY choose to buffer all messages for a minimum of FAILURE_DETECTION_REDUNDANCY * 2 * HeartbeatPeriod seconds, which gives children a period of time to find a new parent before the buffers are freed. A parent detects a preliminary failure of one of its children if it does not receive any TRACK messages from that child in FAILURE_DETECTION_REDUNDANCDY * TrackTimeout seconds (see discussion of how TrackTimeout is computed in 7.2). Because a failed child can slow down the group's progress, it is very important that a parent resolve the child's status quickly. Once a parent declares a preliminary failure of a child, it issues a set of up to FAILURE_DETECTION_REDUNDANCY Heartbeat messages that are unicast (or multicast) to the failed receiver(s). These messages are spaced apart by 2*LocalRTT, where LocalRTT is the round trip time that has been measured to the child in question (see 7.4 for description of how LocalRTT is measured). These Heartbeat messages contain a ChildrenList field that contains the children who are requested to send a TRACK immediately. Whenever a child receives a Heartbeat message with an ImmediateTRACK field set to 1, it immediately sends a TRACK to its parent. If a parent does not receive a TRACK message from a child after waiting a period of 2*ChildRTT after the last Heartbeat message to that child, it declares the child failed, and removes it from the parent's child membership list. INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 16 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 A child or a repair head detects the failure of a sender if it does not receive a Data or NullData message from a sender in FAILURE_DETECTION_REDUNDANCY * NULL_DATA_PERIOD. Note that the more receivers there are in a tree, and the higher the loss rate, the larger FAILURE_DETECTION_REDUNDANCY must be, in order to give the same probability that erroneous failures won't be declared. 7.1.6 Fault Notification When a parent detects the failure of a child, it adds a failure notification field to the next FAILURE_DETECTION_REDUNDANCY TRACK messages that it sends up the tree. It sends this notification multiple times because TRACKs are not delivered reliably. A failure notification field includes the failure code, as well as a list of one or more failed nodes. Failure notifications are aggregated up the tree, according to the rules in 7.3. A failure notification is not a definitive report of a failure, as the child may have moved to a different repair head. 7.1.7 Fault Recovery The Fault Recovery algorithms require a list of one or more addresses of alternate parents that can be bound to, and that still provide loop free operation. If a child detects the failure of its parent, it then re-runs the Bind operation to a new parent candidate, in order to rejoin the tree. As described above in section 7.1.2, a node may perform a late bind, i.e. binding with a repair head which cannot provide all the necessary repair data, only if allowed by the PI. 7.2 TRACK Generation This section describes the algorithms used by the receiver to determine when to send the TRACK messages. TRACK messages are sent from receivers to their parents. TRACK messages may be sent for the following purposes: - to request retransmission of messages - to advance the sender's transmission window for flow control purposes - to deliver end-to-end confirmation of data reception - to propagate other relevant feedback information up through the session (such as RTT and loss reports, for congestion control) The TRACK PI also makes use of the NACK BB, which request retransmission of messages from a parent. The TRACK request and INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 17 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 response algorithms should be highly similar to the NACK algorithms for this specific case. 7.2.1 TRACK Generation with the Rotating TRACK Algorithm Each receiver sends a TRACK message to its parent once per AckWindow (a globally configured variable) of data messages received. A receiver uses an offset from the boundary of each AckWindow to send its TRACK, in order to reduce burstiness of control traffic at the parents. Each parent has a maximum number of children, MaxChildren. When a child binds to the parent, the parent assigns a locally unique ChildID to that child, between 0 and MaxChildren-1. Each child in a tree generates a TRACK message at least once every AckWindow of data messages, when the most recent data message's sequence number, modulo MemberID, is equal to AckWindow. If the message that would have triggered a given TRACK for a given node is missed, the node will generate the TRACK as soon as it learns that it has missed the message, typically through receipt of a higher numbered data message. Together, AckWindow and MaxChildren determine the maximum ratio of control messages to data messages seen by each parent, given a constant load of data messages. In each data message, the sender advertises the current PacketRate (measured in messages per second) it is sending data at. This rate is generated by the congestion control algorithms in use at the sender. At the time a node sends a regular TRACK, it also computes a TRACKTimeout value: interval = AckWindow / PacketRate TRACKTimeout = 2 * interval If no TRACKs are sent within TRACKTimeout interval, a TRACK is generated, and TRACKTimeout is increased by a factor of 2, up to a value of MaxTRACKTimeout. This timer mechanism is used by a receiver to ensure timely repair of lost messages and regular feedback propagation up the tree even when the sender is not sending data continuously. This mechanism complements the AckWindow-based regular TRACK generation mechanism. 7.2.2 Local Repair INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 18 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 A repair head maintains the following state for each of its children, for the purpose of providing repair service to the local group: - HighestConsecutivelyReceived: a sequence number indicating all Data messages before this number (inclusive) have been received by a given child. - MissingPackets: a data structure to keep track of the reception status of the Data messages with sequence number higher than HighestConsecutivelyReceived. In addition, a repair head also maintains other state for purposes of feedback aggregation described in the next section. The minimum HighestConsecutivelyReceived value of all its children is kept as the variable LocalStable. A repair head also maintains a retransmission buffer. The size of the retransmission buffer must be greater than the maximum value of a sender's transmission window. The retransmission buffer must keep all the Data messages received by the repair head with sequence number higher than LocalStable, optionally some messages with sequence number lower than LocalStable if there is room (beyond the maximum value of sender's transmission window). The latter messages are kept in the retransmission buffer in case a receiver from another group losses its parent and needs to join this group. As TRACK messages are received, the repair head updates the above states. To perform local repair, a repair head implements a retransmission queue with memory. Each lost message (reported by a child using the BitMask field) is entered into the retransmission queue in increasing order according to its sequence number. If the same Data message has already been retransmitted recently (recognized due to the queue's memory) it is delayed by the local group RTT (see roundtrip time measurement) before retransmission. The retransmissions are sent using the same PacketRate is that used by the sender. 7.2.3 Flow Control Window Update When a receiver sends a TRACK to its parent, the HighestAllowed field provides information on the status of the receiver's flow control window. The value of HighestAllowed is computed as follows: HighestAllowed = seqnum + ReceiverWindow INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 19 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 Where seqnum is the highest sequence number of consecutively received data message at the receiver. The size of the ReceiverWindow may either be based on a parameter local to the receiver or be a global parameter. 7.2.4 Reliability Window The sender and each repair head maintain a window of messages for possible retransmission. As messages are acknowledged by all of its children, they are released from the parent's retransmission buffer, as described in 7.2.2. In addition, there are two global parameters that can affect when a parent releases a data message from the retransmission buffer -- MIN_HOLD_TIME, and MAX_HOLD_TIME. MIN_HOLD_TIME specifies a minimum length of time a message must be held for retransmission from when it was received. This parameter is useful to handle scenarios where one or more children have been disconnected from their parent, and have to reconnect to another. If, for example, MIN_HOLD_TIME is set to FAILURE_DETECTION_REDUNDANCY * 2 * ConstantHeartbeatPeriod, then there is a high likelihood that any child will be able to recover any lost messages after reconnecting to another parent. The sender continually advertises to the members of the data session both edges of its retransmission window. The higher value is the HighestPacketSent field in each data message, which specifies the highest sequence number of any data message sent. The trailing edge of the window is advertised in the HighestReleased field. This specifies the largest sequence number of any message sent that has subsequently been released from the sender's retransmission window. If both values are the same then the window is presently empty. Zero is not a legitimate value for a data sequence number, so if either field has a value of zero, then no messages have yet reached that state. All sequence number fields use sequence number arithmetic so that a data session can continue after exhausting the sequence number space. When a member of a data session receives an advertisement of a new HighestReleased value, it stores this, and is no longer allowed to ask for retransmission for any messages up to and including the HighestReleased value. If it has any outstanding missing messages that are less than or equal to HighestReleased, it MAY move forward and continue delivering the next data messages in the stream. It also SHOULD report an error for the messages that are no longer recoverable. MAX_HOLD_TIME specifies the maximum length of time a message may be held for retransmission. This parameter is set at the sender who uses it to set the HighestReleased field in data message headers. INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 20 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 This is particularly useful for real-time, semi-reliable streams such as live video, where retransmissions are only useful for up to a few seconds. When combined with Unordered delivery semantics, and application-level jitter control at the receivers, this provides Time Bounded Reliability. Obviously, MAX_HOLD_TIME must always be larger than MIN_HOLD_TIME. 7.2.5 Confirmed Delivery Flow control and the reliability window are concerned with goodput, of delivering data with a high probability that it is delivered at all receivers. However, neither mechanism provides explicit confirmation to the sender as to the list of recipients for each message. Confirmed delivery allows applications to determine the set of applications that have received a set of data messages. To request this service, a sender fills the AppSynch field of data messages with the sequence number of the highest data message it wishes to confirm delivery of. It continues to do so until it receives confirmation, moves the AppSynch point forward to a higher sequence number, or declares an error. When a receiver gets a data message with a non-zero AppSynch field, it starts including the highest sequence number that has been acknowledged by the application in the ApplicationConfirms field of each TRACK message that it sends up the tree. In order to provide reliable delivery of this acknowledgement, this continues so long as a receiver gets data messages with non-zero AppSynch fields. Each receiver is responsible for locally deciding the value of the ApplicationConfirms field. There are two primary issues a receiver must consider in setting this field: the reliability semantics of the data stream, and when a given message is considered confirmed at the receiver. As this is an application level confirmation, a handshake with the application is required to get this confirmation. One example of how an application can implicitly signal confirmation of delivery is through the freeing of buffers passed to it by the transport. The API could specify that whenever an application has freed up a buffer containing one or more data messages, then these messages are considered acknowledged by the application. Alternatively, the application could be required to explicitly acknowledge each message. With a given transport-application API for signaling acknowledgement, the transport then keeps track of all contiguous acknowledgements from that application, and reports these up in the ApplicationConfirms field. If one or more messages can not be acknowledged, the receiver should pass an error report attached to INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 21 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 the ApplicationConfirms field, detailing the type of failure that occurred, and the sequence number of the first message that has not yet been delivered. If MAX_HOLD_TIME is not in use for a data stream, so that delivery is fully reliable, then any message that can not be delivered will be considered a fatal error for that receiver. If MAX_HOLD_TIME has a non-zero value, then any messages that could not be delivered, but are less than HighestReleased as advertised by the sender, are not reported as errors. In addition to the AppSynch field, a sender may also set the ImmediateACK field to 1. When a node gets a data message that has this flag set, it will immediately send a TRACK after processing that message. 7.3 Feedback Aggregation This section describes how repair heads perform aggregation on feedback information sent up in the fields of the TRACK message, and the purposes for performing such aggregation. There are many reasons for providing feedback from all the receivers to the sender in an aggregated form. The major ones are listed below: 1) End-to-end delivery confirmation. This confirmation tells the sender that all the receivers (in the entire tree) have received data packets up to a certain sequence number. The field that carries this information is AppSynch. 2) Flow control. The aggregated information is carried in the field HighestAllowed. It tells the sender the highest sequence number that all the receivers (in the entire tree) are prepared to receive. 3) Identifying the slowest receiver. The aggregated information is carried in the field Slowest. The sender can used this value as part of congestion control. 4) Counting current membership in the group. This information is carried in the field SubTreeCount. This lets the sender know the number of receivers currently connected to the repair tree. 5) Measuring the round-trip time from the sender to the "worst" receiver. A repair head maintains state for each child. Each time a TRACK (from a child) is received, the corresponding states for that child are updated based on the information in the TRACK message. When a INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 22 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 repair head sends a TRACK message to its parent, the following fields of its TRACK message are derived from the aggregation of the corresponding states for its children. The following rules describe how the aggregation is performed: - AppSynch: take the minimum of the AppSynch value from all children - HighestAllowed: take the minimum of the HighestAllowed value from all children - Slowest: this is a measure of how slow the slowest member in the whole subtree is; take either the minimum (or maximum) of the Slowest value from all children (depending what the Slowest measure is). - SubTreeCount: take the sum of the SubTreeCount from all children - SenderDallyTime: take the minimum value of the SenderDallyTime from all children and add in the local dally time. Note, the SendTimeStamp field is left alone. The sender will derive the roundtrip time to the worst receiver by doing its local aggregation for SenderDallyTime and then compute: RTT = currentTime - SendTimeStamp - SenderDallyTime. 7.4 Measuring Round Trip Times This TRACK BB provides two algorithms for distributed RTT calculations-LocalRTT measurements and SenderRTT measurements. LocalRTT measurements are only between a parent and its children. SenderRTT measurements are an end to end RTT measurement, measuring the RTT to the worst receiver as selected by the congestion control algorithms. The SenderRTT is useful for congestion control. It can be used to set the data rate based on the TCP response function, which is being proposed for the congestion control building block. The LocalRTT can be used to (a) quickly detect faulty children (as described in 7.1) or (b) avoid sending unnecessary retransmissions (as described in 7.2 in the local repair algorithm). In the case of LocalRTT measurements, a parent initiates measurement by including a ParentTimestamp field in a Heartbeat message sent to its children. When a child receives a Heartbeat INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 23 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 message with this field set, it notes the time of receipt using its local system clock, and stores this with the message as HeartbeatReceiveTime. When the child next generates a TRACK, just before sending it, it measures its system clock again as TRACKSendTime, and calculates the LocalDallyTime. LocalDallyTime = TRACKSendTime - HeartbeatReceiveTime. The child includes this value, along with the ParentTimestamp field, as fields in the next TRACK message sent. Every heartbeat message that is multicast to all children SHOULD include a ParentTimestamp field. The SenderRTT algorithm is similar. A sender initiates the process by including a SenderTimestamp field in a data message. When a receiver gets a message with this field set, it keeps track of the DataReceiveTime for that message, and when it generates the next TRACK message, includes the SenderTimestamp and SenderDallyTime value. These values are aggregated by Repair Heads, as described in section 7.3. Each node only keeps track of the most recent value for {SenderTimestamp, DataReceiveTime} and {ParentTimestamp, HeartbeatReceiveTime}, replacing any older values any time that a new message is received with these values set. As long as it has non-zero values to report, each node sends up both a {SenderTimestamp, SenderDallyTime} and a {ParentTimestamp, LocalDallyTime} set of fields in each TRACK message generated. These measurements need to be averaged by the TRACK PI. 8. Security This BB does not specifically deal with security. It is the responsibility of the TRACK PI or the Security BB. This issue is covered in the Security Requirements For TRACK draft [HV00]. 9. References [HV00] T. Hardjono, B. Whetten, "Security Requirements For TRACK," Internet Draft, Internet Engineering Task Force, June, 2000. [KV00] R. Kermode, L. Vicisano, "Author Guidelines for RMT Building Blocks and Protocol Instantiation Documents," Internet Draft, Internet Engineering Task Force, June, 2000. [SFCGLTLLBEJMRSV00] T. Speakman, D. Farinacci, J. Crowcroft, J. Gemmell, S. Lin, A. Tweedly, D. Leshchiner, M. Luby, N. Bhaskar, R. Edmonstone, K. M. Johnson, T. Montgomery, L. Rizzo, R. Sumanasekera, and L. Vicisano, "PGM Reliable Transport Protocol INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 24 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 Specification," Internet Draft, Internet Engineering Task Force, April 2000. [WCP00] B. Whetten, D. Chiu, S. Paul, M. Kadansky, G. Taskale, "TRACK Architecture, A Scalable Real-Time Reliable Multicast Protocol," Internet Draft, Internet Engineering ask Force, July 2000. [WVKHFL00] B. Whetten, L. Vicisano, R. Kermode, M. Handley, S. Floyd, and M. Luby, "Reliable Multicast Transport Building Blocks for One-to-Many Bulk-Data Transfer," Internet Draft, Internet Engineering Task Force, October 2000. 10. Acknowledgments We would like to thank the follow people: Sanjoy Paul, Seok Joo Koh, Supratik Bhattacharyya, Joe Wesley, and Joe Provino. 11. Authors' Addresses Dah Ming Chiu dahming.chiu@sun.com Miriam Kadansky miriam.kadansky@sun.com Sun Microsystems Laboratories 1 Network Drive Burlington, MA 01803 Gursel Taskale gursel@talarian.com Brian Whetten whetten@talarian.com Talarian 333 Distel Circle Los Altos, CA 94022-1404 Full Copyright Statement "Copyright (C) The Internet Society (2000). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 25 INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt November 2000 purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." INTERNET DRAFT draft-ietf-rmt-bb-track-00.txt 26