RMT Working Group                                         Brian Whetten
 Internet Engineering Task Force                                Talarian
 Internet Draft                                            Dah Ming Chiu
 Document: draft-ietf-rmt-bb-track-01.txt               Sun Microsystems
 2 March, 2001                                           Miriam Kadansky
 Expires 2 October, 2001                                Sun Microsystems
                                                          Gursel Taskale
                                                                Talarian


          Reliable Multicast Transport Building Block for TRACK
                    <draft-ietf-rmt-bb-track-01.txt>


 Status of this Memo

    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups. Note that
    other groups may also distribute working documents as Internet-
    Drafts. Internet-Drafts are draft documents valid for a maximum of
    six months and may be updated, replaced, or obsoleted by other
    documents at any time. It is inappropriate to use Internet-Drafts
    as reference material or to cite them other than as "work in
    progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt
    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.

 Abstract

    This document describes the TRACK Building Block.  It contains
    functions relating to positive acknowledgments and hierarchical
    tree construction and maintenance.  It is primarily meant to be
    used as part of the TRACK Protocol Instantiation.  It is also
    designed to be useful as part of overlay multicast systems that
    wish to offer efficient confirmed delivery of multicast messages.

 Conventions used in this document

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
    "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
    this document are to be interpreted as described in RFC-2119.


 INTERNET DRAFT      draft-ietf-rmt-bb-track-00.txt                   1

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


 Table of Contents

    1. Introduction
    2. Design Rationale
    3. Applicability Statement
    3.1 Application types
    3.2 Network Infrastructure
    4. Message Types
    5. Global Configuration Variables, Constants, and Reason Codes
    5.1 Global Configuration Variables
    5.2 Constants
    5.3 Reason Codes
    6. External APIs
    6.1 Interfaces to the BB from PI's
    6.1.1 Start(boolean RepairHead, boolean RejoinAllowed,
          Advertisement)
    6.1.2 End
    6.1.3 incomingMessage(Message)
    6.1.4 getStatistics
    6.1.5 MessageSynched(Message)
    6.1.6 RepairHead(boolean)
    6.2 Interfaces from the BB to the PI
    6.2.1 outgoingMessage(Message)
    6.2.2 MessageReceived(Message, boolean Synch)
    6.2.3 SenderLost
    6.2.4 UnrecoverableData
    6.2.5 SessionDone
    7. Algorithms
    7.1 Tree Based Session Creation and Maintenance
    7.1.1 Overview of Tree Configuration
    7.1.2 Bind
    7.1.2.1 Input Parameters
    7.1.2.2 Bind Algorithm
    7.1.3 Unbind
    7.1.4 Eject
    7.1.5 Fault Detection
    7.1.6 Fault Notification
    7.1.7 Fault Recovery
    7.2 TRACK Generation
    7.2.1 TRACK Generation with the Rotating TRACK Algorithm
    7.2.2 Local Repair
    7.2.3 Flow Control Window Update
    7.2.4 Reliability Window
    7.2.5 Confirmed Delivery
    7.3 Feedback Aggregation
    7.4 Measuring Round Trip Times
    8. Security
    9. References
    10. Acknowledgements
    11. Authors' Addresses


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                   2

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


 1. Introduction
    This document describes the TRACK Building Block.  It contains
    functions relating to positive acknowledgments and hierarchical
    tree construction and maintenance.  It is primarily meant to be
    used as part of the TRACK Protocol Instantiation.  It is also
    designed to be useful as part of overlay multicast systems that
    wish to offer efficient confirmed delivery of multicast messages.

    As pointed out in the building blocks rationale draft [WVKHFL00],
    there are two different reliability tasks that can be provided by a
    reliable multicast transport: ensuring goodput and confirming
    delivery of application level messages.  The NACK Protocol
    Instantiation and ALC Protocol Instantiation are each primarily
    concerned with ensuring goodput.  The TRACK BB and TRACK PI rely on
    a repair tree to provide goodput as well as confirmed delivery. If
    Forward Error Correction, Generic Router Assist or other mechanisms
    are used to help provide goodput, they are assumed to work
    transparently at a layer below this BB, as if the IP multicast
    service has lower error rate.

    The TRACK BB also assumes that there is an Automatic Tree Building
    BB [KLCWTCTK01] which provides the list of parents (known as
    Service Nodes within in Tree BB) each node should join to.  If
    Receivers are used that may also serve as Repair Heads, the TRACK
    BB assumes the Auto Tree BB is also responsible for selecting the
    role of each Receiver as either Receiver or Repair Head.  However,
    the TRACK BB may specify that a particular node may not operate as
    a Repair Head.

    The TRACK BB also assumes that a separate session advertisement
    protocol notifies the receivers as to when to join a session, the
    data multicast address for the session, and the control parameters
    for the session.

    The TRACK BB provides additional information and aggregation
    capabilities, which are useful for congestion control.

    The TRACK BB provides the following detailed functionality.

    @ Hierarchical Session Creation and Maintenance.  This set of
       functionality is responsible for creating and maintaining (but
       not configuring) the hierarchical tree of Repair Heads and
       Receivers.
         o Bind.  When a child knows the parent it wishes to join to
            for a given data session, it binds to that parent.
         o Unbind.  When a child wishes to leave a data session,
            either because the session is over or because the
            application is finished with the session, it initiates an
            unbind operation with its parent.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                   3

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


         o Eject.  A parent can also force a child to unbind.  This
            happens if the parent needs to leave the session, if the
            child is not behaving correctly, or if the parent wants to
            move the child to another parent as part of tree
            configuration maintenance.
         o Fault Detection.  In order to verify liveness, parents and
            children send regular heartbeat messages between
            themselves.  The sender also sends regular null data
            messages to the group, if it has no data to send.
         o Fault Recovery.  When a child detects that its parent is no
            longer reachable, it may switch to another parent.  When a
            parent detects that one of its children is no longer
            reachable, it removes that child from its membership list
            and reports this up the tree to the Sender of the Data
            Session.

    @ TRACK Generation.  This set of functionality is responsible for
       periodically generating TRACK messages from all receivers to
       acknowledge receipt of data, report missing messages, advance
       flow control windows, provide roundtrip time measurements and
       provide other group management information.  The algorithms
       include:
         o TRACK Timing.  In order to avoid ACK implosion, the
            Receivers and Repair Heads use the rotating TRACK
            algorithm.
         o Flow Control and Buffer Management.  Receivers and Repair
            Heads maintain a set of buffers that are at least as large
            as the Sender's transmission window.  The Receivers pass
            their reception status up to the sender as part of their
            TRACK messages.  This is used to acknowledge receipt of
            delivery, to advance the buffer windows at each node, and
            to limit the sender's window advancement to the speed of
            the slowest receiver.
         o Application Level Confirmed Delivery.  Confirmed Delivery
            provides transport level confirmation of delivery.  Senders
            can put a ôsynch pointö request in data messages, asking
            for application level confirmation.  Data messages with
            this flag set are only confirmed by the Receivers after the
            Receiver applications confirm receipt.

    @ Local Recovery. This functionality describes how repair heads
       maintain state on their children and provide repairs in response
       to requests for retransmission contained in TRACK messages.
       This has overlap with the NACK BB, which is unified in the TRACK
       PI.

    @ TRACK Aggregation. In order to provide the highest levels of
       scalability and reliability, interior tree nodes provide
       aggregation of control traffic flowing up the tree. The
       aggregated feedback information includes that used for end-to-


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                   4

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


       end confirmed delivery, flow control, congestion control, and
       group membership monitoring and management.

    @ Distributed RTT Calculations.  One of the primary challenges of
       congestion control is efficient RTT calculations.  TRACK
       provides two methods to perform these calculations.
         o Sender Per-Message RTT Calculations.  Each message is
            stamped with a timestamp from the sender.  As each is
            passed up the tree, the amount of dally time spent waiting
            at each node is accumulated.  The lowest measurements are
            passed up the tree, and the dally time is subtracted from
            the original measurement.
         o Local Per-Level RTT Calculations.  Each parent measures the
            local RTT to each of its children as part of the keep-alive
            messages used for failure detection.


 2. Design Rationale

    Much of the design rationale behind the protocol instantiations and
    building blocks being standardized by the RMT working group are
    laid out in [WVKHFL00].  In addition, the design rationale for the
    TRACK PI is laid out in [WCP00].  This building block conforms with
    the design rationales laid out in both of those documents.

    TRACK is designed to provide confirmed delivery, receiver-based
    flow control, distributed management of group membership (some of
    them may be dedicated servers in a repair tree), as well as
    providing aggregation of information up the tree.  It also provides
    requests for retransmissions as part of TRACK messages, and local
    recovery of lost packets.

    This TRACK BB is primarily designed to work as part of the TRACK
    PI, in conjunction with other BB's including NACK, FEC, and Auto
    Tree.  In the spirit of modular reuse specified in [WVKHFL00], it
    is also designed to be useful as an additional layer of
    functionality on top of any of the following services.  1) The
    functionality (if not the exact message headers) of the NORM PI.
    2) The functionality (if not the exact message headers) of the ALC
    PI.  3) Running directly on top of an unreliable IP multicast
    routing protocol, but on a carefully provisioned network.  4) On
    top of an overlay multicast (also known as application layer
    multicast) system.

    Overlay multicast is a system where servers in the network provide
    multicast (and unicast) routing as well as reliable multicast
    delivery, all on top of a combination of unicast (i.e. TCP) and, as
    available, reliable multicast services.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                   5

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    There is a fundamental tradeoff between reliability and real-time
    performance in the face of failures.  There are two primary types
    of single layer reliability that have been proposed to deal with
    this:  sender reliable and receiver reliable delivery.  Sender
    reliable delivery is similar to TCP, where the sender knows the
    identity of the receivers in a data session, and is notified when
    any of them fails to receive all the data messages.  Receiver
    reliable delivery limits knowledge of group membership and failures
    to only the actual receivers.  Senders do not have any knowledge of
    the membership of a group, and do not require receivers to
    explicitly join or leave a data session.  Receiver reliable
    protocols scale better in the face of networks that have frequent
    failures, and have very high isolation of failures between
    receivers.  This TRACK BB provides sender reliable delivery,
    potentially on top of a receiver reliable system.

    This BB is specified according to the guidelines in [KV00].  In
    addition, it specifies all communication between entities in terms
    of messages, rather than packets.  A message is an abstract
    communication unit, which may be part of, or all of, a given
    packet.  It does not have a specific format, although it does
    contain a list of fields, some of which may be optional, and some
    of which may have fixed lengths associated with them.  It is up to
    each protocol instantiation to combine the set of messages in this
    BB, with those in other components, and create the actual set of
    message formats that will be used.

    As mentioned in the introduction, this BB assumes the existence of
    a separate Auto Tree Configuration BB.  It also assumes that data
    sessions are advertised to all receivers as part of an external BB
    or other component.  It expects to also interact with other BB's
    through the TRACK PI, but does not require this.

 3. Applicability Statement
    It is widely recognized that no single reliable multicast protocol
    can meet the needs of all application types over all network types.
    Distinguishing factors include functionality and performance.
    From a functionality perspective, TRACK and NACK based reliable
    multicast protocols present an inherently different reliability
    model.  TRACK based protocols are able to remove messages from the
    retransmission window when all the children have acknowledged them.
    NACK based protocols have to rely on other means for determining
    how long to keep messages in the retransmission window.  A popular
    method is a time based scheme[SFCGLTLLBEJMRSV00].  TRACK based
    protocols can keep track of the membership of the data session, and
    provide confirmed delivery against that membership list.  NACK
    protocols have anonymous membership.

    Since reliability is obtained through control traffic, the
    difference in the semantics of the term reliability lead to the


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                   6

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    second distinguishing factor: performance.  When a persistent
    failure occurs among the members of a TRACK based protocol, there
    is a possibility that this may slow down other members of the
    group.  NACK protocols have higher isolation of failures, as well
    as smaller amounts of control traffic under many scenarios.

 3.1 Application types

    The objectives of TRACK are to provide high level reliability, high
    scalability, congestion control and flow control for one to many
    bulk data dissemination.  TRACK is not designed for many to many
    applications. Examples of applications that fit into the one-to-
    many data dissemination model are: real time financial news and
    market data distribution, electronic software distribution, audio
    video streaming, distance learning, software updates and server
    replication.  But, not all of these application types have the same
    reliability requirements.

    Historically, financial applications have had the most stringent
    reliability requirements, while audio video streaming have had the
    least stringent.  For applications that want to have strong
    confirmation of delivery guarantees, TRACK may be more applicable
    than alternatives such as NORM or ALC.  For applications that do
    not require this level of reliability, or that demand the lowest
    levels of latency and the highest levels of failure isolation,
    TRACK may be less applicable.

    The TRACK BB, in particular, is designed to optionally work on top
    of a NORM or ALC PI, to allow applications to select this tradeoff
    on a dynamic basis.

 3.2 Network Infrastructure

    The TRACKs also serve to provide feedback information to the
    sender.  The sender uses this information for congestion and flow
    control.  This allows TRACK to be applicable in most networks (i.e.
    managed and shared networks, and high congestion networks.)

    Asymmetric networks with very low upbound bandwidth and a very low
    loss data channel may be better served through NACK based
    protocols, particularly if high reliability is not required.  A
    good example is some satellite networks.

    Networks that have very high loss rates, and regularly experience
    partial network partitions, router flapping, or other persistent
    faults, may be better served through NACK only protocols.

 4. Message Types
    The following table summarizes the messages and their fields used
    by the TRACK BB.  All messages contain the session identifier.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                   7

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


 +--------------------------------------------------------------------+
 Message          From      To     Mcast?       Fields
 +--------------------------------------------------------------------+

 BindRequest     Child    Parent    no         Scope, Level, Role,
                                               SubTreeCount
 +--------------------------------------------------------------------+

 BindConfirm     Parent   Child     no      Level, RepairAddr, SeqNum,
                                            MemberId, CacheInfo
 +--------------------------------------------------------------------+

 BindReject      Parent   Child     no         Reason
 +--------------------------------------------------------------------+

 UnbindRequest   Child    Parent    no         Reason
 +--------------------------------------------------------------------+

 UnbindConfirm   Parent   Child     no     
 +--------------------------------------------------------------------+

 EjectRequest    Parent   Child    either         Reason
 +--------------------------------------------------------------------+

 EjectConfirm    Child    Parent    no                                            
 +--------------------------------------------------------------------+

 Heartbeat       Parent   Child    either     Level, ParentTimestamp,
                                               ChildrenList, SeqNum
 +--------------------------------------------------------------------+

 NullData        Sender    all     yes   SenderTimeStamp, AppSynch, End
 Data                                    Rate, HighestReleased, SeqNum
 +--------------------------------------------------------------------+

 Retransmission  Parent   Child    yes   SenderTimeStamp, AppSynch, End
                                         Rate, HighestReleased, SeqNum
 +--------------------------------------------------------------------+

 Track           Child    Parent      no  SeqNum, BitMask, SubTreeCount
                                          Slowest, FailedChildren,
                                          HighestAllowed,LocalDallyTime
                                          ApplicationConfirms,
                                          ParentThere, ParentTimeStamp,
                                          SenderTimeStamp,
                                          SenderDallyTime
 +--------------------------------------------------------------------+


    The various fields of the messages are described as follows:


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                   8

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    - Scope: an integer to indicate how far a repair message travels.
    This is optional.

    - Level: an integer that indicates the level in the repair tree.
    This value is used to keep loops in the tree from forming, in
    addition to indicating the distance from the sender.  Any changes
    in a node's level are passed down to the Tree BB using the
    treeLevelUpdate interface.

    - Role: This indicates if the bind requestor is a receiver or
    repair head.

    - SubTreeCount: This is an integer indicating the current number of
    receivers below the node.

    - RepairAddr: This field in the BindConfirm message is used to tell
    the receiver which multicast address the repair head will be
    sending retransmissions on.  If this field is null, then the
    receiver should expect retransmissions to be sent on the sender's
    data multicast address.  

    - SeqNum: an integer indicating the sequence number of a data
    message within a given data session. The SeqNum field in the
    BindConfirm message indicates the sequence number starting from
    which the repair head promises to provide repair service.

    - MemberId: This is an integer the repair head assigns to a
    particular child.  The child receiver uses this value to implement
    the rotating TRACK Generation algorithm.

    - CacheInfo: This field contains information about the repair data
    available from this Repair Head.

    - Reason: a code indicating the reason for the BindReject,
    UnbindRequest, or EjectRequest message.

    - ParentTimestamp: This field is included in Heartbeat messages to
    signal the need to do a local RTT measurement from a parent.  It is
    the time when the parent sent the packet.

    - ChildrenList: This field contains the identifiers for a list of
    children.  As part of the keepalive message, this field together
    with the SeqNum field is used to urge those listed receivers to
    send a TRACK (for the provided SeqNum).  The repair head sending
    this must have been missing the regular TRACKs from these children
    for an extended period of time.

    - SenderTimestamp: This field is included in Data messages to
    signal the need to do a roundtrip time measurement from the sender,


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                   9

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    through the tree, and back to the sender.  It is the time (measured
    by the sender's local clock) when it sent the packet.

    - AppSynch: a sequence number signaling a request for confirmed
    delivery by the application.

    - End: indicates that this packet is the end of the data for this
    session.

    - Rate: This field is used by the sender to tell the receivers its
    sending rate, in packets per second.  It is part of the data or
    nulldata messages.

    - HighestReleased:  This field contains a sequence number,
    corresponding to the trailing edge of the sender's retransmission
    window.  It is used (as part of the data, nulldata or
    retransmission headers) to inform the receivers that they should no
    longer attempt to recover those messages with a smaller (or same)
    sequence number.

    - HighestAllowed: a sequence number, used for flow control from the
    receivers.  It signals the highest
    sequence number the sender is allowed to send that will not overrun
    the receivers' buffer pools.

    - BitMask: an array of 1's and 0's.  Together with a sequence
    number it is used to indicate lost data messages.  If the i'th
    element is a 1, it indicates the message SeqNum+i is lost.

    - Slowest: This field contains a field that characterizes the
    slowest receiver in the subtree beneath (and including) the node
    sending the TRACK.  This is used to provide information for the
    congestion control BB, and the aggregation methods on this
    information are defined by that BB.

    - ParentThere: This field indicates to the parent that the receiver
    sending the TRACK has not been receiving the regular keepalive
    messages from its parent, and is wondering if it needs to find a
    new parent.

    - SenderDallyTime: This field is associated with a SenderTimestamp
    field.  It contains the sum of the waiting time that should be
    subtracted from the RTT measurement at the sender.

    - LocalDallyTime: This is the same as the SenderDallyTime, but is
    associated with a ParentTimestamp instead of a SenderTimestamp.

    - ApplicationConfirms: This is the SeqNum value for which delivery
    has been confirmed by all children at or below this parent.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  10

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    - FailedChildren: This is a list of all children that have recently
    been dropped from the repair tree.


 5. Global Configuration Variables, Constants, and Reason Codes

 5.1 Global Configuration Variables
 These are variables that control the data session and are advertised
 to all participants.

    @ TimeMaxBindResponse: the time, in seconds, to wait for a
       response to a BindRequest.  Initial value is
       TIMEOUT_PARENT_RESPONSE (recommended value is 3).  Maximum value
       is MAX_TIMEOUT_PARENT_RESPONSE.

    @ MaxChildren: The maximum number of children a repair head is
       allowed to handle. Recommended value: 32.

    @ ConstantHeartbeatPeriod: Instead of dynamically calculating the
       HeartbeatPeriod as described in Section 7.1.5, a constant period
       may be used instead.  Recommended value: 3 seconds.

    @ MinimumHeartbeatPeriod: The minimum value for the dynamically
       calculated HeartbeatPeriod.  Recommended value: 1 second.

    @ MinHoldTime: The minimum amount of time a repair head holds on
       to data packets.

    @ MaxHoldTime: The maximum amount of time a repair head holds on
       to data packets.

    @ AckWindow: The number of packets seen before a receiver issues
       an acknowledgement.  Recommended value: 32.


 5.2 Constants

    @ NUM_MAX_PARENT_ATTEMPTS: The number of times to try to bind to a
       repair head before declaring a PARENT_UNREACHABLE error.
       Recommended value is 5.

    @ NULL_DATA_PERIOD: The time between transmission of NullData
       Messages.  Recommended value is 1.

    @ FAILURE_DETECTION_REDUNDANCY: The number of times a message is
       sent without receiving a response before declaring an error.
       Recommended value is 3.

    @ MAX_TRACK_TIMEOUT: The maximum value for TRACKTimeout.
       Recommended value is 5 seconds.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  11

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


 5.3 Reason Codes

    @ BindReject reason codes
       @ LOOP_DETECTED
       @ MAX_CHILDREN_EXCEEDED

    @ UnbindRequest reason codes
       @ SESSION_DONE
       @ APPLICATION_REQUEST
       @ RECEIVER_TOO_SLOW

    @ EjectRequest reason codes
       @ PARENT_LEAVING
       @ PARENT_FAILURE
       @ CHILD_TOO_SLOW
       @ PARENT_OVERLOADED


 6. External APIs
 This section describes external interfaces for the building block.

 6.1 Interfaces to the BB from PI's

 6.1.1 Start(boolean RepairHead, boolean RejoinAllowed,
       Advertisement)

 Start instructs the BB to initiate operation.

 RepairHead indicates whether or not the node may also operate as a
 repair head. This parameter is passed along to the tree BB.

 RejoinAllowed indicates whether or not the node is allowed to rejoin
 the session if the only repair heads available are missing some repair
 data needed by this node. This parameter also controls whether or not
 the node is allowed to join the session after the first data messages
 have become unrecoverable (late join).  The BB uses this parameter to
 decide whether or not to use a particular repair head (chosen by the
 tree BB) based on its available repair data.

 The Advertisement parameter passes to the BB all of the parameters
 from the session advertisement.

 6.1.2 End

 End instructs the BB to end its operation.  If the node is the Sender,
 it indicates to the group that the last data message is the final one


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  12

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


 for the session.  Once a receiver has received all of the session's
 data, it MAY unbind from its parent.  However, if the receiver is also
 a repair head, it continues to operate as a repair head until all of
 its children have finished.  Then it MAY unbind from its own parent.

 If End is called at a repair head, it MUST use the multicast Eject
 procedure to inform its children for this session that it is leaving
 the group.  Once the procedure is complete (all children have
 acknowledged receipt of the Eject, or the Eject has been sent the
 maximum number of times), the repair head MAY unbind from its own
 parent.

 If End is called at a receiver, it MUST use the Unbind procedure to
 inform its parent for this session that it is leaving the group.

 6.1.3 incomingMessage(Message)

 incomingMessage presents the BB with message received by the PI.

 6.1.4 getStatistics

 getStatistics returns current BB statistics to the upper BB or PI.

 6.1.5 MessageSynched(Message)

 MessageSynched tells the BB that the indicated message has been
 synched with the application.

 6.1.6 RepairHead(boolean)

 RepairHead tells the BB whether or not it is now acting as a Repair
 Head.

 6.2 Interfaces from the BB to the PI

 6.2.1 outgoingMessage(Message)

 outgoingMessage instructs the PI to send the message.

 6.2.2 MessageReceived(Message, boolean Synch)

 MessageReceived passes a data message up to the PI. Synch indicates
 whether or not the PI should call MessageSynched once the message has
 been consumed by the application.

 6.2.3 SenderLost

 SenderLost tells the PI that contact with the sender has been lost.

 6.2.4 UnrecoverableData


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  13

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


 UnrecoverableData indicates to the PI that the BB was unable to
 recover some session data.

 6.2.5 SessionDone

 SessionDone indicates to the PI that the sender has completed sending
 the data, and the node has left the session.

 7. Algorithms

 7.1 Tree Based Session Creation and Maintenance

    7.1.1 Overview of Tree Configuration

    Before a Data Session starts delivering data, the tree for the Data
    Session needs to be created.  This process binds each Receiver to
    either a Repair Head or the Sender, and binds the participating
    Repair Heads into a loop-free tree structure with the Sender as the
    root of the tree.  This process requires tree configuration
    knowledge, which can be provided with some combination of manual
    and/or automatic configuration.  The algorithms for automatic tree
    configuration are part of the Automatic Tree Configuration BB.
    They return to each node the address of the parent it should bind
    to, as well as zero or more backup parents to use if the primary
    parent fails.

    In addition to receiving the tree configuration information, the
    receivers all receive a Session Advertisement message from the
    senders, informing them of the Data Multicast Address and other
    session configuration information.  This advertisement may contain
    other relevant session information such as whether or not Repair
    Heads should be used, whether manual or automatic tree
    configuration should be used, the time at which the session will
    start, and other protocol settings.  This advertisement is created
    as part of either the PI or as part of an external service.  In
    this way, the Sender enforces a set of uniform Session
    Configuration Parameters on all members of the session.

    As described in the automatic tree configuration BB, the general
    algorithm for a given node in tree creation is as follows.
    1) Get advertisement that a session is starting
    2) Get list of neighbor candidates using the getSNs Tree BB
       interface, contact them
    3) Select best neighbor as parent in a loop free manner
    4) Bind to parent
    5) Optionally, later rebind to another parent

    When a child finishes step 4, it is up to automatic tree
    configuration to, if necessary, continue building the tree in order


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  14

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    to connect the node back to the Sender.  After the session is
    created, children can unbind from their parents and bind again to
    new parents.  This happens when faults occur, or as part of a tree
    optimization process.  Steps 1 through 3 are external to the TRACK
    BB.  Step 4 is performed as part of session creation.  Step 5 is
    performed as part of session maintenance in conjunction with
    automatic tree building, as either an unbind or eject, combined
    with another bind operation.

    Once steps 1 through 3 are completed, Receivers join the Data
    Multicast Address, and attempt to bind to either the Sender or a
    local Repair Head.  A Receiver will attempt to bind to the first
    node in the tree configuration list returned by step 3, and if this
    fails, it will move to the next one.  A Receiver only binds to a
    single Repair Head or Sender, at a time, for each Data Session.

    The automatic tree building BB ensures that the tree is formed
    without loops.  As part of this, when a Repair Head has a Receiver
    attempt to bind to it for a given Data Session, it may not at first
    be able to accept the connection, until it is able to join the tree
    itself.  Because of this, a Receiver will sometimes have to
    repeatedly attempt to bind to a given parent before succeeding.

    Once the Sender initiates tree building, it is also free to start
    sending Data messages on the Data Multicast Address.  Repair Heads
    and Receivers may start receiving these messages, but may not
    request retransmission or deliver data to the application until
    they receive confirmation that they have successfully bound to the
    tree.

 7.1.2 Bind

 7.1.2.1 Input Parameters

    In order to join a data session and bind to the tree, the following
    nodes need the following parameters.

    A Repair Head requires the following parameters.

    - Session:  the unique identifier for the data session to join,
    received from the Session Advertisement algorithm in the PI.

    - ParentAddress:  the address and port of the parent node to which
    the node should connect

    - UDPListenPort:  the number of the port on which the node will
    listen for its children's control messages

    - RepairAddr:  the multicast address, UDP port, and TTL on which
    this node sends control messages to its children.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  15

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    A Sender requires the above parameters, except for the
    ParentAddress.  A Receiver requires the above parameters, except
    for the UDPListenPort and RepairAddr.

 7.1.2.2 Bind Algorithm

    A Bind operation happens when a child wishes to join a parent in
    the distribution tree for a given data session.  The Receivers
    initiate the first Bind protocols to their parents, which then
    cause recursive binding by each parent, up to the Sender.  Each
    Receiver sends a separate BindRequest message for each of the
    streams that it would like to join.  At the discretion of the PI,
    multiple BindRequest messages may be bundled together in a single
    message.

    A node sends a BindRequest message to its automatically selected or
    manually configured parent node.  The parent node sends either a
    BindConfirm message or a BindReject message.  Reception of a
    BindConfirm message terminates the algorithm successfully, while
    receipt of a BindReject message causes the node to either retry the
    same parent or restart the Bind algorithm with its next parent
    candidate (depending on the BindReject reason code), or if it has
    none, to declare a REJECTED_BY_PARENT error.  Once the node is
    accepted by a Repair head, it informs the Tree BB using the setSN
    interface.

    Reliability is achieved through the use of a standard request-
    response protocol.  At the beginning of the algorithm, the child
    initializes TimeMaxBindResponse to the constant
    TIMEOUT_PARENT_RESPONSE and initializes NumBindResponseFailures to
    0.  Every time it sends a BindRequest message, it waits
    TimeMaxBindResponse for a response from the parent node.  If no
    response is received, the node doubles its value for
    TimeMaxBindResponse, but limits TimeMaxBindResponse to be no larger
    than MAX_TIMEOUT_PARENT_RESPONSE.  It also
    increments NumBindResponseFailures, and retransmits the BindRequest
    message.  If NumBindResponseFailures reaches
    NUM_MAX_PARENT_ATTEMPTS, it reports a PARENT_UNREACHABLE error.

    When a parent receives a BindRequest message, it first consults the
    automatic tree building BB for approval (using the acceptChild Tree
    BB interface), for instance to ensure that accepting the
    BindRequest will not cause a loop in the tree. Then the parent
    checks to be sure that it does not have more than MaxChildren
    children already bound to it for this session.  If it can accept
    the child, it sends back a BindConfirm message.  Otherwise, it
    sends the node a BindReject message. Then the parent checks to see
    if it is already a member of this data session.  If it is not yet a
    member of this session, it attempts to join the tree itself.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  16

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    The BindConfirm message contains the lowest sequence number that
    the repair head has available.  If this number is 0 or 1, then the
    repair head has all of the data available from the start of the
    session.  Otherwise, the requesting node is attempting a late join,
    and can only use this repair head if late join was allowed by the
    PI.  If late join is not allowed, the node may try another repair
    head, or give up.

    Similarly, if a failure recovery occurs, when a node tries to bind
    to a new repair head, it must follow the same rules as for a late
    join.  See section 7.1.5.

7.1.3 Unbind

    A child may decide to leave a data session for the following
    reasons.  1) It detects that the data session is finished.  2) The
    application requests to leave the data session.  3) It is not able
    to keep up with the data rate of the data session.  When any of
    these conditions occurs, it initiates an Unbind process.

    An Unbind is, like the Bind function, a simple request-reply
    protocol.  Unlike the Bind function, it only has a single response,
    UnbindConfirm.   With this exception, the Unbind operation uses the
    same state variables and reliability algorithms as the Bind
    function.

    When a child receives an UnbindConfirm message from its parent, it
    reports a LEFT_DATA_SESSION_GRACEFULLY event.  If it does not
    receive this message after NUM_MAX_PARENT_ATTEMPTS, then it reports
    a LEFT_DATA_SESSION_ABNORMALLY event.  Unbinds are reported to the
    Tree BB using the lostSN interface.

 7.1.4 Eject

    A parent may decide to remove one or more of its children from a
    data stream for the following reasons.  1) The parent needs to
    leave the group due to application reasons.  2) The repair head
    detects an unrecoverable failure with either its parent or the
    sender.  3) The parent detects that the child is not able to keep
    up with the speed of the data stream.  4) The parent is not able to
    handle the load of its children and needs some of them to move to
    another parent.  In the first two cases, the parent needs to
    multicast the advertisement of the termination of one or more data
    sessions to all of its children.  In the second two cases, it needs
    to send one or more unicast notifications to one or more of its
    children.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  17

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    Consequently, an Eject can be done either with a repeated multicast
    advertisement message to all children, or a set of unicast request-
    reply messages to the subset of children that it needs to go to.

    For the multicast version of Eject, the parent sends a multicast
    UnbindRequest message to all of its children for a given Data
    Session, on its Local Multicast Channel.  It is only necessary to
    provide statistical reliability on this message, since children
    will detect the parent's failure even if the message is not
    received.  Therefore, the UnbindRequest message is sent
    FAILURE_DETECTION_REDUNDANCY times.

    For the unicast version of Eject, the parent sends a unicast
    UnbindRequest message to all of its children.  Each of them respond
    with an EjectConfirm.  Reliability is ensured through the same
    request-reply mechanism as the Bind operation.

    Ejections are reported to the Tree BB using the removeChild
    interface.

 7.1.5 Fault Detection

    There are three cases where fault detection is needed.  1)
    Detection (by a child) that a parent has failed.  2) Detection (by
    a parent) that a child has failed.  3) Detection (by either a
    Repair Head or Receiver) that a Sender has failed.

    In order to be scaleable and efficient, fault detection is
    primarily accomplished by periodic keep-alive messages, combined
    with the existing TRACK messages.  Nodes expect to see keep-alive
    messages every set period of time.  If more than a fixed number of
    periods go by, and no keep-alive messages of a given type are
    received, the node declares a preliminary failure.  The detecting
    node may then ping the potentially failed node before declaring it
    failed, or it can just declare it failed.

    Failures are detected through three keep-alive messages:
    Heartbeat, TRACK, and NullData.  The Heartbeat message is multicast
    periodically from a parent to its children on its local control
    channel.  NullData messages are multicast by a sender on the global
    multicast address when it has no data to send.  TRACK messages are
    generated periodically, even if no data is being sent to a data
    session, as described in section 7.2.

    Heartbeat messages are multicast every HeartbeatPeriod seconds,
    from a parent to its children.  Every time that a parent sends a
    Retransmission message or a Heartbeat message (as well as at
    initialization time), it resets a timer for HeartbeatPeriod
    seconds.  If the timer goes off, a Heartbeat is sent.  The
    HeatbeatPeriod is dynamically computed as follows:


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  18

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


            interval = AckWindow / PacketRate

            HeartbeatPeriod = 2 * interval

    Global configuration parameters ConstantHeartbeatPeriod and
    MinimumHeartbeatPeriod can be used to either set HeartbeatPeriod to
    a constant, or give HeartbeatPeriod a lower bound, globally.

    Similarly, a NullData message is multicast by the sender to all
    data session members, every NULL_DATA_PERIOD.  The NullData timer
    is set to NULL_DATA_PERIOD, and is reset every time that a Data or
    NullData message is sent by the Sender.

    The key parameter for failure detection is the global tree
    parameter FAILURE_DETECTION_REDUNDANCY.  The higher the value for
    this parameter, the more keep-alive messages that must be missed
    before a failure is declared.

    A major goal of failure detection is for children to detect parent
    failures fast enough that there is a high probability they can
    rejoin the stream at another parent, before flow control has
    advanced the buffer window to a point where the child can not
    recover all lost messages in the stream.  In order to attempt to do
    this, children detect a failure of a parent if
    FAILURE_DETECTION_REDUNDANCY * HeartbeatPeriod time goes by without
    any heartbeats.  As part of buffer window advancement, described in
    section 7.2.4, all parents MAY choose to buffer all messages for a
    minimum of FAILURE_DETECTION_REDUNDANCY * 2 * HeartbeatPeriod
    seconds, which gives children a period of time to find a new parent
    before the buffers are freed.  Children report parent failures to
    the Tree BB using the lostSN interface.

    A parent detects a preliminary failure of one of its children if it
    does not receive any TRACK messages from that child in
    FAILURE_DETECTION_REDUNDANCY * TrackTimeout seconds (see discussion
    of how TrackTimeout is computed in 7.2.1).  Because a failed child
    can slow down the group's progress, it is very important that a
    parent resolve the child's status quickly.  Once a parent declares
    a preliminary failure of a child, it issues a set of up to
    FAILURE_DETECTION_REDUNDANCY Heartbeat messages that are unicast
    (or multicast) to the failed receiver(s).  These messages are
    spaced apart by 2*LocalRTT, where LocalRTT is the round trip time
    that has been measured to the child in question (see 7.4 for
    description of how LocalRTT is measured).  These Heartbeat messages
    contain a ChildrenList field that contains the children who are
    requested to send a TRACK immediately.

    Whenever a child receives a Heartbeat message with an
    ImmediateTRACK field set to 1, it immediately sends a TRACK to its


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  19

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    parent.  If a parent does not receive a TRACK message from a child
    after waiting a period of 2*ChildRTT after the last Heartbeat
    message to that child, it declares the child failed, and removes it
    from the parent's child membership list.  It informs the Tree BB
    using the removeChild interface.

    A child or a repair head detects the failure of a sender if it does
    not receive a Data or NullData message from a sender in
    FAILURE_DETECTION_REDUNDANCY * NULL_DATA_PERIOD.

    Note that the more receivers there are in a tree, and the higher
    the loss rate, the larger FAILURE_DETECTION_REDUNDANCY must be, in
    order to give the same probability that erroneous failures won't be
    declared.

 7.1.6 Fault Notification

    When a parent detects the failure of a child, it adds a failure
    notification field to the next TRACK messages that it sends up the
    tree.  It sends this notification multiple times because TRACKs are
    not delivered reliably.  A failure notification field includes the
    failure code, as well as a list of one or more failed nodes.
    Failure notifications are aggregated up the tree, according to the
    rules in 7.3.  A failure notification is not a definitive report of
    a failure, as the child may have moved to a different repair head.

 7.1.7 Fault Recovery

    The Fault Recovery algorithms require a list of one or more
    addresses of alternate parents that can be bound to, and that still
    provide loop free operation.

    If a child detects the failure of its parent, it then re-runs the
    Bind operation to a new parent candidate, in order to rejoin the
    tree.  As described above in section 7.1.2, a node may perform a
    late join, i.e. binding with a repair head which cannot provide all
    the necessary repair data, only if allowed by the PI.

 7.2 TRACK Generation

    This section describes the algorithms used by the receiver to
    determine when to send the TRACK messages.

    TRACK messages are sent from receivers to their parents.  TRACK
    messages may be sent for the following purposes:
    - to request retransmission of messages
    - to advance the sender's transmission window for flow control
    purposes
    - to deliver end-to-end confirmation of data reception


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  20

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    - to propagate other relevant feedback information up through the
    session (such as RTT and loss reports, for congestion control)

    The TRACK PI also makes use of the NACK BB, which requests
    retransmission of messages from a parent.  The TRACK request and
    response algorithms should be highly similar to the NACK algorithms
    for this specific case.

 7.2.1 TRACK Generation with the Rotating TRACK Algorithm

    Each receiver sends a TRACK message to its parent once per
    AckWindow of data messages received.  A receiver uses an offset
    from the boundary of each AckWindow to send its TRACK, in order to
    reduce burstiness of control traffic at the parents.  Each parent
    has a maximum number of children, MaxChildren.  When a child binds
    to the parent, the parent assigns a locally unique ChildID to that
    child, between 0 and MaxChildren-1.

    Each child in a tree generates a TRACK message at least once every
    AckWindow of data messages, when the most recent data message's
    sequence number, modulo AckWindow, is equal to MemberID.  If the
    message that would have triggered a given TRACK for a given node is
    missed, the node will generate the TRACK as soon as it learns that
    it has missed the message, typically through receipt of a higher
    numbered data message.

    Together, AckWindow and MaxChildren determine the maximum ratio of
    control messages to data messages seen by each parent, given a
    constant load of data messages.

    In each data message, the sender advertises the current PacketRate
    (measured in messages per second) it is sending data at.  This rate
    is generated by the congestion control algorithms in use at the
    sender.

    At the time a node sends a regular TRACK, it also computes a
    TRACKTimeout value:

            interval = AckWindow / PacketRate

            TRACKTimeout = 2 * interval

    If no TRACKs are sent within TRACKTimeout interval, a TRACK is
    generated, and TRACKTimeout is increased by a factor of 2, up to a
    value of MAX_TRACK_TIMEOUT.

    This timer mechanism is used by a receiver to ensure timely repair
    of lost messages and regular feedback propagation up the tree even
    when the sender is not sending data continuously. This mechanism
    complements the AckWindow-based regular TRACK generation mechanism.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  21

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


 7.2.2 Local Repair

    A repair head maintains the following state for each of its
    children, for the purpose of providing repair service to the local
    group:

    - HighestConsecutivelyReceived: a sequence number indicating all
       Data messages up to this number (inclusive) have been received
       by a given child.

    - MissingPackets: a data structure to keep track of the reception
       status of the Data messages with sequence number higher than
       HighestConsecutivelyReceived.

    In addition, a repair head also maintains other state for purposes
    of feedback aggregation described in the next section.

    The minimum HighestConsecutivelyReceived value of all its children
    is kept as the variable LocalStable.

    A repair head also maintains a retransmission buffer. The size of
    the retransmission buffer must be greater than the maximum value of
    a sender's transmission window. The retransmission buffer must keep
    all the Data messages received by the repair head with sequence
    number higher than LocalStable, optionally some messages with
    sequence number lower than LocalStable if there is room (beyond the
    maximum value of sender's transmission window).  The latter
    messages are kept in the retransmission buffer in case a receiver
    from another group losses its parent and needs to join this group.

    As TRACK messages are received, the repair head updates the above
    states.

    To perform local repair, a repair head implements a retransmission
    queue with memory.  Each lost message (reported by a child using
    the BitMask field) is entered into the retransmission queue in
    increasing order according to its sequence number. If the same Data
    message has already been retransmitted recently (recognized due to
    the queue's memory) it is delayed by the local group RTT (see
    roundtrip time measurement) before retransmission.

    The retransmissions are sent using the same PacketRate is that used
    by the sender.

 7.2.3 Flow Control Window Update

    When a receiver sends a TRACK to its parent, the HighestAllowed
    field provides information on the status of the receiver's flow


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  22

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    control window.  The value of HighestAllowed is computed as
    follows:

                     HighestAllowed = seqnum + ReceiverWindow

    Where seqnum is the highest sequence number of consecutively
    received data messages at the receiver.  The size of the
    ReceiverWindow may either be based on a parameter local to the
    receiver or be a global parameter.

 7.2.4 Reliability Window

    The sender and each repair head maintain a window of messages for
    possible retransmission.  As messages are acknowledged by all of
    its children, they are released from the parent's retransmission
    buffer, as described in 7.2.2. In addition, there are two global
    parameters that can affect when a parent releases a data message
    from the retransmission buffer -- MinHoldTime, and MaxHoldTime.

    MinHoldTime specifies a minimum length of time a message must be
    held for retransmission from when it was received. This parameter
    is useful to handle scenarios where one or more children have been
    disconnected from their parent, and have to reconnect to another.
    If, for example, MinHoldTime is set to FAILURE_DETECTION_REDUNDANCY
    * 2 * ConstantHeartbeatPeriod, then there is a high likelihood that
    any child will be able to recover any lost messages after
    reconnecting to another parent.

    The sender continually advertises to the members of the data
    session both edges of its retransmission window.  The higher value
    is the SeqNum field in each Data or NullData message, which
    specifies the highest sequence number of any data message sent.
    The trailing edge of the window is advertised in the
    HighestReleased field.  This specifies the largest sequence number
    of any message sent that has subsequently been released from the
    sender's retransmission window.  If both values are the same then
    the window is presently empty.  Zero is not a legitimate value for
    a data sequence number, so if either field has a value of zero,
    then no messages have yet reached that state.  All sequence number
    fields use sequence number arithmetic so that a data session can
    continue after exhausting the sequence number space.

    When a member of a data session receives an advertisement of a new
    HighestReleased value, it stores this, and is no longer allowed to
    ask for retransmission for any messages up to and including the
    HighestReleased value.  If it has any outstanding missing messages
    that are less than or equal to HighestReleased, it MAY move forward
    and continue delivering the next data messages in the stream.  It
    also SHOULD report an error for the messages that are no longer
    recoverable.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  23

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    MaxHoldTime specifies the maximum length of time a message may be
    held for retransmission.  This parameter is set at the sender which
    uses it to set the HighestReleased field in data message headers.
    This is particularly useful for real-time, semi-reliable streams
    such as live video, where retransmissions are only useful for up to
    a few seconds.  When combined with Unordered delivery semantics,
    and application-level jitter control at the receivers, this
    provides Time Bounded Reliability.  Obviously, MaxHoldTime must
    always be larger than MinHoldTime.

 7.2.5 Confirmed Delivery

    Flow control and the reliability window are concerned with goodput,
    of delivering data with a high probability that it is delivered at
    all receivers.  However, neither mechanism provides explicit
    confirmation to the sender as to the list of recipients for each
    message.  Confirmed delivery allows applications to determine the
    set of applications that have received a set of data messages.

    To request this service, a sender fills the AppSynch field of data
    messages with the sequence number of the highest data message it
    wishes to confirm delivery of.  It continues to do so until it
    receives confirmation, moves the AppSynch point forward to a higher
    sequence number, or declares an error.

    When a receiver gets a data message with a non-zero AppSynch field,
    it starts including the highest sequence number that has been
    acknowledged by the application in the ApplicationConfirms field of
    each TRACK message that it sends up the tree.  In order to provide
    reliable delivery of this acknowledgement, this continues so long
    as a receiver gets data messages with non-zero AppSynch fields.

    Each receiver is responsible for locally deciding the value of the
    ApplicationConfirms field.  There are two primary issues a receiver
    must consider in setting this field:  the reliability semantics of
    the data stream, and when a given message is considered confirmed
    at the receiver.  As this is an application level confirmation, a
    handshake with the application is required to get this
    confirmation.

    One example of how an application can implicitly signal
    confirmation of delivery is through the freeing of buffers passed
    to it by the transport.  The API could specify that whenever an
    application has freed up a buffer containing one or more data
    messages, then these messages are considered acknowledged by the
    application.  Alternatively, the application could be required to
    explicitly acknowledge each message.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  24

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    With a given transport-application API for signaling
    acknowledgement, the transport then keeps track of all contiguous
    acknowledgements from that application, and reports these up in the
    ApplicationConfirms field.  If one or more messages can not be
    acknowledged, the receiver should pass an error code describing the
    type of failure that occurred, and the sequence number of the first
    message that has not yet been delivered.

    If MaxHoldTime is not in use for a data stream, so that delivery is
    fully reliable, then any message that can not be delivered will be
    considered a fatal error for that receiver.  If MaxHoldTime has a
    non-zero value, then any messages that could not be delivered, but
    are less than HighestReleased as advertised by the sender, are not
    reported as errors.

    In addition to the AppSynch field, a sender may also set the
    ImmediateACK field to 1.  When a node gets a data message that has
    this flag set, it will immediately send a TRACK after processing
    that message.

 7.3 Feedback Aggregation

    This section describes how repair heads perform aggregation on
    feedback information sent up in the fields of the TRACK message,
    and the purposes for performing such aggregation.
    There are many reasons for providing feedback from all the
    receivers to the sender in an aggregated form.  The major ones are
    listed below:

    1) End-to-end delivery confirmation.  This confirmation tells the
    sender that all the receivers (in the entire tree) have received
    data packets up to a certain sequence number.  The field that
    carries this information is AppSynch.

    2) Flow control.  The aggregated information is carried in the
    field HighestAllowed.  It tells the sender the highest sequence
    number that all the receivers (in the entire tree) are prepared to
    receive.

    3) Identifying the slowest receiver.  The aggregated information is
    carried in the field Slowest.  The sender can use this value as
    part of congestion control.

    4) Counting current membership in the group.  This information is
    carried in the field SubTreeCount.  This lets the sender know the
    number of receivers currently connected to the repair tree.

    5) Measuring the round-trip time from the sender to the "worstö
    receiver.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  25

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    A repair head maintains state for each child.  Each time a TRACK
    (from a child) is received, the corresponding states for that child
    are updated based on the information in the TRACK message. When a
    repair head sends a TRACK message to its parent, the following
    fields of its TRACK message are derived from the aggregation of the
    corresponding states for its children.  The following rules
    describe how the aggregation is performed:

    - AppSynch: take the minimum of the AppSynch value from all
    children

    - HighestAllowed: take the minimum of the HighestAllowed value from
    all children

    - Slowest: this is a measure of how slow the slowest member in the
    whole subtree is; take either the minimum (or maximum) of the
    Slowest value from all children (depending what the Slowest measure
    is).

    - SubTreeCount: take the sum of the SubTreeCount from all
    children

    - SenderDallyTime: take the minimum value, for all of the children,
    of

          child's reported SenderDallyTime + child's local dally time

    Note, the SendTimeStamp field is left alone.  The sender will
    derive the roundtrip time to the worst receiver by doing its local
    aggregation for SenderDallyTime and then compute:

          RTT = currentTime - SendTimeStamp - SenderDallyTime.

 7.4 Measuring Round Trip Times

    This TRACK BB provides two algorithms for distributed RTT
    calculations ù LocalRTT measurements and SenderRTT measurements.
    LocalRTT measurements are only between a parent and its children.
    SenderRTT measurements are end-to-end RTT measurements, measuring
    the RTT to the worst receiver as selected by the congestion control
    algorithms.

    The SenderRTT is useful for congestion control. It can be used to
    set the data rate based on the TCP response function, which is
    being proposed for the congestion control building block.

    The LocalRTT can be used to (a) quickly detect faulty children (as
    described in 7.1) or (b) avoid sending unnecessary retransmissions
    (as described in 7.2 in the local repair algorithm).


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  26

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    In the case of LocalRTT measurements, a parent initiates
    measurement by including a ParentTimestamp field in a Heartbeat
    message sent to its children.  When a child receives a Heartbeat
    message with this field set, it notes the time of receipt using its
    local system clock, and stores this with the message as
    HeartbeatReceiveTime.  When the child next generates a TRACK, just
    before sending it, it measures its system clock again as
    TRACKSendTime, and calculates the LocalDallyTime.

             LocalDallyTime = TRACKSendTime - HeartbeatReceiveTime.

    The child includes this value, along with the ParentTimestamp
    field, as fields in the next TRACK message sent.  Every heartbeat
    message that is multicast to all children SHOULD include a
    ParentTimestamp field.

    The SenderRTT algorithm is similar.  A sender initiates the process
    by including a SenderTimestamp field in a data message.  When a
    receiver gets a message with this field set, it keeps track of the
    DataReceiveTime for that message, and when it generates the next
    TRACK message, includes the SenderTimestamp and SenderDallyTime
    value.  These values are aggregated by Repair Heads, as described
    in section 7.3.

    Each node only keeps track of the most recent value for
    {SenderTimestamp, DataReceiveTime} and {ParentTimestamp,
    HeartbeatReceiveTime}, replacing any older values any time that a
    new message is received with these values set.  As long as it has
    non-zero values to report, each node sends up both a
    {SenderTimestamp, SenderDallyTime} and a {ParentTimestamp,
    LocalDallyTime} set of fields in each TRACK message generated.

    These measurements need to be averaged by the TRACK PI.

 8. Security

    This BB does not specifically deal with security.  It is the
    responsibility of the TRACK PI or the Security BB.  This issue is
    covered in the Security Requirements For TRACK draft [HW00].

 9. References

    [HW00] T. Hardjono, B. Whetten, "Security Requirements For TRACK,"
    Internet Draft, Internet Engineering Task Force, June, 2000.

    [KLCWTCTK01] M. Kadansky, D. Chiu, B. Whetten, B. Levine, G.
    Taskale, B. Cain, D. Thaler, S. Koh, "Reliable Multicast Transport
    Building Block: Tree Auto-Configuration," Internet Draft, Internet
    Engineering Task Force, March, 2001.


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  27

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    [KV00] R. Kermode, L. Vicisano, "Author Guidelines for RMT Building
    Blocks and Protocol Instantiation Documents," Internet Draft,
    Internet Engineering Task Force, June, 2000.

    [SFCGLTLLBEJMRSV00] T. Speakman, D. Farinacci, J. Crowcroft, J.
    Gemmell, S. Lin, A. Tweedly, D. Leshchiner, M. Luby, N. Bhaskar, R.
    Edmonstone, K. M. Johnson, T. Montgomery, L. Rizzo, R.
    Sumanasekera, and L. Vicisano, "PGM Reliable Transport Protocol
    Specification," Internet Draft, Internet Engineering Task Force,
    November 2000.

    [WCP00] B. Whetten, D. Chiu, S. Paul, M. Kadansky, G. Taskale,
    "TRACK Architecture, A Scalable Real-Time Reliable Multicast
    Protocol," Internet Draft, Internet Engineering ask Force, July
    2000.

    [WVKHFL00]  B. Whetten, L. Vicisano, R. Kermode, M. Handley, S.
    Floyd, and M. Luby, "Reliable Multicast Transport Building Blocks
    for One-to-Many Bulk-Data Transfer," RFC 3048, January 2001.

 10. Acknowledgements
    We would like to thank the follow people: Sanjoy Paul, Seok Joo
    Koh, Supratik Bhattacharyya, Joe Wesley, and Joe Provino.


 11. Authors' Addresses

    Dah Ming Chiu
    dahming.chiu@sun.com
    Miriam Kadansky
    miriam.kadansky@sun.com
    Sun Microsystems Laboratories
    1 Network Drive
    Burlington, MA 01803

    Gursel Taskale
    gursel@talarian.com
    Brian Whetten
    whetten@talarian.com
    Talarian
    333 Distel Circle
    Los Altos, CA 94022-1404


    Full Copyright Statement

    "Copyright (C) The Internet Society (2001). All Rights Reserved.
    This document and translations of it may be copied and furnished to
    others, and derivative works that comment on or otherwise explain
    it or assist in its implementation may be prepared, copied,


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  28

 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt          March 2001


    published and distributed, in whole or in part, without restriction
    of any kind, provided that the above copyright notice and this
    paragraph are included on all such copies and derivative works.
    However, this document itself may not be modified in any way, such
    as by removing the copyright notice or references to the Internet
    Society or other Internet organizations, except as needed for the
    purpose of developing Internet standards in which case the
    procedures for copyrights defined in the Internet Standards process
    must be followed, or as required to translate it into languages
    other than English.

    The limited permissions granted above are perpetual and will not be
    revoked by the Internet Society or its successors or assigns.

    This document and the information contained herein is provided on
    an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
    ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
    THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
    WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."


 INTERNET DRAFT      draft-ietf-rmt-track-bb-01.txt                  29